arXiv:cs/0612019v2 [cs.IT] 5 Dec 2006
On Finite Memory Universal Data Compression and Classification of Individual Sequences Jacob Ziv Department of Electrical Engineering Technion–Israel Institute of Technology Haifa 32000, Israel November 19, 2006
Abstract Consider the case where consecutive blocks of N letters of a semi-infinite individual sequence X over a finite-alphabet are being compressed into binary sequences by some one-to-one mapping. No a-priori information about X is available at the encoder, which must therefore adopt a universal data-compression algorithm. It is known that if the universal LZ77 data compression algorithm is successively applied to N -blocks then the best error-free compression, for the particular individual sequence X is achieved as N tends to infinity. The best possible compression that may be achieved by any universal data compression algorithm for finite N -blocks is discussed. It is demonstrated that context tree coding essentially achieves it. Next, consider a device called classifier (or discriminator) that observes an individual training sequence X. The classifier’s task is to examine individual test sequences of length N and decide whether the test N -sequence has the same features as those that are captured by the training sequence X, or is sufficiently different, according to some appropriate criterion. Here again, it is demonstrated that a particular universal context classifier with a storage-space complexity that is linear in N , is essentially optimal. This may contribute a theoretical “individual sequence” justification for the Probabilistic Suffix Tree (PST) approach in learning theory and in computational biology. Index Terms: Data compression, universal compression, universal classification, contexttree coding.
1
A.
Introduction and Summary of Results:
Traditionally, the analysis of information processing systems is based on a certain modeling of the process that generates the observed data (e.g an ergodic process). Based on this a-priori model, a processor (e.g. a compression algorithm, a classifier, etc) is then optimally designed. In practice, there are many cases where insufficient a-priori information about this generating model is available and one must base the design of the processor on the observed data only, under some complexity constraints that the processor must comply with.
1.
Universal Data Compression with Limited Memory
The Kolmogorov-Chaitin complexity (1968) is the length of the shortest program that can generate the given individual sequence via a universal Turing machine. More concrete results are achieved by replacing the universal Turing machine model with the more restricted finite-state machine model. The Finite- State(FS) normalized complexity (compression) H(X), measured in bits per input letter, of an individual infinite sequence X is the normalized length of the shortest one-to-one mapping of X into a binary sequence that can be achieved by any finite-state compression device[1]. For example, the counting sequence 0123456... when mapped into the binary sequence 0,1,00,01,10,11,000,001,010,011,100,101,110,111... is incompressible by any finite-state algorithm. Fortunately, the data that one has to deal with is in many cases compressible. The FS complexity was shown to be asymptotically achieved by applying the LZ universal data compression algorithm [1] to consecutive blocks of the individual sequence. The FS modeling approach was also applied to yield asymptotically optimal universal prediction of individual sequences[9]. Consider now the special case of a FS class of processors is further constrained to include only block-encoders that process one N -string at a time and then start all over again, (e.g. due to bounded latency and error-propagation considerations), H(X) is still asymptotically achievable by the LZ algorithm when applied on-line to consecutive strings of length N, as N tends to infinity[1]. But the LZ algorithm may not be the best on-line universal data compression algorithm when the block-length is of finite length N . It has been demonstrated that if it is a-priori known that X is a realization of a stationary ergodic, Variable Length Markov Chain (VLMC) that is governed by a tree model, then context-tree coding yields a smaller redundancy than the LZ algorithm([10],[4]). More recently, it has been demonstrated that context-tree coding yields an optimal universal coding policy (relative to the VLMC assumption) ([2]).
2
Inspired by these results, one may ask whether the optimality of context-tree coding relative to tree models still holds for more general setups. It is demonstrated here that the best possible compression that may be achieved by any universal data compression algorithm for finite N -blocks is essentially achieved by contexttree coding for any individual sequence X and not just for individual sequences that are realizations of a VLMC. In the following, a number of quantities are defined, that are characterized by non-traditional notations that seem unavoidable due to end-effects resulting from the finite length of X1N . These end-effects vanish as N tends to infinity, but must be taken into account here. Refer to an arbitrary sequence over a finite alphabet A, |A| = A, X N = X1N = X1 , X2 , ...; XN ∈ A, as being an individual sequence. Let X = X1 , X2 , ... denote a semi-infinite sequence over the alphabet A. Next, an empirical probability distribution PM N (Z1N , N ) is defined for N vectors that appear in a sequence of length M N . The reason for using the notation PM N (Z1N , N ) rather than, say, PM N (Z1N ) is due to end-effects as discussed below. We then define an empirical entropy that results from PM N (Z1N , N ), namely HM N (N ). This quantity is similar to the classical definition of the empirical entropy of N -blocks in an individual sequence of length M N and as one should anticipate, serves as a lower bound for the compression that can be achieved by any N - block encoders. Furthermore, HM N (N ) is achievable in the impractical case where one is allowed to first scan the long sequence X1M N , generate the corresponding empirical probability PM N (Z1N , N ) for each N -vector Z1N that appears in X1M N , and apply the corresponding Huffman coding to consecutive N -blocks. Then, define H(X, N ) = lim supM →∞ HM N (N ). It follows that H(X) = lim sup H(X, N )) N →∞
is the smallest number of bits per letter that can be asymptotically achieved by any Nblock data-compression scheme for X. However, in practice, universal data-compression is executed on-line and the only available information on X is the currently processed N -block. Next, an empirical probability measure PM N (Z1i , N ) is defined for i < N vectors that appear in an M N -sequence, which is derived from PM N (Z1N , N ) by summing up PM N (Z1N , N ) over the last N − i letters of N vectors in the M N - sequence. Again, observe that due to end-effects, PM N (Z1i , N ) is different from PM N (Z1i , i) but converges to it asymptotically, as M tends to infinity. Similarly, an empirical entropy HM N (i, N ) that is derived from PM N (Z1i , N ). In the analysis that follows below both HM N (N ) and HM N (i, N ) play an important role. An empirical entropy HM N (Zi |Z1i−1 , N ) is associated with each vector Z1i−1 ; 1 ≤ i ≤ (log N )2 in X1M N . This entropy is derived from PM N (Zi |Z1i−1 , N ) =
3
PM N (Z1i ,N ) . PM N (Z1i−1 ,N )
Note
that this empirical entropy is conditioned on the particular value of Z1i−1 and is not averaged over all Z1i−1 ∈ Ai−1 relative to PM N (Z1i−1 , N ). A context-tree with approximately N leaves, that consists of the N most empirically probable contexts in X1M N , is generated. For each leaf of this tree, choose the one context among the contexts on the the path from the root of the tree to this leaf, for which the associated entropy is the smallest. Then, these minimal associated entropies are averaged over the set of leaves of the tree. This average entropy is denoted by Hu (N, M ). Note that Hu (N, M ) is essentially an empirical conditional entropy which is derived for a suitably derived variable-length Markov chain (VLMC). Finally, define Hu (X, N ) = lim supM →∞ Hu (N, M ). It is demonstrated that lim inf N →∞ [H(X, N )− Hu (X, N )] > 0. Thus, for large enough N , Hu (X, N ), like H(X, N ), may also serve as a lower bound on the compression that may be achieved by any encoder for N -sequences. The relevance of Hu (X, N ) becomes apparent when it is demonstrated in Theorem 2 below that a context-tree universal data-compression scheme, when applied to N ′ -blocks, essentially achieves Hu (X, N ) for any X if log N ′ is only slightly larger than log N , and achieves H(X) as N ” tends to infinity. Furthermore, it is shown in Theorem 1 below that among the many compressible sequences X for which Hu (X, N ) = H(X) < log A, there are some for which no on-line universal data-compression algorithm can achieve any compression at all when applied to consecutive blocks of length N ′ , if log N ′ is slightly smaller than log N . Thus, context-tree universal data-compression is therefore essentially optimal. Note that the threshold effect that is described above is expressed in a logarithmic scaling of N . At the same time, the logarithmic scaling of N is apparently the natural scaling for the length of contexts in a context-tree with N leaves.
2.
Application to Universal Classification
A device called classifier (or discriminator) observes an individual training sequence of length of m letters, X1m . The classifier’s task is to consider individual test sequences of length N and decide whether the test N -sequence has, in some sense the same properties as those that are captured by the training sequence, or is sufficiently different, according to some appropriate criterion. No a-priori information about the test sequences is available to the classifier asides from the training sequence. A universal classifier d(X1m , Z1N ∈ AN ) for N-vectors is defined to be a mapping from AN onto {0, 1}. Upon observing Z1N , the classifier declares Z1N to be similar to one of the N j+N vectors Xj+1 ; j = 0, 1, ..., m − N iff d(X1m , Z1N ) = 1 (or, in some applications, if a slightly distorted version Z˜1N of Z1N satisfies d(X1m , Z˜1N ) = 1).
4
In the classical case, the probability distribution of N -sequences is known and an optimal classifier accepts all N -sequences Z1N such that the probability P (Z1N ) is bigger than some preset threshold. If X1L is a realization of a stationary ergodic source one has, by the Asymptotic Equipartition Property (A.E.P) of information theory, that the classifier’s task is tantamount (almost surely for large enough N and M ) to deciding whether the test sequence is equal to a ”typical” sequence of the source (or, when some distortion is acceptable, if a slightly distorted version of the test sequence is equal to a ”typical” sequence). The cardinality of the set of typical sequences is, for large enough N , about 2N H , where H is the entropy rate of the source[10]. What to do when P (Z1N ) is unknown or does not exist, and the only available information about the generating source is a training sequence X1m ? The case where the training sequence is a realization of an ergodic source with vanishing memory is studied in [11], where it demonstrated that a certain universal context-tree based classifier is essentially optimal for this class of sources. This is in unison with related results on universal prediction([11],[12],[13]). Universal classification of test sequences relative to a long training sequence is a central problem in computational biology. One common approach is to assume that the training sequence is a realization of a VLMC, and upon viewing the training sequence, to construct an empirical Probabilistic Suffix Tree (PST), the size of which is limited to the available storage complexity of the classifier, and apply a context-tree based classification algorithm [7,8]. But how one should proceed if there is no a-priori support for the VLMC assumption? In the following, it is demonstrated that the PST approach is essentially optimal for every individual training sequence, even without the VLMC assumption. Denote by Sǫ (N, X1m ) a set of N -sequences Z1N which are declared to be similar to X1m (i.e. d(X1m , Z1N ) = 1), j+N ) = 0 should be satisfied by no more than ǫ(m − N + 1) instances where d(X1m , Xj+1 j = 0, 1, 2, ..., m − N , and where ǫ is an arbitrarily small positive number. Also, given a particular classifier, let Dǫ (N, X1m ) = |Sǫ (N, X1m )|, and let Hǫ (N, X1m ) =
1 log Dǫ (N, X1m ) . N
Thus, any classifier is characterized by a certain Hǫ (N, X1m ). Given X1m , let Dǫ,min (N, X1m ) be the smallest achievable Dǫ (N, X1m ) and let Hǫ, min(N, X1m ) =
1 log Dǫ,min (N, X1m ) . N
and, for an infinite training sequence X, Hǫ (N, X) = lim sup Hǫ,min (N, X1m ) m→∞
.
5
Note that
¯ H(X) = lim lim sup Hǫ,min (N, X) ǫ→0 N →∞
is the topological entropy of X[6]. Naturally, if the classifier has the complete list of N -vectors that achieve Dǫ,min (N, X1m ), j+N , for every it can achieve a perfect classification by making d(X1m , Z1N ) = 1 iff Z1N = Xj+1 j+N ∈ S(N, ǫ, X1m ). instant j = 0, 1, 2, ..., m − N for which Xj+1
The discussion is constrained to cases where Hǫ, min(N, X1m ) > 0. Therefore, when m is large, Dǫ,min (N, X1m ) grows exponentially with N (e.g. when the test sequence is a realization of an ergodic source with a positive entropy rate). The attention is limited to classifiers that have a storage-space complexity that grows only linearly with N. Thus, the long training sequence cannot be stored. Rather, the classifier is constrained to represent the long training sequence by a short “signature”, and use this short signature to classify incoming test sequences of length N . It is shown that it is possible to find such a classifier, denoted by d(X1m , ǫ, Z1N ), which is essentially optimal in the following sense: An optimal “ǫ-efficient” universal classifier d(X1m , ǫ, Z1N ) is defined to be one that satisfies j+N ) = 1 for (1 − ǫˆ)(m − N + 1) instances j = 0, 1, ...m − N , the condition that d(X1m , Xj+1 where ǫˆ ≤ ǫ. This corresponds to the rejection of at most ǫDǫ,min (N, X1m ) vectors from among the Dǫ,min (N, X1m ) typical N -vectors in X1m . Also, an optimal “ǫ-efficient” universal classifier should satisfy the condition that d(X1m , Z1N ) = 1 is satisfied by no more than m 2N Hǫ,min (N,X1 )+ǫ N-vectors Z1N . This corresponds to a false-alarm rate of m
m)
2N (Hǫ,min (N,X1 )+ǫ) − 2N Hǫ,min (N,X1 m 2N log A − 2N Hǫ,min (N,X1 )
when N -vectors are selected independently at random, with an induced uniform probability m distribution over the set of 2N log A − 2N (Hǫ,min (N,X1 )) N -vectors that should be rejected. Note that the false-alarm rate is thus guaranteed to decrease exponentially with N for any individual sequence X1m for which Hǫ,min (N, X1m ) < log A − ǫ. A context-tree based classifier for N -sequences, given an infinite training sequence X and a storage-complexity of O(N ), is shown by Theorem 3 below to be ǫ-efficient for any N ≥ N0 (X) and some m = m0 (N, X). Furthermore, by Theorem 3 below, among the set of training sequences for which the proposed classifier is ǫ-efficient, there are some for which no ǫefficient classifier for N ′ -sequences exists, if log N ′ < log N for any ǫ < log A−Hǫ,min (N, X). Thus, the proposed classifier is essentially optimal. Finally, the following universal classification problem is considered: Given two test-sequences Y1N and Z1N and no training data, are these two test-sequences ”similar” to each other? The case where both Y1N and Z1N are realizations of some (unknown) finite-order Markov processes is discussed in [14], where an asymptotically optimal empirical divergence measure is derived empirically from Y1N and Z1N .
6
In the context of the individual-sequence approach that is adopted here, this amounts to the following problem: Given Y1N and Z1N , is there a training-sequence X for which Hǫ,min (X) > 0 ,such that both Y1N and Z1N are accepted by some ǫ-efficient universal classifier with linear space complexity?(this problem is a reminiscence of the “common ancestor” problem in computational biology[15], where one may think of X as a training sequence that captures the properties of a possible “common ancestor” of two DNA sequences Y1N and Z1N ). This is the topic of the Corollary following Theorem 3 below.
B.
Definitions,Theorems and Algorithms
Given X1N ∈ AN , let c(X1N ); X ∈ AN be a one-to-one mapping of X1N into a binary sequence of length L(X1N ), which is called the length function of c(X1N ). It is assumed that L(X1N ) satisfies the Kraft inequality. For every X and any positive integers M, N , define the compression of the prefix X1N M to be: M −2 1 X (i+(j+1)N ) NM . L Xi+jN +1 + L(X1i ) + L X(i+1+(M ρL (X, N, M ) = max −1)N ) i;1≤i≤N −1 N M j=0
Thus, one looks for the compression of the sequence X1M N that is achieved by successively phase i; i= 1, 2, ..., N − applying a given length-function L(X1N ) and with the worst starting 1 1 i N M 1. Observe that by ignoring the terms N M L(X1 ) and N M L X(i+1+(M −1)N ) (that vanish for large values of M ) in the expression above, one gets a lower bound on the actual compression. In the following, a lower-bound HM N (N ) on ρL (X, N, M ) is derived, that applies to any length-function L(X1N ). First, the notion of empirical probability of an N -vector in a finite M N -vector is derived, for any two positive integer N and M . Given a sequence X1M N , define for a vector Z1N ∈ AN , Pi,M N (Z1N , N )
M −2 X 1 i+(j+1)N −1 = ;1 ≤ i ≤ N − 1 1lZ N Xi+jN 1 M −1
(1)
j=0
and PM N (Z1N , N )
n 1 X = Pi,M N (Z1N , N ) N
(2)
i=1
where, (j+1)N +i−1
1lZ N (XjN +i 1
(j+1)N +i−1
) = 1 iff XjN +i
(j+1)N +i−1
= Z1N ; else 1lZ N (XjN +i 1
7
) = 0.
Thus, PM N (Z1N , N )
1 = (M − 1)N + 1
(M −1)N +1
X
1lZ N (Xii+N −1 ) , 1
i=1
the empirical probability of Z1N . Similarly, define X
PM N (Z1i , N ) =
PM N (Z1N , N ); 1 ≤ i ≤ N − 1
N ∈Ai ZN−i+1
(As noted in the Introduction, PM N (Z1i , N ) converges to the empirical probability PM N (Z1i , i) as M tends to infinity. However, for finite values of M , these two quantities are not identical due to end-effects). Let, HM N (i, N ) = −
1 X PM N (Z1i , N ) log PM N (Z1i , N ) ℓ i i Z1 ∈A
and HM N (N ) = HM N (N, N ) = −
1 N
X
PM N (Z1N , N ) log PM N (Z1N , N )
(3)
Z1N ∈AN
then, Proposition 1 ρL (X, N, M ) ≥ HM N (N ) and lim sup ρL (X, N, M ) ≥ H(X, N ) M →∞
where H(X, N ) = lim supM →∞ HM N (N ). The proof appears in the Appendix. Thus, the best possible compression of X1M N that may be achieved by any one-to-one encoder for N -blocks, is bounded from below by HM N (N ). Furthermore, HM N (N ) is achievable for N that is much smaller than log M , if c(Z1N ) (and its corresponding lengthfunction L(Z1N ) is tailored to the individual sequence X1M N , by first scanning X1M N , evaluating the empirical distribution PM N (Z1N , N ) of N -vectors and then applying the corresponding Huffman data compression algorithm. However, in practice, the data-compression has to be executed on-line and the only available information on X1M N is the one that is contained the currently processed N-block. The main topic of this paper is to find out how well one can do in such a case where the same mapping c(X1N ) is being applied to successive N -vectors of X1M N .
8
Next, given N and M a particular context-tree is generated from X1M N for each letter Xi ; 1 ≤ i ≤ M N , and define a related conditional empirical entropy Hu (N, M ) that corresponds to these context trees. It is then demonstrated that for large enough N and M , Hu (N, M ) may also serve as a lower bound on ρL (X, N, M ). Construction of the Context-tree for the letter Zi 1) Consider contexts which are no longer than t = ⌈(log N )2 ⌉ and let K be a positive number. 2)Let K1 (Z1N , K) = min[j − 1, t] where j is the smallest positive integer such that PM N (Z1j , N ) ≤ K1 , where the probaility measure PM N (Z1j , N ) for vectors Z1j ∈ Aj is dervided from X1M N . If such j does not exist, set K1 (Z1N , K) = −1, where Z10 is the null vector. 3)Given X1M N evaluate PM N (Z1i , N ). For the i-th symbol in Z1N , let Z1i−1 be the corresponding suffix. For each particular Z1i−1 ∈ Ai−1 define HM N (Zi |Z1i−1 , N ) = −
PM N (Z1i , N ) PM N (Z1i , N ) log i−1 P (Z1 , N ) PM N (Z1i−1 , N ) Z ∈A M N X
(4)
i
4)Let j0 = j0 (Z1i−1 ) be the integer for which, i−1 HM N (Zi |Zi−j , N) = 0
min
1≤j≤1+K1 (Z1N ,K)
i−1 HM N (Zi |Zi−j , N)
Each such j0 is a node in a tree with about K leaves. The set of all such nodes represent the particular context tree for the i-th instant. 5)Finally,
Hu (N, K, M ) =
X
PM N (Z1N , N )HM N (Zi |X i−1
i−j0 (Z1i−1 )
Z1N ∈AN
, N)
(5)
Observe that Hu (N, K, M ) is an entropy-like quantity defined by an optimal data-driven tree of variable depth K1 , where each leaf that is shorter than t has roughly an empirical 1 . probability K Set K = N and let Hu (N, M ) = Hu (N, N, M ). Also, let Hu (X, N ) = lim sup Hu (N, M )
(6)
Hu (X) = lim sup Hu (X, N ).
(7)
M →∞
and N →∞
9
Let H(X, N ) = lim sup HM N (N ) M →∞
and H(X) = lim sup H(X, N ) N →∞
Then, Lemma 1 For every individual sequence X, h i lim inf H(X, N ) − Hu (X, N ) ≥ 0 N →∞
(8)
Hence, Hu (X) ≤ H(X) . The proof of Lemma 1 appears in the Appendix. A compression algorithm that achieves a compression Hu (X, N ) is therefore asymptotically optimal as N tends to infinity. Note that the conditional entropy for a data-driven Markov tree of a uniform depth of, say, O(log log N ) may still satisfy Lemma 1 (by the proof of Lemma 1), but this conditional empirical entropy is lower-bounded by Hu (X, N ) for finite values of N . A context-tree data-compression algorithm for N -blocks that essentially achieves a compression Hu (X, N ) is introduced below, and is therefore asymptotically optimal, but so are other universal data-compression algorithms (e.g., a simpler context-tree data compression algorithm with a uniform context depth or the LZ algorithm [1]). However, the particular context-tree algorithm that is proposed below is shown to be essentially optimal for non-asymptotic values of N as well. It is now demonstrated that no universal data-compression algorithm that utilizes a length-function for N -blocks can always achieve a compression which is essentially better than Hu (X, N ) for any value of N in the following sense: Let δ be an arbitrarily small positive number. Let us consider the class CN0 ,M0 ,δ of all ˆ such that δ < H ˆ < (1 − 2δ) log A, X-sequences for which for some H ˆ 1) HM0 N0 (N0 , N0 ) = H. ˆ ≤ δ where K0 = N0 . 2) Hu (N0 , K0 , M0 ) − H It is demonstrated that the class CN0 ,M0 ,δ is not empty. In the proof of Theorem 1 in (log A)ℓ the Appendix, a class of cardinality Mℓ,h = 2 2hℓ of sequences is constructed such that δ this class is included in CN0 ,M0 ,δ for h = 2 and for ℓ that satisfy N0 = 2hℓ . Moreover, by Lemma 1, it follows that every sequence X is in the set CN0 ,M0 ,δ , for large enough N0 and M0 = M0 (N0 ).
10
Theorem 1 Let N ′ = N 1−δ . For any universal data-compression algorithm for N ′ -vectors ′ that utilizes some length-function L(Z1N ), there exist some sequences X ∈ CN0 ,M0 ,δ such that for any M ≥ M0 and any N ≥ N0 : ˆ ρL (X, M, N ′ ) ≥ (1 − δ)[log A − δ] > H for large enough N0 .
The proof of Theorem 1 appears in the Appendix. The next step demonstrates that there exists a universal data-compression algorithm, which is optimal in the sense that when it is applied to consecutive N -blocks, its associated compression is about Hu (X, N ′ ) for every individual sequence X where log N ′ is slightly smaller than log N . Theorem 2 Let δ be an arbitrarily small positive number and let N ′ = ⌊N 1−δ ⌋. There ˆ N) exists a context-tree universal coding algorithm for N -blocks, with a length-function L(Z 1 for which, for every individual X1M N ∈ AM N , log N ρLˆ (X, M, N ) ≤ Hu (N, N ′ , M ) + O Nδ It should be noted here that the particular universal context-tree algorithm that is described below is not claimed to yield the lowest possible redundancy for a given block-length N . No attempt was taken to minimize the redundancy, since it is suffice to establish the fact that this particular essentialy optimal universal algorithm indeed belongs to the class of context-tree algorithms. The reader s referred to [4] for an exhaustive discussion of optimal universal context-tree algorithms.
Description of the universal compression algorithm: Consider first the encoding of iN the first N -vector X1N (to be repeated for every X(i−1)N +1 ; i = 2, 3, ..., M − 1). Let t = ⌈(log N )2 ⌉. A) Given the first N -vector X1N , generate the set Tu (X1N ) that consists of all contexts 1 ; i ≤ t. Z1i−1 that appear in X1N , satisfying PN (Z1i−1 , t) ≥ N 1−δ Clearly, Tu (X1N ) is a context tree with no more than N 1−δ leaves with a maximum depth of t = ⌈(log N )2 ⌉. The depth t is chosen to be just small enough so as to yield an implementable compression scheme and at the same time, still be big enough so as to yield an efficient enough compression.
11
B) Evaluate, Hu (X1N , Tu , t) =
X
PN (X1t−1 , t)
X1t−1 ∈At−1
min
0≤j≤t−1;X1j−1 ∈Tu (X1N )
HN (Xi |X1i−1 , t) .
Let Tˆu (X1N ) be a sub-tree of Tu (X1N ), such that it’s leaves are the the set of contexts that achieves Hu (X1N , Tu , t). ˆ 3 (X N ) is constructed as follows: ˆ 2 (X N ) + L ˆ 1 (X N ) + L ˆ N) = L C) A length function L(X 1 1 1 1 ˆ 1 (X N ) is the length of an uncompressed binary length-function, m 1) L ˆ 1 (X1N ) that 1 enables the decoder to reconstruct the context tree Tˆu (X1N ), that consists of the set of contexts that achieves Hu (X1N , Tu , t). This tree has, by construction, at most N 1−δ leaves and at most t letters per leaf. It takes at most 1 + tlog A bits to encode a vector of length t over an alphabet of A letters. It also takes at most 1 + log t bits to encode the length of a particular context. Therefore, ˆ 1 (X N ) ≤ N 1−δ (t log A + log t + 2) ≤ [log N log A + log(log N 2 ) + 2]N 1−δ bits. L 1 ˆ 2 (X N ) is the length of a binary word m 2) L ˆ 2 (X1N ), (t log A bits long), which is an 1 uncompressed binary mapping of X1t , the first t letters of X1N . ˆ 2 (X1N ), the decoder can re-generate X1t and 3) Observe that given m ˆ 1 (X1N ), and m the sub-tree Tˆu (X1N ) that achieves Hu (X1N , Tu , t)). Given Tˆu (X1N ) and a prefix N is compressed by a context-tree algorithm for FSMX sources X1t of X1N , Xt+1 [3,4], which is tailored to the contexts that are the leaves of Tˆu (X1N ) , yielding a ˆ 3 (X N ) ≤ N Hu (X N , T, t) + O(1). length function L 1 1 iN D) Repeat the steps 1), 2) and 3) above for the N-vectors X(i−1)N +1 ; i = 2, 3, ..., M − 1.
Let T¯u (X1N ) be the set of all vectors Z1i−1 satisfying PM N (Z1i−1 , N ) ≥ N ” = N 1−2δ .
1 N” ; i
≤ t, where
The proof of Theorem 2 follows from the construction and by the convexity of the entropy function since M −1 [log N ]4 1 X (i+1)N Hu ZiN +1 , Tu , t ≤ Hu (N, N ”, M ) + O(N −δ ) + O M N i=0
and where the term O(N −δ ) is an upper-bound on the relative frequency of instances in (i+1)N (i+1)N XiN +1 that have as a context a leaf of Tˆu (XiN +1 ) that is a suffix of one of the vectors in T¯u (X1N ) and therefore is not an element of the set of contexts that achieve Hu (N, N ”, M ). 4 The term O( [logNN ] ) is due to end-effects and follows from Lemma 2.7 in [5,page 33](see proof of Lemma 1 in the Appendix).
12
C.
Application to Universal Classification
A device called classifier (or discriminator) observes an individual training sequence of length of m letters, X1m . The classifier’s task is to consider individual test sequences of length N and decide whether the test N-sequence has the same features as those that are captured by the training sequence, or is sufficiently different, according to some appropriate criterion. No a-priori information about the test sequences is available to the classifier aside from the training sequence. Following the discussion in the Introduction section, a universal classifier d(X1m , Z1N ∈ AN ) for N-vectors is defined to be a mapping from AN onto {0, 1}. Upon observing Z1N , the j+N ; j = 0, 1, ..., m − N iff classifier declares Z1N to be similar to one of the N -vectors Xj+1 m N d(X1 , Z1 ) = 1 (or, in some applications, if a slightly distorted version Z˜1N of Z1N satisfies d(X1m , Z˜1N ) = 1). Denote by S(N, ǫ, X1m ) a set of N -sequences Z1N which are declared to j+N be similar to X1m , i.e. d(X1m , Z1N ) = 1), where d(X1m , Xj+1 ) = 0 should be satisfied by no more than ǫ(m − N + 1) instances j = 0, 1, 2, ..., m − N , and where ǫ is an arbitrarily small positive number. Also, given a particular classifier, let Dǫ (N, X1m ) = |S(N, ǫ, X1m )|, and let Hǫ (N, X1m ) =
1 log Dǫ (N, X1m ) . N
Thus, any classifier is characterized by a certain Hǫ (N, X1m ). Given X1m , let Dǫ,min (N, ǫ, X1m ) be the smallest achievable Dǫ (N, X1m ) and let Hǫ, min(N, X1m ) =
1 log Dǫ,min (N, X1m ) . N
Naturally, if the classifier has the complete list of N -vectors that achieve Dǫ,min (N, X1m ), j+N for every it can perform a perfect classification by making d(X1m , Z1N ) = 1 iff Z1N = Xj+1 j+N instant j = 0, 1, 2, ..., m − N for which Xj+1 ∈ S(N, ǫ, X1m ). The discussion is constrained m to cases where Hǫ, min(N, X1 ) > 0. Therefore, when m is large, Dǫ,min (N, X1m ) grows exponentially with N .
The attention s limited to classifiers that has a storage-space complexity that grows only linearly with N. Thus the training sequence cannot be stored. Rather, the classifier should represent the long training sequence with a short “signature” and use it to classify incoming test sequences of length N . It is shown that it is possible to find such a classifier, denoted by d(X1m , ǫ, Z1N ), that is essentially optimal in the following sense(as discussed and motivated in the Introductory section): d(X1m , ǫ, Z1N ) is defined to be one that satisfies the condition j+N that d(X1m , Xj+1 ) = 1 for (1 − ǫˆ)(m − N + 1) instances j = 0, 1, ...m − N , where ǫˆ ≤ ǫ. This corresponds to a rejection of at most ǫN vectors among N -vectors in X1m . Also, an optimal “ǫ-efficient” universal classifier should satisfy the condition that d(X1m , Z1N ) = 1 is m satisfied by no more than 2N Hǫ,min (N,X1 )+ǫ N-vectors Z1N . Observe that in the case where X is a realization of a finite-alphabet stationary ergodic
13
process, limǫ→0 limN →∞ lim supm→∞ H0 (N, X1m ) is equal almost surely to the entropy- rate of the source and, for a large enough N , the classifier efficiently identifies typical N -vectors without searching the exponentially large list of typical N-vectors, by replacing the long training sequence with an “optimal sufficient statistics” that occupies a memory of O(N ) only. In the following, a universal context classifier for N-vectors with a a storage-space complexity that is linear in N , is shown to be essentially optimal for large enough N and m.
Description of the universal classification algorithm: A) Evaluate Hǫ,min (N, X1m ). 1−2ǫ ⌋. Compute H (N, N ”, M ) where H (N, N ”, M ) B) Let M = ⌊ m u u N ⌋ and let N ” = ⌊N is given by Eq.(5), with N ” replacing K. Let T¯u (X1m ) be the subset of contexts for which the minimization that yields Hu (N, N ”, M ) is achieved.
Note that steps A), and B) above are preliminary pre-processing steps that are carried out prior to the construction of the classifier that is tailored to the training data X1m . C) Compute hu (Z1N , X1m , T¯u , t) X X 0 0 0 =− PN (Z−i+1 , t) PN (Z1 |Z−i+1 , t) log Pm (Z1 |Z−i+1 , N) 0 Z−i+1 ∈T¯u (X1m )
Z1 ∈A
D) Similar to step B in the description of the universal data compression algorithm above, let Tu (X1N ) be the context tree that consists of all contexts Z1i−1 that appear in Z1N and 1 ; i ≤ t. Let S(Z1N , µ) be the set of all X1N ∈ AN such that satisfy PN (Z1i−1 , t) ≥ N 1−ǫ N N that g(Z1 , X1 ) ≤ µ, where g(∗, ∗) is some non-negative distortion function satisfying: g(Z1N , X1N ) = 0 iff X1N = Z1N . Given a test sequence Z1N , compute Hu (Z˜1N , Tu , t) = X X 0 0 0 min [− , t) PN (Z¯−i+1 , t)] , t) log PN (Z¯1 |Z¯−i+1 PN (Z¯1 |Z¯−i+1 ¯ N ∈S(Z N ,µ) Z 1 1
0 Z¯−i+1 ∈Tu (X1m )
Z¯1 ∈A
E) Let h i ∆(Z1m , Z1N ) = hu (Z1N , X1m , T¯u , t) − min Hu (Z1N , Tu , t), Hǫ,min (N, X1m ) ˆ m , ǫ, Z N ∈ AN )=1 iff ∆(X m , Z N ) ≤ ǫ′ , where ǫ′ is set so as to guarantee and set d(X 1 1 1 1 ˆ m , ǫ, Z˜ j+N ) = 1 for some Z˜ N ∈ S(Z N , µ), for at least (1 − ǫ)(m − N + 1) that d(X 1 1 1 j+1 ˆ m , Z N ∈ AN )=1 instances j = 0, 1...., m − N . If Hu (Z1N , Tu , t) + ǫ′ > log A, set d(X 1 1 N N for every Z1 ∈ A .
14
Refer to a test sequence Z1N as being ǫ acceptable (relative to X1m ) iff ∆(X1m , Z1N ) ≤ ǫ′ . It should be noted that for most instances in Z1N , except for at most N O(N −ǫ ) instances in Z1N , a context that is an element T¯u (X1m ) is a suffix of an element in Tu (Z1N ). Hence, by the convexity of the logarithmic function, it follows that ∆(X, Z1N ) ≥ −O(N −ǫ ). If, for example, X is a realization of a stationary i.i.d. process, and if no distortion is allowed (µ = 0), ∆(X, Z1N ) + O(N −ǫ ) is almost surely larger than or equal to the divergence X
QZ1n (Z) log
Z∈A
QZ1n (Z) P (X = Z)
where P (X) is the probability distribution of the i.i.d process and where QZ1n (Z) is the empirical probability of Z ∈ Z1N . It should be noted that for some small values of N, one may find some values of m for which ˆ ǫ (N, Z m ) is much larger than Hǫ,min (X). It should also be noted that if no distortion is H 1 allowed (i.e. µ = 0), the time complexity of the proposed algorithm is linear in N as well. Theorem 3 1)For any arbitrarily small positive ǫ, the “ǫ-efficient” classifier that is dem ˆ scribed above accepts no more than 2Hǫ (N,Z1 ) N-vectors where ˆ ǫ (N, Z m ) ≤ Hǫ,min (X) + ǫ2 lim sup lim inf H 1 N →∞
m→∞
ˆ ǫ (N, X m ) is much smaller than log A for which no 2) There exist m-sequences such that H 1 ′ m ˆ classifier can achieve Hǫ,min (N , X1 ) < log A−δ if log N ′ < log N , where δ is an arbitrarily small positive number. Thus, for every X there exists some N0 (X) such that for every N ≥ N0 , the the proposed algorithm is essentially optimal for some m0 = m0 (N, X) and is characterized by a storagespace complexity that is linear in N . Furthermore, if one sets µ = 0 (i.e. no distortion), the proposed algorithm is also characterized by a linear time-complexity. The proof of Theorem 3 appears in the Appendix. Also, it follows from the proof of Theorem 3 that if one generates a training sequence X such that lim inf M →∞ Hu (N, N, M ) = limM →∞ Hu (N, N, M ) (i.e. a “stationary” training sequence), then there always exist positive integers N0 and m0 = m0 (N0 ) such that the proposed classifier is essentially optimal for any N > N0 (X) and any m > m0 (N0 ), rather than only some specific values of m that depend on N . Now, let Y1N and Z1N be two N-sequences and assume that no training sequence is available. However, one still would like to test the hypothesis that there exists some test sequence X such that both N -sequences are ǫ-acceptable with respect to X. (This is a reminiscence of the “common ancestor” problem in computational biology where one may think of X as a training sequence that captures the properties of a possible “common ancestor”[15] of two DNA sequences Y1N and Z1N ).
15
Corollary 1 Let Y1N and Z1N be two N-sequences and let S(Y1N , Z1N ) be the union of all their corresponding contexts that are no longer than t = ⌈(log N )2 ⌉ with an empirical prob1 . ability of at least N 1−δ If there does not exist a conditional probability distribution 0 0 P (X1 |X−i+1 ); X−i+1 ∈ S(Y1N , Z1N )
such that, X
0 PN,Y N (X1 |X−i+1 , t) log 1
X1 ∈A
0 PN,Y N (X1 |X−i+1 ) 1
0 P (X1 |X−i+1 )
≤ǫ
and at the same time, X
0 PN,Z N (X1 |X−i+1 , t) log 1
X1 ∈A
0 PN,Z N (X1 |X−i+1 , t) 1
0 P (X1 |X−i+1 )
≤ǫ
0 0 (where PN,Y N (Y1 |Y−i+1 , t) is empirically derived from Y1N and PN,Z N (X1 |X−i+1 , t) is em1 1 N pirically derived from Z1 ), then there does not exist a training sequence X, such that for ˆ ǫ (N, X m ) ≤ Hǫ,min (X) + ǫ2 (i.e. N is “long enough” relative some X1m , H0 (X) > 0 and H 1 N N to X), both Y1 and Z1 are ǫ-acceptable relative to X1m .
In unison with the “individual sequence” justification for the essential optimality of Contexttree universal data compression algorithms [3, 4] that was established above, these results may contribute a theoretical “individual sequence” justification for the Probabilistic Suffix Tree approach in learning and in computational biology [7, 8].
Acknowledgement Helpful discussions with Neri Merhav and helpful comments by the anonymous reviewers are acknowledged with thanks.
16
Appendix Proof of Proposition 1:
By definition,
M −2 N X 1 X 1 N −1 i+(j+1)N −1 ) ≥ L(Xi+jN ρL (X, N, M ) ≥ max i=1 N M N j=0
X
Pi,M N (Z1N ), N )L(Z1N )
i=1 Z N )∈AN 1
=
1 N
X
PM N (Z1N , N )L(Z1N ) ≥ HM N (N ) (9)
Z1N ∈AN
which leads to Proposition 1 by the Kraft inequality.
Proof of Lemma 1: Let N0 , M0 and M be positive numbers and let ǫ = ǫ(M ) be an arbitrarily small positive number, satisfying log M0 > N0 log A, Hu (X) ≥ Hu (X, N0 ) − ǫ, and M ≥ N0 2 such that Hu (M0 N0 , M0 N0 , M ) ≥ Hu (X) − ǫ, where Hu (M0 N0 , M0 N0 , M ) = Hu (M0 N0 , K, M ) with K = M0 N0 . Therefore, by the properties of the entropy function, by applying the chain-rule to HM N0 (N0 , M0 N0 ) and by Eq. (5), HM M0N0 (N0 , M0 N0 ) ≥ HM M0 N0 (ZN |Z1N0 −1 , M0 N0 ) ≥ Hu (M0 N0 , M0 N0 , M ) − ≥ Hu (X, N0 ) − 2ǫ −
log A N0
log A N0
A where by Eq (4), the term log N0 is an upper-bound on the total contribution to Hu (M0 N0 , M0 N0 , M ) by vectors Z1N0 −1 for which PM M0 N0 (Z1N0 −1 , M0 N0 ) < M01N0 ( hence yielding K1 (Z1M0 N0 , K) < N0 − 1, where K = M0 N0 ). Note that for any vector Z1M0 N0 , the parameter t that determines K1 (Z1M0 N0 , K), satisfies t > N02 > N0 .
Now, |PM M0 N0 (Z1N0 , M0 N0 ) − PM M0 N0 (Z1N0 , N0 )| ≤
M 0 N0 M M 0 N0
≤
1 N0 2
= D(N0 ) . By Lemma
P (Z1N0 )
2.7 in [5, page 33], for any two probability distributions and Q(Z1N0 ), N0 N 0 X P (Z ) A N0 1 P (Z1 ) log − ≤ d log N 0 d Q(Z1 ) Z1N0 ∈AN0
where d = maxZ N0 ∈AN0 |P (Z1N0 ) − Q(Z1N0 )|. Hence, 1
i 1 h N log A + 2 log N HM M0 N0 (N0 , M0 N0 ) − HM M0 N0 (N0 , N0 ) ≤ 0 0 N0 2
17
and therefore, HM M0 N0 (N0 , N0 ) ≥ Hu (X, N0 ) − 2ǫ −
i log A 1 h − N log A + 2 log N 0 0 N0 N0 2
which, by Eqs. (6), (7) and (8) and by setting ǫ =
1 N0 ,
proves Lemma 1.
Proof of Theorem 1: Consider the following construction of X1N M : Let h be an arbitrary small positive number and ℓ be a positive integer, where δ and ℓ satisfy N = ℓ2hℓ , and assume that ℓ divides N . 1) Let Sℓ,h be a set of some T ′ =
N ℓ
= 2hℓ distinct ℓ-vectors from Aℓ .
2) Generate a concatenation Z1N of the T ′ distinct ℓ-vector in Sℓ,h . 3) Return to step 2 for the generation of the next N -block. Now, by construction, for M consecutive N -blocks, ρL (X, N, M ) ≥
1 M
and
1 M −1+N ≥ Pj,M N (Z1ℓ , N ) ≥ ; j = 1, 2, ...N. N MN −1+N Thus, by construction, PM N (Z1ℓ , N ) ≥ M M . Furthermore, there exists a positive integer N N0 = N0 (h) such that for any N ≥ N0 ,
HM N (ℓ, N ) ≤ where HM N (ℓ, N ) = −
1 ℓ
X
log N ≤ 2h ℓ
PM N (Z1ℓ , N ) log PM N (Z1ℓ , N ) .
Z1N ∈AN
j Observe that any vector Zj−i ; i + 1 ≤ j ≤ M N ; 1 ≤ i ≤ ℓ − 1, except for a subset of instances j with a total empirical probability measure of at most 21hℓ , is therefore a suffix ˆ ≤ t for any N > N0 (h), where ˆ = N M and that K1 (X N , K) where K of Z j N ˆ 1 M −1 j−K1 (X1 ,K)
t = ⌈(log N )2 ⌉. Thus, by applying the chain-rule to HM N (ℓ, N ), by the convexity of the entropy function and by Eq. (5), ˆ M ) ≤ HM N (Z1 |Z 0 , N )≤ HM N (ℓ, N ) ≤ 2h Hu (N, K, −ℓ+1 ˆ M ) = lim supM →∞ Hu (K, ˆ K, ˆ M − 1). Also, lim supM →∞ Hu (N, K,
18
(10)
Consider now the class σℓ,h of all sets like Sℓ,h that consists of 2hℓ distinct ℓ-vectors. The next step is to establish that no compression for N -sequences which consist of the 2hℓ distinct ℓ vectors that are selected from some member in the class σℓ,h is possible, at least for some such N -sequences. ¯ N ) be defined by: Let the normalized length-function L(Z 1 N)
2−L(Z1
ˆ N ) = − log P L(Z 1
Z1N ∈AN
N)
2−L(Z1
.
¯ N satisfies it ¯ N ) ≤ L(Z N ) since L(Z N ) satisfies the Kraft inequality while L(Z Clearly, L(Z 1 1 1 1 N ¯ with equality, since 2−L(Z1 ) is a probability measure. Then, N ℓ
L(Z1N )
≥
¯ N) L(Z 1
=
−1 X
¯ (i+1)ℓ |Z iℓ ) L(Z 1 iℓ+1
i=0
where
¯ (i+1)ℓ ) − L(Z ¯ iℓ ) ¯ (i+1)ℓ |Z iℓ ) = L(Z L(Z 1 1 1 iℓ+1
is a (normalized) conditional length-function that, given X1iℓ , satisfy the Kraft inequality (i+1)ℓ
¯
with equality, since 2−L(Ziℓ+1
|Z1iℓ )
is a conditional probability measure.
¯ ℓ |X 0 Lemma 2 For any h > 0, any N ≥ N0 = N0 (ℓ, h) and any L(X 1 −N +1 ) there exists a hℓ set of 2 ℓ-vectors such that X ¯ ℓ |X 0 P1,N (X1ℓ , ℓ)L(X 1 −N +1 ) ≥ ℓ(1 − δ)(log A − δ) X1ℓ ∈Aℓ
0 for all X−N +1 which are concatenations of ℓ-vectors from Sℓ,h as described above.
Proof of Lemma 2: ℓ vectors over A is:
The number of possible sets Sℓ,h that may be selected from the Aℓ Mℓ,h =
2(log A)ℓ 2hℓ
¯ ℓ |X 0 Given a particular L(X 1 −N +1 ), consider the collection Mℓ,h,δ|X 0
of all sets Sℓ,h,δ ˆ ℓ for which that consist of at most (1 − δ)2hℓ vectors selected from the set of vectors X 1 0 log Aℓ (log A−δ)ℓ ℓ ˆ |X −2 such vectors). L(X 1 −N +1 ) ≥ (log A−δ)ℓ (observe that there are at least 2 −N+1
The collection Mℓ,h,δ|X−N+1 is referred to as the collection of ”good” sets Sℓ,h,δ (i.e. sets ˆ ℓ |X 0 yielding L(X 1 −N +1 ) ≤ (log A − δ)ℓ).
19
P It will now be demonstrated that X 0
−N+1
∈AN
Mℓ,h,δ|X 0 is exponentially smaller than −N
0 Mℓ,h if N < δ2hℓ (1−h)ℓ. Hence, for any conditional length- function L(X1ℓ |X−N +1 ) and any 0 N X−N +1 ) ∈ A , most of the sets Sℓ,h ∈ Mℓ,h will not contain a ”good” Sℓ,h,δ ∈ Mℓ,h,δ|X 0 −N+1
and therefore less than δ2hℓ ℓ-vectors out of the 2hℓ ℓ-vectors in Sℓ,h will be associated with 0 an L(X1ℓ |X−N +1 ) < (log A − δ)ℓ. P hℓ The cardinality of Mℓ,h,δ|(X 0 is upper bounded by: 2j=δ2hℓ −N+1 Now, by [3], one has for a large enough positive integer n, n log2 = [h(p) + ǫ(n)]n pn
2ℓ log A −2(log A−δ)ℓ 2(log A−δ)ℓ . (2hℓ −j) j
where h(p) = −p log2 p − (1 − p) log2 (1 − p) and where limn→∞ ǫ(n) = 0. Thus, log
P
0 ∈AN X−N+1
Mℓ,h,δ|X 0
−N+1
Mℓ,h
≤ −δ2hℓ (1 − h)ℓ + N + ǫ′ (N )N
(11)
where limN →∞ ǫ′ (N ) = 0. Therefore, if N < δ2hℓ (1 − h)ℓ, there exists some Sℓ,h for which, X 0 ˆ 1ℓ |X−N 2−hℓ L(X +1 ) ≥ ℓ(1 − δ)(log A − δ) X1ℓ ∈Aℓ
0 N for all N-vectors X−N +1 ∈ A .
Hence by construction, there exists some Sℓ,h for which, N ℓ
L(Z1N )
≥
¯ 1N ) L(Z
=
−1
X i=0
h i ¯ Z (i+1)ℓ |X1iℓ ≥ N (1 − δ)(log A − δ) + ǫ′ (N ) L iℓ+1
This completes the proof of Lemma 2 and setting h = δ, the proof of Theorem 2.
Proof of Theorem 3: It follows from the construction of of the universal compression algorithm that is associated with Theorem 2 above that N [Hu (Z1N , Tu , t) + O(N −ǫ )] is a proper length-function. Consider the one-to-one mapping of X1N with the following lengthfunction: N)
1) L(Z1 ) = 2 + N [Hu (Z1N , Tu , t) + O(N −ǫ )] if Hu (Z1N , Tu , t) ≤ Hǫ (N, X1m ), and if Z1N ∈ S0 (N, X1m ) = S(N, ǫ, X1m ) 2) L(Z1N ) = 2 + N [H0 (N, X1m )] if Hu (Z1N , Tu , t) > Hǫ (N, X1m ) and if Z1N ∈ S(N, ǫ, X1m ) 3) Else, L(Z1N ) = 2 + N [Hu (Z1N , Tu , t) + O(N −ǫ )].
20
Note that for every Z1N ∈ S(N, ǫ, X1m ), L(Z1N ) ≤ 2 + N [Hǫ (N, X1m )] By by Proposition 1 and by Lemma 1, since L(Z1N ) is a length-function and by the construction of the universal data-compression algorithm that is associated with Theorem 2, for any X, there exists a positive integer N0 such that for any N > N0 and some M > M0 = M0 (N0 ), Hu (N, N ”, M ) ≤ Hu (X, N ) + 41 (ǫ)2 and, m−N X 2 2 1 j+N ¯ Hu ((Xj+1 , Tu , t), + + O(N −ǫ ) ≤ Hu (N, N ”, M ) + + O(N −ǫ ) m−N +1 N N j=0
Also, by Proposition 1 and Lemma 1 and for any N ≥ N0′ (X) ≥ N0 and any m ≥ M N0′ , m−N X 1 1 j+N L(Xj+1 ) ≥ HM N (N ) ≥ Hu (X, N ) − (ǫ)2 m−N +1 4 j=0
Therefore, m−N X 1 1 j+N ∆(X1m , Xj+1 ) ≤ ǫ2 + O(N −ǫ ) m−N +1 2 j=0
But, as pointed in the description of the universal classifier above, j+N ∆(X1m , Xj+1 ) + O(N −ǫ ) ≥ 0; j = 0, 1, ..., m − N
Statement 1) in Theorem 3 then follows by the Markov inequality and by setting ǫ′ = 1 2 −ǫ 2 ǫ + O(N ). Next, a training sequence X1m is constructed, for which Hǫ,min (N, X1m ) > 0 and where the m ˆ ǫ (N, Z N ) ≤ 2N Hǫ,min (N,X1 ) ”typical” N-vectors in X1m are equiprobable. At the same time, H 1 2 m Hǫ,min (N, X1 ) + (ǫ) , where ǫ is an arbitrarily small positive number. Hence, if N ′ is a positive integer that satisfies log N ′ < log N , any classifier with a storage m complexity of O(N ′ ) can store only an exponentially small fraction of the 2N Hǫ,min (N,X1 ) typˆ ǫ (N, Z N ) ≥ ical N -vectors that should be accepted and therefore, for any ǫ-efficient classifier, H 1 log A − ǫ. This will lead to statement 2) of Theorem 3. 3ǫ
Let B and ℓ be positive integers satisfying B > ℓǫ and let Rℓ,h be the set of all the ℓvectors in Sℓ,h and their cyclic shifts, where Sℓ,h is described in the proof of Theorem 1 above. Consider training sequences X1m that are generated by B repetitions of an ℓ-vectors in Sℓ,h , followed by B repetitions of yet another ℓ vector in Rℓ,h and so on, until Rℓ,h is exhausted. Thus, m = (ℓ)2 2hℓ B.
21
Let N = ǫℓB and assuming that N divides m, let M = m N . It then follows that 0 < j+ℓ log ℓ 1 m Hǫ,min (N, X1 ) ≤ (h + ℓ ). Also, it follows that PM N (Zj ) ≥ (1−ǫ)ℓ for most instances j, except for a subset of instances that has an empirical probability of at most ǫ. Thus, similar to the proof of Theorem 1 above, it follows that there exists a positive integer ℓ0 such that for any ℓ ≥ ℓ0 , HM N (ℓ, N ) ≤ h + logℓ ℓ + ǫ log A and also, for small enough ǫ, Hu (N, K, M ) ≤ ℓ ℓ , M ) < 2h, where K = N ” and where N ” = N 1−2ǫ = ℓ(1+3ǫ)(1−2ǫ) ≥ 1−ǫ . Hu (N, 1−ǫ j+N Therefore, ∆(X1m , Xj+1 ) ≤ 2h. ˆ ǫ (N, Z N ) ≤ Hence, by setting h = ǫ2 and ǫ′ = ǫ2 + O(N −ǫ ), and by the Markov inequality, H 1 m Hǫ,min (N, X1m )+(ǫ)2 . Also, the 2N Hǫ,min (N,X1 ) ”typical” N-vectors in X1m are equiprobable, which completes the proof of Theorem 3.
References [1] J. Ziv, A. Lempel, “Compression of Individual Sequences via Variable-Rate Coding”, IEEE Trans. Inf. Theory, vol. IT–24, no. 5, pp. 530–536, Sept. 1978. [2] A. Martin, G. Seroussi and M. Weinberger, “Linear Time Universal Coding and Time Reversal of Tree Sources via FSM Closure”, IEEE Trans. Inf. Theory, vol. 50, no. 7, July 2004. [3] R.B. Ash, “Information Theory”, Dover, New York, 1965; pp. 113. [4] F.M.J. Willems, Y.M. Shtarkov, T.J. Tjalkens, “The Context-Tree Weighting Method: basic properties”, IEEE Trans. Inf. Theory, vol. IT-41, no. 3, pp. 653–664, May 1995. [5] I. Csiszar, J. Korner, “Information Theory:Coding Theorems for Discrete Memoryless Systems”, Academic Press, 1981. [6] P. C. Shields, “The Interactions Between Ergodic Theory and Information Theory”, IEEE Trans. Inf. Theory, vol. IT-44, no. 6, 1998. [7] D. Ron, Y. Singer, N. Tishbi, “The power of Amnesia: Learning Probabilistic Automata with Variable Memory Length”, Machine Learning, vol. 25, no. 2-3, pp. 117–149, 1996. [8] A. Apostolico and G. Bejerano, “Optimal Amnesic Probabilistic Automata or how to learn and classify Proteins in Linear Time and Space”, Proc. of ACM Recomb., pp. 25-32, 2000. [9] M. Feder, N. Merhav,M. Gutman Universal, “prediction of individual sequences”. IEEE Trans. Inf. Theory, vol. IT-38, no. 4, pp. 1258-1270, July 1992. [10] j. Rissanen, “Fast universal coding with context models” IEEE Trans. Inf. Theory, vol. IT-45, no. 4, pp. 1065-1071, May 1999. [11] J.Ziv “Classification with Finite Memory Revisited” Submitted to the IEEE Trans. Inf. Theory.
22
[12] J. Ziv, “An efficient universal prediction algorithm for unknown sources with limited training data”, IEEE Trans. Inf. Theory, vol. IT-48, pp. 1690-1693, June 2002. and Correction to: ”An efficient universal prediction algorithm for unknown sources with limited training data”, IEEE Trans. Inf. Theory, vol. IT-50, pp. 1851-1852, August 2004. [13] P. Jacquet, W. Szpankowski, I Apostol, “universal predictor based on pattern matching” IEEE Trans. Inf. Theory, Volume 48, Issue 6, June 2002 pp. 1462 - 1472 [14] J.Ziv, N.Merhav, “a measure of relative entropy between individual sequences with applications to universal classification” IEEE Trans. Inf. Theory, vol IT-39 1993, pp. 1280-1292 [15] M.Ridley “The search for LUCA.” Natural History November 2000, pp. 82-85
23