I. Introduction and Main Result - Semantic Scholar

Report 0 Downloads 36 Views
The Index Entropy of a Mismatched Codebook Ram Zamir Dept. of Electrical Engineering - System Tel-Aviv University, Tel-Aviv 69978, Israel [email protected] Submitted to IEEE tr. on Information Theory December 27, 1999 Abstract

We show that if a random codebook for lossy source coding is generated by a non-optimum reproduction distribution Q, then the entropy of the index of the D-matching codeword is reduced by conditioning on the codebook: the number of bits saved is equal to the divergence between the \favorite type" in the codebook and the generating distribution Q. Speci c examples are provided.

Key Words: Approximate string matching, favorite type, entropy coded quan-

tization.

I. Introduction and Main Result Consider coding a string X = X1:::Xl , generated by a discrete memoryless source having a distribution P over a nite alphabet X , into a code word y = y1:::yl from a nite alphabet Y , under the distortion constraint

d(X; y) = 1=l

l

X

i=1

d(Xi; yi)  D

(1)

where d : X  Y ! [0; 1) is a nite distortion measure. If (1) holds we say that \y D-matches X". 1

Suppose a random codebook Y1; Y2; ::: of words in Y l is generated such that each letter in each code word is independent and identically distributed as Q. Let N` denote the index of the rst codeword that satis es (1), i.e.,

d(X; Yi) > D i = 1; :::; Nl ? 1;

d(X; YNl )  D:

To avoid technical subtleties we assume that the distortion measure is such that every source letter has a perfect reconstruction letter, i.e., for each x

d(x; y) = 0 for some y.

(2)

We also assume that Q(y) > 0 for all y in Y . It follows that for any source string x and D  0, there is a positive probability pmatch > 0 that each codeword Yi will D-match x. As a consequence PrfNl < 1 j X = xg = 1; i.e., a D-match is found in the codebook with probability one. See, e.g., [4, 10, 5] for various settings of lossy source coding and the related topic of approximate string matching. In [5], Yang and Kie er show that 1=l log(Nl ) converges to a constant, 1 log(N ) ! R(P; Q; D) as l ! 1 in probability, (3) l l and they characterize R(P; Q; D) in terms of information theoretic quantities. A modi cation of their formula for R(P; Q; D), which appears in [7, 9], has the form n

I (P kQ0; D) + D(Q0kQ) R(P; Q; D) = min Q0 m

o

(4)

where D() denotes divergence (or relative entropy), and Im denotes \lower mutual information" [10]: Im (P kQ; D) = min I (P; W ); (5) where I (P; W ) denotes the mutual information associated with input distribution P and transition distribution W from X to Y , and the minimization in (5) is over all 2

the W 's that induce output distribution Q and average distortion less than or equal to D. If no such W exists then Im (P kQ; D) is equal to in nity. Thus, for large word length l, approximately 2lR(P;Q;D) code words suce to guarantee a D-match with high probability. This result may give some insight into the workings of adaptive lossy source coding schemes, where during the adaptation phase the codebook is sub optimum with respect to the source [7]. In [7, 8, 9], Zamir and Rose address the random \type" TNl of the D-matching codeword YNl (i.e., the empirical distribution of YNl [2]). They show that for large word length, TNl concentrates around a limiting distribution:

TNl ! QP;Q;D as l ! 1 in probability,

(6)

where QP;Q;D is the distribution Q0 which achieves the minimum in (4). We call this concentration point \the favorite type" (although QP;Q;D is in general not an l-type). The intuition behind this phenomenon comes from the roles of the two terms in the minimum in (4): The lower mutual information Im (P kQ0; D) characterizes the \covering eciency" of a type Q0 - it is the D-match probability exponent given that the codeword's type is Q0, while the divergence D(Q0kQ) characterizes the frequency of a type Q0 in the codebook - the frequency is  2?lD(Q0kQ). The common types in the codebook are close to Q, but their covering eciency is small; the most coveringecient types in the codebook are close to an optimum reproduction distribution

QP;D = arg min R(P; Q; D) Q which achieves the rate-distortion function R(P; D) [1], but these types are too rare in the codebook. The \favorite type" QP;Q;D strikes the optimum balance between covering eciency and frequency in the codebook. It follows from (6) that most of the 2lR(P;Q;D) codewords in the codebook are asymptotically useless; only those having a type close to QP;Q;D - whose fraction in the codebook is only  2?lD(QP;Q;DkQ) - have a good chance to D-match the source word. In a sense, we are paying extra D(QP;Q;DkQ) bits in coding rate for the random 3

appearance of types in the codebook. Our main result shows that this redundancy can be removed by entropy coding conditioned on the codebook. Let 1 X H (Nl ) = ? PrfNl = ng log PrfNl = ng n=1

denote the entropy of the index of the D-matching code word, and let

H (NljY1; Y2; :::) = Mlim !1

X

y1 ;:::;yM

PrfY1 = y1; :::; YM = yM gH (Nljy1; :::; yM )

denote its conditional entropy given the codebook, where the limit holds since the sequence of conditional entropies is non increasing with M .

Theorem 1 If the codebook generating distribution Q is positive everywhere, then and

lim 1l H (Nl) = R(P; Q; D)

l!1

1 H (N jY ; Y ; :::) = I (P kQ ; D) lim l 1 2 m P;Q;D l!1 l = R(P; Q; D) ? D(QP;Q;D kQ) : Thus conditioning on the codebook saves D(QP;Q;D kQ) bits in coding rate.

(7) (8) (9)

The proof is given in Section III. Conditioning on the codebook can be viewed as conditioning on the past reproduction in \backward adaptive" sequential coding. Roughly speaking, for a random codebook the index Nl is uniformly distributed over the entire range (1:::2lR(P;Q;D)), since the \favorite type" can appear anywhere in that range, and therefore the unconditional entropy in (7) does not improve the coding rate in (3). Conditioning on the codebook amounts to re-ordering the codewords according to their probabilities to be selected, [3], which e ectively constructs a sub-codebook of size 2lR(P;Q;D) : 2lD(QP;Q;DkQ) The number of bits saved, D(QP;Q;D kQ), is strictly positive, unless the generating distribution is an optimal reproduction distribution QP;D ; see [8]. 4

II. Examples In the following special case we have explicit formulas for R(P; Q; D) and QP;Q;D , and hence for the conditional index entropy. Let X = Y = f0; : : : ; jXj ? 1g. Consider symmetric distortion measure of the form d(x; y) = d(y ? x) where the subtraction is modulo-jXj. De ne Hmax(D) and VD as the maximumentropy under a D-constraint and the maximum-entropy achieving distribution, respectively [2]: Hmax(D) = H (VD ) = P max H (V ) : V:

x V (x)d(x)D

Lemma 1 For symmetric distortion measure and uniform codebook generating distribution Q(y) = 1=jYj 8y, R(P; Q; D) = log jXj ? Hmax(D) and

QP;Q;D = P  VD

where the  sign denotes a circular convolution (i.e., P  VD is the distribution of the independent sum of a random variable  P and a random variable  VD ). Hence the conditional index entropy (8) is given by

Im (P kQP;Q;D; D) = H (P  VD ) ? Hmax(D): In particular, in the symmetric binary-Hamming case, i.e., X = Y = f0; 1g, Q(0) = Q(1) = 1=2, d(x; y) = x  y, and for a Bernoulli(p) source, we have by Lemma 1

R(P; Q; D) = 1 ? HB (D) QP;Q;D(1) = p  D Im (P kQP;Q;D; D) = HB (p  D) ? HB (D); 5

(10) (11) (12)

where HB (p) = ?p log(p) ? (1 ? p) log(1 ? p) denotes the binary entropy. Note that, while R(P; Q; D) is highly redundant and is independent of the source distribution, the conditional index entropy is signi cantly lower, and for low distortions it is close to the Shannon lower bound on the rate-distortion function H (P ) ? Hmax (D) [1]. Proof: If Ed(Y ? X )  D then by the properties of the entropy function

H (Y jX ) = H (Y ? X jX )  H (Y ? X )  Hmax (D) with equality if and only if Y can be written as the independent sum of X and \noise"  VD . Thus, supposing Y 0 achieves Im(P kQ0; D), we obtain the \reversed Shannon lower bound"

Im (P kQ0; D) = H (Y 0) ? H (Y 0jX )  H (Q0) ? Hmax(D) ; with equality i Q0 = P  VD . On the other hand, since Q is uniform, the divergence in (4) can be written for any Q0 as D(Q0kQ) = H (Q) ? H (Q0) = log jXj ? H (Q0). Substituting in (4) we obtain the lower bound R(P; Q; D)  log jXj ? Hmax (D). The lemma now follows since substituting Q0 = P  VD in (4) achieves this lower bound. QED In a forthcoming paper with I. Kontoyiannis [3] we extend the \favorite type theorem" (6) and the index entropy result above to continuous sources and reproduction alphabets. These extensions allow to consider encoding a general continuous source using a Gaussian codebook under a squared error criterion. We show that in the limit of large codebook variance, the conditional index entropy (8) is given in this case by the mutual information obtained by passing the source through an additive Gaussian noise channel with noise variance D. This result gives an interesting interpretation for the entropy rate of dithered lattice quantizers [6].

6

III. Proof of Theorem We start with (7). Consider the conditional entropy of Nl given a speci c source string x 1 X (13) H (NljX = x) = ? pn(x) log(pn(x)) n=1

where pn(x) is the conditional probability that Nl = n given X = x. Since the code words are independent and identically distributed, we have

pn(x) = [1 ? pmatch (x)]n?1pmatch (x) 

(14)



where pmatch (x) = Pr d(x; Yi)  D . Substituting in (13), and using the series P1 m 2 m=1 mq = q=(1 ? q ) , we obtain 



H (NljX = x) = ? log pmatch (x) + 1 + O(pmatch(x)):

(15)

From [5, 9], the probability that Y  Q D-matches x satis es

pmatch(x) = 2?l[R(Tx;Q;D)+o(1)]

(16)

where Tx is the type of the string x, and o(1) ! 0 as ` ! 1 uniformly in x. Substituting (16) in (15), we get 1 H (N jX = x) = R(T ; Q; D) + o(1) l x l

(17)

where o(1) ! 0 as ` ! 1 uniformly in x. Since the right hand side depends only on the type of x, we also have 1 H (N jT = P 0) = R(P 0; Q; D) + o(1): l X l Using the fact that the o(1) term converges uniformly, we take an expectation over the source string type TX , and get 1 H (N jT ) = E nR(T ; Q; D)o + o(1): l X X l 7

But the source string is memoryless, so TX converges to P in probability. Furthermore, due to Q being everywhere positive, R(P 0; Q; D) as a function of P 0 is continuous and bounded, so convergence in probability implies convergence in the mean. We conclude that 1 H (N jT ) = R(P; Q; D): lim l X l!1 l Now, by the properties of the mutual information function and the entropy we have 0  H (Nl ) ? H (Nl jTX) = I (TX; Nl )  H (TX)

(18) (19)

= O(log(l))

(20)

where the last line follows since the support of TX (the number of l-types) grows polynomialy with l. Thus 1l [H (Nl) ? H (Nl jTX)] ! 0, and the rst part of the theorem follows. We turn to the conditional index entropy (8). Consider rst the \direct" half of the statement, i.e., liml!1 1l H (Nl jY1; Y2; :::)  Im (P kQP;Q;D; D). Let Ti = TYi denote the (random) type of the i-th codeword in the code. Since Ti is a function of Yi, we have H (Nl jY1; Y2; :::)  H (Nl jT1; T2 ; :::): (21) Pick any l-types P 0 and Q0 such that

Im(P 0kQ0; D) < 1:

(22)

Given a speci c type sequence T1 ; T2; ::: = t1; t2 ; :::, and conditioned on the event TNl = Q0, the index Nl belongs to the sub-sequence

fn : tn = Q0 g = fn1 ; n2; ::::g: Therefore

PrfNl = njTX = P 0; TNl = Q0 g = 0 if n 62 fn1 ; n2; :::g. 8

Moreover, analogously to (14), we have n



o

h

Pr Nl = ni TX = P 0; TNl = Q0 = 1 ? pmatch (l; P 0; Q0 )

i?1

i

pmatch (l; P 0; Q0 );

for i = 1; 2; :::, where by the \conditional D-match exponent theorem" [10, 9],

pmatch (l; P 0; Q0) = Pr d(X; Y)  D TX = P 0; TY = Q0 = 2?l[Im(P 0kQ0 ;D)+o(1)];

n

o

(23) (24)

and o(1) ! 0 as l ! 1 uniformly in P 0 and Q0 . The calculation above applies since D-match events are statistically independent even when conditioned on a speci c code word type (though it wouldn't apply if conditioned on the codebook y1 ; y2; ::: itself). Analogously to the derivation leading to (17), we thus obtain 1 H (N jT = P 0; T = Q0 ; T = t ; T = t ; : : :) = I (P 0kQ0; D) + o(1) l X Nl 1 1 2 2 m l

(25)

where o(1) ! 0 as l ! 1 uniformly for all P 0; Q0 satisfying (22). The latter result holds also for the average conditional entropy given T1 ; T2; :::, since the right hand side is independent of t1; t2 ; :::. By averaging also over the types of the source string and the D-matching code word, we therefore obtain 1 H (N jT ; T ; T ; T ; : : :) = E nI (T kT ; D)o + o(1): l X Nl 1 2 m X Nl l

(26)

Now, since the source is memoryless, the law of large numbers and the \favorite type theorem" (6) imply (TX; TNl ) ! (P; QP;Q;D) as l ! 1 in probability.

(27)

Note that the function Im (P 0kQ0 ; D) is continuous in P 0; Q0 and bounded by log jXj whenever it is nite. In fact, for any pair (P 0; Q0), either Im(P 0kQ0; D)  log jXj or Pr(TX = P 0; TNl = Q0 ) = 0.1 In other words, Im(TX kTNl ; D) is continuous and This is because m ( type and y is of type 1

I

P

0

P

0

k

Q

0

Q ;D

0

.

) = 1 implies that (x y) d

9

;

> D

for every (x y) such that x is of ;

bounded with probability one for every l. It follows that the convergence in probability in (27) implies convergence in the mean in (26), so 1 H (N jT ; T ; T ; T ; : : :) = I (P kQ ; D): lim l X Nl 1 2 m P;Q;D l!1 l By arguments similar to (18) - (20) we can now remove the conditioning on TX and TNl , to obtain 1 H (N jT ; T ; : : :) = I (P kQ ; D): (28) lim l 1 2 m P;Q;D l!1 l

Thus, conditioning on the types is enough to reduce the entropy to the level in (8), and in view of (21) the \direct" half of (8) follows. The \converse" half of (8) shows that conditioning also on Y1; Y2; ::: does not reduce the entropy further. The proof is based on the following lemma which is proved in the Appendix.

Lemma 2 If the probability of each letter in some discrete alphabet is upper bounded

by 1=M for some integer M (where M is less than or equal to the size of the alphabet), then the entropy is lower bounded by log(M ). The minimum is achieved if and only if the distribution is uniform over some M letters.

Suppose that the source string is of some type P 0, and the D-matching code word is of the same type as the n-th code word, i.e., TNl = Tn = Q0 for some Q0 . Since previous codewords can D-match the source before the n-th codeword, the probability that Nl = n cannot exceed the probability that yn D-matches the source string, i.e., 

Pr Nl = n





TX = P 0; TNl = Q0; y1 ; y2; :::  Pr d(X; yn)  D TX = P 0 = Pr d(X; Yn)  D TX = P 0; Tn = Q0 = 2?l[Im(P 0 kQ0;D)?o(1)]











(29) (30) (31)

where (30) follows since the D-match probability depends only on the type of y, (31) follows as in (23), and o(1) ! 0 as l ! 1 uniformly in P 0 and Q0 . Combining with 10

Lemma 2 we conclude that 





H Nl TX = P 0; TNl = Q0; y1 ; y2; :::  l[Im (P 0kQ0; D) ? o(1)]: Following the same arguments leading from (25) to (28), we conclude that 1 H (N jY ; Y ; :::)  I (P kQ ; D) lim inf l 1 2 m P;Q;D l!1 l and the \converse" half of (8) follows. This completes the proof of the theorem. QED

Appendix: Proof of Lemma 2 Assume rst a nite alphabet. Pick any two letters whose probabilities are strictly less than 1=M , and "sharpen" their distribution by increasing the bigger one to the maximum and reducing the smaller one to the minimum; speci cally, if 1=M > p1  p2, then p01 = minf1=M; p1 + p2 g, and p02 = maxfp2 ? (1=M ? p1 ); 0g. We have HB (p1 =p1 + p2) > HB (p01 =p01 + p02), i.e., this operation reduces the binary entropy of the two letters, and as a consequence (by the chain rule) reduces the entropy of the entire distribution. We can repeat this procedure until the distribution is uniform over some M letters, in which case the entropy is log(M ). Thus log(M ) lower bounds the entropy of the initial distribution. If the distribution has an in nite alphabet, then rst map it into a nite alphabet, by combining the letters in the tail of the distribution into a single letter, such that the upper bound 1=M is not exceeded. This operation can only reduce the entropy. Then continue as above. QED

Acknowledgment My joint work with K. Rose formed the basis for this work. I thank I. Kontoyiannis for the discussion that motivated this result. I thank T. Berger for introducing me to 11

the work of Pinkston on entropy coded random codes. I thank T. Linder for helpful comments, and U. Erez for the proof of Lemma 2.

References [1] T. Berger. Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall, Englewood Cli s, NJ, 1971. [2] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, New York, 1991. [3] Y. Kontoyiannis. Private communications. [4] Y. Steinberg and M. Gutman. An algorithm for source coding subject to a delity criterion, based on string matching. IEEE Trans. Information Theory, IT-39:877{886, May 1993. [5] E.H. Yang and J. Kie er. On the performance of data compression algorithms based upon string matching. IEEE Trans. Information Theory, IT-44:47{65, Jan. 1998. [6] R. Zamir and M. Feder. On lattice quantization noise. IEEE Trans. Information Theory, pages 1152{1159, July 1996. [7] R. Zamir and K. Rose. Towards lossy Lempel-Ziv: Natural type selection. In Proc. of the Information Theory Workshop, Haifa, Israel, page pp. 58, June 1996. [8] R. Zamir and K. Rose. A string-matching interpretation for the Arimoto-Blahut algorithm. In Proc. of the Sixth Canadian Workshop on Information Theory, Kingston, Ontario, page pp. 48, June 1999. [9] R. Zamir and K. Rose. Natural type selection in adaptive lossy compression. IEEE Trans. Information Theory, revised Oct. 1999. [10] Z. Zhang, E.H. Yang, and V. Wei. The redundancy of source coding with a delity criterion - Part one: Known statistics. IEEE Trans. Information Theory, IT-43:71{91, Jan. 97.

12