1
Empirical distribution of good channel codes with non-vanishing error probability Yury Polyanskiy and Sergio Verd´u
Abstract This paper studies several properties of channel codes that approach the fundamental limits of a given memoryless channel with a non-vanishing probability of error. The output distribution induced by an ǫ-capacity-achieving code is shown to be close in a strong sense to the capacity achieving output distribution (for DMC and AWGN). Relying on the concentration of measure (isoperimetry) property enjoyed by the latter, it is shown that regular (Lipschitz) functions of channel outputs can be precisely estimated and turn out to be essentially non-random and independent of the used code. It is also shown that the binary hypothesis testing between the output distribution of a good code and the capacity achieving one cannot be performed with exponential reliability. Using related methods it is shown that quadratic forms and sums of q-th powers when evaluated at codewords of good AWGN codes approach the values obtained from a randomly generated Gaussian codeword. The random process produced at the output of the channel is shown to satisfy the asymptotic equipartition property (for DMC and AWGN).
Index Terms Shannon theory, discrete memoryless channels, additive white Gaussian noise, information measures, relative entropy, empirical output statistics, asymptotic equipartition property, concentration of measure, optimal transportation Y. Polyanskiy is with the Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, 02139 USA, e-mail:
[email protected]. S. Verd´u is with the Department of Electrical Engineering, Princeton University, Princeton, NJ, 08544 USA. e-mail:
[email protected]. The work was supported in part by the National Science Foundation (NSF) under Grant CCF-1016625 and by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under Grant CCF-0939370. Parts of this work were presented at 49-th and 50th Allerton Conferences on Communication, Control, and Computing, Allerton Retreat Center, IL, 2011-2012.
December 21, 2012
DRAFT
2
I. I NTRODUCTION A reliable channel code is a collection of waveforms of fixed duration distinguishable with small probability of error when observerd through a noisy channel. Such a code is optimal (extremal) if it possesses the maximal cardinality among all codebooks of equal duration and probability of error. In this paper we characterize several properties of optimal and close-tooptimal channel codes indirectly, i.e. without identifying the best code explicitly. This characterization provides theoretical insight and ultimately may facilitate the search for new good codes. Shannon [1] was the first to recognize that to maximize information transfer across a memoryless channel the input (output) waveforms must be “noise-like”, i.e. resemble a typical sample of a memoryless random process with marginal distribution PX∗ (PY∗ ) – a capacity achieving input (output) distribution. The most general formal statement of this property of optimal codes has been put forward in [2] which showed that a capacity achieving sequence of codes with vanishing probability of error must satisfy [2, Theorem 2] 1 D(PY n ||PY∗ n ) → 0 , n
(1)
where PY n denotes the output distribution induced by the codebook (assuming equiprobable codewords) and PY∗ n = PY∗ × · · · × PY∗
(2)
is the n-th power of the single-letter capacity achieving output distribution PY∗ and D(·||·) is the relative entropy. Furthermore, [2] shows that (under regularity conditions) the empirical frequency of input letters (or sequential k-letter blocks) inside the codebook approaches the capacity achieving input distribution (or its k-th power) in the sense of vanishing relative entropy. Note that (1) establishes a strong sense in which PY n is close to the memoryless PY∗ n . In this paper we extend the result in [2] to the case of non-vanishing probability of error. Studying this regime as opposed to vanishing probability of error has recently proved to be fruitful for the non-asymptotic characterization of the maximal achievable rate [3]. Although for the memoryless channels considered in this paper the ǫ-capacity Cǫ is independent of the probability of error ǫ, it does not immediately follow that a Cǫ -achieving code necessarily satisfies the empirical distribution property (1). In fact, somewhat surprisingly, we will show that (1) only holds under the maximal probability of error criterion. December 21, 2012
DRAFT
3
To illustrate the delicacy of the question of approximating PY n with PY∗ n , consider a good, capacity-achieving k-to-n code for the binary symmetric channel (BSC) with crossover probability δ and capacity C. First, consider the set of all codewords; its probability under PY n is (approximately) (1 − δ)n while over PY∗ n it is 2k−n – the two being exponentially very different (since for a reliable code k < n log 2(1 − δ)). On the other hand, consider a set E consisting of a union of small Hamming balls surrounding each codeword, whose radius ≈ δn is chosen
such that PY n [E] = 21 , say. Assuming that the code is decodable with small probability of error,
the union will be almost disjoint and hence PY∗ n [E] ≈ 2k−nC – the two becoming exponentially comparable (provided k ≈ nC). Thus, although PY n and PY∗ n differ dramatically in “small” details, on a “larger” scale they look the same. One of the goals of this paper is to make this intuition precise. We will show that the closer the code approaches the fundamental limits, the better PY n approximates PY∗ n . Studying the output distribution PY n also becomes important in the context of secure communication, where the output due to the code is required to resemble white noise; and in the problem of asynchronous communication where the output statistics of the code imposes limits on the quality of synchronization [4]. For example, in a multi-terminal communication problem, the channel output of one user may create interference for another. Assessing the average impairment caused by such interference involves the analysis of the expectation of a certain function of the channel output E[F (Y n )]. We show that under certain regularity assumptions on F not only one can approximate the expectation of F by substituting the unknown PY n with PY∗ n , as in Z Z n F (y )dPY n ≈ F (y n )dPY∗ n ,
(3)
but one can also prove that in fact the distribution of F (Y n ) will be tightly concentrated around
its expectation. Thus, we are able to predict with overwhelming probability the random value of F (Y n ) without any knowledge of the code used to produce Y n (but assuming the code is ǫ-capacity-achieving). Besides (1) and (3) we will show that 1) the hypothesis testing problem between PY n and PY∗ n has zero Stein’s exponent; 2) a convenient inequality holds for the conditional relative entropy for the channel output in terms of the cardinality of the employed code; 3) codewords of good codes for the additive white Gaussian noise (AWGN) channel become December 21, 2012
DRAFT
4
more and more isotropically distributed (in the sense of evaluating quadratic forms) and resemble white Gaussian noise (in the sense of ℓq norms) as the code approaches the fundamental limits; 4) the output process Y n enjoys an asymptotic equipartition property. Throughout the paper we will observe repeated connections with the concentration of measure (isoperimetry) and optimal transportation, which were introduced into the information theory by the seminal works [5]–[7]. Although some key results are stated for general channels, most of the discussion is specialized to discrete memoryless channels (DMC) (possibly with a (separable) input cost constraint) and to the AWGN channel. The organization of the paper is as follows. Section II contains the main definitions and notation. Section III proves a sharp upper bound on the relative entropy D(PY n ||PY∗ n ). In Section IV we discuss various implications of the bounds on relative entropy and in particular prove approximation (3). Section V considers the hypothesis testing problem of discriminating between PY n and PY∗ n . The asymptotic equipartition property of the channel output process is established in Section VI. Section VII discusses results for the quadratic forms and ℓp norms of the codewords of good Gaussian codes. Finally, Section VIII outlines extensions to other channels. II. D EFINITIONS
AND NOTATION
A random transformation PY |X : X → Y is a Markov kernel acting between a pair of measurable spaces. An (M, ǫ)avg code for the random transformation PY |X is a pair of random transformations f : {1, . . . , M} → X and g : Y → {1, . . . , M} such that ˆ 6= W ] ≤ ǫ , P[W
(4)
ˆ = g(Y ) with W equiprobable on where in the underlying probability space X = f(W ) and W ˆ forming a Markov chain: {1, . . . , M}, and W, X, Y, W f
PY |X
g
ˆ . W →X → Y →W
(5)
In particular, we say that PX (resp., PY ) is the input (resp., output) distribution induced by the encoder f. An (M, ǫ)max code is defined similarly except that (4) is replaced with the more stringent maximal probability of error criterion: ˆ 6= W |W = j] ≤ ǫ . max P[W
1≤j≤M
December 21, 2012
(6) DRAFT
5
A code is called deterministic, denoted (M, ǫ)det , if the encoder f is a functional (non-random) mapping. A channel is a sequence of random transformations, {PY n |X n , n = 1, . . .} indexed by the parameter n, referred to as the blocklength. An (M, ǫ) code for the n-th random transformation is called an (n, M, ǫ) code. The non-asymptotic fundamental limit of communication is defined as1 M ∗ (n, ǫ) = max{M : ∃(n, M, ǫ)-code} .
(7)
To the three types of channels considered below we also associate a special sequence of output distributions PY∗ n , defined as the n-th power of a certain single-letter PY∗ :2 △
PY∗ n = (PY∗ )n ,
(8)
where PY∗ is defined as follows: 1) a DMC (without feedback) is built from a single letter transformation PY |X : X → Y acting between finite spaces by extending the latter to all n ≥ 1 in a memoryless way.
Namely, the input space of the n-th random transformation PY n |X n is given by3 △
(9) Xn = X n = |X × .{z . . × X} n times n and similarly for the output space Y = Y × . . . × Y, while the transition kernel is set to be n
n
PY n |X n (y |x ) = The capacity C and
PY∗ ,
n Y j=1
PY |X (yj |xj ) .
(10)
the unique capacity-achieving output distribution (caod), are found
by solving C = max I(X; Y ) . PX
(11)
2) a DMC with input constraint (c, P ) is a generalization of the previous construction with an additional restriction on the input space Xn : ( ) n X n n c(xj ) ≤ nP Xn = x ∈ X :
(12)
j=1
1
Additionally, one should also specify which probability of error criterion, (4) or (6), is used.
2
For general channels, the sequence {PY∗ n } is required to satisfy a quasi-caod property, see [8, Section IV].
3
To unify notation we denote the input space as Xn (instead of the more natural X n ) even in the absence of cost constraints.
December 21, 2012
DRAFT
6
In this case the capacity C and the caod PY∗ are found from a restricted maximization C = max I(X; Y ) , PX
(13)
where PX is any distribution such that E[c(X)] ≤ P . 3) the AW GN(P ) channel has an input space4 n √ o Xn = x ∈ Rn : kxk2 ≤ nP
(14)
(15)
the output space Y n = Rn and the transition kernel
PY n |X n =x = N (x, In ) ,
(16)
where N (x, Σ) denotes a (multidimensional) normal distribution with mean x and covari-
ance matrix Σ and In – is the n × n identity matrix. Then5 1 log(1 + P ) 2 = N (0, 1 + P ) .
C = PY∗
(17) (18)
As shown in [9], [10] in all three cases PY∗ is unique and PY∗ n satisfies the key property: D(PY n |X n =x ||PY∗ n ) ≤ nC ,
(19)
for all x ∈ Xn . Property (19) implies that for every input distribution PX n the induced output distribution PYn satisfies
PY n ≪ PY∗ n PY n |X n =xn ≪ PY∗ n ,
(20) ∀xn ∈ Xn ,
D(PY n ||PY∗ n ) ≤ nC − I(X n ; Y n ) .
(21) (22)
As a consequence of (21) the information density is well defined: △
ı∗X n ;Y n (xn ; y n ) = log
dPY n |X n =xn n (y ) . dPY∗ n
(23)
4
For convenience we denote the elements of Rn as x, y (for non-random vectors) and X n , Y n (for the random vectors).
5
As usual, all logarithms log and exponents exp are taken to arbitrary fixed base, which also specifies the information units.
December 21, 2012
DRAFT
7
Moreover, for every channel considered here there is a constant a1 > 0 such that6 sup Var ı∗X n ;Y n (X n ; Y n ) X n = xn ≤ na1 .
(24)
xn ∈Xn
In all three cases, the ǫ-capacity Cǫ equals C for all 0 < ǫ < 1, i.e. log M ∗ (n, ǫ) = nC + o(n) ,
n → ∞.
(25)
In fact, see [3]:7 log M ∗ (n, ǫ) = nC − for any 0 < ǫ
ρj |W = j] ≥ 1 − ǫ . exp{ρj }β1−ǫ (PY |W =j , PY ) + 1 − M
(43)
Multiplying by exp{−ρj } and using resulting bound in place of (41) we repeat steps (37)-(40) to obtain 1 M
inf P [ıW ;Y (W ; Y ) ≤ ρj |W = j] ≥
1≤j≤M
inf P [ıW ;Y (W ; Y ) ≤ ρj |W = j] − ǫ exp{−¯ ρ} ,
1≤j≤M
(44)
which in turn is equivalent to (42). Choosing ρ(x) = D(PY |X=x ||QY ) + ∆ we can specialize Theorem 1 in the following convenient form: Theorem 2: Consider a random transformation PY |X , a distribution PX induced by an (M, ǫ)max,det code and an auxiliary output distribution QY . Assume that for all x ∈ X we have △
d(x) = D(PY |X=x ||QY ) < ∞ and sup PY |X=x x
dPY |X=x (Y ) ≥ d(x) + ∆ ≤ δ ′ , log dQY
(45)
(46)
for some pair of constants ∆ ≥ 0 and 0 ≤ δ ′ < 1 − ǫ. Then, we have D(PY |X ||QY |PX ) ≥ log M − ∆ + log(1 − ǫ − δ ′ ) .
(47)
Remark 3: Note that (46) holding with a small δ ′ is a natural non-asymptotic embodiment of information stability of the underlying channel, cf. [8, Section IV]. One way to estimate the upper deviations in (46) is by using Chebyshev’s inequality. As an example, we obtain Corollary 3: If in the conditions of Theorem 2 we replace (46) with dPY |X=x (Y ) X = x ≤ Sm sup Var log dQY x
for some constant Sm ≥ 0, then we have
D(PY |X ||QY |PX ) ≥ log M −
December 21, 2012
r
2Sm 1−ǫ + log . 1−ǫ 2
(48)
(49)
DRAFT
11
Notice that when QY is chosen to be a product distribution, such as PY∗ n , log
dPY |X=x dQY
becomes
a sum of independent random variables. In particular, (24) leads to an equivalent restatement of (1): Theorem 4: Consider a memoryless channel belonging to one of the three classes in Section II. Then for any 0 < ǫ < 1 and any sequence of (n, Mn , ǫ)max,det capacity-achieving codes we have 1 1 I(X n ; Y n ) → C ⇐⇒ D(PY n ||PY∗ n ) → 0 , n n
(50)
where X n is the output of the encoder. Proof: The direction ⇒ is trivial from property (22) of PY∗ n . For the direction ⇐ we only
need to lower-bound I(X n ; Y n ) since I(X n ; Y n ) ≤ nC. To that end, we have from (24) and Corollary 3:
√ D(PY n |X n ||PY∗ n |PX n ) ≥ log Mn + O( n) .
(51)
Then the conclusion follows from (28) and the following identity applied with QY n = PY∗ n : I(X n ; Y n ) = D(PY n |X n ||QY n |PX n ) − D(PY n ||QY n ) ,
(52)
which holds for all QY n such that the unconditional relative entropy is finite. We remark that Theorem 4 can also be derived from a simple extension of the Wolfowitz converse [11] to an arbitrary output distribution QY n , e.g. [14, Theorem 10], and then choosing QY n = PY∗ n . Note that Theorem 4 implies the result in [2] since the capacity-achieving codes with vanishing error probability are a subclass of those considered in Theorem 4. B. DMC with C1 < ∞ For a given DMC denote the parameter introduced by Burnashev [16] C1 = max D(PY |X=a ||PY |X=a′ ) . ′ a,a
(53)
Note that C1 < ∞ if and only if the transition matrix does not contain any zeros. In this section we show (34) for a (regular) class of DMCs with C1 < ∞ by an application of the main inequality (47). We also demonstrate that (1) may not hold for codes with non-deterministic encoders or unconstrained maximal probability of error.
December 21, 2012
DRAFT
12
Theorem 5: Consider a DMC PY |X with capacity C > 0 (with or without an input constraint). Then for any 0 < ǫ < 1 there exists a constant a = a(ǫ) > 0 such that any (n, Mn , ǫ)max,det code satisfies
√ D(PY n ||PY∗ n ) ≤ nC − log M + a n ,
(54)
where PY n is the output distribution induced by the code. In particular, for any capacity-achieving sequence of such codes we have 1 D(PY n ||PY∗ n ) → 0 , n
(55)
Remark 4: If PY∗ is equiprobable on Y (such as for some symmetric channels), (55) is equivalent to H(Y n ) = n log |Y| + o(n) .
(56)
In any case, as we will see in Section IV-A, (55) always implies H(Y n ) = nH(Y ∗ ) + o(n)
(57)
(by (133) applied to f (y) = log PY∗ (y)). Note also that traditional combinatorial methods, e.g. [17], are not helpful in dealing with quantities like H(Y n ), D(PY n ||PY∗ n ) or PY n -expectations of functions which are not of the form of cumulative average. Proof: Fix y n , y¯n ∈ Y n which differ in the j-th letter only. Then, denoting y\j = {yk , k 6= j} we have y n )| | log PY n (y n ) − log PY n (¯ PYj |Y\j (yj |y\j ) = log PYj |Y\j (¯ yj |y\j ) ≤ max′ log a,b,b
△
PY |X (b|a) PY |X (b′ |a)
= a1 < ∞ ,
(58) (59) (60)
where (59) follows from PYj |Y\j (b|y\j ) =
December 21, 2012
X
a∈X
PY |X (b|a)PXj |Y\j (a|y\j ) .
(61)
DRAFT
13
Thus, the function y n 7→ log PY n (y n ) is a1 -Lipschitz in Hamming metric on Y n . Its discrete
gradient (see definition of D(f ) in [18, Section 4]) is bounded by n|a1 |2 and thus by the discrete Poincar´e inequality [18, Theorem 4.1f] we have Var [log PY n (Y n )|X n = xn ] ≤ n|a1 |2 .
(62)
Therefore, for some 0 < a2 < ∞ and all xn ∈ Xn we have Var [iX n ;Y n (X n ; Y n )|X n = xn ] PY n |X n (Y n |X n ) n n = Var log X = x PY n (Y n ) ≤ 2 Var log PY n |X n (Y n |X n ) X n = xn + 2 Var [log PY n (Y n )|X n = xn ]
≤ 2na2 + 2n|a1 |2 ,
(63)
(64) (65)
where (65) follows from (62) and the fact that the random variable in the first variance in (64) is a sum of n independent terms. Applying Corollary 3 with Sm = 2na2 + 2n|a1 |2 and QY = PY n we obtain:
√ D(PY n |X n ||PY n |PX n ) ≥ log Mn + O( n) .
(66)
We can now complete the proof: D(PY n ||PY∗ n ) = D(PY n |X n ||PY∗ n |PX n ) − D(PY n |X n ||PY n |PX n )
(67)
≤ nC − D(PY n |X n ||PY n |PX n ) √ ≤ nC − log Mn + O( n)
(68) (69)
where (68) is because PY∗ n satisfies (19) and (69) follows from (66). This completes the proof of (54). Remark 5: (55) need not hold if the maximal probability of error is replaced with the average or if the encoder is allowed to be random. Indeed, for any 0 < ǫ < 1 we construct a sequence of (n, Mn , ǫ)avg capacity-achieving codes which do not satisfy (55). Consider a sequence of (n, Mn′ , ǫ′n )max,det codes with ǫ′n → 0 and 1 log Mn′ → C . n December 21, 2012
(70) DRAFT
14
For all n such that ǫ′n
0
(71)
(the existence of such x0 relies on the assumption C > 0). Denote the output distribution induced by this code by PY′ n . Next, extend this code by adding
ǫ−ǫ′n Mn′ 1−ǫ
identical codewords: (x0 , . . . , x0 ) ∈ Xn . Then the
minimal average probability of error achievable with the extended codebook of size △
Mn =
1 − ǫn ′ M 1−ǫ n
(72)
is easily seen to be not larger than ǫ. Denote the output distribution induced by the extended code by PY n and define a binary random variable S = 1{X n = (x0 , . . . , x0 )} with distribution PS (1) = 1 − PS (0) =
ǫ − ǫ′n . 1 − ǫ′n
(73)
(74)
We have then D(PY n ||PY∗ n ) = D(PY n |S ||PY∗ n |PS ) − I(S; Y n )
(75)
≥ D(PY n |S ||PY∗ n |PS ) − log 2
(76)
= nD(PY |X=x0 ||PY∗ )PS (1) + D(PY′ n ||PY∗ n )PS (0) − log 2
(77)
= nD(PY |X=x0 ||PY∗ )PS (1) + o(n) ,
(78)
where (75) is by (52), (76) follows since S is binary, (77) is by noticing that PY n |S=0 = PY′ n , and (78) is by (55). It is clear that (71) and (78) show the impossibility of (55) for this code. Similarly, one shows that (55) cannot hold if the assumption of the deterministic encoder is dropped. Indeed, then we can again take the very same (n, Mn′ , ǫ′n ) code and make its encoder randomized so that with probability
ǫ−ǫ′n 1−ǫ′n
it outputs (x0 , . . . , x0 ) ∈ Xn and otherwise it outputs
the original codeword. The same analysis shows that (78) holds again and thus (55) fails.
December 21, 2012
DRAFT
15
Note that the counterexamples constructed above also demonstrate that in Theorem 2 (and hence Theorem 1) the assumptions of maximal probability of error and deterministic encoders are not superfluous, contrary to what is claimed by Ahlswede [13, Remark 1]. C. DMC with C1 = ∞ 3
Next, we show the estimate for D(PY n ||PY∗ n ) differing by a log 2 n factor from (34) for the DMCs with C1 = ∞. Theorem 6: For any DMC PY |X with capacity C > 0 (with or without input constraints) and 0 < ǫ < 1 there exists a constant b > 0 with the property that for any sequence of (n, Mn , ǫ)max,det codes we have for all n ≥ 1
√ 3 D(PY n ||PY∗ n ) ≤ nC − log Mn + b n log 2 n .
(79)
In particular, for any such sequence achieving capacity we have 1 D(PY n ||PY∗ n ) → 0 . n
(80)
Remark 6: Claim (80) is not true if either the maximal probability of error is replaced with the average, or if we allow the encoder to be stochastic. Counterexamples are constructed exactly as in Remark 5. Proof: Let ci and Di , i = 1, . . . Mn denote the codewords and the decoding regions of the code. Denote the sequence ℓn = b1
p
n log n
(81)
with b1 > 0 to further constrained shortly. According to the isoperimetric inequality for Hamming space [17, Corollary I.5.3], there is a constants a > 0 such that for every i = 1, . . . , Mn ℓn ℓn −1 1 − PY n |X n =ci [Γ Di ] ≤ Q Q (ǫ) + √ a n 2 ℓ ≤ exp −b2 n n = n−b2 1 , ≤ n
(82) (83) (84) (85)
where Γℓ D = {bn ∈ Y n : ∃y n ∈ D s.t. |{j : yj 6= bj }| ≤ ℓ} December 21, 2012
(86) DRAFT
16
denotes the ℓ-th Hamming neighborhood of a set D and we assumed that b1 was chosen large enough so there is b2 ≥ 1 satisfying (85). Let Mn′ =
Mn |Y|ℓn n
(87)
n ℓn
and consider a subcode F = (F1 , . . . , FMn′ ), i.e. an ordered list of Mn′ indices from {1, . . . , Mn }. Then for every possible choice of F we denote by PX n (F ) and PY n (F ) the input/output distribution induced by a given subcode, so that for example: ′
PY n (F )
Mn 1 X PY n |X n =Fj . = ′ Mn j=1
(88) Mn′
We aim to apply the random coding argument over all equally likely Mn
choices of a subcode
F . Random coding among subcodes was originally invoked in [6] to demonstrate the existence of a good subcode. One easily verifies that the expected induced output distribution is △
E[PYn (F )] =
1 Mn′
Mn
= PY n .
Mn X
F1 =1
···
Mn X
PY n (F )
(89)
FM ′ =1 n
(90)
Next, for every F we denote by ǫ′ (F ) the minimal possible average probability of error achieved by an appropriately chosen decoder. With this notation we have, for every possible value of F : D(PY n (F ) ||PY∗ n ) = D(PY n |X n ||PY∗ n |PX n (F ) ) − I(X n (F ); Y n (F ))
(91)
≤ nC − I(X n (F ); Y n (F ))
(92)
≤ nC − (1 − ǫ′ (F )) log Mn′ + log 2
(93)
≤ nC − log Mn′ + nǫ′ (F ) log |X | + log 2 √ 3 ≤ nC − log Mn + nǫ′ (F ) log |X | + b3 n log 2 n
(94) (95)
where (91) is by (52), (92) is by (19), (93) is by Fano’s inequality, (94) is because log Mn′ ≤
n log |X | and (95) holds for some b3 > 0 by the choice of Mn′ in (87) and by n ≤ ℓn log n . log ℓn December 21, 2012
(96) DRAFT
17
Taking the expectation of both sides of (95), applying convexity of relative entropy and (90) we get D(PY n ||PY∗ n )
√ 3 ≤ nC − log Mn + n E[ǫ′ (F )] log |X | + b3 n log 2 n .
(97)
Accordingly, it remains to show that n E[ǫ′ (F )] ≤ 2 .
(98)
To that end, for every subcode F define the suboptimal randomized decoder: ˆ (y) = Fj W
∀Fj ∈ L(y, F ) (with probability
1 ), |L(y,F )|
(99)
where L(y, F ) is a list of those indices i ∈ F for which y ∈ Γℓn Di . Let FW be equiprobable on F , then averaged over the selection of F we have n
n
E[L(Y , F ) | FW ∈ L(Y , F )] ≤ 1 + because each y ∈ Y n can belong to at most
n ℓn
n ℓn
|Y|ℓn
Mn
(Mn′ − 1) ,
(100)
|Y|ℓn enlarged decoding regions Γℓn Di and
since each Fj is chosen independently and equiprobably among all possible Mn alternatives. The average probability of error for the given decoder when also averaged over F can be upperbounded as E[ǫ′ (F )] ≤ P[FW 6∈ L(Y n , F )] |L(Y n , F )| − 1 n +E 1{FW ∈ L(Y , F )} |L(Y n , F )| ℓ ′ n |Y| n Mn ≤ P[FW 6∈ L(Y n , F )] + ℓn Mn ℓ n ′ n |Y| Mn 1 ≤ + ℓn n Mn 2 , ≤ n where (102) is by Jensen’s inequality applied to
x−1 x
(101) (102) (103) (104)
and (100), (103) is by (85), and (104) is
by (87). Since (104) also serves as an upper bound to E[ǫ′ (F )] the proof of (98) is complete.
December 21, 2012
DRAFT
18
D. Gaussian channel Theorem 7: For any 0 < ǫ < 1 and P > 0 there exists a = a(ǫ, P ) > 0 such that the output distribution PY n of any (n, Mn , ǫ)max,det code for the AW GN(P ) channel satisfies √ D(PY n ||PY∗ n ) ≤ nC − log M + a n ,
(105)
where PY∗ n = N (0, 1 + P )n . In particular for any capacity-achieving sequence of such codes we have 1 D(PY n ||PY∗ n ) → 0 . n
(106)
Remark 7: (106) need not hold if the maximal probability of error is replaced with the average or if the encoder is allowed to be stochastic. Counterexamples are constructed similarly to those for Remark 5 with x0 = 0. Note also that Theorem 7 need not hold if the power-constraint is in the average-over-the-codebook sense; see [14, Section 4.3.3]. Proof: Denote by lower-case pY n |X n =x and pY n densities of PY n |X n =x and PY n . Then an elementary computation shows ∇ log pY n (y) =
log e ∇pY n (y) pY n (y)
(107)
M
log e X 1 − 21 ||y−cj ||2 = n ∇e pY n (y) j=1 M(2π) 2
(108)
M
1 log e X − 21 ||y−cj ||2 = n (cj − y)e pY n (y) j=1 M(2π) 2 = (E[X n |Y n = y] − y) log e .
(109) (110)
For convenience denote
and notice that since kX n k ≤
December 21, 2012
√
ˆ n = E[X n |Y n ] X
(111)
nP we have also
√
ˆ n
X ≤ nP .
(112)
DRAFT
19
Then 1 E[k∇ log pY n (Y n )k2 | X n ] log2 e
2
n
n n ˆ X = E Y − X n 2 n
ˆ n 2 n ≤ 2 E kY k X + 2 E X X ≤ 2 E kY n k2 X n + 2nP = 2 E kX n + Z n k2 X n + 2nP
(113) (114) (115) (116)
≤ 4kX n k2 + 4n + 2nP
(117)
≤ (6P + 4)n ,
(118)
where (114) is by ka + bk2 ≤ 2kak2 + 2kbk2 ,
(119)
(115) is by (112), in (116) we introduced Z n ∼ N (0, In) which is independent of X n , (117) is by (119) and (118) is by the power-constraint imposed on the codebook. Conditioned on X n , the random vector Y n is Gaussian. Thus, from Poincar´e’s inequality for the Gaussian measure, e.g. [19, (2.16)], we have Var[log pY n (Y n ) | X n] ≤ E[k∇ log pY n k2 | X n ]
(120)
and together with (118) this yields the required estimate Var[log pY n (Y n ) | X n] ≤ a1 n
(121)
for some a1 > 0. The argument then proceeds step by step as in the proof of Theorem 5 with (121) taking the place of (62) and recalling that property (19) holds for the AWGN channel too.
IV. D ISCUSSION We have shown that there is a constant a = a(ǫ) independent of n and M such that √ D(PY n ||PY∗ n ) ≤ nC − log M + a n , December 21, 2012
(122)
DRAFT
20
where PY n is the output distribution induced by an arbitrary (n, M, ǫ)max,det code. Therefore, there any (n, M, ǫ)max,det necessarily satisfies √ log M ≤ nC + a(ǫ) n
(123)
as is classically known [20]. In particular, (122) implies that any ǫ-capacity-achieving code must satisfy (1). In this section we discuss this and other implications of this result, such as: 1) (122) implies that the empirical marginal output distribution n
X △ 1 PY P¯n = n j=1 i
(124)
converges to PY∗ in a strong sense (Section IV-A);
2) (122) guarantees estimates of the precision in the approximation (3) (Sections IV-B and IV-E), 3) (122) provides estimates for the deviations of f (Y n ) from its average (Sections IV-C). 4) relation to optimal transportation (Section IV-D), 5) implications of (1) for the empirical input distribution of the code (Sections IV-G and IV-H). A. Empirical distributions and empirical averages Considering the empirical marginal distributions, the convexity of relative entropy and (1) result in D(P¯n ||PY∗ ) ≤
1 D(PY n ||PY∗ n ) → 0 , n
(125)
where P¯n is the empirical marginal output distribution (124). More generally, we have [2, (41)] D(P¯n(k)||PY∗ k ) ≤
k D(PY n ||PY∗ n ) → 0 , n−k+1
(126)
(k) where P¯n is a k-th order empirical output distribution
P¯n(k) =
n−k+1 X 1 P j+k−1 . n − k + 1 j=1 Yj
(127)
Knowing that a sequence of distributions Pn converges in relative entropy to a distribution P , i.e. D(Pn ||P ) → 0
(128)
implies convergence properties for the expectations of functions: December 21, 2012
DRAFT
21
1) The following inequality (often called Pinsker’s) is established in [21]: 1 D(Pn ||P ) , 2 log e
(129)
kP − QkT V = sup |P (A) − Q(A)|
(130)
||Pn − P ||2T V ≤ where △
A
From (129) we have that (128) implies Z Z f dPn → f dP
(131)
for all bounded functions f .
2) In fact, (131) also holds for unbounded f as long as it satisfies Cramer’s condition under P , i.e.
Z
etf dP < ∞
(132)
for all t in some neighborhood of 0; see [22, Lemma 3.1]. Together (131) and (125) show that for a wide class of functions f : Y → R empirical averages over distributions induced by good codes converge to the average over the capacity achieving output distribution (caod):
# Z n 1X f (Yj ) → f dPY∗ . E n j=1 "
(133)
From (126) a similar conclusion holds for k-th order empirical averages. B. Averages of functions of Y n To go beyond empirical averages, we need to provide some definitions. Definition 1: The function F : Y n → R is called (b, c)-concentrated with respect to measure
µ on Y n if for all t ∈ R Z exp{t(F (Y n ) − F¯ )}dµ ≤ b exp{ct2 } ,
F¯ =
Z
F dµ .
(134)
A function F is called (b, c)-concentrated for the channel if it is (b, c)-concentrated for every PY n |X n =x and PY∗ n . A couple of simple properties of (b, c)-concentrated functions: 1) Gaussian concentration around the mean:
t2 P[|F (Y ) − E[F (Y )]| > t] ≤ b exp − 4c n
December 21, 2012
n
.
(135) DRAFT
22
2) Small variance: n
Z
∞
P[|F (Y n ) − E[F (Y n )]|2 > t]dt Z ∞ t ≤ min b exp − , 1 dt 4c 0 = 4c log(2be) .
Var[F (Y )] =
(136)
0
(137) (138)
Examples of concentrated functions (see [19] for a survey): •
A bounded function F with kF k∞ ≤ A is (exp{A2 (4c)−1 }, c)-concentrated for any c and any measure µ. Moreover, for a fixed µ and a sufficiently large c any bounded function is (1, c)-concentrated.
• •
If F is (b, c)-concentrated then λF is (b, λ2 c)-concentrated. Let f : Y → R be (1, c)-concentrated with respect to µ. Then so is n 1 X n f (yj ) F (y ) = √ n j=1
(139)
with respect to µn . In particular, any F defined in this way from a bounded f is (1, c)-
concentrated for a memoryless channel (for a sufficiently large c independent of n). •
If µ = N (0, 1)n and F is a Lipschitz function on Rn with Lipschitz constant kF kLip then F is (1,
kF k2Lip )-concentrated 2 log e
with respect to µ, e.g. [23, Proposition 2.1]: ( ) Z kF k2Lip 2 n n exp{t(F (y ) − F¯ )}dµ(y ) ≤ exp . t 2 log e Rn
Therefore any Lipschitz function is (1, •
(1+P )kF k2Lip )-concentrated 2 log e
(140)
for the AWGN channel.
For discrete Y n we endow it with the Hamming distance d(y n , z n ) = |{i : yi 6= zi }|
(141)
and define Lipschitz functions in the usual way. In this case, a simpler criterion is: F : Y n → R is Lipschitz with constant ℓ if and only if max |F (y1 , . . . , yj , . . . , yn ) − F (y1 , . . . , b, . . . , yn )| ≤ ℓ . n y ,b,j
(142)
Let µ be any product probability measure P1 × . . . × Pn on Y n , then the standard AzumaHoeffding estimate shows that X
y n ∈Y n December 21, 2012
exp{t(F (y n ) − F¯ )}µ(y n) ≤ exp
(
nkF k2Lip 2 t 2 log e
)
(143)
DRAFT
23
and thus any Lipschitz function F is (1,
nkF k2Lip )-concentrated 2 log e
with respect to any product
measure on Y n . Note that unlike the Gaussian case, the constant of concentration c worsens linearly with dimension n. Generally, this growth cannot be avoided as shown by the coefficient
√1 n
in the
exact solution of the Hamming isoperimetric problem [24]. At the same time, this growth P does not mean (143) is “weaker” than (140); for example, F = nj=1 φ(yj ) has Lipschitz √ constant O( n) in Euclidean space and O(1) in Hamming. Surprisingly, however, for convex functions the concentration (140) holds for product measures even under Euclidean distance [25]. We now show how to approximate expectations of concentrated functions: Proposition 8: Suppose that F : Y n → R is (b, c)-concentrated with respect to PY∗ n . Then q n ∗n (144) | E[F (Y )] − E[F (Y )]| ≤ 2 cD(PY n ||PY∗ n ) + c log b ,
where
E[F (Y
∗n
)] =
Z
F (y n )dPY∗ n .
(145)
Proof: Recall the Donsker-Varadhan inequality [26, Lemma 2.1]: For any probability meaR sures P and Q with D(P ||Q) < ∞ and a measurable function g such that exp{g}dQ < ∞ R we have that gdP exists (but perhaps is −∞) and moreover Z Z gdP − log exp{g}dQ ≤ D(P ||Q) . (146) Since by (134) the moment generating function of F exists under PY∗ n , applying (146) to tF
we get t E[F (Y n )] − log E[exp{tF (Y ∗n )}] ≤ D(PY n ||PY∗ n ) .
(147)
ct2 − t E[F (Y n )] + t E[F (Y ∗n )] + D(PY n ||PY∗ n ) + log b ≥ 0
(148)
From (134) we have
for all t. Thus the discriminant of the parabola in (148) must be non-positive which is precisely (144). Note that for empirical averages F (y n ) =
1 n
Pn
j=1 f (yi )
we may either apply the estimate for
concentration in the example (139) and then use Proposition 8, or directly apply Proposition 8
December 21, 2012
DRAFT
24
to (125) – the result is the same: r n 1 X c E[f (Yj )] − E[f (Y ∗ )] ≤ 2 D(PY n ||PY∗ n ) → 0 , n n j=1
(149)
for any f which is (1, c)-concentrated with respect to PY∗ .
For the Gaussian channel, Proposition 8 and (140) yield: Corollary 9: For any 0 < ǫ < 1 there exist two constants a1 , a2 > 0 such that for any (n, M, ǫ)max,det code for the AW GN(P ) channel and for any Lipschitz function F : Rn → R we have n
| E[F (Y )] − E[F (Y
∗n
q √ )]| ≤ a1 kF kLip nC − log Mn + a2 n ,
(150)
where C = 21 log(1 + P ) is the capacity.
Note that in the proof of Corollary 9 concentration of measure is used twice: once for PY n |X n in the form of Poincar´e inequality (proof of Theorem 7) and once in the form of (134) (proof of Proposition 8). C. Concentration of functions of Y n Surprisingly not only can we estimate expectations of F (Y n ) by replacing the unwieldy PY n with the simple PY∗ n , but in fact the distribution of F (Y n ) exhibits a sharp peak at its expectation: Proposition 10: Consider a channel for which (122) holds. Then for any F which is (b, c)concentrated for such channel, we have for every (n, M, ǫ)max,det code: √ t2 n ∗n P[|F (Y ) − E[F (Y )]| > t] ≤ 3b exp nC − log M + a n − 16c and,
√ Var[F (Y n )] ≤ 16c nC − log M + a n + log(6be) .
(151)
(152)
Proof: Denote for convenience:
△ F¯ = E[F (Y ∗n )] ,
φ(xn ) = E[F (Y n )|X n = xn ] . Then as a consequence of F being (b, c)-concentrated for PY n |X n =xn we have 2 t n n n n . P[|F (Y ) − φ(x )| > t|X = x ] ≤ b exp − 4c December 21, 2012
(153) (154)
(155)
DRAFT
25
Consider now a subcode C1 consisting of all codewords such that φ(xn ) > F¯ + t for t > 0. The number M1 = |C1 | of codewords in this subcode is M1 = MP[φ(X n ) > F¯ + t] . Let QY n be the output distribution induced by C1 . We have the following chain: 1 X φ(xn ) F¯ + t ≤ M1 x∈C 1 Z = F (Y n )dQY n q ¯ ≤ F + 2 cD(QY n ||PY∗ n ) + c log b) q √ ¯ ≤ F + 2 c(nC − log M1 + a n) + c log b
(156)
(157) (158) (159) (160)
where (157) is by the definition of C1 , (158) is by (154), (159) is by Proposition 8 and the assumption of (b, c)-concentration of F under PY∗ n , and (160) is by (122). Together (156) and (160) imply: √ t2 ¯ P[φ(X ) > F + t] ≤ b exp nC − log M + a n − . 4c n
(161)
Applying the same argument to −F we obtain a similar bound on P[|φ(X n ) − F¯ | > t and thus P[|F (Y n ) − F¯ | > t] ≤ P[|F (Y n ) − φ(X n )| > t/2] + P[|φ(X n ) − F¯ | > t/2] √ t2 ≤ b exp − 1 + 2 exp{nC − log M + a n} 16c √ t2 ≤ 3b exp − + nC − log M + a n , 16c
(162) (163) (164)
where (163) is by (155) and (161) and (164) is by (123). Thus (151) is proven. Moreover, (152) follows by (138). D. Relation to optimal transportation Since the seminal work of Marton [7], [27] one of the tools for proving (b, c)-concentration of Lipschitz functions is the optimal transportation theory. Namely, Marton demonstrated that if a probability measure µ on a metric space satisfies a T1 inequality W1 (ν, µ) ≤
December 21, 2012
p c′ D(ν||µ)
∀ν
(165)
DRAFT
26
then any Lipschitz f is (b, kf k2Lip c)-concentrated with respect to µ for some b = b(c, c′ ) and any 0 0 such that for all n ≥ n
1X E[f (Yj )] ≤ E[f (Y ∗ )] + n j=1
16 θ4
we have √ D(PY n ||PY∗ n ) + b n 3
n4
,
(175)
provided that PY∗ n = (PY∗ )n .
√ Remark 8: Substituting the estimate (122) and assuming log M = nC + O( n) we can see 1
that both Proposition 8 and (175) give the same order n− 4 for the deviation of the empirical average from E[f (Y ∗ )]. Note also that Proposition 11 applies to functions which only have onesided exponential moments; this will be useful in Theorem 23.qFinally, the remainder term is √ D(PY n ||PY∗ n ) at the expense of order-optimal for D(PY n ||PY∗ n ) ∼ n but can be improved to n adding technical conditions on n, θ and m1 .
Proof: It is clear that if the moment-generating function t 7→ E[exp{tf (Y ∗ )}] exists for t = θ > 0 then it also exists for all 0 ≤ t ≤ θ. Notice that since x2 exp{−x2 } ≤ e−2 log e ,
December 21, 2012
∀x ≥ 0
(176)
DRAFT
28
we have for all 0 ≤ t ≤ θ2 : E[f 2 (Y ∗ ) exp{tf (Y ∗ )}] ≤ E[f 2 (Y ∗ )1{f < 0}] + e−2 m1 log e (θ − t)2 4e−2 m1 log e ≤ m2 + θ2
e−2 log e E [exp{θf (Y ∗ )}1{f ≥ 0}] (177) (θ − t)2
≤ m2 +
(178) (179)
△
= b(m1 , m2 , θ) · 2 log e .
(180)
Then a simple estimate log E[exp{tf (Y ∗ )}] ≤ t E[f (Y ∗ )] + bt2 ,
0≤t≤
θ , 2
(181)
can be obtained by taking the logarithm of the identity Z t Z s 1 t ∗ ∗ E[f (Y )] + E[exp{tf (Y )}] = 1 + ds E[f 2 (Y ∗ ) exp{tf (Y ∗ )}]du log e log2 e 0 0
(182)
and applying upper bounds (180) and log x ≤ (x − 1) log e. P Next, we define F (y n ) = n1 nj=1 f (yi ) and consider the chain:
t E[F (Y n )] ≤ log E[exp{tF (Y ∗n )}] + D(PY n ||PY∗ n )
(183)
t = n log E[exp{ f (Y ∗ )}] + D(PY n ||PY∗ n ) n bt2 + D(PY n ||PY∗ n ) , ≤ t E[f (Y ∗ )] + n
where (183) is by (147), (184) is because PY∗ n = (PY∗ )n and (185) is by (181) assuming 3
(184) (185) t n
≤ θ2 .
The proof concludes by taking t = n 4 in (185). A natural extension of Proposition 11 to functions such as n−r+1 X 1 f (yjj+r−1) F (y ) = n − r + 1 j=1 n
(186)
is made by replacing the step (184) with an estimate
n−r+1 tr ∗r log E[exp{tF (Y )}] ≤ log E exp f (Y ) , r n ∗
(187)
which in turn is shown by splitting the sum into r subsums with independent terms and then applying Holder’s inequality: 1
E[X1 · · · Xr ] ≤ (E[|X1 |r ] · · · E[|X1 |r ]) r
(188)
(for n not a multiple of r a small remainder term needs to be added to (187)). December 21, 2012
DRAFT
29
F. Functions of degraded channel outputs Notice that if the same code is used over a channel QY |X which is stochastically degraded with respect to PY |X then by the data-processing for relative entropy, the upper bound (122) holds for D(QY n ||Q∗Y n ), where QY n is the output of the QY |X channel and Q∗Y n is the output of QY |X when the input is distributed according to a capacity-achieving distribution of PY |X . Thus, in all the discussions the pair (PY n , PY∗ n ) can be replaced with (QY n , Q∗Y n ) without any change in arguments or constants. This observation can be useful in questions of information theoretic security, where the wiretapper has access to a degraded copy of the channel output. G. Input distribution: DMC As shown in Section IV-A we have for every ǫ-capacity-achieving code: n
1X PY → PY∗ . P¯n = n j=1 j
(189)
As noted in [2], convergence of output distributions can be propagated to statements about the input distributions. For example, this is obvious for the case of a DMC with a non-singular (more generally, injective) matrix PY |X . For other DMCs, the following argument extends that of [2, Theorem 4]. By Theorem 4 and 5 we know that 1 I(X n ; Y n ) → C . n
(190)
By concavity of mutual information, we must necessarily have ¯ Y¯ ) → C , I(X; where PX¯ =
1 n
Pn
j=1
(191)
PXj . By compactness of the simplex of input distributions and continuity
of the mutual information on that simplex the distance to the (compact) set of capacity achieving distributions Π must vanish: d(PX¯ , Π) → 0 .
(192)
If the capacity achieving distribution PX∗ is unique ,then (192) shows the convergence of PX¯ →
PX∗ in the (strong) sense of total variation.
December 21, 2012
DRAFT
30
H. Input distribution: AWGN In the case of the AWGN, just like in the discrete case, (50) implies that for any capacity achieving sequence of codes we have n
(n) PX¯
1X △ = PXj → PX∗ = N (0, P ) , n j=1
(193)
however, in the sense of weak convergence of distributions only. Indeed, first, a collection of all distributions on R with second moment not exceeding P is weakly compact by Prokhorov’s (n)
criterion. Thus, the sequence of empirical input marginal distributions PX¯
may be assumed
to weakly converge to some distribution QX . On the other hand, for the sequence of induced (n)
empirical output distributions PY¯
from (50) we get (n)
D(PY¯ ||PY∗ ) → 0 .
(194)
Thus by the weak lower-semicontinuity of relative entropy [32] we have D(QY ||PY∗ ) = 0, where QY = QX ∗ N (0, 1) is the output distribution induced by QX over the AWGN channel PY |X and ∗ is convolution. Since the map QX 7→ QX ∗N (0, 1) is injective (e.g. by computing characteristic
functions), we must also have QX = PX∗ and thus (193) holds.
We now discuss whether (193) can be claimed in a stronger topology than the weak one. Since PX¯ is purely atomic and PX∗ is purely diffuse, we have ||PX¯ − PX∗ ||T V = 1 ,
(195)
and convergence in total variation (let alone in relative entropy) cannot hold. On the other hand, it is quite clear (and also follows from a quantitative estimate in (267) P PXj necessarily converges to that of N (0, P ). Together below) that the second moment of n1 weak convergence and control of second moments imply [29, (12), p.7] ! n X 1 PXj , PX∗ → 0 . W22 n j=1
(196)
Note that convexity properties of W22 (·, ·) imply ! n n X 1 1X 2 2 ∗ W2 ≤ P Xj , P X W2 PXj , PX∗ n j=1 n j=1
(197)
Therefore (193) holds in the sense of topology metrized by the W2 -distance.
≤ December 21, 2012
1 2 W (PX n , PX∗ n ) , n 2
(198) DRAFT
31
where we denoted △
PX∗ n = (PX∗ )n = N (0, P In) .
(199)
Comparing (196) and (198), it is natural to conjecture a stronger result: For any capacityachieving sequence of codes 1 √ W2 (PX n , PX∗ n ) → 0 . n
(200)
Another reason to conjecture (200) arises from considering the behavior of Wasserstein distance under convolutions. Indeed from the T2 -transportation inequality (171) and the relative entropy bound (122) we have 1 2 W (PX n ∗ N (0, In ), PX∗ n ∗ N (0, In)) → 0 , n 2
(201)
PY n = PX n ∗ N (0, In )
(202)
PY∗ n = PX∗ n ∗ N (0, In ) ,
(203)
since by definition
where ∗ denotes convolution of distributions on Rn . Trivially, for any P, Q and N – probability
measures on Rn it is true that (e.g. [29, Proposition 7.17])
W2 (P ∗ N , Q ∗ N ) ≤ W2 (P, Q) .
(204)
1 1 0 ← √ W2 (PX n ∗ N (0, In ), PX∗ n ∗ N (0, In )) ≤ √ W2 (PX n , PX∗ n ) , n n
(205)
Thus, overall we have
and (200) simply conjectures that (conversely to (204)) the convolution with Gaussian kernel is unable to significantly decrease W2 . Despite the foregoing intuitive considerations, the conjecture (200) is false. Indeed, define D ∗ (M, n) to be the minimum achievable average square distortion among all vector quantizers of the memoryless Gaussian source N (0, P ) for blocklength n and cardinality M. In other words, D ∗ (M, n) =
1 inf W22 (PX∗ n , Q) , n Q
(206)
where the infimum is over all probability measures Q supported on M equiprobable atoms in Rn . Standard rate-distortion (converse) lower bound dictates 1 P 1 log M ≥ log ∗ n 2 D (M, n) December 21, 2012
(207) DRAFT
32
and hence W22 (PX n , PX∗ n ) ≥ nD ∗ (n, M) 2 ≥ nP exp − log M , n
(208) (209)
which shows that for any sequence of codes with log Mn = O(n), the normalized transportation distance stays strictly bounded away from zero: 1 lim inf √ W2 (PX n , PX∗ n ) > 0 . n→∞ n V. B INARY
HYPOTHESIS TESTING
PY n
(210) VS .
PY∗ n
We now turn to the question of distinguishing PY n from PY∗ n in the sense of binary hypothesis testing. First, a simple data-processing reasoning yields d(α||βα(PY n , PY∗ n )) ≤ D(PY n ||PY∗ n ) ,
(211)
where we have denoted the binary relative entropy △
d(x||y) = x log
1−x x + (1 − x) log . y 1−y
From (122) and (211) we conclude: Every (n, M, ǫ)max,det code must satisfy α1 M C √ a ∗ βα (PY n , PY n ) ≥ exp{−n − n } 2 α α
(212)
(213)
for all 0 < α ≤ 1. Therefore, in particular we see that the hypothesis testing problem for
discriminating PY n from PY∗ n has zero Stein’s exponent (and thus a zero Chernoff exponent), provided the sequence of (n, Mn , ǫ)max,det codes, defining PY n , is capacity achieving. The following result shows that (213) is not tight: Theorem 12: Consider one of the three types of channels introduced in Section II. Then every (n, M, ǫ)avg code must satisfy √ βα (PY n , PY∗ n ) ≥ M exp{−nC − a2 n}
ǫ ≤ α ≤ 1,
(214)
where a2 = a2 (ǫ, a1 ) > 0 depends only on ǫ and the constant a1 from (24). To prove Theorem 12 we need the following strengthening of the meta-converse results [3, Theorems 27 and 31]:
December 21, 2012
DRAFT
33
Theorem 13: Consider an (M, ǫ)avg code for an arbitrary random transformation PY |X . Let PX be the distribution of the encoder output if all codewords are equally likely and PY be the induced output distribution. Then for any QY and ǫ ≤ α ≤ 1 we have βα (PY , QY ) ≥ Mβα−ǫ (PXY , PX QY ) .
(215)
If the code is (M, ǫ)max,det then additionally we have for every δ > 0: βα (PY , QY ) ≥
δ M inf βα−ǫ−δ (PY |X=x , QY ) x∈C 1−α+δ
ǫ +δ ≤ α ≤ 1,
(216)
where the infimum is taken over the codebook C.
Proof: On the probability space corresponding to a given (M, ǫ)avg code, define the following random variable ˆ (Y ) = W, Y ∈ E} , Z = 1{W
(217)
where E is an arbitrary subset satisfying PY [E] ≥ α .
(218)
Precisely as in the original meta-converse [3, Theorem 26] the main idea is to use Z as a suboptimal hypothesis test for discriminating PXY against PX QY . Following the same reasoning as in [3, Theorem 27] one notices that (PX QY )[Z = 1] ≤
QY [E] M
(219)
and PXY [Z = 1] ≥ α − ǫ .
(220)
Therefore, by definition of βα we must have βα−ǫ (PXY , PX QY ) ≤
QY [E] . M
(221)
To complete the proof of (215) we just need to take the infimum in (221) over all E satisfying (218). To prove (216), we again consider any set E satisfying (218). Let ci , i = 1, . . . , M be the codewords of the (M, ǫ)max,det code. Define pi = PY |X=ci [E] ˆ = i, E] , qi = QY [W December 21, 2012
(222) i = 1, . . . , M.
(223) DRAFT
34
ˆ = i} are disjoint, the (arithmetic) average of qi is lower-bounded by Since the sets {W E[qW ] ≤
QY [E] , M
(224)
whereas because of (218) we have E[pW ] ≥ α . Thus, the following lower bound holds: QY [E] QY [E] E pW − δqW ≥ (α − δ) M M
(225)
(226)
implying that there must exist i ∈ {1, . . . , M} such that QY [E] QY [E] pi − δqi ≥ (α − δ) . M M
(227)
PY |X=ci [E] ≥ α − δ
(228)
For such i we clearly have
ˆ = i, E] ≤ QY [W
QY [E] 1 − α − δ . M δ
(229)
By the maximal probability of error constraint we deduce ˆ = i] ≥ α − ǫ − δ PY |X=ci [E, W
(230)
and thus by the definition of βα : βα−ǫ−δ (PY |X=ci , QY ) ≤
QY [E] 1 − α − δ . M δ
(231)
Taking the infimum in (231) over all E satisfying (218) completes the proof of (216). Proof of Theorem 12: To show (214) we first notice that as a consequence of (19), (24) and [3, Lemma 59] (see also [14, (2.71)]) we have for any xn ∈ Xn : ) ( r 2a n α 1 . βα (PY n |X n =xn , PY∗ n ) ≥ exp −nC − 2 α
(232)
From [14, Lemma 32] and the fact that the function of α in the right-hand side of (232) is convex we obtain that for any PX n βα (PX n Y n , PX n PY∗ n )
) ( r α 2a1 n . ≥ exp −nC − 2 α
(233)
Finally, (233) and (215) imply (214).
December 21, 2012
DRAFT
35
VI. AEP
FOR THE OUTPUT PROCESS
Yn
We say that a sequence of distributions PY n on Y n (with Y a countable set) satisfies the asymptotic equipartition property (AEP) if 1 1 n log − H(Y ) →0 n PY n (Y n )
(234)
in probability.
Although the sequence of output distributions induced by codes are far from being (a finite chunk of) a stationary ergodic process, we will show that (234) is satisfied for ǫ-capacityachieving codes (and other codes). Thus, in particular, if the channel outputs are to be losslessly compressed and stored for later decoding n1 H(Y n ) bits per sample would suffice (cf. (57)). In √ fact, the log PY n1(Y n ) concentrates up to n around the entropy H(Y n ). Such questions are also interesting in other contexts and for other types of distributions, see [33]. Theorem 14: Consider a DMC PY |X with C1 < ∞ (with or without input constraints) and a capacity achieving sequence of (n, Mn , ǫ)max,det codes. Then the output AEP (234) holds. Proof: In the proof of Theorem 5 it was shown that log PY n (y n ) is Lipschitz with Lipschitz constant upper bounded by a1 , see (60). Thus by (143) and Proposition 10 we find that for any capacity-achieving sequence of codes Var[log PY n (Y n )] = o(n2 ) ,
n → ∞.
(235)
Thus (234) holds by Chebyshev inequality. Surprisingly, for many practically interesting DMCs (such as those with additive noise in a finite group), the estimate (235) can be improved to O(n) even without assuming the code to be capacity-achieving. Theorem 15: Consider a DMC PY |X with C1 < ∞ (with or without input constraints) and such that H(Y |X = x) is constant on X . Then for any sequence of (n, Mn , ǫ)max,det codes there exists a constant a = a(ǫ) such that for all n sufficiently large 1 ≤ an . Var log PY n (Y n )
(236)
In particular, the output AEP (234) holds. Proof: First, let X be a random variable and A some event (think P[Ac ] ≪ 1) such that |X − E[X]| ≤ L December 21, 2012
(237) DRAFT
36
if X 6∈ A. Then Var[X] = E[(X − E[X])2 1A ] + E[(X − E[X])2 1Ac ] ≤ E[(X − E[X])2 1A ] + P[Ac ]L2 ! 2 c P[A ] = P[A] Var[X|A] + (E[X] − E[X|Ac ]) P[Ac ]L2 P[A] ≤ Var[X|A] +
P[Ac ] 2 L , P[A]
(238) (239) (240) (241)
where (239) is by (237), (240) is after rewriting E[(X − E[X])2 |A] in terms of Var[X|A] and applying the obvious identity E[X|A] =
E[X] − P[Ac ] E[X|Ac ] P[A]
(242)
and (241) is because (237) implies | E[X|Ac ] − E[X]| ≤ L.
Next, fix n and for any codeword xn ∈ Xn denote for brevity d(xn ) = D(PY n |X n =xn ||PY n ) n 1 n n X = x v(x ) = E log PY n (Y n ) = d(xn ) + H(Y n |X n = xn ) .
(243) (244) (245)
If we could show that for some a1 > 0 Var[d(X n )] ≤ a1 n the proof would be completed as follows: n 1 1 Var log = Var log X + Var[v(X n )] PY n (Y n ) PY n (Y n ) ≤ a2 n + Var[v(X n )]
(246)
(247) (248)
= a2 n + Var[d(X n )]
(249)
≤ (a1 + a2 )n ,
(250)
where (248) follows for an appropriate constant a2 > 0 from (62), (249) is by (245) and H(Y n |X n = xn ) does not depend on xn by assumption9, and (250) is by (246). 9
This argument also shows how to construct a counterexample when H(Y |X = x) is non-constant: merge two constant
composition subcodes of types P1 and P2 such that H(W |P1 ) 6= H(W |P2 ) where W = PY |X is the channel matrix. In this case one clearly has Var[log PY n (y n )] ≥ Var[v(X n )] = const · n2 . December 21, 2012
DRAFT
37
To show (246), first, notice that iX n ;Y n (xn ; y n ) = log
PX n |Y n (y n |xn ) ≤ log Mn . PX n (xn )
(251)
Second, as shown in (65) one may take Sm = a3 n in Corollary 3. In turn, this implies that one q 3n and δ ′ = 1−ǫ in Theorem 2, that is: can take ∆ = 2a 1−ǫ 2 n PY n |X n =xn n 1+ǫ n (Y ) < d(x) + ∆ X = x ≥ . (252) inf P log n x PY n 2
Then applying Theorem 1 with ρ(xn ) = d(xn ) + ∆ to the subcode consisting of all codewords with {d(X n ) ≤ log Mn − 2∆} we get
P[d(X n ) ≤ log Mn − 2∆] ≤ since
2 exp{−∆} , 1−ǫ
E[ρ(X n )|d(X n )] ≤ log Mn − 2∆] ≤ log Mn − ∆ .
(253)
(254)
Now, we apply (241) to d(X n ) with L = log Mn and A = {d(X n ) > log Mn − 2∆}. Since
Var[X|A] ≤ ∆2 this yields
2 log2 Mn exp{−∆} (255) Var[d(X )] ≤ ∆ + 1−ǫ √ 2 exp{−∆} ≤ 21 . Since ∆ = O( n) and log Mn = O(n) we conclude for all n such that 1−ǫ n
2
from (255) that there must be a constant a1 such that (246) holds. We also have a similar extension for the AWGN channel, except that this time output-AEP does not have a natural interpretation in terms of fundamental limits of compression. Theorem 16: Consider the AW GN channel. Then for any sequence of (n, Mn , ǫ)max,det codes there exists a constant a = a(ǫ) such that for all n sufficiently large 1 ≤ an , Var log pY n (Y n )
(256)
where pY n is the density of PY n with respect to Lebesgue measure Leb on Rn . In particular, the AEP holds for Y n . Proof: The argument follows that of Theorem 15 step by step with (121) used in place of (62). Corollary 17: If in the setting of Theorem 16, the codes are the spherical codes (i.e., energies of all codewords X n are equal) or, more generally, Var[||X n ||2] = o(n2 ), December 21, 2012
(257) DRAFT
38
then
in PY n -probability.
1 dPY n n ∗ log (Y ) − D(PY n ||PY n ) → 0 ∗ n dPY n
(258)
Yn (Y n ) we need in addition to (256) show Proof: To apply Chebyshev’s inequality to log dP dP ∗ n Y
Var[log p∗Y n (Y n )] = o(n2 ) , n
(259)
||y n ||2
where p∗Y n (y n ) = (2π(1 + P ))− 2 e− 2(1+P ) . Introducing i.i.d. Zj ∼ N (0, 1) we have # " n 2 X log e Var ||X n ||2 + 2 Xj Zj + ||Z n ||2 . Var[log p∗Y n (Y n )] = 4(1 + P )2 j=1
(260)
Variances of second and third terms are clearly O(n), while the variance of the first term is o(n2 ) by assumption (257). Then (260) implies (259) via Var[A + B + C] ≤ 3(Var[A] + Var[B] + Var[C]) .
VII. E XPECTATIONS
OF NON - LINEAR POLYNOMIALS OF
G AUSSIAN
(261)
CODES
This section contains results special to the AWGN channel. Because of the algebraic structure available on Rn it is natural to ask whether we can provide approximations for polynomials. Since Theorem 7 shows the validity of (122), all the results for Lipschitz (in particular linear) functions from Section IV follow. Polynomials of higher degree, however, do not admit bounded Lipschitz constants. In this section we discuss the case of quadratic polynomials (Section VII-A) and polynomials of higher degree (Section VII-B). We present results directly in terms of the polynomials in (X1 , . . . , Xn ) on the input space. This is strictly stronger than considering polynomials on the output space, since E[q(Y n )] = E[q(X n +Z n )] and thus by taking integrating over distribution of Z n problem reduces to computing the expectation of a (different) polynomial of X n . The reverse reduction is not possible, clearly. A. Quadratic forms We denote the canonical inner product on Rn as (a, b) =
n X
aj bj ,
(262)
j=1
December 21, 2012
DRAFT
39
and write the quadratic form corresponding to matrix A as n X n X (Ax, x) = ai,j xi xj .
(263)
j=1 i=1
n
n
Note that when X ∼ N (0, P ) we have trivially
E[(AX n , X n )] = P tr A ,
(264)
where tr is the trace operator. Therefore, the next result shows that the distribution of good codes must be close to isotropic Gaussian distribution, at least in the sense of evaluating quadratic forms: Theorem 18: For any P > 0 and 0 < ǫ < 1 there exists a constant b = b(P, ǫ) > 0 such that for all (n, M, ǫ)max,det codes and all quadratic forms A such that
we have
−In ≤ A ≤ In
(265)
√ q √ 2(1 + P ) n √ nC − log M + b n | E[(AX , X )] − P tr A| ≤ log e
(266)
n
n
and (a refinement for A = In ) n X √ 2(1 + P ) (nC − log M + b n) . | E[Xj2 ] − nP | ≤ log e j=1
(267)
Remark 9: By using the same method as in the proof of Proposition 10 one can also show that
the estimate (266) holds on a per-codeword basis for an overwhelming majority of codewords. Proof: Denote Σ = E[xxT ] ,
(268)
V = (In + Σ)−1 ,
(269)
QY n = N (0, In + Σ) , dPY n |X n =x R(y|x) = log (y) , dQY n log e ln det(In + Σ) + (Vy, y) − ||y − x||2 , = 2 d(x) = E[R(Y n |x)|X n = x] , log e = (ln det(In + Σ) + (Vx, x) + tr(V − In )) 2 v(x) = Var[R(Y n |x)|X n = x] . December 21, 2012
(270) (271) (272) (273) (274) (275) DRAFT
40
Denote also the spectrum of Σ by {λi , i = 1, . . . , n} and its eigenvectors by {vi , i = 1, . . . , n}. We have then |E[(AX n , X n )] − P tr A| = |tr(Σ − P In )A| n X = (λi − P )(Avi , vi ) ≤
i=1 n X i=1
|λi − P | ,
(276) (277) (278)
where (277) follows by computing the trace in the eigenbasis of Σ and (278) is by (265). From (274), it is straightforward to check that D(PY n |X n ||QY n |PX n ) = E[d(X n )] 1 = log det(In + Σ) 2 n 1X log(1 + λj ) . = 2 j=1 By using (261) we estimate 1 1 2 n 2 n n n v(x) ≤ 3 log e Var[||Z || ] + Var[(V Z , Z )] + Var[(V x, Z )] 4 4 9 △ + 3P log2 e = nb21 ≤ n 4
(279) (280) (281)
(282) (283)
where (283) results from applying the following identities and bounds for Z n ∼ N (0, In ): Var[||Z n ||2 ] = 3n ,
(284)
Var[(a, Z n )] = ||a||2 ,
(285)
Var[(VZ n , Z n )] = 3 tr V2 ≤ 3n
(286)
||Vx||2 ≤ ||x||2 ≤ nP .
(287)
Finally from Corollary 3 applied with Sm = b21 n and (281) we have n
√ 1X 2 log(1 + λj ) ≥ log M − b1 n − log 2 j=1 1−ǫ √ ≥ log M − b n =
December 21, 2012
n (log(1 + P ) − δn ) , 2
(288) (289) (290)
DRAFT
41
where we abbreviated s
+ 3P 2 b = log e + log 1−ǫ 1−ǫ √ δn = 2(nC + b n − log M) . 2
9 4
(291) (292)
To derive (267) consider the chain: n
−δn
1X 1 + λi ≤ log n j=1 1+P n
≤ log
1 X 1 + λi n j=1 1 + P
(293) !
(294)
n
log e X ≤ (λi − P ) n(1 + P ) j=1 =
log e (E[||X n ||2 ] − nP ) n(1 + P )
(295) (296)
where (293) is (290), (294) is by Jensen’s inequality, (295) is by log x ≤ (x − 1) log e. Note that (267) is equivalent to (296). Finally, (266) follows from (278), (293) and the next Lemma i , i = 1, . . . , n . applied with X equiprobable on 1+λ 1+P Lemma 19: Let X > 0 and E[X] ≤ 1, then
E[|X − 1|] ≤
s
1 2 E ln X
(297)
Proof: Define two distributions on R+ : △
(298)
△
(299)
P [E] = P[X ∈ E] Q[E] = E[X · 1{X ∈ E}] + (1 − E[X])1{0 ∈ E} . Then, we have 2||P − Q||T V
= 1 − E[X] + E[|X − 1|] 1 D(P ||Q) = E log . X
(300) (301)
and (297) follows by (129). The proof of Theorem 18 relied on a direct application of the main inequality (in the form of Corollary 3) and is independent of the previous estimate (122). At the expense of a more
December 21, 2012
DRAFT
42
technical proof we could derive an order-optimal form of Theorem 18 starting from (122) using concentration properties of Lipschitz functions. Indeed, notice that because E[Z n ] = 0 we have E[(AY n , Y n )] = E[(AX n , X n )] + tr A .
(302)
Thus, (266) follows from (122) if we can show n
n
| E[(AY , Y )] − E[(AY
∗n
,Y
∗n
q )]| ≤ b nD(PY n ||PY∗ n ) .
(303)
This is precisely what Corollary 9 would imply if the function y 7→ (Ay, y) were Lipschitz √ with constant O( n). However, (Ay, y) is generally not Lipschitz when considered on the entire of Rn . On the other hand, it is clear that from the point of view of evaluation of both √ the E[(AY n , Y n )] and E[(AY ∗n , Y ∗n )] only vectors of norm O( n) are important, and when √ restricted to the ball S = {y : kyk2 ≤ b n} quadratic form (Ay, y) does have a required √ Lipschitz constant of O( n). This approximation idea can be made precise using Kirzbraun’s theorem (see [34] for a short proof) to extend (Ay, y) beyond the ball S preserving the maximum √ absolute value and the Lipschitz constant O( n). Another method of showing (303) is by using the Bobkov-G¨otze extension of Gaussian concentration (140) to non-Lipschitz functions [30, Theorem 1.2] to estimate the moment generating function of (AY ∗n , Y ∗n ) and apply (147) with q t = n1 D(PY n ||PY∗ n ). Both methods yield (303), and hence (266), but with less sharp constants
than those in Theorem 18. B. Behavior of ||x||q
The next natural question is to consider polynomials of higher degree. The simplest example P of such polynomials are F (x) = nj=1 xqj for some power q, to analysis of which we proceed now. To formalize the problem, consider 1 ≤ q ≤ ∞ and define the q-th norm of the input vector in the usual way △
||x||q =
n X i=1
|xi |q
! q1
.
(304)
The aim of this section is to investigate the values of ||x||q for the codewords of good codes for the AWGN channel. Notice that when the coordinates of x are independent Gaussians we expect to have
n X i=1
December 21, 2012
|xi |q ≈ n E[|Z|q ] ,
(305)
DRAFT
43
where Z ∼ N (0, 1). In fact it can be shown that there exists a sequence of capacity achieving codes and constants Bq , 1 ≤ q ≤ ∞ such that every codeword x at every blocklength n satisfies10 : 1
1
1 ≤ q < ∞,
||x||q ≤ Bq n q = O(n q )
(306)
and ||x||∞ ≤ B∞
p p log n = O( log n) .
(307)
But do (306)-(307) hold (possibly with different constants) for any good code? It turns out that the answer depends on the range of q and on the degree of optimality of the code. Our findings are summarized in Table I. The precise meaning of each entry will be clear from Theorems 20, 23 and their corollaries. The main observation is that the closer the code size comes to M ∗ (n, ǫ), the better ℓq -norms reflect those of random Gaussian codewords (306)(307). Loosely speaking, very little can be said about ℓq -norms of capacity-achieving codes, while O(log n)-achieving codes are almost indistinguishable from the random Gaussian ones. In particular, we see that, for example, for capacity-achieving codes it is not possible to approximate expectations of polynomials of degrees higher than 2 (or 4 for dispersion-achieving codes) by assuming Gaussian inputs, since even the asymptotic growth rate with n can be dramatically different. The question of whether we can approximate expectations of arbitrary polynomials for O(log n)-achieving codes remains open. We proceed to support the statements made in Table I. 1
In fact, each estimate in Table I, except n q log
q−4 2q
n, is tight in the following sense: if the √ entry is nα , then there exists a constant Bq and a sequence of O(log n)-, dispersion-, O( n)-, or capacity-achieving (n, Mn , ǫ)max,det codes such that each codeword x ∈ Rn satisfies for all n≥1
kxkq ≥ Bq nα .
(308)
If the entry in the table states o(nα ) then there is Bq such that for any sequence τn → 0 there √ exists a sequence of O(log n)-, dispersion-, O( n)-, or capacity-achieving (n, Mn , ǫ)max,det codes 10
This does not follow from a simple random coding argument since we want the property to hold for every codeword, which
constitutes exponentially many constraints. However, the claim can indeed be shown by invoking the κβ-bound [3, Theorem 25] with a suitably chosen constraint set F.
December 21, 2012
DRAFT
44
TABLE I B EHAVIOR OF ℓq
NORMS
kxkq
OF CODEWORDS FROM CODES FOR THE
1≤q≤2
Code
2 0 there exists a constant b = b(P, ǫ) such
that for any11 n ≥ N(P, ǫ) and any (n, M, ǫ)max,det -code for the AW GN(P ) channel at least half of the codewords satisfy kxk2∞
4(1 + P ) ≤ log e
nC −
√
M nV Q (ǫ) + 2 log n + log b − log 2 −1
,
(321)
where C and V are the capacity and the dispersion. In particular, the expression in (·) is nonnegative for all codes and blocklengths. Remark 11: What sets Theorem 20 apart from other results is its sensitivity to whether the code achieves the dispersion term. This is unlike estimates of the form (122), which only sense √ whether the code is O( n)-achieving or not. From Theorem 20 the explanation of the entries in the last column of Table I becomes obvious: the more terms the code achieves in the asymptotic expansion of log M ∗ (n, ǫ) the closer kxk∞ √ becomes to the O( log n), which arises from a random Gaussian codeword (307). To be specific, we give the following exact statements: Corollary 21 (q = ∞ for O(log n)-codes): For any 0 < ǫ < 1 and P > 0 there exists a constant b = b(P, ǫ) such that for any (n, Mn , ǫ)max,det -code for the AW GN(P ) with log Mn ≥ nC − 11
N (P, ǫ) = 8(1 + 2P −1 )(Q−1 (ǫ))2 for ǫ
0 we have that at least half of the codewords satisfy p kxk∞ ≤ (b + K) log n + b .
(323)
Corollary 22 (q = ∞ for capacity-achieving codes): For any capacity-achieving sequence of
(n, Mn , ǫ)max,det -codes there exists a sequence τn → 0 such that for at least half of the codewords we have 1
kxk∞ ≤ τn n 2 .
(324)
Similarly, for any dispersion-achieving sequence of (n, Mn , ǫ)max,det -codes there exists a sequence τn → 0 such that for at least half of the codewords we have 1
kxk∞ ≤ τn n 4 .
(325)
Remark 12: By (309), the sequences τn in Corollary 22 are necessarily code-dependent. For the q = 4 we have the following estimate (see Appendix for the proof): Theorem 23 (q = 4): For any 0 < ǫ
0 there exist constants b1 > 0 and b2 > 0,
depending on P and ǫ, such that for any (n, M, ǫ)max,det -code for the AW GN(P ) channel at least half of the codewords satisfy kxk24
2 ≤ b1
nC + b2
√
M n − log 2
,
(326)
where C is the capacity of the channel. In fact, we also have a lower bound √ 1 E[kxk44 ] ≥ 3nP 2 − (nC − log M + b3 n)n 4 ,
(327)
for some b3 = b3 (P, ǫ) > 0. Remark 13: Note that E[kzk44 ] = 3nP 2 for z ∼ N (0, P )n. We can now complete the proof of the results in Table I: 1) Row 2: q = 4 is Theorem 23; 2 < q ≤ 4 follows by (314) with p = 4; q = ∞ is Corollary 21; for 4 < q < ∞ we apply interpolation via (316) with p = 4. 2) Row 3: q ≤ 4 is treated as in Row 2; q = ∞ is Corollary 22; for 4 < q < ∞ apply interpolation (316) with p = 4. 3) Row 4: q ≤ 4 is treated as in Row 2; q ≥ 4 follows from (317) with p = 4. 4) Row 5: q = ∞ is Theorem 22; for 2 < q < ∞ we apply interpolation (316) with p = 2. The upshot of this section is that we cannot approximate values of non-quadratic polynomials √ in x (or y) by assuming iid Gaussian entries, unless the code is O( n)-achieving, in which December 21, 2012
DRAFT
48
case we can go up to degree 4 but still will have to be content with one-sided (lower) bounds only, cf. (327).12 Before closing this discussion we demonstrate the sharpness of the arguments in this section by considering the following example. Suppose that a power of a codeword x from a capacitydispersion optimal code is measured by an imperfect tool, such that its reading is described by
n
1X (xi )2 Hi , E= n i=1
(328)
where Hi ’s are i.i.d bounded random variables with expectation and variance both equal to 1. For large blocklengths n we expect E to be Gaussian with mean P and variance n1 kxk44 . On the one hand, Theorem 23 shows that the variance will not explode; (327) shows that it will be at least as large as that of a Gaussian codebook. Finally, to establish the asymptotic normality rigorously, the usual approach based on checking Lyapunov condition will fail as shown by (309), but the Lindenberg condition does hold as a consequence of Theorem 22. If in addition, the code is O(log n)-achieving then 2 nδ√ 1 +b2 δ log n
−b
P[|E − E[E]| > δ] ≤ e VIII. E XTENSION
.
(329)
TO OTHER CHANNELS : TILTING
Let us review the scheme of investigating functions of the output F (Y n ) that was employed in this paper so far. First, an inequality (122) was shown by verifying that QY = PY n satisfies conditions of Theorem 2. Then approximation of the form F (Y n ) ≈ E[F (Y n )] ≈ E[F (Y ∗n )]
(330)
follows by the arguments in Section IV, cf. Propositions 8 and 10. Therefore, in some sense application of Theorem 2 with QY = PY n implies an approximation (330) simultaneously for all concentrated (e.g. Lipschitz) functions. On one hand, this is very convenient since all the channel-specific work is isolated in proving (122). On the other hand, verifying conditions of Theorem 2 for QY = PY n may not in general be feasible even for memoryless channels. In this section we show how Theorem 2 can be used to investigate (330) for a specific function F in 12
Using quite similar methods, (327) can be extended to certain bi-quadratic forms, i.e. 4-th degree polynomials
where A = (ai−j ) is a Toeplitz positive semi-definite matrix.
December 21, 2012
P
i,j
ai−j x2i x2j ,
DRAFT
49
the absence of the universal estimate (122). A different application of the same technique was previously reported in [8, Section VIII]. In this section we focus on studying an arbitrary random transformation PY |X : X → Y,
for convenience we introduce Y ♭ distributed according to auxiliary distribution QY and denote analogously to (145) ♭
△
E[F (Y )] =
Z
F (y)dQY ,
(331)
where F – is arbitrary function of Y. Suppose that F is such that ZF = log E[exp{F (Y ♭ )}] < ∞ , (F )
Denote by QY
(332)
an F -tilting of QY , namely (F )
dQY
= exp{F − ZF }dQY
(333)
The core idea of our technique is that if F is sufficiently regular and QY satisfies conditions (F )
of Theorem 2, then QY
also does. Consequently, the expectation of F under PY (induced by
the code) can be investigated in terms of the moment-generating function of F under QY . For brevity we only present a variance-based version (similar to Corollary 3): Theorem 24: Let QY and F be such that (332) holds and dPY |X=x (Y ) X = x < ∞ , S = sup Var log QY x SF = sup Var[F (Y )|X = x] .
(334) (335)
x
Then there exists a constant a = a(ǫ, S) > 0 such that for any (M, ǫ)max,det code we have for all 0 ≤ t ≤ 1 p t E[F (Y )] − log E[exp{tF (Y ♭ )}] ≤ D(PY |X ||QY |PX ) − log M + a S + t2 SF (336) √ SF 2 t (337) ≤ D(PY |X ||QY |PX ) − log M + a S 1 + 2S Proof: Note that since log
dPY |X (F ) dQY
= log
dPY |X − F (Y ) + ZF dQY
(338)
we have for any 0 ≤ t ≤ 1: (tF )
D(PY |X ||QY |PX ) = D(PY |X ||QY |PX ) − t E[F (Y )] + log E[exp{tF (Y ♭ )}] , December 21, 2012
(339)
DRAFT
50
and
# dPY |X Var log X = x ≤ 2(S + t2 SF ) , (F ) dQ "
(340)
Y
where we used Var[A + B] ≤ 2(Var[A] + Var[B]). Then from Corollary 3 applied with QY (tF )
and S replaced by QY √ 1 + x ≤ 1 + x2 .
and 2S + 2t2 SF , respectively, we obtain (336), while (337) follows by
To see that (337) is actually useful, notice that for example Corollary 9 can easily be recovered by taking QY = PY∗ n , applying (19), estimating the moment-generating function via (140) and SF via Poincar´e inequality: SF ≤ bkF k2Lip .
(341)
The advantage of such a direct approach is that in fact a “dimension-dependent” Poincare inequality would also succeed in this case: SF ≤ bnkF k2Lip ,
(342)
whereas the full “dimensionless” Poincar´e inequality (341) was required in order to show (122). A PPENDIX In this appendix we prove results from Section VII-B. To prove Theorem 20 the basic intuition is that any codeword which is abnormally peaky (i.e., has a high value of kxk∞ ) is wasteful in terms of allocating its power budget. Thus a good capacity- or dispersion-achieving codebook cannot have too many of such wasteful codewords. A non-asymptotic formalization of this intuitive argument is as follows: Lemma 25: For any ǫ ≤
1 2
and P > 0 there exists a constant b = b(P, ǫ) such that given any
(n, M, ǫ)max,det code for the AW GN(P ) channel, we have for any 0 ≤ λ ≤ P :13 n o p √ b exp nC(P − λ) − nV (P − λ)Q−1 (ǫ) + 2 log n , P[kX n k∞ ≥ λn] ≤ M
(343)
where C(P ) and V (P ) are the capacity and the dispersion of the AW GN(P ) channel, and X n is the output of the encoder assuming equiprobable messages. 13
For ǫ >
1 2
one must replace V (P − λ) with V (P ) in (343). This does not modify any of the arguments required to prove
Theorem 20.
December 21, 2012
DRAFT
51
Proof: Our method is to apply the meta-converse in the form of [3, Theorem 30] to the √ subcode that satisfies kX n k∞ ≥ λn. Application of a meta-converse requires selecting a suitable auxiliary channel QY n |X n . We specify this channel now. For any x ∈ Rn let j ∗ (x) be
the first index s.t. |xj | = ||x||∞, then we set QY n |X n (y n |x) = PY |X (yj ∗ |xj ∗ )
Y
PY∗ (yj )
(344)
j6=j ∗ (x)
We will show below that for some b1 = b1 (P ) any M-code over the Q-channel (344) has average probability of error ǫ′ satisfying:
3
b1 n 2 . 1−ǫ ≤ M ′
dP
n
(345) n
=x (Y n ) we see that it coincides with On the other hand, writing the expression for log dQYY n|X |X n =x
the expression for log
dPY n |X n =x dPY∗ n
except that the term corresponding to j ∗ (x) will be missing;
compare with [14, (4.29)]. Thus, one can repeat step by step the analysis in the proof of [3, Theorem 65] with the only difference that nP should be replaced by nP − kxk2∞ reflecting the
reduction in the energy due to skipping of j ∗ . Then, we obtain for some b2 = b2 (α, P ): ! v ! u 2 2 u kxk kxk∞ 1 ∞ +tnV P − Q−1 (ǫ)− log n−b2 , log β1−ǫ (PY n |X n =x , QY n |X n =x ) ≥ −nC P − n n 2 (346) √ which holds simultaneously for all x with kxk ≤ nP . Two remarks are in order: first, the
analysis in [3, Theorem 64] must be done replacing n with n − 1, but this difference is absorbed into b2 . Second, to see that b2 can be chosen independent of x notice that B(P ) in [3, (620)] tends to 0 with P → 0 and hence can be bounded uniformly for all P ∈ [0, Pmax ]. √ Denote the cardinality of the subcode {kxk∞ ≥ λn} by √ Mλ = MP[kxk∞ ≥ λn] .
(347)
Then according to [3, Theorem 30], we get inf β1−ǫ (PY n |X n =x , QY n |X n =x ) ≤ 1 − ǫ′ ,
(348)
x
where the infimum is over the codewords of Mλ -subcode. Applying both (345) and (346) we get
inf −nC x
P−
December 21, 2012
kxk2∞ n
!
v u u + tnV
P−
kxk2∞ n
!
1 3 Q−1 (ǫ)− log n−b2 ≤ − log Mλ +log b1 + log n 2 2 (349)
DRAFT
52
and, further, since the function of ||x||∞ in left-hand side of (349) is monotone in kxk∞ : −nC(P − λ) + Thus, overall
p
nV (P − λ)Q−1 (ǫ) −
log Mλ ≤ nC(P − λ) −
p
3 1 log n − b2 ≤ − log Mλ + log b1 + log n . 2 2
nV (P − λ)Q−1 (ǫ) + 2 log n + b2 + log b1 ,
(350)
(351)
which is equivalent to (343) with b = b1 exp{b2 }.
It remains to show (345). Consider an (n, M, ǫ′ )avg,det -code for the Q-channel and let Mj , j =
1, . . . , n denote the cardinality of the set of all codewords with j ∗ (x) = j. Let ǫ′j denote the minimum possible average probability of error of each such codebook achievable with the maximum likelihood (ML) decoder. Since n 1 X Mj (1 − ǫ′j ) 1−ǫ ≤ M j=1 ′
it suffices to prove 1−
ǫ′j
≤
q
2nP π
Mj
(352)
+2 (353)
for all j. Without loss of generality assume j = 1 in which case the observations Y2n are useless for determining the value of the true codeword. Moreover, the ML decoding regions Di , i = 1, . . . , Mj for each codeword are disjoint intervals in R1 (so that decoder outputs message estimate i whenever Y1 ∈ Di ). Note that for Mj ≤ 2 there is nothing to prove, so assume otherwise. Denote the first coordinates of the Mj codewords by xi , i = 1, . . . , Mj and assume √ √ (without loss of generality) that − nP ≤ x1 ≤ x2 ≤ · · · ≤ xMj ≤ nP and that D2 , . . . DMj −1
December 21, 2012
DRAFT
53
are finite intervals. We have the following chain then 1−
ǫ′j
Mj 1 X PY |X (Di |xi ) = Mj i=1
(354)
Mj −1 1 X 2 PY |X (Di |xi ) + ≤ Mj Mj j=2
(355)
Mj −1 2 Leb(Di ) 1 X ≤ 1 − 2Q + Mj Mj j=2 2 Mj −1 X 2 Mj − 2 1 ≤ 1 − 2Q Leb(Di ) + Mj Mj 2Mj − 4 i=2 !! √ Mj − 2 2 nP 1 − 2Q + ≤ Mj Mj Mj − 2 q 2nP 2 π ≤ + , Mj Mj
(356)
(357)
(358)
(359)
where in (354) PY |X=x = N (x, 1), (355) follows by upper-bounding probability of successful decoding for i = 1 and i = Mj by 1, (356) follows since clearly for a fixed value of the length Leb(Di ) the optimal location of the interval Di , maximizing the value PY |X (Di |xi ), is centered at xi , (357) is by Jensen’s inequality applied to x → 1 − 2Q(x) concave for x ≥ 0, (358) is because
Mj −1
[
i=2
√ √ Di ⊂ [− nP , nP ]
(360)
and Di are disjoint, and (359) is by
1 − 2Q(x) ≤
r
2 x, π
x ≥ 0.
(361)
Thus, (359) completes the proof of (353), (345) and the theorem. Proof of Theorem 20: Notice that for any 0 ≤ λ ≤ P we have C(P − λ) ≤ C(P ) −
λ log e . 2(1 + P )
(362)
p V (P ) and since V (0) = 0 we have for any 0 ≤ λ ≤ P p p p V (P ) λ. (363) V (P − λ) ≥ V (P ) − P
On the other hand, by concavity of
December 21, 2012
DRAFT
54
Thus, taking s = λn in Lemma 25 we get with the help of (362) and (363): n o 1 P[kxk2∞ ≥ s] ≤ exp ∆n − (b1 − b2 n− 2 )s ,
(364)
where we denoted for convenience
log e , 2(1 + P ) p V (P ) −1 Q (ǫ) , = P p = nC(P ) − nV (P )Q−1 (ǫ) + 2 log n − log M + log b .
b1 =
(365)
b2
(366)
∆n
(367)
Note that Lemma 25 only shows validity of (364) for 0 ≤ s ≤ nP , but since for s > nP the left-hand side is zero, the statement actually holds for all s ≥ 0. Then for n ≥ N(P, ǫ) we have 1
(b1 − b2 n− 2 ) ≥
b1 2
(368)
and thus further upper-bounding (364) we get P[kxk2∞
b1 s ≥ s] ≤ exp ∆n − . 2
(369)
Finally, if the code is so large that ∆n < 0, then (369) would imply that P[kxk2∞ ≥ s] < 1 for all s ≥ 0, which is clearly impossible. Thus we must have ∆n ≥ 0 for any (n, M, ǫ)max,det code. The proof concludes by taking s =
2(log 2+∆n ) b1
in (369).
Proof of Theorem 23: To prove (326) we will show the following statement: There exist two constants b0 and b1 such that for any (n, M1 , ǫ) code for the AW GN(P ) channel with codewords x satisfying 1
kxk4 ≥ bn 4
(370)
we have an upper bound on the cardinality: ( ) r √ 4 nV M1 ≤ exp nC + 2 − b1 (b − b0 )2 n , 1−ǫ 1−ǫ provided b ≥ b0 (P, ǫ). From here (326) follows by first upper-bounding (b − b0 )2 ≥
(371) b2 2
− b20 and
then verifying easily that the choice b2 = with b2 = b20 b1 + 2 December 21, 2012
q
V 1−ǫ
√ M 2 √ (nC + b2 n − log ) b1 n 2
(372)
4 + log 1−ǫ takes the right-hand side of (371) below log M2 .
DRAFT
55
To prove (371), denote S =b−
6 1+ǫ
14
(373)
and choose b large enough so that 1√ △ δ = S − 64 1 + P > 0 .
(374)
Then, on one hand we have 1
1
PY n [kY n k4 ≥ Sn 4 ] = P[kX n + Z n k4 ≥ Sn 4 ]
(375) 1
≥ P[kX n k4 − kZ n k4 ≥ Sn 4 ] 1
(376)
≥ P[kZ n k4 ≤ n 4 (S − b)]
(377)
≥
(378)
1+ǫ , 2
where (376) is by the triangle inequality for k·k4 , (377) is by the constraint (370) and (378) is P by the Chebyshev inequality applied to kZ n k44 = nj=1 Zj4 . On the other hand, we have 1 1√ 1 PY∗ n [kY n k4 ≤ Sn 4 ] = PY∗ n [kY n k4 ≤ (6 4 1 + P + δ)n 4 ] 1 1 1√ ≥ PY∗ n [{kY n k4 ≤ 6 4 1 + P n 4 } + {kY n k4 ≤ δn 4 }] 1√ 1 1 ≥ PY∗ n [{kY n k4 ≤ 6 4 1 + P n 4 } + {kY n k2 ≤ δn 4 }] √ ≥ 1 − exp{−b1 δ 2 n} ,
(379)
(380) (381) (382)
where (379) is by the definition of δ in (374), (380) is by the triangle inequality for k·k4 which implies the inclusion {y : kyk4 ≤ a + b} ⊃ {y : kyk4 ≤ a} + {y : kyk4 ≤ b}
(383)
with + denoting the Minkowski sum of sets, (381) is by (317) with p = 2, q = 4; and (382) holds for some b1 = b1 (P ) > 0 by the Gaussian isoperimetric inequality [35] which is applicable since
1 1 1√ (384) PY∗ n [kY n k4 ≤ 6 4 1 + P n 4 ] ≥ 2 P by the Chebyshev inequality applied to nj=1 Yj4 (note: Y n ∼ N (0, 1 + P )n under PY∗ n ). As a
side remark, we add that the estimate of the large-deviations of the sum of 4-th powers of iid √ Gaussians as exp{−O( n)} is order-optimal.
December 21, 2012
DRAFT
56
Together (378) and (382) imply √ β 1+ǫ (PY n , PY∗ n ) ≤ exp{−b1 δ 2 n} .
(385)
2
√ On the other hand, by [3, Lemma 59] we have for any x with kxk2 ≤ nP and any 0 < α < 1: ) ( r 2nV α , (386) βα (PY n |X n =x , PY∗ n ) ≥ exp −nC − 2 α where C and V are the capacity and the dispersion of the AW GN(P ) channel. Then, by convexity in α of the right-hand side of (386) and [14, Lemma 32] we have for any input distribution PX n : βα (PX n Y n , PX n PY∗ n ) ≥
(
α exp −nC − 2
r
2nV α
)
.
(387)
We complete the proof of (371) by invoking Theorem 13 in the form (215) with QY = PY∗ n and α =
1+ǫ : 2
β 1+ǫ (PY n , PY∗ n ) ≥ M1 β 1−ǫ (PX n Y n , PX n PY∗ n ) . 2
(388)
2
Applying bounds (385) and (387) to (388) we conclude that (371) holds with 41 1√ 6 b0 = + 64 1 + P . 1+ǫ
(389)
Next, we proceed to the proof of (327). On one hand, we have n X j=1
n X 4 E Yj = E (Xj + Zj )4
(390)
j=1
=
n X
E[Xj4 + 6Xj2 Zj2 + Zj4 ]
(391)
j=1
≤ E[kxk44 ] + 6nP + 3n ,
(392)
where (390) is by the definition of the AWGN channel, (391) is because X n and Z n are P 2 independent and thus odd terms vanish, (392) is by the power-constraint Xj ≤ nP . On the other hand, applying Proposition 11 with f (y) = −y 4 , θ = 2 and using (122) we obtain14 n X j=1
√ 1 E Yj4 ≥ 3n(1 + P )2 − (nC − log M + b3 n)n 4 ,
(393)
for some b3 = b3 (P, ǫ) > 0. Comparing (393) and (392) statement (327) follows. 14
Of course, a similar Gaussian lower bound holds for any cumulative sum, in particular for any power
December 21, 2012
P
E[|Yj |q ], q ≥ 1. DRAFT
57
We remark that by extending Proposition 11 to expectations like
1 n−1
Pn−1 j=1
2 E[Yj2 Yj+1 ], cf. (187),
we could provide a lower bound similar to (327) for more general 4-th degree polynomials in P x. For example, it is possible to treat the case of p(x) = i,j ai−j x2i x2j , where A = (ai−j ) is a Toeplitz positive semi-definite matrix. We would proceed as in (392), computing E[p(Y n )]
in two ways, with the only difference that the peeled off quadratic polynomial would require application of Theorem 18 instead of the simple power constraint. Finally, we also mention that the method (392) does not work for estimating E[kxk66 ] because we would need an upper bound √ E[kxk44 ] . 3nP 2 , which is not possible to obtain in the context of O( n)-achieving codes as the counterexamples (308) show.
R EFERENCES [1] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423 and 623–656, Jul./Oct. 1948. [2] S. Shamai and S. Verd´u, “The empirical distribution of good codes,” IEEE Trans. Inf. Theory, vol. 43, no. 3, pp. 836–846, 1997. [3] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “Channel coding rate in the finite blocklength regime,” IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2307–2359, May 2010. [4] A. Tchamkerten, V. Chandar, and G. W. Wornell, “Communication under strong asynchronism,” IEEE Trans. Inf. Theory, vol. 55, no. 10, pp. 4508–4528, Oct. 2009. [5] R. Ahlswede, P. G´acs, and J. K¨orner, “Bounds on conditional probabilities with applications in multi-user communication,” Probab. Th. Rel. Fields, vol. 34, no. 2, pp. 157–177, 1976. [6] R. Ahlswede and G. Dueck, “Every bad code has a good subcode: a local converse to the coding theorem,” Probab. Th. Rel. Fields, vol. 34, no. 2, pp. 179–182, 1976. [7] K. Marton, “A simple proof of the blowing-up lemma,” IEEE Trans. Inf. Theory, vol. 32, no. 3, pp. 445–446, 1986. [8] Y. Polyanskiy and S. Verd´u, “Relative entropy at the channel output of a capacity-achieving code,” in Proc. 2011 49th Allerton Conference, Allerton Retreat Center, Monticello, IL, USA, Oct. 2011. [9] F. Topsøe, “An information theoretical identity and a problem involving capacity,” Studia Sci. Math. Hungar., vol. 2, pp. 291–292, 1967. [10] J. H. B. Kemperman, “On the Shannon capacity of an arbitrary channel,” Indagationes Math., vol. 77, no. 2, pp. 101–115, 1974. [11] J. Wolfowitz, “The coding of messages subject to chance errors,” Illinois J. Math., vol. 1, pp. 591–606, 1957. [12] U. Augustin, “Ged¨achtnisfreie kan¨ale f¨ur diskrete zeit,” Z. Wahrscheinlichkeitstheorie und Verw. Geb., vol. 6, pp. 10–61, 1966. [13] R. Ahlswede, “An elementary proof of the strong converse theorem for the multiple-access channel,” J. Comb. Inform. Syst. Sci, vol. 7, no. 3, pp. 216–230, 1982.
December 21, 2012
DRAFT
58
[14] Y. Polyanskiy, “Channel coding: non-asymptotic fundamental limits,” Ph.D. dissertation, Princeton Univ., Princeton, NJ, USA, 2010, available: http://people.lids.mit.edu/yp/homepage/. [15] H. V. Poor and S. Verd´u, “A lower bound on the error probability in multihypothesis testing,” IEEE Trans. Inf. Theory, vol. 41, no. 6, pp. 1992–1993, 1995. [16] M. V. Burnashev, “Data transmission over a discrete channel with feedback. random transmission time,” Prob. Peredachi Inform., vol. 12, no. 4, pp. 10–30, 1976. [17] I. Csisz´ar and J. K¨orner, Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.
Cambridge
University Press, 2011. [18] S. Bobkov and F. G¨otze, “Discrete isoperimetric and Poincar´e-type inequalities,” Probab. Theory Relat. Fields, vol. 114, pp. 245–277, 1999. [19] M. Ledoux, “Concentration of measure and logarithmic Sobolev inequalities,” Seminaire de probabilites XXXIII, pp. 120– 216, 1999. [20] J. Wolfowitz, Coding Theorems of Information Theory.
Englewood Cliffs, NJ: Prentice-Hall, 1962.
[21] I. Csisz´ar, “Information-type measures of difference of probability distributions and indirect observation,” Studia Sci. Math. Hungar., vol. 2, pp. 229–318, 1967. [22] ——, “I-divergence geometry of probability distributions and minimization problems,” Ann. Probab., vol. 3, no. 1, pp. 146–158, Feb. 1975. [23] M. Ledoux, “Isoperimetry and Gaussian analysis,” Lecture Notes in Math., vol. 1648, pp. 165–294, 1996. [24] L. Harper, “Optimal numberings and isoperimetric problems on graphs,” J. Comb. Th., vol. 1, pp. 385–394, 1966. [25] M. Talagrand, “An isoperimetric theorem on the cube and the Khintchine-Kahane inequalities,” Proc. Amer. Math. Soc., vol. 104, pp. 905–909, 1988. [26] M. Donsker and S. Varadhan, “Asymptotic evaluation of certain markov process expectations for large time. I. II.” Comm. Pure Appl. Math., vol. 28, no. 1, pp. 1–47, 1975. ¯ [27] K. Marton, “Bounding d-distance by information divergence: a method to prove measure concentration,” Ann. Probab., vol. 24, pp. 857–866, 1990. [28] V. Strassen, “The existence of probability measures with given marginals,” Ann. Math. Stat., vol. 36, pp. 423–439, 1965. [29] C. Villani, Topics in optimal transportation.
Providence, RI:. American Mathematical Society, 2003, vol. 58.
[30] S. Bobkov and F. G¨otze, “Exponential integrability and transportation cost related to logarithmic Sobolev inequalities,” J. Functional Analysis, vol. 163, pp. 1–28, 1999. [31] M. Talagrand, “Transportation cost for Gaussian and other product measures,” Geom. Funct. Anal., vol. 6, no. 3, pp. 587–600, 1996. [32] I. Gelfand, A. Kolmogorov, and A. Yaglom, “On the general definition of the amount of information (in russian),” Dokl. Akad. Nauk SSSR, vol. 111, no. 4, pp. 745–748, 1956. [33] S. Bobkov and M. Madiman, “Concentration of the information in data with log-concave distributions,” Ann. Probab., vol. 39, no. 4, pp. 1528–1543, 2011. [34] I. J. Schoenberg, “On a theorem of Kirzbraun and Valentine,” Am. Math. Monthly, vol. 60, no. 9, pp. 620–622, Nov. 1953. [35] V. Sudakov and B. Tsirelson, “Extremal properties of half-spaces for spherically invariant measures,” Zap. Nauch. Sem. LOMI, vol. 41, pp. 14–24, 1974.
December 21, 2012
DRAFT