Coding for Computing Alon Orlitskyy
James R. Rochez
Abstract
A sender communicates with a receiver who wishes to reliably evaluate a function of their combined data. We show that if only the sender can transmit, the number of bits required is a conditional entropy of a naturally de ned graph. We also determine the number of bits needed when the communicators exchange two messages.
1 Introduction Let f be a function of two random variables X and Y . A sender PX knows X , a receiver PY knows Y , and both want PY to reliably determine f (X; Y ). How many bits must PX transmit? Embedding this communication-complexity scenario (Yao [22]) in the standard informationtheoretic setting (Shannon [17]), we assume that (1) f (X; Y ) must be determined for a block of many independent (X; Y )-instances, (2) PX transmits after observing the whole block of X instances, (3) a vanishing block error probability is allowed, and (4) the problem's rate Lf (X jY ) is the number of bits transmitted for the block, normalized by the number of instances. Two simple bounds are easily established. Two naive bounds are easily established. Clearly Lf (X jY ) H (f (X; Y )jY ), the number of bits required when PX knows Y in advance, and by a simple application of the Slepian-Wolf Theorem [19], Lf (X jY ) minfH (g(X )jY ) : g(X ) and Y determine f (X; Y )g. Both bounds are tight in special cases, but not in general. Drawing on rate-distortion results, we show that for every X , Y , and f ,
Lf (X jY ) = HG (X jY ):
(1)
The graph G, de ned by Witsenhausen [20] and used in [14, 6], is the characteristic graph of X , Y , and f . HG (X jY ) is the conditional G-entropy of X given Y . It extends HG(X ), the G-entropy of X , or graph entropy of (G; X ), de ned by Korner [10]. Graph entropy has recently been used to derive an alternative characterization of perfect graphs [5], lower bounds on perfect hashing [7, 11, 12], lower bounds for Boolean formula size [13, 16], and algorithms for sorting [8]. For an excellent review of graph entropy and its applications, see [18]. The lower bound () in (1) is proven via an analogy between HG(X jY ) and rate-distortion results of Wyner and Ziv [21] and their extension in Csiszar and Korner [4]. The upper bound () Submitted, IEEE Transactions on Information Theory. This version was printed on June 9, 1998. y AT&T Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974. z Center for Communications Research, Thanet Road, Princeton, NJ 08540
1
strengthens these rate-distortion results, showing that in certain applications the same rate suces to achieve small block- and not just bit-error probability. The proof uses robust typicality, a more restrictive form of the asymptotic equi-partition property. We also consider the more general scenario in which the communicators can exchange two messages. PY sends a message based on the block of Y 's, and PX responds with a message based on PY 's message and the block of X 's. Again, PY must accurately evaluate all f (X; Y )'s. PX 's and PY 's transmission rates are the number of bits they transmit, normalized by the block length. We determine the region of possible rate pairs for all X , Y , and f . The inner bound is derived by generalizing the one-way achievability results. To prove the (matching) outer bound, we extend results of Kaspi and Berger [9] to a larger class of distortion measures. The description of the two-way rate-region is somewhat involved, hence postponed to Section 5. In Section 4 we prove Characterization (1). Section 3 uses two naive bounds and three examples to motivate the results. To begin with, the next section formally de nes Lf (X jY ) and HG (X jY ).
2 De nitions Characterization (1) relates Lf (X jY ) to HG (X jY ), where G is the problem's characteristic graph. Each of the following subsections de nes one of these quantities.
2.1 Lf (X jY ) X , Y , and Z are nite sets, and f : X Y ! Z . f(Xi ; Yi)g1i=1 are independent instances of a random pair (X; Y ) ranging over X Y . A protocol consists of an encoding function ' : X n ! f0;1gk and a decoding function : f0;1gk Y n ! Z n where n 2 N is the block length and k 2 N is the number of bits transmitted. The protocol's rate is k=n and its block error probability is
p( ('(X); Y) 6= f (X; Y)) where X def = X1 ; : : : ;Xn , Y def = Y1 ; : : : ;Yn , and f (X; Y) def = f (X1 ; Y1 ); : : : ;f (Xn; Yn ). A rate r is achievable if for every > 0 and n 2 N there is a protocol with rate1 r, block length n, and block-error probability < . The rate Lf (X jY ) of (X; Y ) and f is the in mum of the set of achievable rates. Intuitively, the sender PX knows the Xi 's, the receiver PY knows the Yi 's, and both want PY to reliably evaluate the f (Xi ; Yi )'s. PX considers the n-instance block X and uses k bits to transmit '(X) at an average rate of k=n bits per copy. PY decides that f (X; Y) is ('(X); Y). He is correct for the block if the two vectors agree in every coordinate, and errs otherwise. Lf (X jY ) is the minimum number of bits that PX must transmit per instance for PY 's block-error probability to be arbitrarily small. 1
Inequality allows for irrational rates.
2
2.2 HG(X jY ) A set of vertices of a graph G is independent if no two are connected to each other. ?(G) is the collection of independent sets of G. Let X be a random variable and let a graph G be de ned over the support set of X . Korner [10] de ned the G-entropy of X (or graph entropy of (G; X )) to be
HG (X ) def = min I (W ; X ): X 2W 2?(G) Elaboration is in order. X induces a probability distribution over G's vertices. For every vertex x we select a transition probability distribution p(wjx) ranging over the independent sets containing P x: p(wjx) 0 and w3x p(wjx) = 1. This speci es a joint distribution of X and a random variable W ranging over the independent sets and always containing X . The G-entropy of X is the smallest possible mutual information between X and W . By the Data-Processing Inequality, this minimization can be restricted to W 's ranging over maximal independent sets of G. Note also that 0 I (W ; X ) H (X ) for all W , hence 0 HG (X ) H (X ) for all G and X .
Example 1 For an empty G, the set of all vertices is independent and always contains X , hence HG(X ) = 0. For a complete G, the only independent sets are singletons, hence W = fX g yielding HG(X ) = I (W ; X ) = H (X ). More interestingly, let X be uniformly distributed over f1; 2; 3g and let G consist of the single edge 1 ? 3. ?(G) contains only two maximal independent sets: f1; 2g and f2; 3g. By convexity of mutual information, I (W ; X ) is minimized when p(f1; 2gj2) = p(f2; 3gj2) = 1 1 2 2 2 . Therefore, HG (X ) = H (W ) ? H (W jX ) = 1 ? 3 = 3 . We extend the de nition of graph entropy to include conditioning. Let (X; Y ) be a random pair and let a graph G be de ned over the support set of X . The G-entropy of X given Y is
HG (X jY ) def =
min
W ?X ?Y X 2W 2?(G)
I (W ; X jY );
(2)
where W ? X ? Y indicates that W; X; Y is a Markov chain: p(wjx; y) = p(wjx) for every w; x; y. As with unconditional graph entropy, the minimization can be restricted to W 's ranging over maximal independent sets of G. Also, 0 I (W ; X jY ) H (X jY ) for all W , hence 0 HG (X jY ) H (X jY ) for all G and (X; Y ).
Example 2 For an empty G, we can again take W = X , hence HG(X jY ) = 0. For a complete G, we must again have W = fX g, yielding HG (X jY ) = I (W ; X jY ) = H (X jY ). We will later use the following example: (X; Y ) is uniformly distributed over f(x; y) : x; y 2 f1; 2; 3g; x 6= yg and G consists of the single edge 1 ? 3. ?(G) contains only two maximal independent sets: f1; 2g and f2; 3g. By convexity, I (W ; X jY ) is minimized when p(f1; 2gj2) = p(f2; 3gj2) = 21 . Therefore, HG(X jY ) = H (W jY ) ? H (W jXY ) = 31 + 23 h( 41 ) ? 13 = 32 h( 14 ). 2 3
2.3 G The characteristic graph G of X , Y , and f was de ned by Witsenhausen [20]. Its vertex set is the support set of X , and distinct vertices x; x0 are connected if there is a y such that p(x; y); p(x0 ; y) > 0 and f (x; y) 6= f (x0 ; y).
3 Motivation We motivate the characterization Lf (X jY ) = HG (X jY ) in (1) via three examples and two naive bounds that are tight in some special cases, but not in general.
3.1 Naive bounds
Lemma 1 (Naive lower bound) For all (X; Y ) and f , Lf (X jY ) H (f (X; Y )jY ) :
Proof: PX would need that many bits per instance even if he knew Y in advance.
2
g(X ), where g is a function de ned over X , and Y determine f (X; Y ) if there is a function h such that
p[h(g(X ); Y ) 6= f (X; Y )] = 0: For example, if X; Y 2 N and f (X; Y ) = X + Y mod 4, then g(X ) def = X mod 4 and Y determine def f (X; Y ). If, in addition, p((X + Y ) 1 mod 2) = 0, then g(X ) = b(X mod 4)=2c and Y determine f (X; Y ) as well.
Lemma 2 (Naive upper bound) For all (X; Y ) and f : Lf (X jY ) minfH (g(X )jY ) : g(X ) and Y determine f g:
Proof: By the Slepian-Wolf theorem [19], PX can transmit that many bits per instance to reliably convey g(X1 ); : : : ;g(Xn ) to PY . PY can then determine f (X; Y). 2
3.2 Examples We evaluate the two bounds via three, progressively more illuminating, examples. In the rst, the naive bounds coincide, hence the graph-entropy characterization in (1) provides no additional information. In the second, the naive bounds diverge and (1) shows that the upper bound is tight. In the third, (1) agrees with neither bound, showing that optimal coding can save over the naive upper-bound approach. While the examples are presented in the single-instance description, the solutions assume the usual block interpretation.
4
Example 3 A satellite has ten k-bit parameters to report to base. Base knows eight of the
parameters (satellite does not know which ones). How many bits must the satellite transmit for base to know all ten parameters? the 10k bits it would need if base knew nothing at all, or the 2k bits it would need if it knew which two parameters base was missing? Formally, X 1 ; : : : ;X 10 are random k-bit strings representing the parameters, and I 0 ; I 00 (1 I 0 < I 00 10) are the random indices of the parameters not known to base. All variables are distributed uniformly and independently. X = (X 1 ; : : : ;X 10 ) and Y = (Y 1 ; : : : ;Y 10 ) where Y i = X i for the eight indices i 62 fI 0 ; I 00 g and Y i = for the two indices i 2 fI1 ; I2 g. Finally, f (X; Y ) = (XI 0 ; XI 00 ). It is easy to verify that given Y , there is a 1-1 correspondence between X and f (X; Y ). Hence H (X jY ) = H (f (X; Y )jY ) = 2k, and the above bounds imply
Lf (X jY ) = 2k : For a simple illustration, imagine that base is missing just one parameter. Satellite transmits k bits representing the bit-wise parity of all 10 parameters. Base can deduce X , and hence f (X; Y ). In this atypical case, the minimal rate can be achieved without error using a single copy of (X; Y ). When more parameters are missing, multiple copies and a nonzero (albeit vanishing) error probability may be unavoidable. 2
Example 4 As in Example 3, satellite has ten k-bit parameters to report to base and base needs
to learn only two parameters (only base knows which ones). This time, however, base does not know the values of the other eight. How many bits must satellite transmit now? As before, H (f (X; Y )jY ) = 2k. But now if g(X ) and Y determine f (X; Y ), then H (g(X )jY ) = H (X jY ) = 10k. Hence the bounds yield: 2k Lf (X jY ) 10k : Yet a simple calculation shows that
HG (X jY ) = 10k;
hence (1) implies that the upper bound is tight and the lower bound is not. Note that when, as here, Y is independent of X , the same result can be obtained without using (1). However, we do not know of more direct proofs when X and Y are correlated. 2
Example 5 A top hat contains three cards, labeled 1, 2, and 3. Each of PX and PY selects one
card (without replacement). What is the minimum number of bits that PX must transmit for PY to determine who has the higher-valued card? For x; y 2 f1; 2; 3g, 1 1 if x > y, def 6 if x 6= y , p(x; y) = and f (x; y) = 0 if x < y. 0 if x = y, def 0 if x = 1 1 and y determine It is easy to verify that H (f (X; Y )jY ) = 3 and that g(x) = 1 if x 2 f2; 3g f (x; y). H (g(X )jY ) = 2=3 bits and this is the lowest entropy for all g's in the naive upper bound. 5
Hence the bounds yield
1 L (X jY ) 2 : f 3 3 The characteristic graph G consists of the vertices f1; 2; 3g and the edge 1|3, hence Example 2 showed that HG(X jY ) = 32 h( 14 ) :541: Therefore (1) implies that neither of the naive bounds is tight and that :125 bits per instance can be saved over the naive encoding. 2
3.3 Special cases In special cases one or both naive bounds are tight. 1. When p(x; y) > 0 for all (x; y) 2 X Y , the upper bound is tight. Write x=x0 if x is not connected to x0 in G, namely, if f (x; y) = f (x0 ; y) for all y such that p(x; y); p(x0 ; y) > 0. In general = can be arbitrarily complicated, but when, as we assume here, all (X; Y ) values are possible, = is an equivalence relation. Let [x] be x's equivalence class. It is easy to verify that [X ] minimizes H (g(X )jY ) among all g's such that g(X ) and Y determine f (X; Y ). The naive upper bound is therefore Lf (X jY ) H ([X ]jY ). In this special case, the upper bound is tight. However, except when [X ] and Y are independent, the only way we know of proving that fact (i.e., establishing a matching lower bound) is via the general result. G is a complete partite graph connecting all X pairs belonging to dierent equivalence classes. A simple extension of Example 2 shows that for all such graphs HG (X jY ) = H ([X ]jY ). Hence,
Lf (X jY ) = H ([X ]jY ): In the further restricted case where all (X; Y ) values are possible and for every x; x0 there is a y such that f (x; y) 6= f (x; y0 ), we have [X ] = fX g, hence
Lf (X jY ) = H (X jY ): This was the case in Example 4. We provide two additional examples. The rst describes a typical scenario where the amount of information required by PY depends on what he already knows. While the one-way results are disappointing, we will revisit this scenario in Section 5 and show that interesting things happen when two messages are allowed. The second example relates the current result to a well-studied communication-complexity problem.
Example 6 Consider satellite and base once more. Satellite knows a random variable X uniformly distributed over f1; : : : ;mg and base knows an independent Bernoulli-p random variable Y (Y 2 f0;1g, p(Y = 1) = p). If Y = 1 (\alert") base wants to know X . If Y = 0 6
(\relax") | it doesn't. Though not necessarily useful for the proof, it may be worthwhile to note that base computes f (x; y) = x y: Since all values of (X; Y ) are possible, and every two X values are distinguishable, satellite must transmit the maximum number of bits even when p is small:
Lf (X jY ) = H (X jY ) = log m: This of course is disappointing. When p is small, PY needs to learn X only in a small fraction of the instances. Yet PX must transmit almost all the bits. We will return to this example in Section where we show that in two-message communication coding can save over simple schemes. 2
Example 7 Consider the equality function
1 if x = y, 0 otherwise, where all (X; Y ) 2 X Y values are possible. Then def
f (x; y) =
Lf (X jY ) = H (X jY ): ?
(3)
Rabin and Yao [15] showed that O log log jXj + log( 1 ) bits suce for PY to evaluate a single instance of the equality function with error probability . For xed this number can smaller than H (X jY ). Equation (3) shows that when only arbitrarily small error probability is allowed, exactly H (X jY ) bits per instance are required even for multiple instances. 2 2. When f (X; Y ) = X , both bounds are tight. The Slepian-Wolf Theorem says that
Lf (X jY ) = H (X jY ): This equality (and the implied tightness of the bounds) also follows from the fact that the naive bounds coincide. For a still more complicated proof, we employ (1). Distinct x and x0 are connected in the characteristic graph G if p(x; y); p(x0 ; y) > 0 for some y 2 Y . Hence, for every y, the set fx : p(x; y) > 0g of possible x's is a clique in G. Since the intersection of a clique and an independent set is a singleton, Y and an independent set containing X determine X . Hence
HG (X jY ) =
min
W ?X ?Y X 2W 2?(G)
I (W ; X jY ) = H (X jY ) ?
7
max
W ?X ?Y X 2W 2?(G)
H (X jW; Y ) = H (X jY ):
4 Proofs Recall that
HG (X jY ) def =
We prove
min
W ?X ?Y X 2W 2?(G)
I (W ; X jY ):
Theorem 1 For every X; Y , and f , Lf (X jY ) = HG (X jY ):
2
The next subsection establishes the lower bound (). The upper bound is proved in Subsection 4.2.
4.1 Lower bound To obtain the lower bound we relax the error demands. Instead of insisting on small block-error probability, we require only that Zi = f (Xi ; Yi ) for all but a small fraction of the coordinates. Applying rate-distortion results, we show that HG (X jY ) bits per copy are needed even under this weaker error requirement. We need the following rate-distortion de nitions and results. For more details see Cover and Thomas [3]. Let d : X Z ! R be a distortion measure over two nite alphabets. The (normalized) distortion between two sequences x = x1 ; : : : ;xn 2 X n and z = z1 ; : : : ;zn 2 Z n is n X 1 d(x; z) = n d(xi ; zi ): def
i=1
The expected distortion between two random variables X and Z is
d(X; Z ) def =
X
x;z
p(x; z)d(x; z):
The (normalized) expected distortion between two random sequences X = X1 ; : : : ;Xn 2 X n and Z = Z1 ; : : : ;Zn 2 Z n is therefore
d(X; Z) =
X
n X 1 p(x; z)d(x; z) = n d(Xi ; Zi ):
x;z i=1 Let (X1 ; Y1 ); : : : ;(Xn ; Yn ) be independent copies of a random pair (X; Y ) 2 X Y . PX knows X = X1; : : : ;Xn while PY knows Y = Y1; : : : ;Yn. How many bits must PX transmit for PY to determine X within distortion D? Speci cally, as in the introduction, an (n; k) protocol for n; k 2 N consists of an encoding function ' : X n ! f0;1gk and a decoding function : f0;1gk Y n ! Z n . The protocol's rate is k=n and its distortion is d(X; ('(X); Y)). A rate-distortion pair (R; D) is achievable if for every > 0 there is a protocol with rate R and distortion < D + . The rate-distortion function of (X; Y ) and d maps every distortion D to R(D) | the in mum of the rates R such that (R; D) is an achievable pair.
8
In a seminal paper, Wyner and Ziv [21] showed that for every (X; Y ) and d,
R(D) =
min
V ?X ?Y 9g d(X;g(V;Y ))D
I (V ; X jY );
where the function g is de ned over the support set of (V; Y ). We need a lesser known extension of this result, allowing the distortion d : X Y Z ! R to depend on Y as well as on X and Z . All other de nitions are modi ed accordingly. For example, the protocol's distortion is d(X; Y; ('(X); Y)). Csiszar and Korner [4] showed that for every (X; Y ) and d, R(D) = min I (V ; X jY ): (4) V ?X ?Y 9g d(X;Y;g(V;Y ))D
To express our problem in rate distortion terms, let d(x; y; z) = 0 if z = f (x; y); 1 otherwise, so that for every random variable Z 2 Z ,
d(X; Y; Z ) = p(Z 6= f (X; Y )): For a sequence of variables,
d(X; Y; Z) = n1
n X i=1
d(Xi ; Yi ; Zi ) = n1
n X i=1
p(Zi 6= f (Xi ; Yi ))
is the expected fraction of indices i for which Zi 6= f (Xi ; Yi ). Viewing Z as an estimate of f (X; Y), the distortion d(X; Y; Z) is the bit-error probability. We are especially interested in R(0), the smallest rate ensuring that the bit error probability d(X; Y; Z) diminishes to 0. In essence, the next lemma shows that this requirement is weaker than that of vanishing block-error probability.
Lemma 3 For every X; Y , and f , Lf (X jY ) R(0):
Proof: For every i 2 f1; : : : ;ng, p(Z 6= f (X; Y)) p(Zi 6= f (Xi ; Yi )): Therefore,
n X 1 p(Z 6= f (X; Y)) n p(Zi 6= f (Xi ; Yi )) = d(X; Y; Z): i=1
It follows that a protocol's distortion is smaller than its block-error probability, hence every rate achievable with vanishing block error is also achievable with zero distortion. 2 9
Theorem 2 For every (X; Y ) and f , R(0) = HG (X jY ):
Proof: In view of (4) and the de nition of HG(X jY ), we need to prove that min
V ?X ?Y 9g Ed(X;Y;g(V;Y ))0
I (V ; X jY ) =
min
W ?X ?Y X 2W 2?(G)
I (W ; X jY ):
: We show that if X 2 W 2 ?(G) then there is a (partial) function g over ?(G) Y such that
f (x; y) = g(w; y) whenever p(w; x; y) > 0, hence Ed(X; Y; g(W; Y )) = 0. Let w 2 ?(G) and y 2 Y . If p(x; y) = 0 for all x 2 w, leave g unde ned. Otherwise, let g(w; y) def = f (x0 ; y) for any (say the rst) x0 2 w such that p(x0 ; y) > 0. By de nition of w, f (x; y) = f (x0; y) = g(w; y) for all x 2 w such that p(x; y) > 0, hence (as X 2 W ) whenever p(w; x; y) > 0. : Suppose that V ? X ? Y and that there exists g such that Ed(X; Y; g(V; Y )) 0. We de ne W such that X 2 W 2 ?(G) and I (W ; X jY ) I (V ; X jY ). Let p(v; x; y) be the probability distribution underlying (V; X; Y ). Set w(v) = fx : p(v; x) > 0g to be the set of X values that could result in v, and de ne the Markov chain W ? V ? XY by
p(wjv; x; y) = 1 if w = w(v)
0 otherwise. The following claims, in order, show that X 2 W 2 ?(G), that W ? X ? Y , and that I (W ; X jY ) I (V ; X jY ).
Claim 1 X 2 W 2 ?(G). Proof: X 2 W : p(w; x) > 0 implies a v such that w = w(v) and p(v; x) > 0, by de nition of w(v), x 2 w(v) = w. W 2 ?(G): We want to show that if p(w) > 0 then f (x0 ; y) = f (x00 ; y) for all x0 ; x00 2 w such that p(x0 ; y); p(x00 ; y) > 0. By de nition, p(w) > 0 implies a v such that w = w(v). If x 2 w then p(v; x) > 0, hence, by Markovity, p(x; y) > 0 implies p(v; x; y) > 0. To achieve d(X; Y; g(V; Y )) 0, we must have g(v; y) = f (x; y). If follows that if x0 ; x00 2 w and p(x0 ; y); p(x00 ; y) > 0 then f (x0; y) = g(v; y) = f (x00; y).
Claim 2 W ? V ? XY and V ? X ? Y imply W ? X ? Y . Proof: W ? V ? XY alone implies W ? V ? X as H (W jV X ) H (W jV ) = H (W jV XY ) H (W jV X ):
10
Therefore,
p(wjx) = = =
X
v
X
v
X
v
p(vjx)p(wjvx) p(vjx)p(wjv) p(vjxy)p(wjvxy)
= p(wjxy):
Claim 3 W ? V ? XY implies I (W ; X jY ) I (V ; X jY ). Proof: First observe that W ? V Y ? X as H (W jV XY ) H (W jV Y ) H (W jV ) = H (W jV XY ): Hence,
I (W ; X jY ) = H (X jY ) ? H (X jW; Y ) H (X jY ) ? H (X jW; V; Y ) = H (X jY ) ? H (X jV; Y ) = I (V ; X jY );
2
completing the proof of the theorem. Lemma 3 and Theorem 2 prove the lower bound in Theorem 1.
4.2 Upper bound In the previous subsection we saw that R(0) = HG (X jY ). Namely, PX must transmit about HG(X jY ) bits per copy to achieve vanishing bit error probability. We now show that the same rate suces to achieve vanishing block error probability as well. The rst part of the proof of Theorem 2 constructs a partial function g : ?(G) Y ! Z such that f (x; y) = g(w; y) whenever x 2 w and p(x; y) > 0; namely, for all w 2 ?(G),
p[f (X; Y ) = g(w; Y )jX 2 w] = 1: The protocol communicates to PY a sequence W = W1 ; : : : ;Wn such that Xi 2 Wi for all i. PY can therefore evaluate f (Xi ; Yi ) for every i 2 f1; : : : ;ng. The proof uses various properties of robust typicality, de ned and analyzed in Appendix B. For a random variable X distributed over a support set X and for n 2 N and > 0, we de ne a -robustly-typical set TX; X n . Among the properties of robust typicality is that if x 2 TX; then p(x) = 0 implies that for all i, xi 6= x. Let p(wjx) achieve HG (X jY ) and let (W; X; Y ) be distributed according to the implied joint distribution p(w; x; y) = p(x; y) p(wjx). By de nition, x 62 w implies that p(w; x) = 0. Hence, if (w; X) 2 T(W;X ); for some > 0, then Xi 2 wi for all i. 11
Lemma 4 If (w; X) 2 T(W;X ); for some > 0, then for all i, p[f (Xi ; Yi) = g(wi ; Yi )] = 1: Hence, if PY knows any w robustly typical with X, he can compute f (Xi ; Yi ) for all i.
2
The protocol uses roughly I (W ; X jY ) = HG(X jY ) bits to convey to PY a sequence w that is robustly typical with X. It strengthens similar (strong-typicality) results of Wyner and Ziv [21] and Csiszar and Korner [4]. The proof follows the outlines in Berger [2] and Cover and Thomas [3]. It also addresses subtleties arising in the detailed proof. For every rate r > HG (X jY ) we exhibit a collection of deterministic protocols. Each protocol's rate is r, and the collection's average block-error probability is exponentially small. Standard arguments imply a single deterministic protocol with the same rate and block error probability. Before formally presenting the protocol, we need a number of de nitions. Let 0 < < r ? HG(X jY ): Set l m l m s def = 2(I (W ;X )+=2)n and t def = 2(I (W ;X )?I (W ;Y )+)n : = W1j ; : : : ;Wnj , where each Wij is distributed according to p(w), indepenFor j 2 f1; : : : ;sg let Wj def dently of X, Y, and all other Wij 's. Let be a random mapping from f1; : : : ;sg to f1; : : : ;tg. Note that the Wj 's are independent of the values of X and Y known to PX and PY . They are chosen ahead of time as part of the protocol. We show that with high probability there is a Wj that is robustly jointly typical with X and that if PX transmits (j ), PY can determine j and therefore Wj . It will follow from Lemma 4 that PY can determine f (X; Y) with no errors. The actual proof requires additional work. Pick any 0 < 1 < 2 < 3 < =4; and with
H def = maxfH (X; Y ); 2H (X ); 2H (Y )g
de ne the robust typicality sets
T1 def = TX; =H ; 1
T2 def = T(W;X ); =H ;
T3 def = T(W;Y ); =H ;
2
3
and error bounds
1 (n) = 2jSX je?(1 =H )2 X n=3 def
and
3 (n) = 2jSW;Y def
( ? ) je? H(3H+11) W;Y n=3; 2
where SX def = fx 2 X : p(x) > 0g is the support set of X and X = xmin p(x) is the smallest nonzero 2SX X -probability (similarly for SW;Y and W;Y ). The following result is crucial in the proof. 12
Lemma 5 For every n, 1. p[X 62 T1 ] 1 (n). 2. For every j 2 f1; : : : ;sg and all x 2 T1 , p (Wj ; x) 2 T2 2?(I (W ;X )+ )n . 3. For all (w; x) 2 T2 , p[(w; Y) 62 T3 jX = x] 3 (n). 4. For every j 2 f1; : : : ;sg, p (Wj ; Y) 2 T3 2?(I (W ;Y )? )n . Proof: The four parts follow, in order, from Lemmas 17 and 25 and Corollaries 3 and 4. 2
3
2
We now apply these assertions to the protocol. Formally, de ne J def = fj : (Wj ; X) 2 T2 g to be the random index set of the Wj 's that are robustly jointly typical with X. If J 6= ;, the empty set, let J be the rst2 element of J . If J = ;, J is unde ned. Note that J and J are functions of W1 ; : : : ;Ws and X, and that if J 6= ; then WJ is robustly jointly typical with X. Finally, if J 6= ;, let
K def = fk : (Wk ; Y) 2 T3 and (k) = (J )g be the random index set of Wk 's that are robustly jointly typical with Y and are mapped by to the same element as J . If jKj = 1, let K be its unique element. Otherwise, K is unde ned. When de ned, K is the only k in ?1 (J ) such that Wk is robustly jointly typical with Y.
Protocol: PX : If J is empty, transmits an error message. Otherwise, transmits (J ). PY : If jKj = 6 1, declares an error. Otherwise, proceeds to determine g(WiK ; Yi) for all i. Error Analysis: If K = J then (WK ; X ) 2 T2 . By Lemma 4, f (Xi ; Yi ) = g(WiK ; Yi ) for every i 2 f1; : : : ;ng, hence PY is correct on the whole block. An error can occur only if J is empty, if jKj = 6 1, or if K= 6 J ; namely, if one or more of the following error events occurs: E1 : J = ; E2 : J = 6 ; and (WJ ; Y) 62 T3 E3 : J = 6 ; and there is k =6 J such that (Wk ; Y) 2 T3 and (k) = (J ). Lemmas 6 { 8 show that each of these error events occurs with exponentially small probability.
Corollary 1 p[E1 [ E2 [ E3] e?2 2
n=4
+ 1 (n) + 3 (n) + 2?n=4 :
Any other choice, e.g., a random element of J , will do.
13
2
It follows that for every rate r > HG (X jY ) there is a deterministic protocol with rate r and exponentially small block-error probability. The upper bound of Theorem 1 follows.
Lemma 6 p(E1 ) e?2 Proof: By de nition,
n=4
+ 1 (n).
p(E1 ) = p(J = ;) = p (Wj ; X) 62 T2 for all j 2 f1; : : : ;sg : By Part (2) of Lemma 5, for every x 2 T1 and j 2 f1; : : : ;sg,
p (Wj ; x) 2 T2 2?(I (W ;X )+ )n : Hence, for every x 2 T1 ,
2
s
p(E1 jX = x) 1 ? 2?(I (W ;X )+ )n e?s2? I W X 2
and therefore,
( (
;
)+2 )n
e?2
(=2
?2 )n
< e?2n= ; 4
p(E1 jX 2 T1 ) e?2n= : 4
Part (1) of Lemma 5 implies that
p(E1 ) = p(E1 and X 2 T1 ) + p(E1 and X 62 T1 ) p(E1 jX 2 T1 ) + p(X 62 T1) e?2n= + 1(n):
2
4
Lemma 7 p(E2 ) 3(n). Proof: By de nition of E2 and standard calculations, ?
p(E2 ) = p J 6= ;; (WJ ; Y) 62 T3 X ? p J 6= ;; (WJ ; Y) 62 T3 ; WJ = w; X = x = =
(w;x)2T2 X
(w;x)2T2
?
?
p J 6= ;; WJ = w; X = x p (w; Y) 62 T3 jJ 6= ;; WJ = w; X = x :
If W ? X ? Y is a Markov chain, so is g(W; X ) ? X ? Y for any function g. By construction, (W1 ; : : : ;Ws ) ? X ? Y is a Markov chain, and both J and WJ depend only on (W1 ; : : : ;Ws ) and X , hence (J ; WJ ) ? X ? Y is a Markov chain. Incorporating Part (3) of Lemma 5, we obtain
p(E2 ) =
X
(w;x)2T2 X
(w;x)2T2
?
?
p J 6= ;; WJ = w; X = x p((w; Y) 62 T3 jX = x) p J 6= ;; WJ = w; X = x 3 (n)
3 (n):
2 14
Lemma 8 P (E3 ) 2?n=4 . Proof: By de nition, h
p(E3 ) = p J 6= ;; 9k 6= J s.t. (Wk ; Y) 2 T3 and (k) = (J )
s X s X j =1 kk=1 6=j
i
p J 6= ;; J = j; (Wk ; Y) 2 T3 p (k) = (j )jJ 6= ;; J = j; (Wk ; Y) 2 T3 :
For k 6= j ,
p (k) = (j )jJ 6= ;; J = j; (Wk ; Y) 2 T3 = p((k) = (j )) = 1t :
Reversing order of summation to overcome dependency of the events J = j and (Wk ; Y) 2 T3 , and using Part (4) of Lemma 5, s s X X j =1 kk=1 6=j
p J 6= ;; J = j; (Wk ; Y) 2 T3
=
s X s X k=1 jj6==1k s X
p J 6= ;; J = j; (Wk ; Y) 2 T3
p J 6= ;; (Wk ; Y) 2 T3
k=1 ?
s p (W1 ; Y) 2 T3 s 2?(I (W ;Y )? )n :
3
Therefore,
p(E3 ) st 2?(I (W ;Y )? )n 2?(?=2? )n < 2?n=4 : where the second inequality ignores the \ceiling" in the de nition of s | an easily xable aw. 2 3
3
5 Two-way communication We consider the number of bits that must be transmitted when the communicators can exchange two messages. The next subsection de nes the problem and the region of achievable rates. Subsection 5.2 describes the results. Subsection 5.3 shows that even in natural scenarios coding can reduce transmissions over simple-minded communication schemes. Subsection 5.4 gives the proof of the main theorem on two-message communication.
5.1 De nitions As before, X , Y , and Z are nite sets, and f : X Y ! Z . f(Xi ; Yi )g1 i=1 are independent instances of a random pair (X; Y ) ranging over X Y . A two-message protocol consists of a Y -encoding function ' : Y n ! f0;1gl , an X -encoding function : f0;1gl X n ! f0;1gk and a decoding function : f0;1gl f0;1gk Y n ! Z n . Here, n 2 N is the block length , k 2 N is the number of bits transmitted by PX and l 2 N is the number 15
of bits transmitted by PY . The protocol's x-rate is k=n and its y-rate is l=n. The block error probability is h i p ('(Y); ('(Y); X); Y) 6= f (X; Y) ; where X def = X1 ; : : : ;Xn , Y def = Y1 ; : : : ;Yn , and f (X; Y) def = f (X1 ; Y1 ); : : : ;f (Xn; Yn ). A rate pair (rx ; ry ) is achievable if for every > 0 and n 2 N there is a protocol with x-rate rx, y-rate ry , block length n, and block-error probability < . The two-message rate region Rf2 (X jY ) of (X; Y ) and f is the closure of the set of achievable rate pairs. The two-message communication complexity is = minfrx + ry : (rx ; ry ) 2 Rf2 (X jY )g: L2f (X jY ) def Intuitively, PX knows the Xi 's, PY knows the Yi 's, and both want PY to reliably evaluate the f (Xi ; Yi )'s. PY considers the n-instance block Y and uses l bits to transmit '(Y) at an average rate of l=n bits per instance. PX then considers the n-instance block X and uses k bits to transmit ('(Y); X) at an average rate of k=n bits per instance. PY decides that f (X; Y) is ('(Y); ('(Y); X); Y). He is correct for the block if the two vectors agree in every coordinate, and errs otherwise. Rf2 (X jY ) is the set of rate pairs that can be transmitted while permitting PY 's block-error probability to be arbitrarily small. The problem's communication complexity L2f (X jY ) is the total number of bits both communicators must transmit. While a more complicated de nition may seem necessary, one where PX and PY 's messages are of variable lengths, it can be shown that the current de nition is equivalent.
5.2 Results Two random variables U and V de ned over nite alphabets are admissible if: 1. U ? Y ? X , 2. V ? UX ? Y , 3. U , V , and Y determine f (X; Y ).
Theorem 3 For every (X; Y ) and f , n
o
Rf2 (X jY ) = (rx ; ry ) : rx I (V ; X jUY ) and ry I (U ; Y jX ) for some admissible U and V : 2 The theorem is illustrated in the next subsection and proved in Subsection 5.4.
5.3 Example We use Theorem 3 to show that even for natural problems, coding can save transmission over simple-minded communication. 16
Example 8 Recall the scenario in Example 6. Satellite knows a random variable X uniformly distributed over f1; : : : ;mg and base knows an independent Bernoulli-p random variable Y . If Y = 1 (\alert") base wants to know X . If Y = 0 (\relax") | it doesn't. We showed that, since all values of (X; Y ) are possible, one-way communication requires the maximum number of bits even for small p: Lf (X jY ) = H (X jY ) = log m: When two messages are allowed, the communicators can save bits when Y = 0. PY transmits Y using h(p) bits per instance, and PX describes X only if Y = 1. Hence,
L2f (X jY ) h(p) + p log m: For small p this is signi cantly smaller than Lf (X jY ). However, Theorem 3 shows that transmission can be further reduced. De ne the Markov chain X ? Y ? U where U 2 fz; eg by
p(U = ejY = 1) = 1
and
p(U = ejY = 0) = ;
namely, if Y = 1 then U = e, and if Y = 0 then U = e with probability . Then U = z implies that Y is zero, while U = e implies that Y is either 0 or 1. Let
q def = p(U = e) = p + p = 1 ? p ; where p = 1 ? p and = 1 ? . The Markovity U ? Y ? X and the independence of X and Y imply that U and X are independent. Therefore, I (U ; Y jX ) = H (U ) ? H (U jY ) = h(q) ? ph(): Next de ne Y ? UX ? V , where V 2 f1; : : : ;mg, via 1 if v = x, and p(vjU = z; x) = m1 : p(vjU = e; x) = 0 if v 6= x Then
I (V ; X jUY ) = H (V jUY ) ? H (V jUX ) = H (V ) ? p(U = z)H (V jX; U = z) = q log m: Therefore
Rf2 (X jY ) f(rx ; ry ) : rx q log m; ry h(q) ? ph(q=p); q 2 [p; 1]g :
With more work it can be shown that this is exactly the achievable region. Next, consider the communication complexity. For every q 2 [p; 1],
L2f (X jY ) h(q) ? ph(q=p) + q log m = ?q log q ? p log p + (q ? p) log(q ? p) + q log m: 17
Dierentiation with respect to q shows that L2f (X jY ) is minimized when
q = minf mp ?m1 ; 1g:
For simplicity, assume that p 1 ? m1 . Then
L2f (X jY ) = h(p) + p log(m ? 1): It is instructive to consider m = 2, equivalently, the scenario where X B (1=2) and Y B (p) are independent Bernoulli variables and PY wants to determine X ^ Y . For conciseness, we assume that p 21 . In one-way communication, PX must transmit 1 bit (per instance). In simple-minded twomessage communication, PY transmits h(p) bits and PX responds with p bits for a total of h(p) + p bits. In optimal two-message communication, PY transmits h(p) ? 2p bits and PX responds using 2p bits for a total of h(p) bits. 2
5.4 Proof It remains to prove Theorem 3. We describe only the outer bound (converse) proof. The innerbound (achievability) proof, omitted to save space, combines random-coding arguments similar to ones given by Kaspi and Berger [9] with robust typicality techniques like those used to prove one-way results in Section 4 of this paper. This combination proves vanishing block-error (not just bit-error) probability. The remainder of this subsection outlines the outer-bound (converse) proof. As with one-way communication, we show that the outer bound holds even under the relaxed requirement of small bit-error (not just block-error) probability. The proof consists of three main parts: 1. As in [9], we obtain an outer bound on the three-dimensional region S3 , de ned essentially as the set of all triples (rx ; ry ; D) that have associated protocols with x-rate rx + , y-rate ry + , and bit-error rate D + for all > 0. It will be shown that S3 S3 , where S3 = (S30 )cl is the closure of the set S30 , with S30 de ned in terms of mutual informations involving X , Y , and three auxiliary variables T , U , and V . The corresponding alphabets T , U , and V may be taken to have cardinalities bounded above by polynomials in jXj and jYj. 2. Manipulating Markov chains, we show that S30 and its closure, S3 , can be expressed in terms of X , Y , U , and V alone. 3. Using the continuity of the mutual information function and the uniform bound on jUj and jVj, we argue that S30 is closed and that the two-dimensional set S3 \fD = 0g, or (S30 )cl \fD = 0g, is the same as S30 \ fD = 0g. It follows that S2 , the two-dimensional slice through S3 on which D = 0 (which is equivalent to the two-dimensional region Rf2 (X jY )) is contained in the two-dimensional region S20 given in Theorem 3. 18
We begin with some de nitions. For x 2 X , y 2 Y , z 2 Z , let def
(
(x; y; z) =
0 if z = f (x; y); 1 if z 6= f (x; y);
and for x 2 X n , y 2 Y n , z 2 Z n , let
n X n def = n(x; y; z) def = n1 (xi ; yi ; zi ): i=1
De ne S3 R3 to be the closure of the set of all (rx ; ry ; D) such that for every > 0 and n 2 N there is a protocol with x-rate rx , y-rate ry , and "
n X E (n ) = E n1 (Xi ; Yi ; Zi)
= n1
n X i=1
#
i=1
pfZi 6= f (Xi ; Yi )g
D+ :
Let
S2 def = S3 \ fD = 0g: Note that S2 is equivalent to the region Rf2 (X jY ): S2 = f(rx ; ry ; 0) : (rx ; ry ) 2 Rf2 (X jY )g : Next we de ne the regions S30 and S3 that provide an outer bound on S3 . For D 0 let P3 (D) be the set of all triples of r.v.'s (T; U; V ) jointly distributed with X; Y and ranging over nite sets T , U , and V , respectively, satisfying the following properties: a) U ? Y ? X , b) V ? UX ? Y , c) X ? UY ? T , d) X ? UV Y ? T , e) f (X; Y ) ? UV Y ? T , f) 9F : T U V Y ! Z such that E [(X; Y; F (T; U; V; Y ))] D. Then
S30 def = f(rx ; ry ; D) : rx I (V ; X jTUY ); ry I (U ; Y jX ) for some (T; U; V ) 2 P3 (D)g and
= (S30 )cl S3 def 19
is the closure of S30 . As in [9], S3 is convex and the cardinalities jT j, jUj, jVj may be taken to be bounded above by polynomial functions of jXj and jYj. We now prove the rst of the three steps establishing the outer bound.
Theorem 4
S3 S3:
Proof: The proof is similar that of Theorem 4.1 in [9]. We provide only an outline. S3 is closed by de nition. Hence it suces to show that if s def = (rx ; ry ; D) 2 S3 , then for any > 0, s def = (rx + + (); ry + + (); D + + ()) is in the interior of S3 , where () ! 0 as ! 0. Since s 2 S3 , we know that for suciently large n there exists a code ' that satis es the conditions for the de nition of S3 . We lower-bound the two rates for the code as in [9]. First, for i = 1; 2; : : : ; n de ne the auxiliary r.v.'s = (Y1i?1 ; Yin+1 ); Ti def Ui def = ('(Y); Xi1?1 ; Yin+1 ); = ('(Y); X): Vi def For convenience we sometimes omit the arguments of the encoding functions ' and (and of the decoding function ). We also often omit the index and limits of summation when a sum is over all i from 1 to n. Proceeding as in [9], we nd that for any given > 0 and for any code satisfying the conditions that de ne the region S3 ,
n(rx + ) n(ry + )
X X
I (Xi ; Vi jTiUi Yi); I (Yi ; UijXi ):
Let i def = E [(Xi ; Yi ; Zi )] be the probability of bit error that the given code achieves for the ith character of the blocks, where Zi is the ith component of Z = ('; ; Y). By de nition, (Ti ; Ui ; Vi ; Yi ) includes the coding functions ' and , as well as the complete vector Y. Therefore there exists a function Fi such that Zi = Fi (Ti ; Ui; Vi ; Yi), namely Fi (Ti; Ui ; Vi; Yi ) is the ith component of ('; ; Y). As we shall see shortly, we also have the ve following Markov conditions on Ti , Ui , Vi .
Condition 1: Ui ? Yi ? Xi Condition 2: Vi ? Ui Xi ? Yi 20
Condition 3: Xi ? UiYi ? Ti Condition 4: Xi ? UiViYi ? Ti Condition 5: f (Xi; Yi ) ? UiVi Yi ? Ti. Taking these ve conditions on faith for the moment, we see from the de nition of P3 (D) that P (Ti ; Ui ; Vi ) 2 P3 (i ). By a simple convexity lemma (proved as in [9]), if we let = n?1 i , then there exist (T; U; V ) 2 P3 () such that X
I (X ; V jTUY ) = n?1 I (Xi ; VijTi Ui Yi ) ; X I (Y ; U jX ) = n?1 I (Yi ; Ui jXi ) : Combining previous results we obtain
rx + I (X ; V jTUY ); ry + I (Y ; U jX ); D+ : Thus s is in the interior of S3 , as we wished to show. Furthermore, as in [9], we may without loss of generality impose uniform bounds on the cardinalities of the alphabets T , U , and V . 2 Now we return to the ve Markov conditions in Theorem 4 and outline their proofs. Conditions 1 and 2 are established as in [9], and Condition 3 is straightforward, so we concentrate on the last two conditions.
Lemma 9 (Condition 4) Xi ? Ui ViYi ? Ti. Proof: This Markov constraint follows from the stronger condition Xin ? Ui ViYi ? Ti which, in turn, follows from the de nitions of Ti , Ui , and Vi , and from the next lemma after letting
A1 = Xi1?1 ; A2 = Xni; B1 = Y1i?1 ; B2 = Yin; g = ; h = ':
2
Lemma 10 Let A1 , A2, B1, B2 be r.v.'s with (A1 ; A2 ) independent of (B1 ; B2). Let h be a deterministic function h(B1 ; B2 ) and let g be a deterministic function g(h; A1 ; A2 ). Then
A2 ? (h; g; A1 ; B2 ) ? B1:
Proof: Straightforward.
2
Before establishing the fth and nal Markov condition, we prove two claims and two lemmas. Abbreviate fi def = f (Xi ; Yi ): 21
Claim 4 fiXi ? Yi ? Ui Ti. Proof: Follows readily from the de nitions of Ti and Ui.
2
Lemma 11 (Lemma A.1 of [9]) Let Ai and Bi, i = 1; 2, be r.v.'s taking values on respective nite sets Ai and Bi , with (A1 ; B1 ) independent of (A2 ; B2 ). Let g = g(B1 ; B2 ) and f = f (A1 ; A2 ; g) be deterministic functions. Then
Proof: See [9].
f ? (A1 ; B2 ; g) ? B1 :
2
Claim 5 TiYi ? fiXi Ui ? Vi Proof: Follows almost immediately from the fact that fiTiYi ? Xi Ui ? Vi , which can in turn be 2 readily proved after using the previous lemma to show that ? Xi1 Yin+1 ' ? Y1i . Lemma 12 Let A, B , T , U , V be r.v.'s de ned over nite alphabets with B ? A ? TU and AT ? BU ? V . Then (a) BV ? AU ? T , (b) B ? AUV ? T . Proof: (a) For every realization a; b; t; u; v of A, B , T , U , V , we have ) = Pp(ab)p(tuja)p(vjbu) p(tjabuv) = pp((abtuv p(ab)p(tuja)p(vjbu) abuv) u ja) Pp(tuja)p(a) = Pp(ptu (tuja) = p(tuja)p(a) u
u
) = p(atu) = p(tjau) : = Pp(patu (atu) p(at) u
Therefore
p(tjabuv) = p(tjau)
for every a; b; t; u; v, so BV ? AU ? T . (b) H (T jAUV ) H (T jABUV ) = H (T jAU ) H (T jAUV ), where the equality follows from part (a) and where the two inequalities follow from the fact that additional conditioning reduces entropy. Therefore
H (T jAUV ) = H (T jABUV ) ; so B ? AUV ? T .
2
22
Lemma 13 (Condition 5) fi ? UiViYi ? Ti . Proof: Follows from the previous two claims and part (b) of the last lemma with A = Yi and 2
B = (fi ; Xi ).
This completes the proof of Theorem 4. Next we show that the auxiliary variable T can be eliminated from the de nition of the region S30 and its closure, S3 . For D 0, let P~3 (D) be the set of all pairs of r.v.'s (U; V ) jointly distributed with X; Y and ranging over nite sets U and V , respectively, satisfying the following properties: a) U ? Y ? X , b) V ? UX ? Y , c) 9F : U V Y ! Z such that
E [(X; Y; F (U; V; Y ))] D: Let
n
o
= (rx ; ry ; D) : rx I (V ; X jUY ); ry I (U ; Y jX ) for some UV 2 P~3 (D) ; S~30 def
and let
= (S~30 )cl S~3 def
be its closure.
Theorem 5
S3 S~3:
In fact, S~30 = S30 and S~3 = S3 . Proof: Follows from the next two lemmas, showing that T can be eliminated in Theorem 4. 2
Lemma 14 If X ? UY ? T and X ? UV Y ? T , then I (X ; V jUTY ) = I (X ; V jUY ):
Proof:
I (X ; V jUTY ) = H (X jUY T ) ? H (X jUV Y T ):
By the two Markov hypotheses, we have
H (X jUY T ) = H (X jUY ) and so
H (X jUV Y T ) = H (X jUV Y ); I (X ; V jUTY ) = H (X jUY ) ? H (X jUV Y ) = I (X ; V jUY ): 23
2
Lemma 15 Given r.v.'s T , U , V , X , and Y de ned over nite alphabets with a particular joint distribution, suppose that f (X; Y ) ? UV Y ? T and that there exists a decoding function F (T; U; V; Y ) such that
pfF (T; U; V; Y ) 6= f (X; Y )g D: Then there exists a decoding function F~ (U; V; Y ) such that pfF~ (U; V; Y ) 6= f (X; Y )g D:
Proof: De ne the r.v.'s A = f (X; Y ), B = (U; V; Y ), and C = T over the corresponding nite alphabets A, B, and C with the joint distribution induced by T , U , V , X , Y , and f . Given any realization (b; c) of the random pair (B; C ), there is some discrete distribution on the value of A. The best possible decoding rule, F , is one that for each (b; c) sets F (b; c) equal to the most probable value of A. (If there is more than one such value of A, it may be chosen arbitrarily.) In particular, any F chosen as above must achieve
pfF (B; C ) 6= Ag pfF (B; C ) 6= Ag D: But by the Markov condition A ? B ? C , we may choose F to be a function of B alone without altering pfF (B; C ) 6= Ag. Thus we may take
F~ (U; V; Y ) = F (B ) = F (U; V; Y ):
2
The previous two lemmas and Theorem 4 establish Theorem 5. We are nally ready to prove the outer bound (converse) of Theorem 3. Recall that
S2 def = S3 \ fD = 0g = f(rx ; ry ; 0) : (rx ; ry ) 2 Rf2 (X jY )g: Thus the outer bound of Theorem 3 will be established by the following theorem.
Theorem 6
S2 S~30 \ fD = 0g: Proof: It follows from Theorem 5 that if s 2 S2, then s 2 S~3 \ fD = 0g = (S30 )cl \ fD = 0g: To prove Theorem 6, we essentially need to show that S~30 already contains all its points of closure for which D = 0. The following claim establishes this result and thus, together with the inner 2 bound on Rf2 (X jY ), proves Theorem 3, our main theorem on two-way communication.
Claim 6 S~30 is closed. Proof: We show that for every in nite sequence f(rx ; ry ; D)k gk1 from S~30 that has a limit (rx ; ry ; D ), this limiting triple is also contained in S~30 . We use the fact that we can place uniform upper bounds on the cardinalities of the alphabets U , V , and Z in terms of jXj and jYj.
Thus the number of possible decoding functions F (U; V; Y ) is bounded, and the space of admissible 24
probability mass functions (p.m.f.'s) p(x; y; u; v) = p(x; y)p(ujy)p(vju; x) (the form allowed by the Markov constraints in the de nition of P~3 (D)) is closed and bounded and has bounded dimension. Now given any in nite sequence of triples f(rx ; ry ; D)k g from S~30 , and any corresponding sequence fFk (U; V; Y ), pk (ujy), pk (vju; x)g achieving the given f(rx ; ry ; D)k g, there must be an in nite subsequence having a common decoding function F (U; V; Y ). Within this subsequence, there must be an accumulation point (F (U; V; Y ); p (ujy); p (vju; x)). Since the space of p.m.f.'s is closed and since mutual information is a continuous function, it follows from the de nition of S~30 that we can de ne U and V by p (x; y; u; v) = p(x; y)p (u; y)p (vju; x) and F by F (U; V; Y ) to achieve the 2 point (rx ; ry ; D ). The claim follows.
Acknowledgements We are indebted to Toby Berger, Janos Korner, and especially Aaron Wyner, for elucidating discussions, literature references, and help in smoothing out the rough edges of some proofs.
Appendix: Robust typicality Let X be a random variable distributed over a nite set X according to a probability distribution p, and let X = X1 ; : : : ;Xn be a sequence of independent random copies of X . For large n, we expect roughly a p(x)-fraction of the copies to attain the value x 2 X . This fundamental property underlies many an information-theory result. Formally, the empirical frequency of x 2 X in a sequence x = x1 ; : : : ;xn 2 X n is x (x) def = jfi : xni = xgj :
The sequence x is -strongly typical for > 0 if for all x 2 X ,
jx(x) ? p(x)j
(or =jXj in some de nitions):
Strong typicality is studied in detail by Berger [1], Csiszar and Korner [4], Cover and Thomas [3], and others. We need a slightly modi ed form of typicality. The sequence x is -robustly typical (-r.t.) for > 0 if for all x 2 X , jx (x) ? p(x)j p(x): Suppressing sequence-length, n, for brevity, we let TX; , or T for short, denote the set of -r.t. x's. In most cases, the random variable is implied by context, and we use the shorthand notation, for example, x 2 T instead of x 2 TX; . Pairs and, more generally, collections of random variables are special kinds of random variables, hence the de nition of robust typicality applies to them too. For example, let (X; Y ) be a random pair distributed over a nite set X Y according to a probability distribution p and, using informal notation, let (X; Y) def = (X1 ; Y1 ); : : : ;(Xn ; Yn ) be a sequence of independent copies of (X; Y ). 25
According to the de nitions above, the empirical frequency of (x; y) 2 X Y in a sequence pair (x; y) = (x1 ; y1 ); : : : ;(xn ; yn ) is x;y (x; y) = jfi : (xi ; yi ) = (x; y)gj=n, the sequence (x; y) is -r.t. if jx;y (x; y) ? p(x; y)j p(x; y) for all (x; y) 2 X Y , and the set of all -r.t. (x; y)'s is denoted by T(X;Y ); , usually abbreviated T . Robust typicality has certain advantages over its strong counterpart. The bounds have more natural expressions (e.g., Lemma 18), many proofs are simpli ed (cf. Lemma 21), and a single de nition applies to arbitrary collections of random variables (as above). For our purposes, the crucial property distinguishing robust from strong typicality3 is also the one that lends it its name: If x is r.t. then p(x) = 0 implies x (x) = 0. It is this property that allows us to achieve nearzero block error probability. In the protocol discussed in Subsection 4.2, PY attempts to nd a w 2 ?(G)n that is -r.t. with x. If x 62 w then p(wjx) = 0, and robust typicality guarantees that if PY nds w as above then x 2 wi for all i and therefore PY is correct on the whole block. In the rest of the appendix we prove robust-typicality equivalents of standard strong-typicality results (e.g., Cover and Thomas [3]). Many of the results hold under similar quali cations. To abbreviate the exposition, we describe all quali ers here. All results hold for every random variable X and every n 2 N . Whenever appears (mainly Lemmas 16|21 and Corollary 2), the result holds for every 0 < 1; and, if used,
(n) def = 2jSX je? X n=3 = fx 2 X : p(x) > 0g is the support set of X and X = xmin where SX def p(x) is its smallest nonzero 2SX probability. Whenever 1 and 2 appear (mainly following Lemma 22), the result holds for every 0 < 1 < 2 1, and, if used, 2
; (n) def = 2jSX;Y je? 1
2
?
(2 1 )2 1+1
X;Y n=3 :
Note that (n) and ; (n) diminish to zero exponentially with n. We rst consider a single random variable X . We prove that with high probability a random sequence is typical, that every typical sequence has probability of about 2?nH (X ) , and that there are roughly 2nH (X ) typical sequences. We use the following basic tool. Lemma 16 (Cherno Bound) For every x 2 X , 1
2
p[X(x) (1 ? )p(x)] e? p(x)n=2 2
and
p[X (x) (1 + )p(x)] e? p(x)n=3 :
2
2
O we go.
Lemma 17 p(X 2 T ) 1 ? (n). Proof: Using the Cherno Bound, X p(X 62 T ) p[jX(x) ? p(x)j p(x)] 2jSX j e? 2
x2SX 3 And from weak typicality, which is less relevant here.
26
X
n=3 =
(n):
2
Lemma 18 For every x 2 T , 2?(1+)H (X )n p(x) 2?(1?)H (X )n :
Proof: For every x 2 X n, p(x) = 2 For every x 2 T ,
?
P
2X
x
x (x) log
! 1 p(x)
n
=2
? H (X )+
P
2X
x
(x (x)?p(x)) log
! 1 p(x)
n
:
X (
X x (x) ? p(x)) log p(1x) p(x) log p(1x) = H (X ): x2X x2X
2
Lemma 19 (1 ? (n)) 2(1?)H (X )n jT j 2(1+)H (X )n : Proof: Using Lemma 18, X 1 p(x) jT j 2?(1+)H (X )n : x2T Conversely, incorporating Lemma 17,
1 ? (n)
X
p(x) jT j 2?(1?)H (X )n :
2
x2T We now turn to pairs of random variables. The next corollary summarizes some implication of the general results proven in the previous lemmas. The ensuing results are speci c to random pairs.
Corollary 2 1. For every (x; y) 2 T , 2?(1+)H (X;Y )n p(x; y) 2?(1?)H (X;Y )n : 2. p((X; Y) 2 T ) 1 ? (n): 3. (1 ? (n)) 2(1?)H (X;Y )n jT j 2(1+)H (X;Y )n :
2
Using the results in Lemma 18 and Part (1) of the corollary, we can bound p(yjx) for every (x; y) 2 T . However, an adaptation of the proof of Lemma 18 yields a stronger bound:
Lemma 20 For all (x; y) 2 T , 2?(1+)H (Y jX )n p(yjx) 2?(1?)H (Y jX )n :
2
Lemma 21 T(X;Y ); TX; TY; . Proof: We need to show that if (x; y) 2 T then (using shameless notation) x; y 2 T . For every x 2 X, X X jx (x) ? p(x)j = j x;y(x; y) ? p(x; y)j p(x; y) = p(x); y2Y
y2Y
therefore x 2 T . Similarly, y 2 T .
27
2
Recall that the ensuing lemmas hold for all n 2 N and all 0 < 1 < 2 1, and that
; (n) def = 2jSX;Y je? 1
2
?
(2 1 )2 1+1
X;Y n=3 :
Part (2) of Corollary 2 says that with high probability (X; Y) 2 T . The same holds when we condition on every typical value of X:
Lemma 22 For every x 2 T , p[(x; Y) 2 T jX = x] 1 ? ; (n): Proof: Let x 2 T , namely, for every x 2 X , (1 ? 1 )p(x) x (x) (1 + 1 )p(x): (5) We show that if X = x then for every (x; y), x;Y (x; y) is close to p(x; y) with high probability. 1
2
1
2
1
If p(x; y) = 0, then x;Y (x; y) = X;Y (x; y) = 0 = p(x; y). If x (x) = 0, then, by (5), p(x; y) = p(x) = 0 and again x;Y (x; y) = p(x; y). For (x; y) 2 SX;Y such that X(x) 6= 0 we apply the Cherno bound. The expected value of x;Y (x; y) is x (x) p(yjx). Hence, p ( x ) p[x;Y (x; y) (1 + 2 )p(x; y)] = p x;Y (x; y) (1 + 2) (x) x(x) p(yjx) x