Algorithmic Aspects of Optimal Channel Coding

Report 2 Downloads 96 Views
Algorithmic Aspects of Optimal Channel Coding Siddharth Barman∗

Omar Fawzi†

arXiv:1508.04095v1 [cs.IT] 17 Aug 2015

Abstract A central question in information theory is to determine the maximum success probability that can be achieved in sending a fixed number of messages over a noisy channel. This was first studied in the pioneering work of Shannon who established a simple expression characterizing this quantity in the limit of multiple independent uses of the channel. Here we consider the general setting with only one use of the channel. We observe that the maximum success probability can be expressed as the maximum value of a submodular function. Using this connection, we establish the following results: 1. There is a simple greedy polynomial-time algorithm that computes a code achieving a (1 − e−1 )approximation of the maximum success probability. Moreover, for this problem it is NP-hard to obtain an approximation ratio strictly better than (1 − e−1 ).

2. Shared quantum entanglement between the sender and the receiver can increase the success probability by a factor of at most 1−e1 −1 . In addition, this factor is tight if one allows an arbitrary non-signaling box between the sender and the receiver.

3. We give tight bounds on the one-shot performance of the meta-converse of Polyanskiy-Poor´ Verdu.

1

Introduction

One of the central threads of research in information theory is the study of the minimum error probability that can be achieved in sending a fixed number of messages over a noisy channel. This task can be formulated as the maximization—over valid encoders and decoders—of the probability of correctly determining the sent message (which we refer to as success probability for the rest of the paper). In his seminal work, Shannon [20] showed that for n independent copies of the channel, this question is almost completely answered by a single number, C, the capacity of the channel. In fact, for a number of messages k(n) satisfying supn log nk(n) < C, the maximum success probability tends to 1 as n tends to infinity, and when inf n log nk(n) > C, the maximum success probability tends to 0 as n tends to infinity. Here, we study the algorithmic aspects of determining the optimal encoder and decoder which lead to the maximum success probability over a noisy channel in the non-asymptotic regime. Recently, in information theory literature there has been significant interest in understanding the non-asymptotic behavior when the number of channel uses n is finite [21, 10, 16, 23]. But, instead of focusing on closedform expressions for the maximum rate at which information can be transmitted, we rather ask how well can the optimal rates of transmission be computed with an efficient algorithm. One way to formulate the computational problem is as follows. The input of the problem is an integer k (which denotes the total number of messages) together with the description of a channel W that maps elements in X to elements in Y ; specifically, for each y ∈ Y and x ∈ X, we have W (y|x), which is the probability of receiving the symbol y when the symbol x is transmitted. The objective is to determine the maximum ∗

California Institute of Technology. [email protected] UMR 5668 LIP - ENS Lyon - CNRS - UCBL - INRIA, Universit´e de Lyon and California Institute of Technology. [email protected]

1

success probability,1 S(W, k), for transmitting k distinct messages using the channel W (only once). This algorithmic formulation leads us to interesting implications that are described below. We refer to this problem of determining the maximum success probability of a given noisy channel as the optimal channel coding problem. Our more specific findings can be summarized as follows: • We observe that the problem of computing the optimal success probability S(W, k) corresponds to a submodular maximization problem with cardinality constraints (Proposition 2.1). which implies that this quantity can be efficiently approximated using a simple greedy algorithm that achieves a (1 − e−1 )-approximation ratio. As the maximum coverage problem can be reduced to the optimal channel coding problem, we also find that it is NP-hard to approximate S(W, k) within a factor larger than (1 − e−1 ). • The natural linear programming relaxation of the optimal channel coding problem is well-studied in the information theory literature under a different name. It corresponds to the well-known meta-converse of Polyanskiy-Poor-Verdu´ (PPV) [16] which puts an upper bound on the maximum success probability of sending k messages.2 Matthews [13] showed that this linear programming relaxation corresponds to the maximum success probability, SNS (W, k), when the sender and the receiver are allowed to share any non-signaling box, i.e., a (hypothetical) device taking inputs from both parties and generating outputs for both parties in a manner that makes it, by itself, useless for communication. Our main finding is an upper bound on the integrality gap for the linear programming relaxation of the optimal channel coding problem (Theorem 3.1). In particular, for any channel W and integers k and `, we have:  k 1 − e−`/k SNS (W, k) ≤ S(W, `) ≤ SNS (W, `) . `

(1)

When ` = k, this inequality says that the ratio of the optimal success probability and the metaconverse is at least (1 − e−1 ). More generally, if a better guarantee on the success probability is desired, this can be achieved at the expense of taking a number of messages ` that is slightly smaller than k. This bound can be seen as a bicriterion upper bound on the integrality gap, highlighting the tradeoff between the two important parameters: success probability and number of messages. We note that it is important for our applications to analyze the linear programming relaxation and not only understand the performance of the greedy algorithm for the optimal channel coding problem. The bound (1) can be proved by combining a result of [6, 14] together with an important observation that can be found in [12, Theorem 1.5]. We provide a self-contained elementary proof of the result. Using a family of examples from [1] gives a family of channels for which the guarantees in (1) are tight. • As quantum entanglement cannot be used for signaling, the inequality (1) puts a limit on the usefulness of entanglement between the sender and the receiver for the problem of coding over a noisy (classical) channel. The fact that entanglement could improve success probabilities was highlighted by [19, 7]. In this paper we in fact obtain an explicit upper bound on the improvement that can be achieved via entanglement. The bound (1) setting ` = k addresses a question asked by Hemenway et al. [11] who proved that for the special case of transmitting a single bit, i.e., k = 2, the ratio of the quantum advantage to the 1

In this paper, we focus on the success probability on average over the k messages. Technically, the PPV bound gives an upper bound on the number of messages for a desired success probability. It is however simple to adapt it to maximizing success probability for a fixed number of messages. The reason for the usefulness of the meta-converse is that it can be analytically evaluated for many settings of interest in particular for n independent channel uses for non-asymptotic values of n; see [15, 22] and references therein for an overview of the active area of finite blocklength analysis. 2

2

classical advantage is at most 2.3 For an explicit generalization of their result to arbitrary values of k see inequality (19), which is a consequence of our main theorem.

1.1

Outline

In Section 2 we establish that the optimal channel coding problem corresponds to submodular maximization with cardinality constraints. Though, in and of itself, this connection is direct, it provides a useful starting point. In particular, we extend this connection in Section 3 to obtain interesting implications for channel coding with non-signaling boxes and in particular quantum entanglement.

2

Optimal channel coding as a submodular maximization problem

Given a noisy channel W whose input and output alphabets are X and Y respectively, along with an integer k, our aim is to build an encoder and a decoder that can transmit k distinct messages with the smallest error probability averaged over the messages. We can describe a (possibly randomized) encoder e taking as input message i ∈ [k] and mapping it to x ∈ X with probability e(x|i). Similarly, a decoder d takes as input some y ∈ Y and outputs i ∈ [k] with probability d(i|y). The maximum success probability S(W, k) of sending k messages using the noisy channel W can be written as the following optimization program over encoding and decoding maps. def

S(W, k) = maximize e,d

subject to

1X e(x|i)W (y|x)d(i|y) k x,y,i X e(x|i) = 1 ∀i ∈ [k] x

X i

(2)

d(i|y) = 1 ∀y ∈ Y

0 ≤ e(x|i) ≤ 1 ∀(x, i) ∈ X × [k] 0 ≤ d(i|y) ≤ 1

∀(i, y) ∈ [k] × Y .

The next proposition is a simple but important observation that the problem described in (2) is of a very well-studied type: maximizing a submodular function subject to a cardinality constraint. Proposition 2.1. Let W be a channel, with input alphabet X and output alphabet Y , and k ≥ 1 be an integer. Then we have X 1 S(W, k) = max fW (S) where fW : 2X → R+ is defined by fW (S) = max W (y|x) . (3) x∈S k S⊆X,|S|≤k y∈Y

Moreover, f is monotone and submodular, i.e., for any S ⊆ T ⊂ X and x ∈ / T, Monotone: Submodular:

fW (T ∪ {x}) ≥ fW (T )

(4)

fW (S ∪ {x}) − fW (S) ≥ fW (T ∪ {x}) − fW (T ) .

(5)

Proof The monotonicity of fW is clear. For the submodularity, let S ⊆ T ⊂ X. For any u ∈ / T, (fW (S ∪ {u}) − fW (S)) − (fW (T ∪ {u}) − fW (T ))  X = max W (y|x) − max W (y|x) − ( max W (y|x) − max W (y|x)) . y

x∈S∪{u}

x∈S

x∈T ∪{u}

3

x∈T

(6) (7)

Note that the case k = 2, the quantity S(W, 2) can be written in a very explicit form as a function of the maximum total variation distances between all pairs of distributions W (.|x1 ) and W (.|x2 ), see [11, Proposition 1] for more details.

3

For each y ∈ Y , we distinguish two cases. If maxx∈T ∪{u} W (y|x) = W (y|u), in which case the expression reduces to maxx∈T W (y|x) − maxx∈S W (y|x) ≥ 0. The second case is that the maximum is achieved in T , i.e., maxx∈T ∪{u} W (y|x) = W (y|x) for some x ∈ T . In this case maxx∈T ∪{u} W (y|x) − maxx∈T W (y|x) = 0 and the above expression is clearly non-negative by monotonicity of f . We now show that S(W, k) ≥ k1 maxS⊆X,|S|≤k fW (S). For this, choose S ⊆ X of size ` ≤ k (as f is monotone, we can assume that maximum is attained for some S of size exactly min(k, |X|)). Then arbitrarily order the elements in S = {x1 , . . . , x` }. Define e(xi |i) = 1 for all i ∈ [`] and set all the other variables e(.|i) for i ∈ [`] to zero. For i ∈ {` + 1, . . . , k}, set e(x|i) in an arbitrary way that satisfies the normalization constraint. Then, for any y define m(y) to be i ∈ [`] such that W (y|xi ) is the maximum (in case of multiple i’s with the same maximum value, choose the smallest i). Then set d(m(y)|y) = 1 for all y ∈ Y and zero for all other entries in d. Clearly e and d satisfy the constraints. Moreover, 1 k

X

i∈[k],x,y

W (y|x)e(x|i)d(i|y) ≥

1 X W (y|xi )d(i|y) k

(8)

i∈[`],y

1X W (y|xm(y) ) k y 1X max W (y|x) , = k y x∈S

=

(9) (10)

which leads to the desired result. P For the other direction, if we define xi to be the symbol that maximizes y W (y|x)d(i|y), we have X i,x

e(x|i)

X y

W (y|x)d(i|y) ≤ =

X i



x

XX i



max

X y

X

W (y|x)d(i|y)

(11)

y

W (y|xi )d(i|y)

(12)

y

max W (y|xi ) i∈[k]

max

S⊆X,|S|≤k

X y

max W (y|x) . x∈S

(13) (14) t u

Using the notable result of [14], the formulation in (3) immediately shows that the quantity S(W, k) can be approximated efficiently within a factor of (1 − e−1 ). In fact, this can be achieved with a very simple greedy algorithm. Starting with set S0 = ∅, subset S`+1 ⊆ X is constructed from subset S` ⊆ X by adding an element x`+1 that maximizes fW (S` ∪ {x`+1 }), so that S`+1 = S` ∪ {x`+1 }. Let Sgreedy (W, k) be the success probability obtained by the greedy algorithm on input W and k. With the notation used above and the function fW defined in (3), we have Sgreedy (W, k) = k1 fW (Sk ); here Sk is the size-k subset computed by the greedy algorithm. Corollary 2.2. For any channel W and any k, (1 − e−1 )S(W, k) ≤ Sgreedy (W, k) ≤ S(W, k) .

(15)

Moreover, for any  > 0, there is no polynomial-time algorithm approximating S(W, k) within a multiplicative factor of (1 − e−1 + ) unless P = NP. Proof The fact that the greedy algorithm provides a (1 − e−1 )-approximation algorithm for S follows directly from [14]. In Section 3, we provide a proof of a strengthening of this result. 4

For the hardness of the problem, we use the hardness of the maximum-k-coverage problem [8]. In the maximum-k-coverage problem, we are given a collection of sets {Tx }x∈X of elements in Y (i.e., Tx ⊆ Y for each x ∈ X) and the objective is to find a subset S ⊆ X of size k such that | ∪x∈S Tx | is as large as possible. Feige [8] showed that this problem is hard to approximate with a factor better than (1 − e−1 ). In fact, as highlighted in [9], the problem is still hard if the sets Tx all have the same size, call it d. Given such an instance we define the following channel W (y|x) = d1 if y ∈ Tx and 0 otherwise. Then for any P choice S ⊆ X, we have | ∪x∈S Tx | = y maxx∈S W (y|x) · d = d · fW (S). This shows the desired result. t u

3

Channel coding with non-signaling boxes

A key motivation for this work is to understand the advantage that can be achieved when the sender and the receiver share additional resources that are by themselves useless for communication. Such resources are commonly called non-signaling boxes [17]. The simplest example of a non-signaling box is a device providing shared randomness between the sender and the receiver. It is quite simple to see that allowing the encoder e and decoder d to depend on some shared randomness will not affect the value of (2). However, if the sender and the receiver share entanglement, we know that for some channels, a success probability SQ (W, k) that exceeds S(W, k) can be achieved [19, 7]. This is analogous to a Bell inequality violation [5], or in other words the fact that the entangled value of a 2-prover 1-round game can be larger than the classical value.4 A natural question to ask here is how much can entanglement (or, in general, a non-signaling resource) between the sender and receiver help for reliable transmission. For example, is there a choice of channel for which the success probability with entanglement is close to 1 and without entanglement −1 . is very small? Our main result shows that this cannot be the case and that the ratio SS(W,k) Q (W,k) ≥ 1 − e

As SQ (W, k) does not seem to be easy to analyze,5 it is helpful to consider even more general resources between the sender and the receiver. Allowing for any non-signaling box between the sender and the receiver leads to a very natural linear programming (LP) relaxation of (2). def

SNS (W, k) = maximize rx,y ,px

subject to

1X W (y|x)rx,y k x,y,i X rx,y ≤ 1 ∀y ∈ Y x

X

px = k

(16)

x

rx,y ≤ px

∀(x, y) ∈ X × Y

0 ≤ rx,y , px ≤ 1

∀(x, y) ∈ X × Y .

Matthews [13] showed that SNS (W, k) corresponds to the maximum success probability that can be achieved if the sender and the receiver are allowed to use any additional non-signaling resource. This means that they can use any additional box as long as it does not allow the sender and the receiver to send information to each other; see [17] and references therein for a review on the study of non-signaling boxes. For the convenience of the reader, we repeat the proof of [13] in Appendix A.6 4

In fact, the problem of optimal channel coding that we are studying can be seen as a kind of game. The input of the sender is the label i of the message and his output is an element x ∈ X. The input of the receiver is some y ∈ Y and his input is a label j of some message. The difference is that the way one would normalize for a game is different than the way we do it here. Also, in our setting, the referee’s output is not necessarily 0 or 1 but rather a utility for each input-output pair that depends on the channel probabilities W (y|x). 5 We only know of a hierarchy of semidefinite programs that converges to this value [4]. 6 Note that we also have a small modification compared to the LP of [13] in that we require px ≤ 1. We can safely add this constraint provided k ≤ |X|, which we assume throughout this section.

5

Success prob. 1 S

SNS

1 |X|

P

y

maxx∈X W (y|x)

0 |X| Number of messages

1

Figure 1: This plot illustrates the maximum success probability as a function of the number of messages to be sent. The top curve corresponds to the maximum success probability if any non-signaling boxes between the sender and the receiver is allowed. The bottom one corresponds to the setting where the sender and the receiver do not share any additional resources. Theorem 3.1 states that for any value ` of the number of messages, the ratio between the non-signaling success probability and the usual success probability is at most 1 − e−1 . It also gives a way of comparing the two values for different number of messages to be sent. As pointed out in [13], a form of the LP (16) is widely known in the information theory literature as the Polyanskiy-Poor-Verdu´ meta-converse [16], which gives a bound for the number of messages that can be sent through a channel in terms of some hypothesis test. In Appendix B, we basically repeat the argument of [13] to show how to interpret the LP (16) in terms of a hypothesis test. Our main result shows that the LP relaxation (16) cannot be too far from the maximum success probability in (2). Theorem 3.1. Let W be a channel. Then, for any integers `, k ∈ {1, . . . , |X|} we have:  ! k 1 ` S(W, `) ≥ 1− 1− SNS (W, k) . ` k

(17)

Figure 1 gives an illustration for the statement of the theorem. Comments on the proof The ` = k case of this theorem can be proved by using a result of [6] relating the performance of the greedy algorithm and the linear programming relaxation for the location problem. In fact, the expression in (3) shows that the optimal channel coding problem can be written as a location problem. To obtain the tradeoff between success probability and number of messages, we use the observation about the greedy algorithm that can be found in [12, Theorem 1.5]. A complete proof of this theorem appears in Section 3.1. The bound we prove is in fact slightly stronger than (17) and can be written as  ! 1 ` k greedy 1− 1− SNS (W, k) . (18) S(W, `) ≥ S (W, `) ≥ ` k In particular, with this strengthening, we know that the greedy algorithm gives a (1−e−1 ) approximation even for the maximum success probability SQ (W, k) using entanglement. Another relevant observation concerning the proof is that we can in fact obtain a multiplicative bound between centered quantities. As S(W, k) ≥ k1 for any channel W (the decoder can just randomly

6

guess a message), it might be useful to consider the ratio S(W, k) − k1 to SNS (W, k) − k1 , in particular for small values of k. A simple modification in the proof leads to the bound:  !   1 1 1 k−1 NS S (W, k) − . (19) S(W, k) − ≥ 1 − 1 − k k k This inequality generalizes the bounds obtained by Hemenway et al. [11], who considered the case k = 2. Comments on the ratio ing inequalities

The pre-factor in the right hand side of (17) can be simplified using the follow-

k `

 !  1 ` k 1− 1− ≥ 1 − e−`/k k `

(20)

and when `/k  1, a good approximation is given by the following inequality  ` k 1 − e−`/k ≥ 1 − . ` 2k

(21)

The bound shows, for example, that if it is possible to send log k bits of information with a success probability SNS (W, k) = 1 −  using non-signaling boxes, then it is also possible to send log k − 10 bits of information with a success probability of at least 0.998 · (1 − ) without using any additional resources. We show in Section 3.2 that the bound (17) is tight. Multiple independent uses of a channel It has been known for a long time that asymptotically for n channel uses with n → ∞, entanglement cannot increase the capacity C(W ) of a noisy classical channel W [3, 2]. In fact, this is easily recovered from Theorem 3.1. Let R be a rate achievable using a nonsignaling box, i.e., SNS (W ⊗n , Rn ) → 1 as n → ∞. We will show that R ≤ C(W ). In fact, let δ > 0. Then using Theorem 3.1 with the     1 R(1 − δ) n ⊗n n S(W , (R(1 − δ)) ) ≥ 1 − SNS (W ⊗n , Rn ) , (22) 2 R and thus S(W ⊗n , Rn (1 − δ)n ) → 1, which shows that R(1 − δ) is an achievable rate for the channel W (without using any non-signaling boxes). By definition of the capacity C(W ), this means that R(1 − δ) ≤ C(W ). As this holds for any δ, this shows that R ≤ C(W ).

3.1

Proof of Theorem 3.1

In comparison with Corollary 2.2, we prove a stronger result relating the performance of the greedy algorithm to the value of the linear programming relaxation. For that, it is useful to define the following extension of the function fW defined in (3) to fractional vectors p ∈ [0, 1]|X| , def

fW (p) = maximize rx,y

subject to

X

W (y|x)rx,y

x,y

X x

rx,y ≤ 1 ∀y ∈ Y

0 ≤ rx,y ≤ px

7

∀(x, y) ∈ X × Y .

(23)

With this notation, we can write SNS (W, k) =

1 max fW (p) . k Ppx ≥0

(24)

px =k

x

The following lemma is crucial in proving Theorem 3.1. P Lemma 3.2. Let S ⊆ X and vector p ∈ [0, 1]|X| such that x px = k. Then

fW (p) ≤ fW (S) + k(max fW (S ∪ {x}) − fW (S))

(25)

x∈X

Proof Define for any x ∈ X, qx = max{px , 1x∈S , 0}. Note that qx ≥ px for all x, and so fW (p) ≤ fW (q). We aim to show the stronger statement fW (q) ≤ fW (S) + k(maxx∈X fW (S ∪ {x}) P− fW (S)). Let rx,y correspond to an optimal solution for the program of f (q), in particular f (q) = x,y W (y|x)rx,y . We have X X fW (q) − fW (S) = W (y|x)rx,y − max W (y|x0 ) (26) 0 x,y

=

X X y



x

X X y

x∈S

x ∈S

y

W (y|x)rx,y − max W (y|x0 ) 0 x ∈S

W (y|x)rx,y − (

using the fact that and thus

X

x rx,y

(27)

rx,y ) max W (y|x0 ) + 0 x ∈S

x∈S

P

!

x∈S /

≤ 1. Now observe that

fW (q) − fW (S) ≤ =

X X y

x∈S /

XX

y x∈S /



X

P

x∈S

W (y|x)rx,y − (

W (y|x)rx,y

W (y|x)rx,y − (

X

X

rx,y ) max W (y|x0 ) 0

x∈S /

x ∈S

!

rx,y ) max W (y|x0 ) 0 x ∈S

x∈S /

x ∈S

x∈S / y∈Γ(x)

,

(28)

P − ( x∈S rx,y ) maxx0 ∈S W (y|x0 ) ≤ 0

rx,y (W (y|x) − max W (y|x0 )) 0

X X

!

rx,y (W (y|x) − max W (y|x0 )) , 0 x ∈S

(29) (30) (31)

where Γ(x) = {y : W (y|x) − maxx0 ∈S W (y|x0 ) > 0}. For x ∈ / S, we have qx = px and thus rx,y ≤ px . As a result, X X fW (q) − fW (S) ≤ px (W (y|x) − max W (y|x0 )) (32) 0 x∈S /

≤k· where x∗ ∈ / S maximizes the quantity nition of Γ(x∗ ), we have

P

x ∈S

y∈Γ(x)

X

(W (y|x∗ ) − max W (y|x0 )) , 0 x ∈S

y∈Γ(x∗ )

y∈Γ(x) (W (y|x)

fW (S ∪ {x∗ }) − fW (S) =

X

Combining this with (33), we get the desired result.

− maxx0 ∈S W (y|x0 )). Now observe that by defi-

(W (y|x∗ ) − max W (y|x0 )). 0

y∈Γ(x∗ )

(33)

x ∈S

(34) t u

Proof [of Theorem 3.1] Using Lemma 3.2, we can apply the framework of [14] for analyzing the performance of the greedy algorithm. Recall the notation introduced for the greedy algorithm: starting from 8

S0 = ∅, S`+1 is constructed from S` by adding an element x`+1 that maximizes fW (S` ∪ {x`+1 }), so that S`+1 = S` ∪ {x`+1 }. Fix an integer i0 ∈ {0, . . . , |X|} (think of i0 = 0, but it will also be useful to choose i0 = 1). We prove by induction on ` that for any ` ≥ i0

    1 `−i0 (35) max fW (p) − fW (S` ) ≤ 1 − max fW (p) − fW (Si0 ) , p p k P where the maximizations are taken over all p such that px ≥ 0 and x px = k. The base case ` = i0 is clear. Using Lemma 3.2 together with the fact that fW (S`+1 ) = maxx∗ ∈X fW (S` ∪ {x∗ }) gives max fW (p) − fW (S` ) ≤ k · (fW (S`+1 ) − fW (S` )) . p

(36)

Rearranging the terms we see that    1 max fW (p) − fW (S`+1 ) ≤ 1 − max fW (p) − fW (S` ) p p k  `+1−i0   1 ≤ 1− max fW (p) − fW (Si0 ) , p k using the induction hypothesis. For i0 = 0, this gives can be written as !  1 ` max fW (p) . fW (S` ) ≥ 1 − 1 − p k

(37) (38)

(39)

Dividing by ` and using (24), we have 1 1 S(W, `) ≥ Sgreedy (W, `) = fW (S` ) ≥ ` ` =

k `

!  1 ` 1− 1− max fW (p) p k  ` ! 1 1− 1− SNS (W, k) . k

(40) (41)

This concludes the proof go the theorem. To prove (19), take i0 = 1, observe that fW (S1 ) = 1 and set ` = k, (35) becomes maxp fW (p) − fW (Sk ) ≤ maxp fW (p) − 1



1 1− k

k−1

.

(42) t u

Rearranging the terms, we get the inequality (19).

3.2

Tightness of Theorem 3.1

We now prove that the result shown in Theorem 3.1 is tight using a simple family of graphs proposed def

in [1]. Consider the following channel for k, t ≥ 1 integers. The input alphabet X is composed of n = kt  n symbols and the output alphabet Y is composed of t symbols that we interpret as subsets of X of size t. On input x, the output of the channel is a randomly chosen y such that x ∈ y. ( 1 if x ∈ y (n−1 t−1 ) (43) W (y|x) = 0 otherwise.

9

Note that, interestingly, the case k = t = 2 is exactly the channel that is studied in [19], in which it is experimentally demonstrated that entanglement assistance can help in improving the success probability for sending a bit over this channel. We first show that SNS (W, k) = 1. For this let px = nk for all x ∈ X and rx,y = nk if x ∈ y and rx,y = 0 P otherwise. We have x rx,y = t nk = 1. Moreover,   1 XX 1 k 1 k 1X 1 n  = W (y|x)rx,y = · t n−1 = 1 . (44) n−1 k x,y k y x∈y t−1 n k t n t−1 Using the symmetry of the channel it is simple to determine S(W, `) for any 1 ≤ ` ≤ n. In fact, one can see that fW (S) only depends on the size of S, let S` be any fixed set of size `: S(W, `) =

1 1 max fW (S) = ` |S|=` `

X

y∈Y :y∩S` 6=∅

1

.

n−1 t−1

(45)

So we only need to count the number of subsets ythat intersect with the set S` of size `. This number is  n−1 t n given by nt − n−` . Observing that = t t−1 n t and t/n = 1/k we have k S(W, `) = `

1−

!

n−` t n t

k = `

1−

!

n−t ` n `

  k (n − t) · · · (n − ` − t + 1) = 1− ` n(n − 1) · · · (n − ` + 1)      k t t = 1− 1− ··· 1 − . ` n n−`+1

(46) (47) (48)

From this expression, we see that for example by fixing ` and k to be constants and letting t → ∞. This expression approaches  ! k 1 ` 1− 1− , (49) ` k which exactly matches the bound in Theorem 3.1.

4

Discussion

The main message of this work is to draw a connection between the study of optimal coding for noisy channels in information theory and algorithmic aspects of submodular maximization. We believe this connection could be fruitful in both directions. As we showed in this paper, techniques developed in the context of submodular maximization—and, in general, approximation algorithms—can have interesting implications when analyzing the problem of optimal channel coding. We believe there are many more relevant applications to be explored. A particular question is whether algorithmic techniques can be helpful in obtaining better finite-blocklength bounds for well-studied channels such as the ones given in [15].

Acknowledgements We would like to thank Mario Berta, Shaddin Dughmi, Volkher Scholz, Piyush Srivastava, Vincent Tan, Marco Tomamichel for discussions on the topic of this paper.

10

A

Non-signaling assisted channel coding

Suppose that the sender and the receiver have a box shared between them with the following properties. Alice inputs α and receives output a and Bob inputs β and receives output b. We say that such a box is non-signaling if by itself it is useless for communication. More formally, such a box is described by a conditional probability distribution P (a, b|α, β) representing the probability that the outputs are a and b given that the inputs are α and β. The non-signaling property is then easily formulated as a linear constraint on these numbers: the marginal distribution on a is independent of the input β of Bob and the marginal distribution on b is independent of the input α of Alice def

P (a|α, β) =

X

P (a, b|α, β) = PA (a|α)

(50)

P (a, b|α, β) = PB (b|β) ,

(51)

b

def

P (b|α, β) =

X a

for some conditional distributions PA and PB . In other words, P (a|α, β) is the same for all values of β. Perhaps the simplest example of a non-signaling box is one that provides shared randomness. In δ 0 this case P (a, a0 |α, β) = a=a A . Also, if Alice and Bob share a quantum state (which could be entangled) and perform measurements that depend on their inputs, the outputs being the measurement outcomes, this also defines a non-signaling box. There are also some distributions that are non-signaling but do not seem to be physically realizable without communication, the most well-known being the PopescuRohrlich box [18]. Now assume that the sender and the receiver share such a box. The sender may give an input depending on the message he wishes to send and use the output he receives from the box to choose the symbol x ∈ X. Similarly, the receiver can give an input β that depends on the symbol y he receives and might use the output to decode the message. By encompassing all pre- and post-processing of the sender and the receiver into the box itself, we can assume that Alice inputs the message i into the box and the output of the box is exactly the input to the channel. Similarly, Bob inputs y into the box and receives j, a candidate for the sent message. Given this definition, we can naturally define the non-signaling success probability as def

SNS (W, k) =

maximize

P (x,j|i,y),PA ,PB

subject to

1X W (y|x)P (x, i|i, y) k x,y,i X P (x, j|i, y) = PA (x|i) j

X

P (x, j|i, y) = PB (j|y)

x

X x,j

∀i, x, y ∈ [k] × X × Y ∀j, i, y ∈ [k] × [k] × Y

(52)

P (x, j|i, y) = 1 ∀i, y ∈ [k] × Y

0 ≤ P (x, j|i, y) ∀j, i, x, y ∈ [k] × [k] × X × Y . We here prove that this linear program and the one in (16) have the same value. P Given a box P , let us construct a feasible solution for (16). Let r = x,y i P (x, i|i, y) and px = P P P P P i PA (x|i). Then clearly x rx,y = i PB (i|y) = 1 and x px = x,i PA (x|i) = k. We still need to show that weP can assume px ≤ 1. For this, define p0x = min{px , 1}. As rx,y ≤ 1, we still have rx,y ≤ p0x . 0 0 But the sum P x px might be less than k. But as k ≤ |X|, there exists p¯x ≥ px while keeping p¯x ≤ 1 satisfying ¯x = k. The pair (rx,y , p¯x ) is then a feasible solution for (16) with objective function equal xp P to k1 x,y,i W (y|x)P (x, i|i, y). 11

For the other direction, define P (x, j|i, y) =

(

rx,y k px −rx,y k(k−1)

if i = j if i 6= j .

(53)

It is simple to see that this distribution defines a non-signaling box.

B

Interpreting the LP in terms of hypothesis testing

Consider two distributions P and Q over the set Z. Given a sample from either P or Q, we wish to determine which distribution generated the sample. One can define a randomized test T : Z → [0, 1] where T (z) is the probability of declaring the distribution to be P . An important quantity that is studied in statistical hypothesis testing is the smallest probability to falsely outputting P among all tests that correctly identify P with probability at least α: More precisely, X Q(z)T (z) . (54) βα (P, Q) = min P T :Z→[0,1],

z

P (z)T (z)≥α

z

We will see that the quantity SNS (W, k) is related to βα (P, Q) for distributions P and Q constructed from the channel W . More specifically, let µ ∈ D(X) be a distribution on inputs X and ν ∈ D(Y ) be a distribution on outputs Y . Then let the distribution µ · W on X × Y be defined by probabilities µ(x)W (y|x) and µ · ν be the distribution defined by probabilities µ(x)ν(y). Then the maximum success probability SNS (W, k) can be interpreted as a distance between the product distribution µ · ν and the distribution induced by the channel µ · W in the following sense: Proposition B.1. 1 − SNS (W, k) = min

max β1−1/k (µ · ν, µ · W ) .

µ∈D(X) ν∈D(Y )

(55)

Proof Given a feasible solution px and rx,y for (16), we P define a distribution µ(x) = px /k and a test r T (x, y) = 1 − px,y . The constraints in (16) readily give x µ(x) = 1 and T (x, y) ∈ [0, 1] for all x, y. We x then have that for any distribution ν on Y , X X px rx,y µ(x)ν(y)T (x, y) ≥ min · (1 − ) (56) y k px x,y x 1X = min 1 − rx,y (57) y k x ≥1−

1 . k

(58)

In addition, we have X x,y

µ(x)W (y|x)T (x, y) = 1 −

1X rx,y W (y|x) k x,y

≥ 1 − SNS (W, k) .

As a result, there exists µ such that for all ν, β1−1/k (µ · ν, µ · W ) ≤ 1 − SNS (W, k). For the other direction, we first show using linear programming duality that for any µ, X max β1−1/k (µ · ν, µ · W ) = min µ(x)W (y|x)T (x, y) . ν∈D(Y )

T :X×Y →[0,1] P x,y ∀y, x µ(x)T (x,y)≥1− k1

12

(59) (60)

(61)

In order to see this, observe that βα (µ · ν, µ · W ) is a linear program and thus using duality can also be written as a maximization program. In fact, X βα (µ · ν, µ · W ) = max λ2 (x, y) (62) λ1 α − λ1 ≥0,λ2 (x,y)≥0 µ(x)W (y|x)+λ2 (x,y)≥λ1 µ(x)ν(y)

x,y

As a result, max βα (µ · ν, µ · W ) =

ν∈D(Y )

=

max

λ1 ≥0,λ2 (x,y)≥0 µ(x)W (y|x)+λ2P (x,y)≥λ1 µ(x)ν(y) ν(y)≥0, y ν(y)=1

max

λ1 (y)≥0,λ2 (x,y)≥0 µ(x)W (y|x)+λ2 (x,y)≥λ1 (y)µ(x)

λ1 α −

α

X y

X

λ2 (x, y)

(63)

X

(64)

x,y

λ1 (y) −

λ2 (x, y) ,

x,y

where we simply set λ1 (y) = λ1 ν(y). To conclude the proof of (61), it suffices to observe that this last expression is nothing but the dual program for the right hand side of (61). Now given a distribution µ on X and a test T that satisfies the constraint on the P right rand side of (61), 1 we define px P = k · µ(x) and r = k · µ(x)(1 − T (x, y)). Then the condition x,y x µ(x)T (x, y) ≥ 1 − k P P translates to x µ(x) − k1 x rx,y ≥ 1 − k1 . In other words, x rx,y ≤ 1. In addition, we clearly have P t u x px = k and rx,y ≤ px . This concludes the proof of the claim.

References [1] A. A. Ageev and M. I. Sviridenko. Pipage rounding: A new method of constructing algorithms with proven performance guarantee. Journal of Combinatorial Optimization, 8(3):307–328, 2004. [2] C. H. Bennett, G. Brassard, C. Cr´epeau, R. Jozsa, A. Peres, and W. K. Wootters. Teleporting an unknown quantum state via dual classical and Einstein-Podolsky-Rosen channels. Phys. Rev. Lett., 70:1895–1899, Mar 1993. [3] C. H. Bennett, P. W. Shor, J. A. Smolin, and A. V. Thapliyal. Entanglement-assisted classical capacity of noisy quantum channels. Phys. Rev. Lett., 83:3081–3084, Oct 1999. [4] M. Berta, O. Fawzi, and V. B. Scholz. arXiv:1506.08810, 2015.

Quantum Bilinear Optimization.

arXiv preprint

[5] N. Brunner, D. Cavalcanti, S. Pironio, V. Scarani, and S. Wehner. Bell nonlocality. Reviews of Modern Physics, 86(2):419, 2014. [6] G. Cornuejols, M. L. Fisher, and G. L. Nemhauser. Location of bank accounts to optimize float: An analytic study of exact and approximate algorithms. Management Science, 23(8):789–810, 1977. [7] T. S. Cubitt, D. Leung, W. Matthews, and A. Winter. Improving zero-error classical communication with entanglement. Phys. Rev. Lett., 104(23):230503, 2010. [8] U. Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634–652, 1998. [9] U. Feige. Approximation thresholds for combinatorial optimization problems. In Proceedings of the International Congress of Mathematicians, 2002. [10] M. Hayashi. Information spectrum approach to second-order coding rate in channel coding. Information Theory, IEEE Transactions on, 55(11):4947–4966, 2009. 13

[11] B. Hemenway, C. A. Miller, Y. Shi, and M. Wootters. Optimal entanglement-assisted one-shot classical communication. Phys. Rev. A, 87(6):062301, 2013. [12] A. Krause and D. Golovin. Submodular function maximization. Tractability: Practical Approaches to Hard Problems, 3:19, 2012. [13] W. Matthews. A linear program for the finite block length converse of Polyanskiy-Poor-Verdu´ via nonsignaling codes. IEEE Trans. Inform. Theory, 58(12):7036–7044, 2012. [14] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions. Mathematical Programming, 14(1):265–294, 1978. [15] Y. Polyanskiy. Channel coding: non-asymptotic fundamental limits. PhD thesis, Princeton University, 2010. ´ Channel coding rate in the finite blocklength regime. [16] Y. Polyanskiy, H. V. Poor, and S. Verdu. Information Theory, IEEE Transactions on, 56(5):2307–2359, 2010. [17] S. Popescu. Nonlocality beyond quantum mechanics. Nat. Phys., 10(4):264–270, 04 2014. [18] S. Popescu and D. Rohrlich. Quantum nonlocality as an axiom. Foundations of Physics, 24(3):379– 385, 1994. [19] R. Prevedel, Y. Lu, W. Matthews, R. Kaltenbaek, and K. J. Resch. Entanglement-enhanced classical communication over a noisy classical channel. Phys. Rev. Lett., 106(11):110505, 2011. [20] C. Shannon. A mathematical theory of communications. Bell System Technical Journal, 27:379–423, 1948. [21] V. Strassen. Asymptotische Absch¨atzungen in Shannons Informationstheorie. In Trans. Third Prague Conf. Information Theory, pages 689–723, 1962. [22] V. Y. F. Tan. Asymptotic estimates in information theory with non-vanishing error probabilities. Foundations and Trends in Communications and Information Theory, 11(1-2):1–184, 2014. [23] M. Tomamichel and V. Y. Tan. A tight upper bound for the third-order asymptotics for most discrete memoryless channels. IEEE Trans. Inform. Theory, 59(11):7041–7051, 2013.

14