Universal Zero-Delay Joint Source-Channel Coding - CiteSeerX

Report 29 Downloads 17 Views
Universal Zero-Delay Joint Source-Channel Coding Shahriyar Matloub, Tsachy Weissman∗ January 13, 2005

Abstract

We consider zero-delay joint source-channel coding of individual source sequences for a general known channel. Given an arbitrary finite set of schemes with finite-memory (not necessarily time-invariant) decoders, a scheme is devised that does essentially as well as the best in the set on all individual source sequences. Using this scheme, we construct a universal zero-delay joint source-channel coding scheme that is guaranteed to achieve, asymptotically, the performance of the best zero-delay encoding-decoding scheme with a finite-state encoder and a Markov decoder, on all individual sequences. For the case where the channel is a DMC, we construct an implementable zero-delay joint source-channel coding scheme that is based on the “follow the perturbed leader” scheme of Gy¨orgy et al. for lossy source coding of individual sequences. Our scheme is guaranteed to attain asymptotically the performance of the best in the set of all encoding-decoding schemes with a “symbol by symbol” decoder (and arbitrary encoder), on all individual sequences.

Key words and phrases: Discrete memoryless channel (DMC), Finite-state encoder/decoder, Individual sequences, Joint source-channel coding, Markov encoder/decoder, Zero-delay schemes.

1

Introduction

The two main coding theorems of classical information theory determine the fundamental performance limits for data compression (source coding theorem) and transmission rate of communication (channel coding theorem). Both theorems rely heavily on the properties of “typical” sequences generated by a stationary and ergodic process and, usually, the optimum performance is achievable by codes that introduce asymptotically large delays. The “separation theorem” is another fundamental result of classical information theory stating that the performance of the optimum joint source-channel coding scheme can be achieved by optimizing the source coding and the channel coding separately. Unfortunately, in many real world applications, the long delay introduced by the code is not tolerable, hence the delay introduced by the encoding-decoding scheme needs to be either zero (zero-delay), or no more than some prescribed value (finite-delay). By adding the finite delay constraint to the encoding-decoding process, the separation theorem does not hold any more, and to attain the optimum performance, one need jointly optimize the source coding and the channel coding. ∗ Department

of Electrical Engineering, Stanford University, Stanford 94305-9510, USA. Email: smatloub,[email protected].

1

A general formulation of the joint source-channel coding problem is the following: An encoder accesses the source sequence x1 , x2 , . . . and transforms it into a sequence of channel input symbols u1 , u2 , . . .. The decoder observes the channel output sequence v1 , v2 , . . . and, based on it, generates a reconstruction sequence xˆ1 , xˆ2 , . . .. Assuming that one channel use is available per source symbol, a joint source-channel coding scheme has (or can be implemented with) zero delay if, for each t, the t-th channel input symbol depends on the source sequence only through its first t components, while the t-th reconstruction symbol depends on the channel output sequence only through its first t components. Notwithstanding the obvious practical interest in zero-delay schemes in communications and control, little is known regarding the fundamental limitations on their performance. For example, a fundamental concept such as channel capacity is not useful in this framework, because, as shown in [1], channels with the same capacity may have quite different behavior under the zero-delay constraint. The problem of real-time lossy encoding for a memoryless source has been studied in [2, 3, 4]. The structure of the optimal real-time encoder-decoder pair when the source is a k-th order Markov source and the decoder has a limited memory is derived in [5, 6]. A different, yet closely related setting has been developed in [7] for the casual source coding problem. Casual source codes have been studied in [8, 9, 7, 10, 11] for different types of sources (e.g. memoryless, stationary, binary symmetric first-order Markov, etc.). Real-time decoders for different Markov sources passed through a noisy channel have been studied in [12, 13]. For a memoryless source corrupted by a memoryless channel it is straightforward to show, using argumentation similar to that of [2] and [4], that minimum expected distortion in zero-delay joint source-channel coding can be attained using “symbol by symbol” encoding and decoding. The optimality of symbol by symbol operations for this setting was recently shown in [14] (cf., in particular, Section 4 therein) to persist also under the large deviations criterion. The form and performance of the optimal scheme for a stochastic source remain open questions, however, under any perturbation of the assumption of a memoryless source and channel. In this work we consider zero-delay joint source-channel coding in a semi-stochastic setting, where the source signal is assumed an individual sequence while the channel noise is assumed stochastic with a known distribution. This setting, as was argued in [15], is well-connected to practical scenarios, where there is little if any knowledge regarding the source signal or its statistical properties, yet the noisy medium through which communication is to take place has a well-specified statistical characterization. The semi-stochastic setting in joint source-channel coding dates back to Ziv’s work on lossy compression of individual sequences [16], which considered also the setting of an individual source sequence transmitted via a memoryless channel (with no delay limitations). Zero-delay lossy source coding in the individual sequence setting was studied by Linder and Lugosi in [17], where a scheme 2

was devised and shown to attain the performance of the best scalar quantizer in hindsight on every individual source sequence. This scheme was later simplified in [18], which generalized the setting of [17] to accommodate an arbitrary reference class, more general alphabets and distortion measures, the possibility of delay, and the case where source sequences are corrupted prior to reaching the encoder. The setting of [17] was recently revisited in [19] and in [20], where it was shown that the performance of the best scalar quantizer on every individual sequence can be practically attained by avoiding the need to explicitly track the performance of each scalar quantizer (a computational bottle-neck which rendered the schemes of [17] and [18] impractical). The setting of the present work can thus be considered an extension of Ziv’s semi-stochastic joint source-channel coding problem to the case of zero delay and general channels. Alternatively, it can be considered an extension of the setting of [17, 18, 19, 20] to the case of a noisy channel between the encoder and the decoder. This extension from the case of a noiseless channel to the current setting adds two new ingredients to the problem: 1) The encoder does not know exactly the performance of the decoder and 2) The encoder can not perfectly communicate to the decoder which scheme to use. After concretely formulating the problem in Section 2, we shall present in Section 3 the construction of and performance analysis for a scheme that attains the performance of the best in an arbitrary given finite reference set of zero-delay schemes (with a mild structural assumption on the decoders), for the case of a general known channel. In Section 4, we first present a precise definition of universality with respect to different classes of zero-delay encoding-decoding schemes, and then we use the result of Section 3 to design a universal zero-delay joint source-channel coding scheme for individual sequences. Specifically, the scheme achieves, asymptotically, the performance of the optimum zero-delay encoding-decoding scheme, implemented by a finite-state encoder (of arbitrarily large, but finite state-space) and a Markov decoder (of any order). For reasons that will be discussed, however, implementation of the generic scheme of Section 3, even for reference sets of moderate size, would be prohibitively complex. In Section 5, we first present a more detailed discussion on the computational issues, and then construct an implementable “follow the perturbed leader”-type scheme extending that of [19] to our setting. Our scheme is guaranteed to asymptotically do at least as well as the best in the set of all schemes with an arbitrary encoder and a symbol by symbol decoder, for the case where the channel is discrete and memoryless. In Section 6 we conclude by mentioning a few directions for related future research.

2

Zero-delay joint source-channel coding schemes

We assume the following general time-invariant channel model: . . . N−1 , N0 , N1 . . . is a stationary process (the noise process) of a given known distribution1 and there exists a nonnegative integer q 1 Stationarity

of the noise process is not crucial for our results, but simplifies the details.

3

with a function ψ such that for each t ≥ q the channel output at time t, Vt , is given by ψ(utt−q+1 , N t ), with N t = (. . . , Nt−1 , Nt ) and utt−q+1 = (ut−q+1 , . . . , ut ) being the channel input symbols from time t − q + 1 to time t. We further assume for some rate R > 0 that there exists a σ > 0 with Pe (2nR , n) ≤ exp(−nσ), where Pe (2nR , n) is the minimum probability of error in discriminating one message among 2nR possible messages, after n channel uses. This assumption on the channel is a benign assumption and holds for any channel with a positive capacity and a positive error exponent. Note that if the channel has a zero capacity clearly nothing can be done, and almost any channel with a positive capacity has a positive error exponent. Furthermore, our schemes and principal results will be seen to only require positive capacity. The additional requirement of a positive error exponent allows to obtain more explicit performance bounds. A zero-delay joint source-channel coding scheme is defined by an encoder-decoder pair. We allow for randomized encoders by formally assuming access of the encoder to a sequence Ω1 , Ω2 , . . . of i.i.d. random variables, independent of the channel noise process, distributed uniformly over the interval [0, 1]. At each time instant i = 1, 2, . . . the encoder produces a channel input symbol Ui based on the past and present source symbols xi = (x1 , . . . xi ) and randomization values Ωi = (Ω1 , . . . , Ωi ). bi based on the After receiving Vi , the i-th channel output, the decoder outputs the reconstruction X channel output symbols V i = (V1 , . . . , Vi ) received so far. More formally, letting X , U, V, Xb denote, respectively, the alphabets of the source, channel input, channel output, and reconstruction, a scheme, which we generically denote by S, is given by a sequence of encoder-decoder measurable function pairs {(ft , gt )}t≥1 where and gt : V t → Xb,

ft : X t × [0, 1]t → U

bt = gt (V t ). Let ρ : X × Xb → R be a bounded distortion measure. The so that Ut = ft (xt , Ωt ) and X expected total distortion and the expected per-symbol distortion of the scheme S from time t1 to t2 (1 ≤ t1 ≤ t2 ) when the source signal is the individual sequence x = (x1 , x2 , . . .) are, respectively, denoted by DS (xtt21 ) and DS (xtt21 ), where " DS (xtt21 ) = E

t2 X

# bt ) , ρ(xt , X

(1)

t=t1

and DS (xtt21 ) = DS (xtt21 )/(t2 − t1 +1). The expectation in (1) is taken with respect to the randomness in both the randomization sequence and in the channel noise. We will similarly let # " t ¯ 2 X ¯ bt )¯Ωtt2 = ωtt2 DS (xtt21 |ωtt12 ) = E ρ(xt , X 1 ¯ 1 t=t1

denote the expected total distortion conditioned on the randomization sequence2 , and define the 2 Or,

more precisely, a version of the conditional expectation evaluated at Ω = ω.

4

per-symbol distortion conditioned on the randomization sequence by DS (xtt21 |ωtt12 ). Note that £ ¤ DS (xtt21 ) = E DS (xtt21 |Ωtt21 ) .

3

(2)

A generic scheme

Following [18], a scheme will be said to have a finite-memory decoder of order m if for all t ≥ m t t and all v t , v˜t ∈ V t such that vt−m+1 = v˜t−m+1 , gt (v t ) = gt (˜ v t ). Note that time-invariance is not

required. Fm will denote the class of all schemes with a finite-memory decoder of order m. Fm consists of schemes which may be relevant in situations where algorithmic resources are abundant on the encoding end, yet decoding (which may take place in a primitive device) must be simple. With ρmax = supx,y ρ(x, y) < ∞, we have the following: Theorem 1 Let S be an arbitrary finite subset of Fm for some m ≥ 1. There exists a zero-delay joint source-channel code S such that for all N ≥ 1 and xN ∈ X N µ N

N

DS (x ) − min DS 0 (x ) ≤ Γ (ρmax , m, q, σ, R, |S|) 0

log N N

¶1/3 .

(3)

DS (xn ) − min DS 0 (xn ) ≤ Γ (ρmax , m, q, σ, R, , |S|) N 3 (log N ) 3 . 0

(4)

S ∈S

Moreover, for all n ≤ N , 2

1

S ∈S

The explicit form of Γ will be evident in the proof. We mention that the dependence on |S| (in the bound we have been able to obtain) is proportional to (log |S|)2/3 . Note the dependence of the bound on N , which is similar to that in the bound of [18, Theorem 1], up to the logarithmic factor stemming from an additional channel coding ingredient arising in the current setting (as detailed below). Proof: Description of the Algorithm: Assume N large, fix α ¿ l ¿ N , and divide the time axis [1, . . . , N ] into N/l consecutive blocks of length l (assume l divides N ). We construct the scheme S as follows: At the beginning of the k-th block (1 ≤ k ≤ N/l), i.e., at the i = (k − 1)l + 1-th channel use, when both xi and Ωi are available to the encoder (so, in particular, DS 0 (x(k−1)l |Ω(k−1)l ) is known to the encoder for all S 0 ∈ S), the encoder uses Ωi to generate S (k) , a |S|-valued random variable with distribution satisfying Pr{S

(k)

0

(k−1)l

= S |Ω

¡ ¢ exp −ηDS 0 (x(k−1)l |Ω(k−1)l ) }= P exp (−ηDS˜ (x(k−1)l |Ω(k−1)l )) ˜ S∈S

a.s.,

(5)

where η > 0 is another degree of freedom to be chosen later. The encoder now dedicates the first α channel symbols at the beginning of the k-th block (i.e., Ui for i = (k − 1)l + 1, . . . , (k − 1)l + α) to 5

convey to the decoder the identity of S (k) . It does so using an optimal channel code (in the sense of maximum error probability) of block length α for discriminating one among |S| alternatives. At the remainder of the block (i.e., at times i = (k − 1)l + α + 1, . . . , kl) the encoder produces (k)

(k)

the channel input symbols Ui = fi (xi , Ωi ) (where {ft } are the encoding functions associated with the scheme S (k) ). Meanwhile, on the decoder’s side, at the beginning of the block at times bi ’s. Then, i = (k − 1)l + 1, . . . , (k − 1)l + α it outputs some arbitrary reproduction sequence of X having observed the first α channel output symbols at the beginning of the k-th block, the decoder uses the optimal decoding rule associated with the channel code to come up with Sˆ(k) , its estimate of ˆ i = gˆ(k) (V i ), S (k) . On the remainder of the block, the decoder produces the reproduction symbols X i

(k) where {ˆ gt } denote the decoding functions associated with the scheme Sˆ(k) .

Performance Analysis: We now turn to the analysis of the performance of scheme S, which bears some similarity to that in [18, Section III] (which, in turn, relies on the exponential weighting ideas in [17]). For each 0 ≤ j ≤ N/l − 1 define the random variable X © ¡ ¢ª Wj+1 = exp −ηDS 0 xjl |Ωjl . S 0 ∈S

As W1 = log |S|, Wk+1 log = log W1

à µ

X

© ¡ ¢ª exp −ηDS 0 xkl |Ωkl

! − log |S|

S 0 ∈S

¶ © ¡ kl kl ¢ª ≥ log max exp −ηDS 0 x |Ω − log |S| S 0 ∈S ¡ kl kl ¢ 0 x |Ω D = −η min − log |S|. S 0 S ∈S

(6)

On the other hand, for each 0 ≤ j ≤ k − 1 © ¡ (j+1)l (j+1)l ¢ª P |Ω Wj+2 S 0 ∈S exp −ηD S 0 x © ª log = log P jl jl Wj+1 S 0 ∈S exp −ηD S 0 (x |Ω ) n ³ ´o ¡ ¢ P (j+1)l (j+1)l 0 . exp {−ηDS 0 xjl |Ωjl } xjl+1 |Ωjl+1 S 0 ∈S exp −ηD S © ª = log P jl jl S 0 ∈S exp −ηD S 0 (x |Ω ) o n (j+1)l (j+1)l = log EQj+1 exp −ηDS 0 (xjl+1 |Ωjl+1 ) µ ½ i η 2 ρ2 l2 ¾¶ h (a) (j+1)l (j+1)l max ≤ log exp −ηEQj+1 DS 0 (xjl+1 |Ωjl+1 ) + 8 h i η 2 ρ2 l 2 (j+1)l (j+1)l max , = −ηEQj+1 DS 0 (xjl+1 |Ωjl+1 ) + 8 where EQj+1 [.] is expectation with respect to the random probability distribution on S defined in (5). Inequality (a) follows from an application of Hoeffding’s bound (c.f. [21] Lemma 8.1). 6

Let α(²) denote the minimum number of channel uses needed to discriminate one message among |S| messages with error probability less than or equal to ². In the case that decoder picks the wrong decoding scheme in the (j +1)-th block, the total distortion of scheme S on this block is bounded by lρmax . Otherwise, schemes S and S 0 generate the same reconstruction sequence after α(²) + m + q symbols from the beginning of the block, and the total distortion of scheme S in this block is ¡ ¢ (j+1)l (j+1)l bounded by DS 0 (xjl+1 |Ωjl+1 ) + α(²) + m + q ρmax . Hence i h i h ¡ ¢ (j+1)l (j+1)l (j+1)l (j+1)l EQj+1 DS (xjl+1 |Ωjl+1 ) ≤ lρmax ² + EQj+1 DS 0 (xjl+1 |Ωjl+1 ) + α(²) + m + q ρmax (1 − ²). Therefore i h ¡ ¢ η 2 ρ2max l2 Wj+2 (j+1)l (j+1)l log ≤ −ηEQj+1 DS (xjl+1 |Ωjl+1 ) + ηρmax l² + α(²) + m + q + . Wj+1 8 Using k−1

Wk+1 X Wj+2 log = log W1 Wj+1 j=0 we have ¶ X µ k−1 i h Wk+1 ηρmax l2 (j+1)l (j+1)l log ≤ kηρmax l² + α(²) + m + q + − ηEQj+1 DS (xjl+1 |Ωjl+1 ) . W1 8 j=0

(7)

Combining (6) and (7), and taking expectation with respect to randomization sequence gives DS (xkl ) − min DS 0 (xkl ) ≤ 0 S ∈S

log |S| α(²) + m + q ηρ2 l + ρmax ² + ρmax + max . klη l 8

(8)

log |S| α(²) + m + q ηρ2 l + ρmax ² + ρmax + max . Nη l 8

(9)

Now, letting k = N/l in (8), DS (xN ) − min DS 0 (xN ) ≤ 0 S ∈S

We now have the freedom to determine the q optimum values for η, l, and ² that minimize the right 8 log |S| side of (9). Optimizing on η we have η = N , and for this value of η, ρ2max l r ¡ ¢ log |S|ρ2max − 1 1 N 2l2 + ρ 0 D ² + α(²) + m + q ρmax l−1 . (10) DS (xN ) − min (x ) ≤ N S max S 0 ∈S 2 q 2 1 1 −1 max The right side of (10) is in the form of f (l) = Al 2 +B+Cl , where A = log |S|ρ N − 2 , B = ρmax ², 2 ¡ ¢ 2 and C = α(²) + m + q ρmax . The optimum value for l that minimizes f (l) is l = ( 2C ) 3 , and for A this value of l, DS (xN ) − min DS 0 (xN ) ≤ 0 S ∈S

i 13 ¢ 1 3 h¡ α(²) + m + q log |S| ρmax N − 3 + ρmax ². 2

In order to find a good value for ², we need to find an upper bound for α(²). Using the assumption on the behavior of Pe (2nR , n), the probability of error in discriminating one message among 2nR 7

possible messages after n channel uses, any integer ν > 0 that satisfies exp(−νσ) ≤ ² and 2νR ≥ |S| is an upper bound for α(²). Thus 1 1 1 log |S|, log } R σ ² 1 1 1 ≤ log |S| + log , R σ ²

α(²) ≤ max{

and 3 DS (x ) − min DS 0 (x ) ≤ 0 S ∈S 2

µ

N

N

1 1 1 ( log |S| + log + m + q) log |S| R σ ²

¶ 13

1

ρmax N − 3 + ρmax ².

(11)

1

By choosing ² = N − 3 in (11) we have, µ N

N

DS (x ) − min DS 0 (x ) ≤ Γ (ρmax , m, q, σ, R, |S|) 0 S ∈S

log N N

¶ 31

,

(12)

³ ´ 2 where the precise form of Γ is evident in (11) and in particular, Γ (ρmax , m, q, σ, R, |S|) = O (log |S|) 3 . For the second part of the theorem, let Sˆ = arg minS 0 ∈S DS 0 (xkl ) and kl ≤ n < (k + 1)l. Then DS (xn ) − min DS 0 (xn ) ≤ (n − kl)ρmax + DS (xkl ) − DSˆ (xkl ) S 0 ∈S ¶ µ log |S| α(²) + m + q ηρmax l ≤ (n − kl)ρmax + + klρmax ² + + η l 8 ¶ µ log |S| α(²) + m + q ηρmax l ≤ lρmax + + N ρmax ² + + , η l 8 1

where the second inequality follows from (8). By our choice of ² = N − 3 we have, 1 1 log |S| + log N = O(log N ), R 3σ µ ¶ 23 ³ 1 ´ 1 1 2 1 1 l = 2 log |S| + log N + m + q N 3 (log |S|)− 3 = O N 3 (log N ) 3 , R 3σ ¢− 31 ¡1 1 ³ ´ 2 log |S| + log N + m + q 2 2 1 3σ N − 3 (log |S|) 3 = O N − 3 (log N )− 3 , η = 2 R ρmax

α(²) =

hence 2

1

DS (xn ) − min DS 0 (xn ) ≤ Γ (ρmax , m, q, σ, R, |S|) N 3 (log N ) 3 . 0 S ∈S

2

(13)

A few remarks: 1. At the beginning of the i-th block, scheme S picks one scheme from the reference class and uses that scheme for the rest of the block. Assuming all the schemes in the reference class have 8

a finite-memory decoder, the output symbol of the decoder at time t is independent of all past input symbols up to time t − (q + m). Hence, excluding the first α(²) + m + q reconstructed symbols at the beginning of each block, the reconstructed symbols for the block depend only on the choice of decoding scheme. Without the assumption of finite-memory decoders, the performance analysis of scheme S would be complicated by the fact that the channel output symbol at any time instance (and, in particular, those at the beginning of the sub-block) can potentially affect the whole reconstruction sequence at the future time instances. We will discuss the consequences of this assumption with regard to the coding scheme presented in the next section. 2. Though Theorem 1 presented a bound on performance under the expected loss criterion DS (xN ), where the loss is averaged with respect to the randomization and the channel noise, similar bounds on the actual loss that hold with high probability can be given under slightly stronger assumptions on the channel (e.g., mixingness of the channel output process). This is done by establishing a “concentration” between the actual normalized cumulative loss of a scheme and its expectation. 3. The impracticability of the scheme stems from the fact that, similarly as in the generic scheme of [18], the encoder must keep track of the performance of each scheme in the reference class. In the present setting, however, this task is even more demanding as the encoder needs to compute the expected performance of each scheme with respect to the channel noise (whereas in the setting of [18], the encoder knows the exact performance based on observing the source sequence). The channel coding that needs to be performed at the beginning of each sub-block is another source of computational burden (though a similar result and proof would also hold for a scheme that uses more practical and possibly sub-optimal codes for this purpose). 4. Note that no restrictions or assumptions have been made on the alphabets X , U, V, Xb. The only assumption is of a bounded distortion measure.

4

Universal zero-delay joint source-channel coding problem

Throughout this section we assume the alphabets X , U, V, Xb are finite. A

Problem definition

E will be said to be a zero-delay finite-state encoder with finite state-space S = {1, 2, . . . , |S|} if there exists an encoding function f : S × X → U , a next-state function hf : S × X → S, and an initial state s1 ∈ S such that ut = f (st , xt ),

st+1 = hf (st , xt ), 9

t = 1, 2, . . . .

D will be said to be a zero-delay finite-state decoder with finite state-space S = {1, 2, . . . , |S|} if there exists a decoding function g : S × V → Xb, a next-state function hg : S × V → S, and an initial state s1 ∈ S such that xˆt = g(st , vt ),

st+1 = hg (st , vt ),

t = 1, 2, . . . .

Let Gφ (S1 , S2 ) denote the class of all zero-delay finite-state encoder-decoder pairs (E, D) with statespaces S1 = {1, 2, . . . , S1 } and S2 = {1, 2, . . . , S2 } respectively. Define φ(S1 , S2 , xn ) =

min

(E,D)∈Gφ (S1 ,S2 )

D(E,D) (xn ).

(14)

The quantity φ(S1 , S2 , xn ) is the expected distortion introduced by the best zero-delay finite-state encoder-decoder pair with respectively state-spaces S1 and S2 on xn . For any infinite sequence x = x1 , x2 , . . . the minimum distortion introduced by any zero-delay finite-state encoder-decoder pair is defined as φ(x) =

lim

S1 →∞, S2 →∞

lim sup φ(S1 , S2 , xn ). n→∞

A zero-delay joint source-channel coding scheme T is said to be universal with respect to the class of schemes with finite-state encoder and decoder if for all sequences x = x1 , x2 , . . ., lim sup DT (xn ) ≤ φ(x).

(15)

n→∞

A Markov encoder (decoder) of order m is a finite-state encoder (decoder) with state-space t−1 S = X m (S = V m ) and st = xt−1 t−m (st = vt−m ). Similarly to (14), the minimum expected distortion introduced by the best scheme with finite-state encoder with state-space S = {1, 2, . . . , |S|}, and Markov decoder of order m on xn is defined as µ(S, m, xn ) =

min

(E,D)∈Gµ (S,m)

D(E,D) (xn ),

where Gµ (S, m) denotes the class of all zero-delay joint source channel coding schemes with finitestate encoder with S states and Markov decoder of order m. Assuming S = {1, 2, . . . , S}, we also use Gµ (S, m) to denote this class. Similarly, for an infinite sequence x = x1 , x2 , . . . , the minimum distortion introduced by any zero-delay scheme with finite-state encoder and Markov decoder is defined as µ(x) =

lim

S→∞, m→∞

lim sup µ(S, m, xn ). n→∞

Similarly to (15), a zero-delay joint source-channel coding scheme T is said to be universal with respect to the class of schemes with finite-state encoder and Markov decoder, if for all sequences x = x1 , x2 , . . ., lim sup DT (xn ) ≤ µ(x). n→∞

10

B

Construction of a universal scheme with respect to the class of finite-state encoders and Markov decoders

By a combinatorial reasoning m

|Gµ (S, m)| = (|U||S|)|X ||S| |X ||V| .

(16)

Hence, choosing Sm = {1, 2, , . . . , |V|m }, we have log |Gµ (Sm , m)| = |X ||Sm | log (|U||Sm |) + |Sm | log |X |

(17)

≤ C1 m|V|m log |V|,

(18)

where C1 is a constant independent of m. We construct scheme S ∗ as follows: We break the time-axis into blocks of length li = |V|i log i , i = 1, 2, . . . , and for the i-th block we use scheme Si∗ from Theorem 1 that competes with the class of schemes Gµ (Si , i) on that block. We claim that scheme S ∗ is universal with respect to the class of schemes with finite-state encoders and Markov decoders. Theorem 2 The scheme S ∗ satisfies, ∀x ∈ X ∞ , lim sup DS ∗ (xn ) ≤ µ(x). n→∞

The main idea behind the proof is to show that for any m > 0, scheme S ∗ achieves the performance of the best scheme in Gµ (Sm , m), for sufficiently large n. Proof: For any set A of encoding-decoding schemes, n+m ). min DS (xn ) + min DS (xn+m n+1 ) ≤ min D S (x S∈A

S∈A

Letting Nk =

Pk

i=1 li ,

DS ∗ (xNk ) −

S∈A

we then have min

S 0 ∈Gµ (Sk ,k)

DS 0 (xNk ) ≤ DS ∗ (xNk−1 ) −

min

S 0 ∈Gµ (Sk ,k)

k +DSk∗ (xN Nk−1 +1 ) −

DS 0 (xNk−1 )

min

S 0 ∈Gµ (Sk ,k)

k DS 0 (xN Nk−1 +1 ) 2

2

1

≤ Nk−1 ρmax + C(log |Gµ (Sk , k)|) 3 lk3 (log lk ) 3 , where the second inequality holds by Theorem 1. Moreover, for any Nk ≤ n < Nk+1 , DS ∗ (xn ) −

min

S 0 ∈Gµ (Sk ,k)

DS 0 (xn ) ≤ DS ∗ (xNk ) −

min

S 0 ∈Gµ (Sk ,k)

+DS ∗ (xnNk +1 ) −

DS 0 (xNk ) min

S 0 ∈Gµ (Sk+1 ,k+1) 2

DS 0 (xnNk +1 ) 2

1

≤ Nk−1 ρmax + C(log |Gµ (Sk , k)|) 3 lk3 (log lk ) 3 2

2

1

3 +C(log |Gµ (Sk+1 , k + 1)|) 3 lk+1 (log lk+1 ) 3 ,

11

where the second inequality holds by the second part of Theorem 1. By definition of Nk , |V|k log k = lk ≤ Nk and Nk =

k X

|V|

i log i



i=1

k X

|V|i log k ≤ |V|(k+1) log k .

i=1

Thus 2

2

1

Nk−1 C(log |Gµ (Sk , k)|) 3 lk3 (log lk ) 3 DS ∗ (x ) − 0 min DS 0 (x ) ≤ ρmax + S ∈Gµ (Sk ,k) Nk Nk n

n

2

2

1

3 C(log |Gµ (Sk+1 , k + 1)|) 3 lk+1 (log lk+1 ) 3 + Nk 1

k

2k

1

2k

1

≤ |V|− log k ρmax + C2 k(log k) 3 |V| 3 (2−log k) + C3 |V| 3 (4− 2 log k) 2k

1

≤ C4 |V| 3 (4− 2 log k) . Let j(n) = max{j : Nj ≤ n}. For any ² > 0, fixed m, k sufficiently large that C4 |V| 3 (4− 2 log k) ≤ ² and k ≥ m, and all n ≥ Nk , DS ∗ (xn ) −

min

S 0 ∈Gµ (Sm ,m)

DS 0 (xn ) ≤ DS ∗ (xn ) − ≤ C4 |V|

min

S 0 ∈Gµ (Sj(n) ,j(n))

2j(n) (4− 12 3 2k

DS 0 (xn )

log j(n))

1

≤ C4 |V| 3 (4− 2 log k)

(19)

≤ ². As (19) holds for any ² > 0, m, and sufficiently large n, it follows that lim sup DS ∗ (xn ) ≤ µ(x).

2

n→∞

A few remarks: 1. Theorem 2 assumes the semi-stochastic setting, and its result holds for any individual sequence x. Analogously as in problems such as source coding [22], prediction [23], denoising [15], and filtering [24], it can be shown that scheme S ∗ is universally asymptotically optimal in a fully stochastic setting too, where x is emitted by a stationary source (and performance of the optimum distribution-dependent scheme is sought). 2. As we showed, scheme S ∗ is a universal scheme with respect to the class of schemes with finite-state encoders and Markov decoders. Universality of the scheme S ∗ with respect to the class of schemes with finite-state encoders and finite-state decoders remains an open question. We conjecture that similarly as in the prediction problem (c.f. [23]) and the filtering problem 12

(c.f. [24]), the optimum performance of both these classes are the same, i.e., for any infinite sequence x = x1 , x2 , . . . , φ(x) = µ(x). In particular, this would imply that scheme S ∗ is also universal with respect to the class of schemes with finite-state encoders and decoders. Proving this equivalence has defied our efforts thus far.

5

On implementable schemes

A

A remark on the implementation of the generic scheme for sliding-window decoders

As mentioned before, the impracticability of the scheme described in Theorem 1 stems from the fact that the encoder must keep track of the performance of each scheme in the reference class. In the case where the reference class is a subset of all schemes with a Markov decoder of order m, the following observation is the key to implement a more efficient algorithm to track the performance of each scheme. To alleviate technicalities, we assume finite alphabets throughout this section. As defined before, scheme S = {(ft , gt )} will be said to have a Markov decoder of order m if there t exists a function g : V m → Xb such that gt (v t ) = g(vt−m+1 ) for all t ≥ m and v t ∈ V t . Denoting the class of all schemes with such a decoder by Wm , we have Wm ⊆ Fm , where, as defined in section 3, Fm denotes the class of all schemes with a finite memory of size m (but not necessarily time-invariant). For any scheme S ∈ Wm

(S = {ft , g}), " t # 2 X t2 t2 t bi )|ωt 2 DS (xt1 |ωt1 ) = E ρ(xi , X 1 i=t

" t1 # 2 X ¡ ¢ i = E ρ xi , g(Vi−m+1 ) |ωtt12 i=t

" t1 # 2 ³ ´ X ¡ ¢ i−m+1 = E ρ xi , g ψ(ui−m+1 ), . . . , ψ(uii−q+1 , N i ) |ωtt12 i−m−q+2 , N i=t1

=

t2 X

ρ˜g (xi , uii−m−q+2 ),

(20)

i=t1

where we assume u0−m−q+3 is fixed and h ³ ¡ ¢´i q m+q−1 m+q−1 q m+q−1 ρ˜g (β, γ ) = E ρ β, g ψ(γ1 , N ), . . . , ψ(γm ,N ) , where the expectation is taken with respect to the random channel noise components. The last equality in (20) holds because the channel noise process . . . N−1 , N0 , N1 . . . is stationary. For any (source, randomization) sequence pairs (xn , ω n ), we define ª¯ 1 ¯¯© t m+q−1 ¯ (m+q) n n m+q−1 P [x , ω , {ft }](β, γ ) = ¯ 1 ≤ t ≤ n : xt = β, ut−m−q+2 = γ ¯. n 13

Thus P (m+q) [xn , ω n , {ft }] is the empirical distribution of (m + q)-tuples induced by (source, channel input) sequence pairs (xn , un ). Using this definition, £ ¡ ¢¤ DS (xn |ω n ) = EP (m+q) ρ˜g X, U m+q−1 , where (X, U m+q−1 ) is a random vector with distribution P (m+q) [xn , ω n , {ft }]. Hence, the performance of scheme S is dependent on xn only through the (m + q)-th order empirical distribution its encoder induces on the (source, channel input) pair. This implies that the generic scheme of Theorem 1, when applied to compete with a finite subset of Wm , can be implemented by only keeping track of the P (m+q) [xn , ω n , {ft }] that each scheme induces. B

An implementable “follow the perturbed leader”-type scheme for the memoryless channel

In this subsection we construct a low complexity algorithm for zero-delay joint source-channel coding of individual sequences, assuming the channel between the encoder and the decoder is memoryless. This method was first introduced in the seminal [25] for the prediction problem. Our approach is based on the “follow the perturbed leader” scheme devised in [19] for zero-delay lossy coding of individual sequences. Assume that the channel is memoryless with positive capacity, and a specified channel matrix Ψ|U|×|V| , where Ψ(u, v) denotes the probability of a channel output symbol v when the input symbol is u. Furthermore, assume S ∈ W1 , the class of all schemes with a “symbol by symbol” decoder. From (20) we have n

n

DS (x |ω ) = where ρ˜g (x, u) =

n X

ρ˜g (xi , ui ),

i=1

P v

¡ ¢ ρ x, g(v) Ψ(u, v) for this channel. In this case, the choice of u at time i only

affects the output of the decoder at time i, and has no effect on future time instances. Hence the optimum encoder, is the one that minimizes ρ˜g (xi , ui ) for each input symbol xi . This optimum encoder is a symbol by symbol encoder and is defined by f (x) = arg min ρ˜g (x, u). u

(21)

Note that the encoder defined in (21) is a zero-delay encoder that is optimum among all possible encoders, including the non-causal encoders! Moreover, as u is a deterministic function of x, the dependence of DS (xn |ω n ) on xn and ω n (for S employing the encoder in (21)) is only through the first order empirical distribution induced by xn . More specifically, defining dg (x) = ρ˜g (x, f (x)) (for the encoder f in (21)), and denoting the first order empirical distribution induced by xn by P 1 [xn ], DS (xn ) = EP 1 [xn ] [dg (X)] 1 = dg .hn , n 14

(22)

where X ∼ P 1 [xn ] in (22), hn = nP 1 [xn ] is a |X | × 1 vector denoting the histogram of xn , dg is a P | |X | × 1 vector with dg [x] = dg (x) (dg [x] is the x-th element of the vector dg ), and a.b = |X j=1 aj bj for a = (a1 , a2 , . . . , a|X | ) and b = (b1 , b2 , . . . , b|X | ). Let a = (a1 , a2 , . . . , a|X | ) be a vector in R|X | with aj ≥ 0, j = 1, 2, . . . , |X |, and Pa = a/|a|1 , P | b |V| where |a|1 = |X j=1 aj . Then Pa defines a probability distribution on X . Noting that there are |X | different symbol by symbol decoders g, we will define g ∗ (a) = arg min EPa [dg (X)] , g:V→Xb

(23)

where the subscript in the expectation denotes X ∼ Pa . Note that minS 0 ∈W1 DS 0 (xn ) is achieved by the symbol by symbol decoder g ∗ (P 1 [xn ]) (and the encoder f from (21) corresponding to it). We now consider a horizon-dependent “follow the perturbed leader”-type scheme T , that is based on the idea presented in [19]: Fix α ¿ l ¿ N , and divide the time axis [1, . . . , N ] into N/l consecutive blocks of length l. At the beginning of the k-th block (1 ≤ k ≤ N/l), i.e., at the i = (k − 1)l + 1-th channel use, the encoder computes gk∗ = g ∗ (h(k−1)l + ξk ) (recall notation from (23)), where ξ = η.(Ωi−|X | , . . . , Ωi−1 ) and η > 0 is a parameter to be optimized later. Note that ξ is a random vector uniformly distributed over [0, η]|X | and can be viewed as a random perturbation to the histogram of x(k−1)l . The encoder now dedicates the first α channel uses to communicate to the decoder the value of gk∗ .

At all time points t in the remainder of the block the encoder feeds the channel with fk∗ (xt ), the

optimum associated encoder for decoder gk∗ . Meanwhile, on the decoder’s side, at the beginning of the k-th block, at times i = (k − 1)l + 1, . . . , (k − 1)l + α, it outputs some arbitrary reproduction bi ’s. Then, having observed the first α channel output symbols at the beginning of the sequence of X k-th block, the decoder uses the optimal decoding rule associated with the channel code to come up with gˆk∗ , its estimate of gk∗ . On the remainder of the block the decoder uses gˆk∗ to decode. Theorem 3 For the scheme T described above, with appropriate choice of parameters α, l, η, µ ¶1 log N 3 N N DT (x ) − min DS 0 (x ) ≤ C ∀xN ∈ X N . S 0 ∈W1 N Proof: We know h i ∗ |ξ ) ≤ d .(h − h ) + α(²)ρ (1 − ²) + ²lρmax , DT (xkl k kl max g (h(k−1)l +ξk ) (k−1)l (k−1)l+1 where ² is the probability that decoder does not pick the correct decoding scheme in the k-th block (P r{ˆ gk∗ 6= gk∗ }), and α(²) is the minimum number of channel uses needed to convey the identity of

15

the decoding scheme gk∗ to the decoder with error probability less than ². Hence   N/l X  DT (xN ) = E  DT (xkl (k−1)l+1 |ξk )  ≤ E

k=1



N/l

X

µ

dg∗ (h(k−1)l +ξk ) .(hkl − h(k−1)l ) + N ρmax

k=1

¶ α(²) +² . l

(24)

To find an upper bound for the expression on the right side of (24), we first consider the more tractable expectation

 E



N/l

X

dg∗ (hkl +ξk ) .(hkl − h(k−1)l ) .

k=1

Defining ξ0 = (0, 0, . . . , 0), for any 1 ≤ m ≤ N/l we have m X

(a)

dg∗ (hkl +ξk ) .(hkl + ξk − h(k−1)l − ξk−1 ) ≤ dg∗ (hml +ξm ) .(hml + ξm )

k=1 (b)

≤ dg∗ (hml ) .(hml + ξm ).

(25)

Inequality (b) holds due to the fact that g ∗ (hml + ξm ) is optimal for hml + ξm . We prove inequality (a) by induction: for m = 1, the inequality holds trivially. Assuming the inequality holds for m, we have m+1 X

dg∗ (hkl +ξk )

.

(hkl + ξk − h(k−1)l − ξk−1 )

k=1

≤ dg∗ (h(m+1)l +ξm+1 ) .(h(m+1)l + ξm+1 − hml − ξm ) + dg∗ (hml +ξm ) .(hml + ξm ) = dg∗ (h(m+1)l +ξm+1 ) .(h(m+1)l + ξm+1 ) − (dg∗ (h(m+1)l +ξm+1 ) − dg∗ (hml +ξm ) ).(hml + ξm ) ≤ dg∗ (h(m+1)l +ξm+1 ) .(h(m+1)l + ξm+1 ), since dg∗ (hml +ξm ) .(hml + ξm ) ≤ dg∗ (h(m+1)l +ξm+1 ) .(hml + ξm ), by optimality of dg∗ (hml +ξm ) for hml + ξm . Hence the inequality holds for m + 1 and (a) is proved. P Replacing ξm by m k=1 (ξk − ξk−1 ) in (25) we have m X

dg∗ (hkl +ξk ) .(hkl − h(k−1)l ) ≤ dg∗ (hml ) .hml +

k=1

m X £

¤ dg∗ (hml ) − dg∗ (hkl +ξk ) .(ξk − ξk−1 )

k=1 (a)





ml

min DS (x ) + |X |ρmax

S∈W1

m X k=1

min DS (xml ) + |X |ρmax η,

S∈W1

16

|ξk − ξk−1 |∞

where |a|∞ = maxi |ai |, and inequality (a) holds since a.b ≤ |a|1 |b|∞ for any a, b ∈ R|X | . Hence " m # X E (26) dg∗ (hkl +ξk ) .(hkl − h(k−1)l ) ≤ min DS (xml ) + |X |ρmax η. S∈W1

k=1

Note now that both h(k−1)l + ξk and hkl + ξk have a uniform distribution over |X |-dimensional cubes. Denoting the fraction these cubes overlap by τ , ¯ £ ¤ £ ¤¯¯ ¯ ¯E dg∗ (hkl +ξk ) .(hkl − h(k−1)l ) − E dg∗ (h(k−1)l +ξk ) .(hkl − h(k−1)l ) ¯ ≤ (1 − τ )lρmax , since the conditional expectations of dg∗ (hkl +ξk ) .(hkl − h(k−1)l ) and dg∗ (h(k−1)l +ξk ) .(hkl − h(k−1)l ) are the same when h(k−1)l + ξk and hkl + ξk fall into the intersection of the cubes. Note now that τ ≥ 1 − |hkl − h(k−1)l |1 /η = 1 − l/η for cubes h(k−1)l + [0, η]|X | and hkl + [0, η]|X | , given η ≥ l. Hence     N/l N/l X X N lρmax dg∗ (hkl +ξk ) .(hkl − h(k−1)l ) ≤ dg∗ (h(k−1)l +ξk ) .(hkl − h(k−1)l ) − E  E . (27) η k=1 k=1 Combining (24),(26), and(27), we have N lρmax DT (x ) ≤ min DS (x ) + + |X |ρmax η + N ρmax S∈W1 η N

N

µ

¶ α(²) +² . l

(28)

Using the same reasoning as in section 3 and denoting the error exponent of the channel for rate R by σ we have

|V| 1 1 log |Xb| + log . R σ ² −1/3 Choosing ² = N , the right side of (28) is minimized for α(²) ≤

µ l = |Xb|

− 13

µ 2 η = |Xb|− 3

|V| 1 log |Xb| + log N R 3σ |V| 1 log |Xb| + log N R 3σ

¶ 32 ¶ 31

1 3

(29)

³

1 3

N = O N (log N )

2

1

DT (xN ) − min DS (xN ) ≤ CN 3 (log N ) 3 , S∈W1

or

µ N

DT (x ) − min DS (x ) ≤ C S∈W1

that proves the theorem. 2.

17

´ ,

³ 2 ´ 2 1 N 3 = O N 3 (log N ) 3 ,

and for these values of η and l we have

N

2 3

log N N

¶ 13

,

6

Conclusion and Open Directions

Given a finite reference class of schemes with finite-memory decoders, we have constructed a zerodelay joint source-channel coding scheme that performs essentially as well as the best in the reference class. The setting is a semi-stochastic one where the source sequence is an individual sequence and the output of the encoder is corrupted by a known stochastic channel. The scheme and the result extend to the setting of limited- (rather than zero-) delay coding in a straightforward way. Following the methodology pioneered in [22], we used this scheme on increasing reference classes and constructed a universal scheme that asymptotically attains the performance of any (finitestate encoder, sliding-window decoder) pair on any individual sequence (in turn also implying a universality result for the stochastic setting). Furthermore, we devised an implementable “follow the perturbed leader”-type scheme for the case of a memoryless channel, and proved its universality with respect to the class of all joint source-channel coding schemes with a symbol-by-symbol decoder (and arbitrary encoder). A natural question arising from our result is whether the aforementioned scheme is also universal with respect to the class of schemes with both finite-state encoders and decoders. The optimum performance of finite state schemes and Markov schemes proved to be the same for the prediction problem in [23] as well as for the perhaps more closely related filtering problem in [24]. We believe a similar equivalence holds for zero-delay decoding, though a proof has defied our efforts. Another open question is whether a generic scheme of the spirit of the scheme presented in section 3 can be devised that can compete with schemes that have a finite-state decoder. Even more basically: can the performance of schemes with finite-state encoder-decoder be attained at all universally (not necessarily by our scheme)? Finally, our brute-force universal scheme is prohibitively complex in most cases of practical interest. A major challenge is to find a practical scheme with similar universality properties.

References [1] J. C. Walrand and P. Varaiya, “Optimal causal coding-decoding problems,” IEEE Trans. on Inform. Theory, vol. IT-29, no. 6, pp. 814–820, November 1983. [2] T. Ericson, “A result on delay-less information transmission,” Int. Symposium on Information Theory, June 1979, Grignano, Italy. [3] N. T. Gaarder and D. Slepian, “On optimal finite-state digital transmission systems,” Int. Symposium on Information Theory, June 1979, Grignano, Italy. [4] ——, “On optimal finite-state digital transmission systems,” IEEE Trans. Inform. Theory, vol. 28, no. 2, pp. 167–186, March 1982. 18

[5] H. S. Witsenhausen, “On the structure of real-time source coders,” Bell System Technical Journal, vol. 58, no. 6, pp. 1437–1451, July-August 1979. [6] D. Teneketzis, “Optimal real-time encoding-decoding of markov sources in noisy environments,” Proc. Math. Theory of Networks and Sys. (MTNS), 2004, Leuwen, Belgium. [7] D. L. Neuhoff and R. K. Gilbert, “Causal source codes,” IEEE Trans. Inform. Theory, vol. 28, no. 5, pp. 701–713, September 1982. [8] S. P. Lloyd, “Rate vs. fidelity for the binary source,” Bell System Technical Journal, vol. 56, no. 3, pp. 427–437, March 1977. [9] P. Piret, “Causal sliding block encoders with feedback,” IEEE Trans. on Inform. Theory, vol. IT-25, no. 2, pp. 237–240, March 1979. [10] T. Linder and R. Zamir, “Causal source coding of stationary sources with high resolution,” Int. Symposium on Information Theory, June 2001. [11] R. K. Gilbert and D. L. Neuhoff, “Bounds to the performance of causal codes for markov sources,” Proc. Allerton Conf. Comm. Contr. and Comput., pp. 284–292, 1979. [12] J. Devore, “A note on the observation of a Markov source through a noisy channel,” IEEE Trans. Inform. Theory, vol. 20, pp. 762–764, November 1974. [13] E. Ordentlich and T. Weissman, “On the optimality of symbol by symbol filtering and denoising,” submitted to IEEE Trans. Inform. Theory. [14] N. Merhav and I. Kontoyiannis, “Source coding exponents for zero-delay coding with finite memory,” IEEE Transactions on Information Theory, vol. 49, no. 3, pp. 609–625, March 2003. [15] T. Weissman, E. Ordentlich, G. Seroussi, S. Verd´ u, and M. Weinberger, “Universal discrete denoising: Known channel,” IEEE Trans. Inform. Theory, vol. 51, no. 1, pp. 5–28, January 2005. [16] J. Ziv, “Distortion-rate theory for individual sequences,” IEEE Trans. Inform. Theory, vol. 26, no. 2, pp. 137–143, March 1980. [17] T. Linder and G. Lugosi, “A zero-delay sequential scheme for lossy coding of individual sequences,” IEEE Trans. Inform. Theory, vol. 47, pp. 2533–2538, 2001. [18] T. Weissman and N. Merhav, “Limited-delay lossy coding and filtering of individual sequences,” IEEE Trans. Inform. Theory, vol. 48, no. 3, pp. 721–733, March 2002.

19

[19] A. Gy¨orgy, T. Linder, and G. Lugosi, “A ‘follow the perturbed leader’-type algorithm for zerodelay quantization of individual sequences,” Proc. Data Compression Conference, March 23-25 2004, Snowbird, Utah. [20] ——, “Efficient algorithms and minimax bounds for zero-delay lossy source coding,” IEEE Transactions on Signal Processing, vol. 52, pp. 2337–2347, August 2004. [21] L. Devroye, L. Gy¨orfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition.

New

York: Springer-Verlag, 1996. [22] J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Trans. Inform. Theory, vol. 24, no. 5, pp. 530–536, September 1978. [23] M. Feder, N. Merhav, and M. Gutman, “Universal prediction of individual sequences,” IEEE Trans. Inform. Theory, vol. 38, pp. 1258–1270, July 1992. [24] E. Ordentlich, T. Weissman, M. J. Weinberger, A. SomekhBaruch, and N. Merhav, “Discrete universal filtering through incremental parsing,” Proc. Data Compression Conference, March 23-25, 2004, Snowbird, Utah. [25] J. Hannan, “Approximation to bayes risk in repeated plays,” in Contributions to the Theory of Games (Eds. M. Dresher, A. Tucker, and P. Wolfe), vol. 3, pp. 97–139, princeton University Press, 1957.

20