Information Structures of Maximizing Distributions ... - Semantic Scholar

Comment

Report 14 Downloads 187 Views

1

Information Structures of Extremum Problems of Feedback Capacity for General Channels with Memory

arXiv:1604.01063v1 [cs.IT] 4 Apr 2016

Charalambos D. Charalambous and Christos K. Kourtellaris

This work was financially supported by a medium size University of Cyprus grant entitled “DIMITRIS”. The authors are with the Department of Electrical and Computer Engineering, University of Cyprus, 75 Kallipoleos Avenue, P.O. Box 20537, Nicosia, 1678, Cyprus, e-mail: {kourtellaris.christos, [email protected]}

2

C ONTENTS I

Introduction

3

I-A

Motivation, Objectives and Finite Transmission Feedback Information Capacity . . . . . . . . . . . . .

4

I-B

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

I-C

Characterizations of FTFI Capacity for Class C Channels and Transmission Cost Functions . . . . . . .

6

I-C1

Special Case M = 2, L = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

I-C2

Special Case M = 0, L = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

I-D II

Extremum problems of Directed Information and Variational Equalities II-A

III

9 10

Characterization of FTFI Capacity

13

III-A

Channel Class A or B and Transmission Cost Class A or B . . . . . . . . . . . . . . . . . . . . . . . .

14

III-A1

Channel Class A and Transmission Cost Class A . . . . . . . . . . . . . . . . . . . . . .

14

III-A2

Channel Class A and Transmission Cost Class B and Vice-Versa . . . . . . . . . . . . .

21

III-B

IV

FTFI Capacity and Variational Equalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Channels Class C and Transmission Cost Class C, A or B . . . . . . . . . . . . . . . . . . . . . . . . .

21

III-B1

Channels Class C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

III-B2

Channel Class C with Transmission Costs Class C . . . . . . . . . . . . . . . . . . . . .

25

III-B3

Channel Class C with Transmission Costs Class A . . . . . . . . . . . . . . . . . . . . .

28

III-B4

Channel Class C with Transmission Costs Class B . . . . . . . . . . . . . . . . . . . . .

29

General Discrete-Time Recursive Channel Models & Gaussian Linear Channel Models with Memory

29

IV-A

General Discrete-Time Recursive Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

IV-B

Multiple Input Multiple Output Gaussian Linear Channel Models with Memory . . . . . . . . . . . . .

33

V

Achievability

36

VI

Conclusion

37

References

37

3

Abstract 4

For any class of channel conditional distributions, with finite memory dependence on channel input RVs An = {Ai : i = 0, . . . , n} 4 Bn = {Bi

4

CI ⊆ P or channel output RVs : i = 0, . . . , n} or both, we characterize the subsets of channel input distributions P[0,n] [0,n] = PAi |Ai−1 ,Bi−1 : i = 1, . . . , n , which satisfy conditional independence on past information, and maximize directed information defined by 4

I(An → Bn ) =

n

∑ I(Ai ; Bi |Bi−1 )

i=0

and we derive the corresponding expressions, called “characterizations of Finite Transmission Feedback Information (FTFI) capacity”. We derive similar characterizations, when general transmission cost constraints are imposed. Moreover, we also show that the structural properties apply to general nonlinear and linear autoregressive channel models defined by discrete-time recursions on general alphabet spaces, and driven by arbitrary distributed noise processes. We derive these structural properties by invoking stochastic optimal control theory and variational equalities of directed information, to identify tight upper bounds on I(An → Bn ), which are achievable over subsets of conditional independence CI ⊆ P distributions P[0,n] [0,n] and specified by the dependence of channel distributions and transmission cost functions on inputs and output symbols. We apply the characterizations to recursive Multiple Input Multiple Output Gaussian Linear Channel Models with limited memory on channel input and output sequences. The structural properties of optimal channel input distributions, generalize the structural properties of Memoryless Channels with feedback, to any channel distribution with memory, and settle various long standing problems in information theory.

I. I NTRODUCTION Shannon’s mathematical model for a communication channel is defined by Ai : i = 0, . . . , n , Bi : i = 0, . . . , n , PBi |Bi−1 ,Ai : i = 0, . . . , n 4

4

where an = {a0 , a1 , . . . , an } ∈ ×ni=0 Ai are the channel input symbols, bn = {b0 , b1 , . . . , bn } ∈ ×ni=0 Bi are the channel output 4 symbols, and C[0,n] = PBi |Bi−1 ,Ai : i = 0, 1, . . . , n , is the sequence of channel conditional distributions. A fundamental problem in extremum problems of feedback capacity, is to determine the information structures of optimal channel input conditional 4 distributions P[0,n] = PAi |Ai−1 ,Bi−1 : i = 0, 1, . . . , n , for any class of channel distributions, of the optimization problem 4

CAFB ∞ →B∞ = lim inf n−→∞

1 CFB n n, n + 1 A →B

4

n n CAFB n →Bn = sup I(A → B ) P[0,n]

(I.1)

where I(An → Bn ) is the directed information from An to Bn , defined by [1], [2]

Here,

PAi ,Bi , PBi |Bi−1

o n dP i−1 i (·|Bi−1 , Ai ) n n 4 Bi |B ,A (Bi ) (I.2) I(An → Bn ) = ∑ I(Ai ; Bi |Bi−1 ) = ∑ E log i−1 dPBi |Bi−1 (·|B ) i=0 i=0 : i = 0, . . . , n are the joint and conditional distributions induced by the channel input conditional

distribution from P[0,n] and the specific channel conditional distribution from C[0,n] , and E{·} denotes expectation with respect to the joint distribution. Under certain conditions, CAFB ∞ →B∞ is the supremum of all achievable rates of the sequence of feedback codes {(n, Mn , εn ) : n = 0, 1, . . . }, defined as follows. 4

(a) A set of messages Mn = {1, . . . , Mn } and a set of encoding strategies, mapping source messages into channel inputs of block length (n + 1), defined by n FB E[0,n] (κ) , gi : Mn × Bi−1 7−→ Ai , a0 = g0 (w, a−1 , b−1 ), . . . , an = gn (w, an−1 , bn−1 ), w ∈ Mn : o 1 Eg c0,n (An , Bn−1 ) ≤ κ , n = 0, 1, . . . . n+1

(I.3)

4

The codeword for any w ∈ Mn is uw ∈ An , uw = (g0 (w, a−1 , b−1 ), g1 (w, a0 , b0 ), , . . . , gn (w, an−1 , bn−1 )), and Cn = (u1 , u2 , . . . , uMn ) is the code for the message set Mn . In general, the code depends on the initial data (A−1 , B−1 ) = (a−1 , b−1 ) (unless it can be shown that in the limit, as n −→ ∞, the induced channel output process has a unique invariant distribution). (b) Decoder measurable mappings d0,n : Bn 7−→ Mn , such that the average probability of decoding error satisfies1 n o n o 1 (n) g n g n Pe , P d (B ) = 6 w|W = w ≡ P d (B ) = 6 W ≤ εn 0,n 0,n Mn w∈∑ Mn Here, rn ,

1 n+1

log Mn is the coding rate or transmission rate (and the messages are uniformly distributed over Mn ). A rate R

1 is said to be an achievable rate, if there exists a code sequence satisfying limn−→∞ εn = 0 and lim infn−→∞ n+1 log Mn ≥ R. The

feedback capacity is defined by C , sup{R : R is achievable}. Coding theorems for channels with memory with and without feedback are developed extensively over the years, in an anthology of papers, such as, [3]–[14]. Our interest in optimization problem CAFB n →Bn is the following. From the the converse coding theorem, if the supremum over FB channel input distributions in CAFB n →Bn exists, and its per unit time limit exists and it is finite, then CA∞ →B∞ is a non-trivial upper

bound on the supremum of all achievable rates of feedback codes-the feedback capacity, while under stationary ergodicity or Dobrushin’s directed information stability [4], then CAFB ∞ →B∞ is indeed the feedback capacity. When transmission cost constraints are imposed of the form (or variants of them) o n n 1 4 P[0,n] (κ) = PAi |Ai−1 ,Bi−1 , i = 0, . . . , n : E ∑ γi (T i An , T i Bn ) ≤ κ , κ ∈ [0, ∞) n + 1 i=0

(I.4)

the optimization problem (I.1) is replaced by 4

CAFB ∞ →B∞ (κ) = lim inf n−→∞

1 CFB n n (κ), n + 1 A →B

4

CAFB n →Bn (κ) =

sup I(An → Bn )

(I.5)

P[0,n] (κ)

where for each i, the dependence of transmission cost function γi (·, ·) : i = 0, . . . , n , on input and output symbols is specified by T i an ⊆ {a0 , a1 , . . . , ai }, T i bn ⊆ {b0 , b1 , . . . , bi }, and these are either fixed or nondecreasing with i, for i = 0, 1, . . . , n.

A. Motivation, Objectives and Finite Transmission Feedback Information Capacity In general, it is very difficult, almost impossible, to determine the information structures of optimal channel input distributions FB directly from CAFB ∞ →B∞ and CA∞ →B∞ (κ). Indeed, in the related theory of infinite horizon Markov Decision (MD), the fundamental

question, whether optimizing a given pay-off over all non-Markov strategies occurs in the subclass of Markov strategies, is addressed from its finite horizon version. Then by using the Markovian property of strategies, the infinite horizon or per unit time limit (i.e., asymptotic limit) over Markov strategies is analyzed [15]. Moreover, for specific MD models the analysis of the asymptotic limit, reveals several hidden properties of the role of optimal strategies to affect the controlled process, such as, in Linear Quadratic Gaussian Stochastic Optimal Control Problems and in MD models with finite or countable state spaces [15]. FB Therefore, similar to MD models, we investigate CAFB n →Bn and CAn →Bn (κ), which are related to channel coding theorems, as

follows. 1) The characterizations of FTFI capacity gives tight bounds on any achievable code rate (of feedback codes). Via these tight bounds, the direct part of the coding theorem can be shown, by investigating the per unit time limit of the characterizations of FTFI capacity, without unnecessary a´ priori assumptions on the channel, such as, stationarity, ergodicity, or information stability of the joint process {(Ai , Bi ) : i = 0, 1, . . .}. 2) Through the characterizations of FTFI capacity, several hidden properties of the role of optimal channel conditional distributions to affect the channel output transition probability distribution can be identified, including answers to the questions 1 The

superscript on expectation, i.e., Pg indicates the dependence of the distribution on the encoding strategies.

5

whether feedback increases capacity, and by how much. However, in this paper we do not pursue any lengthy investigation regarding 1) or 2), but instead we provide the necessary steps to address such issues. Our main objective is to determine the information structures of optimal channel input distriCI ⊆ P CI butions, by characterizing the subsets of channel input distributions P[0,n] [0,n] and P[0,n] (κ) ⊆ P[0,n] (κ), which satisfy

conditional independence, and give tight upper bounds on directed information I(An → Bn ), which are achievable, called the “characterizations of Finite Transmission Feedback Information (FTFI) capacity”. We derive such characterizations for any class of time-varying channel distributions and transmission cost functions, of the following type. Channel Distributions. Class A. PBi |Bi−1 ,Ai (dbi |bi−1 , ai ) = PBi |Bi−1 ,Ai (dbi |bi−1 , aii−L ), i = 0, . . . , n,

(I.6)

i Class B. PBi |Bi−1 ,Ai (dbi |bi−1 , ai ) = PBi |Bi−1 ,Ai (dbi |bi−1 i−M , a ), i = 0, . . . , n,

(I.7)

i Class C. PBi |Bi−1 ,Ai (dbi |bi−1 , ai ) = PBi |Bi−1 ,Ai (dbi |bi−1 i−M , ai−L ), i = 0, . . . , n.

(I.8)

i−L

i−M

i−M

i−L

Transmission Costs. Class A. γi (T i an , T i bn ) = γiA.N (aii−N , bi ),

i = 0, . . . , n,

(I.9)

Class B. γi (T i an , T i bn ) = γiB.K (ai , bii−K ),

i = 0, . . . , n,

(I.10)

i n

i n

Class C. γi (T a , T b

) = γiC.N,K (aii−N , bii−K ),

i = 0, . . . .n,

(I.11)

Here, {K, L, M, N} are nonnegative finite integers and we use the following convention. i

i i i i If M = 0 then PB |Bi−1 ,Ai (dbi |bi−1 i−M , a )|M=0 = PB |Ai (dbi |a ), for any A ∈ {A , Ai−L }, i = 0, 1, . . . , n. i

i−M

i

For M = L = 0, the channel is memoryless. By invoking function restriction, if necessary, the above transmission cost functions include, as degenerate cases, many others, such as, γi (·, T i bn ) = γi (·, bi−1 i−K ), i = 0, . . . , n. We show in Theorem IV.1 (see also Definition IV.1), that channel conditional distributions induced by various nonlinear channel models (NCM) driven by arbitrary noise processes, such as, nonlinear and linear time-varying Autoregressive models, and nonlinear and linear channel models expressed in state space form [16], are included in the list of channel distributions (I.6)-(I.8). In view of Theorem IV.1, the above list of channel distributions and transmission cost functions, includes many of the existing channels investigated in the literature, for example, nonstationary and nonergodic Additive Gaussian Noise channels investigated by Cover and Pombra [17], stationary deterministic channels [18], and finite alphabet channels with channel state information investigated in [19]–[25]. We derive in Theorem IV.2, the form of the optimal channel input conditional distribution for Multiple Input Multiple Output Gaussian Linear Channel Models with finite memory dependence on channel input and output symbols, from which the characterization of FTFI capacity is obtained.

B. Methodology The methodology we apply to derive the information structures of optimal channel input distributions and the corresponding characterizations of FTFI capacity, combines stochastic optimal control theory [26] and variational equalities of directed information [27]. This method is applied in [28] to derive characterizations of FTFI capacity for channel distributions of Class A and C, with L = 1. In this paper, we apply the method with some variations, to any combination of channel distributions and transmission cost functions of class A, B, C, as follows. FB First, we identify the following connection between stochastic optimal control theory and extremum problems CAFB n →Bn ,CAn →Bn (κ)

(see also Figure I-B). (i)

The information measure I(An → Bn ) is the pay-off;

(ii)

the channel output process {Bi : i = 0, 1, . . . , n} is the controlled process;

6

Fig. I.1.

(iii) (iv)

Communication block diagram and its analogy to stochastic optimal control.

the channel input process {Ai : i = 0, 1, . . . , n} is the control process; the channel output process {Bi : i = 0, 1, . . . , n} is controlled, by controlling its transition probability distribution PBi |Bi−1 : i = 0, . . . , n called the controlled object, via the choice of the transition probability distribution PAi |Ai−1 ,Bi−1 : i = 0, . . . , n ∈ P[0,n] or P[0,n] (κ) called the control object.

Second, we identify variational equalities of directed information, which can be used to determine achievable upper bounds CI (κ) ⊆ P on directed information over subsets of channel input conditional distributions, P[0,n] [0,n] (κ).

We apply the analogy to stochastic optimal control theory and the variational equalities of directed information, to show that for any combination of a channel distribution of class A, B, or C and a transmission cost function of class A, B, or C, the CI (κ), which satisfy conditional independence conditions, maximization of I(An → Bn ) over P[0,n] (κ), occurs in a subset P[0,n]

as follows. n o\ 4 CI P[0,n] (κ) = PAi |Ai−1 ,Bi−1 (dai |ai−1 , bi−1 ) = πi (dai |IiP ) ≡ P Ai ∈ dai |IiP : i = 0, . . . , n P[0,n] (κ), IiP ⊆ ai−1 , bi−1 , i = 0, . . . , n, 4 IiP =

Information Structure of optimal distributions, i = 0, . . . , n.

(I.12) (I.13) (I.14)

Further, we show that the information structure IiP , i = 0, 1, . . . , n, is specified by the memory of the channel conditional distribution, and the dependence of the transmission cost function on the channel input and output symbols. This procedure allows us to determine the dependence, of the joint distribution of {(Ai , Bi ) : i = 0, . . . , n}, and the transition distribution of the channel output process {Bi : i = 0, . . . , n} (i.e., controlled object) on the control object, πi (dai |IiP ) : i = 0, . . . , , and to determine the characterizations of FIFI capacity. The characterizations of FTFI capacity are generalizations of the two-letter characterization of Shannon’s feedback capacity of DMCs, in which the two-letter characterization is replaced by a multi-letter characterizations, depending on the memory of the channel and the dependence of the transmission cost function on channel input and output symbols.

C. Characterizations of FTFI Capacity for Class C Channels and Transmission Cost Functions Next, we describe the characterizations of FTFI capacity, which corresponds to a channel distribution of class C and a transmission cost of class C, and we discuss several degenerate cases. We show that the conditional independence and information structure of the optimal channel input conditional distribution are specified by n o 4 4 4 i−1 PAi |Ai−1 ,Bi−1 = πi (dai |IiP ), IiP = ai−1 i−I , bi−J , i = 0, . . . , n, I = max{L, N}, J = max{M, L, K, N

(I.15)

7

and moreover the induced joint distribution and channel output transition distribution are given by2 i i−1 i−1 PπAi ,Bi (dai , dbi ) = ⊗ni=0 PBi |Bi−1 ,Ai (dbi |bi−1 i−M , ai−L ) ⊗ πi (dai |ai−I , bi−J ) , i = 0, . . . , n, i−M

PπB |Bi−1 (dbi |bi−1 i−J ) = i

i−J

Z Aii−I

i−L

i−1 i−1 i i−1 i−1 π PBi |Bi−1 ,Ai (dbi |bi−1 i−M , ai−L ) ⊗ πi (dai |ai−I , bi−J ) ⊗ PAi−1 |Bi−1 (dai−I |bi−J ). i−M

i−L

i−I

(I.16) (I.17)

i−J

The characterization of FTFI capacity is then given by the following extremum problem. CAFB,C.I,J n →Bn (κ) =

n

sup

i−1 ) ∑ I(Aii−L ; Bi |Bi−J

(I.18)

i o n dPB |Bi−1 ,Ai (·|Bi−1 i−M , Ai−L ) i i−M i−L π E log (Bi ) ∑ i−1 π dP i−1 (·|Bi−J ) i=0

(I.19)

◦ C.I,J i=0 P [0,n] (κ)

n

=

sup ◦ C.I,J P [0,n] (κ)

Bi |Bi−J

where the transmission cost constraint is defined by n ◦ C.I,J 4 i−1 i−1 P [0,n] (κ) = πi (dai |ai−I , bi−J ), i = 0, . . . , n :

n o 1 Eπ ∑ γiC.N,K (Aii−N , Bii−K ) ≤ κ , κ ∈ [0, ∞). n+1 i=0

(I.20)

These structural properties of optimal channel input conditional distributions are generalizations of those of memoryless channels, and they hold for finite, countable and abstract alphabet spaces (i.e., continuous), and channels defined by nonlinear models, state space models, autoregressive models, etc. Next, we discuss certain special cases, to illustrate the explicit analogy to Shannon’s two-letter characterization or capacity formulae of memoryless channels. 1) Special Case M = 2, L = 1: For channel

PBi |Bi−1 ,Ai i−2

: i = 0, 1, . . . , n , the optimal channel input distribution, which

i−1

maximizes I(An → Bn ) occurs in the set o ◦ C.1,2 4 n C.1,2 i−1 i−1 (dai |ai−1 , bi−1 ), i = 0, 1, . . . , n P [0,n] = PAi |Ai−1 ,Bi−1 (dai |a , b ) = πi i−2

(I.21)

and this implies the following. The joint process {(Ai , Bi ) : i = 0, . . . , n} is second-order Markov, the channel output process {Bi : i = 0, . . . , n} is second-order Markov, that is, C.1,2

C.1,2

i

i

PπA .B |Ai−1 ,Bi−1 = PπA ,B |Ai−1 ,Bi−1 , i

i

i−2

i−2

C.1,2

C.1,2

i

i

PπB |Bi−1 = PπB |Bi−1 , i = 0, 1, . . . , n

(I.22)

i−2

If a transmission cost P[0,n] (κ) is imposed corresponding to γiC.1,K (ai , bii−K ), i = 0, 1, . . . , n, the optimal channel input distribution maximizing I(An → Bn ) occurs in set n o\ ◦ C.1,J 4 C.1,J i−1 i−1 (dai |ai−1 , bi−1 P[0,n] (κ) P [0,n] (κ) = PAi |Ai−1 ,Bi−1 (dai |a , b ) = πi i−J ), i = 0, 1, . . . , n

(I.23)

4

where J = max{2, K}, which then implies the following. The joint process {(Ai , Bi ) : i = 0, . . . , n} is J−order Markov, the channel output process {Bi : i = 0, . . . , n} is J−order Markov, and the characterization of FTFI capacity is CAFB,C.1,J n →Bn =

2 The

n

sup

i−1 ) ∑ I(Aii−1 ; Bi |Bi−J

◦ C.1,J T i=0 P [0,n] P[0,n] (κ)

superscript notation on distributions and expectations indicates their dependence on πi (dai |I P ) : i = 0, . . . , n .

(I.24)

8

2) Special Case M = 0, L = 1: For channel

PBi |Ai−1 ,Ai : i = 0, 1, . . . , n , called the Unit Memory Channel Input (UMCI)

channel, the optimal channel input distribution, which maximizes I(An → Bn ) occurs in the set o ◦ C.1,1 4 n i−1 i−1 C.1,1 (dai |ai−1 , bi−1 ), i = 0, 1, . . . , n P [0,n] = PAi |Ai−1 ,Bi−1 (dai |a , b ) = π

(I.25)

which then implies the following. The joint process {(Ai , Bi ) : i = 0, . . . , n} and channel output process {Bi : i = 0, . . . , n} are first-order Markov, and the characterization of FTFI capacity is n

4

CAFB,C.1,1 n →Bn =

∑ I(Ai−1 , Ai ; Bi |Bi−1 ).

sup C.1,1

πi

(I.26)

(dai |ai−1 ,bi−1 ),i=0,1,...,n i=0

Note that for both the UMCI channel, PBi |Ai−1 ,Ai : i = 0, 1, . . . , n , and the channel PBi |Bi−1 ,Ai−1 ,Ai : i = 0, 1, . . . , n , called the Unit Memory Channel Input Output (UMCIO) channel, the optimal channel input distributions, which maximize I(An → Bn ) ◦ C.1,1

occur in the set P [0,n] . That is, the information structures of the optimal channel input distributions are identical for both channels, the characterization of FTFI capacity is given by (I.26), however, the induced joint and transition distributions of the output processes are different. Moreover, for both channels, for each i, the term I(Ai−1 , Ai ; Bi |Bi−1 ), depends on {Ai , Ai−1 , Bi , Bi−1 }, and this dependence is fixed and does not increase with i, i.e., it is a four-letter expression. The Markovian properties of the processes {(Ai , Bi ) : i = 0, . . . , n} imply that the theory of Markov Decision is directly available, and it can be applied, subject to modifications, to investigate the ergodic, via the per unit time limit 4

CAFB,C.I,J ∞ →B∞ (κ) = lim inf n−→∞

1 CFB,C.I,J n n (κ). n + 1 A →B

(I.27)

D. Literature Review Shannon and subsequently Dobrushin [29] characterized the capacity of DMCs with and without feedback, and obtained the well-known two-letter expression 4

C = max I(A; B)

(I.28)

PA

and similarly for memoryless channels with continuous alphabets, with transmission cost

R

|a|2 PA (da) ≤ κ. For memoryless

channels without feedback, this characterization is obtained from the upper bound n

4

CAn ;Bn = max I(An ; Bn ) ≤ PAn

∑ I(Ai ; Bi ) PA ,i=0,...,n max

i

(I.29)

i=0

since this bound is achievable, when the channel input distribution satisfies conditional independence PAi |Ai−1 (dai |ai−1 ) = PAi (dai ), i = 0, 1, . . . , n, and moreover C is obtained, when {Ai : i = 0, 1, . . . , } is identically distributed, which then implies the joint process {(Ai , Bi ) : i = 0, 1, . . . , } is independent and identically distributed. For memoryless channels with feedback, the procedure is similar, provided it is shown (via the converse to the coding theorem) that feedback does not increase capacity, which then implies PAi |Ai−1 ,Bi−1 (dai |ai−1 , bi−1 ) = PAi (dai ), i = 0, 1, . . . , n, and C is obtained if {Ai : i = 0, 1, . . . , } is identically distributed. The conditional independence conditions imply that the Information Structure of the maximizing channel input distributions is the Null Set. Our methodology described in Section I-B, in principle, repeats the above steps, although, each of the steps is more involved due to the memory of the channels, and hence new tools are required to established achievable upper bounds, over proper subsets of channel input conditional distributions. Cover and Pombra [17] (see also [8], [30]) characterized the feedback capacity of nonstationary nonergodic Additive Gaussian Noise (AGN) channels with memory, defined by Bi = Ai +Vi ,

i = 0, 1, . . . , n,

1 n+1 n 2 o ∑ E |Ai | ≤ κ, κ ∈ [0, ∞) n + 1 i=0

(I.30)

9

where {Vi : i = 0, 1, . . . , n} is a real-valued jointly nonstationary Gaussian process N(µV n , KV n ), under the assumption that “An is causally related to V n ” defined by3 PAn ,V n (dan , dvn ) = ⊗ni=0 PAi |Ai−1 ,V i−1 (dai |ai−1 , vi−1 ) ⊗ PV n (dvn ).

(I.31)

In [17], the authors characterized feedback capacity, via the maximization of mutual information between uniformly distributed messages and the channel output process, denoted by I(W, Bn ), and obtained the following characterization 4 . 4

FB,CP CW ;Bn (κ) = n

max 1 n+1

∑ni=0 E|Ai |2 ≤ κ

=

n o I(W, B ) = n

sup

1 Tr Γn K n ΓT +K n ≤κ Γn ,KZ n : n+1 V Z

n n o H(B ) − H(V )

max 1 n+1

∑ni=0 E|Ai |2 ≤ κ:

Ai =∑i−1 j=0 γi, j V j +Zi :

(I.32)

i=0,1,...,n

n n n Γ + I KV (Γ + I)T + KZ n 1 log n 2 K

(I.33)

V

4

4

where Z n = {Zi : i = 0, 1, . . . , n} is a Gaussian process N(0, KZ n ), orthogonal to V n = {Vi : i = 0, . . . , n}, and {γi, j : i, j = 0, . . . , n} are deterministic functions, which are the entries of the lower diagonal matrix Γn . The feedback capacity is shown to be 4

FB,CP FB,CP 1 CW ;B∞ (κ) = limn−→∞ n+1 CW ;Bn (κ). Based on the characterization derived in [17], several investigations of versions of the

Cover and Pombra [17] AGN channel are found in the literature, such as, [8], [31], [32]. Specifically, in [32], the stationary ergodic version of Cover and Pombra [17] AGN channel, is revisited by utilizing characterization (I.33) to derive expressions FB,CP for feedback capacity, CW ;B∞ (κ), using frequency domain methods, when the noise power spectral density corresponds to a

stationary Gaussian autoregressive moving-average model with finite memory. For finite alphabet channels with memory and feedback, expressions of feedback capacity are derived for certain channels with symmetry, in [21]–[25], while in [20] it is illustrated that if the input to the channel and the channel state are related by a one-to-one mapping, and the channel assumes a specific structure, specifically, PBi |Ai ,Ai−1 : i = 0, . . . , n , then dynamic programming can be used, in such capacity problems. In [19] the unit memory channel output (UMCO) channel PBi |Bi−1 ,Ai : i = 0, . . . , n}, is analyzed under the assumption that the optimal channel input distribution is PAi |Bi−1 : i = 0, . . . , n}. The authors in [19] showed that the UMCO channel can be transformed to one with state information. Coding theorems for channels with memory with and without feedback are developed in many papers and books [3]–[14].

II. E XTREMUM PROBLEMS OF D IRECTED I NFORMATION AND VARIATIONAL E QUALITIES In this section, we introduce the basic notation, the precise definition of extremum problem of FTFI capacity (I.1), the variational equalities of directed information [33], and some of their properties. Throughout the paper we use the following notation. R : set of real numbers; Z : set of integer; N0 : set of nonnegative integers {0, 1, 2, . . . }; Rn : set of n tuples of real natural; (Ω, F , P) : probability space, where F is the σ −algebra generated by subsets of Ω; B(W) : Borel σ −algebra of a given topological space W; M (W) : set of all probability measures on B(W) of a Borel space W; K (V|W) : set of all stochastic kernels on (V, B(V)) given (W, B(W)) of Borel spaces W, V;

3

[17], page 39, above Lemma 5. methodology in [17] utilizes the converse coding theorem to obtain an upper bound on the entropy H(Bn ), by restricting {Ai : i = 0, . . . , n} to a Gaussian process. 4 The

10

All spaces (unless stated otherwise) are complete separable metric spaces also called Polish spaces, i.e., Borel spaces. This generalization is adopted to treat simultaneously discrete, finite alphabet, real-valued Rk or complex-valued Ck random processes for any positive integer k, and general Rk −valued random processes with absolute summable p-moments characterized by ` p (N × Ω, F , P; Rn )-spaces, p = 1, 2, . . ., (see [34]) etc. 4

Given two measurable spaces (X, B(X)), (Y, B(Y)) then X × Y = {(x, y) : x ∈ X, y ∈ Y} is the cartesian product of X and Y, and for A ∈ B(X) and B ∈ B(Y) then A × B is called a measurable rectangle. The product measurable space of (X, B(X)) and (Y, B(Y)) is denoted by (X × Y, B(X) ⊗ B(Y)), where B(X) ⊗ B(Y) is the product σ −algebra generated by {A × B : A ∈ B(X), B ∈ B(Y)}. A Random Variable (RV) defined on a probability space (Ω, F , P) by the mapping X : (Ω, F ) 7−→ (X, B(X)) induces a probability distribution P(·) ≡ PX (·) on (X, B(X)) as follows5 . 4 P(A) ≡ PX (A) = P ω ∈ Ω : X(ω) ∈ A , ∀A ∈ B(X).

(II.34)

4

A RV is called discrete if there exists a countable set SX = {xi : i ∈ N} such that ∑xi ∈SX P{ω ∈ Ω : X(ω) = xi } = 1. The probability distribution PX (·) is then concentrated on points in SX , and it is defind by 4

PX (A) =

∑T

xi ∈SX

P ω ∈ Ω : X(ω) = xi , ∀A ∈ B(X).

(II.35)

A

If the cardinality of SX is finite then the RV is finite-vaued and it is called a finite alphabet RV. Given another RV Y : (Ω, F ) 7−→ (Y, B(Y)), PY |X (dy|X)(ω) is called the conditional distribution of RV Y given RV X. The conditional distribution of RV Y given X = x is denoted by PY |X (dy|X = x) ≡ PY |X (dy|x). Such conditional distributions are equivalently described by stochastic kernels or transition functions K(·|·) on B(Y) × X, mapping X into M (Y) (the space of probability measures on (Y, (B(Y))), i.e., x ∈ X 7−→ K(·|x) ∈ M (Y), such that for every F ∈ B(Y), the function K(F|·) is B(X)-measurable.

n 4 The family of such probability distruibutions on (Y, B(Y) parametrized by x ∈ X, is defined by K (Y|X) = K(·|x) ∈ M (Y) : o x∈X . A. FTFI Capacity and Variational Equalities The communication block diagram is shown in Figure I-B. The channel input and channel output alphabets are sequences of Polish measurable spaces (complete separable metric spaces) {(Ai , B(Ai )) : i ∈ Z} and {(Bi , B(Bi )) : i ∈ Z}, respectively, 4

4

and their history spaces are the product spaces AZ = ×i∈Z Ai , BZ = ×i∈Z Bi . These spaces are endowed with their respective 4 product topologies, and B(ΣZ ) = ⊗i∈Z B(Σi ), denotes the σ −algebra on ΣZ , where Σi ∈ Ai , Bi , ΣZ ∈ AZ , BZ , generated 4

4

m m m by cylinder sets. Points in Σm k = × j=k Σ j are denoted by zk = {zk , zk+1 , . . . , zm } ∈ Σk , (k, m) ∈ Z × Z. We often restrict Z to N0 .

Channel Distributions with Memory. A sequence of stochastic kernels or distributions defined by n o 4 C[0,n] = Qi (dbi |bi−1 , ai ) = PBi |Bi−1 ,Ai ∈ K (Bi |Bi−1 × Ai ) : i = 0, 1, . . . , n .

(II.36)

At each time instant i the conditional distribution of the channel is affected causally by past channel output symbols bi−1 ∈ Bi−1 and current and past channel input symbols ai ∈ Ai , i = 0, 1, . . . , n. The distribution at time t = 0 is either fixed or the conditioning information is fixed, depending to the convention used. Channel Input Distributions with Feedback. A sequence of stochastic kernels defined by n o 4 P[0,n] = Pi (dai |ai−1 , bi−1 ) = PAi |Ai−1 ,Bi−1 ∈ K (Ai |Ai−1 × Bi−1 ) : i = 0, 1, . . . , n .

(II.37)

At each time instant i the conditional channel input distribution with feedback is affected causally by past channel inputs and output symbols {ai−1 , bi−1 } ∈ Ai × Bi−1 , i = 0, 1, . . . , n. Hence, the information structure of the channel input distribution at 5 The

subscript X is often omitted.

11

4

time instant i is IiP = {ai−1 , bi−1 } ∈ Ai−1 × Bi−1 , i = 0, 1, . . . , n. Transmission Cost. The cost of transmitting and receiving symbols is a measurable function c0,n : An × Bn 7−→ [0, ∞). The average transmission cost is defined by n o n 1 4 EP c0,n (An , Bn ) ≤ κ, c0,n (an , bn ) = ∑ γi (T i an , T i bn ), κ ∈ [0, ∞) n+1 i=0

(II.38)

where the superscript notation EP {·} denotes the dependence of the joint distribution on the choice of conditional distribution {Pi (dai |ai−1 , bi−1 ) : i = 0, . . . , n} ∈ P[0,n] . A channel input distribution with feedback and transmission cost is defined by n o 1 4 P[0,n] (κ) = Pi (dai |ai−1 , bi−1 ) ∈ M (Ai ), i = 1, . . . , n : EP c0,n (An , Bn−1 ) ≤ κ ⊂ P[0,n] (II.39) n+1 FTFI Capacity. Given any channel input distribution Pi (dai |ai−1 , bi−1 ) : i = 0, 1, . . . , n ∈ P[0,n] and the channel distribution Q(dbi |bi−1 , ai−1) : i = 0, 1, . . . , n ∈ C[0,n] , then we can uniquely define the induced joint PP (dan , dbn ) on the distribution

canonical space An × Bn , B(An ) ⊗ B(Bn ) , and we can construct a probability space Ω, F , P carrying the sequence of

RVs {(Ai , Bi ) : i = 0, . . . , n}, as follows. 4 P An ∈ dan , Bn ∈ dbn = PP (dan , dbn ), n ∈ N = ⊗nj=0 P(db j |b j−1 , a j ) ⊗ P(da j |a j−1 , b j−1 ) = ⊗nj=0 Q j (db j |b j−1 , a j ) ⊗ Pj (da j |a j−1 , b j−1 ) .

(II.40) (II.41)

Further, we define the joint distribution of Bi : i = 0, . . . , n and its conditional distribution by6 Z 4 P Bn ∈ dbn = PP (dbn ) =

An

PP (dan , dbn ), n ∈ N,

(II.42)

≡ ΠP0,n (dbn ) = ⊗ni=0 ΠPi (dbi |bi−1 ) ΠPi (dbi |bi−1 ) =

Z Ai

(II.43)

Qi (dbi |bi−1 , ai ) ⊗ Pi (dai |ai−1 , bi−1 ) ⊗ PP (dai−1 |bi−1 ), i = 0, . . . , n.

(II.44)

The above distributions are parametrized by either a fixed B−1 = b−1 ∈ B−1 or a fixed distribution P(db−1 ) = ν(db−1 ). An alternative equivalent representation of the sets P[0,n] , C[0,n] and induced joint distribution, and marginal distribution, is via → − ← − the causally conditioned compound probability distributions Q 0,n (·|an ) ∈ M (Bn ) parametrized by an ∈ An and P 0,n (·|bn−1 ) ∈ M (An ) parametrized by bn−1 ∈ Bn−1 , defined as follows. → − ← − 4 4 Q 0,n (dbn |an ) = ⊗ni=0 Qi (dbi |bi−1 , ai ), P 0,n (dan |bn−1 ) = ⊗ni=0 Pi (dai |ai−1 , bi−1 ), Z ← − → − → − ← − ← − 4 P PP (dan , dbn ) = ( P 0,n ⊗ Q 0,n )(dan , dbn ), PP (dbn ) = Π0,n (dbn ) = ( P 0,n ⊗ Q 0,n )(dan , dbn ). An

(II.45) (II.46)

→ − ← − It is shown in [27], that the set of distributions Q 0,n (·|an ) ∈ M (Bn ) and P 0,n (·|bn−1 ) ∈ M (An ) are convex. The pay-off or directed information I(An → Bn ) is defined as follows. o n dQ (·|Bi−1 , Ai ) n 4 i (B ) I(An → Bn ) = ∑ EP log i dΠPi (·|Bi−1 ) i=0 Z dQ (·|bi−1 , ai ) n i log (b ) PP (dai , dbi ) ≡ IAn →Bn (Pi , Qi , : i = 0, 1, . . . , n) =∑ i P i−1 ) i ×Bi dΠ (·|b A i i=0 − Z d→ ← Q 0,n (·|ai ) → − − − ← − → = log (b0,n ) ( P 0,n ⊗ Q 0,n )(dan , dbn ) ≡ IAn →Bn ( P 0,n , Q 0,n ) ← − P An ×Bn dΠ0,n (·)

(II.47) (II.48) (II.49)

where the notation in (II.48) illustrates that I(An → Bn ) is a functional of the two sequences of conditional distributions, → − ← − Pi (dai |ai−1 , bi−1 ), Qi (dbi |bi−1 , ai ) : i = 0, 1, . . . , n and the notation in (II.49) indicates it is a functional of { P 0,n (dan |bn−1 ), Q 0,n (dbn |an )}. 6 Throughout

the paper the superscript notation PP (·), ΠP0,n (·), etc., indicates the dependence of the distributions on the channel input conditional distribution.

12

These are equivalent representations [27].

− → − ← − → ← − Further, it is shown in [27], that the functional IAn →Bn ( P 0,n , Q 0,n ) is convex in Q 0,n (·|an ) ∈ M (Bn ) for a fixed P 0,n (·|bn−1 ) ∈ → − ← − M (An ) and concave in P 0,n (·|bn−1 ) ∈ M (An ) for a fixed Q 0,n (·|an ) ∈ M (Bn ). Next, we introduce the definition of FTFI capacity CAFB n →Bn , using the established notation of distributions given by (II.40)-(II.44). Definition II.1. (Extremum problem with feedback) Given any channel distribution from the class C[0,n] , find the Information Structure of the optimal channel input distribution Pi (dai |ai−1 , bi−1 ) : i = 0, . . . , n ∈ P[0,n] (assuming it exists) of the extremum problem defined by 4

CAFB n →Bn =

Pi (dai

I(An → Bn ),

sup

I(An → Bn ) = (II.48).

(II.50)

|ai−1 ,bi−1 ):i=0,...,n

∈P[0,n]

When an transmission cost constraint is imposed the extremum problem is defined by 4

CAFB n →Bn (κ) =

I(An → Bn ).

sup

(II.51)

Pi (dai |ai−1 ,bi−1 ):i=0,...,n ∈P[0,n] (κ)

Our objective is to determine the information structures of optimal channel input distributions for any combination of channel distribution and transmission cost of class A, B, or C, as discussed by (I.12)-(I.14). Clearly, for each time i the largest 4

FB P i−1 , bi−1 }, i = 0, 1, . . . , n. information structure of the channel input distributions of problem CAFB n →Bn and CAn →Bn (κ) is Ii = {a

Next, we introduce the two variational equalities of directed information, which we employ in many of the derivations. Theorem II.1. (Variational Equalities) Given a channel input distribution Pi (dai |ai−1 , bi−1 ) : i = 0, 1, . . . , n ∈ P[0,n] and channel distribution Qi (dbi |bi−1 , ai ) : → − ← − i = 0, 1, . . . , n ∈ C[0,n] , define the corresponding joint and marginal distributions PP (dan , dbn ) ≡ ( P 0,n ⊗ Q 0,n )(dan , dbn ) ∈ ← −

P (dbn ) ∈ M (Bn ) by (II.40)-(II.46). M (An × Bn ), ΠP0,n (dbn ) = ⊗ni=0 ΠPi (dbi |bi−1 ) ≡ Π0,n

(a) Let V0,n (dbn ) ∈ M (Bn ) be any arbitrary distribution on Bn , which is uniquely defined by Vi (dbi |bi−1 ) ∈ M (Bi ) : i = 4

0, . . . , n} via V0,n (dbn ) = ⊗ni=0Vi (dbi |bi−1 ) and vice-versa. → − ← − For a fixed P 0,n (dan |bn−1 ) ∈ M (An ) and Q 0,n (dbn |an ) ∈ M (Bn ), define the following functional. − − ← − → ← − → I0,n (·; P 0,n , Q 0,n ) : M (Bn ) 7−→ R, +∞}, V0,n (dbn ) 7−→ I0,n (V0,n ; P 0,n , Q 0,n ), − Z d→ Q 0,n (·|an ) n ← → − − − ← − → 4 (b ) ( P 0,n ⊗ Q 0,n )(dan , dbn ) I0,n (V0,n ; P 0,n , Q 0,n )) = log dVn (·) An ×Bn

(II.52) (II.53)

Then the following hold.

→ − − ← − → ← − (i) The functional I0,n (V0,n ; P 0,n , Q 0,n ) is convex in V0,n (dbn ) ∈ M (Bn ) for fixed P 0,n (dan |bn−1 ) ∈ M (An ) and Q 0,n (dbn |an ) ∈ M (Bn ). (ii) The following variational equality holds. I(An → Bn ) =

inf

V0,n (dbn )∈M (Bn )

− ← − → I0,n (V0,n ; P 0,n , Q 0,n )

(II.54)

← −

P (dbn ). and the infimum in (II.54) is achieved at V0,n (dbn ) = Π0,n

Equivalently, the following variational equality holds. n

I(An → Bn ) =

inf

Z

∑

i i Vi (dbi |bi−1 )∈M (Bi ):i=0,1,...,n i=0 A ×B

log

dQ (·|bi−1 , ai ) i

dVi (·|bi−1 )

(bi ) PP (dai , dbi )

(II.55)

and the infimum in (II.55) is achieved at Vi (dbi |bi−1 ) = ΠPi (dbi |bi−1 ), i = 0, . . . , n

or

ΠP0,n (dbn )

given by (II.42)-(II.44).

(II.56)

13

(b) Let Si (dbi |bi−1 , ai−1 ) ∈ M (Bi ) : i = 0, . . . , n and Ri (dai |ai−1 , bi ) ∈ M (Ai ) : i = 0, 1, .. . , n be arbitrary distributions and define the joint distribution on M (An × Bn ) by ⊗ni=0 Si (dbi |bi−1 , ai−1 ) ⊗ Ri (dai |ai−1 , bi ) . Then the following variational equality holds. n

I(X n → Y n ) =

sup

∑

Z

Ai ×Bi Si (dbi |bi−1 ,ai−1 )⊗Ri (dai |ai−1 ,bi )∈M (Ai ×Bi ):i=0,1,...,n i=0

log

dRi (·|ai−1 , bi ) (ai ) dPi (·|ai−1 , bi−1 )

Si (dbi |bi−1 ,ai−1 )∈M (Bi ), Ri (dai |ai−1 ,bi )∈M (Ai )

! dSi (·|bi−1 , ai−1 ) . (bi ) PP (dai , dbi ) dΠPi (·|bi−1 )

(II.57)

and the supremum in (II.57) is achieved when the following identity holds. dPi (·|ai−1 , bi−1 ) dQi (·|bi−1 , ai ) (a ). (bi ) = 1 − a.a.(an , bn ), i = 0, 1, . . . , n. i dRi (·|ai−1 , bi ) dSi (·|bi−1 , ai−1 )

(II.58)

Equivalently, the supremum in (II.57) is achieved at ⊗ni=0 Si (dbi |bi−1 , ai−1 ) ⊗ Ri (dai |ai−1 , bi ) = PP (dan , dbn ).

(II.59)

Proof: These are derived in [27], Theorem IV.1. We shall use the variation equality in (a) to identify upper bounds on directed information, which are achievable over specific subsets of the set of distributions P0,n] and P[0,n] (κ), which depend on the properties of the channel distribution and the transmission cost function. The second variation equality, although not essential in this paper, we use it to identify lower bounds on directed information, which are achievable over specific subsets of the set of distributions P0,n] and P[0,n] (κ). The first variational equality encompasses as a special case, the maximum entropy properties of joint and conditional distributions, such as, the maximizing entropy property of Gaussian distributions.

III. C HARACTERIZATION OF FTFI C APACITY In this section, we derive the information structures of optimal channel input distributions, as described in Section I-B. Using the established notation, the channel output process {Bi : i = 0, 1, . . . , n} is controlled, by controlling the controlled object P(dbi |bi−1 ) ≡ ΠPi (dbi |bi−1 ) ∈ M (Bi ) : i = 0, . . . , n via the choice of the control object Pi (dai |ai−1 , bi−1 ) : i = 0, . . . , n ∈ P[0,n] . We derive the characterizations of FTFI capacity in the following sequence. Step 1-finite memory on channel input or channel output symbols. Given a channel distribution of Class A or B, and transmission cost functions of Class A or B, where {L, N} are finite, we show via stochastic optimal control and variational equality (II.55), that at each time instant i, the information structure of the optimal channel input distribution lies in a subset IiP ⊆ {ai−1 , bi−1 }, i = 0, 1, . . . , n, and depends on the entire sequence of channel output symbols. This implies the corresponding channel input distributions lie in a subset P [0,n] ⊆ P[0,n] , and have at most finite memory with respect to channel input symbols. Step 2-finite memory on channel input and output symbols. Given a channel distribution of Class C, and transmission cost functions of Class C, since these are special cases of the ones in Step 1, then the optimal channel input distributions lie in a subset P [0,n] ⊆ P[0,n] . Further, we apply stochastic optimal control and variational equality (II.55), to the resulting optimization ◦

problem to obtain an upper bound, which is achievable over smaller subsets of conditional distributions P [0,n] ⊂ P [0,n] , and have finite memory with respect to channel input and output symbols.

14

A. Channel Class A or B and Transmission Cost Class A or B 1) Channel Class A and Transmission Cost Class A: Given the channel distribution (I.6), the joint distribution is defined by 4 PP (dai , dbi ) = ⊗ij=0 Pj (da j |a j−1 , b j−1 ) ⊗ Q j (db j |b j−1 , a jj−L ) , i = 0, . . . , n.

(III.60)

Consequently, directed information is given by n dQ (·|Bi−1 , Ai ) o n n i i−L I(An → Bn ) = ∑ EP log (B ) ≡ I(Aii−L ; Bi |Bi−1 ) i ∑ dΠPi (·|Bi−1 ) i=0 i=0 n o n i−1 i−1 = ∑ EP `Pi (Ai , Ai−L ,B )

(III.61) (III.62)

i=0

where 4

i−1 `Pi (ai , ai−1 )= i−L , b

Z

log Bi

ΠPi (dbi |bi−1 ) = =

Z i ZA

Ai

dQ (·|Bi−1 , ai

i−L ) (bi ) P i−1 dΠi (·|b ) i

dQi (dbi |bi−1 , aii−L )

(III.63)

Qi (dbi |bi−1 , aii−L ) ⊗ Pi (dai |ai−1 , bi−1 ) ⊗ PP (dai−1 |bi−1 ), i = 0, . . . , n.

(III.64)

Qi (dbi |bi−1 , aii−L ) ⊗ PP (daii−L |bi−1 )

(III.65)

i−1 i−1 i−1 ) through , b ) depends on (ai−1 We make the following observation. For each i, the pay-off in (III.62), i.e., `Pi (ai , ai−L i−L , b 4 the channel distribution dependence on these variables, and the control object g j (a j−1 , b j−1 ) = Pj (da j |a j−1 , b j−1 ) : j = 4 0, . . . , i , througt the controlled object ξ j (b j−1 ) = ΠPi (dbi |bi−1 ) : j = 0, . . . , i}, defined by III.64), for i = 0, . . . , n. More over, for each i, the dependence of the controlled object ξ j (b j−1 ) : j = 0, . . . , i} on ai−1−L is through the control object g j (a j−1 , b j−1 ) : j = 0, . . . , i , for i = 0, . . . , n, and not the channel distribution. Hence, by stochastic optimal control theory, i−1 i−1 and precisely as done in Markov decision theory, [15], [26], we can introduce additional state variables for (ai−L , b ), and

then apply either dynamic programming, to deduce that the maximization of directed information I(An → Bn ) defined by 4 (III.61) over gi (ai−1 , bi−1 ) = Pi (dai |ai−1 , bi−1 ) : i = 0, . . . , n , occurs in the subset, which satisfies conditional independence i−1 ) − a.a.(ai−1 , bi−1 ), i = 0, . . . , n. Pi (dai |ai−1 , bi−1 ) = P(dai |ai−1 i−L , b i−L

Although, the above observation suffices to determine the information structures of optimal channel input conditional distributions, we provide an alternative derivation based on the variational equalities of Theorem II.1, applied to the channel distribution (I.6). For convenience we introduce the following application of Theorem II.1 to channel distribution (I.6).

Theorem III.1. (Variational equalities for Class A channels) Consider the channel distribution (I.6) and directed information I(An → Bn ) defined by (III.61), via distributions (III.60), (III.64). The following hold. 4 (a) Let V[0,n] = Vi (dbi |Bi−1 ) ∈ M (Bi ) : i = 0, . . . , n be an arbitrary set of distributions. Then n dQ (·|Bi−1 , Ai ) o i i−L P E log (B ) i ∑ dΠPi (·|Bi−1 ) P[0,n] i=0 n dQ (·|Bi−1 , Ai ) o n i i−L = sup inf ∑ EP log (B ) . i dVi (·|Bi−1 ) P[0,n] V[0,n] i=0 n

sup I(An → Bn ) = sup

P[0,n]

(III.66) (III.67)

Moreover, the infimum over V[0,n] is achieved at Vi (dbi |bi−1 ) = ΠPi (dbi |Bi−1 ), i = 0, . . . , n given by (III.64). (b) Let Si (dbi |bi−1 , ai−1 ) ∈ M (Bi ) : i = 0, . . . , n and Ri (dai |ai−1 , bi ) ∈ M (Ai ) : i = 0, 1, . . . , n be arbitrary distributions

15

and defined the joint distribution on M (An × Bn ) by ⊗ni=0 Si (dbi |bi−1 , ai−1 ) ⊗ Ri (dai |ai−1 , bi ) . Then n

sup

∑ EP

P[0,n] i=0

n

log

dQ (·|Bi−1 , Ai

i−L ) (Bi ) dΠPi (·|Bi−1 )

o

i

(III.68) n

= sup P [0,n]

sup

∑

Ai ×Bi Si (dbi |bi−1 ,ai−1 )⊗Ri (dai |ai−1 ,bi )∈M (Ai ×Bi ):i=0,1,...,n i=0

! dSi (·|bi−1 , ai−1 ) dRi (·|ai−1 , bi ) (ai ) (bi ) PP (dai , dbi ) dPi (·|ai−1 , bi−1 ) dΠPi (·|bi−1 )

Z

log

Si (dbi |bi−1 ,ai−1 )∈M (Bi ), Ri (dai |ai−1 ,bi )∈M (Ai )

(III.69) Moreover, the supremum over Si (dbi |bi−1 , ai−1 )⊗Ri (dai |ai−1 , bi ) ∈ M (Ai ×Bi ) : i = 0, 1, . . . , n is achieved when the following identity holds. dQi (·|bi−1 , aii−L ) dPi (·|ai−1 , bi−1 ) (a ). (bi ) = 1 − a.a.(ai , bi ), i = 0, 1, . . . , n. i dRi (·|ai−1 , bi ) dSi (·|bi−1 , ai−1 ) Equivalently, the supremum is achieved at ⊗ni=0 Si (dbi |bi−1 , ai−1 ) ⊗ Ri (dai |ai−1 , bi ) = PP (dan , dbn ).

(III.70)

Proof: (a), (b) These are applications of Theorem II.1 to the specific channel, hence the derivations are omitted. Next, we apply the variational equalities of Theorem III.1 and stochastic optimal control theory, to identify the information structure of the optimal channel input conditional distribution, which maximizes (III.61) over P[0,n] , without and with a transmission transmission cost constraint of Class A. Theorem III.2. (Class A channels and transmission cost functions) Suppose the channel distribution is of Class A defined by (I.6), i.e., PBi |Bi−1 ,Ai (dbi |bi−1 , ai ) = Qi (dbi |bi−1 , aii−L ) − a.a.(bi−1 , ai ), i = 0, . . . , n.

(III.71)

Define the following restricted class of channel input distributions. n o A.L 4 i−1 i−1 P [0,n] = Pi (dai |ai−1 , bi−1 ) = πiA.L (dai |ai−L , b ) − a.a.(ai−1 , bi−1 ) : i = 0, 1, . . . , n ⊂ P[0,n]

(III.72)

The following hold. A.L

(a) Without Transmission Cost. The maximization of I(An → Bn ) defined by (III.61) over P[0,n] occurs in P [0,n] and the characterization of FTFI capacity is given by the following expression. n

CAFB,A.L n →Bn =

π A.L

∑E

sup

i−1 )∈M (A ):i=0,...,n i=0 πiA.L (dai |ai−1 i i−L ,b

n

log

dQ (·|Bi−1 , Ai i

π A.L

dΠi

i−L ) (Bi ) i−1 (·|B )

o

(III.73)

where A.L

Z

A.L

i−1 i−1 ), ) ⊗ Pπ (dai−1 Qi (dbi |bi−1 , aii−L ) ⊗ πiA.L (dai |ai−1 i−L , b i−L |b A.L i−1 ) , i = 0, . . . , n Pπ (dai , dbi ) = ⊗ni=0 Qi (dbi |bi−1 , aii−L ) ⊗ πiA.L (dai |ai−1 i−L , b

Ππi

(dbi |bi−1 ) =

Aii−L

(III.74) (III.75)

and the intial data are specified from the convention used. (b) With Transmission Cost. Define the average transmission cost constraint by n n o 1 4 A P[0,n] (κ) = Pi (dai |ai−1 , bi−1 ), i = 0, 1, . . . , n : EP ∑ γiA.N (Aii−N , Bi ) ≤ κ ⊂ P[0,n] n+1 i=0

(III.76)

and suppose the following condition holds. sup I(An → Bn ) = inf λ ≥0 A

P[0,n] (κ)

Pi (dai

sup |ai−1 ,bi−1 ):i=0,...,n

∈P[0,n]

n n n oo I(An → Bn ) − λ EP ∑ γiA.N (Aii−N , Bi ) − κ(n + 1) i=0

(III.77)

16

where λ is the Lagrange multiplier associated with the transmission cost constraint. A (κ) occurs in the subset The maximization of I(An → Bn ) defined by (III.61) over P[0,n]

n 4 A.I i−1 i−1 P [0,n] (κ) = Pi (dai |ai−1 , bi−1 ) = πiA.I (dai |ai−I , b ) − a.a.(ai−1 , bi−1 ), i = 0, 1, . . . , n : n o A.I 1 4 A.N i i A γ (A , B ) ≤ κ ⊂ P[0,n] (κ), I = max{L, N}. Eπ ∑ i i−N n+1 i=0

(III.78)

and the characterization of FTFI capacity is given by the following expression. n

CAFB,A.I n →Bn =

sup

i−1 ),i=0,...,n: 1 Eπ A.I πiA.I (dai |ai−1 i−I ,b n+1

∑ni=0 γiA.N (Aii−N ,Bi )

≤κ

∑ Eπ i=0

A.I

n

log

dQ (·|Bi−1 , Ai i

π A.I

dΠi

i−L ) (Bi ) (·|Bi−1 )

o

(III.79)

where the joint and marginal distributions are given by A.I

Z

A.I

i−1 i−1 i−1 ), Qi (dbi |bi−1 , aii−L ) ⊗ πiA.I (dai |ai−I , b ) ⊗ Pπ (dai−1 i−I |b A.I i−1 ) , i = 0, . . . , n Pπ (dai , dbi ) = ⊗ni=0 Qi (dbi |bi−1 , aii−L ) ⊗ πiA.I (dai |ai−1 i−I , b

Ππi (dbi |bi−1 ) =

Aii−L

(III.80) (III.81)

and the initial data are specified by the convention used. Proof: (a) Recall (III.61). By applying the re-conditioning property of expectation, we obtain the following identities. n dQi (·|Bi−1 , Aii−L ) n n P I(A → B ) = ∑ E log (Bi ) (III.82) dΠPi (·|Bi−1 ) i=0 n o n dQi (·|Bi−1 , Aii−L ) i i−1 = ∑ EP EP log A , B (III.83) (B ) i ΠPi (·|Bi−1 ) i=0 n o n dQi (·|Bi−1 , Aii−L ) i i−1 P P = ∑ E E log (III.84) (Bi ) Ai−L , B dΠPi (·|Bi−1 ) i=0 n 4 4 i−1 = ∑ EP `Pi Ai , Si , Si = (Ai−1 ), Si = Bi−1 , i = 0, . . . , n, (III.85) i−L , B i=0

`Pi

Q (db |s , a ) Z i i i i Qi (dbi |si , ai ), i = 0, . . . , n ai , si ≡`Pi ai , ai−1 , s = log i−L i ΠPi (dbi |si ) Bi

where (III.84) is due to the channel conditional independence property (III.71). Hence, o n dQ (·|Bi−1 , Ai ) n n i i−L P P E (B ) = sup ` A , S . sup ∑ EP log i i i ∑ i dΠPi (·|Bi−1 ) P[0,n] i=0 P[0,n] i=0

(III.86)

(III.87)

P 4 i−1 ) : i = 0, 1, . . . , n via the `i (ai , si ) : i = 0, . . . , n defined by (III.86) depends on si = (ai−1 i−L , b 4 channel distribution dependence on these variables, and the control object gi (ai−1 , bi−1 ) = Pi (dai |ai−1 , bi−1 ) : i = 0, . . . , n ,

The pay-off functional

4

via the controlled object {ξi (bi−1 ) = ΠPi (dbi |bi−1 ) : i = 0, . . . , n}. Hence, {Si : i = 0, . . . , n} is the controlled process, control i−1 i−1 )≡ by the control process {Ai : i = 0, . . . , n}, and therefore the optimal control object satisfies gi (ai−1 , bi−1 ) = gA.L i (ai−L , b i−1 i−1 A.L πi (dai |ai−L , b ) : i = 0, . . . , n . Nevetheless, we will show this by invoking the variational equalities of Theorem III.1 to identify achievable upper bounds. 4 Consider the set of arbitrary distributions V[0,n] = Vi (dbi |bi−1 ) ∈ Mi (Bi ) : i = 0, . . . , n and define the pay-off function Q (db |s , a ) Z i i i i `Vi ai , si = log Qi (dbi |si , ai ), i = 0, . . . , n. Vi (dbi |si ) Bi

(III.88)

17

By virtue of (III.67), identity (III.87), and inequality sup inf ≤ inf sup we obtain the following upper bound. n

sup

∑ EP

P[0,n] i=0

n

log

dQ (·|Bi−1 , Ai

i−L ) (Bi ) dΠPi (·|Bi−1 )

o

i

n

n o `Vi Ai , Si

(III.89)

o n P V S E ` A , i i ∑ i

(III.90)

∑ EP V[0,n]

= sup inf P[0,n]

i=0 n

≤ inf sup V[0,n] P

[0,n] i=0

n

∑ EP

≤ sup

P[0,n] i=0

n o `Vi Ai , Si , ∀ Vi (dbi |bi−1 ) ∈ M (Bi ), i = 0, . . . , n

(III.91)

4 i−1 ) : i = 0, . . . , n (and also ζ (bi−1 ) = Since the pay-off functions `Vi ai , · : i = 0, . . . , n depend on si = (ai−1 Vi (dbi |bi−1 ) ∈ i i−L , b M (Bi ), whose information is already included in si ), by stochastic optimal control theory [26], the maximizing distribution A.L

in the right hand side of (III.91) occurs in the set P [0,n] , defined by (III.72). Hence, the following upper bound is obtained. n

sup

∑ EP

n

P[0,n] i=0

log

dQ (·|Bi−1 , Ai

i−L ) (Bi ) P i−1 dΠi (·|B )

o

i

n

≤

π ∑E

sup

A.L

n o `Vi Ai , Si ,

(III.92)

i−1 ):i=0,...,n i=0 πiA.L (dai |ai−1 i−L ,b

∀ Vi (dbi |bi−1 ) ∈ M (Bi ), i = 0, . . . , n where Eπ

A.L

means expectation with respect to joint distribution (III.75). Next, we evaluate the upper bound (III.92) at

A.L Vi (dbi |bi−1 ) = Πiπ (dbi |bi−1 ), i

`Vi

= 0, . . . , n, defined by (III.74), which implies Q (db |s , a ) Z i i i i π A.L i−1 ≡ ` a , a , s = log ai , si Qi (dbi |si , ai ), i = 0, . . . , n i i i i−L A.L π V =π A.L Bi Πi (dbi |si )

(III.93)

to obtain the following upper bound. n

∑ EP

sup

P[0,n] i=0

n

log

dQ (·|Bi−1 , Ai

i−L ) (Bi ) P i−1 dΠi (·|B ) n

=

o

i

π ∑E

sup

i−1 ):i=0,...,n πiA.L (dai |ai−1 i−L ,b

A.L

n

log

n

≤

π ∑E

sup

A.L

n A.L o `πi Ai , S i

(III.94)

i−1 ):i=0,...,n i=0 πiA.L (dai |ai−1 i−L ,b

dQ (·|Bi−1 , Ai

i=0

i

π A.L

dΠi

i−L ) (Bi ) i−1 (·|B )

Note that any other choice of Vi (dbi |bi−1 ), other than Vi (dbi |bi−1 ) = Ππi joint distribution induced by the channel distribution and

o

≡ CAFB,A.L n →Bn .

(III.95)

A.L

(dbi |bi−1 ), i = 0, . . . , n will not be consistent with the A.L i {πi (dai |ai−L , bi−1 ) : i = 0, . . . , n}, i.e., the distribution over which

the expectation is taken in (III.92). A.L

The reverse inequality can be shown by restricting the maximization in (III.87) to the subset P [0,n] ⊂ P[0,n] , which then implies the joint and transition probability distribution of the channel output process are given by (III.74) and (III.75), and consequently the reverse inequality is obtained. We can also show the reverse inequality via an application of variational equality (III.69). We do so to illustrate the power of variational equalities. By virtue of (III.69), and by removing the supremum over Si (dbi |bi−1 , ai−1 ) ⊗ Ri (dai |ai−1 , bi ) : i = 0, . . . , n , and setting Si (dbi |bi−1 , ai−1 )⊗Ri (dai |ai−1 , bi ) =

dPi (·|ai−1 , bi−1 ) dΠP (·|bi−1 ) (ai ) (bi ) A.L i−1 dπ A.L (·|ai−L , bi−1 ) dΠπi (·|bi )

i−1 i−1 .πiA.L (dai |ai−1 ) ⊗ dQi (dbi |bi−1 , ai−L ), a.s., i = 0, 1, . . . , n. i−L , b

(III.96)

A.L where Ππi (bi |bi ) : i = 0, . . . , n is given by (III.74), then the following lower bound is obtained. n

sup

∑ EP

P[0,n] i=0

Since for each i, the pay-off `πi

A.L

n

log

dQ (·|Bi−1 , Ai

i−L ) (Bi ) P i−1 dΠi (·|B ) i

o

n

≥ sup

∑ EP

P[0,n] i=0

n A.L o `πi Ai , Si .

(III.97)

ai , · depends on s, for i = 0, . . . , n, by stochastic optimal control theory (as above) then the

18

following lower bound is obtained. n

∑ EP

sup

n

P[0,n] i=0

log

dQ (·|Bi−1 , Ai

i−L ) (Bi ) dΠPi (·|Bi−1 )

o

i

≥ CAFB,A.L n →Bn .

(III.98)

Combining (III.95) and (III.98) we establish the claims in (a). (b) Since by condition (III.77), the constraint problem is equivalent to an unconstraint problem, we repeat the steps in (a), to the aumgented pay-off given by the following expression. I(An → Bn ) − λ EP

o n dQ (·|Bi−1 , Ai ) n i i−L A.N i i A.N i i P (B ) − λ γ (A , B ) . γ (A , B ) = E log i ∑ i i−N ∑ i i−N dΠPi (·|Bi−1 ) i=0 i=0 n

(III.99)

Note that the term λ (n + 1)κ is not included, because it does not affect the derivation of information structures of optimal channel input conditional distribution. Similarly as in the unconstraint case, we have the following. n I(An → Bn ) − λ EP ∑ γiA.N (Aii−N , Bi )

(III.100)

i=0

n P

=∑E i=0 n

P

E

n

log

dQ (·|Bi−1 , Ai

P = ∑ EP `i Ai , Sbi

i−L )

i

ΠPi (·|Bi−1 )

,

o A.N i i i i−1 (Bi ) − λ γi (Ai−N , B ) A , B

(III.101)

4 4 4 4 i−1 i−1 Sbi = (Ai−1 ), Si = (Ai−1 ), Si = Bi−1 , I = max{L, N} i = 0, . . . , n, i−I , B i−L , B

(III.102)

i=0

where i Z h Qi (dbi |si , ai ) P − λ γiA.N (Aii−N , Bi ) Qi (dbi |si , ai ), i = 0, . . . , n. `i ai , sbi = log (III.103) P Πi (dbi |si ) Bi P Note that unlike part (a), the augmented pay-off function `i ai , · : i = 0, . . . , n defined by (III.103) depends on sbi = i−1 ) : i = 0, . . . , n , via the channel and cost function, and that if I = L then sˆ = s , i = 0, . . . , n (same as in (a)). (ai−1 i i i−I , b It is easy to verify that the variational equalities of Theorem II.1 (see Theorem IV.1 in [27]) are also valid, when transmission A (κ). Note that if cost constraints are imposed. Thus, Theorem III.1 holds, with the supremum over P[0,n] replaced by P[0,n]

I = L, the optimal channel input distribution has exactly the same form as in part (a). P

i−1 ), for i = 0, . . . , n, Keeping in mind the dependence, for each i, of the unconstraint pay-off function `i (ai , ·) on sbi = (ai−1 i−I , b

we repeat the derivation of the upper bound in (a), by invoking the first variational equality, and then remove the infimum, to deduce n

sup

∑ EP

n

log

dQ (·|Bi−1 , Ai

i−L ) (Bi ) P i−1 dΠi (·|B )

o

i

A (κ) i=0 P[0,n]

n

≤

sup

∑ EP

A (κ) i=0 P[0,n]

n o `Vi Ai , Si .

(III.104)

A.I

Further, by setting Vi (dbi |bi−1 ) = Ππi (dbi |bi−1 ), i = 0, . . . , n defined by (III.80), the maximization of the right hand side of A.I

A (κ), hence the following upper bound. (III.104) occurs in the subset P [0,n] (κ) ⊆ P[0,n] n

sup

∑ EP

A (κ) i=0 P[0,n]

n

log

dQ (·|Bi−1 , Ai

i−L ) (Bi ) dΠPi (·|Bi−1 ) i

o

≤ CAFB,A.I n →Bn (κ).

(III.105)

From this point forward, by repeating the derivation of part (a), if necessary, it is easy to deduce that the information structure i−1 }, for i = 0, 1, . . . , n, of the channel input distribution, which maximizes directed information, for each i, is IiP = {ai−1 i−I , b

which then implies (III.79)-(III.81). We note that if N > L, the upper bound corresponds to {Ππi

A.N

(dbi |bi−1 ) : i = 0, . . . , n} defined by (III.80), (with I = N), which

i−1 ) : i = 0, . . . , n}. This completes the prove. depends on the channel input conditional distribution {πiA.N (dai |ai−1 i−N , b

We make the following comments regarding the derivation of the theorem. Remark III.1. (Comments on Theorem III.2)

19

(a) As illustrated prior to the statement of the theorem and in the derivation, we do not need to apply the variational equality to determine the information structures of optimal channel input distributions for channel distribution of Class A and transmission cost function of Class A. The conclusion follows directly from the standard theory of Markov Decision.

→ − (b) Recall the functional defined by (II.54) of Theorem II.1, specialized to channel distribution Class A, with Q (dbn |an ) → − 4 replaced by Q A (dbn |an ) = ⊗ni=0 Qi (dbi |bi−1 , aii−L ). The pay-off functional in (III.90), is equivalent to n o n − ← − → 4 I(V0,n ; P 0,n , Q A0,n ) = ∑ EP `Vi Ai , Si

(III.106)

i=0

→ − ← − and this functional is convex in V0,n (dbn ) ∈ M (Bn ) for fixed P 0,n (dan |bn−1 ) ∈ M (An ) (since the channel Q A (dbn |an ) ∈ ← − (Bn ) is always fixed), and concave in P 0,n (dan |bn−1 ) ∈ M (An ) for a fixed V0,n (dbn ) ∈ M (Bn ). Hence, if we also impose − ← − → sufficient conditions so that I(V0,n ; P 0,n , Q A0,n ) is lower semicontinuous in V0,n (dbn ) ∈ M (Bn ) and upper semicontinuous in ← − P 0,n (dan |bn−1 ) ∈ M (An ), then the saddle point inequalities hold [35], and we have n

∑ EP V[0,n]

sup inf

P[0,n]

n o `Vi Ai , Si = inf sup V[0,n] P

i=0

n

∑ EP

n o `Vi Ai , Si .

(III.107)

[0,n] i=0

It is important to note that for finite alphabet spaces {(Ai , Bi ) : i = 0, . . . , n}, all conditions for validity of (III.107) hold. However, for countable or Borel spaces (i.e., continuous alphabet spaces) we need to impose conditions for upper and lower semicontinuity − ← − → of the functional I(V0,n ; P 0,n , Q A0,n ). Such conditions are identified in [27] using the topology of weak convergence of probability distributions. In the next remark, we illustrate that the last theorem gives as degenerate case, one of the information structures derived [28]. Moreover, we also illustrate that the derivation based on variational equalities, provides an alternative derivation of capacity achieving distributions of memoryless channels.

Remark III.2. (Special cases) (a) Relation to [28]. If L = N = 0 then I = 0, which corresponds to a channel distribution and transmission cost function, which do not depend on past channel input symbols, and (III.80) and (III.81) are induced by the channel and channel input distribution {πiA.0 (dai |bi−1 ) : i = 0, . . . , n , and all statements of Theorem III.2, (b) specialize to one of the results derived in [28]. (b) Application of variational equalities to memoryless channels. If the channel is memoryless, i.e., PBi |Bi−1 ,Ai (dbi |bi−1 , ai ) = Qi (dbi |ai ) − a.a.(bi−1 , ai ), i = 0, . . . , n, then from the derivation of Theorem III.2, we can replace (III.85), (III.86) by Q (db |a ) Z 4 i i i P Qi (dbi |ai ), s = bi−1 , i = 0, . . . , n `i ai , si = log (III.108) ΠPi (dbi |bi−1 Bi ΠPi (dbi |bi−1 ) =

Z

Qi (dbi |ai ) ⊗ PP (dai |bi−1 ).

(III.109)

Ai

4 Since for each i, the pay-off function `Pi ai , · depends on s = bi−1 only through the control object gi (bi−1 ) = PP (dai |bi−1 ), and not the channel distribution, then by Markov Decision theory, the optimal channel input conditional distribution of memoryless channels satisfies gi (bi−1 ) ≡ P(dai ) − a.a.bi−1 , for i = 0, . . . , n, i.e., it is independent of the conditioning information. Alternatively, by an application of the variational equality, repeating the steps, starting with (III.88) and leading to (III.95), with the corresponding upper bound obtained by considering (III.91) evaluated at 4

Vi (dbi |bi−1 ) = Ππi (dbi ) =

Z

Qi (dbi |ai ) ⊗ πi (dai ), i = 0, . . . , n

(III.110)

Ai

i.e., corresponding to Pi (dai |ai−1 , bi−1 ) = πi (dai ) ≡ P(dai ), i = 0, . . . , n, then the following upper bound is obtained. n dQ (·|A ) o n dQ (·|A ) o n n i i i i π sup ∑ EP log (B ) ≤ sup E log (B ) . (III.111) i i ∑ dΠπi (·) dΠPi (·|Bi−1 ) P[0,n] i=0 i=0 πi (dai ):i=0,...,n

20

Further, the reverse inequality holds, by restricting the channel input distributions to the smaller set Pi (dai |ai−1 , bi−1 ) = πi (dai ), i = 0, . . . , n , i.e., the upper bound is achievable, when the process (Ai , Bi ) : i = 0, . . . , n is jointly independent. Note that for memoryless channels with feedback, the standard method often applied to derive the capacity achieving distribution, is via the converse coding theorem, by first showing that feedback does not increase capacity compared to the case without feedback [7]. As pointed out by Massey [2], for channels with feedbac, it will be a mistake to use mutual information I(An ; Bn ), because by Marko’s bidirectional information [1], mutual information is not a tight bound on any achievable rate for channels with feedback. Strictly speaking, for memoryless channels, any derivation of capacity achieving distribution for channels with feedback, which applies the bound I(An ; Bn ) ≤ ∑ni=0 I(Ai ; Bi ), presupposes that it is already shown that feedback does not increase capacity, i.e., that P(dai |ai−1 , bi−1 ) = P(dai ) − a.a.(ai−1 , bi−1 ), i = 0, . . . , n. Next, we give examples. Example III.1. (Channel Class A and Transmission Cost Class A) Consider a channel Qi (bi |bi−1 , ai , ai−1 ) : i = 0, 1, . . . , n . (a) Without Transmission Cost. By Theorem III.2, (a) (since there is no transmission cost constraint) the optimal channel input conditional distribution occurs in the subset A.1 4 P [0,n] = πiA.1 (dai |ai−1 , bi−1 ) : i = 0, . . . , n ⊂ P[0,n]

(III.112)

and the characterization of the FTFI capacity is 4

CAFB,A.1 n →Bn = sup A.1

n

∑

P [0,n] i=0

Z

log

dQ (·|bi−1 , a , a i

π A.1

dΠi

i i−1 ) (bi ) (·|bi−1 )

A.1 Pπ (dbi , dai , dai−1 )

(III.113)

n

= sup A.1

∑ I(Ai−1 , Ai ; Bi |Bi−1 )

(III.114)

P [0,n] i=0

where A.1

Ππi (dbi |bi−1 ) =

Z

A.1

Qi (dbi |bi−1 , ai , ai−1 ) ⊗ πiA.1 (dai |ai−1 , bi−1 ) ⊗ Pπ (ai−1 |bi−1 ), i = 0, . . . , n, A.1 A.1 Pπ (dbi , dai ) = ⊗ni=0 Qi (dbi |bi−1 , ai , ai−1 ) ⊗ πiπ (dai |ai−1 , bi−1 ) , i = 0, . . . , n. Aii−1

(III.115) (III.116)

(b) With Transmission Cost Function γiA.2 (aii−2 , bi ), i = 0, 1, . . . , n , that is, L = 1, N = 2. The characterization of the FTFI capacity is given by the following expression. CAFB,A.2 sup n →Bn (κ) = A.2

n

∑ I(Ai−1 , Ai ; Bi |Bi−1 )

(III.117)

o n A.2 1 Eπ γiA.2 (Ai , Ai−1 , Ai−2 , Bi ) ≤ κ , ∑ n+1 i=0

(III.118)

P [0,n] (κ) i=0

where n 4 A.2 P [0,n] (κ) = πiA.2 (dai |ai−1 , ai−2 , bi−1 ), i = 0, . . . , n : A.2

Z

A.2

Qi (dbi |bi−1 , ai , ai−1 ) ⊗ πiA.2 (dai |ai−1 , ai−2 , bi−1 ) ⊗ Pπ (ai−1 , ai−2 |bi−1 ), A.2 Pπ (dbi , dai ) = ⊗ni=0 Qi (dbi |bi−1 , ai , ai−1 ) ⊗ πiA.2 (dai |ai−1 , ai−2 , bi−1 ) , i = 0, . . . , n.

Ππi (dbi |bi−1 ) =

Aii−1

(III.119) (III.120)

Since, N = 2 and L = 1, the dependence of the optimal channel input distribution on past channel input symbols is determined from the dependence of the instantaneous transmission cost on past channel input symbols. Moreover, although, in both cases, with and without transmission cost, the pay-off ∑ni=0 I(Ai−1 , Ai ; Bi |Bi−1 ) is the same, the channel output transition probability distributions and joint distributions, are different, because these are induced by different optimal channel input conditional distributions.

21

2) Channel Class A and Transmission Cost Class B and Vice-Versa: From Theorem III.2, we can also deduce the information structures of optimal channel input conditional distributions for channels of Class A and transmission cost functions of Class B, and vice-versa. These are stated as a corollary.

Corollary III.1. (Class A channels and Class B transmission cost functions and vice-versa) (a) Suppose the channel distribution is of Class A, as in Theorem III.2, the transmission cost function is of Class B, specifically, {γiB.K (ai , bii−K ) : i = 0, . . . , n}, and the corresponding average transmission cost constraint is defined by n 4 B (κ) = Pi (dai |ai−1 , bi−1 ), i = 0, 1, . . . , n : P[0,n]

n o 1 EP ∑ γiB.K (Ai , Bii−K ) ≤ κ ⊂ P[0,n] . n+1 i=0

(III.121)

B (κ), is of Then the optimal channel input conditional distribution, which maximizes I(An → Bn ) defined by (III.61) over P[0,n] i−1 i−1 the form Pi (dai |a , b ), i = 0, 1, . . . , n (i.e., there is no reduction in the information structure of the optimal channel input

distribution). (b) Suppose the channel distribution is of Class B, defined by i i−1 i , a ), i = 0, . . . , n PBi |Bi−1 ,Ai (dbi |bi−1 , ai ) = Qi (dbi |bi−1 i−M , a ) − a.a.(b

(III.122)

A (κ) defined by (III.76) (i.e., it corresponds to a transmission cost function and the average transmission cost constraint is P0,n

of Class A). Then directed information is given by o n dQ (·|Bi−1 , Ai ) n i i−M I(An → Bn ) = ∑ EP log (B ) i dΠPi (·|Bi−1 ) i=0

(III.123)

where ΠPi (dbi |bi−1 ) =

Z Ai

i i−1 i−1 Qi (dbi |bi−1 , b ) ⊗ PP (dai−1 |bi−1 ), i = 0, . . . , n. i−M , a ) ⊗ Pi (dai |a

(III.124)

A (κ) is of the form P (da |ai−1 , bi−1 ), i = Moreover the optimal channel input distribution, which maximizes (III.123) over P0,n i i 0, 1, . . . , n (i.e., there is no reduction in information structure). Proof: (a), (b) By repeating the derivation of Theorem III.2, (b), if necessary, we can verify that if either the pay-off or the channel conditional distribution, depends on the entire history of the channel input process, then the augmented pay-off is a functional of the entire past of channel input symbols. This implies there is no reduction in the information structure of the optimal channel input conditional distribution, which maximizes directed information.

B. Channels Class C and Transmission Cost Class C, A or B In this section, we consider channel distributions of Class C and transmission cost functions Class C, A or B. Clearly, channel distributions of Class C and transmission cost functions of Class C, depend only on finite channel input and output symbols, when compared to any of the ones treated in previous sections. Since any channel of Class C is a special case of Channels of Class A, and any transmission cost of Class C is a special case of transmission costs of Class A, then we can invoke Theorem III.2 to conclude that the maximizing channel input conditional A,I

distribution occurs in the preliminary subset P [0,n] (κ) ⊂ P[0,n] (κ). Then we can further apply the variational equalities of directed information and stochastic optimal control theory (as in Theorem III.2), to show the supremum over the set of channel A.I

◦

A.I

input conditional distributions P [0,n] (κ), occurs in a smaller subset P [0,n] (κ) ⊂ P [0,n] (κ) ⊂ P[0,n] (κ). Alternatively, we can apply the variational equalities directly, without using the results of Theorem III.2. First, we do not impose any transmission cost constraint, to illustrate the methodology, without the need to introduce complex notation.

22

i−1 1) Channels Class C: Consider the channel distribution Qi (dbi |bi−M , aii−L ) : i = 0, 1, . . . , n . Since there is no transmission cost, by Theorem III.2, (a), we obtain the following preliminary characterization of the FTFI capacity. dQ (·|bi−1 , ai ) i i−M i−L P i i (b ) i P (db , dai−L ) ∑ dΠPi (·|bi−1 ) P[0,n] i=0 n dQ (·|Bi−1 , Ai ) o n i i−M i−L π A.L log (Bi ) = sup A.L ∑E dΠπi (·|Bi−1 ) i−1 ):i=0,...,n i=0 πiA.L (dai |ai−1 i−L ,b o n A.L n i−1 i−1 π π A.L (A , A , B ) ` E = sup i i i−L ∑ n

4

CAFB,C n →Bn = sup

Z

log

(III.125) (III.126)

(III.127)

i−1 ):i=0,...,n i=0 πiA.L (dai |ai−1 i−L ,b

where `πi

A.L

4

i−1 (ai , ai−1 )= i−L , b

Z

log

dQ (·|bi−1 , ai ) i i−M i−L

Bi A.L

(dbi |bi−1 ) =

π A.L

dΠi

(·|bi−1 )

i (bi ) dQi (dbi |bi−1 i−M , ai−L ),

Z

(III.128)

A.L

i A.L i−1 i−1 i−1 i−1 Qi (dbi |bi−1 ) ⊗ Pπ (dai−L |b ), i−M , ai−L ) ⊗ πi (dai |ai−L , b A.L j j−1 j−1 A.L ) , i = 0, . . . , n. Pπ (dai , dbi ) = ⊗ij=0 Q j (db j |b j−1 j−M , a j−L ) ⊗ π j (da j |a j−L , b

Ππi

Aii−L

(III.129) (III.130)

The main objective is to show the optimal channel input conditional distribution in (III.126) satisfies the following conditional 4

i−1 i−1 independence condition. Pi (dai |ai−1 , bi−1 ) = P(dai |ai−L , bi−J ) − a.a.(ai−1 , bi−1 ), i = 0, . . . , n, J = max{M, L}. Similarly, as in the

channel distribution, and on

bi−1−M

via the control object

A.L

i−1 (ai , ·), depends on (bi−1 i−M , Ai−L ) via 4 i−1 ) = π A.L (da |ai−1 , bi−1 ), for i = 0, . . . , n. To identify gi (ai−1 i i−L i i−L , b

previous section, we note that for each i, the pay-off functional in (III.127), i.e., `πi

the the

information structure of the optimal channel input conditional distribution, we apply the variational equality.

Theorem III.3. (Channel class C) Suppose the channel conditional distribution is of Class C, defined by i PBi |Bi−1 ,Ai (dbi |bi−1 , ai ) = Qi (dbi |bi−1 i−M , ai−L ), i = 0, . . . , n.

(III.131)

◦ C.L,J

Define the restricted class of channel input conditional distributions P [0,n] by ◦ C.L,J 4 n A.L i−1 i−1 P [0,n] = πi (dai |ai−L , b ) ∈ M (Ai ), i = 0, . . . , n :

o 4 A.L C.L,J i−1 i−1 i−1 i−1 i−1 , b ) = π (da |a , b ) − a.a.(a , b ), i = 0, . . . , n ⊂ P [0,n] , J = max{M, L}. πiA.L (dai |ai−1 i i−L i−L i−J i

(III.132)

◦ C.L,J

The maximization in CAFB,C n →Bn defined by (III.125) or equivalently (III.126) occurs in the subset P [0,n] and the characterization of FTFI capacity is given by the following expression. CAFB,C.L,J n →Bn =

n

sup C.L,J

πi

π ∑E

i−1 i=0 (dai |ai−1 i−L ,bi−J )∈M (Ai ):i=0,...,n

C.L,J

n

log

dQ (·|Bi−1 , Ai ) i i−M i−L π C.L,J

dΠi

(·|Bi−1 i−J )

o (Bi )

(III.133)

where C.L,J Ππi (dbi |bi−1 i−J ) =

Z

C.L,J

C.L,J i i−1 π i−1 Qi (dbi |bi−1 (dai |ai−1 (dai−1 i−M , ai−L ) ⊗ πi i−L , bi−J ) ⊗ P i−L |bi−J ), C.L,J j j−1 j−1 C.L,J Pπ (dai , dbi ) = ⊗ij=0 Q j (db j |b j−1 , a ) ⊗ π (da |a , b ) j j−M j−L j j−L j−J , i = 0, . . . , n. Aii−L

(III.134) (III.135)

and the initial data are specified by the convention used. Proof: The derivation is based on applying the variational equality of directed information to (III.126). 4 i−1 ) : i = 0, . . . , n , and any arbitrary distribution V i−1 ) ∈ M (B ) : i = 0, . . . , n}, Given a policy πiA.L (dai |ai−1 i [0,n] = {Vi (dbi |b i−L , b

23

by virtue of Theorem II.1, then CAFB,C n →Bn

=

inf

sup

V[0,n]

i−1 )∈M (A ):i=0,...,n πiA.L (dai |ai−1 i i−L ,b

where {Pπ

A.L

n

n

! i o dQi (·|bi−1 i−M , ai−L ) π A.L i−1 i (b ) P (db , db , da ) i i i−L dVi (·|bi−1 )

Z

log

∑

i=0

(III.136)

(bi , bi−1 , daii−L ) : i = 0, . . . , n} is defined by (III.130). For any arbitrary {Vi (dbi |bi−1 ) : i = 0, . . . , n} ∈ M (Bi ) : i =

0, . . . , n}, define the restricted set o n C.J i−1 C.J 4 ) − a.a.b : i = 0, . . . , n ⊆ V[0,n] . V[0,n] = Vi (dbi |bi−1 ) = V i (dbi |bi−1 i−J

(III.137)

Then we obtain the following upper bounds on (III.136). CAFB,C n →Bn

n

(α)

≤

sup

i−1 )∈M (A ):i=0,...,n πiA.L (dai |ai−1 i i−L ,b

Pπ

A.L

inf

log

∑

! i dQi (·|bi−1 i−M , ai−L ) (bi ) dVi (·|bi−1 )

Z

C.J i=0 Vi (dbi |bi−1 ),i=0,...,n ∈V[0,n]

(dbi , daii−L )

(III.138) n

(β )

≤

sup

i−1 )∈M (A ):i=0,...,n πiA.L (dai |ai−1 i i−L ,b

!

i dQi (·|bi−1 i−M , ai−L ) (bi ) C.J dV i (·|bi−1 i−J )

Z

log

∑

i=0

Pπ

A.L

(dbi , daii−L ),

C.J

∀V i (dbi |bi−1 i−J ) ∈ M (Bi ), i = 0, . . . , n n

(γ)

=

sup

∑

C.L,J i−1 πi (dai |ai−1 i−L ,bi−J )∈M (Ai ):i=0,...,n

(III.139) !

i dQi (·|bi−1 i−M , ai−L ) (bi ) C.J dV i (·|bi−1 i−J )

Z

log

i=0

Pπ

C.L,J

(dbii−J , daii−L ),

C.J

∀ V i (dbi |bi−1 i−J ) ∈ M (Bi ), i = 0, . . . , n

(III.140)

where (α) is due to taking the infimum of a smaller set; (β ) is due to removing the infimum; (γ) is due to the fact that the pay-off function

R

Bi log

i dQi (·|bi−1 i−M ,ai−L ) C.J

dV i (·|bi−1 i−J )

i−1 i i (bi )Qi (dbi |bi−1 i−M , ai−L ) is at most a functional of {bi−J , ai−L },

i−1 ) ∈ M (A ) : i = 0, . . . , n} for i = 0, . . . , n, and hence by stochastic optimal control theory, the supremum over {πiA.L (dai |ai−1 i i−L , b i−1 in (III.139), occurs in the subset {πiC.L,J (dai |ai−1 i−L , bi−J ) ∈ M (Ai ) : i = 0, . . . , n}. C.J is arbitrary, we let Since, {Vi (dbi |bi−1 ) : i = 0, . . . , n} ∈ V[0,n] π C.L,J

C.J

Vi (dbi |bi−1 ) =V i (dbi |bi−1 i−J ) = V i Z

=

Z Aii−L

i π Qi (dbi |bi−1 i−M , ai−L ) ⊗ P

C.L,J i−1 i−1 π Qi (dbi |bi−1 (dai |ai−1 i−M , ai−L ) ⊗ πi i−L , bi−J ) ⊗ P

Aii−L π C.L,J

=Πi

4

(dbi |bi−1 i−J ) =

C.L,J

C.L,J

(daii−L |bi−1 i−J )

i−1 i−1 (dai−L |bi−J )

i−1 (dbi |bi−1 , i = 0, 1, . . . , n i−J ) − a.a.b

(III.141)

which is precisely (III.134). Evaluating the upper bound (III.140) at the distribution defined by (III.141), we obtain the following. ! i ) n Z (δ ) dQi (·|bi−1 , a C.L,J FB,C i−M i−L (III.142) (bi ) Pπ (dbii−J , daii−L ) CAn →Bn ≤ sup log ∑ C.L,J i−1 π C.L,J dΠ (·|b ) πi

i−1 i=0 (dai |ai−1 i−L ,bi−J )∈M (Ai ):i=0,...,n

i

i−J

Next, we can show the reverse inequality also holds, by replacing the maximization in (III.126) by the subset of distributions ◦ C.L,J

A.L

P [0,n] ⊂ P [0,n] ⊂ P[0,n] , and hence by (III.125)-(III.130), we obtain the following lower boud. ! i−1 n Z dQi (·|bi−1 A.L FB,C i−M , ai−L ) CAn →Bn = sup (bi ) Pπ (dbi , daii−L ) A.L ∑ log dΠπ (·|bi−1 ) i−1 )∈M (A ),i=0,...,n i=0 πiA.L (dai |ai−1 i i−L ,b

n

≥

sup

∑

C.L,J i−1 πi (dai |ai−1 i−L ,bi−J )∈M (Ai ):i=0,...,n

i=0

(III.143)

i

Z

log

i dQi (·|bi−1 i−M , ai−L ) C.L,J dΠπi (·|bi−1 i−J )

! (bi ) Pπ

C.L,J

(dbi−J , daii−L ).

(III.144)

24

i−1 ) : i = 0, . . . , n} ∈ P Combining (III.144) and (III.142), we deduce that the supremum over {πiA.L (dai |ai−1 [0,n] in the i−L , b ◦ C.L,J

definition of CAFB,C n →Bn occurs in P [0,n] , which implies statements (III.133)-(III.135). This completes the prove. Although, Theorem III.3 is derived by invoking, as an intermediate step, the results of Theorem III.2, (a), to conclude identity (III.126), this is not essential. We can apply the variational equality directly to (III.125), and repeat the pevious procedure to show validity of the statements of Theorem III.3. Next, we present examples. Example III.2. (Channel Class C) (a) Consider a channel Qi (dbi |bi−1 , ai , ai−1 ) : i = 0, 1, . . . , n , i.e., M = 1, L = 1. By Theorem III.3, the optimal channel input conditional distribution occurs in the subset o ◦ C.1,1 4 n C.1,1 (dai |ai−1 , bi−1 ) : i = 0, 1, . . . , n P [0,n] = πi

(III.145)

and the characterterization of the FTFI capacity is CAFB,C.1,1 n →Bn

n

Z

= sup

∑ ◦ C.1,1

P [0,n]

log

dQi (·|bi−1 , ai , ai−1 ) c.1,1 dΠiπ (·|bi−1 )

i=0

C.1,1 (bi ) Pπ (dbi , bi−1 , dai , ai−1 )

(III.146)

n

∑ I(Ai−1 , Ai ; Bi |Bi−1 ). ◦ C.1,1

≡ sup

P [0,n]

(III.147)

i=0

where C.1,1

Z

C.1,1

Qi (dbi |bi−1 , ai , ai−1 ) ⊗ πiC.1,1 (dai |ai−1 , bi−1 ) ⊗ Pπ (dai−1 |bi−1 ), C.1,1 Pπ (dai , dbi ) = ⊗ij=0 Q j (db j |b j−1 , a j−1 ) ⊗ π C.1,1 (da j |a j−1 , b j−1 ) , i = 0, . . . , n. j

Ππi

(dbi |bi−1 ) =

Aii−1

(III.148) (III.149)

The above characterization of FTFI capacity implies (a.i)

the joint process {(Ai , Bi ) : i = 0, . . . , n} is first-order Markov;

(a.ii) the channel output process {Bi : i = 0, . . . , n} is first-order Markov, or equivalently the following hold. P(dai , dbi |ai−1 , bi−1 ) = Pπ

C.1,1

(dai , dbi |ai−1 , bi−1 ), P(dbi |bi−1 ) = Pπ

C.1,1

(dbi |bi−1 ), i = 0, 1, . . . , n.

(III.150)

(b) Consider a channel Qi (dbi |bi−1 , bi−2 , ai , ai−1 ) : i = 0, 1, . . . , n , i.e., M = 2, L = 1. By Theorem III.3, the optimal channel input conditional distribution occurs in the subset o ◦ C.1,2 4 n C.1,2 (dai |ai−1 , bi−1 , bi−2 ) : i = 0, 1, . . . , n P [0,n] = πi

(III.151)

and the characterterization of the FTFI capacity is CAFB,C.1,2 n →Bn = sup

n

∑ ◦ C.1,2

P [0,n]

Z

log

i=0

dQi (·|bi−1 , bi−2 , ai , ai−1 ) π C.1,2

dΠi

(·|bi−1 , bi−2 )

C.1,2 (bi ) Pπ (dbi , bi−1 , bi−2 , dai , ai−1 )

(III.152)

n

≡ sup

∑ I(Ai−1 , Ai ; Bi |Bi−1 , Bi−2 ). ◦ C.1,2

P [0,n]

i=0

The above characterization of FTFI capacity implies (b.i)

4

the joint process {Zi = (Bi−1 , Bi−2 , Ai−1 ) : i = 0, . . . , n} is first-order Markov; 4

(b.ii) the channel output process {Si = (Bi−1 , Bi−2 ) : i = 0, . . . , n} is first-order Markov,

(III.153)

25

or equivalently the following hold. P(dzi |zi−1 ) = Pπ

C.1,2

(dzi |zi−1 ), P(dsi |si−1 ) = Pπ

C.1,2

(dsi |si−1 ), i = 0, 1, . . . , n.

(III.154)

The optimizations of characterizations of FTFI capacity expressions in (a) and (b) over the channel input distributions can be solved by applying dynamic programming, in view of the Markov property of the channel output processes. 2) Channel Class C with Transmission Costs Class C: Consider a channel distribution of Class C defined by (III.131) and an average transmission cost constraint corresponding to a transmission cost function of Class C, specifically, {γiC.N,K (aii−N , bii−K ) : i = 0, . . . , n}, defined as follows. n 4 C P[0,n] (κ) = Pi (dai |ai−1 , bi−1 ), i = 0, 1, . . . , n :

n o 1 EP ∑ γiC.N,K (Aii−N , Bii−K ) ≤ κ . n+1 i=0

(III.155)

Since a channel of Class C is a special case of channel of Class A, and a transmission cost function of Class C, is a special case of a transmission cost function of Class A, then Theorem III.2, (b) is directly applicable (we do not apply Theorem III.3 because of the transmission cost constraint), hence we obtain the following preliminary characterization of FTFI capacity. n

4

CAFB,C n →Bn (κ) =

sup

∑

Z

log

C (κ) i=0 P[0,n]

dQ (·|bi−1 , ai ) i i−M i−L P i i (b ) i P (db , da ) dΠPi (·|bi−1 ) n

=

sup

i−1 ),i=0,...,n: 1 Eπ πiA.I (dai |ai−1 i−I ,b n+1

A.I

C.N,K

∑ni=0 γi

(Aii−N ,Bii−K ) ≤κ

π ∑E

A.I

n

(III.156)

log

i=0

dQ (·|Bi−1 , Ai ) i i−M i−L A.I

dΠπi (·|Bi−1 )

o (Bi )

(III.157)

where A.I

Ππi (dbi |bi−1 ) =

Z

A.I

4

i A.I i−1 i−1 i−1 Qi (dbi |bi−1 ) ⊗ Pπ (dai−1 ), I = max{L, N}, i−M , ai−L ) ⊗ πi (dai |ai−I , b i−I |b A.I j j−1 j−1 A.I Pπ (dai , dbi ) = ⊗ij=0 Q j (db j |b j−1 ) , i = 0, . . . , n. j−M , a j−L ) ⊗ π j (da j |a j−I , b Aii−I

(III.158) (III.159)

Next, we state the main theorem concerning the maximization in (III.157) or equivalently (III.156), for different parameters {M, N, L, K}.

Theorem III.4. (Channel class C transmission cost class C) C (κ) is Suppose the channel conditional distribution is of Class C, defined by (III.131), the transmission cost constraint P[0,n]

defined by (III.155), and the following condition holds. n n n oo sup I(An → Bn ) = inf sup I(An → Bn ) − λ EP ∑ γiC.N,K (Aii−N , Bii−K ) − κ(n + 1) . C (κ) P[0,n]

λ ≥0 P[0,n]

(III.160)

i=0

Then the following hold. (a) If M ≥ K and L ≥ N then the characterization of the FTFI capacity is given by the following expression. CAFB,C.L,J n →Bn (κ) =

n

sup

∑ Eπ

◦ C.L,J i=0 P [0,n] (κ)

C.L,J

n

log

dQ (·|Bi−1 , Ai ) i i−M i−L dΠπi

C.L,J

(·|Bi−1 i−J )

o 4 (Bi ) , J = max{L, M}

where the maximizing channel input conditional distribution occurs in the subset n n o ◦ C.L,J C.L,J 1 4 C.L,J i−1 (dai |ai−1 Eπ γiC.N,K (Aii−N , Bii−K ) ≤ κ P [0,n] (κ) = πi ∑ i−L , bi−J ), i = 0, 1, . . . , n : n+1 i=0

(III.161)

(III.162)

C.L,J i−1 i−1 and the joint and marginal distributions are induced by {Qi (dbi |bi−1 (dai |ai−i i−M , ai−L ), πi i−L , bi−J ) : i = 0, . . . , n , that is, they are given by (III.134), (III.135).

26

(b) If {M, N, L, K} are arbitrary then the characterization of the FTFI capacity is given by the following expression. CAFB,C.I,J n →Bn (κ) =

n

∑ Eπ

sup ◦ C.I,J

P [0,n] (κ)

C.I,J

n

log

dQ (·|Bi−1 , Ai ) i i−M i−L π C.I,J

dΠi

i=0

(·|Bi−1 i−J )

o 4 4 (Bi ) , I = max{L, N}, J = max{L, N, K, M}

where the maximizing channel input conditional distribution occurs in the subset n o n ◦ C.I,J C.I,J 1 4 C.I,J i−1 i−1 γiC.N,K (Aii−N , Bii−K ) ≤ κ Eπ P [0,n] (κ) = πi (dai |ai−I , bi−J ), i = 0, 1, . . . , n : ∑ n+1 i=0

(III.163)

(III.164)

and the joint and channel output distributions are given by C.I,J

Z

C.I,J

C.I,J i i−1 i−1 i−1 Qi (dbi |bi−1 (dai |ai−I , bi−J ) ⊗ Pπ (dai−1 i−M , ai−L ) ⊗ πi i−I |bi−J ), C.I,J j−1 C.I,J i (da j |a j−1 Pπ (dai , dbi ) = ⊗ij=0 Q j (db j |b j−1 j−M , a j−L ) ⊗ π j j−I , b j−J ) , i = 0, . . . , n.

Ππi

(dbi |bi−1 i−J ) =

(III.165)

Aii−I

(III.166)

Proof: The derivation is based on repeating the steps of Theorem III.3, with some modifications to account for the average transmission cost constraints. Since condition (III.160) holds, consider the augmented pay-off given by the following expression. I(An → Bn ) − λ EP

n dQ (·|Bi−1 , Ai ) o n i C.N,K i C.N,K i i−M i−L i−1 P i γ (A , B ) = E log (B ) − λ γ (A , B ) i ∑ i ∑ i−N i−K i−N i−K i dΠPi (·|Bi−1 ) i=0 i=0 n

(III.167)

where the term λ (n + 1)κ is omitted. (a) Consider the case M ≥ K, L ≥ N. By repeating the derivation of Theorem III.2, if necessary, for the above augmented pay-off, we conclude the following preliminary characterization of the FTFI capacity. n

CAFB,C n →Bn (κ) =

sup

i−1 ),i=0,...,n: 1 Eπ πiA.L (dai |ai−1 i−L ,b n+1

A.L

C.N,K

∑ni=0 γi

(Aii−N ,Bii−K ≤κ

π ∑E

A.L

n

log

dQ (·|Bi−1 , Ai ) i i−M i−L

i=0

dΠπi

A.L

(·|Bi−1 )

o (Bi )

(III.168)

From the last equation, we conclude that, when N ≥ K, L ≥ N, the preliminary characterizations of the FTFI capacity, is not affected by the presence of the transmission cost constraint, compared to that of CAFB,C n →Bn (κ) given by (III.126) (i.e., which does not use transmission cost constraint). Hence, by imposing the additional transmission cost constraint, and by repeating the derivation of Theorem III.3, if necessary, then we conclude (III.162) and (III.161). (b) Consider the case, when {M, K, L, N} are arbitrary. By repeating the steps of the derivation of Theorem III.2, specifically, (III.82)-(III.86), for the augmented pay-off (III.167), we obtain the following. n I(An → Bn ) − λ EP ∑ γiC.N,K (Aii−N , Bii−K ) (III.169) i=0

i dQi (·|Bi−1 C.N,K i i−M , Ai−L ) i P (Bi ) − λ γi (Ai−N , Bi−K ) = ∑ E log dΠPi (·|Bi−1 ) i=0 n i−1 n i i−1 o dQi (·|Bi−M , Aii−L ) C.N,K i P P i = ∑ E E log (Bi ) − λ γi (Ai−N , Bi−K ) A , B ΠPi (·|Bi−1 ) i=0 n o n 4 4 P i−1 = ∑ EP `i Ai , Sbi , Sbi = (Ai−1 ), I = max{L, N} i−I , B n

(α)

=

i=0 n

∑ Eπ

A.I

n A.I o π `i Ai , Sbi , i = 0, . . . , n

(III.170) (III.171) (III.172) (III.173)

i=0

where Z h Q (·|s , a ) i 4 i−1 4 P i i i C.N,K i i i−1 `i ai , sbi = (b ) − λ γ (A , B ) Qi (dbi |si , ai ), S = (Ai−1 , log i i−N i−K i−L , Bi−M ), Si = B i P (·|s ) Π Bi i i Z h i Q (·|s , a ) (β ) i i i = log (bi ) − λ γiC.N,K (Aii−N , Bii−K ) Qi (dbi |si , ai ) A.I Bi Ππ (·|si ) i A.I π ≡`i ai , sbi , i = 0, . . . , n

(III.174) (III.175) (III.176)

27

and (α), (β ) are due to the preliminary characterization of FTFI capacity given by (III.157). π A.I

Clearly, for each i, the function `i

i−1 (ai , ·) depends on sbi = (ai−I , si ), where its dependence on channel distribution and

i−1 transmission cost function symbols is (Ai−1 i−I , bi−I ), I = max{K, M}, while the dependence on the rest of the symbols is via the 4

i−1 i−1 i−1 ), for i = 0, . . . , n. From this point forward, we apply the variational equality control object gA,I ) = πiA.I (dai |ai−1 i (ai−I , b i−I , b

of directed information, by repeating the derivation of Theorem III.3, if necessary, to deduce that, when the channel input conditional distribution occurs in the set (III.164), then we can obtain upper and lower bounds, which are identical, and they are given by (III.163), with corresponding channel output distribution and joint distribution defined by (III.165) and (III.166). This completes the prove. Let us illustrate, in the next example, the difference of the information structure of optimal channel input distribution, compared to Example III.2. Example III.3. (Channel Class C and Transmission Cost Class C) Consider a channel Qi (dbi |bi−1 , ai , ai−1 ) : i = 0, 1, . . . , n and transmission cost function γiC.2,1 (aii−2 , bii−1 ) : i = 0, . . . , n , i.e., , M = 1, L = 1, N = 2, K = 1. By Theorem III.4, the optimal channel input conditional distribution occurs in the subset n o n ◦ C.2,2 C.2.2 1 4 C.2,2 i−1 (dai |ai−1 Eπ γi (Aii−2 , Bii−1 ) ≤ κ P [0,n] (κ) = πi ∑ i−2 , bi−2 ), i = 0, . . . , n : n+1 i=0

(III.177)

and the characterterization of the FTFI capacity is CAFB,C.2,2 n →Bn (κ) =

n

sup ◦ C.2,2

P [0,n] (κ)

∑ Eπ

C.2,2

n

log

i=0

dQi (·|bi−1 , ai , ai−1 ) π C.2,2

dΠi

(·|bi−1 , bi−2 )

o (bi )

(III.178)

where Ππi

C.2,2

Z

C.2,2

i−1 π i−1 (dai−1 Qi (dbi |bi−1 , ai , ai−1 ) ⊗ πiC.2,2 (dai |ai−1 i−2 , bi−2 ) ⊗ P i−2 |bi−2 ), C.2,2 j−1 Pπ (dai , dbi ) = ⊗ij=0 Q j (db j |b j−1 , a j , a j−1 ) ⊗ π C.2,2 (da j |a j−1 j j−2 , b j−2 ) .

(dbi |bi−1 , bi−2 ) =

Aii−2

(III.179) (III.180)

The above characterization implies the joint process {(Ai , Bi ) : i = 0, . . . , n} and channel output process {Bi : i = 0, . . . , n} are second-order Markov,

(i)

equivalently Pπ

C.2,2

(dai , dbi |ai−1 , bi−1 ) = Pπ

C.2,2

i−1 π (dai , dbi |ai−1 i−2 , bi−2 ), P

C.2,2

(dbi |bi−1 ) = Pπ

C.2,2

i−1 (dbi |bi−2 ), i = 0, . . . , n.

(III.181)

This example illustrates that the dependence of the transmission cost function, for each i, on ai−2 in addition to symbols {ai−1 , ai } (i.e, the ones the channel depends on), implies the information structure of the optimal channel input conditional distribution is {ai−1 , ai−2 , bi−1 , bi−2 }, for i = 0, . . . , n, which is fundamentally different from the information structures of Example III.2, (a) (although the channels are identical). In the next example, we illustrate that for any channel distribution, which depends only on finite memory on present and past channel input symbols, our characterizations of FTFI capacity give tighter bounds, compared to the expression given in [20] (see (eqn(2), eqn(3), Theorem 1 in [20]). Example III.4. (Unit Memory Channel Input) Consider a channel of the form {Qi (dbi |ai , ai−1 ) : i = 0, 1, . . . , n}, called the Unit Memory Channel Input UMCI (without transmission). Since this channel is a special case of channel distributions of Class C, by Theorem III.4, we have the following.

28

The characterization of the FTFI capacity is given by the following expression. n dQ (·|A , A ) o n i i i−1 π C.1,1 log (B ) E CAFB.C.1,1 sup n →Bn = i C.1,1 ∑ C.1,1 Πiπ (·|Bi−1 ) (a |a ,b )∈M (A ):i=0,...,n i=0 π i

i

i−1

(III.182)

i

i−1

n

=

∑ I(Ai , Ai−1 ; Bi |Bi−1 ).

sup C.1,1

πi

(III.183)

(dai |ai−1 ,bi−1 )∈M (Ai ):i=0,...,n i=0

where (Ai , Bi ) : i = 0, . . . , n is jointly Markov and Bi : i = 0, . . . , n is Markov

(III.184)

and for i = 0, . . . , n the distributions are given by Ππi Pπ

C.1,1

C.1,1

(dbi |bi−1 ) =

Z Ai ×Ai−1

Qi (dbi |ai , ai−1 ) ⊗ πiC.1,1 (dai |ai−1 , bi−1 ) ⊗ Pπ

C.1,1

(dai , dai−1 , dbi , dbi−1 ) = Qi (dbi |ai , ai−1 ) ⊗ πiC.1,1 (dai |ai−1 , bi−1 ) ⊗ Pπ

(dai−1 |bi−1 ),

C.1,1

(dbi−1 |ai−1 ) ⊗ Pπ

(III.185) C.1,1

(dai−1 ).

(III.186)

◦ C.1,1 4 Clearly, the fact that the optimal channel input conditional distribution maximizing I(An → Bn ) occurs in P [0,n] = πiC.1,1 (dai |ai−1 , bi−1 ) : i = 0, 1, . . . , n implies the four letter expression given by (III.183). This in turn, simplifies considerably, any attempt to compute

the optimal capacity achieving channel input conditional distribution, for different types of channels (i.e., Gaussian, finite alphabet channels, etc.). We note that the above characterization of FTFI capacity, and hence its per unit time limiting version, the feedback capacity, is fundamentally different from the main results derived in [20] (see (eqn(2), eqn(3), Theorem 1 in [20]), where the authors state that, for the UMCI defined on finite alphabet spaces, the optimal channel input distribution occurs in A.1 4 P [0,n] = PAi |Ai−1 ,Bi−1 (dai |ai−1 , bi−1 ) : i = 0, 1, . . . , n , and that the corresponding formulae for CAFB n →Bn is given by n

KT CAFB,Y n →Bn =

i−1

∑ I(Ai , Ai−1 ; Bi |B

sup

).

(III.187)

P(dai |ai−1 ,bi−1 ):i=0,...,n i=0

KT FB,C.1,1 Clearly, CAFB,Y n →Bn ≤ CAn →Bn , this bound is achievable, and moreover (III.187) is much more difficult to compute, compared to

(III.183). A.1 4 A.L , with L = 1, is the In fact, by Theorem III.2, (a), it is clear that P [0,n] = PAi |Ai−1 ,Bi−1 (dai |ai−1 , bi−1 ) : i = 0, 1, . . . , n ≡ P[0,n] capacity achieving set of channel input conditional distributions for channel conditional distributions Qi (dbi |bi−1 , ai , ai−1 ) : i = 0, . . . , n , and not for channel distributions Qi (dbi |ai , ai−1 ) : i = 0, . . . , n .

Theorem III.4 settles a long standing question on the information structures of optimal channel input conditional distributions, for extremum problems of feedback capacity, since it holds for general alphabet spaces and arbitrary channel distributions. It provides tighter bounds on any achievable feedback codes, compared to other papers, which appeared in the literature.

3) Channel Class C with Transmission Costs Class A: Consider a channel distribution of Class C defined by (III.131) A (κ) defined by (III.76), and corresponding to a transmission cost function of and an average transmission cost constraint P[0,n] Class A, {γiA.N (aii−N , bi ) : i = 0, . . . , n}. We can repeat the derivation of Theorem III.4, (b), to obtain the following preliminary characterization of FTFI capacity. 4

CAFB,C,A n →Bn (κ) =

n

sup

∑

A (κ) i=0 Pi (dai |ai−1 ,bi−1 ):i=0,...,n ∈P[0,n]

=

sup

i−1 ),i=0,...,n: 1 Eπ πiA.I (dai |ai−1 i−I ,b n+1

A.I

Z

dQ (·|bi−1 , ai ) i i−M i−L (b ) PP (dbi , dai ) i dΠPi (·|bi−1 ) ( ) dQ (·|Bi−1 , Ai ) n i i−M i−L π A.I log (Bi ) A.I ∑E dΠπi (·|Bi−1 ) γ A.N (Ai ,Bi ) ≤κ i=0 log

∑ni=0 i

i−N

(III.188)

(III.189)

29

4

where I = max{L, N} and A.I

Ππi (dbi |bi−1 ) =

Z

A.I

i−1 i−1 i i−1 i−1 |b ), Qi (dbi |bi−1 ) ⊗ Pπ (dai−I i−M , ai−L ) ⊗ Pi (dai |ai−I , b A.I j j−1 j−1 A.I ) , i = 0, . . . , n. Pπ (dai , dbi ) = ⊗ij=0 Q j (db j |b j−1 j−M , a j−L ) ⊗ π j (da j |a j−I , b Aii−I

(III.190) (III.191)

i−1 i−1 However, we cannot go further to reduce the dependence of the channel input conditional distribution πiA.I (dai |ai−I , b ) on

bi−1 , and end up with an upper bound, which is achievable over a smaller set, because of the dependence of the transmission cost function of Class A, on bi−1 , for i = 0, . . . , n. That is, there will be a lack of consistency of the channel output conditional distribution (if it depends on any channel input distribution with less information structure) chosen via the variational equality, and the joint distribution resulting from the maximization using stochastic optimal control theory. This prevents us from showing i−1 ) : i = 0, . . . , n . achievability of upper bounds over any subset of {πiA.I (dai |ai−1 i−I , b 4) Channel Class C with Transmission Costs Class B: Consider a channel distribution of Class C defined by (III.131) and B (κ) defined by (III.121), and corresponding to a transmission cost function of an average transmission cost constraint P[0,n]

Class B, {γiB.K (ai , bii−K ) : i = 0, . . . , n}. Similarly as above, we can repeat the derivation of Theorem III.4, (b), to obtain the following preliminary characterization of FTFI capacity. n

4

CAFB,C,B n →Bn (κ) =

sup

Z

∑

B (κ) i=0 Pi (dai |ai−1 ,bi−1 ):i=0,...,n ∈P[0,n]

log

dQ (·|bi−1 , ai ) i i−M i−L (b ) PP (dbi , dai ) i dΠPi (·|bi−1 )

(III.192)

where ΠPi (dbi |bi−1 ) =

Z

i i−1 i−1 Qi (dbi |bi−1 , b ) ⊗ PP (dai−1 |bi−1 ), i = 0, . . . , n, i−M , ai−L ) ⊗ Pi (dai |a Ai j j−1 j−1 , b ) , i = 0, . . . , n. PP (dai , dbi ) = ⊗ij=0 Q j (db j |b j−1 j−M , a j−L ) ⊗ Pj (da j |a

(III.193) (III.194)

However, we cannot go further to reduce the dependence of the channel input conditional distribution Pi (dai |ai−1 , bi−1 ) on ai−1 , and end up with an upper bound which is achievable over a smaller set, because of the dependence of the transmission cost function of Class B, on ai−1 , for i = 0, . . . , n, for the same reasons discussed above, for channels of Class C and transmission costs of Class A. Hence, we cannot show achievability of any upper bound, via the variational equality, over any subset of B (κ). Pi (dai |ai−1 , bi−1 ) : i = 0, . . . , n ∈ P[0,n] The characterizations of FTFI capacity presented in this section cover many channel distributions and transmission cost functions of practical interest. We conclude with some comments.

Conclusion III.1. (Comments on information structures) (a) The characterizations of the FTFI capacity are analogous to the two-letter feedback capacity characterization of DMC and memoryless continuous alphabet channels. (b) The characterization of the feedback capacity, follow directly from the per unit time limit of the characterization of FTFI capacity. (c) For specific channel distributions and transmission cost functions, it is possible to derive closed form expressions for the optimal channel input conditional distributions, and expressions of the corresponding characterizations of FTFI capacity. (d) The optimal channel input distributions corresponding to the characterizations of FTFI can be found via dynamic programming, since for most of the channels and transmission cost functions, appropriately defined augmented processes are Markov processes.

IV. G ENERAL D ISCRETE -T IME R ECURSIVE C HANNEL M ODELS & G AUSSIAN L INEAR C HANNEL M ODELS WITH M EMORY In this section, we show the following.

30

(i)

Channel distributions (I.6)-(I.8), are induced by various nonlinear channel models (NCM), driven by arbitrary distributed noise processes. These include nonlinear and linear time-varying Autoregressive models, and nonlinear and linear channel models expressed in state space form [16].

(ii)

The optimal channel input conditional distributions of Multiple-Input Multiple Output (MIMO) Gaussian Linear Channel Models (G-LCM), driven by correlated Gaussian noise processes, which maximize directed information I(An → Bn ), are Gaussian.

Claim (i) illustrates that many of the existing channels investigated in the literature, for example, [17]–[24], induce channel distributions of Class A, B or C. Claim (ii) generalizes the Cover and Pombra [17] characterization (I.32) of feedback capacity of nonnstationary nonergodic Additive Gaussian channels driven by correlated noise.

A. General Discrete-Time Recursive Channels We show claim (i), by using the following preliminary definition of NCM. Definition IV.1. (Nonlinear channel models and transmission costs) (a) NCM-A. Nonlinear Channel Models A (NCM-A) are defined by nonlinear recursive models and transmission cost functions, as follows. −1 Bi = hAi (Bi−1 , Aii−L ,Vi ), B−1 = b−1 , A−1 −L = a−L , i = 0, . . . , n, 1 n A.N i E γi (Ai−N , Bi ) ≤ κ ∑ n + 1 i=0

(IV.195) (IV.196)

where {Vi : i = 0, 1, . . . , n} is the noise process, and the following assumption holds. Assumption A.(i). The alphabet spaces include any of the following. 4

4

4

Continuous Alphabets: Bi = R p , Ai = Rq , Vi = Rr , i = 0, 1, . . . , n; 4 4 4 Finite Alphabets: Bi = 1, . . . , p , Ai = 1, . . . , q}, Vi = 1, . . . , r , i = 0, 1, . . . , n;

(IV.198)

Combinations of Continuous and Discrete (Finite or Countable) Alphabets.

(IV.199)

(IV.197)

Assumption A.(ii). hAi : Bi−1 × Aii−L × Vi 7−→ Bi , γiA.N : Aii−N × Bi 7−→ Ai and hAi (·, ·, ·), γiA.N (·, ·) are measurable functions, for i = 0, 1, . . . , n; Assumption A.(iii). The noise process {Vi : i = 0, . . . , n} satisfies conditional independence condition PVi |V i−1 ,Ai (dvi |vi−1 , ai ) = PVi (dvi ) − a.a.(vi−1 , ai ), i = 0, . . . , n.

(IV.200)

Clearly, by (IV.200) the noise process distribution satisfies PV n (dvn ) = ⊗ni=0 PVi (dvi ), and the following consistency condition holds. n o P Bi ∈ Γ Bi−1 = bi−1 , Ai = ai = PVi Vi : hAi (bi−1 , aii−L ,Vi ) ∈ Γ , Γ ∈ B(Bi ) i−1

= Qi (Γ|b

, aii−L ),

i = 0, 1, . . . , n.

(IV.201) (IV.202) 4

There is no loss of generality to use the convention that transmission starts at time i = 0, and the initial data B−1 = −1 b−1 , A−1 −L = a−L are either specified or their distribution is fixed. Alternatively, we can assume no information is avail-

able for i ∈ {−1, −2, . . . , }, i.e., σ {B−1 , A−1 } = {Ω, 0}, / which then implies B0 = hA0 (A0 ,V0 ), B1 = hA1 (B0 , A1 , A0 ,V1 ), . . . , Bn = hAn (Bn−1 , . . . , B0 , An , . . . , A0 ,Vn ). (b) Nonlinear Channel Models A.B and B.A are as follows. (b.1) NCM-A.B. Nonlinear Channel Models A.B (NCM-A.B) correspond to nonlinear recursive models NCM-A, with γiA (Aii−N , Bi )

31

in (IV.195) replaced by γiB.K (Ai , Bii−K ), i = 0, . . . , n. (b.2) NCM-B.A. Nonlinear Channel Models B.A (NCM-B.A) correspond to nonlinear recursive models NCM-A, with hA (Bi−1 , Aii−L ,Vi ) i in (IV.195) replaced by hBi (Bi−1 i−M , A ,Vi ), i = 0, . . . , n.

The underlying assumptions for NCM-A.B and NC-M.B.A are the following. Assumption B. Assumptions A.(i)-A.(iii) hold with appropriate changes. Assumption B implies the NCM-A.B induces channel distribution {Qi (dbi |bi−1 , aii−L ) : i = 0, . . . , n and NCM-B.A induces i channel distribution {Qi (dbi |bi−1 i−M , a ) : i = 0, . . . , n (i.e., they satisfy a consistency conditions as in (IV.202)). (c) NCM-C. Nonlinear Channel Models C (NCM-C) are defined as follows. i −1 −1 −1 −1 Bi = hCi (Bi−1 i−M , Ai−L ,Vi ), B−M = b−M , A−L = a−L , i = 0, . . . , n, 1 n C.N,K i E γi (Ai−N , Bii−K ) ≤ κ ∑ n + 1 i=0

(IV.203) (IV.204)

where {Vi : i = 0, 1, . . . , n} is the noise process, and the following assumptions hold. Assumption C. Assumptions A.(i)-A.(iii) hold with appropriate changes. Similarly, as above, Assumption C implies the following consistency condition holds. n o i P Bi ∈ Γ Bi−1 = bi−1 , Ai = ai = PVi Vi : hCi (bi−1 i−M , ai−L ,Vi ) ∈ Γ , Γ ∈ B(Bi ) i−1 = Qi (Γ|bi−M , aii−L ), i = 0, 1, . . . , n.

(IV.205) (IV.206)

(d) NCM-D. Nonlinear Channel Models D (NCM-D) correspond to any one of NCM-A, NCM-A.B, NCM-B.A, NCM-C, with recursive function hD i (·, ·, ·) for D ∈ {A, B,C}, and correlated noise process {Vi : i = 0, . . . , n} having distribution, which satisfies the following condition. Assumption D. The noise process {Vi : i = 0, . . . , n} distribution satisfies conditional independence i−1 PVi |V i−1 ,Ai (dvi |vi−1 , ai ) = PV |V i−1 (dvi |vi−1 ), vi−1 ∈ vi−1 − a.a.(vi−1 , ai ), i = 0, . . . , n i−T , v i

(IV.207)

where T is nonnegative and finite. Next, we show that any NCM-D can be equivalently transformed, under mild conditions, to one of the NCMs, NCM-A, NCMA.B, NCM-B.A, or NCM-C. which then implies channel distributions of Class A, B, C are sufficient to deal with NCMs, driven by noise processes, which are not necessarily independent. Theorem IV.1. (NCM driven by arbitrary noise processes) Any NCM-D for which the inverse of the map i−1

vi ∈ Vi 7−→ hD i (b

i−1

, ai , vi ), ai ∈ {ai , aii−L }, b

exists and it measurable (i.e., the inverse is gD i (bi , b

i−1

∈ {bi−1 , bi−1 i−M }, i = 0, . . . , n

(IV.208)

, ai ), for i = 0, . . . , n), induces either a channel distribution {Qi (dbi |bi−1 , ai ) :

i = 0, . . . , n} or a channel distribution induced by one of the channel models NCM-A, NCM-B.A, or NCM-C, with parameters {M, L} replaced by some {M 0 , L0 } (i.e., corresponding to a channel distribution of Class A, B, or C). i−1 i Proof: First, suppose the NCM-D, corresponds to NCM-C, i.e., Bi = hD i (Bi−M , Ai−L ,Vi ), i = 0, . . . , n, where the noise process i i is arbitrary distributed, satisfying Assumption D. By the hypothesis of the invertibility of the maps hD i (bi−M , ai−L , ·), i = 0, . . . , n,

32

and Assumption D, the following identities hold7 . o n i−1 i P Bi ∈ dbi Bi−1 = bi−1 , Ai = ai =PVi |Bi−1 ,Ai ,V i−1 Vi : hD i (bi−M , ai−L ,Vi ) ∈ dbi i−1 i =PVi |V i−1 ,Ai Vi : hD i (bi−M , ai−L ,Vi ) ∈ dbi i−1 i−1 i i−1 (b , a ,V ) ∈ db , V ∈ {V i−1 ,Vi−T } by (IV.207) =PV |V i−1 Vi : hD i i i i−M i−L i

i−2

i i−1 = Qi (dbi |bi−1 ), vi−1 = gi (bi−1 , bi−1−M , ai−1 i−M , ai−L , v i−1−L ), i = 0, 1, . . . , n

(IV.209) (IV.210) (IV.211) (IV.212)

i i−1 ) : i = 0, . . . , n}, which depends on the distribution of the noise {V : i = for some conditional distribution {Qi (·|bi−1 i i−M , ai−L , v i−1 }, from (IV.212) either vi−1 = vi−1 , which implies vi−1 is specified by (bi−1 , ai−1 ), and hence 0, . . . , n}. Since vi−1 ∈ {vi−1 i−T , v

the distribution (IV.212) depends on (bi−1 , ai ), or vi−1 = vi−1 i−T , which implies the distribution (IV.212) is limited memory and i 0 0 depends on (bi−1 i−M 0 , ai−L0 ), for some nonnegative M , L . Hence, any such NCM-D can be equivalently transformed to a channel

distribution of Class C or a NCM-C. Similar conclusions also hold for the rest of NCM-D, having recursions corresponding to NCM-A and NCM-B.A, with arbitrary distributed noise {Vi : i = 0, . . . , n} satisfying Assumption D (i.e., they can be equivalently transformed to a channel distribution of Class A or B). This completes the prove. The above theorem illustrates that, under relaxed assumptions, there is no loss of generality to consider NCM driven by an independent noise process. Next, we present an example which illustrates the previous Theorem. Example IV.1. (Information structures of recursive models driven by correlated noise) Consider a NCM-D described by Bi =hi (Bi−1 , Ai ) + σ (Bi−1 , Ai )Vi , B−1 = b−1 , i = 0, . . . , n,

1 n ∑ E γi (Ai , Bi−1 ) ≤ κ n + 1 i=0

(IV.213)

with appropriate initial conditions, where {Vi : i = 0, 1, . . . , n} satisfies Assumption D, with T = 1, i.e., PVi |V i−1 ,Ai (dvi |vi−1 , ai ) = PVi |Vi−1 (dvi |vi−1 ) − a.a.(vi−1 , ai ), i = 0, . . . , n

(IV.214)

the maps are measurable, and the following inverse maps exist and they are measurable. −1 vi = gi (bi , bi−1 , ai ) ≡ σ (bi−1 , ai ) bi − hi (bi−1 , ai ) , i = 0, . . . , n.

(IV.215)

Then the induced channel distribution is obtained as follows. n o n o P Bi ≤ bi Bi−1 = bi−1 , Ai = ai =P Bi ≤ bi Bi−1 = bi−1 , Ai = ai ,V i−1 = vi−1 (IV.216) =PVi |Bi−1 ,Ai ,V i−1 Vi : σ (bi−1 , ai )Vi ≤ bi − hi (bi−1 , ai ) (IV.217) =PVi |V i−1 ,Ai Vi : σ (bi−1 , ai )Vi ≤ bi − hi (bi−1 , ai ) , bi−1 is specified by (ai−1 , vi−1 ) (IV.218) =PVi |Vi−1 Vi : σ (bi−1 , ai )Vi ≤ bi − hi (bi−1 , ai ) by (IV.214) (IV.219) =Qi (−∞, bi ] bi−1 , ai , vi−1 and vi−1 = gi−1 (bi−1 , bi−2 , ai−1 ) (IV.220) i−1 i (IV.221) =Qi (−∞, bi ] ≤ bi bi−2 , ai−1 , i = 0, . . . , n n o i for some channel distribution Qi (−∞, bi ] bi−1 i−2 , ai−1 : i = 0, . . . , n , which is determined from the distribution of the channel noise {Vi : i = 0, . . . , n} given by (IV.214). By (IV.221), we deduce that the induced channel distribution is of Class C, which i for each i depends on symbols {bi−1 i−2 , ai−1 }, for i = 0, . . . , n.

7 Since the maps v ∈ V 7−→ hA (bi−1 , ai , v ), i = 0, . . . , n are invertible and measurable, with inverse g (b , bi−1 , ai ), then knowledge of Bi−1 = i i i i i−M i−L i i−M i−L i bi−1 , Ai−1 = ai−1 specifies V i−1 = vi−1 , for i = 0, . . . , n.

33

Special Case. Suppose {σi (bi−1 , ai ) = I : i = 0, . . . , n} (i.e., identity matrix), and the noise is Gaussian with density fVi |Vi−1 (dvi |vi−1 ) ∼ N Θi,i−1 vi−1 , Σi,i−1 , i = 0, . . . , n (IV.222) where (Θi,i−1 , Σi,i−1 ) : i = 0, . . . , n} are deterministic matrices. Then the equivalent channel is given by the following recursion. (IV.223) Bi = hi (Bi−1 , Ai ) + Θi,i−1 Bi−1 − hi−1 (Bi−2 , Ai−1 ) +Wi ≡ hi (Bi−1 , Bi−2 , Ai , Ai−1 ) +Wi , i = 0, . . . , n, where {Wi ∼ N(0, Σi,i−1 ) : i = 0, . . . , n} is an independent process. Hence, the optimal channel input distribution, which maximizing directed information I(An → Bn ) is of the form PAi |A ,Bi−1 : i = 0, . . . , n . i−1

i−2

B. Multiple Input Multiple Output Gaussian Linear Channel Models with Memory We show claim (ii), by considering the following Gaussian-LCM-C, which is a degenerate version of the NCM-D of Example IV.18 . M

Bi =

L

∑ Ci,i− j Bi− j + ∑ Di,i− j Ai− j +Vi ,

j=1

−1 −1 −1 B−1 −M = b−M , A−L = a−L , i = 0, . . . , n,

(IV.224)

j=0

i ≡ CM (i)Bi−1 i−M + DL (i)Ai−L +Vi , o n C.L,M i 4 1 n n i i i i i E γ (A , B ) = E hA , R (i)A i + hB , Q (i)B i ≤ κ, L M ∑ i ∑ i−L i−M i−L i−L i−M i−M n + 1 i=0 i=0 4

(IV.225)

RL (i) = RTL (i) 0 ∈ R(L+1)q×(L+1)q , QM (i) = QTM (i) 0 ∈ R(M+1)p×(M+1)p , J = max{L, M},

(IV.226)

Assumption A.(iii) given by (IV.200) holds and Vi ∼ N(0, KVi ), for i = 0, . . . , n.

(IV.227)

By Assumption A.(iii), the channel distribution is Gaussian given by n o i P Bi ≤ bi |Bi−1 = bi−1 , Ai = ai ∼ N(CM (i) bi−1 i−M + DL (i) ai−L , KVi ), i = 0, 1, . . . , n

(IV.228)

By Theorem III.4, we directly obtain that the optimal channel input conditional distribution occurs in the following set. o n ◦ C.L,J 1 n πC.L,J C.L,M i 4 C.L,J i i−1 E γ (A , B ) ≤ κ (IV.229) (dai |ai−1 , b ), i = 0, . . . , n : P [0,n] (κ) = πi ∑ i−L i−M i−L i−J i n + 1 i=0 and that the characterization of FTFI capacity is given by the following expression. n n o i−1 CAFB,C.L,J sup H(B |B ) − H(V n ) n →Bn (κ) = i ∑ i−J ◦ C.L,J P [0,n] (κ)

where n o Z i−1 P Bi ≤ bi |Bi−1 = b i−J i−J =

Aii−L

(IV.230)

i=0

n o C.L,J C.L,J i i−1 i−1 i−1 P Vi ≤ bi −CM (i) bi−1 (dai |ai−L , bi−J ) ⊗ Pπ (dai−1 i−M + DL (i) ai−L ⊗ πi i−L |bi−J ), i = 0, 1, . . . , n. (IV.231)

Next, we show that the optimal channel input distribution satisfying the average transmission n o n cost constraint is Gaussian, oi.e., g C.L,J i−1 i−1 πi (dai |ai−L , bi−J ) = P i−1 i−1 : i = 0, . . . , n , and the corresponding joint process (Ai , Bi ) = (Agi , Bgi ) : i = 0, . . . , n is jointly Gaussian.

Ai |Ai−L ,Bi−J

Theorem IV.2. (MIMO Gaussian LCM-C) Consider the G-LCM-C defined by (IV.224)-(IV.227). Then the following hold. n g,C.L,J i−1 i−1 i−1 (a) The optimal channel input conditional distribution is Gaussian distributed, denoted by πiC.L,J (·|ai−1 (·|ai−L , bi−J ) : i−L , bi−J ) = πi 8L

= M = 0 corresponds to a memoryless channel.

34

o o n i = 0, . . . , n , and the corresponding joint process is jointly Gaussian distributed, denoted by (Ai , Bi ) = (Agi , Bgi ) : i = 0, . . . , n . Moreover, the characterization of FTFI capacity is given by CAFB,G−C.L,J (κ) = n →Bn

n

sup ◦ G−C.L,J (κ) P [0,n]

n

o

g,i−1 ) ∑ H(Bgi |Bi−J

− H(V n )

(IV.232)

i=0

where n o ◦ G−C.L,J 1 n π g,C.L,J C.L,M g,i g,i 4 i−1 , b ), i = 0, . . . , n : (κ) = πig,C.L,J (dai |ai−1 ) ≤ κ , B γ (A E P [0,n] ∑ i−L i−J i−L i−M i n + 1 i=0 o Z n o n i−1 i P Vi ≤ bi −CM (i) bi−1 P Bgi ≤ bi |Bg,i−1 i−M + DL (i) ai−L i−J = bi−J =

(IV.233)

Aii−L

g i−1 ⊗ πig,C.L,J (dai |ai−1 i−L , bi−J ) ⊗ P i−1

Ai−L |Bi−1 i−J

i−1 (dai−1 i−L |bi−J ), i = 0, 1, . . . , n.

(IV.234)

(b) An equivalent FTFI characterization is given by the following expressions. Agi =

J

L

j=1

j=1

∑ Γi,i− j Bgi− j + ∑ Λi,i− j Agi− j + Zi ,

i = 0, 1, . . . , n

g,i−1 = ΓJ (i)Bg,i−1 i−J + ΛL (i)Ai−L + Zi ,

Bgi =

M

L

j=1

j=0

∑ Ci,i− j Bgi− j + ∑ Di,i− j Agi− j +Vi ,

(IV.235) (IV.236)

i = 0, . . . , n,

(IV.237)

g,i = CM (i)Bg,i−1 i−M + DL (i)Ai−L +Vi i) Zi is independent of Ag,i−1 , Bg,i−1 , i = 0, . . . , n,

(IV.238)

ii) Z i is independent of V i , i = 0, . . . , n n o iii) Zi ∼ N(0, KZi ), KZi 0 : i = 0, 1, . . . , n is an independent Gaussian process

(IV.240)

4

(IV.241)

n

CAFB,G−C.L,J (κ) = n →Bn IL−G−C.L,J E[0,n] (κ) =

(IV.239)

sup

IL−G.C.L,J

ΓJ (i),ΛL (i),KZi ,i=0,...,n ∈E[0,n]

n

ΓJ (i), ΛL (i), KZi , i = 0, . . . , n :

n ∑ H(Bgi |Bg,i−1 i−J ) − H(V )

(IV.242)

(κ) i=0

o 1 n C.L,M g,i g,i E γ (A , B ) ≤ κ , ∑ i i−L i−M n + 1 i=0

4

g,i g,i g,i g,i g,i γiC.L,M (Ag,i i−L , Bi−M ) = hAi−L , RL (i)Ai−L i + hBi−M , QM (i)Bi−M i, i = 0, . . . , n.

(IV.243) (IV.244)

Proof: (a) We can show that the optimal channel input distribution satisfying the average transmission cost constraint is Gaussian, by various ways. One way is to invoke the maximum entropy property of Gaussian distributions, as done in Cover i−1 and Pombra [17], for the channel model (I.30) satisfying (I.31), to upper bound the term ∑ni=0 H(Bi |Bi−J ) = H(Bn ) in the right 4

hand side of (IV.230) by the inequality H(Bnn−J ) ≤ H(Bg,n ), where Bg,n = {Bgi : i = 0, 1, . . . , n} is jointly Gaussian distributed, and the constraint is satisfied, and then to show achievability, when the process {Agi : i = 0, . . . , n} is Gaussian. An alternative, way is to use the distribution of the channel model given by (IV.228) and the distribution (IV.231), as follows. By (IV.231) the output process is jointly Gaussian if and only if Ai , Bi ,Vi : i = 0, . . . , n are jointly Gaussian. The upper bound is achieved g i−1 if and only if the channel input distribution is Gaussian, denoted by {πi (dai |ai−1 i−L , bi−J ) ≡ PAi |Ai−1 ,Bi−1 : i = 0, 1, . . . , n}, having i−L i−J g,i−1 conditional mean which is a linear combination of (Ag,i−1 , B ) : i = 0, . . . , n , conditional covariance which is independent i−L i−J

of the channel input and output processes, and the average transmission cost is satisfied. Hence, the upper bound is achieved if g i−1 and only if the channel output conditional distribution denoted by {P{Bi ≤ bi |bi−1 i−J } ≡ PBi |Bi−1 (dbi |bi−J ) : i = 0, 1, . . . , n} is also g i−J Gaussian, with conditional mean which is a linear combination of Bi : i = 0, . . . , n − 1 and conditional covariance, which is

independent of the channel output process. Finally, by the linearity of the model (IV.224), the upper bound is achieved if and only if the channel input process is Gaussian, denoted by Agi : i = 0, . . . , n and the constraint is satisfied. Hence, we obtain the characterization of the FTFI capacity given by (IV.232)-(IV.234). (b) Clearly, (IV.235) follows from (a) and the information

35

n structure of the maximizing channel input distribution, Pg

i−1 i−1 ,Bi−J Ai |Ai−L

o : i = 0, 1, . . . , n . The independence properties i)-iii) follows

from follow from Assumption A.(iii) given by (IV.200) and fact that Vi ∼ N(0, KVi ), for i = 0, . . . , n. is an independent process.

In the next remark we relate the above theorem to the Cover and Pombra [17] characterization, and illustrate some of the fundamental differences. Remark IV.1. (MIMO LCM and Relation to Cover and Pombra [17]) −1 −1 −1 (a) By Theorem IV.2, and assuming, without loss of generality the initial data are B−1 −M = b−M = 0, A−L = a−L = 0, we can

express the decomposition (IV.235) in terms of the channel noise and the process noise {(Vi , Zi ) : i = 0, . . . , n}, by simple recursive substitution, as follows. i−1

i

j=0

j=0

Agi = ∑ Γi,i− jV j + ∑ ∆i,i− j Z j , Ag0 = ∆0,0 Z0 , i−1

4

= ∑ Γi,i− jV j +Ui , Ui = j=0

i = 0, 1, . . . , n,

(IV.245)

i

(IV.246)

∑ ∆i,i− j Z j

j=0

for appropriately chosen matrices {Γi,i− j : j = 0, . . . , i − 1}, {∆i,i− j : j = 0, . . . , i}, i = 0, . . . , n. Clearly, in (IV.245) the process {Zi : i = 0, . . . , n} is an orthogonal or independent innovations process, while in the alternative equivalent expression (IV.246), the process {Ui : i = 0, . . . , n} is not an independent innovations process. In fact, the realization of the process {Agi : i = 0, . . . , n} given by (IV.246), in terms of {Ui : i = 0, . . . , n}, is analogous to the realization derived by Cover and Pombra [17] (see (I.30)-(I.33)), where the process {Zi : i = 0, . . . , n} is not an orthogonal process. Since the objective is to compute CAFB,G−C.L,J (κ) given by (IV.242) explicitly, then decomposition (IV.235), where the process n →Bn {Zi : i = 0, . . . , n} is an orthogonal process, is much simpler to analyze, compared to decomposition (IV.246), where {Ui : i = 0, . . . , n} is correlated. This is possibly one of the main reason, which prevented many of the past attempts to solve the Cover and Pombra [17] characterization explicitly, or any of its variants [31], [32], without any assumptions of stationarity. (b) By combining Example IV.1, specifically (IV.222) and (IV.223), and Theorem IV.2, it is clear that any G-LCM of the form o 1 n n E hAi , Ri Ai i + hBi−1 , Qi,i−1 Bi−1 i ≤ κ, (IV.247) Bi = Ai +Vi , i = 0, . . . , n, ∑ n + 1 i=0 Vi ∼ N Θi,i−1 vi−1 , Σi,i−1 , Ri = RTi 0, Qi,i−1 = QTi,i−1 0, i = 0, . . . , n (IV.248) is equivalent to the channel model Bi = Ai + Θi,i−1 Bi−1 − Ai−1 +Wi , B−1 = b−1 , A−1 = a−1 , i = 0, . . . , n

(IV.249)

where {Wi ∼ N(0, Σi,i−1 ) : i = 0, . . . , n} is an independent process (also another convention can be used instead of B−1 = b−1 , A−1 = a−1 ). Hence, the optimal channel input distribution, which maximizing directed information I(An → Bn ) is Gaussian of the form PgAi |A ,B : i = 0, . . . , n . Such distributions can be realized as follows. i−1

i−1

Agi = Γi,i−1 Bgi−1 + ∆i,i−1 Agi−1 + Zi , i = 0, . . . , n, i) Zi is independent of Ag,i−1 , Bg,i−1 , i = 0, . . . , n,

(IV.250) (IV.251)

ii) Z i is independent of V i , i = 0, . . . , n, n o iii) Zi ∼ N(0, KZi ) : i = 0, 1, . . . , n is an independent Gaussian process.

(IV.252) (IV.253)

Moreover, if we use the convention B−1 = 0, A−1 = 0, by recursive substitution we obtain the alternative realization Agi =

i−1

∑ Γi,i− jW j +Ui ,

j=0

4

Ag0 = ∆0,0 Z0 , Ui =

i

∑ ∆i,i− j Z j ,

i = 0, . . . , n

(IV.254)

j=0

for some matrices (Γi,i− j ∆i,i− j ), where {Ui : i = 0, . . . , n} is a correlated Gaussian process and {Ui : i = 0, . . . , n} is independent of

36

{Vi : i = 0, . . . , n} Clearly, (IV.254) generalizes the realization obtained by Cover and Pombra [17], to nonstationary, nonergodic noise process with unit memory. However, the above realization gives additional insight compared to the realization of the optimal channel input distribution derived in [17]. Since the above method can be repeated for any nonstationary nonergodic noise process with arbitrary memory, then the material of this section, generalize the characterization of Cover and Pombra to channels with arbitrary memory on past channel outputs, present and past channel inputs, and channel noise with arbitrary memory. Moreover, they also generalize the characterization obtained by Kim [32], where the author assumed the AGN channel Bi = Ai +Vi , with stationary and ergodic noise {Vi : i = 0, . . . , n}, generated by a limited memory autoregressive model. Finally, we note that although, the emphasis is to illustrate applications in MIMO G-LCM, the methodology applies to arbitrary channel models, irrespectively of the type of alphabet spaces and channel noise distributions.

V. ACHIEVABILITY Many existing coding theorems found in [8], [11]–[13], [18], [19], [32], [36], are either applicable or can be generalized to show the per unit time limiting versions of the characterizations of FTFI capacity, corresponds to feedback capacity, under appropriate conditions. Next, we provide a short elaboration on technical issues, which need to be resolved, in order to ensure, under relaxed conditions (i.e., without imposing stationarity or ergodicity), that the per unit time limiting versions of the characterizations of FTFI capacity correspond to the supremum of all achievable feedback codes. It is straight forward to conclude that the characterizations of FTFI capacity give tight bounds on any achievable code rate (of feedback codes). Via these tight bounds, the direct part of the coding theorem can be shown, by investigating the per unit time limit of the characterizations of FTFI capacity, without unnecessary a´ priori assumptions on the channel, such as, stationarity, ergodicity, or information stability of the joint process {(Ai , Bi ) : i = 0, 1, . . .}. Further, through the characterizations of FTFI capacity, several hidden properties of the role of optimal channel conditional distributions to affect the channel output transition probability distribution can be identified. Next, we state the fundamental conditions, in order to make the transition to the per unit time limiting versions of the characterizations of FTFI capacity, and to give an operational meaning to these characterizations. (C1) For any source process Xi : i = 0, . . . , to be encoded and transmitted over the channel, the following conditional independence [2] is satisfied. PBi |Bi−1 ,Ai ,X k = PBi |Bi−1 ,Ai ∀k ∈ {0, 1, . . . , n}, i = 0, . . . , n

(V.255)

As pointed out by Massey [2], conditional independence condition (V.255), is necessary condition for directed information I(An → Bn ) to give a tight upper bound on the information conveyed by the source to the channel output (Theorem 3 in [2]), and that directed information reduces to mutual information in the absence of feedback, that is, if PAi |Ai−1 ,Bi−1 = PAi |Ai−1 , i = 0, . . . , n, then I(An → Bn ) = I(An ; Bn ). (C2) For any of the channels and transmission cost functions investigated, there exists channel input conditional distributions denoted by πi∗ (dai |I P ) : i = 0, . . . , n ∈ P[0,n] (κ) (if transmission cost is imposed), which achieve the supremum of the characterizations of FTFI capacity, and their per unit time limits are finite. For the converse part of the channel coding theorem, existence (i.e., C2) is necessary, because it is often shown by invoking 1 Fano’s inequality, which requires finiteness of lim infn−→∞ n+1 CAFB n →Bn (κ). Similarly, the direct part of the coding theorem is 1 often shown by generating channel codes according to the channel input distribution which achieves lim infn−→∞ n+1 CAFB n →Bn (κ).

Hence, the derivation of coding theorems presupposes existence of optimal channel input distributions and finiteness of the limiting expression. Since, for continuous and countable alphabet spaces, (Ai , Bi ) : i = 0, . . . , n , information theoretic measures are not necessarily continuous functions on the space of distributions [37], and that, directed information is lower semicontinuous, as a functional

37

of channel input conditional distributions PAi |Ai−1 ,Bi−1 : i = 0, . . . , n ∈ P[0,n] , sufficient conditions for continuity of directed information should be identified. Such conditions are given in [27]. (C3) The optimal channel input distributions πi∗ (dai |IiP ) : i = 0, 1, . . . , n ∈ P[0,n] (κ), which achieve the supremum of the characterizations of FTFI capacity, induce stability in the sense of Dobrushin [4], of the directed information density, that is, o n ∗ 1 π ∗ π ∗ n n π ∗ n n (V.256) lim Pπ (An , Bn ) ∈ An × Bn : E i (A , B ) − i (A , B ) > ε = 0 n−→∞ n+1 and stability of the transmission cost constraint, that is, n o o n ∗ 1 π ∗ n n lim Pπ (An , Bn ) ∈ An × Bn : γi (T i An , T i Bn ) − ∑ γi (T i An , T i Bn ) > ε = 0. E ∑ n−→∞ n+1 i=0 i=0

(V.257)

For example, for any channel distribution of Class C, and any transmission cost of Class C, the directed information density is ∗

iπ (An , Bn ) ≡ iπ

∗,C.L,J

4

n

(An , Bn ) = ∑ log i=0

Q (·|Bi−1 , Ai ) i i−M i−L Ππi

∗,C.L,J

(·|Bi−1 i−J )

(Bi ) , i = 0, . . . , n

(V.258)

and similarly for the rest of the characterizations of FTFI capacity derived in the paper. The important research question of showing (V.256) and (V.257) requires extensive analysis, especially, for abstract alphabet spaces (i.e., continuous), and this is beyond the scope of this paper. Condition (C1) implies the well-known data processing inequality, while condition (C2) implies existence of the optimal channel input distribution and finiteness of the corresponding characterizations of the FTFI capacity and its per unit time limit. Condition (C3) is sufficient to ensure the AEP holds, and hence standard random coding arguments hold (i.e., following Ihara [8], by replacing the information density of mutual information by that of directed information). Finally, we note that, for specific application examples, it is possible to invoke the characterizations of FTFI capacity derived in this paper, to compute the expressions of error exponents derived in [13], and establish coding theorems via this alternative direction. VI. C ONCLUSION We derived structural properties of optimal channel input conditional distributions, which maximize directed information from channel input RVs to channel output RVs, for general channel distributions with memory, with and without transmission cost constraints. These structural properties generalize the structural properties of Memoryless Channels with feedback, and Shannon’s two-letter characterization of channel capacity, to channels with memory. We have applied one of the main theorems to recursive Multiple Input Multiple Output Gaussian Linear Channel Models, with limited memory on channel input and output sequences, under general transmission cost constraints, and we derived the characterization of FTFI capacity. The feedback capacity can be obtained via its per unit time limiting version and standard results on ergodic Markov Decision theory. In future work, it is of interest to understand the role of feedback to control the channel output process, to derive, for specific channel models, closed form expressions for the characterizations of FTFI capacity and feedback capacity, and to determine whether feedback increases capacity, and by how much. Whether the methodology of this paper can be applied to extremum problems of network information theory, to identify information structures of optimal distributions and achievable upper bounds, remains, however, a subject for further research. R EFERENCES [1] H. Marko, “The bidirectional communication theory–A generalization of information theory,” IEEE Transactions on Communications, vol. 21, no. 12, pp. 1345–1351, Dec. 1973. [2] J. L. Massey, “Causality, feedback and directed information,” in International Symposium on Information Theory and its Applications (ISITA ’90), Nov. 27-30 1990, pp. 303–305.

38

[3] R. L. Dobrushin, “General formulation of Shannon’s main theorem of information theory,” Usp. Math. Nauk., vol. 14, pp. 3–104, 1959, translated in Am. Math. Soc. Trans., 33:323-438. [4] M. Pinsker, Information and Information Stability of Random Variables and Processes. Feinstein. [5] R. T. Gallager, Information Theory and Reliable Communication.

Holden-Day Inc, San Francisco, 1964, translated by Amiel

John Wiley & Sons, Inc., New York, 1968.

[6] R. E. Blahut, Principles and Practice of Information Theory, ser. in Electrical and Computer Engineering. Company, 1987. [7] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. [8] S. Ihara, Information theory for Continuous Systems.

Reading, MA: Addison-Wesley Publishing

John Wiley & Sons, Inc., Hoboken, New Jersey, 2006.

World Scientific, 1993.

[9] S. Verd´u and T. S. Han, “A general formula for channel capacity,” IEEE Transactions on Information Theory, vol. 40, no. 4, pp. 1147–1157, July 1994. [10] T. S. Han, Information-Spectrum Methods in Information Theory, 2nd ed.

Springer-Verlag, Berlin, Heidelberg, New York, 2003.

[11] G. Kramer, “Capacity results for the discrete memoryless network,” IEEE Transactions on Information Theory, vol. 49, no. 1, pp. 4–21, Jan. 2003. [12] S. Tatikonda and S. Mitter, “The capacity of channels with feedback,” IEEE Transactions on Information Theory, vol. 55, no. 1, pp. 323–349, Jan. 2009. [13] H. Permuter, T. Weissman, and A. Goldsmith, “Finite state channels with time-invariant deterministic feedback,” IEEE Transactions on Information Theory, vol. 55, no. 2, pp. 644–662, Feb. 2009. [14] E. A. Gamal and H. Y. Kim, Network Information Theory.

Cambridge University Press, December 2011.

[15] P. R. Kumar and P. Varaiya, Stochastic Systems: Estimation, Identification, and Adaptive Control. [16] P. E. Caines, Linear Stochastic Systems, ser. Wiley Series in Probability and Statistics.

Prentice Hall, 986.

John Wiley & Sons, Inc., New York, 1988.

[17] T. Cover and S. Pombra, “Gaussian feedback capacity,” IEEE Transactions on Information Theory, vol. 35, no. 1, pp. 37–43, Jan. 1989. [18] Y.-H. Kim, “A coding theorem for a class of stationary channels with feedback,” IEEE Transactions on Information Theory, vol. 54, no. 4, pp. 1488–1499, 2008. [19] J. Chen and T. Berger, “The capacity of finite-state Markov channels with feedback,” IEEE Transactions on Information Theory, vol. 51, no. 3, pp. 780–798, March 2005. [20] S. Yang, A. Kavcic, and S. Tatikonda, “Feedback capacity of finite-state machine channels,” Information Theory, IEEE Transactions on, vol. 51, no. 3, pp. 799–810, March 2005. [21] H. Permuter, P. Cuff, B. Van Roy, and T. Weissman, “Capacity of the trapdoor channel with feedback,” IEEE Transactions on Information Theory, vol. 54, no. 7, pp. 3150–3165, July 2008. [22] H. Permuter, H. Asnani, and T. Weissman, “Capacity of a post channel with and without feedback,” IEEE Transactions on Information Theory, vol. 60, no. 10, pp. 6041–6057, Oct 2014. [23] O. Elishco and H. Permuter, “Capacity and coding of the ising channel with feedback,” IEEE Transactions on Information Theory, vol. 60, no. 9, pp. 3138–5149, June 2014. [24] C. Kourtellaris and C. Charalambous, “Capacity of binary state symmetric channel with and without feedback and transmission cost,” in IEEE Information Theory Workshop (ITW), May 2015. [25] C. Kourtellaris, C. Charalambous, and J. Boutros, “Nonanticipative transmission of sources and channels with memory,” in IEEE International Symposium on Information Theory (ISIT), 2015. [26] O. Hernandez-Lerma and J. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria, ser. Applications of Mathematics Stochastic Modelling and Applied Probability. Springer Verlag, 1996, no. v. 1. [27] C. D. Charalambous and P. A. Stavrou, “Directed information on abstract spaces: Properties and variational equalities,” submitted to IEEE Transactions on Information Theory, 2013. [Online]. Available: http://arxiv.org/abs/1302.3971 [28] C. Kourtellaris and C. D. Charalambous, “Information structures of capacity achieving distributions for feedback channels with memory and transmission cost: Stochastic optimal control & variational equalities-part I,” IEEE Transactions on Information Theory, 2015, submitted, November 2015. [Online]. Available: http://arxiv.org/abs/1512.04514 [29] R. L. Dobrushin, “Information transmission in a channel with feedback,” Theory of Probability and its Applications, vol. 3, no. 2, pp. 367–383, 1958. [30] P. M. Ebert, “The capacity of the Gaussian channel with feedback,” Bell Sys. Tech. Journal, pp. 1705–1712, October 1970. [31] S. Yang, A. Kavcic, and S. Tatikonda, “On feedback capacity of power-constrained Gaussian noise channels with memory,” Information Theory, IEEE Transactions on, vol. 53, no. 3, pp. 929–954, March 2007.

39

[32] Y.-H. Kim, “Feedback capacity of stationary Gaussian channels,” IEEE Transactions on Information Theory, vol. 56, no. 1, pp. 57–85, 2010. [33] C. D. Charalambous and P. A. Stavrou, “Directed information on abstract spaces: properties and extremum problems,” in IEEE International Symposium on Information Theory (ISIT), July 1-6 2012, pp. 518–522. [34] N. Dunford and J. T. Schwartz, Linear Operators Part I: General Theory.

John Wiley & Sons, Inc., Hoboken, New Jersey, 1988.

[35] M. Sion, “On general minimax theorem,” Pacific Journal of Mathematics, vol. 14, pp. 171–176, 1958. [36] G. Kramer, “Directed information for channels with feedback,” Ph.D. dissertation, Swiss Federal Institute of Technology (ETH), December 1998. [37] S.-W. Ho and R. Yeung, “On the discontinuity of the shannon information measures,” IEEE Transactions on Information Theory, vol. 55, no. 12, pp. 5362–5374, Dec 2009.

Recommend Documents

Information maximizing DAC noise shaping - Semantic Scholar

Information Structures with Unawareness - Semantic Scholar