Iterative Scaling Algorithm for Channels

Report 2 Downloads 159 Views
arXiv:1603.07181v1 [cs.IT] 23 Mar 2016

Iterative Scaling Algorithm for Channels Paolo Perrone∗1 and Nihat Ay1,2,3 1

Max Planck Institute for Mathematics in the Sciences, Inselstrasse 22, 04103 Leipzig, Germany 2 Faculty of Mathematics and Computer Science, University of Leipzig, PF 100920, 04109 Leipzig, Germany 3 Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA

Abstract Here we define a procedure for evaluating KL-projections (I- and rIprojections) of channels. These can be useful in the decomposition of mutual information between input and outputs, e.g. to quantify synergies and interactions of different orders, as well as information integration and other related measures of complexity. The algorithm is a generalization of the standard iterative scaling algorithm, which we here extend from probability distributions to channels (also known as transition kernels). Keywords: Markov Kernels, Hierarchy, I-Projections, Divergences, Interactions, Iterative Scaling, Information Geometry.

1

Introduction

Here we present an algorithm to compute projections of channels onto exponential families of fixed interactions. The decomposition is geometrical, and it is based on the idea that, rather than joint distributions, the quantities we work with are channels, or conditionals (or Markov kernels, stochastic kernels, transition kernels, stochastic maps). Our algorithm can be considered a channel version of the iterative scaling of (joint) probability distributions, presented in [1]. Exponential and mixture families (of joints and of channels) have a duality property, shown in Section 2. By fixing some marginals, one determines a mixture family. By fixing (Boltzmann-type) interactions, one determines an exponential family. These two families intersect in a single point, which means ∗ Correspondence:

[email protected]

that (Theorem 2) there exists a unique element which has the desired marginals and the desired interactions. As a consequence, it translates projections onto exponential families (which are generally hard to compute) to projections onto fixed-marginals mixture families (which can be approximated by an iterative procedure). Section 3 explains how this is done. Projections onto exponential families are becoming more and more important in the definition of measures of statistical interaction, complexity, synergy, and related quantities. In particular, the algorithm can be used to compute decompositions of mutual information, as for example the ones defined in [2] and [3], and it was indeed used to compute all the numerical examples in [3]. Another application of the algorithm is explicit computations of complexity measure as treated in [4], [5], [6], and [7]. Examples of both applications can be found in Section 4. For all the technical details about the iterative scaling algorithm, in its traditional version, we refer the interested reader to [1]. All proofs can be found in the Appendix.

1.1

Technical Definitions

We take the same definitions and notations as in [3], except that we let the output be multiple. More precisely, we consider a set of N input nodes V , taking values in the sets X1 , . . . , XN , and a set of M output nodes W , taking values in the sets Y1 , . . . , YM . We write the input globally as X := X1 ×· · ·×XN , and the output globally as Y := Y1 × · · · × YM . We denote by F (Y ) the set of real functions on Y , and by P (X) the set of probability measures on X. Definition 1. Let I ⊆ V and J ⊆ W . We call FIJ the space of functions who only depend on XI and YJ :  FIJ := f ∈ F (X, Y ) f (xI , xI c , yJ , yJ c ) = f (xI , x0I c , yJ , yJ0 c ) ∀xI c , x0I c , yJ c , yJ0 c . (1) We can model the channel from X to Y as a Markov kernel k, that assigns to each x ∈ X a probability measure on Y (for a detailed treatment, see [8]). Here we will consider only finite systems, so we can think of a channel simply as a transition matrix (or stochastic matrix), whose rows sum to one. X k(x; y) ≥ 0 ∀x, y; k(x; y) = 1 ∀x . (2) y

The space of channels from X to Y will be denoted by K(X; Y ). We will denote by X and Y also the corresponding random variables, whenever this does not lead to confusion. Conditional probabilities define channels: if p(X, Y ) ∈ P (X, Y ) and the marginal p(X) is strictly positive, then p(Y |X) ∈ K(X; Y ) is a well-defined

channel. Viceversa, if k ∈ K(X; Y ), given p ∈ P (X) we can form a well-defined joint probability: pk(x, y) := p(x) k(x; y) ∀x, y . (3) To extend the notion of divergence from probability distributions to channels, we need an “input distribution”: Definition 2. Let p ∈ P (X), let k, m ∈ K(X; Y ). Then: Dp (k||m) :=

X

p(x) k(x; y) log

x,y

k(x; y) . m(x; y)

(4)

Let p, q be joint probability distributions on X × Y , and let D be the KLdivergence. Then: D(p(X, Y )||q(X, Y )) = D(p(X)||q(X)) + Dp(X) (p(Y |X)||q(Y |X)) .

2

(5)

Families of Channels

Suppose we have a family E of channels, and a channel k that may not be in E. Then we can define the “distance” between k and E in terms of Dp . Definition 3. Let p be an input distribution. The divergence between a channel k and a family of channels E is given by: Dp (k|| E) := inf Dp (k||m) . m∈E

(6)

If the minimum is uniquely realized, we call the channel πE k := arg min Dp (k||m)

(7)

m∈E

the rI-projection of k on E (and simply “an” rI-projection if it is not unique). The families considered here are of two types, dual to each other: linear and exponential. For both cases, we take the closures, so that the minima defined above always exist. Definition 4. A mixture family of K(X; Y ) is a subset of K(X; Y ) defined by an affine equation, i.e., the locus of the k which satisfy a (finite) system of equations in the form: X k(x; y)fi (x, y) = ci , (8) x,y

for some functions fi ∈ F (X, Y ), and some constants ci .

Example.

Consider a channel m ∈ K(X; Y1 , Y2 ). We can form the marginal: X m(x; y1 ) := m(x; y1 , y2 ) . (9) y2

The channels k ∈ K(X; Y1 , Y2 ) such that k(x; y1 ) = m(x; y1 ) form a mixture family, defined by the system of equations (for all x0 ∈ X, y10 ∈ Y1 ): X k(x; y1 , y2 ) δ(x, x0 )δ(y1 , y10 ) = m(x0 ; y10 ) , (10) x,y1 ,y2

where the function δ(z, z 0 ) is equal to 1 for z = z 0 , and zero for any other case. More examples of channel marginals will appear in the next section. Definition 5. A (closed) exponential family of K(X; Y ) is (the closure of) a subset of K(X; Y ) of strictly positive solutions to a log-affine equation, i.e., the locus of the k which satisfy a (finite) system of equations in the form: Y k(x; y)fi (x,y) = ci , (11) x,y

for some functions fi ∈ F (X, Y ), and some constants ci . This is a multiplicative equivalent of mixture families. For strictly positive k, logarithms are defined, and equation (11) is equivalent to say that log k satisfies equation (8). However, strictly positive channels do not form a compact space. Example. Consider the space K(X; Y ). Constant channels (i.e. for which Y does not depend on X) form an exponential family, defined by the system of equations (for all x0 ∈ X, y 0 ∈ Y ): Y 0 0 k(x; y)(δ(x,x )−1)(δ(y,y )−1) = 1 . (12) x,y

Here is why. The product above is equivalent to write: k(x; y) k(x0 ; y) k(x; y 0 ) k(x0 ; y 0 ) = 1 ,

(13)

whose strictly positive solutions satisfy: k(x0 ; y) k(x; y) = , 0 k(x; y ) k(x0 ; y 0 )

(14)

so that the fraction of elements mapped to y does not depend on x. We now want to show how to construct families which are, in some sense, dual.

Proposition 1. Let L be a (finite-dimensional) linear subspace of F (X, Y ). Let p ∈ P (X) and k ∈ K(X; Y ). Define:   X X m(x; y)l(x, y) = k(x; y)l(x, y) ∀l ∈ L , M(k, L) := m ∈ K(X; Y ) x,y

x,y

(15) and:  E(k, L) :=

 X el(x,y) el(x,y) k(x; y), l ∈ L . k(x; y) Z(x) = Z(x) y

(16)

Then: 1. M(k, L) is a mixture family; 2. E(k, L) is an exponential family. The duality is expressed by the following result. Theorem 2. Let L be a subspace of F (X, Y ). Let p ∈ P (X) be strictly positive. Let k0 ∈ K(X; Y ) be a strictly positive “reference” channel. Let E := E(k0 , L) and M := M(k, L). For k 0 ∈ K(X; Y ), the following conditions are equivalent: 1. k 0 ∈ M ∩ E. 2. k 0 ∈ E, and Dp (k||k 0 ) = inf m∈E Dp (k||m). 3. k 0 ∈ M, and Dp (k 0 ||k0 ) = inf m∈M Dp (m||k0 ). In particular, k 0 is unique, and it is exactly πE k. Geometrically, we are saying that k 0 = πE k, the rI-projection of k on E. We call the mapping k → k 0 the rI-projection operator, and the mapping k0 → k 0 the I-projection operator These are the channel equivalent of the I-projections introduced in [9] and generalized in [10]. The result is illustrated in Figure 1.

M

k

k'

Ɛ

k0

Figure 1: Illustration of Theorem 2. The point k 0 at the intersection minimizes on E the distance from k, and minimizes on M the distance from k0 . I- and rI-projections on exponential families satisfy a Pythagoras-type equality (see Figure 2). For any m ∈ E, with E exponential family: Dp (k||m) = Dp (k||πE k) + Dp (πE k||m) .

(17)

This statement follows directly from the analogous statement for probability distribution found in [11], after applying (5).

k

πƐ k

Ɛ

m

Figure 2: Illustration of equation (17).

3

Algorithm

The algorithm can be considered as a channel equivalent of the iterative scaling procedure (see for example [1]). The computations in the algorithm are iterated many times, but as they are mostly array rescalings, they are well-suited for parallelization. The intuitive idea is the following. Theorem 2 implies that the rI-projection of k on an exponential family E is the channel minimizing the divergence from a reference channel k0 ∈ E, with constraints on the marginals. We can then obtain the rI-projection of k with the following trick: rescaling the marginals of k0 , while keeping the divergence as low as possible. Adjusting several marginals at once is not an easy task, but it can be obtained iteratively. To apply the theorem, we choose as mixture family M a family of prescribed marginals. Marginals of a channel have to be defined, and they depend on the input distribution. Definition 6. Let I ⊆ [N ], J ⊆ [M ], J 6= ∅. We define the marginal operator for channels as: X k(x; y) 7→ k(xI ; yJ ) := p(xI c |xI ) k(xI , xI c ; yJ , yJ c ) , (18) xI c ,yJ c

for some input probability distribution p. This way, k(XI ; YJ ) is exactly the conditional probability for the marginal pk(XI , YJ ). Definition 7. Let I ⊆ [N ], J ⊆ [M ], J 6= ∅. We define the mixture families ¯ as: MIJ (k)  ¯ I ; yJ ) , ¯ := k(x1...n ; y1...m ) p(xI ) k(xI ; yJ ) = p(xI ) k(x (19) MIJ (k) ¯ I ; yJ ) are prescribed channel marginals, for a strictly positive input where the k(x distribution p. ¯ corresponds exactly to the set M(k, ¯ L) defined in Proposition 3. MIJ (k) Proposition 1, where as L we take the space FIJ of functions which only depend on the nodes in I, J. Just as in [1], the I-projections for single marginals can be obtained by ‘scaling: Definition 8. We define the IJ-scaling as the operator σ IJ : K(X; Y ) → K(X; Y ), mapping k to: ¯ I ; yJ ) k(x 1 k(xI , xI c ; yJ , yJ c ) , Z(x) k(xI ; yJ ) where: Z(x) :=

X 0 yJ

k(xI , xI c ; yJ0 )

¯ I ; y0 ) k(x J . k(xI ; yJ0 )

The scaling leaves the input distribution unchanged.

(20)

(21)

This operation indeed yields an element of the desired family: ¯ Proposition 4. Let k ∈ K(X; Y ). Then σIJ k ∈ MIJ (k). The IJ-scaling indeed yields projections, for which a Pythagoras-type relation holds. This means that IJ-scalings are indeed I-projections. ¯ Let p be a strictly positive input Lemma 5. Let k ∈ K(X; Y ), let m ∈ MIJ (k). distribution. Then: Dp (m||k) = Dp (m|| σ IJ k) + Dp (σ IJ k||k) .

(22)

Corollary 6. σ IJ is an I-projection: Dp (σ IJ k||k) =

min

¯ m∈MIJ (k)

Dp (m||k) .

(23)

To prescribe all marginals together, we need to iterate all scalings. The sequence will then converge to the desired channel. Definition 9. Let M1 , . . . , Mn be mixture families in K(X; Y ), and denote the I-projection operators as σ 1 , . . . , σ n . Let k 0 ∈ K(X; Y ). We call the iterative scaling of k 0 the infinite sequence starting at k 0 , defined inductively by: k i := σ (i mod n) k i−1 .

(24)

In our case, we choose the linear families as families of prescribed marginals ¯ with Ii ⊆ [N ] and for different subsets of the nodes. That is, Mi = MIi Ji (k), Ji ⊆ [M ], J 6= ∅ for all i. Theorem 7. If M := M1 ∩ · · · ∩ Mn 6= ∅, and denote by σ the I-projection on M, then: lim k i = σ k. (25) i→∞

We do not require the projection to be strictly positive. This result is illustrated in Figure 3.

M1

M2

k0

M3

Figure 3: Illustration of the iterative scaling procedure, Theorem 7: subsequent projections tend to the projection on the intersection of the families. In our case,, we choose as initial channel k 0 exactly the reference channel k0 of Theorem 2. The rI-projection of k on the desired exponential family will be obtained by the iterative scaling of k0 . The limit point will have the same prescribed marginals as the original kernel k.

4 4.1

Applications Synergy Measures

The algorithm presented here permits to compute decompositions of mutual information between inputs and outputs in [2] and [3]. We give here examples of computations of pairwise synergy as an rI-projection for channels, as described in [3]. It is not within the scope of this article to motivate this measure, we rather want to show how it can be computed. Let k be a channel from X = (X1 , X2 ) to Y . Let p ∈ P (X) be a strictly positive input distribution. We define in [3] the synergy of k as: d2 (k) := Dp (k|| E 1 ) ,

(26)

where E 1 is the (closure of the) family of channels in the form: m(x1 , x2 ; y) =

 1 exp φ0 (x1 , x2 ) + φ1 (x1 , y) + φ2 (x2 , y) , Z(X)

(27)

where: Z(x) :=

X

 exp φ0 (x1 , x2 ) + φ1 (x1 , y) + φ2 (x2 , y) ,

(28)

y

and: φ0 ∈ F{1,2}∅ ,

φ1 ∈ F{1}{1} ,

φ2 ∈ F{2}{1} .

(29)

According to Theorem 2, the rI-projection of k on E 1 is the unique point k 0 of E 1 which has all the prescribed marginals: k 0 (x1 ; y) = k(x1 ; y) ,

k 0 (x2 ; y) = k(x2 ; y) ,

(30)

and can therefore be computed by iterative scaling, either of the joint distribution (as it is traditionally done, see [1]), or of the channels (our algorithm). Here we present a comparison of the two algorithms, implemented similarly and in the same language (Mathematica). The red dots represent our (channel) algorithm, and the blue dots represent the joint rescaling algorithm. For the easiest channels (see Figure 4), both algorithm converge instantly. XOR SYNERGY V. comp. 1.0

0.8

0.6

0.4

0.2

Iterations 0

2

4

6

8

10

Figure 4: Comparison of convergence times for the synergy of the XOR gate. Both algorithms get immediatly the desired result. (The dots here are overlapping, the red ones are not visible.) A more interesting example is a randomly generated channel (Figure 5), in which both method need 5-10 iterations to get to the desired value. However, the channel method is slightly faster.

RANDOM KERNEL SYNERGY V. comp.

0.007

0.006

0.005

0.004

0.003

0.002

0.001

Iterations 0

2

4

6

8

10

Figure 5: Comparison of convergence times for the synergy of a randomly generated channel. The channel method (red) is slightly faster. The most interesting example is the synergy of the AND gate, which should be zero according to the procedure [3]. In that article, we mistakenly wrote a different value, that here we would like to correct (it is zero). The convergence to zero is very slow, of the order of 1/n (Figure 6). It is clearly again slightly faster for the channel method in terms of iterations. AND SYNERGY V. comp.

0.0014

0.0012

0.0010

0.0008

0.0006

0.0004

0.0002

Iterations 0

200

400

600

800

1000

Figure 6: Comparison of convergence times for the synergy of the AND gate. The channel method (red) tends to zero proportionally to n−1.05 , the joint method (blue) proportionally to n−0.95 . It has to be noted, however, that rescaling a channel requires more elemen-

tary operations than rescaling a joint distribution. Because of this, one single iteration with our method takes longer than with the joint method. In the end, despite the need of fewer iterations, the total computation time of a projection with our algorithm can be longer (depending on the particular problem). For example, again for the synergy of the AND gate, we can plot the computation time as a function of the accuracy (distance to actual value), down to 10−3 . The results are shown in Figure 7. Time (s)

0.14

0.12

0.1

0.08

0.06

0.04

0.02

Accuracy 0.012

0.011

0.01

0.009

0.008

0.007

0.006

0.005

0.004

0.003

0.002

0.001

Figure 7: Comparison of total computation times for the synergy of the AND gate. The channel method (red) is slightly slower than the joint method (blue). To get to the same accuracy, though, the channel approach used less iterations. In summary, our algorithm is better in terms of iteration complexity, but generally worse in terms of computing time.

4.2

Complexity Measures

Iterative scaling can also be used to compute measures of complexity, as defined in [4], [5], [6], and in Section 6.9 of [7]. For simplicity, consider two inputs X1 , X2 , two outputs Y1 , Y2 and a generic channel between them. In general, any sort of interaction is possible, which in terms of graphical models (see [12]) can be represented by diagrams such as those in Figure 8. Any line in the graph indicates an interaction between the nodes. In [4] and [5] the outputs are assumed to be conditionally independent, i.e. they do not directly interact (or, their interaction can be explained away by conditioning on the inputs). In this case the graph looks like Figure 8a. Suppose now that Y1 , Y2 correspond to X1 , X2 at a later time. In this case it is natural to assume that the system is not complex if Y1 does not depend (directly) on X2 , and Y2 does not depend (directly) on X1 . Intuitively, in this case “the whole is exactly the sum of its parts”. In terms of graphical models,

X1

Y1

X1

Y1

X2

Y2

X2

Y2

(a)

(b)

Figure 8: a) The graphical model corresponding to conditionally independent outputs Y1 and Y2 are indeed correlated, but only indirectly, via the inputs. b) The graphical model corresponding to a non-complex system. this means that our system is represented by Figure 8b. These channels (or joints) form an exponential family (see [4] and [5]) which we call F 1 . Suppose now, though, that the outputs are indeed conditionally independent, but that they also depend on additional inputs, which we call X3 , that we cannot observe, and which we can consider as “noise”. The graph would then be that of Figure 9a. If such a system is not complex (but the noise persists), we then have a graphical model as in Figure 9b. Since we cannot observe the noise alone, we have to integrate (or sum, or marginalize) over the node X3 . This way a “noisy” correlation between the output nodes appears (see Figure 10). In particular, since X3 is now “hidden”, the outputs are not conditionally independent anymore. In particular, a noncomplex but “noisy” system would be represented by Figure 10b. Such channels (or joints) form again an exponential family, which we call F 2 . We would like now to have a measure of complexity for a channel (or joint). In [4] and [5], the measure of complexity is defined as the divergence from the family F 1 represented in Figure 8b. We will call such a measure c1 . In case of noise, however, it is argued in [6] and [7] that the divergence should be computed from the family F 2 represented in 10b, because of the marginalization over the noise (and, as written in the cited papers, because such a complexity measure should be required to be upper bounded by the mutual information between X and Y ). We will call such a measure c2 . Both divergences can be computed with our algorithm. As an example, we have considered the following channel:    1 k(x1 , x2 , x3 ; y1 , y2 ) = exp α x1 x2 + βx3 (y1 − y2 ) , (31) Z(x)

X1

Y1

X1

X3

Y1

X3

X2

Y2

X2

(a)

Y2

(b)

Figure 9: a) The model of Figure 8a, with and additional “noise” input term X3 . b) The graphical model corresponding to a non-complex system, with possible noise.

X1

Y1

X1

Y1

X2

Y2

X2

Y2

(a)

(b)

Figure 10: a) The graphical model of Figure 9a, after marginalizing over the noise. b) The non-complex model of Figure 9b, after marginalizing over the noise.

with: Z(x) =

X

 exp

 α x1 x2 + βx3 (y10 − y20 )

 .

(32)

y10 ,y20

We have chosen α = 1 and β = 2, and a uniform input probability p. After marginalizing over X3 , we can compute the two divergences: • c1 (k) = Dp (k|| F 1 ) = 0.519. • c2 (k) = Dp (k|| F 2 ) = 0.110. This indicates that c1 is incorporating the correlation of the output nodes due to the “noise”, and therefore probably overestimating the complexity, at least in this case. One could nevertheless argue that c2 can underestimate complexity, as we can see in the following “control” example. Consider the channel:    1 exp α x1 x2 (y1 − y2 ) , (33) h(x1 , x2 , x3 ; y1 , y2 ) = Z(x) with: Z(x) =

X

 exp

α x1 x2 (y10 





y20 )

,

(34)

y10 ,y20

which is represented by the graph in Figure 8a. If the difference between c1 and c2 were just due to the noise, then for our new channel c1 (h) and c2 (h) should be equal. This is not the case: • c1 (h) = Dp (h|| F 1 ) = 0.946. • c1 (h) = Dp (h|| F 2 ) = 0.687. The divergences are still different. This means that there is an element h2 in F 2 , which does not lie in F 1 , for which: Dp (h||h2 ) < Dp (h||h1 ) ∀h1 ∈ f1 .

(35)

The difference is this time smaller, which could mean that noise still does play a role, but in rigor it is hard to say, since none of these quantities is linear, and divergences do not satisfy a triangular inequality. We do not want to argue here in favor or against any of these measures. We would rather like to point out that such considerations can be done mostly after explicit computations, which can be done with iterative scaling.

References [1] Csisz´ ar, I. and Shields, P. C. Information Theory and Statistics: A Tutorial. Foundations and Trends in Communications and Information Theory, 1(4):417–528, 2004 [2] Olbrich, E., Bertschinger, N., and Rauh, J. Information decomposition and synergy. Entropy, 17(5):3501–3517, 2015 [3] Perrone, P. and Ay, N. Hierarchical quantification of synergy in channels. Frontiers in Robotics and AI, 35(2), 2016. [4] Ay, N. Information Geometry on Complexity and Stochastic Interaction. MPI MiS Preprint, 95/2001. Available online: http://www.mis.mpg.de/publications/preprints/2001/prepr2001-95.html [5] Ay, N. Information Geometry on Complexity and Stochastic Interaction. Entropy, 17, 2432–2458, 2015. [6] Oizumi, M., Tsuchiya, N., and Amari, S. A unified framework for information integration based on information geometry Preprint available on arXiv:1510.04455, 2015. [7] Amari, S. Information Geometry and Its Applications Springer, 2016. [8] Kakihara, Y. Abstract Methods in Information Theory. World Scientific, 1999. [9] Csisz´ ar, I. I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3:146–158, 1975. [10] Csisz´ ar, I. and Mat´ uˇs, F. Information projections revisited. IEEE Transactions on Information Theory, 49:1474–1490, 2003. [11] Amari, S. Information geometry on a hierarchy of probability distributions. IEEE Transactions on Information Theory, 47(5):1701–1709, 2001. [12] Lauritzen, S. L. Graphical Models. Oxford, 1996. [13] Williams, P. L. and Beer, R. D. Nonnegative decomposition of multivariate information. Submitted. Preprint available on arXiv:1004.2151, 2010. [14] Amari, S. and Nagaoka, H. Differential geometry of smooth families of probability distributions. Technical Report METR 82-7, Univ. of Tokyo, 1982. [15] Amari, S. and Nagaoka, H. Methods of Information Geometry. Oxford, 1993.

A

Proofs

Proof of Proposition 1. Let {fi } be a basis of L, orthonormal in L2 (w.r.t. the counting measure). Then: 1. The condition in (15) is equivalent to: X X m(x; y)fi (x, y) = k(x; y)fi (x, y) = ci , x,y

(36)

x,y

for every i, and for some constants ci , and this is in the form of (8). 2. The elements of E(k, L) are in the form, for some θi : ! X k(x; y) exp θi fi (x, y) . h(x; y) = Z(x) i

(37)

We have, for the exponential term: !fj (x,y) Y

exp

x,y

X

θi fi (x, y)

! = exp

i

X i

θi

X

fi (x, y)fj (x, y)

, (38)

x,y

which because of orthonormality is equal to: ! X exp θi δij = eθj .

(39)

i

Therefore: 

Y

h(x; y)fi (x,y)

x,y

 Y k(x0 ; y 0 )  = ci = eθ i  Z(x0 ) 0 0

(40)

x ,y

for some constants ci , which is exactly in the form of (11).

Proof of Theorem 2. 1 ⇔ 2: Choose a basis f0 = 1, f1 , . . . , fd of L. Define the map θ 7→ kθ , with:   d X 1 k0 (x; y) exp  θj fj (x, y) , (41) kθ (x; y) = k(θ1 , . . . , θd )(x; y) := Zθ (x) j=1 and: Zθ (x) :=

X y

k0 (x; y) exp

d X i=1

! θi fi (x, y)

.

(42)

Then: Dp (k||kθ ) = Dp (k||k0 ) −

d X

θj Epk [fj ] + Ep [log Zθ ] .

(43)

j=1

Deriving (where ∂ j is w.r.t. θj ):  ∂ j Dp (k||kθ ) = − Epk [fj ] + Ep

∂ j Zθ Zθ

 .

(44)

The term in the last brackets is equal to: ! d X 1 X ∂ j Zθ = k0 (x; y) exp θi fi (x, y) fj (x, y) Zθ Zθ y i=1 X = kθ (x; y)fj (x, y) ,

(45) (46)

y

so that (44) now reads: ∂ j Dp (k||kθ ) = − Epk [fj ] + Epkθ [fj ] .

(47)

This quantity is equal to zero for every j if and only if kθ ∈ M. Now if kθ is a minimizer, it satisfies (47), and so kθ ∈ M. Viceversa, suppose kθ ∈ M, so that it satisfies (47) for every j. To prove that it is a global minimizer, we look at the Hessian: ∂ i ∂ j Dp (k||kθ ) = ∂ i ∂ j D(pk||pkθ ) . (48) This is precisely the covariance matrix of the joint probability measure pkθ , which is positive definite. 1 ⇔ 3: For every m ∈ M, we have:   X m(x; y) m Dp (m||k0 ) = p(x) m(x; y) log = Epm log . (49) k0 (x; y) k0 x,y If k 0 ∈ E, then:     m k0 k0 0 Dp (m||k0 ) = Epm log 0 + log = Dp (m||k ) + Epm log . k k0 k0

(50)

By definition of E, the logarithm in the last brackets belongs to L, and since m ∈ M:       k0 k0 k0 Epm log = Epk log = Epk0 log . (51) k0 k0 k0 Inserting in (50):   k0 Dp (m||k0 ) = Dp (m||k 0 ) + Epk0 log = Dp (m||k 0 ) + Dp (k 0 ||k0 ) . k0

(52)

Since Dp (m||k 0 ) ≥ 0, (52) shows that k 0 is a minimizer. Since Dp (m||k0 ) = D(pm||pk0 ) is strictly convex in the first argument, its minimizer is unique.

Proof of Proposition 3. For f in FIJ : X X Epk [f ] = p(x) k(x; y) f (x, y) = p(xI ) k(xI ; yJ ) f (xI ; yJ ) , x,y

(53)

xI ,yJ

and just as well: X X ¯ y) f (x, y) = ¯ I ; yJ ) f (xI ; yJ ) . Epk¯ [f ] = p(x) k(x; p(xI ) k(x x,y

(54)

xI ,yJ

The definition in Proposition 1 (with strict positivity of p) requires exactly that: Epk [f ] = Epk¯ [f ]

(55)

for every f ∈ FIJ . Using (53) and (54), the equality becomes: X X ¯ I ; yJ ) f (xI ; yJ ) p(xI ) k(xI ; yJ ) f (xI ; yJ ) = p(xI ) k(x xI ,yJ

(56)

xI ,yJ

¯ I ; yJ ). for every f in FIJ , which means that k(xI ; yJ ) = k(x Proof of Proposition 4. Let p(X) be a strictly positive input distribution. Applying (18) to (20): (σ IJ k)(xI ; yJ ) =

X p(xI c |xI ) ¯ I ; yJ ) k(x k(xI , xI c ; yJ ) Z(xI , xI c ) k(xI ; yJ ) x c I

¯ I ; yJ ) = k(x

X p(xI c |xI ) k(xI , xI c ; yJ ) Z(xI , xI c ) k(xI ; yJ ) x c

¯ I ; yJ ) = k(x

X

¯ I ; yJ ) = k(x

X

I

xI c

xI c

¯ I ; yJ ) = k(x

pk(xI , xI c , yJ ) 1 Z(xI , xI c ) pk(xI , yJ )

P

0 yJ

X P xI c

¯ I ; yJ ) = k(x

0 yJ

pk(xI c |xI , yJ ) pk(xI c |xI , yJ0 ) ¯ pk(xI , yJ0 ) p(x) pk(xI c |xI , yJ ) p(x) ¯ I , y0 ) pk(xI c |xI , yJ0 ) pk(x J

X pk(xI c |xI , yJ ) p(x) xI c

p(x)

,

as the input probability is the same. We are left with: X ¯ I ; yJ ) ¯ I ; yJ ) , (σ IJ k)(xI ; yJ ) = k(x pk(xI c |xI , yJ ) = k(x xI c

which is exactly the constraint of MIJ .

(57)

Proof of Lemma 5. Expanding the r.h.s. of (22) with (4), and inserting (20): Dp (m|| σ IJ k) + Dp (σ IJ k||k)   X m(x; y) σ IJ k(x; y) = p(x) m(x; y) log + σ IJ k(x; y) log σ IJ k(x; y) k(x; y) x,y  X ¯ I ; yJ )  k(x m(x; y) Z(x) k(xI ; yJ ) = p(x) m(x; y) log ¯ I ; yJ ) + σ IJ k(x; y) log Z(x) k(xI ; yJ ) k(x; y) k(x x,y   X  Z(x) k(xI ; yJ ) m(x; y) m(x; y) − σ IJ k(x; y) + log ¯ = p(x) m(x; y) log k(x; y) k(xI ; yJ ) x,y = Dp (m||k) +

X

log

xI ,yJ

= Dp (m||k) +

X xI ,yJ

= Dp (m||k) +

X

 Z(x) k(xI ; yJ ) X p(x) m(x; y) − p(x) σ IJ k(x; y) ¯ k(xI ; yJ ) x c ,y c I

J

 Z(x) k(xI ; yJ ) p(xI ) m(xI ; yJ ) − p(xI ) σ IJ k(xI ; yJ ) log ¯ k(xI ; yJ ) log

xI ,yJ

 Z(x) k(xI ; yJ ) ¯ I ; yJ ) − k(x ¯ I ; yJ ) p(xI ) k(x ¯ k(xI ; yJ )

= Dp (m||k) + 0 , ¯ (the latter using Proposition 4), and so as both m and σ IJ k belong to MIJ (k) they have the prescribed marginals. Proof of Theorem (7). The proof is adapted from the analogous statement for probability distributions found in [1]. Equation (22) implies that for every i, for k ∈ M: Dp (k||k i−1 ) = Dp (k||k i ) + Dp (k i ||k i−1 ) . (58) Summing these equations for i = 1, . . . , N we get: Dp (k||k 0 ) = Dp (k||k N ) +

N X

Dp (k i ||k i−1 ) .

(59)

i=1

Since the polytope of channels is compact, there exists at least an accumulation point k 0 and a subsequence k Nj → k 0 . This means: Dp (k||k 0 ) = Dp (k||k 0 ) +

∞ X

Dp (k i ||k i−1 ) ,

(60)

i=1

so that the series on the right must be convergent, and its terms must tend to zero. This in turn implies, using Pinsker’s inequality: X |k Nj (x; y) − k Nj −1 (x; y)| → 0 , (61) x,y

which means that also the subsequence k Nj −1 → k 0 . So does k Nj −2 , and so on until k Nj −n . Among the terms k Nj −1 , . . . , k Nj −n there is one in each

M1 , . . . , Mn , and since they all tend to the same accumulation point k 0 , k 0 must lie in the intersection M. Therefore (60) holds for k 0 as well: Dp (k 0 ||k 0 ) = 0 +

∞ X

Dp (k i ||k i−1 ) ,

(62)

i=1

which substituted again gives, for any k ∈ M: Dp (k||k 0 ) = Dp (k||k 0 ) + Dp (k 0 ||k 0 ) ,

(63)

i.e. k 0 = σ k 0 , which is unique for strictly positive p. Since the choice of subsequence was arbitrary, σ k 0 is the only accumulation point, so the sequence converges.