Capacity-achieving Sparse Superposition Codes via Approximate ...

Report 4 Downloads 90 Views
Capacity-achieving Sparse Superposition Codes via Approximate Message Passing Decoding Cynthia Rush Yale University

Adam Greig University of Cambridge

Ramji Venkataramanan∗ University of Cambridge

[email protected]

[email protected]

[email protected]

arXiv:1501.05892v1 [cs.IT] 23 Jan 2015

January 26, 2015 Abstract Sparse superposition codes were recently introduced by Barron and Joseph for reliable communication over the AWGN channel at rates approaching the channel capacity. The codebook is defined in terms of a Gaussian design matrix, and codewords are sparse linear combinations of columns of the matrix. In this paper, we propose an approximate message passing decoder for sparse superposition codes, whose decoding complexity scales linearly with the size of the design matrix. The performance of the decoder is rigorously analyzed and it is shown to asymptotically achieve the AWGN capacity with an appropriate power allocation. We provide simulation results to demonstrate the performance of the decoder at finite block lengths, and investigate the effects of various power allocations on the decoding performance.

1 Introduction This paper considers the problem of constructing low-complexity, capacity-achieving codes for the memoryless additive white Gaussian noise (AWGN) channel. The channel generates output y from input x according to y = x + w, (1) where the noise w is a Gaussian random variable with zero mean and variance σ 2 . There is an average power constraint PP on the input x: if x1 , . . . , xn are transmitted over n uses of the channel, then we require that n1 ni=1 x2i ≤ P . The signal-to-noise ratio σP2 is denoted by snr. The goal is to construct codes with computationally efficient encoding and decoding, whose rates approach the channel capacity given by C := 21 log(1 + snr). (2)

Sparse superposition codes, also called Sparse Regression Codes (SPARCs), were recently introduced by Barron and Joseph [1, 2] for communication over the channel in (1). They proposed an efficient decoding algorithm called ‘adaptive successive decoding’, and showed that for any fixed rate R < C, the probability of decoding error decays to zero exponentially in logn n , where n is the block length of the code. Despite the strong theoretical performance guarantees, the rates achieved by this decoder for practical block lengths are significantly less than C. Subsequently, a soft-decision iterative decoder was proposed by Cho and Barron [3, 4], with theoretical guarantees similar to the earlier decoder in [2] but improved empirical performance for finite block lengths. In this paper, we propose an approximate message passing (AMP) decoder for SPARCs. We analyze its performance and prove that the probability of decoding error goes to zero with growing block length for all fixed rates R < C. The decoding complexity is proportional to the size of the design matrix defining the code, which is a low order polynomial in n. ∗

This work was supported in part by a Marie Curie Career Integration Grant (GA Number 631489).

1

1.1 Approximate Message Passing (AMP) “Approximate message passing” refers to a class of algorithms [5–11] that are Gaussian or quadratic approximations of loopy belief propagation algorithms (e.g., min-sum, sum-product) on dense factor graphs. AMP has proved particularly effective for the problem of reconstructing sparse signals from a small number of noisy linear measurements. This problem, commonly referred to as compressed sensing [12], is described by the measurement model y = Aβ + w.

(3)

Here A is an n×N measurement matrix with n < N , β ∈ RN is a sparse vector to be estimated from the observed vector y ∈ Rn , and w ∈ Rn is the measurement noise. One popular class of algorithms to reconstruct β is `1 -norm based convex optimization, e.g. [13–15]. Though these algorithms have strong theoretical guarantees and excellent empirical performance, the computational cost makes it challenging to implement the convex optimization procedures for problems where N is large. A fast AMP reconstruction algorithm for the model in (3) was proposed in [5]. Its empirical performance (for a large class of measurement matrices) was found to be similar to convex optimization based methods at significantly lower computational cost. The factor graph corresponding to the model in (3) is dense, hence it is infeasible to implement message passing algorithms in which the messages are complicated real-valued functions. AMP circumvents this difficulty by passing only scalar parameters corresponding to these functions. For example, the scalars could be the mean and the variance if the functions are posterior distributions. The references [6,8–10] describe how various flavors of AMP for the model in (3) can be obtained by approximating the standard message passing equations. These approximations reduce the message passing equations to a set of simple rules for computing successive estimates of β. In [5], it was demonstrated via numerical experiments that the mean-squared reconstruction error of these estimates of β could be tracked by a simple scalar iteration called state evolution. In [7], it was rigorously proved that the state evolution is accurate in the large system limit1 for measurement matrices A with i.i.d. Gaussian entries. In addition to compressed sensing, AMP has also been applied to a variety of related problems, e.g. [16–18]. We will not attempt a complete survey of the growing literature on AMP; the reader is referred to [10, 11] for comprehensive lists of related work.

1.2 Main Contributions • We propose an AMP decoder for sparse regression codes, which is derived via a first-order approximation of a min-sum-like message passing algorithm. • The main result of the paper is Theorem 1, in which we rigorously show that the probability of decoding error goes to zero as the block length tends to infinity, for all rates R < C. • We demonstrate the performance of the decoder for finite block lengths via simulation results. We also investigate the effects of various power allocations for the SPARC, and provide guidelines on how to fix the code parameters depending on the ratio R/C. To prove our main result, we use the framework of Bayati and Montanari [7], who in turn built on techniques introduced by Bolthausen [19]. However, we remark that the analysis of the proposed algorithm does not follow directly from the results in [7, 20]. The main reason for this is that the undersampling ratio n/N in our setting goes to zero in the large system limit, whereas previous 1

The large system limit considered in [7] lets n, N → ∞ with n/N held constant.

2

rigorous analyses of AMP consider the case where the undersampling ratio is a constant. This point is discussed further in Section 5.

1.3 Related work on communication with SPARCs The adaptive successive decoder of Joseph-Barron [2] and the iterative soft-decision decoder of ChoBarron [3,4] both have probability of error that decays as n/ log n for any fixed rate R < C, but the latter has better empirical performance. Theorem 1 shows that the probability of error for the AMP decoder goes to zero for all R < C, but does not give a rate of decay; hence we cannot theoretically compare its performance with the Cho-Barron decoder in [4]. We can, however, compare the two decoders qualitatively. Both the AMP and the Cho-Barron decoder generate a succession of estimates β 1 , β 2 , . . . for the message vector β based on test statistics s0 , s1 , . . ., respectively. At step t, the Barron-Cho decoder generates statistic st based on an orthonormalization of the observed vector y and the previous ‘fits’ Aβ 1 , . . . , Aβ t . In contrast, the test statistic in the AMP decoder is based on a modified version of the residue (y − Aβ t ). Despite being generated in very different ways, the test statistics of the AMP and Cho-Barron decoders have a similar structure: they are asymptotically equivalent to an observation of β corrupted by additive Gaussian noise whose variance decreases with t. However, the AMP statistic is faster to compute in each step, which makes it feasible to implement the decoder for larger block lengths, which in turn results in lower (empirical) probability of decoding error. An approximate message passing decoder for sparse superposition codes was recently proposed by Barbier and Krzakala in [21]. This decoder has different update rules from the AMP proposed here. A replica-based analysis of the decoder in [21] suggested it could not achieve rates beyond a threshold which was strictly smaller than C. Very recently, Barbier et al [22] reported empirical results which show that the performance of the decoder in [21] can be improved by using spatially coupled Hadamard matrices to define the code. Finally, we mention bit-interleaved coded modulation [23] and the recently proposed low-density lattice codes [24] as two alternative approaches to designing high-rate codes for the AWGN channel. Though these schemes have very good empirical performance, in general there are no theoretical guarantees that they achieve the AWGN capacity with growing block length.

1.4 Paper outline and Notation The paper is organized as follows. The sparse regresssion code construction is described in Section 2. We describe the AMP channel decoder in Section 3, and provide some intuition about its iterations. We also show how the decoder can be derived as a first-order approximation to a minsum-like message passing algorithm. Section 4 contains the main result, which characterizes the performance of the AMP decoder for any rate R < C in the large system limit. In Section 4.1, we present simulation results to demonstrate the performance of the decoder at finite block lengths. Section 5 contains the proof of the main result. Notation: The `2 -norm of vector x is denoted by kxk. The transpose of a matrix B is denoted by B ∗ . N (µ, σ 2 ) denotes the Gaussian distribution with mean µ and variance σ 2 . For any positive integer m, [m] is the set {1, . . . , m}. The indicator function of an event A is denoted by 1(A). f (x) = o(g(x)) means limx→∞ f (x)/g(x) = 0; f (x) = Θ(g(x)) means f (x)/g(x) asymptotically lies in an interval [κ1 , κ2 ] for some constants κ1 , κ2 > 0. log and ln are used to denote logarithms with base 2 and base e, respectively. Rate is measured in bits. 3

Section 1 M columns

Section 2 M columns

Section L M columns

A:

β:

0,

√ 0, nP1 ,

√ 0, nP2 , 0,



nPL , 0,

,0

T

Figure 1: A is an n × M L matrix and β is a M L × 1 vector. The positions of the non-zeros in β correspond to the gray columns of A which combine to form the codeword Aβ.

2 The Sparse Regression Codebook A sparse regression code (SPARC) is defined in terms of a dictionary or design matrix A of dimension n × M L, whose entries are i.i.d. N (0, n1 ). Here n is the block length, and M, L are integers whose values will be specified shortly in terms of n and the rate R. As shown in Fig. 1, one can think of the matrix A being composed of L sections with M columns each. Each codeword is a linear combination of L columns, with one column from each section. Formally, a codeword can be expressed as Aβ, where β is an M L × 1 vector (β1 , . . . , βM L ) with the following property: there is exactly one non-zero βj for 1 ≤ j ≤ M , √ one non-zero βj for M + 1 ≤ j ≤ 2M , and so forth. The non-zero value of β in section ` is set to nP` , where P1 , . . . , PL are positive constants that satisfy L X

P` = P

(4)

`=1

Denote the set of all β’s that satisfy this property by BM,L (P1 , . . . , PL ). Since there are M columns in each of the L sections, the total number of codewords is M L . To obtain a communication rate of R bits/sample, we need M L = 2nR

or

L log M = nR.

(5)

There are several choices for the pair (M, L) which satisfy (5). For example, L = 1 and M = 2nR recovers the Shannon-style random codebook in which the number of columns in A is 2nR . For our constructions, we will choose M equal to Lb , for some constant b > 0. In this case, (5) becomes bL log L = nR.

(6)

Thus L = Θ( logn n ), and the size of the design matrix A (given by n × M L = n × Lb+1 ) now grows polynomially in n. Encoding: The encoder splits its stream of input bits into segments of log M bits each. A length M L message vector β0 is indexed by L such segments—the decimal equivalent of segment ` determines the position of the non-zero coefficient in section ` of β0 . The input codeword is then computed as X = Aβ0 ; note that computing X simply involves the addition of L columns of A, weighted by appropriate coefficients. Power Allocation: The power allocation {P` }L `=1 , plays an important role in determining the performance of the decoder. We will consider allocations where P` = Θ( L1 ). Two examples are: 4

• Flat power allocation across sections: P` = PL , ` ∈ [L]. • Exponentially decaying power allocation: Fix parameter κ > 0. Then P` ∝ 2−κ`/L , ` ∈ [L]. We use the exponentially decaying allocation with κ = 2C for Theorem 1. In Section 4.1, we discuss other power allocations, and find that an appropriate combination of exponential and flat allocations yields good decoding performance at finite block lengths. Both the design matrix A and the power allocation {P` } are known to the encoder and the decoder before communication begins.

2.1 Some more notation In the analysis, we will treat the message as a random vector β, which is uniformly √ distributed over BM,L (P1 , . . . , PL ), the set of length M L vectors that have a single non-zero entry nP` in section `, for ` ∈ [L]. We will denote the true message vector by β0 ; β0 should be understood as a realization of the random vector β. We will use indices i, j to denote specific entries of β, while the index ` will be used to denote the entire section ` of β. Thus βi , βj are scalars, while β` is a length M vector. We also set N = M L. The performance of the SPARC decoder will be characterized in the limit as the dictionary size goes to ∞. We write lim x to denote the limit of the quantity x as the SPARC parameters n, L, M → ∞ simultaneously, according to M = Lb and bL log L = nR.

3 The AMP Channel Decoder Given the received vector y = Aβ0 + w, the AMP decoder generates successive estimates of the message vector, denoted by {β t }, where β t ∈ RN for t = 1, 2, . . .. Set β 0 = 0, the all-zeros vector. For t = 0, 1, . . ., compute   z t−1 kβ t k2 z t = y − Aβ t + 2 , (7) P− n τt−1 βit+1 = ηit (β t + A∗ z t ),

for i = 1, . . . , N = M L,

(8)

where quantities with negative indices are set equal to zero. The constants {τt }, and the estimation functions ηit (·) are defined as follows for t = 0, 1, . . .. • Define

τ02 = σ 2 + P, 2 τt+1 = σ 2 + P (1 − xt+1 ),

where xt+1 =

L X `=1

t ≥ 0,

√



exp P`  √ E nP` P ` exp τt (U1 +

 √  nP` nP` ` τt (U1 + τt )  P  . √ √ M nP` nP` ` ) + exp U j=2 j τt τt

(9)

(10)

In (10), {Uj` } are i.i.d. N (0, 1) random variables for j ∈ [M ], ` ∈ [L]. • For i ∈ [N ], define ηit (s) =

 √ si nP` p τt2  √ , nP` P sj nP` exp j∈sec` τt2 exp



5

if i ∈ sec` , 1 ≤ ` ≤ L.

(11)

The notation j ∈ sec` is used as shorthand for “index j in section `”, i.e., j ∈ {(`−1)M +1, . . . , lM }. Notice that ηit (s) depends on all the components of s in the section containing i. For brevity, the argument of ηit in (8) is written as A∗ z t + β t , with the understanding that only the components in the section containing i play a role in computing ηit . Before running the AMP decoder, the constants {τt } must be iteratively computed using (9) and (10). This is an offline computation: for given values of M, L, n, the expectations in (10) can be computed via Monte Carlo simulation. The relation (9), which describes how τt+1 is obtained from τt , is called state evolution, following the terminology in [5, 7]. In Section 4, we derive closed form expressions for the trajectories of xt+1 and τt2 as n → ∞. For now, it suffices to note that for any fixed R < C, τt strictly decreases with t for a finite number of steps Tn , at which point we have τTn+1 ≥ τTn . Having determined τ0 , τ1 , . . . , τTn , the decoder iteratively computes codeword estimates β 1 , . . . , β Tn using (7) and (8). √ The algorithm is then terminated. Finally, in each section ` of β Tn , set the ˆ maximum value to nP` and remaining entries to 0 to obtain the decoded message β. Computational Complexity: The complexity is determined by the matrix-vector multiplications Aβ t and A∗ z t , whose running time is O(nN ) if performed in the straightforward way. The remaining operations are O(N ). As the number of iterations is finite, the complexity scales linearly with the size of the design matrix.

3.1 The Test Statistics β t + A∗ z t To understand the decoder let us first focus on (8), in which β t+1 is generated from the test statistic st := β t + A∗ z t .

(12)

The AMP update step (8) is underpinned by the following key property of the test statistic: st is asymptotically (as n → ∞) distributed as β + τ¯t Z, where τ¯t is the limit of τt , and Z is an i.i.d. N (0, 1) random vector independent of the message vector β. This property, which is proved in Section 5, is due to the presence of the “Onsager” term   z t−1 kβ t k2 P− 2 n τt−1 in the residue update step (7). The reader is referred to [7, Section I-C] for intuition about role of the Onsager term in the standard AMP algorithm. In light of the above property, a natural way to generate β t+1 from st = s is β t+1 (s) = E[β | β + τt Z = s],

(13)

i.e., β t+1 is the Bayes optimal estimate of β given the observation st = β + τt Z. For i ∈ sec` , ` ∈ [L], we have βit+1 (s) = E[βi | β + τt Z = s] = E[βi | {βj + τt Zj = sj }j∈sec` ] p p = nP` P (βi = nP` | {βj + τt Zj = sj }j∈sec` ) √ √ p f ({βj + τt Zj = sj }j∈sec` | βi = nP` ) P (βi = nP` ) √ √ = nP` P nP` ) P (βk = nP` ) k∈sec` f ({βj + τt Zj = sj }j∈sec` | βk = 6

(14)

where we have used Bayes Theorem with f denoting the joint density function of {βj + τt Zj }j∈sec` . Since β and Z are independent, with Z having i.i.d. N (0, 1) entries, for each k ∈ sec` we have √ Y p 2 2 2 2 f ({βj + τt Zj = sj }j∈sec` | βk = nP` ) ∝ e−(sk − nP` ) /2τt e−sj /2τt =e

j∈sec` ,j6=k √ Y −s2 /2τ 2 sk nP` /τt2 −nP` /2τt2 t j

e

e

(15) .

j∈sec`

Using (15) in (14), together with the fact that P (βk = βit+1 (s)



nP` ) =

1 M

for each k ∈ sec` , we obtain  √ 

si nP` τt2  √ , sj nP` exp j∈sec` τt2

exp p = E[βi | β + τt Z = s] = nP` P

(16)

which is the expression in (11). Thus, under the distributional assumption that st equals β + τt Z, β t+1 is the estimate of the message vector β (based on st ) that minimizes the expected squared estimation error. Also, for i ∈ t+1 √ sec` , βi / nP` is the posterior probability of βi being the non-zero entry in section `, conditioned on the observation st = β + τt Z.

3.2 State Evolution and its Consequences We now discuss the role of the quantity xt+1 in the state evolution equations (9) and (10). Proposition 3.1. Under the assumption that st = β + τt Z, where Z is i.i.d. ∼ N (0,1) and independent of β, the quantity xt+1 defined in (10) satisfies xt+1 =

1 E[β ∗ β t+1 ], nP

and 1 − xt+1 =

1 E[kβ − β t+1 k2 ]. nP

(17)

Proof. For convenience of notation, we relabel the N i.i.d. random variables Zk , k ∈ [N ] as {Uj` }, j ∈ [M ], ` ∈ [L]. Then L 1 1 1 X p (a) t E[β ∗ β t+1 ] = E[β ∗ η t (β + τt Z)] = E[ nP` ηsent(`) (β` + τt Z` )] nP nP nP `=1   √  √ (b)

=

=

L p p 1 X  E nP` · nP`  nP

exp

( nP` +τt U1` ) nP` τt2

  

 `√    P √ √ τt Uj nP` ( nP` +τt U1` ) nP` M `=1 exp + j=2 exp τt2 τt2  √  √   nP` ` ) nP` L exp ( + U X 1 τ τ P`  t t  P  √  √   = xt+1 . E √ M nP` nP` P ` ` exp ( + U ) + exp Uj` τnP `=1 1 j=2 τt τt t 

(18)

In (a) above, the index of the non-zero term in section ` is denoted by sent(`). (b) is obtained by assuming that sent(`) is the first entry in section ` — this assumption is valid because the prior on β is uniform over BM,L (P1 , . . . , PL ). 7

Next, consider

1 E[kβ t+1 k2 ] − 2E[β ∗ β t+1 ] E[kβ − β t+1 k2 ] = 1 + . (19) nP nP Under the assumption that st = β + τt Z, recall from Section 3.1 that β t+1 can be expressed as β t+1 = E[β | st ]. We therefore have (a)

E[kβ t+1 k2 ] = E[ kE[β|st ]k2 ] = E[ (E[β|st ] − β + β)∗ E[β|st ] ] = E[ β ∗ E[β|st ] ] = E[ β ∗ β t+1 ],

(20)

where step (a) follows because E[ (E[β|st ] − β)∗ E[β|st ] ] = 0 due to the orthogonality principle. Substituting (20) in (19) and using (18) yields 1 E[ β ∗ β t+1 ] E[kβ − β t+1 k2 ] = 1 − = 1 − xt+1 . nP nP Hence xt+1 can be interpreted as the expectation of the (power-weighted) fraction of correctly decoded sections in step t + 1. We emphasize that this interpretation is accurate only in the limit as n, M, L → ∞, when st is distributed as β + τ¯t Z, with τ¯t := lim τt . In Section 5 (Lemmas 1 and 2), we derive a closed-form expression for x ¯t+1 := lim xt+1 under an exponentially decaying power −2C`/L allocation of the form P` ∝ 2 . We show that for rates R < C, x ¯t+1 =

(1 + snr) − (1 + snr)1−ξt , snr

τ¯t2 = σ 2 + P (1 − x ¯t+1 ),

for t ≥ 0

(21)

where 1 ξ0 = log 2C

  C , R

 ξt+1 = min

1 log 2C

    C + ξt , 1 . R

(22)

A direct consequence of (21) and (22) is that lx ¯t strictly m increases with t until it reaches one, and 2C ∗ ∗ the number of steps T until x ¯T ∗ = 1 is T = log(C/R) . The constants {ξt }t≥0 have a nice interpretation in the large system limit: at the end of step t + 1, the first ξt fraction of sections in β t+1 will be correctly decodable with high probability, i.e., the true non-zero entry in these sections will have almost all the posterior probability mass. The other (1 − ξt ) fraction of sections will not be correctly decodable from β t+1 as the power allocated  1 C to these sections is not large enough. An additional 2C log R fraction of sections become correctly decodable in each step until T ∗ , when all the sections are correctly decodable with high probability. As x ¯t increases to 1, (21) implies that τ¯t2 , the variance of the “noise” in the AMP test statistic, decreases monotonically from τ¯02 = σ 2 +P down to τ¯T2 ∗ = σ 2 . In other words, the initial observation ∗ y = Aβ + w is effectively transformed by the AMP decoder into a cleaner statistic sT = β + w0 , where w0 is Gaussian with the same variance as the measurement noise w. To summarize, in the large system limit: • The AMP decoder terminates within a finite number of steps for any fixed R < C. ∗ • At the termination step T ∗ , the lim n1 Ekβ − β T k2 equals zero. For finite-sized dictionaries, the test statistic st will not be precisely distributed as β + τt Z. Nevertheless, computing xt+1 numerically via the state evolution equations (9) and (10) yields an estimate for the expected weighted fraction of correctly decoded sections after each step. Figure 2 shows the trajectory of xt vs t for a SPARC with the parameters specified in the figure. The empirical average of (β0∗ β t )/nP matches almost exactly with xt . The theoretical limit x ¯t given in (21) is also shown in the figure. 8

Figure 2: Comparison of state evolution and AMP. The SPARC parameters are M = 512, L = 1024, snr = 15, R = 0.7C, P` ∝ 2−2C`/L . The average of the 200 trials (green curves) is the dashed red curve, which is almost indistinguishable from the state evolution prediction (black curve).

3.3 Derivation of the AMP We describe a min-sum-like message passing algorithm for SPARC decoding from which the AMP decoder is obtained as a first-order approximation. The aim is to highlight the similarities and differences from the derivation of the AMP in [7]. The derivation here is not required for the analysis in the remainder of the paper. Consider the factor graph for the model y = Aβ + w, where β ∈ BM,L (P1 , . . . , PL ). Each row of A corresponds to a constraint (factor) node, while each column corresponds to a variable node. We use the indices a, b to denote factor nodes, and indices i, j to denote variable nodes. The AMP updates in (7)–(8) are obtained via a first-order approximation to the following message passing algorithm that iteratively computes estimates of β from y. 0 For i ∈ [N ], a ∈ [n], set βj→a = 0, and compute the following for t ≥ 0: t za→i = ya − t+1 βi→a

=

X

t Aaj βj→a ,

(23)

j∈[N ]\i

ηit (si→a ) ,

(24)

where ηit (.) is the estimation function defined in (11), and for i ∈ sec` , the entries of the test statistic si→a ∈ RM are defined as X t (si→a )i = Abi zb→i , b∈[n]\a

(si→a )j =

X

t Abj zb→j ,

b∈[n]

j ∈ sec` \i.

(25)

It is useful to compare the β-update in (24) to the message passing algorithm from which the traditional AMP is derived (cf. equation (1.2) in [7]). In [7], the vector x to be recovered is 9

assumed to be i.i.d. across entries; hence we have a single estimating function η t in this case, which for i ∈ [N ], a ∈ [n], generates the message ! X t t xt+1 . (26) Abi zb→i i→a = η b∈[n]\a

In (26), each outgoing message from the ith variable node depends only on its own incoming messages. In contrast, in (24), each outgoing message from a variable node depends on the incoming messages of all the other nodes in the same section. This is due to the constraint that β has exactly one non-zero entry in each section, which ensures that entries of β t within each section are dependent, while entries in different sections are mutually independent. The derivation of the AMP updates in (7)–(8) starting from the messaging passing algorithm (23)–(24) is given in Appendix A.1.

4 Performance of the AMP Decoder Before giving the main result, we state two lemmas that specify the limiting behaviour of the state evolution parameters defined in (9), (10). Treating xt+1 in (10) as a function of τ , we can define √  √   nP` ` + nP` ) L exp (U X 1 τ τ P`  √ √  P  , x(τ ) := E (27) √ M nP` nP` nP` ` P ` exp exp (U + ) + U `=1

τ

1

j=2

τ

j

τ

where {Uj` } are i.i.d. ∼ N (0, 1) for j ∈ [M ], ` ∈ [L]. Lemma 1. For t = 0, 1, . . ., we have x ¯(τ ) := lim x(τ ) = lim

L→∞

L X P` `=1

P

1{c` > 2(ln 2)R τ 2 }

(28)

where c` := limL→∞ LP` . Proof. In Appendix A.2. We remark that the ln 2 term appears because R and C are measured in bits. The performance of the AMP decoder will be analyzed with the following exponentially decaying power allocation: P` = P · For the power allocation in (29),

22C/L − 1 −2C`/L ·2 , 1 − 2−2C



2

c` = lim LP` = 2(ln 2)C(P + σ ) lim L→∞

` ∈ [L].

L→∞

σ2 σ2 + P

(29)

`/L .

(30)

Lemma 2. For the power allocation {P` } given in (29), we have for t = 0, 1, . . .: (1 + snr) − (1 + snr)1−ξt−1 , snr τ¯t2 := lim τt2 = σ 2 + P (1 − x ¯t ) = σ 2 (1 + snr)1−ξt−1 x ¯t := lim xt =

10

(31) (32)

where ξ−1 = 0, and for t ≥ 0,  ξt = min

1 log 2C

    C + ξt−1 , 1 . R

(33)

Proof. In Appendix A.3.  C 1 until it equals 1. Also log R We observe from Lemma 2 that ξt increases in each step by 2C 2 2 note that τ¯t strictly decreases with t until it reaches σ (when ξt reaches 1), after which it remains constant. Thus the number of steps until ξt reaches one (i.e., τ¯t2 stops decreasing) equals   2C ∗ T = . (34) log(C/R) 2 . Hence we Recall from Section 3 that the termination step Tn is the smallest t for which τt2 ≤ τt+1 have shown that in the large system limit, the number of steps until the AMP decoder terminates is lim Tn = T ∗ . We remark that since T n and T ∗ are both integers, lim T n = T ∗ implies that for sufficiently large n we will have T n = T ∗ . Our main result is proved for the following slightly modified AMP decoder, which runs for exactly T ∗ steps. Set β 0 = 0 and compute   z t−1 kβ t k2 t t z = y − Aβ + 2 P− , (35) n τ¯t−1

βit+1 = ηit (β t + A∗ z t ), where for i ∈ sec` , ` ∈ [L],

for i ∈ [N ]

 √ 2 p exp s nP /¯ τ i ` t . √ ηit (s) = nP` P 2 exp s nP /¯ τ j ` t j∈sec`

(36)

(37)

The only difference from the earlier decoder described in (9)–(11) is that we now use the limiting value τ¯t2 defined in Lemma 2 instead of τt2 . ∗ The algorithm terminates after generating β T , where T ∗ is defined in (34). The decoded ∗ codeword βˆ ∈ BM,L (P1 , . . . , PL ) is obtained by setting the maximum of β T in each section ` to √ nP` and the remaining entries to 0. The section error rate of a decoder for a SPARC S is defined as L

Esec (S) :=

1X ˆ 1{β` 6= β0` }. L

(38)

`=1

Theorem 1. Fix any rate R < C, and b > 0. Consider a sequence of rate R SPARCs {Sn } indexed by block length n, with design matrix parameters L and M = Lb determined according to (6), and an exponentially decaying power allocation given by (29). Then the section error rate of the AMP decoder (described in (35)–(37), and run for T ∗ steps) converges to zero almost surely, i.e., for any  > 0, lim P (Esec (Sn ) < , ∀n ≥ n0 ) = 1. (39) n0 →∞

11

Remarks: 1. The probability measure in (39) is over the Gaussian design matrix A, the Gaussian channel noise w, and the the message β distributed uniformly in BM,L (P1 , . . . , PL ). 2. As in [2], we can construct a concatenated code with an inner SPARC of rate R and an outer Reed-Solomon (RS) code of rate (1 − 2). If M is a prime power, a RS code defined over GF (M ) defines a one-to-one mapping between a symbol of the RS codeword and a section of the SPARC. The concatenated code has rate R(1 − 2), and decoding complexity that is polynomial in n. The decoded message βˆ equals β whenever the section error rate of the SPARC is less than . Thus for any  > 0, the theorem guarantees that the probability of message decoding error for a sequence of rate R(1 − 2) SPARC-RS concatenated codes will tend to zero, i.e., lim P (βˆ 6= β) = 0. The proof of Theorem 1 is given in Section 5.

4.1 Experimental Results and the Effect of Power Allocation Fig. 3 shows the performance of the AMP at different rates for a SPARC with the parameters specified in the figure. The block length n is determined by the rate R according to (5), e.g., n = 7680 for R = 0.6C, and n = 5120 for R = 0.9C. The upper solid curve shows the average section error rate of the AMP (over 1000 runs) with an exponentially decaying power allocation with P` ∝ 2−2C`/L . The upper dashed curve is the section error rate prediction obtained from state evolution as follows. Recall from Section 3.2 that xt+1 in (10) can be interpreted as the expectation of the (power-weighted) fraction of correctly decoded sections in step t + 1. Using arguments similar to Proposition 3.1, we can show that under the assumption that the test statistic st ∼ β + τt Z, the (non-weighted) expectation of the correctly decoded sections after step (t + 1) is given by √  √   nP` ` + nP` ) L L exp (U X X 1 τt τt 1 1  P/L √ √  P   := vt+1 . E[β`∗ β`t+1 ] = E √ M nP` nP` nP` ` nP P` L ` exp exp (U + ) + U `=1

`=1

1

τt

τt

j=2

τt

j

(40) In Fig. 3, the the dashed curve on top is the state evolution prediction (1 − vT n ) computed using (40), where T n denotes the termination step. We see that the average section error rate agrees closely with this prediction, but it decays rather slowly with decreasing R. The bottom solid curve shows the section error rate of the AMP with the following power allocation, characterized by two parameters a, f . For f ∈ [0, 1], let  −1 −a2C`/L  P · 2a2C/L for 1 ≤ ` ≤ f L, −a2C · 2 1−2  (41) P` = a2C(1−f ) −1 P 2  (1−f )L for f L + 1 ≤ ` ≤ L. 2a2C −1 For intuition, first assume that f = 1. Then (41) implies that P` ∝ 2−a2C`/L for ` ∈ [L]. Setting a = 1 recovers the original power allocation, and a = 0 allocates PL to each section. Increasing a increases the power allotted to the initial sections which makes them more likely to decode correctly, which in turn helps by decreasing the effective noise variance τt2 in subsequent AMP iterations. However, if a is too large, the final sections may have too little power to decode correctly. 12

Figure 3: Section error rate of the AMP. The SPARC parameters are M = 512, L = 1024, snr = 15, C = 2

bits. The top solid curve shows the average section error rate of the AMP over 1000 trials with P` ∝ 2−2C`/L . The bottom solid curve shows the average section error rate with the power allocation in (41). In both cases, the dashed lines show the section error rate predicted by state evolution.

Hence we want the parameter a to be large enough to ensure that the AMP gets started on the right track, but not much larger. This intuition can be made precise in the large system limit using Lemma 1, which specifies the condition for a section ` to be correctly decoded in step (t + 1): the limit of LP` must exceed a threshold proportional to Rτt2 . For rates close to C, we need a to be close to 1 for the initial sections to cross this threshold and get decoding started correctly. On the other hand, for rates such as R = 0.6C, a = 1 allocates more power than necessary to the initial sections, leading to poor decoding performance in the final sections. Thus the section error rate in the top curve of Fig. 3 can be improved by setting a to an appropriate value smaller than 1. In addition, we found that further improvement can be obtained by flattening the power allocation in the final sections. For a given a, (41) has an exponential power allocation until section f L, and distributes the remaining power equally among the last (1 − f )L sections. Flattening boosts the power given to the final sections compared to an exponentially decaying allocation. The two parameters (a, f ) let us trade-off between the conflicting objectives of assigning enough power to the initial sections and ensuring that the final sections have enough power to be decoded correctly. The bottom solid curve in Fig. 3 shows the average section error rates with values of (a, f ) obtained via a rough optimization around an initial guess of a = f = R/C. Again, these values are close to the state evolution prediction (1 − vT n ) computed using (40). Across trials, we observed good concentration around the average section error rates . For example, at R = 0.75C, 674 of the 1000 trials had zero errors, and all but three trials had four or fewer section errors. Further, all the section errors were in the flat part of the power allocation, as expected. It is evident that judicious power allocation can yield significant improvements in section error rates. An interesting open question is to find good rules of thumb for the power allocation as a function of rate and snr. For any given allocation, one can determine whether the section error

13

rate goes to zero in the large system limit. Indeed, using Lemma 1 with τ 2 = τ¯02 = σ 2 + P , we see that those sections ` for which the indicator in (28) is positive are decoded in the first step; this also gives the value of x ¯1 . Then with τ¯12 = σ 2 + P (1 − x ¯1 ) we can determine which sections are decoded in step 2, and so on. The section error rate goes to zero if and only if x ¯T = 1, where T is the termination step in the limit. The proof of this is essentially identical to that of Theorem 1. Thus Lemma 1 gives a straightforward way to check whether a power allocation is good in the large system limit. This can provide some guidance for the finite length case, but the challenge is to choose between several power allocations for which x ¯T = 1. One way to compare these allocations may be via the state evolution prediction vT n from (40), but this needs additional investigation. The main implementation bottleneck for SPARCs with large dictionaries is memory rather than computation time. One way to reduce the memory footprint at the expense of increased computation time is to generate the A matrix procedurally during the operation, so that only one section (or column) needs to be stored at once. Such a set-up can be then parallelized in hardware quite effectively. Another way to address this issue and scale the decoder to large values of (n, M, L) is via structured dictionaries such as Hadamard matrices, as proposed recently in [22]. Investigating the performance-complexity trade-off of SPARCs with Hadamard design matrices is a promising direction for further research.

5 Proof of Theorem 1 The main ingredient in the proof is a technical lemma (Lemma 3) which shows that the performance of the AMP decoder in the large system limit is accurately predicted by the state evolution equations (31) and (32). In particular, it is shown that the squared error n1 kβ t − βk2 converges almost surely to P (1 − x ¯t ), for 0 ≤ t ≤ T ∗ . Lemma 3 is similar to [7, Lemma 1], with several modifications to account for the differences between the two settings (e.g., the undersampling ratio n/N in our case goes to zero in the limit).

5.1 Definitions and Notation for Lemma 3 For consistency and ease of comparison, we use notation similar to [7]. Define the following column vectors recursively for t ≥ 0, starting with β 0 = 0 and z 0 = y. ht+1 = β0 − (A∗ z t + β t ), bt = w − z t ,

q t = β t − β0 ,

mt = −z t .

(42)

Recall that β0 is the message vector chosen by the transmitter. Due to the symmetry of the code construction, we can assume that the non-zeros of β0 are in the first entry of each section. Define St1 ,t2 to be the sigma-algebra generated by b0 , ..., bt1 −1 , m0 , ..., mt1 −1 , h1 , ..., ht2 , q 0 , ..., q t2 , and β0 , w. Lemma 3 recursively computes the conditional distributions bt |St,t and ht+1 |St+1,t , as well as the limiting values of various inner products involving ht+1 , q t , bt , and mt . A key ingredient in proving the lemma is the conditional distribution of the design matrix A given St1 ,t2 . For t ≥ 1, let   −1 kβ t k2 λt = 2 P− . (43) n τ¯t−1 14

We then have bt + λt mt−1 = Aq t ,

(44)

which follows from (7) and (42). We also have ht+1 + q t = A∗ mt .

(45)

From (44) and (45), we have the matrix equations Xt = A∗ Mt ,

Yt = AQt ,

(46)

where Xt = [h1 + q 0 | h2 + q 1 | . . . | ht + q t−1 ], Mt = [m0 | . . . | mt−1 ],

Yt = [b0 | b1 + λ1 m0 | . . . | bt−1 + λt−1 mt−2 ],

Qt = [q 0 | . . . | q t−1 ].

(47)

The notation [c1 | c2 | . . . | ck ] is used to denote a matrix with columns c1 , . . . , ck . Note that M0 and Q0 are the all-zero vector. We use the notation mtk and qkt to denote the projection of mt and q t onto the column space of Mt and Qt , respectively. Let α ~ t = (α0 , . . . , αt−1 ) and ~γt = (γ0 , . . . , γt−1 ) be the coefficient vectors of these projections, i.e., mtk

=

t−1 X

i

αi m ,

i=0

qkt

=

t−1 X

γi q i .

(48)

i=0

The projections of mt and q t onto the orthogonal complements of M t and Qt , respectively, are denoted by t = q t − qkt (49) mt⊥ = mt − mtk , q⊥ d

Given two random vectors X, Y and a sigma-algebra S , X|S = Y implies that the conditional distribution of X given S equals the distribution of Y . For random variables X, Y , the notation a.s. X = Y means that X and Y are equal almost surely. We use the notation ~ot (n−δ ) to denote a vector in Rt such that each of its coordinates is o(n−δ ) (here t is fixed). The t × t identity matrix is denoted by It×t , and the t × s all-zero matrix is denoted by 0t×s . The notation ‘lim’ is used to denote the large system limit as n, M, L → ∞; recall that the three quantities are related as L log M = nR, with M = Lb . We keep in mind that (given R and b) the block length n uniquely determines the dimensions of all the quantities in the system including A, β0 , w, ht+1 , q t , bt , mt . Thus we have a sequence indexed by n of each of these random quantities, associated with the sequence of SPARCs {Sn }. Finally, we recall the definition of pseudo-Lipschitz functions from [7]. Definition 5.1. A function φ : Rm → R is pseudo-Lipschitz of order k (denoted by φ ∈ P L(k)) if there exists a constant C > 0 such that for all x, y ∈ Rm , |φ(x) − φ(y)| ≤ C(1 + kxkk−1 + kykk−1 )kx − yk.

(50)

We will use the fact that when φ ∈ P L(k), there is a constant C 0 such that ∀x ∈ Rm , |φ(x)| ≤ C 0 (1 + kxkk ). 15

(51)

5.2 Main Lemma In the lemma below, δ ∈ (0, 21 ) is a generic positive number whose exact value is not required. The value of δ in each statement of the lemma may be different. We will say that a sequence xn converges to a constant c at rate n−δ if limn→∞ nδ (xn − c) = 0. l m 2C Lemma 3. The following statements hold for 0 ≤ t ≤ T ∗ , where T ∗ = log(C/R) . (a) t+1

h

d

t−1 X

d

i=1 t−1 X

|St+1,t = bt |St,t =

˜ t+1~ot+1 (n−δ ), αi hi+1 + A˜∗ mt⊥ + Q

(52)

˜ t +M ˜ t~ot (n−δ ) γi bi + Aq ⊥

(53)

i=1

˜ t and M ˜ t form an where A˜ is an independent copy of A and the columns of the matrices Q orthogonal basis for the column space of Qt and Mt , respectively, such that ˜ ∗Q ˜ ˜∗ ˜ Q t t = Mt Mt = nIt×t .

(54)

(b) i) Consider the following functions φh defined on RM × RM × RM → R:  ˜ ` /M, (h` )∗ h    r kη (β` − h` )k2 / log M, 0 ≤ r ≤ t, ˜ ` , β` ) = φh (h` , h r (β − h ) − β ]∗ [η s (β − h ˜  [η ) − β ]/ log M, 0 ≤ r ≤ s ≤ t, ` ` ` ` `   ∗ r` 0 ≤ r ≤ t, h` [ η (β` − h` ) − β` ]/ log M,

(55)

For each function in (55) and arbitrary constants (a0 , . . . , at , b0 , . . . , bt ), we have: " lim nδ

L

1X φh L `=1

t X

ar hr+1 , `

r=0

t X s=0

! bs hs+1 , β0` `

( L 1X E φh − lim L `=1

t X r=0

ar τ¯r Zr` ,

t X

!)# bs τ¯s Zs` , β`

a.s.

= 0,

s=0

(56)

where τ¯r is defined in Lemma 2 and Z0 , ..., Zt are length-N Gaussian random vectors independent of β, with Zr` denoting the `th section of Zr . For 0 ≤ s ≤ t, {Zs,j }j∈[N ] are i.i.d. ∼ N (0, 1), and for each i ∈ [N ], (Z0,i , . . . , Zt,i ) are jointly Gaussian. The inner limit in (56) exists and is finite for each φh in (55). ii) For all pseudo-Lipschitz functions φb : Rt+2 → R of order two, we have # " n X a.s 0 t δ 1 lim n φb (bi , ..., bi , wi ) − E{φb (σ¯0 Zˆ0 , ..., σ¯t Zˆt , σZw )} = 0. n

(57)

i=1

where for s ≥ 0,

σ ¯s2 := τ¯s2 − σ 2 = P (1 − x ¯s ),

(58)

with x ¯s defined in Lemma 2. The random variables (Zˆ0 , ..., Zˆt ) are jointly Gaussian with Zˆs ∼ N (0, 1) for 0 ≤ s ≤ t. Further, (Zˆ0 , ..., Zˆt ) are independent of Zw ∼ N (0, 1). 16

(c) For all 0 ≤ r ≤ s ≤ t, (hr+1 )∗ hs+1 a.s (mr )∗ ms a.s. = lim = E[(¯ σr Zˆr − σZw )(¯ σs Zˆs − σZw )], N n (br )∗ bs a.s (q r )∗ q s a.s. 2 lim = lim = σ ¯s , n n lim

(59) (60)

where the random variables Zˆr , Zˆs , Zw are those in (57), and σ ¯s is defined in (58). The convergence rate in both (59) and (60) is n−δ . (d) For all 0 ≤ r ≤ s ≤ t, σ2 (hr+1 )∗ q s+1 a.s (mr )∗ ms a.s. −¯ = lim λs+1 lim = 2 s+12 E[(¯ σr Zˆr − σZw )(¯ σs Zˆs − σZw )], n n σ +σ ¯s (br )∗ ms a.s (br )∗ bs a.s. 2 = lim = σ ¯s . lim n n

lim

(61) (62)

The convergence rate in both (61) and (62) is n−δ . (e) For all t ≥ 0, lim

(hr+1 )∗ q 0 a.s. = 0. n

(63)

(f ) The following hold almost surely. kq 0 k2 ¯02 = P, lim ⊥ = σ n lim

km0⊥ k2 = τ¯02 = σ 2 + P, n

  r k2 kq⊥ σ ¯r2 2 lim =σ ¯r 1 − 2 for 1 ≤ r ≤ t, n σ ¯r−1 lim

kms⊥ k2 = τ¯s2 − u∗ C −1 u, for 1 ≤ s ≤ t − 1, n

(64) (65)

where for 1 ≤ i, j ≤ s, h i h i ui = E (¯ σs Zˆs − σZw )(¯ σi−1 Zˆi−1 − σZw ) , Cij = E (¯ σi−1 Zˆi−1 − σZw )(¯ σj−1 Zˆj−1 − σZw ) . The limits in (64) and (65) are strictly positive for r, s < T ∗ . The lemma is proved in Section 5.5. The main difference from [7, Lemma 1] is in part (b).i of the lemma, which is a key ingredient in proving Theorem 1. The functions involving η in (55) all act section-wise when applied to vectors in RN , in contrast to the component-wise functions considered in [7] (and in part (b).ii above). To prove (56) for the section-wise functions as the section size M → ∞, we need that the limits in the other parts of the lemma (particularly in (52) and (53)) have convergence rates of n−δ for some δ > 0. Minimum rates of convergence were not needed for [7, Lemma 1].

5.3 Proof of Theorem 1 From the definition in (38), the event that the section error rate is larger than  can be written as ( L ) X {Esec (Sn ) > } = 1{βˆ` 6= β0` } > L . (66) `=1

17

When a section ` is decoded in error, the correct non-zero entry has no more than half the total mass of section ` at the termination step T ∗ . That is, ∗

T βsent(`) ≤

1p nP` 2

(67)

where sent(`) is the index of the non-zero entry in section ` of the true message β0 . Since β0sent (`) = √ nP` , we have nP` ∗ 1{βˆ` 6= β0` } ⇒ kβ`T − β0` k2 ≥ , ` ∈ [L]. (68) 4 Hence when (66) holds, we have kβ

T∗

2

− β0 k =

L X `=1

∗ kβ`T

2

(a)

− β0` k ≥

L X `=1

1{βˆ` 6= β0` }

nPL (c) n  σ 2 ln(1 + snr) nP` (b) ≥ L ≥ , 4 4 4

(69)

where (a) follows from (68); (b) is obtained using (66), and the fact that P` > PL for ` ∈ [L − 1] for the exponentially decaying power allocation in (29); (c) is obtained using the first-order Taylor series lower bound LPL ≥ σ 2 ln(1 + σP2 ). We therefore conclude that  {Esec (Sn ) > } ⇒



kβ T − β0 k2  σ 2 ln(1 + snr) ≥ n 4

 .

(70)

Now, from (60) of Lemma 3(c), we know that ∗



kβ T − β0 k2 kq T k2 a.s. (a) lim = lim = P (1 − x ¯T ∗ ) = 0, n n where (a) follows from Lemma 2, which implies that ξT ∗ −1 = 1 for T ∗ =

(71) l

2C log(C/R)

m , and hence



x ¯T ∗ = 1. Thus we have shown in (71) that  lim P

n0 →∞

kβ T −β0 k2 n

converges almost surely to zero, i.e.,



kβ T − β0 k2 < , ∀n ≥ n0 n

for any e > 0. From (70), this implies that for 0 =

 =1

(72)

4 , σ 2 ln(1+snr)

 lim P Esec (Sn ) ≤ 0 , ∀n ≥ n0 = 1.

n0 →∞

(73)

5.4 Useful Probability and Linear Algebra Results We now list some results that will be used in the proof of Lemma 3. Most of these can be found in [7, Section III.G], but we summarize them here for completeness. Fact 1. Let u ∈ RN and v ∈ Rn be deterministic vectors such that limn→∞ kuk2 /n and limn→∞ kvk2 /n both exist and are finite. Let A˜ ∈ Rn×N be a matrix with independent N (0, 1/n) entries. Then: (a) d kuk d kvk ˜ = √ Zu and A˜∗ v = √ Zv , Au (74) n n

18

where Zu ∈ Rn and Zv ∈ RN are Gaussian random vectors distributed as N (0, In×n ) and N (0, IN ×N ), respectively. Consequently, n 2 ˜ 2 a.s. kuk2 X Zu,i kuk2 kAuk a.s. = lim = lim n→∞ n n→∞ n n→∞ n n

(75)

N 2 kvk2 X Zv,j kvk2 kA˜∗ vk2 a.s. a.s. = lim = lim n→∞ n n→∞ n n→∞ N N

(76)

lim

i=1

lim

j=1

(b) Let W be a d-dimensional subspace of Rn for d ≤ n. Let (w1 , ..., wd ) be an orthogonal basis of W with kwi k2 = n for i ∈ [d], and let PW denote the orthogonal projection operator onto W. d kuk ˜ = √ Dx where x ∈ Rd is a random vector with i.i.d. Then for D = [w1 | . . . | wd ], we have PW Au n

N (0, 1/n) entries. Therefore limn→∞ with d fixed.)

a.s. n−δ kxk =

0 for any constant δ ∈ [0, 0.5). (The limit is taken

Fact 2 (Strong Law for Triangular Arrays). Let {Xn,i : i ∈ [n], n ≥ 1} be a triangular array of random variables such that for each n (Xn,1 , . . . , Xn,n ) are mutually independent, have zero mean, and satisfy n 1X E|Xn,i |2+κ ≤ cnκ/2 for some κ ∈ (0, 1) and c < ∞. (77) n i=1 P Then n1 ni=1 Xn,i → 0 almost surely as n → ∞. Fact 3. Let v ∈ Rn be a random vector with i.i.d. entries ∼ pV where the measure pV has bounded second moment. Then for any function ψ that is pseudo-Lipschitz of order two: n

1X a.s. lim ψ(vi ) = EpV [ψ(V )] n→∞ n

(78)

i=1

with convergence rate n−δ , for some δ ∈ (0, 1/4). Fact 4. Let Z1 , . . . , Zt be jointly Gaussian random variables with zero mean and an invertible covariance matrix C. Then Var(Zt | Z1 , . . . , Zt−1 ) = E[Zt2 ] − u∗ C −1 u, where for i ∈ [t − 1], ui = E[Zt Zi ]. Fact 5. Let Z1 , . . . , Zt be jointly Gaussian random variables such that for all i ∈ [t], E[Zi2 ] ≤ K

and

Var(Zi | Z1 , . . . , Zi−1 ) ≥ ci ,

for some strictly positive constants c1 , . . . , ct . Let Y be a random variable defined on the same probability space, and let g : R2 → R be a Lipschitz function with z → g(z, Y ) non-constant with positive probability. Then there exists a positive constant c0t such that E[(g(Zt , Y ))2 ] − u∗ C −1 u > c0t , where u ∈ Rt−1 and C ∈ R(t−1)×(t−1) are given by ui = E[g(Zt , Y )g(Zi , Y )], Cij = E[g(Zi , Y )g(Zj , Y )], i, j ∈ [t − 1]. (The constant c0t depends only on the K, the random variable Y and the function g.) 19

Fact 6 (Stein’s lemma). For zero-mean jointly Gaussian random variables Z1 , Z2 , and any function f : R → R for which E[Z1 f (Z2 )] and E[f 0 (Z2 )] both exist, we have E[Z1 f (Z2 )] = E[Z1 Z2 ]E[f 0 (Z2 )]. Fact 7. Let v1 , . . . , vt be a sequence of vectors in Rn such that for i ∈ [t] 1 kvi − Pi−1 (vi )k2 ≥ c, n where c is a positive constant and Pi−1 is the orthogonal projection onto the span of v1 , . . . , vi−1 .Then the matrix C ∈ Rt×t with Cij = vi∗ vj /n has minimum eigenvalue λmin ≥ c0 , where c0 is a strictly positive constant (depending only on c and t). Fact 8. Let {Sn }n≥1 be a sequence of t × t matrices such that limn→∞ Sn = S∞ where the limit is element-wise. Then if lim inf n→∞ λmin (Sn ) ≥ c for a positive constant c, then λmin (S∞ ) ≥ c.

5.5 Proof of Lemma 3 A key ingredient in the proof is the distribution of A conditioned on the sigma algebra St1 ,t where t1 is either t + 1 or t. Observing that conditioning on St1 ,t is equivalent to conditioning on the linear constraints AQt1 = Yt1 , A∗ Mt = Xt , the following lemma from [7] specifies the conditional distribution A|St1 ,t .

2

Lemma 4. [7, Lemma 10] For t1 = t + 1 or t, the conditional distribution of the random matrix A given St1 ,t satisfies d ˜ ⊥ (79) A|St1 ,t = Et1 ,t + P⊥ Mt APQt . 1

d

∗ −1 ∗ Here A˜ = A is random matrix independent of St1 ,t , and P⊥ Mt = I−PMt where PMt = Mt (Mt Mt ) Mt is the orthogonal projection matrix onto the column space of Mt ; similarly, P⊥ Qt1 = I − PQt1 , where ∗ −1 ∗ PQt1 = Qt1 (Qt1 Qt1 ) Qt1 . The matrix Et1 ,t = E[A|St1 ,t ] is given by ∗ Et1 ,t = E[APQt1 + PMt AP⊥ Qt | AQt1 = Yt1 , A Mt = Xt ] 1

=

Yt1 (Q∗t1 Qt1 )−1 Q∗t1

+ Mt (Mt∗ Mt )−1 Xt∗ − Mt (Mt∗ Mt )−1 Mt∗ Yt1 (Q∗t1 Qt1 )−1 Q∗t1 .

(80)

Lemma 5. [7, Lemma 12] For the matrix Et1 ,t defined in Lemma 4, the following hold: ∗ ∗ Et+1,t mt = Xt (Mt∗ Mt )−1 Mt∗ mtk + Qt+1 (Q∗t+1 Qt+1 )−1 Yt+1 m⊥ t ,

Et,t q t = Yt (Q∗t Qt )−1 Q∗t qkt + Mt (Mt∗ Mt )−1 Xt∗ qt⊥ ,

(81) (82)

t ⊥ where mtk , m⊥ t , qk , qt are defined in (48) and (49).

We mention that Lemmas 4 and 5 can be applied only when Mt∗ Mt and Q∗t1 Qt1 are invertible. We are now ready to prove Lemma 3. The proof proceeds by induction on t. We label as Ht+1 the results (52), (56), (59), (61), (63), (64) and similarly as B t the results (53), (57), (60), (62), (65). The proof consists of four steps: 2

While conditioning on the linear constraints, we emphasize that only A is treated as random.

20

1. B0 holds. 2. H1 holds. 3. If Br , Hs holds for all r < t and s ≤ t, then Bt holds. 4. if Br , Hs holds for all r ≤ t and s ≤ t, then Ht+1 holds. 5.5.1

Step 1: Showing B0 holds

We wish to show that (53), (57), (60), (62), and (65) hold when t = 0. (a) The sigma-algebra S0,0 is generated by q 0 = −β0 and w. Both M0 and Q0 are empty ˜ 0 is an empty matrix and q 0 = q 0 . The result follows by noting that matrices, and therefore M ⊥ 0 b = −Aβ0 = Aq0 , from the definitions in (42). (b) We will first use Fact 2 to show that " n # n X  a.s. 1X δ 1 0 0 lim n φb (bi , wi ) − EA φb (bi , wi ) = 0. (83) n n i=1

i=1

To apply Fact 2, we need to verify that n  1X E|nδ φb (b0i , wi ) − nδ EA φb (b0i , wi ) |2+κ ≤ cnκ/2 . n

(84)

i=1

for some 0 ≤ κ ≤ 1 and c some constant. Using b0 = Aq 0 ,   ˜ 0 ]i , wi ) − EA φb ([Aq 0 ]i , wi ) |2+κ E|φb (b0i , wi ) − EA φb (b0i , wi ) |2+κ = EA˜ |φb ([Aq 2+κ (a) 0 0 ˜ φ ([ Aq ] , w ) − φ ([Aq ] , w ) ≤ EA,A ˜ b i i i i b   2+κ  (b) 0 0 0 0 2+κ ˜ ˜ 1 + |[ Aq ] | + |w | + |[Aq ] | ≤ c0 EA,A |[ Aq ] − [Aq ] | ˜ i i i i i oi o n  h n 2+κ 0 0 2+κ 0 2+κ 0 2+κ 0 0 2+κ ˜ ˜ ˜ + |w | E |[ Aq ] − [Aq ] | 1 + |[ Aq ] | + |[Aq ] | ≤ c0 EA,A |[ Aq ] − [Aq ] | ˜ ˜ i i i i i i i A,A (c)

≤ c1 + c2 |wi |2+κ ,

(85)

where c0 , c0 , c1 , c2 are positive constants. In the chain above, (a) uses Jensen’s inequality, (b) d √ holds because φb ∈ P L(2), and (c) is obtained using the fact that [Aq0 ]i = −[Aβ0 ]i = P Z, and d √ ˜ ˜ 0 ]i = [Aq P Z, where Z, Z˜ are i.i.d. N (0, 1). Using (85) in (84), we obtain n n  1X nδ(κ+2) X E|nδ φb (b0i , wi ) − nδ EA φb (b0i , wi ) |2+κ ≤ (c1 + c2 |wi |2+κ ) ≤ cnκ/2 , n n i=1

(86)

i=1

κ/2 2 κ+2 since the wi ’s are i.i.d. N (0, σ ). Thus (83) holds. √ d Since b0 = Aq 0 = P Z, where Z ∈ Rn is i.i.d. ∼ N (0, 1), we

for δ
0. Choosing δ ∈ (0, δ 0 ) ensures that we can drop the Q 23

(101)

˜ ` [A] ˜ = a0 [A˜∗ m0 ]` and h ˜ = b0 [A˜∗ m0 ]` , making explicit In what follows, we use the notation h` [A] ˜ We will appeal to Fact 2 to show that the dependence on A. " # L L  1X n  o  X 1 a.s. ˜ ` [A], ˜ ` [A], ˜ h ˜ β0 − ˜ h ˜ β0 lim nδ φh h` [A], EA˜ φh h` [A], = 0 (102) ` ` L L `=1

`=1

To invoke Fact 2 (conditionally on S1,0 ), we need to verify that L   n  o 2+κ 1X ˜ ` [A], ˜ ` [A], ˆ h ˆ β0 − nδ E ˜ φh h` [A], ˜ h ˜ β0 EAˆ nδ φh h` [A], ≤ cLκ/2 ` ` A L

(103)

`=1

ˆ A˜ are i.i.d. copies of A. From Jensen’s for some 0 ≤ κ ≤ 1 and some constant c. In (103), A, inequality, we have   n  o 2+κ ˜ ` [A], ˜ ` [A], ˆ h ˆ β0 − E ˜ φh h` [A], ˜ h ˜ β0 EAˆ φh h` [A], ` ` A    2+κ  ˜ ˜ ˜ ˆ ˜ ˆ , ≤ EA, ˆA ˜ φh h` [A], h` [A], β0` − φh h` [A], h` [A], β0` and in [25], it is shown that for each function in (55),    2+κ  a.s. ˜ ˜ ˜ ˜ ˆ ˆ = O((log M )2+κ ), EA, ˆA ˜ φh h` [A], h` [A], β0` − φh h` [A], h` [A], β0`

` ∈ [L].

(104)

The bound in (104) implies (103) holds if δ(2 + κ) is chosen to be smaller than κ/2. (Recall that L = Θ(n/ log n)). We have thus shown (102). √ d Recall that for each ` ∈ [L], we have [A˜∗ m0 ]` = (km0 k/ n)Z0 where Z0 ∼ N (0, IM ×M ). `

`

0k 0k d d ˜ ` [A] ˜ = ˜ = √ Z0 , and h √ Z0 . We will next show that Therefore, in (102), h` [A] a0 km b0 km ` ` n n

" lim n

δ

#   L 0k 0k a.s. km 1X km EZ0 φh a0 √ Z0` , b0 √ Z0` , β0` − φh (a0 τ¯0 Z0` , b0 τ¯0 Z0` , β0` ) = 0. (105) L n n `=1

  0k km0 k ˜ ` and ∆ ˜ ` similarly with b0 √ Z0 and ∆` = a0 τ √ ¯ − Z0` . Define h Let us redefine h` = a0 km 0 ` n n replacing a0 . Then (105) can be written as # " L     X δ 1 ˜ ` , β0 − φh h` + ∆` , h ˜` + ∆ ˜ ` , β0 a.s. = 0. (106) lim n EZ0 φh h` , h ` ` L `=1

Note that from H1 (c) and the fact that Z0` ∼ N (0, IM ×M ), p km0 k a.s. max |h`j | = |a0 | √ max |Z0`j | = Θ( log M ), n j∈sec(`) j∈sec(`) km0 k 0p a.s. max |∆`j | = |a0 | τ¯0 − √ max |Z0`j | = Θ(n−δ log M ) n j∈sec(`) j∈sec(`)

(107)

for some δ 0 > 0. The almost-sure equality in each line of (107) holds for sufficiently large M . (This can be shown using the standard normal distribution of Z0 and the Borel-Cantelli lemma). 24

√ ˜ ` | = Θ(√log M ) and maxj∈sec(`) |∆ ˜ ` | = Θ(n−δ0 log M ). Using (107), it is Similarly maxj∈sec(`) |h j j shown in [25] that     0 ˜ ` , β0 − φh h` + ∆` , h ˜` + ∆ ˜ ` , β0 a.s. = o(n−δ log M ) (108) φ h h` , h ` ` for some δ 0 > 0. Thus (105) holds for δ < δ 0 . To complete the proof, we need to show that " # L L X 1X a.s. δ 1 lim n EZ0 [φh (a0 τ¯0 Z0` , b0 τ¯0 Z0` , β0` )] − E(Z0 ,β) [φh (a0 τ¯0 Z0` , b0 τ¯0 Z0` , β` )] = 0 L L `=1 `=1 (109) But (109) holds because the uniform distribution of the non-zero entry in β` over the M possible locations and the i.i.d. distribution of Z0 together ensure that for all β0 ∈ BM,L , we have EZ0 [φh (a0 τ¯0 Z0` , b0 τ¯0 Z0` , β0` )] = E(Z0 ,β) [φh (a0 τ¯0 Z0` , b0 τ¯0 Z0` , β` )] ,

∀` ∈ [L].

1 ∗ 1 P 1 (d) By definition q 1 = η 0 (β0 − h1 ) − β0 , and hence (h n) q = n1 L `=1 φh (h` , β0` ), where the function φh : RM × RM → R is φh (h1` , β0` ) := (h1` )∗ [η`0 (β0 − h1 ) − β0` ]. Applying H1 (b) to φh yields " L # L X 1X a.s. δ 1 1 ∗ 0 (110) lim n φh (h` , β0` ) − lim E{¯ τ0 Z0` [η` (β0 − τ¯0 Z0 ) − β0` ]} = 0. n n

`=1

`=1

Consider a single term in the expectation in (110), say ` = 1. We have 0 E{¯ τ0 Z0∗(1) [η(1) (β0

− τ¯0 Z0 ) − β0(1) ]} = τ¯0

M X i=1

E{Z0i [ηi0 (β0 − τ¯0 Z0 ) − β0i ]}

(111)

where β0(1) = (β01 , β02 , . . . , β0M ) and Z0(1) = (Z01 , Z02 , .., Z0M ). Note that for each i, the function ηi0 (·) depends on all the M indices in the section containing i. For each i ∈ [M ], we evaluate the expectation on the RHS of (111) using the law of iterated expectations: h n oi E{Z0i [ηi0 (β0 − τ¯0 Z0 ) − β0i ]} = E E Z0i [η0i (β0 − τ¯0 Z0 ) − β0i ]|β0(1) , Z0(1)\i (112) where the inner expectation is over Zi conditioned on {β0(1) , Z0(1)\i }. Since Z0i is independent of {β0(1) , Z0(1)\i }, the latter just act as constants in the inner expectation, which is over Z0i ∼ N (0, 1). Applying Stein’s lemma (Fact 6) to the inner expectation, we obtain    h n oi ∂ 0 0 [η (β0 − τ¯0 Z0 ) − β0i ] | β0(1) , Z0(1) \i E E Z0i [ηi (β0 − τ¯0 Z0 ) − β0i ] | β0(1) , Z0(1) \i = E E ∂Z0i i p  oi τ¯0 h n (a) nP1 − ηi0 (β0 − τ¯0 Z0 ) |β0(1) , Z0(1) \i = − 2 E E ηi0 (β0 − τ¯0 Z0 ) τ¯0 p i 1 h (b) = − E ηi0 (β0 − τ¯0 Z0 ) nP1 − ηi0 (β0 − τ¯0 Z0 ) τ¯0 (113) where (a) holds because the definition of ηit in (11) implies that  ∂ηit (s) η t (s) p = i2 nP` − ηit (s) for i ∈ section `, δsi τ¯t 25

and (b) follows from the law of iterated expectation. Using (113) in (112) and (111), we have M i n o X h  p ∗ 0 E τ¯0 Z0(1) [η(1) (β0 − τ¯0 Z0 ) − β0(1) ] = E ηi0 (β0 − τ¯0 Z0 ) ηi0 (β0 − τ¯0 Z0 ) − nP1 .

(114)

i=1

The argument above can be repeated for each section ` ∈ [L] to obtain a relation analogous to (114). Using this for the expectation in (110), we obtain " L !#  X E kη 0 (β0 − τ¯0 Z0 )k2 a.s. δ 1 1 (115) lim n φh (h` , β0` ) − lim −P = 0. n n `=1

  E{kη 0 (β0 −¯ τ0 Z0 )k2 } It is shown in Appendix A.4 that lim P − =σ ¯12 . Therefore (115) becomes n " lim n

δ

L

1X 1 ∗ 0 (h ) [η (β0 − h1 ) − β0 ] + σ ¯12 n

# a.s.

= 0,

(116)

`=1

where we have used φh (h1` , β0` ) = (h1` )∗ [η`0 (β0 − h1 ) − β0` ]. 0 2 a.s. To complete the proof, recall from H1 (c) that kmn k → σ 2 + σ ¯02 at rate n−δ . Further, from (43), we observe that !    1 kβ 1 k2 1 E kη 0 (β0 − τ¯0 Z0 )k2 −¯ σ2 −¯ σ2 a.s. λ1 = 2 − P → lim 2 − P = 21 = 2 1 2 , (117) n n τ¯0 τ¯0 τ¯0 σ +σ ¯0 where the convergence at rate n−δ follows from H1 (b) applied to the function (e) We use H1 (a) to represent

kη 0 (β−h1 )k2 n

=

kβ 1 k2 n .

0 d ˜ 1 o(n−δ ) = A˜∗ m0 + √q o(n−δ ). h1 S1,0 = A˜∗ m0⊥ + Q P

(118)

0 ∗ ˜∗ m0 √ km0 k Z √ (q 0 )∗ h1 kq 0 k2 d (q ) A −δ d √ √ √ = + o(n ) = P + P o(n−δ ), S1,0 n n n n n P

(119)

Therefore

˜ By H1 (c), where we have used Fact 1(a) as q 0 , m0 are in the sigma-field and independent of A. km0 k2 a.s. 2 lim n = τ¯0 and therefore (119) goes to zero almost surely in the limit at rate n−δ . 0 = q 0 and so lim (f ) Since Q0 is the empty matrix, q⊥

5.5.3

0 k2 kq⊥ n

0 2

= lim kqnk = P .

Step 3: Showing Bt holds

(f ) By the induction hypothesis Bt−1 , (65) is true for 0 ≤ s ≤ t − 2, so we prove the s = t − 1 case. ∗ M −1 ∗ Let PMt−1 = Mt−1 (Mt−1 t−1 ) Mt−1 be the projection matrix onto the column space of Mt−1 . Then, 2 kmt−1 kmt−1 k2 (mt−1 )∗ Mt−1 ⊥ k = k(I − PMt−1 )mt−1 k2 = − n n n

26



∗ M Mt−1 t−1 n

−1

∗ mt−1 Mt−1 . (120) n

Consider the matrix inverse in (120). By the induction hypothesis Bt−1 (f), lim

kmr − PMr−1 mr k2 kmr⊥ k2 = lim > ςr for 0 ≤ r ≤ t − 2, n n

(121) M∗

Mt−1

for positive constants ςr . Using (121), Facts 7 and 8 imply that the smallest eigenvalue of lim t−1n is greater than some positive constant; hence its inverse exists. Let φb (bri , bsi , wi ) = (bri − wi )(bsi − wi ) = mri msi . It can be verified that φb ∈ P L(2). Using the induction hypothesis Bt−1 (b), we have for 0 ≤ r, s ≤ t − 1: lim

n i 1X (mr )∗ ms a.s. h σs Zˆs − σZw ) = E (¯ σr Zˆr − σZw )(¯ φb (bri , bsi , wi ) = lim n n

(122)

i=1

where (Zˆr , Zˆs ) are jointly Gaussian with N (0, 1) marginals, and independent of Zw . Using (122) in (120), we obtain h i 2 kmt−1 ⊥ k lim = E (¯ σt−1 Zˆt−1 − σZw )2 − u∗ C −1 u (123) n where for 1 ≤ i, j ≤ (t − 1), h i h i ui = E (¯ σt−1 Zˆt−1 − σZw )(¯ σi−1 Zˆi−1 − σZw ) , Cij = E (¯ σi−1 Zˆi−1 − σZw )(¯ σj−1 Zˆj−1 − σZw ) . (124)

Now the result follows from Fact 5 if we can show that there exists strictly positive constants c1 , . . . , ct−1 such that Var(¯ σr Zr |¯ σ0 Z0 , . . . , σ ¯r−1 Zr−1 ) ≥ cr , for 1 ≤ r ≤ (t − 1). Indeed, we will now prove that   σ ¯r2 2 Var(¯ σr Zr |¯ σ0 Z0 , . . . , σ ¯r−1 Zr−1 ) = σ ¯r 1 − 2 . (125) σ ¯r−1   Since σ ¯r2 = σ 2 (1 + snr)1−ξr−1 − 1 , the definition of ξr−1 in (33) implies that the RHS of (125) is m l 2C . strictly positive for r ≤ T ∗ − 1, where T ∗ = log(C/R) For r ∈ [t − 1], we have     kbr⊥ k2 kbr k2 (br )∗ Br Br∗ Br −1 Br∗ br kq r k2 (q r )∗ Qr Q∗r Qr −1 Q∗r q r lim = lim − = lim − (126) n n n n n n n n n where the second equality follows from the induction hypothesis Bt−1 (c) which says that 0

0

(br )∗ br (q r )∗ q r lim = lim =σ ¯r2 for 0 ≤ r0 ≤ r ≤ (t − 1). n n

(127)

i ∗

∗ ∗ ˜ we have C˜ij = C˜ji = lim (q ) qj = σ Denoting lim BrnBr = lim QrnQr by C, ¯j2 , for 0 ≤ i ≤ j ≤ n

kq r k2

(r − 1). The induction hypothesis Ht (f) guarantees that ⊥n is strictly positive for 0 ≤ r ≤ t − 1. Consequently, Facts 7 and 8 imply that C˜ is invertible. Hence     kbr⊥ k2 σ ¯r2 kq r k2 (q r )∗ Qr Q∗r Qr −1 Q∗r q r (a) 2 2 ∗ ˜ −1 2 (b) 2 lim = lim − =σ ¯r − σ ¯r (er C er )¯ σr = σ ¯r 1 − 2 . n n n n n σ ¯r−1 (128) 27

In (128), (a) is obtained using (127) with er ∈ Rr denoting the all-ones column vector. The equality ˜ = er : since all the entries in the last (b) is obtained using the fact that C˜ −1 er is the solution to Cx 2 2 )−1 ]∗ , ˜ ˜ = er is x = [0, . . . , 0, (¯ column of C are equal to σ ¯r−1 , by inspection the solution to Cx σr−1 which yields equality (b) in (128). Using the induction hypothesis Bt−1 (b) for the P L(2) function φb (x, y) = xy, we have n

X1 1 lim (br )∗ bs = lim br bs = E[¯ σr Zˆr σ ¯s Zˆs ], 0 ≤ r, s ≤ (t − 1). n n i i

(129)

i=1

Using this, we obtain 

Br∗ Br n

−1

Br∗ br (a) =σ ¯r2 − v ∗ D−1 v = Var(¯ σr Zˆr |¯ σ0 Zˆ0 , . . . , σ ¯r−1 Zˆr−1 ) n (130) h i h i ¯r σ ¯i Zˆr Zˆi , and Dij = E σ ¯i σ ¯j Zˆi Zˆj . Equality (a) in (130) where for 0 ≤ i, j ≤ (r − 1), vi = E σ follows from Fact 4. We have proved (125) via (128) and (130), which completes the proof of Bt (f). We now state a couple of lemmas that will be useful for proving the remainder of Bt and Ht+1 . kbr k2 kbr k2 (br )∗ Br − lim ⊥ = lim n n n

Lemma 6. For t ≤ T ∗ , the vectors of coefficients in (48), given by  ∗ −1 ∗  ∗ −1 ∗ Mt Mt Mt mt Qt Qt Qt qt α ~ = (α0 , α1 , . . . , αt−1 ) = , ~γ = (γ0 , γ1 , . . . , γt−1 ) = n n n n converge to finite limits at rate n−δ as n → ∞.

r ∗

s

Proof. From the induction hypothesis Ht (c), (m n) m converges almost surely to a constant at rate n−δ for r, s ≤ (t − 1). Further, Bt (f) proved above and Fact 7 together imply that that the smallest M ∗M eigenvalue of the matrix tn t is bounded from below by a positive constant for all n; then Fact 8 implies that its inverse has a finite limit. Further, the inverse converges to its limit at rate n−δ as Mt∗ Mt each entry in n converges at this rate. The statement for ~γ is proved in an analogous manner using the induction hypotheses Bt−1 (c) and Ht (f), together with Facts 7 and 8. Lemma 7. The following statements hold for t ≤ T ∗ : d

ht+1 |St+1,t = Ht (Mt∗ Mt )−1 Mt∗ mtk + PQ⊥t+1 A˜∗ mt⊥ + Qt+1~ot+1 (n−δ ), d ⊥ ˜ t bt |St,t = Bt (Q∗t Qt )−1 Q∗t qkt + PM Aq⊥ + Mt~ot (n−δ ), t

(131) (132)

where Bt = [b0 | . . . | bt−1 ] and Ht = [h1 | . . . | ht ]. Proof. The proof is very similar to that of [7, Lemma 13]. We use Lemmas 4 and 5 to write d t ⊥ ˜ t bt |St,t = (Aq t − λt mt−1 )|St,t = Yt (Q∗t Qt )−1 Q∗t qkt + Mt (Mt∗ Mt )−1 Xt∗ q⊥ + PM Aq⊥ − λt mt−1 t

t ⊥ ˜ t = Bt (Q∗t Qt )−1 Q∗t qkt + [0|Mt−1 ]Λt (Q∗t Qt )−1 Q∗t qkt + Mt (Mt∗ Mt )−1 Ht∗ q⊥ + PM Aq⊥ − λt mt−1 , t (133)

where Λt = diag(λ0 , . . . , λt−1 ). The last equality above is obtained using Yt = Bt + [0|Mt−1 ]Λt , and Xt = Ht + Qt . Thus, to show (132), we need to prove that t [0|Mt−1 ]Λt~γ + Mt (Mt∗ Mt )−1 Ht∗ q⊥ − λt mt−1 = Mt ~ot (n−δ ).

28

(134)

Observe that each side of (134) is a linear combination of {mk }, 0 ≤ k ≤ (t − 1). The coefficient of mk on the LHS equals    t Mt∗ Mt −1 Ht∗ q⊥ λk+1 γk+1 + for 0 ≤ k ≤ t − 2, n n k+1   (135)  t Mt∗ Mt −1 Ht∗ q⊥ −λt + , for k = t − 1. n n t

We prove (134) by showing that each of the coefficients above is o(n−δ ). Indeed, for 1 ≤ i ≤ t, 

t Ht∗ q⊥ n

 = i

t−1 t (hi )∗ (q t − qkt ) (hi )∗ q⊥ (hi )∗ q t X (hi )∗ q r = = − γr n n n n r=0 # " t−1 (mi−1 )∗ mt−1 X (mi−1 )∗ mr−1 a.s. → lim λt − γr λr n n

(136)

r=1

where the convergence (at rate n−δ ) follows from Ht (d); Lemma 6 guarantees the convergence of the γr coefficients. Therefore " #  ∗ t  t−2 Ht q⊥ a.s. (Mt )∗ mt−1 X (Mt )∗ mr → lim λt − γr+1 λr+1 at rate n−δ . (137) n n n r=0

Substituting (137) in (135) yields (134), and completes the proof of (132). The other part of the lemma, (131), is proved in a similar manner. (a) From Lemma 7, we have d ⊥ ˜ t bt |St,t = Bt (Q∗t Qt )−1 Q∗t qkt + PM Aq⊥ + Mt~ot (n−δ ). t

Using the fact that qkt =

Pt−1 i=0

(138)

γi q i , we have

Bt (Q∗t Qt )−1 Q∗t qkt =

t−1 X

γi Bt (Q∗t Qt )−1 Q∗t q i =

i=0

t−1 X

γi bi

(139)

i=0

⊥ Aq ˜ t = as (Q∗t Qt )−1 Q∗t q i ∈ Rt is a vector of zeros with a 1 in position (i+1). Next, observe that PM ⊥ t ˜ t − P k Aq ˜ t . Hence the result follows if we can show that P k Aq ˜ t=M ˜ t~ot (n−δ ). Indeed, using Aq Mt Mt Fact 1(b), we see that t d ˜ k ˜ t d kq⊥ k ˜ √ Mt~ot (n−δ ) = M ot (n−δ ) PMt Aq = t~ ⊥ n kq t k2

t 2

where the last equality follows since ⊥n ≤ kqnk ≤ 2P . (c) By the induction hypothesis, the result holds for all r, s < t, so we only consider the r < t, s = t and r = s = t cases. From Bt (a) above, we have t

d

b |St,t =

t−1 X

˜ t + Mt~ot (n−δ ) γi bi + Aq ⊥

i=0

29

(140)

˜ t~ot (n−δ ). For r < t, s = t, we have from (140): where we have used Mt~ot (n−δ ) = M t−1 t−1 X X ˜ t (br )∗ bt (br )∗ bi (br )∗ Aq (br )∗ mi d ⊥ = γi + + . o(n−δ ) n n n n St,t i=1

(141)

i=0

˜ t d kbr kkq t k Z (br )∗ Aq ⊥ √ ⊥ = , where Z ∼ N (0, 1). Theren n n t k r kq⊥ since n ≤ 2P and Bt−1 (c), (d) imply that kbn k and

Applying Fact 1(a), the second term in (141) is fore the last two terms in (141) are o(n−δ ) (br )∗ mi n

as

converge to finite limits. Using Bt−1 (c) again, the limit of first term in (141) can be written lim

t−1 X i=0

γi

t−1 X (q r )∗ qkt (a) (br )∗ bi a.s. (q r )∗ q i (q r )∗ q t (b) 2 = lim = lim =σ ¯r a.s. γi = lim n n n n

(142)

i=1

t ⊥ qr , where the γi ’s have finite limits due to Lemma 6. Equality (a) in (142) holds because q⊥ while (b) is obtained by applying Ht (b) to the function

φh (hr` , hs` , β0` ) := [η`r−1 (β0 − hr ) − β0` ]∗ [η`s−1 (β0 − hs ) − β0` ] = (q`r )∗ q`s , which yields lim

(q r )∗ q t a.s. 1 ¯r2 , = lim E{[η r−1 (β − τr−1 Zr−1 ) − β]∗ [η t−1 (β − τt−1 Zt−1 ) − β]} = σ n n

where the second equality above is proved in Appendix A.4. From Bt−1 (c) and Ht−1 (b), it follows that the rate of convergence in (142) is n−δ . For r = s = t, using (140), we have 0 t−1 X t−1 X ˜ t k2 kAq kbt k2 (bi )∗ bi d ⊥ |St,t = + γ i γ i0 n n n 0

i=0 i =0

+2

t−1 X i=0

t−1

X (bi )∗ Mt~ot (n−δ ) ˜ t ˜ t )∗ Mt~ot (n−δ ) kMt~ot (n−δ )k2 (bi )∗ Aq (Aq ⊥ ⊥ γi +2 γi +2 + . n n n n

(143)

i=0

Using arguments similar to those for the r < t case, the last four terms in (143) can be shown to be o(n−δ ), and by Fact 1

˜ t k2 kAq ⊥ n

=

t−1 t−1

t k2 kq⊥ kZk2 n n

where Z ∼ Rn is standard normal. Therefore,

0

XX kq t k2 (bi )∗ bi kbt k2 d γ i γ i0 |St,t = + ⊥ + o(n−δ ) n n n 0 i=0 i =0 t−1 X t−1 X a.s.

→ lim

i=0 i0 =0

0 kqkt k2 kq t k2 kq t k2 (q i )∗ q i kq t k2 γ i γ i0 + ⊥ = lim + ⊥ = lim , n n n n n

where the convergence at rate n−δ follows from Bt−1 (c). (b) Using the characterization for bt obtained in Bt (a) above, we have " t−1 # ! X 0 d t−1 0 t 0 r t −δ ˜ +M ˜ t~ot (n ) , wi . φb (bi , . . . , bi , wi ) = φb bi , . . . , bi , γr b + Aq ⊥ St,t

r=0

30

i

(144)

˜ t~ot (n−δ0 ) in the RHS can be dropped. Indeed, defining for some δ 0 > 0. The term M " t−1 ! ! # " t−1 # X X 0 t−1 t−1 0 0 r t −δ r t ˜ +M ˜ t~ot (n ) , wi , ci = b , . . . , b , ˜ ai = bi , . . . , bi , γr b + Aq , wi , γr b + Aq i ⊥ ⊥ i r=0

r=0

i

i

we can show that n n n n  nδ X  X (a) C X 1 X ˜ −δ 0 φb (ai ) − |φb (ai ) − φb (ci )| ≤ (1 + kai k + kci k) M ~ o (n ) φb (ci ) ≤ t t n n n i i=1 i=1 i=1 i=1 v v u t−1 u n 2 uX km (b) uX ˜ r k2 t (1 + ka k + kc k) 0 (c) 0 i i t 2 o(n−δ ) = o(n−δ ). ≤ Ct n n r=0

i=1

(145) In (145), (a) holds because φb ∈ P L(2). (b) is obtained using Holder’s inequality and the fact i2 P h˜ 0 P −δ 0 ) ˜ r k2 . Equality (c) can be shown by verifying that ~ o (n ≤ 2t~o1 (n−δ ) t−1 that ni=1 M t t r=0 km i Pn kai k2 Pn kci k2 and i=1 n i=1 n are bounded and finite. The details are similar to [7, Bt (b)] and are omitted. Thus by choosing δ < δ 0 , we can work with ci instead of ai . Next, we use Fact 2 to show that " n # n X X 1 1 a.s. lim nδ φb (ci ) − EA˜ {φb (ci )} = 0, (146) n n i=1

i=1

To appeal to Fact 2, we need to verify that n n o 2+κ 1 X δ E n φb (ci ) − EA˜ nδ φb (ci ) ≤ cnκ/2 . n

(147)

i=1

Using steps similar to (85), we can show that o  n 2+κ t ˜ t ]i |2+κ ˜ t ]i |2+κ 1 + |[A˜0 q t ]i |2+κ + |[Aq ]i − [Aq ≤ κ0 EA˜0 ,A˜ |[A˜0 q⊥ E φb (ci ) − EA˜ {φb (ci )} ⊥ ⊥ ⊥ ! t−1 o n X t ˜ t ]i |2+κ (|1 + γr ||bri |)2+κ + |wi |2+κ EA˜0 ,A˜ |[A˜0 q⊥ ]i − [Aq + κ0 ⊥ r=0

(a)

≤ κ1 + κ2

|wi |

2+κ

t−1 X + (1 + γr )2+κ |bri |2+κ

!

r=0

(148) ˜ A˜0 are independent copies of A. In (148), (a) holds for some constants κ0 , κ1 , κ2 > 0, where A, t t t k t d kq⊥ k ˜ d kq⊥ kq⊥ k t = ˜ t = ˜0 ˜ ˜0 because Aq ≤ kqn k ≤ P ; similarly, A˜0 q⊥ ⊥ n Z and n n Z , where Z, Z are N (0, 1). Substituting (148) in the LHS of (147), and applying induction hypothesis Bt (b) shows that the κ/2 condition (148) is satisfied if δ < κ+2 . Thus we now need to show that " ( ) # n t−1 X nδ X a.s. ˜ t ]i , wi ) − E{φb (¯ lim EA˜ φb (b0i , . . . , bit−1 , γr bri + [Aq σ0 Z0 , . . . , σ ¯t Zt , W )} = 0. ⊥ n r=0 i=1 (149) 31

t k d kq⊥ ˜ √ Z n

˜ t ]i = Recalling that [Aq ⊥ ( EA˜

φb (b0i , . . . , bit−1 ,

t−1 X

where Z˜ ∼ N (0, 1), we have )

˜ t ]i , wi ) γr bri + [Aq ⊥

t−1 X

) t k kq ˜ wi ) . φb (b0i , . . . , bt−1 γr bri + √⊥ Z, i , n r=0 (150)

( = EZ˜

r=0

Define the function t−1 X

) t k kq ˜ wi ) . φb (b0i , . . . , bt−1 γr bri + √⊥ Z, i , n r=0

( EW 0 φN (bi , . . . , bt−1 ˜ b i , wi ) := EZ

(151)

EW ∈ P L(2), and hence the induction hypothesis B It can be verified that φN t−1 (b) implies that b " n # n o X a.s. t−1 δ 1 N EW 0 N EW lim n φb (bi , . . . , bi , wi ) − E φb = 0. (152) (¯ σ0 Zˆ0 , . . . , σ ¯t−1 Zˆt−1 , σZw ) n i=1

Thus from (150) – (152), we see that " n ( ) t−1 X nδ X t−1 0 r t ˜ ]i , wi ) lim EA˜ φb (bi , . . . , bi , γr bi + [Aq ⊥ n r=0 i=1 )# ( t−1 t k X kq a.s. ˜ σZw ) − EEZ˜ φb (¯ σ0 Zˆ0 , . . . , σ ¯t−1 Zˆt−1 , γr σ ¯r Zˆr + √⊥ Z, = 0. n r=0

(153)

In (153), Lemma 6 implies that the γr ’s converge to a finite limit as n → ∞. Further, t 2 t k2 kq⊥ kq t k2 kqk k kq t k2 k = − = − n n n n

Pt−1

r 2 kq t k2 r=1 γr q k = − n n

Pt−1

r,s=1 γr γs (q

n

r )∗ q s

.

t k kq⊥ √ n

also converges to a finite limit due to Bt (c), proved above. The final step is to show P  kq t k t−1 that the variance of the Gaussian random variable ¯r Zˆr + √⊥n Z˜ converges to σ ¯t2 at rate r=0 γr σ

Hence 0

n−δ for some δ 0 > 0. Applying (153) to the P L(2) function φb (b0i , . . . , bti , wi ) := (bti )2 , we obtain   !2  t−1   t t 2 X kq k kb k  a.s. lim nδ  −E γr σ ¯r Zˆr + √⊥ Z˜ = 0. (154)   n n r=0

Using the induction hypothesis Ht (b) for the function φ` (h` , β` ) = kη`t−1 (β − ht ) − β` k2 = kq`t k2 , we have  t−1   t 2   t 2 kη (β − τ¯t−1 Zt−1 ) − βk2 a.s. δ kq k δ kq k 2 a.s. −E = lim n −σ ¯t = 0 (155) lim n n n n t−1 2 since Appendix A.4 shows that lim E{kη ¯t2 . Further, induction h t 2 (β t−2 iτ¯t−1 Zt−1 ) − βk /n} = σ a.s. hypothesis Bt (c) implies that lim nδ kbnk − kqnk = 0. Combining this with (154) and (155) completes the proof. (d) By definition ms = bs − w and so

32

lim

(br )∗ ms (br )∗ bs (br )∗ w = lim − lim . n n n

r ∗ s

By Bt (c), (b n) b converges almost surely to σ ¯s2 at rate n−δ . Hence the result follows if it can r ∗ be shown that (b n) w approaches 0 almost surely at rate n−δ . Applying Bt (b) to the function φb (bri , wi ) = bri wi , we obtain # " n n o X 1 a.s = 0. lim nδ φb (bri , wi ) − E φb (¯ σr Zˆr , σZw ) (156) n i=1

n o n o The result holds since E φb (¯ σr Zˆr , σZw ) = E σ ¯r σ Zˆr Zw = 0 as Zˆr is independent of Zw . 5.5.4

Step 4: Showing Ht+1 holds

(f ) By the induction hypothesis, Ht (f) is true for 0 ≤ r ≤ (t − 1). For r = t, we have   t k2 kq⊥ kq t k2 (q t )∗ Qt Q∗t Qt −1 (Qt )∗ q t lim = lim − n n n n n

(157)

We note the matrix inverse in (157) exists almost surely. Indeed, from the induction hypothesis Ht (f) we have   r k2 kq⊥ σ ¯r2 2 =σ ¯r 1 − 2 lim > 0 for 0 ≤ r ≤ (t − 1). n σ ¯r−1 Then Facts 7 and 8 imply that the matrix lim From to those

t 2 a.s. Bt (c), we know that kqnk → σ ¯t2 . used to prove (128) in Bt (f ), we

Q∗t Qt n

is invertible.

Substituting this in (157), and using arguments identical obtain     kq t k2 (q t )∗ Qt Q∗t Qt −1 Q∗t q t σ ¯t2 2 lim − =σ ¯t 1 − 2 . (158) n n n n σ ¯t−1   Since σ ¯t2 = σ 2 (1 + snr)1−ξt−1 − 1 , the definition of ξt−1 in (33) implies that the RHS of (158) is strictly positive for t ≤ T ∗ − 1. (a) We start with the characterization for ht+1 in (131) of Lemma 7. The proof from there on is along the same lines as Bt (a), with (Ht , Mt , mt , Qt+1 ) replacing (Bt , Qt , q t , Mt ), respectively. (c) From Ht+1 (a), we have t+1

h

d

|St+1,t =

t−1 X

˜ t+1~ot+1 (n−δ ) αi hi+1 + A˜∗ mt⊥ + Q

(159)

i=1

˜ t+1~ot+1 (n−δ ). For r < t, s = t, we have where we have used Qt+1~ot+1 (n−δ ) = Q t−1 t X (hr+1 )∗ hi+1 (hr+1 )∗ A˜∗ mt⊥ X (hr+1 )∗ q i (hr+1 )∗ ht+1 d = αi + + o(n−δ ) . N N N N St+1,t i=1

˜ t d khr+1 kkmt k Z (hr+1 )∗ Aq ⊥ ⊥ √ = , where Z ∼ N (0, 1). n N n −δ second term is o(n ). The third term is also o(n−δ )

Applying Fact 1(a), the second term in (160) is Therefore, Ht (c) and Bt (f) imply that the

(160)

i=0

33

since Ht (e) implies that the inner products first term in (160) converges at rate n−δ to lim

t−1 X i=0

(hr+1 )∗ q i N

go to zero. Using Ht (c) and Lemma 6, the

t−1 X (mr )∗ mtk (mr )∗ mi (mr )∗ mt (hr+1 )∗ hi+1 a.s. = lim αi = lim = lim αi N n n n i=1

(161)

a.s.

σt Zˆt − σZw )], = E[(¯ σr Zˆr − σZw )(¯

where the last equality is obtained by applying Bt (b) to φb (bri , bti , wi ) = (bri − wi )(bti − wi ) = mri mti . For r = s = t, using (159) we have t−1 t−1

XX (hj+1 )∗ hi+1 kA˜∗ mt⊥ k2 kht+1 k2 d |St+1,t = + αi αj N N N i=0 j=0

+2

t−1 X i=0

t−1 X (hi+1 )∗ A˜∗ mt⊥ [Qt+1~ot+1 (n−δ )]∗ A˜∗ mt⊥ kQt+1~ot+1 (n−δ )k2 (hi+1 )∗ Qt+1~ot+1 (n−δ ) αi +2 +2 + . αi N N N N i=0

(162) Using arguments similar to those for the r < t case, the last four terms in (162) can be shown to be o(n−δ ), and by Fact 1(a)

˜∗ mt d kmt k Z A ⊥ = √⊥n N N

where Z ∼ RN is standard normal. Therefore,

t−1 X t−1 X kht+1 k2 (hj+1 )∗ hi+1 kmt⊥ k2 a.s. lim |St,t = lim + αi αj N N n i=0 j=0

a.s.

= lim

t−1 X t−1 X i=0 j=0

kmtk k2 kmt k2 (mj )∗ mi kmt⊥ k2 kmt k2 ⊥ αi αj + = lim + = lim , n n n n n

where the second equality is obtained using Ht (c), which together with the central limit theorem also gives the n−δ rate of convergence. (b) From Ht+1 (a), we have t+1

h

d

|St+1,t =

t−1 X

˜ t+1~ot+1 (n−δ0 ), αu hu+1 + A˜∗ mt⊥ + Q

(163)

u=0

˜ t+1 form an orthogonal basis where A˜ is an independent copy of A and the columns of the matrix Q ˜∗ Q ˜ for the columns of Qt+1 with Q t+1 t+1 = nIt×t . We therefore have ! t t X X d v+1 ˜` + ∆ ˜ ` , β0 ), = φh (h` + ∆` , h φh au hu+1 , b h , β (164) v 0 ` ` ` ` u=0

v=0

St+1,t

P u+1 ˜ ` and ˜ t+1~ot+1 (n−δ0 )]` . Similarly define h where h` = t−1 +at [A˜∗ mt⊥ ]` and ∆` = at [Q u=0 (au +at αu )h` √ √ ˜ ` , with the bv ’s replacing the au ’s. Note that for each r ≥ 0, we have kq r k ≤ c nP` = Θ( log M ). ∆ ` 0√ Therefore, maxj∈[M ] |∆`j | = Θ(n−δ log M ) for ` ∈ [L]. Using this, it is shown in [25] that for each of the functions in (55), we have L (a) 1 X −δ 0 ˜ ˜ ˜ φ (h + ∆ , h + ∆ , β ) − φ (h , h , β ) log M ). h ` ` ` ` 0` h ` ` 0` = o(n L `=1

34

(165)

˜ t+1~ot+1 (n−δ0 )]` terms. for some δ 0 > 0. Consequently, by choosing δ ∈P(0, δ 0 ) we can drop the [Q t−1 u+1 ˜ ` [A] ˜ = ˜ = + at [A˜∗ mt⊥ ]` and h In what follows, we use the notation h` [A] u=0 (au + at αu )h` Pt−1 v+1 ∗ t ˜ ˜ + bt [A m⊥ ]` , making explicit the dependence on A. We now appeal to Fact v=0 (bv + bt αv )h` 2 to show that # " L L  1X o  n  X 1 a.s. ˜ ` [A], ˜ ` [A], ˜ h ˜ β0 − ˜ h ˜ β0 = 0 (166) lim nδ φh h` [A], EA˜ φh h` [A], ` ` L L `=1

`=1

To invoke Fact 2 (conditionally on S1,0 ), we need to verify that L

  n  o 2+κ 1X ˜ ` [A], ˜ ` [A], ˆ h ˆ β0 − E ˜ φh h` [A], ˜ h ˜ β0 ≤ cLκ/2 EAˆ φh h` [A], ` ` A L

(167)

`=1

ˆ A˜ are i.i.d. copies of A. In [25], it is shown that for each for constants κ ∈ (0, 1). In (167), A, function in (55),     2+κ a.s. ˜ ˜ ˆ ˆ ˜ ˜ EA, φ h [ A], h [ A], β − φ h [ A], h [ A], β = O((log M )2+κ ), ` ∈ [L]. (168) 0` 0` ˆA ˜ h ` ` h ` ` Due to Jensen’s inequality, the bound in (168) implies that (103) holds if δ is chosen such that δ(2 + κ) < κ/2. Hence (166) holds. √ d Recalling that [A˜∗ mt⊥ ] = (kmt⊥ k/ n)Z where Z ∼ N (0, IN ×N ), we have " !# t−1 t−1 o n  t k t k X X km km u+1 v+1 0 0 ⊥ ⊥ ˜ ` [A], ˜ β0 ˜ h = EZ φh EA˜ φh h` [A], au h` + at √ Z` , bv h` + bt √ Z` , β0` ` n n u=0 | {z v=0 } Pt−1 0 u+1 Pt−1 0 v+1 φnew a h , b h , β ( ) 0 u=0 u ` v=0 v ` h ` (169) where we have defined a0u = (au + at αu ) and b0v = (bv + bt αv ). Using Jensen’s inequality, it can be shown that the induction hypothesis Ht (b) holds for the function φnew holds whenever Ht (b) holds h in (169). We therefore have for the function φh inside the expectation defining φnew h !## " " ! t−1 L t−1 L t−1 t−1 X X X X X 1X 0 v+1 0 u+1 new 0 0 new δ 1 φh au h` , bv h` , β0` − E φh au τ¯u Zu` , bv τ¯v Zv` , β` lim n L L `=1

u=0

v=0

`=1

u=0

v=0

a.s.

= 0. (170)

It is then shown in [25] that P P 0 ¯ Z , t−1 b0 τ φnew ( t−1 u u` u=0 au τ v=0 v ¯v Zv` ,β` ) h z }| ( !){ t−1 L t−1 t k t k X X km km nδ X lim b0v τ¯v Zv` + bt √⊥ Z` , β` a0u τ¯u Zu` + at √⊥ Z` , E EZ φh L n n u=0 v=0 `=1 ( !) t−1 t−1 X X a.s. 0 0 − EEZ φh au τ¯u Zu` + at ζt Z` , bv τ¯v Zv` + bt ζt Z` , β` = 0 u=0

v=0

35

(171)

where ζt is the limit of

kmt⊥ k √ . n

That ζt is well-defined and finite can be seen as follows.

0 t 2 t−1 t−1 kmt⊥ k2 kmt k2 kmk k kmt k2 X X (mi )∗ mi = − = − . αu αi0 n n n n n 0

(172)

i=1 i =1

Each of the terms in (172) converges to a finite limit at rate n−δ by Ht+1 (c) and Lemma 6. Using the definitions a0u = (au + at αu ) and b0v = (bv + bt αv ), we have for ` ∈ [L] ! t−1 t−1 X X 0 0 bv τ¯v Zv` + bt ζt Z` , β` φh au τ¯u Zu` + at ζt Z` , u=0 t−1 X

= φh

v=0

au τ¯u Zu`

t−1 t−1 t−1 X X X + at ( αu Zu` + ζt Z` ), bv τ¯v Zv` + bt ( αv Zv` + ζt Z` ), β`

u=0

u=0

v=0

(173)

! .

v=0

Thus Pt−1 the proof is complete if we2 show that the i.i.d. entries of the Gaussian random vector ¯t . To see this, apply the proof thus far (from (164) – (173)) to u=0 αu Zu + ζt Z have variance τ ˜` (h` )∗ h ˜ the function φh (h` , h` , β` ) = with at = bt = 1 and au = bu = 0 for 0 ≤ u ≤ (t − 1). We thus M

obtain

" lim nδ

L

kht+1 k2 1 X Ek − N L

Pt−1

`=1

¯u Zu` + γt Z` k2 u=0 αu τ M

# a.s

= 0.

(174)

P Pt−1 P ¯u Zu` + γt Z` k2 equals αu Zu + ζt Z has i.i.d. entries, M1L L Ek t−1 Further, since u=0 αu τ u=0 `=1 P 2 t−1 E ¯u Zui + γt Zi for any i ∈ [N ]. On the other hand, from Ht+1 (c) we know that u=0 αu τ h t+1 2 i t k2 a.s. kh k km lim nδ − n = 0, and from Bt (b) setting φb (b1i , ..., bti , wi ) = (mti )2 = (bti − wi )2 , we have N lim n

δ



 2  kmt k2 a.s ˆ −E σ ¯t Zt − σZw = 0, n

 2 The result follows since E σ ¯t Zˆt − σZw = σ ¯t2 + σ 2 = τ¯t2 .

(d) By definition q s+1 = η s (β0 − hs+1 ) − β0 , and hence L

(hr+1 )∗ q s+1 1X s+1 = φh (hr+1 ` , h` , β0` ) n n `=1

s+1 r+1 ∗ s s+1 ) − β ]. Applying for φh : RM × RM → R defined as φh (hr+1 0` ` , h` , β0` ) = (h` ) [η` (β0 − h Ht+1 (b) to φh yields # " L L X 1X a.s. r+1 s+1 ∗ s δ 1 φh (h` , h` , β0` ) − lim E{¯ τr Zr` [η` (β0 − τ¯s Zs ) − β0` ]} = 0. (175) lim n n n `=1

`=1

Using arguments very similar to those in H1 (d) (iterated expectations and Stein’s lemma), we obtain that E{¯ τr Zr∗` [η`s (β0 − τ¯s Zs ) − β0` ]} =

 τ¯r E[Zr1 Zs1 ] Ekη`s (β − τ¯s Zs )k2 − nP` , τ¯s 36

` ∈ [L].

(176)

Here Zr1 , Zs1 refer to the first entries of the vectors Zr , Zs , respectively. Thus (175) becomes " L  # X Ekη s (β − τ¯s Zs )k2 τ¯r a.s. r+1 s+1 δ 1 lim n φh (h` , h` , β0` ) − lim E[Zr1 Zs1 ] −P = 0. (177) n τ¯s n `=1

From (43), we observe that λs+1

1 = 2 τ¯s



kβ s+1 k2 −P n



1 a.s. → lim 2 τ¯s

!  E kη s (β − τ¯s Zs )k2 −¯ σ2 − P = 2 s+12 , n σ ¯s + σ s

(178) s+1

2

)k = where the convergence at rate n−δ follows from Ht+1 (b) applied to the function kη (β−h n   s (β −¯ 2 s+1 2 E kη τ Z )k { } s s 0 kβ k 2 . The last equality in (178) holds because P − →σ ¯s+1 (cf. Appendix n n A.4). Substituting (178) in (177), we see that what remains to be shown is a.s.

τ¯r τ¯s E[Zr1 Zs1 ] = lim

(mr )∗ ms a.s. = E[(¯ σr Zˆr − σZw )(¯ σs Zˆs − σZw )]. n r ∗

s

a.s.

(179) r+1 ∗

s+1

The second equality above is due to Ht+1 (c), which also says that lim (m n) m = lim (h )N(h ) . ∗ s+1 to see Then the first equality in (179) is obtained by applying Ht+1 (b) to the function (hr+1 ` ) h` that (hr+1 )∗ (hs+1 ) a.s. lim = τ¯r τ¯s E[Zr1 Zs1 ]. N (e) By Ht+1 part (a), t−1

X (q 0 )∗ hi+1 (q 0 )∗ A˜∗ mt ˜ t+1~ot+1 (n−δ ) (q 0 )∗ Q (q 0 )∗ ht+1 d ⊥ |St+1,t = + + . αi n n n n

(180)

i=1

We argue that each term on the RHS approaches 0 almost surely with rate n−δ . This is true for the first term by the induction hypothesis Ht (e) and Lemma 6. Next, Fact 1(a) implies that

˜∗ mt d kq 0 k kmt k Z (q 0 )∗ A ⊥ = √n √⊥n √n where Z ∼ N (0, 1). Thus the second term in (180) approaches 0 almost n √ √ √ surely with rate n−δ since kq 0 k/ n = P and limkmt⊥ k/ n is a constant by Ht (f). For the third 0 ∗ r term, the result holds because (q n) q converges to a constant for r = 0, . . . , t, due to Bt (c).

A Appendices A.1 AMP Derivation t t In (23), the dependence of za→i on i is only due to the term Aai βi→a being excluded from the t t sum. Similarly, in (24) the dependence of βi→a on a is due to excluding the term Aai za→i from the argument. We begin by estimating the order of√these excluded terms. t Note that Aai = O(n−1/2 ), and βi→a = O( log n). The latter is true since for pi in section  `, √ t βi ≤ nP` , where P` = O(1/L), and L = Θ(n/ log n). Therefore Aai βi→a = O log n/n . In t t (24), the excluded term Abi zb→i is O(n−1/2 ) because zb→i = O(1). We set t+1 t+1 t t za→i = zat + δza→i , and βi→a = βit+1 + δβi→a .

37

(181)

Comparing (181) with (23), we can write X t Aaj βj→a , zat = ya −

t t δza→i = Aai βi→a .

(182)

j∈[N ]

t For i ∈ [N ], let sec(i) denote the set of indices in the section containing i. To nP o determine δβi→a , t we expand ηit in (24) in a Taylor series around the argument , which does b∈[n] Abj zb→j j∈sec(i)

not depend on a. We thus obtain  o nX t+1 t βi→a ≈ ηit  Abj zb→j

j∈sec(i)

b∈[n]





 o nX t t  − Aai za→i ∂i ηit  Abj zb→j

j∈sec(i)

b∈[n]

,

(183)

where ∂i ηit (.) is the partial derivative of ηit with respect to the component of the argument corresponding to index i. (Recall from (11) that the argument is a length M vector.) From (11), the partial derivative can be evaluated as  √    si nP` √ √  exp 2 nP nP ηit (s) p τt ` ` t    √ ∂i ηit (s) = ηit (s) ∂i ln ηit (s) = ηit (s)  2 − = nP − η (s) . ` i s nP τt τt2 P τt2 exp j 2 ` j∈sec(i)

τt

(184) Using (184) in (183) yields  nX o t+1 t βi→a = ηit  Abj zb→j

j∈sec(i)

b∈[n]



Aai zat τt2

 

 ηit 

nX b∈[n]

t Abj zb→j

o j∈sec(i)

  nX o p t   nP` − ηit  Abj zb→j

j∈sec(i)

b∈[n]

(185)

  .

t Notice that we have√replaced the stand-alone term Aai za→i in (183) with Aai zat because the differt ence Aai δza→i is O( log n/n), which can be ignored — we only keep terms as small as O(n−1/2 ). Since only the second term on the right-hand side of (185) depends on a, we can write   nX o t , βit+1 = ηit  Abj (zbt + δzb→j ) (186) j∈sec(i)

b∈[n]

and t+1 δβi→a =

Aai z t − 2a τt

We observe that

 nX o t ) ηit  Abj (zbt + δzb→j b∈[n] t δβi→a

j∈sec(i)

  nX o p t   nP` − ηit  Abj (zbt + δzb→j )

√ = O(log n/ n). Hence, in (182), we can write t δza→i = Aai βit

38

b∈[n]



j∈sec(i)

 .

(187) (188)

t because the difference Aai δβi→a = O(log n/n). Substituting (188) in (186), we see that ! n  nX o o (a) t t+1 t t 2 t ∗ t t βi = ηi = ηi , Abj zb + Abj βj (A z + β )j j∈sec(i)

b∈[n]

t+1 δβi→a =

2 b Abj

→ 1 as n → ∞. Analogously, using (188) in (187) gives ! n X o nX o p  nP` − ηit (A∗ z t + β t )j (A∗ z t + β t )j

where (a) holds because −Aai zat t ηi τt2

P

(189)

j∈sec(i)

j∈sec(i)

b∈[n]

j∈sec(i)

b∈[n]

 

. (190)

Finally, we use (189) and (190) in (182) to obtain X t Aak (βkt + δβk→a ) zat = ya − k∈[N ]

= ya − (b)

X k∈[N ]

 A2 z t−1  hq i Aak ηkt−1 A∗ z t−1 + β t−1 + ak2 a ηkt−1 A∗ z t−1 + β t−1 nPsec(k) − ηkt−1 A∗ z t−1 + β t−1 τt−1

= ya − (Aβ t )a +

zat−1 (nP − kβ t k2 ), 2 nτt−1

(191)

where (b) is obtained as follows. Then, we use A2ak ≈ n1 . Next, (11) implies that for all s, L X X q nP` = nP. nPsec(k) ηkt (s) = `=1

k∈[N ] t−1 k (ηk

 P A∗ z t−1 + β t−1 )2 = k (βkt )2 = kβ t k2 . The AMP update Finally, note from (189) that equations are thus given by (191) and (189). P

A.2 Proof of Lemma 1 From (27), x(τ ) can be written as x(τ ) :=

L X P` `=1

where

 E` = E 

√

nP` τ

E`

(192)

  U1` P   √  . M nP` ` ` U1` + exp − nP exp U 2 j=2 j τ τ exp

exp

√

P

nP` τ

We will prove the lemma by showing that for ` = 1, . . . , L:  1, if c` > 2(ln 2)Rτ 2 , lim E` = 0, if c` < 2(ln 2)Rτ 2 , where c` = lim LP` as defined in (30).3 Using the relation nR = nP` = ν` ln M, τ2 3

L ln M ln 2 ,

(193)

(194) we can write (195)

We can also prove that lim E` = 21 if c` = 2(ln 2)Rτ 2 , but we do not need this for the exponentially decaying power allocation since c` exactly equals 2(ln 2)Rτ 2 for only a vanishing fraction of sections. Since E` ∈ [0, 1], these sections do not affect the value of lim x(τ ) in (192).

39

where ν` =

LP` . Rτ 2 ln 2

Hence E` in (193) can be written as  √   √ ln M ν` U1` exp √  √  E` = E  P √ √ ` exp ln M ν` U1` + M −ν` M ln M ν U exp ` j j=2  √    √ ln M ν` U1` exp √ √   U1`  . = E E  PM √ √ ` ` −ν exp ln M ν` U1 + M ` j=2 exp ln M ν` Uj

The inner expectation in (197) is of the form √    √   exp ln M ν` U1` c ` √ √   U1  = EX E , P √ √ c+X ` exp exp ln M ν` U1` + M −ν` M ln M ν U ` j=2 j

(196)

(197)

(198)

√  √ where c = exp ln M ν` U1` is treated as a positive constant, and the expectation is with respect to the random variable M  √ X √ ln M ν` Uj` . (199) X := M −ν` exp j=2

Case 1: lim νl > 2. Since

c c+X

is a convex function of X, applying Jensen’s inequality we get   c c EX ≥ . (200) c+X c + EX

The expectation of X is EX = M −ν`

M X j=2

h √ i (a) √ E exp ln M ν` Uj` = M −ν` (M − 1)M ν` /2 ≤ M 1−ν` /2

(201)

where (a) is obtained using the moment generating function of a Gaussian random variable. We therefore have   c c c 1 ≥ EX . (202) ≥ ≥ c+X c + EX c + M 1−ν` /2 √  √ Recalling that c = exp ln M ν` U1` , (202) implies that √   √ exp ln M ν` U1` 1 ` √   √ . EX  U1 ≥ √ √ exp ln M ν` U1` + X 1 + M 1−ν` /2 exp − ln M ν` U1` 

(203)

√  When {U1` > −(ln M )1/4 }, the RHS of (203) is at least [1 + M 1−ν` /2 exp (ln M )3/4 ν` ]−1 . Using this in (197), we obtain that 1 ≥ E` ≥ P (U1` > −(ln M )1/4 ) ·

1+

M 1−ν` /2

1 M →∞ √  −→ 1 since ν` > 2. 3/4 exp (ln M ) ν`

Hence E` → 1 when ν` > 2. 40

Case 2: lim νl < 2. The random variable X in (199) can be bounded from below as follows.    √  √ √ √ ` ` −ν` −ν` exp max Uj max exp ln M ν` Uj = M ln M ν` . (204) X≥M j∈{2,...,M }

j∈{2,...,M }

Using standard bounds for the standard normal distribution, it can be shown that   √ ` P max Uj < 2 ln M (1 − ) ≤ exp(−M (1−) ),

(205)

j∈{2,...,M }

ln ln M ln M



4

Combining (205) and (204), we obtain that   √ (1−) ` exp(−M )≥P max Uj < 2 ln M (1 − ) j∈{2,...,M } √    √ √ √  ≥ P X < M −ν` exp 2 ln M (1 − ) ln M ν` = P X < M 2ν` (1−)−ν` .

for  = ω

.

(206)

Since√lim ν` < 2 and  > 0 can be arbitrary small, there exists a strictly positive constant δ such that δ < 2ν` (1 − ) − ν` for all sufficiently large L. Therefore, for sufficiently large M , the expectation in (198) can be bounded as   c c EX ≤ P (X < M δ ) · 1 + P (X ≥ M δ ) · c+X c + Mδ (207) c 2 (1−) ≤ exp(−M )+1· ≤ . c + Mδ 1 + c−1 M δ √  √ Recalling that c = exp ln M ν` U1` , and using the bound of (207) in (197), we obtain  E` ≤ E 



1+



1  √  √ exp − ln M ν` U1`

≤ P (U1` > (ln M )1/4 ) · 1 + P (U1` ≤ (ln M )1/4 ) · (a)

≤ exp(− 12 (ln M )1/2 ) + 1 ·

1+



1 √ exp(− ν` (ln M )3/4 )

(208)

1 (b)  −→ 0 as M → ∞. √ 3/4 1 + exp δ ln M − ν` (ln M )

In (208), (a) is obtained using the bound Φ(x) < exp(−x2 /2) for x ≥ 0, where Φ(·) is the Gaussian cdf; (b) holds since δ and lim ν` are both positive constants. This proves that E` → 0 when lim ν` < 2. The proof of the lemma is complete since we have proved both statements in (194).

A.3 Proof of Lemma 2 For t = 0, τ0 = σ 2 + P . From Lemma 1, we have x ¯1 = lim

L X P` `=1

4

P

2

1{c` > 2(ln 2)R (σ + P )} = lim

L→∞

L X P` `=1

P

 1

` log(C/R) < L 2C

Recall that f (n) = ω(g(n)) if for each k > 0, |f (n)|/|g(n)| ≥ k for sufficiently large n,

41

 ,

(209)

where the second equality is obtained using the expression for c` in (30) and simplifying. Substituting log(C/R) = ξ0 , and using the geometric series formula 2C k X `=1

P` = (P + σ 2 )(1 − 2−2Ck/L ),

(210)

(209) becomes bξ0 Lc

x ¯1 = lim

L→∞

X P` P + σ2 (1 + snr) − (1 + snr)1−ξ0 = (1 − 2−2Cξ0 ) = . P P snr

(211)

`=1

τ¯12

The expression for is a straightforward simplification of σ 2 + P (1 − x ¯1 ). 2 Assume towards induction that (31) and (32) hold for x ¯t , τ¯t . For step (t + 1), from Lemma 1 we have   L L X X P` ` 1 C(P + σ 2 ) P` 2 1{c` > 2(ln 2)R τ¯t } = lim 1 < log x ¯t+1 = lim , (212) L→∞ P P L 2C R τ¯t2 `=1

`=1

where the second equality is obtained using the expression for c` in (30) and simplifying. Using the induction hypothesis for τ¯t2 , we get (P + σ 2 ) (P + σ 2 ) = (1 + snr)ξt−1 = 22Cξt−1 . = τ¯t2 σ 2 (1 + snr)1−ξt−1

(213)

  1 C(P + σ 2 ) 1 C log = log + ξt−1 . 2 2C 2C R R¯ τt {z } |

(214)

Hence

ξt

Using (214) in (212), we obtain bξt Lc

x ¯t+1 = lim

L→∞

X P` (1 + snr) − (1 + snr)1−ξt P + σ2 = (1 − 2−2Cξt ) = . P P snr

(215)

`=1

2 = P + σ 2 (1 − x The proof is concluded by using (215) to compute τ¯t+1 ¯t+1 ). 2 A.4 The limit of n1 E{[η r (β − τ¯r Zr ) − β]∗ [η s (β − τ¯s Zs ) − β]} equals σ ¯s+1 for r ≤ s.

Since kβk2 = P , the required limit is

1 1 1 lim E{[η r (β − τ¯r Zr )]∗ [η s (β − τ¯s Zs )]} − E{β ∗ η r (β − τ¯r Zr )} − E{β ∗ η s (β − τ¯s Zs )} + P. (216) n n n  2 2 For r ≤ s, we prove that the limit in (216) equals σ ¯s+1 = σ (1 + snr)1−ξs − 1 by showing the following:   1 lim E{β ∗ η r (β − τ¯r Zr )} = σ 2 (1 + snr) − (1 + snr)1−ξr , (217) n 1 1 E{kη r (β − τ¯r Zr )]k2 } = E{β ∗ η r (β − τ¯r Zr )}, (218) n n 1 1 lim E{[η r (β − τ¯r Zr )]∗ [η s (β − τ¯s Zs )]} = lim E{β ∗ η r (β − τ¯r Zr )}, for r < s. (219) n n 42

Since β is distributed uniformly over the set BM,L , the expectation in (217) can be computed by assuming that β has a non-zero in the first entry of each section. Thus   √    nP` nP` ` L U exp exp X 2 1 τ¯r 1 τ¯  √  √ r  P  lim E{β ∗ η r (β − τ¯r Zr )} = lim P` E  M nP nP` ` nP n ` ` ` exp exp exp U + U l=1

(a)

=

L X l=1

τ¯r2

τ¯r

1

j=2

τ¯r

j

  (b) P` 1{c` > 2(ln 2)R¯ τr2 } = σ 2 (1 + snr) − (1 + snr)1−ξr . (220)

In (220), {Uj` } with ` ∈ [L], j ∈ [M ] is just a relabeled version of −Zr , and is thus i.i.d. N (0, 1). The equality (a) is obtained from (193) and (194) in Appendix A.2, noting that c` = lim LP` while (b) follows from Lemmas 1 and 2 (cf. (28) and (31)). Since β r+1 (s) = η r (s), (218) was proved in Proposition 3.1 (cf. (19) and (20)). Next, from the Cauchy-Schwarz inequality, we have L 1/2 1 1X E{(η r (β − τ¯r Zr ))∗ η s (β − τ¯s Zs )} ≤ E{kη`r (β` − τ¯r Zr` )k2 } E{kη`s (β` − τ¯s Zs` )k2 } n n l=1 (221) X (a) X 2 2 (b) 2 = P` 1{c` > 2(ln 2)R¯ τr } 1{c` > 2(ln 2)R¯ τs } = P` 1{c` > 2(ln 2)R¯ τr }, `

`

where (a) follows from (218) and (220), and (b) holds because τ¯r2 > τ¯s2 since r < s. Since β is distributed uniformly over the set BM,L , the expectation E{[η`r (β` − τ¯r Zr` )]∗ [η`s (β` − τ¯s Zs` )]} can be computed by assuming that β has a non-zero in the first entry of each section: X 1 1X E{(η r (β − τ¯r Zr ))∗ η s (β − τ¯s Zs )} = E{[η`r (β` − τ¯r Zr` )]∗ [η`s (β` − τ¯s Zs` )]} = P` Ers,` (222) n n `

`

where   √   √   nP` nP` ` ` ` ` exp nP exp exp exp nP τ¯r2 τ¯r Ur1 τ¯s2 τ¯s Us1   √   √  P ·  P  √ √ Ers,` = E M M nP` nP` ` nP` nP` ` ` ` ` ` exp nP exp exp nP exp j=2 exp j=2 exp τ¯r2 τ¯r Ur1 + τ¯r Urj τ¯s2 τ¯s Us1 + τ¯s Usj  √  √ # nP` nP` ` ` M U exp U exp X ri si τ¯r τ¯s   √  P √ ·   √  P √  . + M M nP` nP` ` nP` nP` ` nP` nP` ` ` exp U + exp U exp U + exp U exp exp i=2 r1 s1 rj sj j=2 j=2 τ¯2 τ¯r τ¯r τ¯2 τ¯s τ¯s "

r

s

(223) ` , U ` )}, j ∈ [M ] are i.i.d. across index j, and for each In (223), the pairs of random variables {(Urj sj ` and U ` are jointly Gaussian with N (0, 1) marginals. j, Urj sj

43

The expectation of the first term on the right-hand side of (223) can be written as √

" " E E

nP` ` τ¯r Ur1 ) PM ` exp( −nP j=2 τ¯r2 )



nP` ` τ¯r Ur1 )



nP` ` τ¯s Us1 ) PM ` exp( −nP j=2 τ¯s2 )

!

exp(

exp(



nP` ` τ¯r Urj )

` ` exp( nP τ¯s Us1 ) +  √ √ ! ! nP` ` ` ` (a) exp( τnP U ) exp( U ) r1 s1 ¯r τ¯s  √ √ ≥ E nP` ` nP` ` −nP` ` exp( τ¯r Ur1 ) + M exp( 2¯τ 2 ) exp( τ¯s Us1 ) + M exp( −nP ) 2 2¯ τs r  ! ! 1 1  √ √ = E nP` ` nP` nP` ` ) ` U 1 + M exp(− 2¯τ 2 ) exp(− τ¯r Ur1 ) 1 + M exp(− 2¯τ 2 ) exp(− τnP s1 ¯s r s √ 1/2 ! √ 1/2 ! nP` nP` ` ` ≥ P Ur1 >− P Us1 >− τ¯r τ¯s ! ! 1 1     · nP` 3/4 nP` nP` 3/4 ` 1 + M exp(− nP ) exp −( ) 1 + M exp(− ) exp −( ) 2 2 2 2 2¯ τ τ¯ 2¯ τ τ¯

exp(

+

r

exp(

r

s



exp(

nP` ` τ¯s Usj )

## ! ` ` Ur1 , Us1

s

nP` 1 −→ 1 as M → ∞ if lim > 1. M →∞ 2¯ τr2 ln M (b)

(224)

In (224), (a) is obtained as follows. The inner expectation on the first line of the form EX,Y [f (X, Y )] 1 2 with f (X, Y ) = κ1κ+X · κ2κ+Y , where κ1 , κ2 are positive constants. Since f is a convex function of √

(X, Y ), Jensen’s inequality implies E[f (X, Y )] ≥ f (EX, EY ), with E[exp( Since Ers,` in (223) lies in [0, 1], (224) implies that lim Ers,` = 1 if

lim

M →∞

nP` 1 c` = > 1. 2 2¯ τr ln M 2R¯ τr2 ln 2

nP` ` τ¯r Ur1 )]

` = exp( nP ). 2¯ τ2 r

(225)

where we have used nR = L log M , noting P that c` := lim LP`2. Using this in (222), we conclude that 1 r ∗ s ¯r Zr )) η (β − τ¯s Zs )} ≥ ` P` 1{c` > 2(ln 2)R¯ τr }. Together with the upper bound in n E{(η (β − τ (221), this proves (219), and hence completes the proof.

Acknowledgement The authors would like to thank A. Barron and S. Cho for several discussions regarding their softdecision iterative decoder for sparse regression codes and its connections to the AMP algorithm.

References [1] A. Barron and A. Joseph, “Least squares superposition codes of moderate dictionary size are reliable at rates up to capacity,” IEEE Trans. on Inf. Theory, vol. 58, pp. 2541–2557, Feb 2012. [2] A. Joseph and A. R. Barron, “Fast sparse superposition codes have near exponential error probability for R < C,” IEEE Trans. Inf. Theory, vol. 60, pp. 919–942, Feb. 2014. [3] A. R. Barron and S. Cho, “High-rate sparse superposition codes with iteratively optimal estimates,” in Proc. IEEE Int. Symp. Inf. Theory, 2012. [4] S. Cho, High-dimensional regression with random design, including sparse superposition codes. PhD thesis, Yale University, 2014.

44

[5] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proceedings of the National Academy of Sciences, vol. 106, no. 45, pp. 18914–18919, 2009. [6] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing: I. motivation and construction,” in IEEE Information Theory Workshop (ITW), 2010. [7] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications to compressed sensing,” IEEE Trans. Inf. Theory, pp. 764–785, 2011. [8] A. Montanari, “Graphical models concepts in compressed sensing,” in Compressed Sensing (Y. C. Eldar and G. Kutyniok, eds.), pp. 394–438, Cambridge University Press, 2012. [9] F. Krzakala, M. M´ezard, F. Sausset, Y. Sun, and L. Zdeborov´a, “Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices,” Journal of Statistical Mechanics: Theory and Experiment, no. 8, 2012. [10] S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in Proc. IEEE Int. Symp. Inf. Theory, pp. 2168–2172, 2011. [11] D. L. Donoho, A. Javanmard, and A. Montanari, “Information-theoretically optimal compressed sensing via spatial coupling and approximate message passing,” IEEE Trans. Inf. Theory, pp. 7434–7464, Nov. 2013. [12] R. Baraniuk, E. Candes, R. Nowak, and M. Vetterli (editors), “Special issue on compressive sampling,” IEEE Signal Processing Magazine, vol. 25, March 2008. [13] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans. Inf. Theory, vol. 51, pp. 4203 – 4215, Dec. 2005. [14] D. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, pp. 1289 –1306, April 2006. [15] J. Tropp, “Just relax: convex programming methods for identifying sparse signals in noise,” IEEE Trans. Inf. Theory, vol. 52, pp. 1030 –1051, March 2006. [16] U. Kamilov, S. Rangan, A. K. Fletcher, and M. Unser, “Approximate message passing with consistent parameter estimation and applications to sparse learning,” IEEE Trans. on Inf. Theory, vol. 60, pp. 2969–2985, May 2014. [17] P. Schniter, “A message-passing receiver for BICM-OFDM over unknown clustered-sparse channels,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, pp. 1462–1474, Dec 2011. [18] S. Som and P. Schniter, “Compressive imaging using approximate message passing and a Markov-tree prior,” IEEE Trans. Signal Processing, vol. 7, pp. 3439–3448, July 2012. [19] E. Bolthausen, “An iterative construction of solutions of the TAP equations for the SherringtonKirkpatrick model,” arXiv:1201.2891, 2012. [20] A. Javanmard and A. Montanari, “State evolution for general approximate message passing algorithms, with applications to spatial coupling,” Information and Inference, p. iat004, 2013. [21] J. Barbier and F. Krzakala, “Replica analysis and approximate message passing decoder for sparse superposition codes,” in Proc. IEEE Int. Symp. Inf. Theory, 2014. [22] J. Barbier, C. Schulke, and F. Krzakala, “Approximate message-passing with spatially coupled structured operators, with applications to compressed sensing and sparse superposition codes,” arXiv:1312.1740. [23] A. Guill´en i F` abregas, A. Martinez, and G. Caire, Bit-interleaved coded modulation. Now Publishers Inc, 2008. [24] N. Sommer, M. Feder, and O. Shalvi, “Low-density lattice codes,” IEEE Trans. on Inf. Theory, vol. 54, pp. 1561–1585, April 2008. [25] “Extended proof of steps 2(b) and 4(b),” Online. http://sigproc.eng.cam.ac.uk/foswiki/pub/ Main/RV285/Steps_2b4b.pdf.

45