1
Universal Compressed Sensing of Markov Sources
arXiv:1406.7807v1 [cs.IT] 30 Jun 2014
Shirin Jalali, H. Vincent Poor
Abstract The main promise of compressed sensing is accurate recovery of high-dimensional structured signals from an underdetermined set of randomized linear projections. Several types of structure such as sparsity and low-rankness have already been explored in the literature. For each type of structure, a recovery algorithm has been developed based on the properties of signals with that type of structure. Such algorithms recover any signal complying with the assumed model from its sufficient number of linear projections. However, in many practical situations the underlying structure of the signal is not known, or is only partially known. Moreover, it is desirable to have recovery algorithms that can be applied to signals with different types of structure. In this paper, the problem of developing “universal” algorithms for compressed sensing of stochastic processes is studied. First, R´enyi’s notion of information dimension is generalized to analog stationary processes. This provides a measure of complexity for such processes and is connected to the number of measurements required for their accurate recovery. Then a minimum entropy pursuit (MEP) algorithm is proposed, and it is proven that it can reliably and robustly recover any Markov process of any order from sufficient number of randomized linear measurements, without having any prior information about the distribution of the process. An implementable version of MEP with the same performance guarantees is also provided.
I. I NTRODUCTION Consider the fundamental problem of compressed sensing (CS): a signal xno ∈ Rn is observed through a data
acquisition system that can be modeled as a linear projection system yom = Axno , where A ∈ Rm×n denotes the measurement matrix. The decoder is interested in recovering the signal xno from the measurements yom . In CS we
are usually interested in the case where xno is high-dimensional and the number of measurements m is much smaller than the ambient dimension of the signal, i.e., m ≪ n. In such a setup, clearly, without any extra information, it is
impossible to recover xno from yom , and the system of linear equations described by yom = Axn has infinitely many solutions. However, if some extra information about the structure of xno is available, for instance it is known that xno is sparse, i.e., kxno k0 ≪ n, then it is known that this additional information can be exploited and xno can be
recovered from yom . For sparse signals, it can be proved that in fact the signal recovery can be achieved efficiently and robustly [1]–[3]. This result can be extended to several other structures as well [4]–[11]. In each of these cases,
the signal is known to have some specific structure, and the decoder takes advantage of its knowledge about the structure of the signal to recover it from its linear projections.
2
It is generally believed that if a signal is “structured” or of “low complexity”, then it is possible to recover it from its under-determined set of linear projections; also, the required number of linear measurements for successful recovery is expected to be proportional to the “structure” of the signal instead of its dimensions. In order to develop a recovery algorithm that can be applied to signals with various structures, one fundamental question to ask is, mathematically, what does it mean for an analog signal to be of low complexity? or how can one measure the structure of a signal? Another relevant theoretical question that is also important from a practical standpoint is: is it possible to design a “universal” compressed sensing decoder that is able to recover structured signals from their randomized linear projections without knowing the underlying structure of the signal? Universal algorithms, defined as algorithms that achieve the optimal performance without knowing the source distribution, have been well-studied in the information theory literature. Existence of such algorithms has been proved for several different problems such as lossless compression [12], [13], lossy compression [14]–[19], denoising [20], [21] and prediction [22], [23]. For some problems such as universal lossless compression there are well-known efficient algorithms such as the Lempel-Ziv algorithm [12], [13] that are employed in various commercial products, but for some others, such as universal lossy compression, designing implementable algorithms that are both efficient and optimal is still an ongoing effort. Existence of universal algorithms for compressed sensing has recently been proved in [24]–[26]. In [24], the authors define the Kolmogorov information dimension (KID) of a deterministic analog signal as a measure of its complexity. The KID of a signal xno is defined as the growth rate of the Kolmogorov complexity of the quantized version of xno normalized by the log of the number of quantization levels, as the number of quantization levels grows to infinity. Employing this measure of complexity, the authors in [24] and [26] propose minimum complexity pursuit (MCP) as a universal signal recovery algorithm. MCP is based on Occam’s razor [27], i.e., among all signals satisfying the linear measurement constraints, MCP seeks the one with the lowest complexity. While MCP proves the existence of universal compressed sensing algorithms, it is not an implementable algorithm, since it is based on minimizing Kolmogorov complexity [28], [29], which is not computable. In this paper we focus on stochastic signals and develop an “implementable” algorithm for “universal” compressed sensing of stochastic processes. As a measure of complexity for analog stochastic signals, we extend R´enyi’s information dimension measure [30] and define the information dimension of a stochastic process. We prove that for an independent and identically distributed (i.i.d.) process {Xi }∞ i=1 , its information dimension is equal to the R´enyi information dimension of X1 . It has been recently proved that for i.i.d. processes the R´enyi information dimension characterizes the fundamental limits of (non-universal) compressed sensing [31]. Consider Xon generated by an analog stationary process {Xi }∞ i=1 , and assume that the decoder observes its linear
projections Yom = AXon , where m < n. To recover Xon from Yom , in the same spirit of the MCP algorithm, we propose the minimum entropy pursuit (MEP) algorithm, which among all the signals xn satisfying Yon = Axn ,
outputs the one whose quantized version has minimum conditional empirical entropy. We prove that asymptotically, for proper choice of the number of quantization levels and the order of the conditional empirical entropy, and having slightly more than twice the (upper) information dimension of the process times the ambient dimension of
3
the process randomized linear measurements, MEP presents a reliable and robust estimate of Xon . While MEP is not easy to implement, we also present an implementable version with the same asymptotic performance guarantees as MEP. The implementable version of MEP is similar to the heuristic algorithm proposed in [32] and [33] for universal compressed sensing. The organization of the paper is as follows. Section II introduces the notation used in the paper and reviews some related background. Section III extends R´enyi’s information dimension of a random variable [30] to define the information dimension of a stationary process, explores its properties and computes it for some processes. Section IV presents the MEP algorithm for universal compressed sensing of Markov sources of any order, and proves its robustness. Section V concludes the paper. II. BACKGROUND A. Notation Calligraphic letters such as X and Y denote sets. For a discrete set X , let |X | denote the size of X . Given P vectors un , v n ∈ Rn , let hun , v n i denote their inner product, i.e., hun , v n i , ni=1 ui vi . Given un ∈ Rn , kun k0 , Pn |{i : ui 6= 0}| denotes its ℓ0 -norm defined as the number of non-zero elements in un . Also, kun k2 , ( i=1 u2i )0.5
denotes the ℓ2 -norm of un . For 1 ≤ i ≤ j ≤ n, uji , (ui , ui+1 , . . . , uj ). To simplify the notation, uj , uj1 . The set
of all finite-length binary sequences is denoted by {0, 1}∗, i.e., {0, 1}∗ , ∪n≥1 {0, 1}n. Similarly, {0, 1}∞ denotes the set of infinite-length binary sequences. Throughout the paper log refers to logarithm to the basis of 2 and ln refers to the natural logarithm. Random variables are represented by upper-case letters such as X and Y . The alphabet of the random variable X is denoted by X . Given a sample space Ω and event A ⊆ Ω, 1A denotes the indicator function of A. Given x ∈ R, δx denotes the Dirac measure with an atom at x. Given a real number x ∈ R, ⌊x⌋ (⌈x⌉) denotes the largest (the smallest) integer number smaller (larger) than x. Further, [x]b denotes the b-bit quantized version of x that results from taking the first b bits in the binary expansion P∞ of x. That is, for x = ⌊x⌋ + i=1 2−i (x)i , where (x)i ∈ {0, 1}, [x]b , ⌊x⌋ + Also, for xn ∈ Rn , define
b X
2−i (x)i .
i=1
[xn ]b , ([x1 ]b , . . . , [xn ]b ).
B. Empirical distribution Consider a stochastic process X = {Xi }∞ i=1 , with discrete alphabet X and probability measure µ(·). The entropy rate of a stationary and ergodic process X is defined as H(X1 , . . . , Xn ) ¯ . H(X) , lim n→∞ n
4
Alternatively, for such processes [34], the entropy rate can be defined as the almost sure limit of lim log
n→∞
1 . µ(X n )
The k-th order empirical distribution induced by xn ∈ X n is defined as pk (ak |xn ) =
k {i : xi−1 i−k = a , 1 ≤ i ≤ n} , n
where we make a circular assumption such that xj = xj+n , for j ≤ 0. We define the conditional empirical entropy ˆ k (xn ), to be equal to H(Uk+1 |U k ), where U k+1 ∼ pk+1 (·|xn ). induced by xn ∈ X n , H
C. Universal lossless compression Consider the problem of universal lossless compression of discrete stationary ergodic sources described as follows. A family of source codes {Cn }n≥1 consists of a sequence of codes corresponding to different blocklengths. Each code Cn in this family is defined by an encoder function fn and a decoder function gn such that fn : X n → {0, 1}∗, and gn : {0, 1}∗ → Xˆn . Here Xˆ denotes the reconstruction alphabet which is also assumed to be discrete and in many cases is equal to
X . The encoder fn maps each source block X n to a binary sequence of finite length, and the decoder gn maps ˆ n = gn (fn (X n )). Let ln (fn (X n ) = |fn (X n )| denote the length the coded bits back to the signal space as X
of the binary sequence assigned to the sequence X n . We assume that the codes are lossless (non-singular), i.e., fn (xn ) 6= fn (˜ xn ), for all xn 6= x ˜n . A family of lossless codes is called universal, if 1 ¯ E[ln (fn (X n ))] → H(X), n for any discrete stationary process X. A family of lossless codes is called point-wise universal, if 1 ¯ ln (fn (X n )) → H(X), n almost surely, for any discrete stationary ergodic process X. III. I NFORMATION
DIMENSION OF STATIONARY PROCESSES
Consider an arbitrarily distributed analog random variable X. For a positive integer n, let hXin ,
⌊nX⌋ . n
5
By this definition, hXin is a finite-alphabet approximation of the random variable X, such that 0 < X − hXin ≤
1 . n
R´enyi defined the upper and lower information dimensions of random variable X in terms of the entropy of hXin as H(hXin ) ¯ d(X) = lim sup , log n n and H(hXin ) , log n
d(X) = lim inf n
¯ respectively [30]. If d(X) = d(X), then the information dimension of the random variable X is defined as d(X) = lim
n→∞
H(hXin ) . log n
The information involved in analog signals is infinite, which means that they have infinite entropy rate. Define ∞ the b-bit quantized version of stochastic process X = {Xi }∞ i=1 as [X]b = {[Xi ]b }i=1 . Consider a stationary process
X = {Xi }∞ i=1 ; then since [X]b is derived from a stationary coding of X, it is also a stationary process. We define
the k-th order upper information dimension of a process X as H([Xk+1 ]b |[X k ]b ) . d¯k (X) = lim sup b b→∞ Similarly, the k-th order lower information dimension of X is defined as dk (X) = lim inf b→∞
H([Xk+1 ]b |[X k ]b ) . b
Lemma 1. Both d¯k (X) and dk (X) are non-increasing in k. Proof: For a stationary process [X]b , for any value of k, H([Xk+2 ]b |[X2k+1 ]b ) H([Xk+2 ]b |[X k+1 ]b ) ≤ b b H([Xk+1 ]b |[X k ]b ) = . b Therefore, taking lim sup or lim sup of both sides yields the desired result. From Lemma 1, limk→∞ d¯k (X) and limk→∞ dk (X) exist. For stationary process X, we define its upper information dimension and lower information dimension as d¯o (X) = lim d¯k (X), k→∞
and do (X) = lim dk (X), k→∞
6
respectively. Lemma 2. For stationary process X, H([X k ]b ) 1 lim sup . d¯o (X) = lim k→∞ k b b→∞ Proof: Since the process is stationary, H([X k ]b ) =
k X i=1
=
k X i=1
H([Xi ]b |[X i−1 ]b ) k−1 H([Xk ]b |[Xk−i+1 ]b )
≥ kH([Xk ]b |[X k−1 ]b ).
(1)
Therefore, 1 H([X k ]b ) H([Xk ]b |[X k−1 ]b ) lim sup ≥ lim sup k b→∞ b b b→∞ = d¯k (X).
Taking lim inf of both as k grows to infinity proves that lim inf k→∞
H([X k ]b ) 1 lim sup ≥ d¯o (X). k b→∞ b
(2)
On the other hand, H([X k ]b ) lim sup = lim sup bk b→∞ b→∞ ≤ = Since
Pk
i=1
k−1 H([Xk ]b |[Xk−i+1 ]b ) bk
k k−1 H([Xk ]b |[Xk−i+1 ]b ) 1X lim sup k i=1 b→∞ b k−1 1X¯ di (X). k i=0
(3)
k−1 1X¯ di (X) = lim d¯k (X) = d¯o (X), k→∞ k k→∞ i=0
lim
taking lim sup of both sides of (3) yields lim sup lim sup k→∞
b→∞
H([X k ]b ) ≤ d¯o (X). bk
(4)
The desired result follows from combining (2) and (4).
Lemma 3. For stationary process X, with X = [l, u], d¯o (X) ≤ 1 and do (X) ≤ 1. Also, d¯k (X) ≤ 1, dk (X) ≤ 1,
7
for all k. Proof: Note that H([Xk+1 ]b |[X k ]b ) ≤ log(u − l)2b = log(u − l) + b, for all b and all k, and therefore, 1 log(u − l) H([Xk+1 ]b |[X k ]b ) ≤ 1 + . b b Taking lim sup and lim inf of both sides yields that d¯k (X) ≤ 1 and dk (X) ≤ 1, which in turn proves that d¯o (X) ≤ 1 and do (X) ≤ 1 ¯ ¯ Proposition 1. For an i.i.d. random process X = {Xi }∞ enyi upper i=1 , do (X) (do (X)) is equal to d(X1 ), the R´ (lower) information dimension of X1 . Proof: Since the process is memoryless, H([X1 ]b ) d¯k (X) = d¯0 (X) = lim sup . b b As proved in Proposition 2 of [31], ¯ 1 ) = lim sup H(hX1 i2b ) . d(X b b Since hX1 i2b = [X1 ]b , this yields the desired result. The following theorem, which follows from Theorem 3 of [30] combined with Proposition 1, characterizes the information dimension of i.i.d processes, whose components are drawn from a mixture of continuous and discrete distribution. Theorem 1 (Theorem 3 in [30]). Consider an i.i.d. process {Xi }∞ i=1 , where each Xi is distributed according to (1 − p)fd + pfc , where fd and fc represent a discrete measure and an absolutely continuous measure, respectively. Assume that H(⌊X1 ⌋) < ∞. Then,
d¯o (X) = do (X) = do (X) = p.
From Theorem 1, for an i.i.d. process with components drawn from an absolutely continuous distribution1 the information dimension is equal to one. As the weight of the continuous part p decreases, the information dimension decreases linearly with p. Theorem 2. Consider a first-order stationary Markov process X = {Xi }∞ i=1 , such that conditioned on Xt−1 = xt−1 , Xt has a mixture of discrete and absolutely continuous distribution equal to (1 − p)δxt−1 + pfc , where fc represents the pdf of an absolutely continuous distribution over [0, 1] with bounded differential entropy. Then, do (X) = d¯o (X) = p. 1A
probability distribution is called absolutely continuous, if it has a probability density function (pdf).
8
The proof of Theorem 2 is presented in Appendix A. Remark 1. Processes with piecewise constant realizations are one of the standard models in signal processing [35], [36], and are studied in various problems such as denoising [37] and compressed sensing [3], [38]. t−1 Theorem 3. Consider a stationary Markov process of order l such that conditioned on Xt−l = xt−1 t−l , Xt is Pl distributed as i=1 ai xt−i + Zt , where ai ∈ (0, 1), for i = 1, . . . , l, and Zt is an i.i.d. process distributed
according to (1 − p)δ0 + pfc , where fc is the pdf of an absolutely continuous distribution. Let Z denote the support of fc and assume that there exists 0 < α < β < ∞, such that α < fc (z) < β, for z ∈ Z. Then, do (X) = d¯o (X) = p. Proof of Theorem 3 is presented in Appendix B. IV. U NIVERSAL
COMPRESSED SENSING
¯ ¯ Consider a stationary process {Xi }∞ i=1 such that do (X) < 0.5. As we argued in Section III, since do (X) < 1, we expect this process to be a structured process. In this section we explore universal compressed sensing of such processes. That is, we study acquiring a signal X n generated by the source X through underdetermined linear projections of X n , without having any prior information about the source distribution. A. Noiseless measurements Consider the standard compressed sensing setup: instead of observing xno , the decoder observes yom = Axno , where A ∈ Rm×n denotes the linear measurement matrix, and typically m ≪ n. To recover xno from yom , we employ the following optimization algorithm: ˆ k ([xn ]b ). x ˆno = arg min H
(5)
Axn =yom
We name this algorithm minimum entropy pursuit (MEP). The following theorem proves that MEP is a universal recovery algorithm for Markov processes of any order. Theorem 4. Consider an aperiodic stationary Markov process {Xi }∞ i=1 of order l, with X = [0, 1] and upper
information dimension d¯o (X). Let b = bn = ⌈log log n⌉, k = kn = o( logloglogn n ) and m = mn ≥ 2(1 + δ)d¯o (X)n, where δ > 0. For each n, let the entries of the measurement matrix A = An ∈ Rm×n be drawn i.i.d. according to
ˆ on = X ˆ on (Yom , A) denote the solution of (5), N (0, 1). For Xon generated by the source X and Yom = AXon , let X ˆ n = arg min n m H ˆ k ([xn ]b ). Then, i.e., X o Ax =Yo
1 P ˆ on k2 −→ √ kXon − X 0. n Proof: We show that for any ǫ > 0, 1 ˆ n k2 > ǫ) → 0, P( √ kXon − X o n
9
ˆ on = [X ˆ on ]b + qˆon . By assumption, AXon = AX ˆ on , and therefore, as n → ∞. Let Xon = [Xon ]b + qon and X
ˆ n ]b ) = A(q n − qˆn ). Note that A([Xon ]b − [X o o o
ˆ on ]b )k2 = kA(qon − qˆon )k2 ≤ σmax (A)kqon − qˆon k2 . kA([Xon ]b − [X Define the event E1 , {σmax (A) ≤
√ √ n + 2 m}.
From [39], P(E1c ) ≤ e−m/2 . But, kqon − qˆon k2 ≤ kqon k2 + kˆ qon k2 ≤
√ −b+1 . n2
Hence, conditioned on E1 , kA(qon − qˆon )k2 ≤ σmax (A)kqon − qˆon k2 r m −b+1 ≤ n 1+2 2 . n
ˆ on ]b )k2 . For a fixed vector un and for any τ ∈ (0, 1), As the next step, we want to lower bound kA([Xon ]b − [X by Lemma 4 in Appendix C, P(kAun k2 ≤
p m m(1 − τ )kun k2 ) ≤ e 2 (τ +ln(1−τ )) .
(6)
ˆ n ]b is not a fixed vector. But as we will show next, with high probability, we can upper bound However, [Xon ]b − [X o the number of such vectors possible. ˆ n is the solution of (5), we have Since X o ˆ k ([X ˆ on ]b ) ≤ H ˆ k ([Xon ]b ). H
(7)
On the other hand as proved in Appendix D, 1 ˆ on ]b ) ≤ H ˆ k ([X ˆ on ]b ) + b(kb + b + 3) + γn , ℓLZ ([X n (1 − ǫn ) log n − b
(8)
ˆ n ]b or b, and where γn = o(1), and does not depend on [X o ǫn =
log((2b − 1)(log n)/b + 2b − 2) + 2b . log n
Combining (7) and (8) and dividing both sides by b = bn yields n ˆ kbn + bn + 3 γn 1 ˆ n ]bn ) ≤ Hk ([Xo ]bn ) + ℓLZ ([X + . o nbn bn (1 − ǫn ) log n − bn bn
(9)
ˆ k ([X n ]bn )/bn that holds with high probability. The upper As the next step, we find an upper bound on H o information dimension of process X, d¯o (X), is defined as limk→∞ d¯k (X). By Lemma 1, d¯k (X) is a non-increasing
10
function of k. Therefore, given δ1 > 0, there exists kδ1 > 0, such that for any k > kδ1 , d¯o (X) ≤ d¯k (X) ≤ d¯o (X) + δ1 .
(10)
By the definition of dkδ1 (X), given δ2 > 0, there exists bδ2 , such that for b ≥ bδ2 , H([Xkδ1 +1 ]b |[X kδ1 ]b ) ≤ dkδ1 (X) + δ2 . b
(11)
ˆ k ([Xon ]bn )/bn is a decreasing function of k. Therefore, since k = kn by construction is a diverging sequence, But, H for n large enough, kn > kδ1 , and
ˆ k ([X n ]bn ) ˆ kn ([X n ]bn ) H H o δ1 o ≤ . bn bn
ˆ k ([X n ]bn )/bn is close to We now prove that, given our choice of parameters, for large values of n, H o δ1 H([Xkδ1 +1 ]bn |[X kδ1 ]bn ) , bn with high probability. Since the original process is an stationary aperiodic Markov chain of order l, {[Xi ]b }∞ i=1 is also an aperiodic stationary Markov chain of order l, which has discrete alphabet. Theorem 7 states that for such a process, given ǫ1 > 0, there exists g ∈ N, only depending on ǫ1 and transition probabilities of process {[Xi ]b }∞ i=1 , such that for any k > l and n > 6(k + g)/ǫ1 + k, 2
kb
−ncǫ2 1
P(kpk (·|[X n ]b ) − µk k1 ≥ ǫ1 ) ≤ 2cǫ1 /8 (k + g)n2 2 8(k+g) , where c = 1/(2 ln 2). Let E2 , {kpkδ1 +1 (·|[X n ]bn ) − µkδ1 +1 k1 ≤ ǫ1 /(kδ1 + 1)}. Letting k = kδ1 + 1, where kδ1 > l, and b = bn , for n large enough, P(E2c )
≤2
cǫ2 1 8(kδ +1) 1
(kδ1 + g + 1)n
bn (kδ +1) 1
2
2
−ncǫ2 1 8(kδ +g+1)(kδ +1)2 1 1
.
(12)
Let U kδ1 +1 ∼ pkδ1 +1 (·|[xn ]bn ). Then, conditioned on E2 , ˆ k ([xn ]bn ) − H([Hk +1 ]bn |[X kδ1 ]bn )| = |H(Uk |U kδ1 +1 ) − H([Hk +1 ]bn |[X kδ1 ]bn ) |H δ1 δ1 δ1 δ1 = |H(U kδ1 +1 ) − H(U kδ1 ) − H([X kδ1 +1 ]bn ) + H([X kδ1 ]bn )| = |H(U kδ1 +1 ) − H([X kδ1 +1 ]bn )| + |H(U kδ1 ) − H([X kδ1 ]bn )| (a)
≤ −
2ǫ1 ǫ1 log( ) + ǫ1 bn , kδ1 + 1 kδ1 + 1
(13)
11
where (a) follows from Lemma 6 in Appendix C. Dividing both sides of (13) by bn yields kδ n H ˆ 2ǫ1 ǫ1 kδ1 ([x ]bn ) H([Hkδ1 +1 ]bn |[X 1 ]bn ) log( − ) + ǫ1 . ≤− bn bn (kδ1 + 1)bn kδ1 + 1
(14)
On the other hand, bn is a diverging sequence of n. Therefore, for n large enough, bn ≥ bδ2 , and as a result, combining (10), (11) and (14) yields that, for n large enough, conditioned on E2 , ˆ kn ([X n ]bn ) H o ≤ d¯o (X) + δ3 , bn where δ3 , δ1 + δ2 −
2ǫ1 (kδ1 +1)bn
(15)
log kδǫ1+1 + ǫ1 can be made arbitrarily small by choosing ǫ1 , δ1 and δ2 small 1
enough. Furthermore, given our choice of parameters bn and kn , from (9) and (15), conditioned on E2 , 1 ˆ on ]bn ) ≤ d¯o (X) + δ4 , ℓLZ ([X nbn
(16)
1 ℓLZ ([Xon ]bn ) ≤ d¯o (X) + δ4 , nbn
(17)
and
where δ4 , (kn bn + bn + 3)/((1 − ǫn ) log n − bn ) + γn /bn + δ3 can be made arbitrarily small.
Let Cn , {[xn ]bn : ℓLZ ([xn ]bn ) ≤ nbn (d¯o (X) + δ4 )}. Since the Lempel-Ziv code is a uniquely decodable code,
each binary string corresponding to an LZ-coded sequence corresponds to a unique uncoded sequence. Hence, the number of sequences in Cn , i.e., the number of quantized sequences satisfying the upper bound of (17), can be bounded as nbn (d¯o (X)+δ4 )
X
|Cn | ≤
2i
i=1 ¯
≤ 2nbn (do (X)+δ4 )+1 .
(18)
Define the event E3 , {∀ xn1 , xn2 ∈ [0, 1]n : [xn1 ]bn , [xn2 ]bn ∈ Cn , kA([xn1 ]bn − [xn2 ]bn )k2 ≥ [xn2 ]bn k2 }. Combining the union bound, (6) and (18), it follows that ¯
m
P(E3c ) ≤ 22nbn (do (X)+δ4 )+2 e 2 (τ +ln(1−τ )) .
p m(1 − τ )k[xn1 ]bn − (19)
Finally, conditioned on E1 ∩ E2 ∩ E3 , p ˆ n k2 ≤ n 1 + 2 m(1 − τ )kXon − X o
r
m −bn +1 2 , n
or 1 ˆ n k2 ≤ √ kXon − X o n
r
n 1+2 m(1 − τ )
r
m −bn +1 2 . n
(20)
By the union bound, P((E1 ∩ E2 ∩ E3 )c ) ≤ P(E1c ) + P(E2c ) + P(E3c ). Clearly P(E1c ) → 0, as n → ∞. Also, for our
12
choice of parameter b = bn , from (12), as n → ∞, P(E2c ) → 0 as well. Let τn = 1 −
1 , (log n)2/(1+υ)
where υ ∈ (0, 1). Then from (19), it follows that ¯
m
2
P(E3c ) ≤ 22nbn (do (X)+δ4 )+2 e 2 (1− 1+υ ln log n ) ¯
¯
2
≤ 22n(log log n+1)(do (X)+δ4 )+2 2(1+δ)do (X)n(log e− 1+υ log log n) ¯
= 22n(log log n)(1+δ)do (X)(
1+δ5 1+δ
1 − 1+υ +ρn )
,
(21)
where δ5 , d4 /d¯o (X) and ρn = o(1). Given δ > 0, choosing υ < (δ − δ5 )/(1 + δ5 ) ensures that (21) converges to zero, as n → ∞. Note that δ4 and as a result δ5 can be made arbitrarily small for n large enough. On the other hand, from (20), since always m/n ≤ 1, we have 1 3(log n)1/(1+υ) − log log n+1 ˆ on k2 ≤ p √ kXon − X 2 n 2(1 + δ)d¯o (X)
3(log n)1/(1+υ) − log log n+1 2 ≤ p 2(1 + δ)d¯o (X) 6 ≤ p 2− log log n(1−1/(1+υ)) , 2(1 + δ)d¯o (X)
(22)
which goes to zero as n grows to infinity. This concludes the proof.
Remark 2. For i.i.d. processes, the number of measurements required by MEP for reliable recovery is two times the fundamental limits of the non-universal setup characterized in [31]. Since we do not yet have a lower bound on the number of measurements required by a universal compressed sensing algorithm, it is not clear to us at this stage whether this factor of two is the price of universality, sub-optimality of MEP or an artifact of our proof technique. The optimization presented in (5) is not easy to handle. While the search domain is a hyperplane, i.e., the set of points satisfying Axn = yom , the cost function is defined on a discretized space. To move towards designing an implementable universal compressed sensing algorithm, consider the following algorithm: ˆ k ([xn ]b ) + λ kA[xn ]b − y m k2 . x ˆno = arg min H o 2 n2
(23)
We refer to this algorithm as Lagrangian-MEP. The main difference between (5) and (23) is that in (23) the search is done over the discretized space. This algorithm is similar to the heuristic algorithm proposed in [32] and [33] based on universal priors for universal compressed sensing. The advantage of Lagrangian-MEP compared to MEP is that, as discussed in [32] and [33], Markov chain Monte Carlo (MCMC) and simulated annealing [40]–[42] can be employed to approximate the optimizer of Lagrangian-MEP. The following theorem shows that (23) approximates
13
the solution of MEP with no loss in the asymptotic performance. Theorem 5. Consider an aperiodic stationary Markov process {Xi }∞ i=1 of order l, with X = [0, 1] and upper
information dimension d¯o (X). Let b = bn = ⌈r log log n⌉, where r > 1, k = kn = o( logloglogn n ), λ = λn =
(log n)2r and m = mn ≥ 2(1 + δ)d¯o (X)n, where δ > 0. For each n, let the entries of the measurement matrix
A = An ∈ Rm×n be drawn i.i.d. according to N (0, 1). Given Xon generated by source X and Yom = AXon , let
ˆ on = X ˆ on (Yom , A) denote the solution of (23), i.e., X ˆ on = arg min (H ˆ k ([xn ]b ) + X
λ n n2 kA[x ]b
− Yom k22 ). Then,
1 P ˆ n k2 −→ √ kXon − X 0. o n Proof: We need to prove that for any ǫ > 0, 1 ˆ n k2 > ǫ) → 0, P( √ kXon − X o n ˆ on = [X ˆ on ]b + qˆon , as n → ∞. Throughout the proof d¯o refers to d¯o (X). As before, let Xon = [Xon ]b + qon and X √ ˆ n = arg min(H ˆ k ([xn ]b ) + λ2 kA[xn ]b − Y m k2 ), we where as we proved earlier, kqon k2 , kˆ qon k2 ≤ n2−b . Since X o o n have ˆ k ([X ˆ n ]b ) + λ kA[X ˆ n ]b − Y m k 2 ≤ H ˆ k ([X n ]b ) + λ kAq n k2 , H o o o 2 o o 2 n2 n2 2 −2b ˆ k ([X n ]b ) + λσmax (A)2 ≤H . o n √ √ Let E1 , {σmax (A) ≤ n + 2 m}, where from [39], P(E1c ) ≤ e−m/2 . Also given ǫ > 0, define
(24)
1ˆ n ¯ E2 , { H k ([Xo ]b ) ≤ do + ǫ}. b In the proof of Theorem 4, we showed that, for any ǫ > 0, given our choice of parameters, P(E2c ) converges to zero as n grows to infinity. Conditioned on E1 ∩ E2 , from (24), we derive −2b p 1ˆ ˆ on ]b ) + λ kA[X ˆ on ]b − Yom k22 ≤ d¯o + ǫ + λ2 Hk ([X (1 + 2 m/n)2 2 b bn b −2b 9λ2 ≤ d¯o + ǫ + , (25) b p where the last line holds since for m ≤ n, 1 + m/n ≤ 2. On the other hand, for b = bn = ⌈r log log n⌉ and
λ = λn = (log n)2r , we have
9(log n)2r 9 9λ2−2b ≤ = , 2r b r(log n) log log n r log log n
which goes to zero as n grows to infinity. For n large enough, 9/(r log log n) ≤ ǫ, and hence from (25), 1ˆ ˆ n ]b ) + λ kA[X ˆ n ]b − Y m k2 ≤ d¯o + 2ǫ, Hk ([X o o o 2 b bn2
14
which implies that 1ˆ ˆ n ]b ) ≤ d¯o + 2ǫ, Hk ([X o b
(26)
and since d¯o ≤ 1, r
√ λ ˆ n ]b − Y m k2 ≤ 1 + 2ǫ. kA[ X o o bn2
(27)
For xn ∈ [0, 1]n , from (D.27) in Appendix D, 1 ˆ k ([xn ]b ) + b(kb + b + 3) + γn , ℓLZ ([xn ]b ) ≤ H n (1 − ǫn ) log n − b where ǫn = o(1) and γn = o(1) are both independent of xn . Therefore there exists nǫ , such that for n > nǫ , 1 1ˆ n ℓLZ ([xn ]b ) ≤ H k ([x ]b ) + ǫ. nb b On the other hand, conditioned on E1 ∩ E2 ,
1 ˆ n b Hk ([Xo ]bn )
ˆ n ]b ∈ Cn , where conditioned on E1 ∩ E2 , [Xon ]b , [X o Cn , {[xn ]bn :
≤ d¯o + ǫ, and
1 ˆ ˆn b Hk ([Xo ]bn )
≤ d¯o + 2ǫ. Therefore,
1 ℓLZ ([xn ]bn ) ≤ d¯o + 3ǫ}. nb
Define the event E3 as ˆ on ]b − [Xon ]b )k2 ≥ k[X ˆ on ]b − [Xon ]b k2 E3 , {kA([X where τ > 0. Note that
p (1 − τ )m},
P((E1 ∩ E2 ∩ E3 )c ) ≤ P(E1c ) + P(E2c ) + P(E1 ∩ E2 ∩ E3c ) ≤ P(E1c ) + P(E2c ) + P(E3c |E1 ∩ E2 ).
(28)
ˆ n ]b ∈ Cn , as we argued in the proof of Theorem 4, by the union bound, it Since conditioned on E1 ∩ E2 , [Xon ]b , [X o follows that ¯
m
P(E3c |E1 ∩ E2 ) ≤ 22(do +3ǫ)bn e 2 (τ +ln(1−τ )) . 2r Let τ = 1 − (log n)− 1+f , where f > 0. Then, since b ≤ r log log n + 1, and m = mn > 2(1 + δ)d¯o n,
¯
2r
m
P(E3c |E1 ∩ E2 ) ≤ 22(do +3ǫ)(r log log n+1)n 2 2 (log e− 1+f ¯
log log n)
= 22r log log n(n(do +3ǫ)− 2(1+f ) +αn ) ¯
m
1+δ
¯
≤ 22r(log log n)n(do +3ǫ−( 1+f )do + n αn ) , 1
(29)
where αn = o(1). Since ǫ > 0 and f > 0 can be selected arbitrarily small, they can be chosen such that 3ǫ < (δ − f )d¯o /(1 + f ), which yields P(E3c |E1 ∩ E2 ) → 0, as n → ∞.
15
ˆ on ]b − Yom k2 ≥ kA([X ˆ on ]b − [Xon ]b )k2 − kAqon k2 . Therefore, On the other hand, from the triangle inequality, kA[X conditioned on E1 ∩ E2 ∩ E3 , from (27), r r √ λ(1 − τ )m ˆ n λ n k[Xo ]b − [Xo ]b k2 ≤ kAqon k2 + 1 + 2ǫ 2 2 bn bn r p √ λ ≤ (1 + 2 m/n)2−b + 1 + 2ǫ b r 9λ −b √ ≤ 2 + 1 + 2ǫ, b or 1 ˆ on ]b − [Xon ]b k2 ≤ 2−b √ k[X n
s
9n + (1 − τ )m
s
(30)
(1 + 2ǫ)bn . (1 − τ )λm
2r
Hence, for τ = 1 − (log n)− 1+f , λ = (log n)2r , and b = ⌈r log log n⌉, p 3 + (1 + 2ǫ)(r log log n + 1) 1 n n ˆ √ k[Xo ]b − [Xo ]b k2 ≤ , p rf n 2d¯o (1 + δ)(log n) 1+f
(31)
which can be made arbitrarily small.
B. Robustness to noise In the previous section we assumed that the measurements are perfect and noise-free. In this section we study the case where the measurements are corrupted by noise. In this case, instead of Axno , the decoder observes yom = Axno + z m , where z m denotes the noise in the measurement system, and employs the following optimization algorithm to recover xno : x ˆn =
arg min kAxn − yom k2 ,
(32)
ˆ k ([xn ]b )≤C H
where C is a constant depending on the source complexity. Theorem 6. Consider an aperiodic stationary Markov process {Xi }∞ i=1 of order l, with X = [0, 1] and upper
information dimension d¯o (X). Consider a measurement matrix A = An ∈ Rm×n with i.i.d. entries distributed
according to N (0, 1). For Xon generated by the source X, we observe Yom = AXon + Z m , where Z m denotes the
measurement noise. Assume that there exists a deterministic sequence cm such that lim P(kZ m k2 > cm ) = 0, m→∞
ˆn = X ˆ n (Y m , A) denote the solution of (32) with C = b(1 + δ)d¯o (X), where δ > 0, i.e., and cm = o( logmm ). Let X o o o ˆ on = X
arg min ˆ k ([xn ]b )≤b(1+δ)d¯o (X) H
kAxn − Yom k2 .
Let b = bn = ⌈log log n⌉, k = kn = o( logloglogn n ) and m = mn ≥ 2(1 + 3δ)d¯o (X)n. Then, 1 P ˆ on k2 −→ √ kXon − X 0. n
(33)
16
Proof: Throughout the proof for simplicity d¯o denotes d¯o (X). Define event E0 as ˆ k ([X n ]b ) ≤ b(1 + δ)d¯o }. E0 , { H o In the proof of Theorem 4, we proved that, given our choice of parameters, for any fixed δ > 0, P(E0c ) → 0, as n
ˆ n is the minimizer of (33), grows to infinity. Conditioned on E0 , Xon satisfies the constraint of (33), and since X o we have ˆ n − Y m k2 ≤ kAX n − Y m k2 = kZ m k2 . kAX o o o o
(34)
ˆ on = [X ˆ on ]b + qˆon . Then, from (34), Let Xon = [Xon ]b + qon and X ˆ n ]b ) + A(q n − qˆn ) − Z m k2 ≤ kZ m k2 , kA([Xon ]b − [X o o o and by the triangle inequality, ˆ n ]b )k2 ≤ kA(q n − qˆn )k2 + 2kZ mk2 . kA([Xon ]b − [X o o o Let Um ,
(35)
ˆ on ]b ) A([Xon ]b − [X . ˆ n ]b k 2 k[X n ]b − [X o
o
By this definition, (35) can be rewritten as ˆ on ]b k2 kU m k2 ≤ kA(qon − qˆon )k2 + 2kZ mk2 . k[Xon ]b − [X
(36)
Define the following events: E1 , {σmax (A) ≤ E2 , {kU m k2 ≥
√ √ n + 2 m},
p (1 − τ )m},
E3 , {kZ m k2 ≤ cm },
√ where τ > 0. Conditioned on E0 ∩ E1 ∩ E2 , since kqon − qˆon k2 ≤ 2−b+1 n, it follows from (36) that for n large enough
and since m ≤ n,
p √ 2−b+1 (1 + 2 m/n) n 1 2cm n n ˆ √ k[Xo ]b − [Xo ]b k2 ≤ p +p , n (1 − τ )m (1 − τ )mn 1 ˆ n ]b k 2 ≤ 6 √ k[Xon ]b − [X o n 2b
r
n 2cm . +p (1 − τ )m (1 − τ )mn
(37)
17
To set the parameter τ , we first study the probability of E0 ∩ E1 ∩ E2 ∩ E3 . Note that 3 \ P ( Ei )c ≤ P (E0c ) + P(E1c ) + P(E2c |E0 ) + P(E3c ). i=0
As mentioned earlier, P(E0c ) → 0, as n grows to infinity, and also from [39], P(E1c ) ≤ e−m/2 . Furthermore, since
m grows linearly with n, by the theorem’s assumption, P(E3c ) → 0, as n → ∞. In (D.27) of Appendix D, we showed that for xn ∈ [0, 1]n ,
1 ˆ k ([xn ]b ) + b(kb + b + 3) + γn , ℓLZ ([xn ]b ) ≤ H n (1 − ǫn ) log n − b where ǫn = o(1) and γn = o(1) are both independent of xn . Therefore, given δ ′ > 0, b = bn = ⌈log log n⌉ and
k = kn = o( logloglogn n ), there exits nδ′ , such that for n > nδ′ ,
1 1ˆ n ′ ℓLZ ([xn ]b ) ≤ H k ([x ]b ) + δ . nb b n ¯ ˆ k ([X ˆ on ]bn ) ≤ d¯o (1 + δ), and conditioned on E0 , 1 H ˆ On the other hand, by construction, 1b H b k ([Xo ]bn ) ≤ do (1 + δ).
Choosing δ ′ = δ d¯o , for n large enough,
1 n nb ℓLZ ([Xo ]b )
≤ d¯o (1 + 2δ) and
1 ˆn nb ℓLZ ([Xo ]b )
≤ d¯o (1 + 2δ).
As argued in the proof of Theorem 4, given the fact that the Lempel-Ziv code is a uniquely decodable code, |{[xn ]b :
1 n nb ℓLZ ([x ]b )
¯ ≤ d¯o (1 + 2δ)}| ≤ 2nb(1+2δ)do +1 . Therefore, by the union bound and Lemma 4, ¯
m
P(E2c |E0 ) ≤ 22nb(1+2δ)do +1 e 2 (τ +ln(1−τ )) .
(38)
2 Let τ = τn = 1 − (log n)− 1+f , where f > 0. Then, for b = ⌈log log n⌉, and m > 2n(1 + 3δ)d¯o , from (38),
¯
m
¯
n
2
P(E2c |E0 ) ≤ 22n(log log n+1)(1+2δ)do +1 2 2 (τ log e− 1+f ≤ 22n(log log n+1)(1+2δ)do +1 2( 2 τ log e− ¯
1+3δ
= 22ndo log log n(1+2δ− 1+f
+ηn )
log log n)
2n(1+3δ)d¯o 1+f
log log n)
,
where ηn → 0, as n → ∞. Choosing f small enough such that 1 + 2δ
0, 6(log n)−f /(1+f ) 2cm log n p + + 2−b+1 < ǫ, m 2(1 + 3δ)d¯o
ˆ n k2 > ǫ) = 0. which proves that limn→∞ P( √1n kXon − X o
V. C ONCLUSIONS In this paper we have studied the problem of universal compressed sensing, i.e., the problem of recovering “structured” signals from their under-determined set of random linear projections without having prior information about the structure of the signal. We have considered structured signals that are modeled by stationary processes. We have generalized R´enyi’s information dimension and defined information dimension of a stationary process, as a measure of complexity for such processes. We have also calculated the information dimension of some stationary processes, such as Markov processes used to model piecewise constant signals. We then have introduced the MEP algorithm for universal signal recovery. The algorithm is based on Occam’s Razor and among the signals satisfying the measurements constraints seeks the “simplest” signal, where the complexity of a signal is measured in terms of the conditional empirical entropy of its quantized version. We have proved that for an aperiodic stationary Markov chain with upper information dimension of d¯o (X), the algorithm requires slightly more that 2d¯o (X)n random linear measurements to recover the signal. We have also provided an implementable version of the MEP algorithm with the same asymptotic performance as the original MEP. A PPENDIX A P ROOF
OF
T HEOREM 2
Since the process is a stationary first-order Markov process, H([Xk+1 ]b |[X k ]b ) = H([Xk+1 ]b |[Xk ]b ) = H([X2 ]b |[X1 ]b ), and therefore, d¯k (X) = d¯1 (X), and dk (X) = d1 (X), for all k ≥ 1. Let Xb = { |Xb | = 2b . Then,
Pb
i=1 bi 2
−i
: bi ∈ {0, 1}, i = 1, . . . , b} denote the alphabet at resolution b. Clearly,
H([X2 ]b |[X1 ]b ) =
X
c∈Xb
P([X1 ]b = c)H([X2 ]b |[X1 ]b = c).
We next prove that 1b H([X2 ]b |[X1 ]b = c1 ) uniformly converges to p, as b grows to infinity, for all values of c1 .
Define the indicator random variable I = 1X2 =X1 . Given the transition probability of the Markov chain, I is
independent of X1 , and P(I = 1) = 1 − p. Also define a random variable U , independent of (X1 , X2 ), and
19
distributed according to fc . Then, it follows that P([X2 ]b = c2 |[X1 ]b = c1 ) = P([X2 ]b = c2 , I = 0|[X1 ]b = c1 ) + P([X2 ]b = c2 , I = 1|[X1 ]b = c1 ) = p P([X2 ]b = c2 |[X1 ]b = c1 , I = 0) + (1 − p) P([X2 ]b = c2 |[X1 ]b = c1 , I = 1) = p P([U ]b = c2 ) + (1 − p)1c2 =c1 , where the last line follows from the fact that conditioned on X2 6= X1 , X2 , independent of the value of X1 , is distributed according to fc . For a ∈ Xb , P ([U ]b = a) =
Z
a+2−b
fc (u)du.
(A.1)
a
On the other hand, by the mean value theorem, there exists xa ∈ [a, a + 2−b ], such that 2b
Z
a+2−b
fc (u)du = fc (xa ).
a
Therefore, P ([U ]b = a) = 2−b fc (xa ), and H([X2 ]b |[X1 ]b = c1 ) = −
X
a∈Xb ,a6=c1
−(p2−b fc (xa )) log(p2−b fc (xa ))
− (p2−b fc (xc1 ) + 1 − p) log(p2−b fc (xc1 ) + 1 − p) X = −(p2−b fc (xa ))(log p − b + log(fc (xa ))) a∈Xb ,a6=c1
− (p2−b fc (xc1 ) + 1 − p) log(p2−b fc (xc1 ) + 1 − p).
(A.2)
Dividing both sides of (A.2) by b yields (b − log p)p H([X2 ]b |[X1 ]b = c1 ) = b b p −( ) b
X
2−b fc (xa )
a∈Xb ,a6=c1
X
2−b fc (xa ) log(fc (xa ))
a∈Xb ,a6=c1
(p2−b fc (xc1 ) + 1 − p) log(p2−b fc (xc1 ) + 1 − p) . − b
(A.3)
20
On the other hand, from (A.1), X
2
−b
fc (xa ) =
a∈Xb
XZ
a∈Xb
=
Z
a+2−b
fc (u)du
a
1
fc (u)du
0
= 1. Also, since
R1 0
(A.4)
fc (u)du = 1, lim
b→∞
where h(fc ) = −
R
X
2−b fc (xa ) log(fc (xa )) = h(fc ),
a∈Xb
fc (u) log fc (u)du denotes the differential entropy of U . Therefore, for any ǫ > 0, there exists
bǫ ∈ N, such that for b > bǫ , |
X
a∈Xb
2−b fc (xa ) log(fc (xa )) − h(fc )| ≤ ǫ.
Since fc is bounded by assumption, M = supx∈[0,1] fc (x) < ∞, and h(fc ) ≤ log M < ∞. Finally, −q log q ≤
e−1 log e, for q ∈ [0, 1]. Therefore, combining (A.3), (A.4) and (A.5), it follows that, for b > bǫ , M H([X2 ]b |[X1 ]b = c1 ) p log p p(h(fc ) + ǫ) log e − p ≤ b − + + , b 2 b b eb
(A.5)
for any ǫ′ > 0, there exists bǫ′ , such that for b > max{bǫ , bǫ′ }, H([X2 ]b |[X1 ]b = c1 ) ≤ ǫ′ , − p b
(A.6)
for all c1 ∈ Xb . Since the right hand side of the above equation does not depend on c1 , and goes to zero as b → ∞,
and
X H([X2 ]b |[X1 ]b = c1 ) H([X2 ]b |[X1 ]b ) ≤ − p P([X ] = c ) − p 1 b 1 b b c1 ∈Xb X ≤ ǫ′ P([X1 ]b = c1 ) c1 ∈Xb
= ǫ′ ,
(A.7)
which concludes the proof. A PPENDIX B P ROOF
OF
T HEOREM 3
Since the process is stationary and Markov of order l, do (X) = dl (X) and d¯o (X) = d¯l (X). To compute dl (X), d¯l (X), we need to study H([Xl+1 ]b |[X l ]b ). Let Xb denote the quantized version of X at resolution b.
Define the indicator random variable I = 1Zl =0 . By the definition of the Markov chain, P(I = 1) = 1 − p.
21
Then, given c1 , . . . , cl+1 ∈ Xb , since I is independent of X l , H([Xl+1 ]b |[X l ]) ≤ H([Xl+1 ]b , I|[X l ]b ) ≤ 1 + H([Xl+1 ]b |[X l ]b , I) l X ai Xl−i + U ]b |[X l ]b ) = 1 + pH([ i=1
l X ai Xl−i ]b |[X l ]b ), + (1 − p)H([
(B.8)
i=1
where U is independent of X l and is distributed according to fc . Similarly, H([Xl+1 ]b |[X l ]) ≥ H([Xl+1 ]b |[X l ]b , I) = pH([
l X i=1
ai Xl−i + U ]b |[X l ]b )
l X ai Xl−i ]b |[X l ]b ). + (1 − p)H([
(B.9)
i=1
Conditioned on [X l ]b = cl , we have ci ≤ Xi < ci + 2−b , for i = 1, . . . , l, and l l l X X X |ai |. ai cl−i ≤ 2−b ai Xl−i − i=1
i=1
Let M =
lP
l i=1
m
|ai | and c =
Pl
i=1
ai cl−i . Then, l X i=1
Therefore, [
Pl
i=1
i=1
ai Xl−i ∈ [c − M 2−b , c + M 2−b ].
ai Xl−i ]b can take only 2M + 1 different values, and as a result H([
log(2M + 1). Since M does not depend on b, it follows that lim
H([
b→∞
We next prove that lim
b→∞
H([
Pl
i=1
Pl
i=1
Pl
i=1
ai Xl−i ]b |[X l ]b ) ≤
ai Xl−i ]b |[X l ]b ) = 0. b
ai Xl−i + U ]b |[X l ]b ) = 1, b
for any absolutely continuous distribution with pdf fc . This combined with the lower and upper bounds in (B.8) and (B.9), respectively, yields the desired result. P P To bound H([ li=1 ai Xl−i + U ]b |[X l ]b ), we need to study P([ li=1 ai Xl−i + U ]b = c|[X l ]b = cl ), where
cl ∈ Xbl , and c ∈ Xb . Let N (cl , b) = {xl : ci ≤ xi ≤ ci + 2−b , i = 1, . . . , l}, and define the function g : Rl → R,
22
by g(xl ) =
Pl
i=1
ai xl−i . Note that
Pl l X P([ i=1 ai Xl−i + U ]b , [X l ]b = cl ) ai Xl−i + U ]b = c [X l ]b = cl = P [ P([X l ]b = cl ) i=1 R R c−g(xl )+2−b f (xl )fc (u)dudxl l N (c ,b) c−g(xl ) , = P([X l ]b = cl )
(B.10)
where f (xl ) denotes the pdf of X l . By the mean value theorem, there exists δ(xl ) ∈ (0, 2−b ), such that Z
c−g(xl )+2−b
c−g(xl )
fc (u)du = 2−b fc (c − g(xl ) + δ(xl )).
(B.11)
Combining (B.10) and (B.11) yields that l X 2−b ai Xl−i + U ]b = c [X l ]b = cl = P [ i=1
R
N (cl ,b)
Define the pdf pcl ,b (y l ) over N (cl , b) as
f (xl )fc (c − g(xl ) + δ(xl ))dxl R . l l N (cl ,b) f (x )dx
(B.12)
f (y l ) . f (xl )dxl N (cl ,b)
Then, P([
Pl
pcl ,b (y l ) = R
i=1
ai Xl−i + U ]b = c|[X l ]b = cl ) = 2−b E[fc (c − g(Y l ) − δ(Y l ))], where Y l ∼ pcl ,b . Hence,
l X XX (b − log E[fc (c − g(Y l ) − δ(Y l ))]) ai Xl−i + U ]b [X l ]b = H [ i=1
cl
c
×P
=b−
XX c
cl
×P
l h X
ai Xl−i + U
i=1
i
b
= c [X l ]b = cl
log E[fc (c − g(Y l ) − δ(Y l ))] l h X
ai Xl−i + U
i=1
i
b
= c [X l ]b = cl .
Since by assumption fc is bounded on its support between α and β, then for b large enough, if c− l
l
Pl
i=1
(B.13) ai cl−i ∈ Z,
then E[fc (c − g(Y ) − δ(Y ))] is also bounded between α and β, and hence the desired result follows. That is, Pl lim b−1 H([ i=1 ai Xl−i + U ]b |[X l ]b ) = 1.
b→∞
A PPENDIX C U SEFUL
LEMMAS
The first two lemmas in the following are from [26], and are useful in our proofs. i.i.d.
Lemma 4 (χ2 concentration). Fix τ > 0, and let Ui ∼ N (0, 1), i = 1, 2, . . . , m. Then, P
m X i=1
m Ui2 < m(1 − τ ) ≤ e 2 (τ +ln(1−τ ))
23
and P
m X i=1
m Ui2 > m(1 + τ ) ≤ e− 2 (τ −ln(1+τ )) .
(C.14)
Lemma 5. Let U n and V n denote two independent Gaussian vectors of length n with i.i.d. elements distributed Pn as N (0, 1). Then the distribution of hU m , V m i = i=1 Ui Vi is the same as the distribution of kU n k2 G, where G ∼ N (0, 1) and is independent of kU n k2 .
Lemma 6. Consider distributions p and q on finite alphabet X such that kp − qk1 ≤ ǫ. Then, |H(p) − H(q)| ≤ −ǫ log ǫ + ǫ log |X |. Proof: Define f (y) = −y ln y, for y ∈ [0, 1], and g(y) = f (y + ǫ) − f (y). Since g ′ (y) = ln(y/(y + ǫ)) < 0, g is a decreasing function of y. Therefore, g(y) ≤ −ǫ ln ǫ. For x ∈ X , let |p(x) − q(x)| = ǫx . By our assumption, σ,
X
x∈X
ǫx ≤ ǫ.
(C.15)
On the other hand, we just proved that | − p(x) ln p(x) + q(x) ln q(x)| ≤ −ǫx ln ǫx . Therefore, |H(p) − H(q)| = | ≤ ≤
X
(−p(x) ln p(x) + q(x) ln q(x))|
x∈X
X
| − p(x) ln p(x) + q(x) ln q(x)|
X
−ǫx ln ǫx .
x∈X
x∈X
(C.16)
Also X
x∈X
−ǫx log ǫx =
X
x∈X
−ǫx log
ǫx σ σ
= −σ log σ + σH(
ǫx : x ∈ X) σ
≤ −ǫ log ǫ + ǫ log |X |, where the last line follows because f (y) is an increasing function for y ≤ e−1 .
(C.17)
24
A PPENDIX D C ONNECTION
BETWEEN ℓLZ AND
ˆk H
In this appendix, we adapt the results of [43] to the case where the source is non-binary. Since in this work in most cases we deal with real-valued sources that are quantized at different number of quantization levels and the number of of quantization levels usually grows to infinity with blocklength, we need to derive all the constants carefully to make sure that the bounds are still valid, when the size of the alphabet depends on the blocklength. Consider a finite-alphabet sequence z n ∈ Z n , where |Z| = r and r = 2b , b ∈ N2 . Let NLZ (z n ) denote the
number of phrases in z n after parsing it according the Lempel-Ziv algorithm [13]. For k ∈ N, let nk =
k X
jrj =
j=1
rk+1 (kr − (k + 1)) + r . (r − 1)2
(D.18)
Let NLZ (n) denote the maximum possible number of phrases in a parsed sequence of length n. For n = nk , NLZ (nk ) ≤
k X
rj
j=1
r(rk − 1) r−1 rk+1 (kr − (k + 1)) r = − (r − 1)(kr − (k + 1)) r − 1 (r − 1)nk ≤ kr − (k + 1) nk . ≤ k−1 =
(D.19)
Now given n, assume that nk ≤ n < nk+1 , for some k. It is straightforward to check that for k ≥ 2, n ≥ nk ≥ rk . Therefore, if n > r(1 + r), k≤
log n . log r
On the other hand, n < nk+1 , where from (D.18) nk+1 = =
rk+2 ((k + 1)r − (k + 2)) + r (r − 1)2
rk+2 ((r − 1) logr n + r − 2) + r . (r − 1)2
Therefore, k + 2 ≥ logr
n(r − 1)2 − r , (r − 1) logr n + r − 2
2 Restricting the alphabet size to satisfy this condition is to simplify the arguments, but the results can be generalized to any finite-alphabet source.
25
or k ≥ (1 − ǫn )
log n , log r
(D.20)
where ǫn =
log(((r − 1) logr n + r − 2)r2 ) . log n
(D.21)
To pack the maximum possible number of phrases in a sequence of length n, we need to first pack all possible phrases of length smaller than or equal to k, then use phrases of length k + 1 to cover the rest. Therefore, n − nk k+1 n − nk nk + ≤ k−1 k+1 n ≤ . k−1
NLZ (n) ≤ NLZ (nk ) +
(D.22)
Combining (D.22) with (D.20), and noting that log r = b, yields NLZ (n) b ≤ . n (1 − ǫn ) log n − b
(D.23)
Taking into account the number of bits required for describing the blocklength n, the number of phrases NLZ , the pointers and the extra symbols of phrases, we derive 1 b 1 ℓLZ (z n ) = NLZ log NLZ + NLZ + ηn , n n n
(D.24)
where ηn =
1 (log n + 2 log log n + log NLZ + 2 log log NLZ + 2), n
(D.25)
On the other hand, straightforward extension of the analysis presented in [43] to the case of general non-binary alphabets yields 1 ˆ k (z n ) + NLZ ((µ + 1) log(µ + 1) − µ log µ + k log r), NLZ log NLZ ≤ H n n
(D.26)
where µ = NLZ /n. But, (µ+1) log(µ+1)−µ log µ = log(µ+1)+µ log(1+1/µ) ≤ log(µ+1)+1/ ln 2 < log µ+2. √ PNLZ √ Also, it is easy to show that for any value of r and z n , n ≤ i=1 l, or NLZ (z n ) ≥ 2n − 1, or n/NLZ (z n ) ≤ n, for n large enough. Therefore, since µ−1 log µ is an increasing function of µ, log n log µ ≤ √ . µ 2 n Hence, combining (D.24), (D.23) and (D.26), we conclude that, for n large enough, 1 ˆ k (z n ) + b(kb + b + 3) + γn , ℓLZ (z n ) ≤ H n (1 − ǫn ) log n − b
(D.27)
26
where γn = ηn +
log √n , 2 n
and ǫn and ηn are defined in (D.21) and (D.25), respecctively. Note that γn = o(1) and
does not depend on b or z n . A PPENDIX E E XPONENTIAL CONVERGENCE
RATES
Consider an aperiodic Markov process {Zi }∞ i=1 of order l with probability measure µ, i.e., µ(z n ) = µ(z l )
n Y
i=l+1
i−1 µ(zi |zi−l ).
Let r , |Z|. Definition 1. The process Z = {Zi }∞ i=1 is called Ψ-mixing, if there exists a non-decreasing function Ψ(g) such that Ψ(g) → 1, as g → ∞, and
µ(uℓ1 v g wℓ3 ) ≤ µ(uℓ1 )µ(wℓ3 )Ψ(g),
for all ℓ1 , ℓ2 ∈ N, uℓ1 ∈ Z ℓ1 , v g ∈ Z g and wℓ3 ∈ Z ℓ3 . First order Markov chains are known to be Ψ-mixing [34]. We prove that Markov chains of order l have a weaker property similar to being Ψ-mixing. As we will show later, this is enough to prove that an aperiodic stationary Markov process of order l has exponential rates for frequencies of all orders. Lemma 7. Consider an aperiodic Markov process {Zi }∞ i=1 of order l. There exists a non-increasing function Φ : N → R+ , such that for for all ℓ1 ≥ l, ℓ3 ≥ l, uℓ1 ∈ Z ℓ1 , v g ∈ Z g and wℓ3 ∈ Z ℓ3 , we have µ(uℓ1 v g wℓ3 ) ≤ µ(uℓ1 )µ(wℓ3 )Ψ(g), and limg→∞ Ψ(g) = 1. Proof: The proofs follows directly from the case of l = 1. For general l, Ψ is defined as Ψ(g) =
max
Magl ,bl
al ×bl ∈Z l ×Z l
π(bl )
,
where M = [Mal ,bl ]al ×bl denotes the transition probability matrix of the process {Zi }∞ i=1 and π(·) denotes its stationary distribution. The following theorem uses Lemma 7 and extends Theorem III.1.7 of [34] to Markov processes of order l. Theorem 7. An aperiodic Markov of order l has exponential rates for frequencies of all orders. More precisely, given process {Zi }, an aperiodic Markov chain of order l, for any ǫ > 0, there exists g ∈ N, depending only on ǫ and the transition probabilities of process {Zi }, such that for any k > l and n > 6(k + g)/ǫ + k, P({z n : kpk (·|z n ) − µk k1 ≥ ǫ}) ≤ 2cǫ
2
/8
k
−ncǫ2
(k + g)n|Z| 2 8(k+g) ,
27
where c = 1/(2 ln 2). Proof: Since for any z n ∈ Z n , kµk1 − pk1 (·|z n )k1 ≤ kµk2 − pk2 (·|z n )k1 , for all k1 ≤ k2 , it is enough to prove the statement for k large. Employing Lemma 7 and the technique used in Chapter III of [34], for n > 6(k + g)/ǫ + k and k ≥ l, we derive k
P({z n : kpk (·|z n ) − µk k1 ≥ ǫ}) ≤ (k + g)[Ψ(g)]t (t + 1)|Z| 2−tcǫ
2
/4
,
where c = 1/(2 ln 2), and t ∈ N is the integer that satisfies n n −1≤t< . k+g k+g On the other hand, by Lemma 7, Ψ(g) is a non-increasing function that converges to 1. Hence, for g large enough, Ψ(g) < 2cǫ
2
/8
, and k
P({z n : kpk (·|z n ) − µk k1 ≥ ǫ}) ≤ (k + g)(t + 1)|Z| 2−tcǫ
2
/8
.
Note that choice of g depends only on the Markov chain transition matrix and ǫ. R EFERENCES [1] D.L. Donoho. Compressed sensing. IEEE Trans. Inf. Theory, 52(4):1289–1306, 2006. [2] E. J Cand`es and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inf. Theory, 52(12):5406–5425, 2006. [3] E.J. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory, 52(2):489–509, Feb. 2006. [4] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Trans. Inform. Theory, 56(4):1982 –2001, Apr. 2010. [5] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse problems. Found. of Comp. Math., 12(6):805–849, 2012. [6] M. Vetterli, P. Marziliano, and T. Blu. Sampling signals with finite rate of innovation. IEEE Trans. Signal Process., 50(6):1417–1428, Jun. 2002. [7] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Rev., 52(3):471–501, Apr. 2010. [8] P. Shah and V. Chandrasekaran. Iterative projections for signal identification on manifolds. In Proc. 49th Annual Allerton Conf. on Commun., Control, and Comp., Sep. 2011. [9] C. Hegde and R.G. Baraniuk. Signal recovery on incoherent manifolds. IEEE Trans. Inf. Theory, 58(12):7204–7214, 2012. [10] C. Hegde and R. G. Baraniuk. Sampling and recovery of pulse streams. IEEE Trans. Signal Process., 59(4):1505 –1517, Apr. 2011. [11] D. L. Donoho, H. Kakavand, and J. Mammen. The simplest solution to an underdetermined system of linear equations. In Proc. IEEE Int. Symp. Inform. Theory (ISIT), pages 1924 –1928, Jul. 2006. [12] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337–343, 1977. [13] J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory, 24(5):530–536, Sep 1978. [14] D. J. Sakrison. The rate of a class of random processes. IEEE Trans. Inform. Theory, 16:10–16, Jan. 1970. [15] J. Ziv. Coding of sources with unknown statistics part II: Distortion relative to a fidelity criterion. IEEE Trans. Inf. Theory, 18:389–394, May 1972. [16] D. L. Neuhoff, R. M. Gray, and L. D. Davisson. Fixed rate universal block source coding with a fidelity criterion. IEEE Trans. Inf. Theory, 21:511–523, May 1972.
28
[17] D. L. Neuhoff and P. L. Shields. Fixed-rate universal codes for Markov sources. IEEE Trans. Inf. Theory, 24:360–367, May 1978. [18] J. Ziv. Distortion-rate theory for individual sequences. IEEE Trans. Inf. Theory, 24:137–143, Jan. 1980. [19] R. Garcia-Munoz and D. L. Neuhoff. Strong universal source coding subject to a rate-distortion constraint. IEEE Trans. Inf. Theory, 28:285295, Mar. 1982. [20] D. Donoho. The Kolmogorov sampler. Technical Report 2002-04, Stanford University, Jan. 2002. [21] T. Weissman, E. Ordentlich, G. Seroussi, S. Verd´u, and M. Weinberger. Universal discrete denoising: Known channel. IEEE Trans. Inf. Theory, 51(1):5–28, 2005. [22] M. Feder, N. Merhav, and M. Gutman. Universal prediction for individual sequences. IEEE Trans. Inf. Theory, 38(4):1258–1270, 1992. [23] N. Merhav and M. Feder. Universal prediction. IEEE Trans. Inf. Theory, 44(6):2124–2147, 1998. [24] S. Jalali and A. Maleki. Minimum complexity pursuit. In Proc. 49th Annual Allerton Conf. on Commun., Control, and Comp., pages 1764–1770, Sep. 2011. [25] S. Jalali, A. Maleki, and R. Baraniuk. Minimum complexity pursuit: Stability analysis. In Proc. IEEE Int. Symp. Inform. Theory, pages 1857–1861, Jul. 2012. [26] S. Jalali, A. Maleki, and R.G. Baraniuk. Minimum complexity pursuit for universal compressed sensing. IEEE Trans. Inf. Theory, 60(4):2253–2268, April 2014. [27] S. C. Tornay. Ockham: Studies and Selections. Open Court Publishers, La Salle, IL, Third edition, 1938. [28] R. J. Solomonoff. A formal theory of inductive inference. Inform. Contr., 7:224–254, 1964. [29] A. N. Kolmogorov. Logical basis for information theory and probability theory. IEEE Trans. Inf. Theory, 14:662–664, Sep. 1968. [30] A. R´enyi. On the dimension and entropy of probability distributions. Acta Mathematica Academiae Scientiarum Hungarica, 10(1-2):193– 215, 1959. [31] Y. Wu and S. Verd´u. R´enyi information dimension: Fundamental limits of almost lossless analog compression. IEEE Trans. Inf. Theory, 56(8):3721 –3748, Aug. 2010. [32] D. Baron and M. F. Duarte. Universal MAP estimation in compressed sensing. In Proc. 49th Annual Allerton Conf. on Commun., Control, and Comp., Sep. 2011. [33] D. Baron and M. F. Duarte. Signal recovery in compressed sensing via universal priors. arXiv:1204.2611, 2012. [34] P. Shields. The Ergodic Theory of Discrete Sample Paths. American Mathematical Society, 1996. [35] D. L. Donoho, M. Vetterli, R. A. DeVore, and I. Daubechies. Data compression and harmonic analysis. IEEE Trans. Inf. Theory, 44(6):2435–2476, 1998. [36] P.L. Dragotti and M. Vetterli. Wavelet footprints: theory, algorithms, and applications. IEEE Trans. Signal Process., 51(5):1306–1323, May 2003. [37] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1):259–268, 1992. [38] D. Needell and R. Ward. Stable image reconstruction using total variation minimization. SIAM Journal on Imaging Sciences, 6(2):1035– 1058, 2013. [39] E. Cand`es, J. Romberg, and T. Tao. Decoding by linear programming. IEEE Trans. Inf. Theory, 51(12):4203 – 4215, Dec. 2005. [40] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 6:721–741, Nov 1984. [41] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi. Optimization by simulated annealing. Science, 220:671–680, 1983. [42] V. Cerny. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Journal of Optimization Theory and Applications, 45(1):41–51, Jan 1985. [43] E. Plotnik, M.J. Weinberger, and J. Ziv. Upper bounds on the probability of sequences emitted by finite-state sources and on the redundancy of the Lempel-Ziv algorithm. IEEE Trans. Inf. Theory, 38(1):66–72, Jan 1992.