Analysis of Approximate Message Passing Algorithm - CiteSeerX

Report 10 Downloads 142 Views
Analysis of Approximate Message Passing Algorithm Arian Maleki

David L. Donoho

Andrea Montanari

Department of Electrical Engineering Stanford University [email protected]

Department of Statistics Stanford University [email protected]

Department of Electrical Engineering and Department of Statistics Stanford University [email protected]

Abstract— Finding fast recovery algorithms is a problem of significant interest in compressed sensing. In many applications, extremely large problem sizes are envisioned, with at least tens of thousands of equations and hundreds of thousands of unknowns. The interior point methods for solving the best studied algorithm -`1 minimization- are very slow and hence they are ineffective for such problem sizes. Faster methods such as LARS or homotopy [1] have been proposed for solving `1 minimization but there is still a need for algorithms with less computational complexity. Therefore there is a fast growing literature on greedy methods and iterative thresholding for finding the sparsest solution. It was shown by two of the authors that most of these algorithms perform worse than `1 when it comes to the sparsity-measurement tradeoff [2]. In a recent paper the authors introduced a new algorithm called AMP which is as fast as iterative soft thresholding algorithms proposed before and performs exactly the same as `1 minimization[3]. This algorithm is inspired from the message passing algorithms on bipartite graphs. The statistical properties of AMP let the authors propose a theoretical framework to analyze the asymptotic performance of the algorithm. This results in very sharp predictions of different observables in the algorithm. In this paper we address several questions regarding the thresholding policy. We consider a very general thresholding policy λ(σ) for the algorithm and will use the maxmin framework to tune it optimally. It is shown that when formulated in this general form, the maxmin thresholding policy is not unique and many different thresholding policies may lead to the same phase transition. We will then propose two very simple thresholding policies that can be implemented easily in practice and prove that they are both maxmin. This analysis will also shed some light on several other aspects of the algorithm, such as the least favorable distribution and similarity of all maxmin optimal thresholding policies. We will also show how one can derive the AMP algorithm from the full message passing algorithm.

I. I NTRODUCTION Compressed sensing (CS) [4], [5] is a fairly new field of research that has found many applications in signal processing, machine learning and biology. The main problem of interest in compressed sensing is to find the sparsest solution of an underdetermined system of linear equations y = Axo . Unfortunately, this problem is NP-hard and in general there is no polynomial time algorithm to solve this problem. Chen et al. [6] proposed the following convex optimization for recovering the sparsest solution which is also called Basis Pursuit; min kxk1 s.t.

Ax = y.

(1)

The computational complexity of this minimization problem has motivated researchers to look for cheaper algorithms. Therefore many algorithms with different heuristics have been proposed in the last five years for finding the sparsest solutions. A class of methods that has drawn the most attention recently is the class of iterative thresholding algorithms. Starting with x0 = 0 and z 0 = 0 the iteration is xt+1 = η(xt + A∗ z t ; λt ), t

t

z = y − Ax . t

(2) t

In this equation x is the estimate of the signal at time t. z is the residual at time t and if the problem is noiseless and the algorithm

Fig. 1. The pictorial representation of achieving the sparsest solution for iterative thresholding algorithms

performs well it will converge to 0. η is called thresholding function and is a scaler nonlinearity that is applied to the elements of the vector componentwise and finally λt is the parameter that the thresholding function uses at time t. The main task of the thresholding function is to impose the sparsity at each iteration. Here we assume that the elements of the matrix A are scaled properly such that the iterations converge. If the nonlinearity or thresholding function η is removed from the algorithm it is well-known that the algorithm converges to the least `2 -norm solution of y = Ax. However by applying η at each iteration we direct the algorithm to the sparsest solution. This phenomena is shown in Figure 1. As it is clear from the figure the thresholding function moves the algorithm toward the sparsest solution, otherwise the algorithm moves in the direction perpendicular to the plane and hits the plane at a dense solution. Many different variants of iterative thresholding algorithms have been proposed in the literature from many different perspectives such as [7], [8], [9], [10],[11], [12], [13], [14], [15] and [16]. To see a more complete list of references please refer to [2]. Although many nonlinearities have been proposed in the literature, in this paper we focus on the soft thresholding which is given by η(x; λ) = sgn(x)(|x| − λ)+ ,

(3)

where the subscript (z)+ = zI(z ≥ 0). I is the indicator function. This algorithm is called iterative soft thresholding or IST. Although these algorithms are much faster than `1 , but it has been shown through extensive simulations that they perform worse than `1 when it comes to the sparsity measurement tradeoff [2]. Recently the authors have proposed a slightly modified version of this algorithm which is capable of performing as well as `1 . This new algorithm

Fig. 2.

Phase diagram of the `1 minimization algorithm.

is inspired by message passing algorithms on graphs and is called Approximate Message Passing or AMP. The statistical properties of AMP makes the asymptotic analysis of this algorithm possible. In this paper we discuss the framework that has been used for the asymptotic analysis of the algorithm. This theoretical framework makes the tuning task very simple. After finding optimally tuned thresholding policies we will discuss the algorithm and will show its derivation from the full message passing. The organization of the paper is as follows. In section II we review the concept of the phase transition that formalizes the sparsity measurement tradeoff. We will also mention maximin framework for tuning the free parameters of different algorithms. In section III the state evolution framework will be explained. By using this theoretical framework and the maxmin tuning of section II we calculate the optimal thresholding policy (values of λt ). In Section IV the actual message passing algorithm is mentioned and will be used to derive the AMP algorithm. Finally Section V will be devoted to the simulation results. II. M AXMIN F RAMEWORK In section I computational complexity is explained as a measure of performance for compressed sensing algorithms. There is another very important measure of performance which is sparsitymeasurement tradeoff. Roughly speaking, given the sparsity level the algorithm that needs less number of measurements for the correct recovery of the original signal, is better. Majority of theoretical frameworks that exist in the compressed sensing literature, such as coherence conditions [17] and restricted isometry property [18] provide us with sufficient conditions and therefore cannot be used for comparison purposes. In the next section we introduce phase diagrams and phase transitions that are necessary and sufficient conditions for the correct recovery. We will also introduce the maxmin framework that addresses the tuning of the free parameters in the recovery schemes. A. Phase Diagram Suppose that the elements of xo are drawn i.i.d from a given distribution FX (xoi ) = (1 − )δ0 (xoi ) + G(xoi ), where δ0 is equal to 1 when xoi is zero and 0 otherwise. G(xoi ) is another distribution function that specifies the distribution of the non-zero elements in the vector but it is usually unknown beforehand. In other words

each element of xo is zero with probability 1 −  and is drawn from another distribution G with probability . If the dimension of xo is N then the expected number of non-zero elements is N . According to the compressed sensing framework xo is measured through a fat random matrix A whose elements are drawn i.i.d. from some distribution. The matrix A has n rows and N columns and n n < N . We define δ = N and ρ = δ . Suppose that algorithm A is used for recovering the original vector from the measurements y. No matter how well A works there are certain samples of this random ensemble on which A has no chance of recovering the correct answer. For example when A matrix is equal to zero or when xo is dense. Therefore an important measure of performance is the probability of correct recovery of the algorithm as a function of (N, δ, ρ, F ). If the distribution F is fixed, the probability of correct recovery is considered as a function of (N, ρ, δ). Since in compressed sensing we are interested in very high dimensional vectors, we calculate these probabilities as N → ∞. This results in a 2-D map of the probability of correct recovery in terms of (ρ, δ). The mapping is called the phase diagram. Figure 2 shows the phase diagram of the `1 minimization as derived in [19],[20], [21]. It has been shown that this phase transition is independent of F . As can be seen in the figure as N → ∞ the probability of correct recovery exhibits a sharp transition i.e. there is a curve ρ∗ (δ) below which the `1 minimization recovers the correct answer with probability 1 and above which the probability of correct recovery goes to zero very rapidly as N → ∞. This curve is called the phase transition curve. Although we mentioned the phase transition for the case of `1 minimization, the existence of phase transition for many other algorithms has been either empirically observed or theoretically proved. The interested reader may refer to these papers for more information [12], [22], [23], [24], [19], [25], [20], [26]. Phase transition displays the sparsity-measurement tradeoff for an algorithm. Therefore it is useful for comparison and tuning purposes. But the only problem is that phase transition may depend on the underlying distribution of xo which is not known a priori. In the next section we will explain how this can be done. B. Max-min tuning Consider a sparse recovery algorithm Aθ . θ includes all the free parameters of the algorithm. It is clear that the performance of this algorithm depends highly on the parameter vector θ and therefore a good tuning of the algorithm seems necessary. Phase transition provides us a tool for tuning these parameters. The only problem is the dependence of the phase transition on the distribution of nonzero elements of xo . We represent the value of ρ at the phase transition as ρ∗A (δ, θ; G), where as before G is the distribution of the non-zero elements of xo . But as mentioned in the last section the distribution of the non-zero elements may not be known beforehand. In order to solve this issue we proposed the maximin framework [2]. Suppose that we have a class of distributions G for the non-zero values of the original vector. θ∗ (δ) = arg max inf ρ∗ (δ, θ; A, G) θ

G∈G

(4)

In other words for every value of θ the worst distribution is found and the phase transition for that value of parameter and that value of δ is calculated for the least favorable distribution. Then all the phase transition values are compared and the value of θ that exhibits the maximum value of ρ∗ is chosen. The optimal phase transition is called maxmin phase transition ρM M (δ, A; G). We may also represent this value by ρM M (δ, A) in cases where the family of distributions is clear from the context. This framework can also be used for comparing different algorithms. [2] used this framework to

compare several well-known compressed sensing algorithms. In the next section we discuss a theoretical framework that aims to predict the performance of the approximate message passing algorithm. III. S TATE E VOLUTION Consider the tth iteration of the iterative thresholding algorithm. If we define H = A∗ A − I, at the tth iteration xt + A∗ z t = xo + H(xo − xt ). Naively, the matrix H does not correlate with xo or xt , and therefore from the central limit theorem heuristic we might pretend that H(xo − xt ) is a Gaussian vector whose entries have zero mean and variance n−1 kxo − xt k22 1 . This phenomena has also been observed in digital communication and the noise is called mutual access interference. This heuristic suggests that each iteration can be casted as an estimation problem where a noisy observation of a sparse signal is given and the goal is to estimate the original sparse signal as well as possible. Soft thresholding function has been extensively studied for this purpose and the interested reader is referred to [27],[28]. If the algorithm decreases the variance of the noise at each iteration such that the variance eventually converges to zero, this means the algorithm is successful in the recovery of the sparsest solution. On the other hand the algorithm fails if the variance does not converge to zero. In order to formalize this intuition we mention this simple but important lemma. Lemma 3.1: Let Aij be the ij th element of the measurement matrix and suppose that these elements are drawn i.i.d from a given distribution with the following properties,E(Aij ) = 0, E(A2ij ) = n1 n and E(A4ij ) = O( n12 ). We assume that the ratio δ = N is fixed. Let the vector x be a vector with elements i.i.d from a different distribution and independent of the the matrix A with a bounded second moment. If s = (A∗ A − I)x, then E(si ) = 0, 1 E(si sj ) = O( ), n E(x2i ) 1 + O( ). E(s2i ) = δ n You may find the proof of this lemma in the appendix. It is clear that many ensembles including N (0, n1 ) satisfy the above conditions. Using this lemma we can formally track the evolution from iteration to iteration of the mean-squared error. 1 = E{[η(X + σt Z; λt ) − X]2 }. (5) δ Here the expectation is with respect to the independent random variables Z ∼ N (0, 1) and X, whose distribution coincides with the empirical distribution of the entries of xo . Since all the parameters are fixed except for σ that changes according to the iteration, we represent the thresholding policy as λ(σ), although it should be noted that it implicitly depends on the parameters δ. Definition 1: Given implicit parameters (δ, ρ, F ) and the thresholding mapping λ(σ), State Evolution is the mapping that evolves by the rule, 2 σt+1

1 (6) σt2 7→ Ψ(σt2 ) = E{[η(X + σt Z; λ(σt ))) − X]2 }. δ For a given thresholding policy therefore we define the following regions: Region (I) : Ψ(σ 2 ) < σ 2 for all values of σ 2 . Here σt2 → 0 as t → ∞: the SE converges to zero. Region (II): The complement of region (I). Here the SE does not evolve to σ 2 = 0. 1 Here we assumed that the expected value of the square of the ` -norm of 2 column vectors is equal to 1

Fig. 3. State evolution curves for δ = .1 and three different values of ρ. As ρ increases the curves move upward. The phase transition happens at ρ ≈ 0.19. The thresholding policy used here is λ(σ) = βσ for a fixed β. β is chosen as explained in section III-A.1. X ∼ (1 − )δ0 + δ1 .

Therefore for a given thresholding policy we can define the following notion. ρSE (δ, λ, F ) = sup{ρ : (δ, ρ, λ) ∈ Region (I)}

(7)

Figure 3 depicts three state evolution curves that are given for a fixed value of δ and three different values of ρ. As it is clear from the figure as ρ increases the curves go up and at ρ = ρSE the curve touches the y = x line. Below this critical value of ρ according to the state evolution the variance of the noise is converging to zero and therefore the algorithm recovers the original vector xo exactly and above this critical value the algorithm has another fixed point which is stable and therefore it will fail. Suppose that the state evolution explains the performance of an algorithm. The thresholding policy λ(σ) may be considered as a free parameter in this algorithm. Clearly, this parameter affects the phase transition of the algorithm and it is therefore important to tune it optimally. The goal of this section is to find the maximin threshold policy for the iterative algorithm that satisfies SE. Consider the class of distributions F,γ = {F (µ) : F (0+)−F (0−) > 1−, EF (µ2 ) ≤ γ 2 }. Whenever we remove the parameter γ from the subscript it means that we are considering the set of all distributions that have a mass of at least 1 −  at 0 without any second moment constraint. We also use the notation F,γ (µ) = (1 − )δ0 + δγ for the distribution that puts a mass of 1 −  at 0 and a mass of  at γ. It is clear that F,γ (µ) ∈ F,γ . Finally, Gγ = {G : EG µ2 ≤ γ 2 }. Gγ includes the set of distributions with the second moment bounded above by γ 2 . Definition 2: The risk of the soft thresholding function is defined as, r(µ, λ; σ) = E(η(µ + σZ; λ) − µ)2 ,

(8)

where Z ∼ N (0, 1) and η is the soft thresholding function as defined before. As it is clear here, the expectation is just over the noise Z. Lemma 3.2: r(µ, λ; σ) is a concave function of µ2 . For the proof of this lemma, please refer to the appendix.

Definition 3: Minimum risk thresholding policy is defined as, λ∗M R (σ; , γ)

= inf sup Eµ∼F r(µ, λ; σ). λ F ∈F,γ

(9)

The least favorable distribution for soft thresholding is chosen first and then λ is optimized on the risk of that distribution. Theorem 3.3: Under the minimum risk thresholding policy λ∗M R (σ; , γ) the SE phase transition happens at:   1 − 2/δ[(1 + z 2 )Φ(−z) − zφ(z)] ρSE (δ, λ∗M R ) = max . z 1 + z 2 − 2[(1 + z 2 )Φ(−z) − zφ(z)] (10) Furthermore, ρSE (δ, λ∗M R ) = sup inf ρSE (δ, λ(σ), G). λ(σ) G∈Gγ

(11)

Although the above theorem claims that the minimum risk thresholding policy is optimal in the maxmin sense, it should be emphasized that it is not a legitimate thresholding policy yet, since in the maxmin framework the sparsity level of the signal is not known beforehand, while the minimum risk thresholding policy is using this information which is hidden in the parameter . To fix this issue, we consider the minimum risk thresholding policy and plug in the value of the ρSE (δ, λ∗M R ). From the proof of the above theorem it is clear that this new thresholding policy that does not use the sparsity level still exhibits phase transition at exactly the same place as λ∗M R (σ; , γ). The proof of the theorem is moved to the appendix. Here we just mention parts of the proof that explain several interesting properties of the algorithm. Corollary 3.4: The least favorable distribution that achieves the supremum in supF ∈F,γ Eµ∼F r(µ, λ; σ) is independent of the standard deviation of the noise σ. Corollary 3.5: For any thresholding policy λ(σ), and G ∈ F,γ we have, Ψ(σ 2 ; G) ≤ Ψ(σ 2 ; F,γ );

∀σ.

(12)

In other words a two point prior is the least favorable prior for the algorithm. It should be mentioned that the least favorable prior is not unique. All the distributions that have a point mass of 1 −  at zero and mass of  on the set {−γ, γ} are least favorable as well. Although it is possible to calculate the minimum risk thresholding policy as a function of σ, δ, but there are a few problems in using this policy. First there is no explicit solution for the optimal λ in terms of σ and δ therefore we have to save a huge table on memory. Second, we may need an oracle information about γ. There are other simpler thresholding policies that can achieve the same phase transition. In the next section we discuss these simpler approaches and will show that they achieve the same phase transition as the minimum risk thresholding policy. A. Other Thresholding Policies In the last section we considered the minimum risk thresholding policy and we showed that it achieves the maxmin phase transition of the SE. In this section we propose two simpler thresholding policies and we show that they both achieve the maxmin phase transition of the SE. Finally it will be explained why they all achieve the same phase transition. 1) Fixing False Alarm Rate: In [3] we considered a class of thresholding policies in which λ(σ) = βσ where β just depends on δ. This is called the fixed false alarm rate thresholding policy. False alarm happens when an element whose actual value is zero passes the threshold and this thresholding policy fixes the probability of false

alarm. By considering this thresholding policy the algorithm has now just one free parameter. According to the maxmin framework, the optimal value of this parameter is given by, β ∗ (δ) = arg max

inf ρ(δ, β, G).

β∈[0,∞) G∈Gγ

(13)

The following theorem is proved in [3] about this fixed false alarm rate thresholding policy. Theorem 3.6: Under fixed FAR thresholding policy and for the optimal value of β given from above, the SE has the following phase transition,   1 − 2/δ[(1 + z 2 )Φ(−z) − zφ(z)] ρSE (δ; λ∗F AR ) = max , z 1 + z 2 − 2[(1 + z 2 )Φ(−z) − zφ(z)] (14) which is exactly the same as the SE phase transition of minimum risk thresholding policy. Also the optimal value of β is equal to the maximizing z ∗ in equation 14. The following corollary is clear from the Theorems 3.6 and 3.3. Corollary 3.7: The optimized fixed false alarm rate thresholding policy has the same phase transition as the phase transition of the minimum risk thresholding policy and therefore it is also maximin for the original problem supλ(σ) inf G∈Gγ ρSE (δ, λ(σ), G) In other words the above corollary shows that the phase transition that we get from such a simple thresholding policy is the same as the phase transition of the minimum risk thresholding policy. Figure 4 shows the optimal values of β as a function of δ. The final remark about this algorithm is the estimation of the σ at each iteration. Although there are many ways to do that here we explain one of the simple, yet efficient ways which is the median of the absolute value. Since xo is sparse and at each iteration we observe xo + σz where z ∼ N (0, IN ×N ) in order to estimate σ we can calculate the median of |xo + σz|. The median is robust to outliers and hence one might assume that it is close to the median of |σz| ≈ 0.6745σ. But, as mentioned before there are many other ways to estimate σ. In the next section we introduce another thresholding policy that does not need the estimation of standard deviation and still achieves the maxmin phase transition. 2) Fixing the Number of Detections: Consider another thresholding policy in which the threshold is set according to the following equation, P (|X + σZ| > λF D (σ)) = α,

(15)

where as before X ∼ F ∈ F,γ is the sparse signal and Z ∼ N (0, 1) is the Gaussian noise. In practice this thresholding policy corresponds to fixing the number of elements that pass the threshold. At each step the threshold is set to the magnitude of the `th largest coefficient(in the absolute value). A good aspect of this thresholding policy is that no estimation of variance is involved in this thresholding policy. Again the parameter α that corresponds to the parameter ` is a free parameter that may be optimized by maximin framework. The following theorem gives us a very simple rule for the maximin tuning of this parameter. Theorem 3.8: For α = δ the fixed number of detection thresholding policy achieves the same phase transition as the phase transition of the minimum risk thresholding policy. The proof of this theorem is omitted from the paper. It is basically similar to the proof of Theorem 3.6, and the main difference is that the calculations needed for proving concavity are more complicated. It is also clear that we are not able to improve this phase transition with any thresholding policy according to Theorem 3.3 and therefore

where ρSE (δ) is the optimal phase transition that we got from any of the above thresholding policies and ρ`1 is the phase transition of the `1 minimization as derived in [19],[20] and [21]. Now that we know the phase transition of the SE is happening at the same place as `1 the main questions are the following. Does IST satisfy the SE equations? If not, is there any fast iterative thresholding algorithm that satisfies these two equations? These two questions are answered in the next section. IV. M ESSAGE PASSING AND A PPROXIMATE M ESSAGE PASSING A LGORITHMS

Fig. 4. Maxmin optimal values of β for the fixed false alarm rate thresholding policy.

this thresholding policy can also be considered as maxmin for the original problem i.e. supλ(σ) inf F ∈F,γ ρSE (δ, λ(σ), F ). In practice, the optimal value of α corresponds to the case where ` = n elements pass the threshold at each iteration. Therefore at each iteration we can easily set the threshold to the magnitude of the nth largest coefficient in absolute value. One of the main questions that is left unanswered yet is the reason that all these three thresholding policies exhibit the same phase transition. The following lemma shows the main similarity among these three thresholding policies. Lemma 3.9: Let λ∗M R (σ), λ∗F AR (σ) and λ∗F D (σ) be the maxmin optimally tuned value of λ in minimum risk, fixed false alarm rate and fixed number of detections thresholding policies, respectively. Suppose X ∼ F and F = (1 − )δ0 + G where G is continuous at 0 and  = ρ∗ δ where ρ∗ is the value of ρ at the phase transition. Then, lim P(η(X + σZ; λ∗M R (σ)) > 0) =

σ→0

In the last two sections the state evolution was derived from a few assumptions that we made on iterative soft thresholding and we showed how this framework can be used for tuning the algorithm and calculating the phase transition. Now the main question is that are these assumptions right? Unfortunately they are not right for the iterative soft thresholding and it has been shown that the phase transition of iterative soft thresholding algorithm when it is optimally tuned occurs at a significantly lower place than the phase transition of `1 minimization [2]. Figure 5 also depicts this phenomena. In [3] it was shown that SE predicts the performance of a modified version of iterative thresholding inspired by message passing algorithms for inference in graphical models and graph-based error correcting codes [29], [30], [31]. These are iterative algorithms, whose basic variables (messages) are associated to directed edges in a graph that encodes the structure of the statistical model. The relevant graph here is a complete bipartite graph over N nodes on one side (’variable nodes’), and n on the others (’measurement nodes’). Messages are updated according to the rules X t xt+1 Abi zb→i ), (18) i→a = η( b6=a t za→i = ya −

X

Aaj xtj→a .

(19)

j6=i

MP has one important drawback compared to iterative thresholding. Instead of updating N estimates, at each iterations we need to update N n messages, thus increasing significantly the algorithm complexity. On the other hand, it is easy to see that the right-hand side of Equation 18 depends weakly on the index a (only one out of n terms is excluded) and that the right-hand side of Equation 19 depends weakly on i. Suppose that,

lim P(η(X + σZ; λ∗F AR (σ)) > 0) =

σ→0

lim P(η(X + σZ; λ∗F D (σ)) > 0) = δ. (16) The proof of this lemma is summarized in the appendix. What this lemma is suggesting is that as σ → 0 all the algorithms are setting the thresholds similarly. Of course for larger values of σ they perform very differently but what matters for the phase transition is the performance at σ → 0. B. Relation with `1 So far we have considered three different thresholding policies for the state evolution. As proved in the last section all these thresholding policies when tuned in the maximin framework give us the same phase transition. This phase transition is shown in Figure 5. One very interesting fact about this phase transition curve is the following finding that is reported in [3]. Finding 3.10: For any δ ∈ (0, 1), ρSE (δ) = ρ`1 (δ),

(17)

1 ), N 1 t = zat + δza→i + O( ), N

xti→a = xti + δxti→a + O(

σ→0

t za→i

(20)

t where δxti→a and δza→i are of order O( √1N ). Theorem 4.1: Suppose that the asymptotic behavior (20) holds for the message passing algorithm (18),(19). Then xti and zat satisfy the following equations X  xt+1 = ηt Aia zat + xti + oN (1), i a

zat

=

ya −

X j

Aaj xtj +

|I t−1 | t−1 za + oN (1), δ

(21)

where |I t−1 | is the cardinality of the active set at iteration t − 1 and oN (1) terms vanish as N, n → ∞. You may find the proof of this theorem in the appendix. This algorithm is called Approximate Message Passing or AMP. In the

Fig. 5. Comparison of the theoretical phase transition of `1 with empirical phase transition of AMP and IST for N = 2000. As it is clear AMP has a significant improvement over the very similar IST algorithm. The top curve is the phase transition of the `1 minimization or as mentioned by Finding 3.10 the maximin phase transition predicted by the state evolution equations. The IST algorithm is optimally tuned as explained in [2].

vector notation of this paper the algorithm can be written as, xt+1 = η(xt + Az t ; λt ), |I t−1 | t−1 z . (22) n For the threshold parameter we can use one of the thresholding policies mentioned in the last section. As it can be seen this algorithm is very similar to the iterative soft thresholding algorithm introduced in the beginning of the paper. The only difference is in the extra term that is added to the calculation of z t . This extra term although it barely adds to the computational complexity of our algorithm, has a significant effect on the phase transition of the algorithm. Figure 5 compares the theoretically predicted phase transition of `1 with the phase transition of both AMP and IST algorithms. As it may be seen in this figure the extra term we added to the iterations improves the phase transition of the algorithms significantly.

Fig. 6. QQplot of the interference term at different iterations. The data points lie on a line that passes through the origin as expected for the points that are drawn from the zero mean Gaussian. The decrease in the slope of the line shows that the standard deviation of the noise is shrinking toward zero. Here N = 2000, δ = .9 and ρ = .5.

We now consider predictions made by SE on some observables and compare them with the empirical values from simulations. We consider predictions of the following four observables: •

z t = y − Axt +

It should be also mentioned that other message passing algorithms have been proposed in the literature for the cases when we have a specific prior distribution on the distribution of the actual coefficients. The interested reader may refer to [32]. Also in a recent paper the authors showed that the AMP algorithm with a specific thresholding policy corresponds to the sum product belief propagation algorithm on graphs [33]. That thresholding policy is different from the thresholding policies we introduced here and does not work as well.



• •

The conditional mean of the mean square error given that the original value is zero, i.e. E((xti )2 |xoi = 0). The conditional mean of the mean square error given that the original value is non-zero, i.e. E((xti − xoi )2 |xoi 6= 0). Missed detection rate,i.e. P (xti = 0|xoi 6= 0). False alarm rate,i.e. P (xti 6= 0|xoi = 0).

For our simulation purposes we assume that xo is drawn from F,1 . The theoretical calculations of the above quantities are simple and all of them can be calculated as functions of cdf and pdf of Gaussian distributions. For the empirical calculations of the above observables we run the algorithm 200 times and get 200 Monte Carlo samples. Then by averaging those two hundred samples we get the empirical curves. Figure 7 compares our theoretical and empirical results. Although our analysis is for the N → ∞, in this medium size problem the empirical results match extremely well with the theoretical results and for the first three observables the empirical results are indistinguishable from the theoretical results. To see more simulations on the match between the theoretical and empirical results and some detailed comparison with the other existing algorithms refer to [34].

V. E MPIRICAL J USTIFICATION

VI. C ONCLUSION

In this section we present some of the simulation results that show extremely good agreement between the theoretical calculations that we derive for AMP from the state evolution framework and the empirical performance of the algorithm. The first assumption that simplified the analysis was the assumption that at each iteration xt + A∗ z t can be written as xo + σz where z ∼ N (0, IN ) and IN is the N × N identity matrix. Figure 6 shows the qqplot of xt + A∗ z t − xo at several different iterations.

In this paper we analyzed the state evolution which is a theoretical framework for calculating the asymptotic behavior of message passing and approximate message passing algorithms. This framework let us analyze the performance of the AMP algorithm and find the optimal thresholding policy. We introduced three different thresholding policies that are all maximin thresholding policies and they all present the same phase transition. As shown in [3] this optimal phase transition is the same as the phase transition of the `1 minimization.

Fig. 7. Comparison of empirical and theoretical estimates of 4 observables in AMP algorithm. Here N = 5000, δ = .3 and ρ = .15. The empirical data is the average of 200 Monte Carlo samples.

R EFERENCES [1] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics, vol. 32, pp. 407–499, 2004. [2] A. Maleki and D. L. Donoho, “Optimally tuned iterative thresholding algorithm for compressed sensing,” IEEE journal on selected areas in signal processing, 2009. submitted. [3] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing,” Proceedings of National Academy of Sciences, vol. 106, Nov. 2009. [4] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, pp. 489–509, April 2006. [5] E. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, pp. 489–509, February 2006. [6] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal on Scientific Computing, vol. 20, pp. 33– 61, 1998. [7] I. Daubechies, M. Defrise, and C. D. Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Communications on Pure and Applied Mathematics, vol. 75, pp. 1412–1457, 2004. [8] S. Sardy, A. G. Bruce, and P. Tseng, “Block coordinate relaxation methods for nonparametric wavelet denoising,” Journal of Computational and Graphical Statistics, vol. 9, pp. 361–379, 2000. [9] J. L. Starck, M. Elad, and D. L. Donoho, “Redundant multiscale transforms and their application for morphological component analysis,” Journal of Advances in Imaging and Electron Physics, vol. 132, pp. 287– 348, 2004. [10] P. L. Combettes and V. R. Wajs, “Signal recovery by proximal forwardbackward splitting,” Multiscale Modeling and Simulation, vol. 4, no. 4, pp. 1168–1200, 2005. [11] K. K. Herrity, A. C. Gilbert, and J. A. Tropp, “Sparse approximation via iterative thresholding,” Proc. ICASSP, vol. 3, pp. 624–627, May 2006. [12] D. L. Donoho, I. Drori, Y. Tsaig, and J. L. Starck, “Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit,” Stanford Statistics Department Technical Report, 2006. [13] M. Fornasier and H. Rauhut, “Iterative thresholding algorithms,” Applied and Computational Harmonic Analysis, vol. 25, no. 2, pp. 187–208, 2008.

[14] T. Blumensath and M. E. Davies, “Iterative thresholding for sparse approximations,” Journal of Fourier Analysis and Applications, special issue on sparsity, vol. 14, no. 5, pp. 629–654, 2008. [15] T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,” Applied and Computational Harmonic Analysis, vol. 27, no. 3, pp. 265–274, 2009. [16] W. Yin, S. Osher, D. Goldfarb, and J. Darbon, “Bregman iterative algorithms for `1 -minimization with applications to compressed sensing,” SIAM Journal on Imaging Sciences, vol. 1, no. 1, pp. 143–168, 2008. [17] D. L. Donoho, M. Elad, and V. Temlyakov, “Stable recovery of sparse overcomplete representations in the presence of noise,” IEEE Transactions on Information Theory, vol. 52, pp. 6–18, January 2006. [18] E. Cand`es, “The restricted isometry property and its implications for compressed sensing,” Compte Rendus de l’Academie des Sciences, vol. 346, pp. 589–592, 2008. [19] D. L. Donoho and J. Tanner, “Neighborliness of randomly-projected simplices in high dimensions,” Proceedings of the National Academy of Sciences, vol. 102, no. 27, pp. 9452–9457, 2005. [20] D. L. Donoho and J. Tanner, “Counting faces of randomly projected polytopes when the projection radically lowers dimension,” Journal of American Mathematical Society, vol. 22, pp. 1–53, 2009. [21] D. Donoho and J. Tanner, “Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing,” Phil. Trans. A, 2009. [22] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Transactions on Information Theory, vol. 53, no. 12, pp. 4655–4666, 2007. [23] D. L. Donoho and Y. Tsaig, “Fast solution of `1 -norm minimization problems when the solution may be sparse,” IEEE Transactions on Information Theory, vol. 54, pp. 4789–4812, November 2008. [24] A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin, “One sketch for all: fast algorithms for compressed sensing,” Proc. STOC, pp. 237– 246, 2007. [25] D. L. Donoho, “High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension,” Discrete and Computational Geometry, vol. 35, no. 4, pp. 617–652, 2006. [26] D. L. Donoho and J. Tanner, “Observed universality of phase transitions in high-dimensional geometry, with applications in modern signal processing and data analysis,” Philosophical Transactions of the Royal Society A, vol. 367, no. 1906, pp. 4273–4293, 2009. [27] D. Donoho, I. Johnstone, J. Hoch, and A. Stern, “Maximum entropy and the nearly black object,” Journal of the Royal Statistical Society, Series B (Methodological), vol. 54, no. 1, pp. 41–81, 1992. [28] D. L. Donoho and I. M. Johnstone, “Ideal spatial adaptation via wavelet shrinkage.,” Biometrika, vol. 81, pp. 425–455, 1994. [29] R. G. Gallager, Low-Density Parity-Check Codes. Cambridge, Massachussetts: MIT Press, 1963. Available online at http://web./gallager/www/pages/ldpc.pdf. [30] T. J. Richardson and R. Urbanke, Modern Coding Theory. Cambridge: Cambridge University Press, 2008. Available online at http://lthcwww.epfl.ch/mct/index.php. [31] T. J. Richardson and R. Urbanke, “The capacity of low-density parity check codes under message-passing decoding,” IEEE Trans. Inform. Theory, vol. 47, pp. 599–618, 2001. [32] S. Sarvotham, D. Baron, and R. Baraniuk, “Compressed sensing reconstruction via belief propagation.” Preprint, 2006. [33] A. M. D. L. Donoho, A. Maleki, “Message passing algorithms for compressed sensing: I. motivation and construction,” Proc. of Information Theory Workshop, 2010. [34] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing,” Proceedings of National Academy of Sciences, vol. 106, Nov. 2009. Supplement Information.

A PPENDIX

B. Proof of Lemma 3.2

A. Proof of Lemma 3.1 ∗

th

In this proof xi represents the i element of x and H = A A−I. Hi and Ai the ith column of the H and A matrices respectively and Hij and Aij are the ij th elements of H and A. It is easy to see that, E(Hij ) = 0.

(23)

Therefore, E(si ) = E(

X

Hij xj ) =

X

j

E(Hij )E(xj ) = 0.

j

It is also clear that Hik depends on the ith and kth columns of the A matrix. Hence if i 6= j, i 6= `. j 6= k and j 6= `, E(Hij Hk` ) = 0.

(24)

Now consider another case i 6= j and ` 6= i, j. E(Hij Hj` ) = E(A∗i Aj A∗j A` ) = E(A∗i )E(Aj A∗j A` ) = 0. 1 1 E(A∗i Ai ) = . n n

`

k

k

1

E(Hik Hj` )E(xk x` ) = E(Hij Hji )E(xj xi ) =

`

1 (25) O( ). n Equality (1) is the result of what proved before for the covariance of Hij and Hk` . E(s2i ) = E(x∗ Hi Hi∗ x) = E(x∗ E(Hi∗ Hi )x). From what we proved before E(Hi∗ Hi ) is a diagonal matrix and therefore, X 2 E(x∗ E(Hi∗ Hi )x) = E(x2i ) E(Hik ). (26) k

We also have, X

E((A∗i Ak ))2 =

k6=i

X

X

ET r(A∗i Ak A∗k Ai ) =

k6=i

E

k6=i

1 1 1 N −1 T r(In ) = = + O( ). n2 n δ n

where In is the n × n identity matrix and E((A∗i Ai − 1))2 = E((A∗i Ai )2 − 2(A∗i Ai ) + 1) = X 2 X 2 E((A∗i Ai )2 ) + −2 + 1 = E( Aij Aik ) − 1 = j

XX k

E(A2ij A2ik )

+

j6=k

X

k

E(A4ik ) =

k

n(n − 1) 1 1 + O( ) − 1 = O( ). n2 n n By combining the three above equations we have, E(s2i ) =

E(x2i ) 1 + O( ). δ n

−λ−µ σ

Therefore the second derivative is given by, d2 r(µ, λ; σ) = d 2 µ2 1 λ−µ − (φ( ) + φ(λ + µσ)) σµ σ

C. proof of Theorem 3.3

X X E(si sj ) = E(( Hik xk )( Hj` x` )) = XX

d r(µ, λ; σ) = dµ2 1 E[−I(−λ < µ + σZ < λ))(η(µ + σZ; λ) − µ)] = µ E[I(−λ < µ + σZ < λ)] = Z λ−µ σ φZ (z)dz

Since λ, µ > 0 the second derivative is negative and therefore the risk function is concave.

The final case we consider here is i 6= j. 2 E(Hij ) = E(A∗i Aj A∗j Ai ) =

In order to prove that the soft thresholding function is a concave function we use the second order conditions. For the simplicity of the notation here we assume that µ > 0.

(27)

(28)

We break the proof into a few simpler steps. Lemma 1.1: For any λ and σ, F,γ (µ) is the least favorable distribution for the soft thresholding risk function and is one of the distributions that achieve the supF ∈F,γ Eµ∼F r(µ, λ; σ). The other distributions that achieve this supremum have probability mass 1 −  on zero and  on {−γ, γ}. Proof: Suppose that the distribution of the non-zero elements of µ is G(µ). E(r(µ, λ; σ)) = (1 − )E(r(0, λ; σ)) + Eµ∼G (E(r(µ, λ; σ)|µ)) Since the first part of the risk is fixed and does not depend on the distribution G, we ignore it and we just focus on the second term. We call Υ(µ; λ, σ) = E(r(µ, λ; σ)|µ). Eµ∼G Υ(µ; λ, σ) = EΥ(µ2 ; λ, σ) ≤ Υ(γ 2 ; λ, σ).

(29)

The inequality is coming from the Jensen’s inequality and concavity of Υ that was proved in lemma 3.2. The equality is achieved if and only if P (µ ∈ {γ, −γ}) = 1 although we should emphasize that this is just true on the non-zero elements since we conditioned on this event. Corollary 1.2: The least favorable distribution is independent of the σ and therefore no other thresholding policy can have better phase transition than the minimum risk policy. Lemma 1.3: λ∗ (σ) lim M R = c, (30) σ→0 σ where c is finite number and different from zero. Proof: By calculating the derivative of the risk function for the least favorable distribution, with respect to λ we get, λ∗M R λ∗ − µ λ∗ + µ E(1 − Φ( M R ) + 1 − Φ( M R )) = σ σ σ λ∗M R − µ λ∗M R + µ E(φ( ) + φ( )), (31) σ σ where φ and Φ are the probability density function and distribution function of a normal random variable with mean zero and variance one. Now suppose that limσ→0 σλ = 0. Then the left hand side is equal to zero while the right hand side is a positive number. On the

other hand if σλ → ∞ then the left hand side is going to be infinite while the right hand side is 0. Therefore the limit converges to a constant number; Theorem 1.4: Under the minimum risk thresholding policy λ∗M R the SE phase transition happens at,   1 − 2/δ[(1 + z 2 )Φ(−z) − zφ(z)] ρM R (δ) = max (32) z 1 + z 2 − 2[(1 + z 2 )Φ(−z) − zφ(z)] Proof: Consider the following two thresholding policies. The first one is the λM R and the second one is λ(σ) = βσ where β is . fixed and does not depend on σ. Also suppose that β = limσ→0 λ(σ) σ Call the density evolution mappings for these two thresholdings ΨM R (σ 2 ) and ΨL (σ 2 ). Clearly we have,

E. Proof of Theorem 4.1 To prove the lemma we substitute (20) in the general equations (18) and (19) and write the Taylor expansion of the latter. For this proof in order to keep the notation and the proof as simple as possible we assume that the elements of the matrix are drawn i.i.d. from ± √1n t with probability 12 . The update equation for za→i yields X X t za→i = ya − Aaj xtj − Aaj δxtj→a j∈[N ]

j∈[N ]

{z

|

}

t za

+ Aai xti +O(1/N ) | {z } t δza

i

2

2

ΨM R (σ ) ≤ ΨL (σ ),

For

xt+1 i→a xt+1 i→a

and lim

σ→0

we have, X X t = ηt ( Abi zbt + Abi δzb→i ) b∈[n]

ΨM R (σ 2 ) = 1. ΨL (σ 2 )

b∈[n]

|

{z

− Aai zat ηt0 (

Clearly,

}

xti

X

Abi zbt +

b∈∂i

ρM R (δ) > ρL (δ).

But it has been proved in [3] that ΨL (σ 2 ) is a concave function and therefore ∀ρ > ρL , lim

σ→0

|

(33)

ΨL (σ 2 ) > 1. σ2

σ→0

{z

}

δxti→a

b∈[n]

=

ΨM R (σ 2 ) > 1. σ2

ηt (

X

b∈[n]

Abi zbt +

b∈[n]

=

In other words this tell us ρM R (δ) > ρL (δ). So we basically see that these two methods give us the same phase transitions.

t Abi δzb→i ) +O(1/N ) .

b∈∂i

In underbraces we have identified the various contributions. Subt we obtain the stituting the expression indicated for δxti→a , δza→i t t t recursion for xi and za . In particular xi is updated according to X X t xt+1 = ηt ( Abi zbt + Abi δzb→i ) + o(1) i

Which makes it clear that ∀ρ > ρL , lim

X

ηt (

X

X

A2bi xti ) + o(1)

b∈[n]

Abi zbt

+ xti ) + o(1) .

b∈[n]

For zat we get zat

=

ya −

+

X

X

Aaj xtj

j∈[N ]

D. Proof of Lemma 3.9 For the fixed detection thresholding policy it is clear from the λ∗ (σ) definition. Also, in appendix C we showed that limσ→0 λ∗M R (σ) = 1 F AR therefore the only thing that we should prove for this lemma is the fixed false alarm rate thresholding policy case. For the simplicity of notation I call βF∗ AR = β ∗ . lim P(η(X + σZ; λ∗F AR (σ)) > 0) = 2(1 − )P(σZ > β ∗ σ)+

σ→0



ya −

φ(β ∗ ) , φ(β ∗ ) + β ∗ (Φ(β ∗ ) − 1/2) ∗ ∗ β (1 − Φ(β )) ρ∗ = 1 − φ(β ∗ )

By plugging in these two equations in Equation 34 we get

(35)

Abj zbt−1 +

X

t−1 Aaj δza→j ) + o(1)

b∈[n]

Aaj xtj

j∈[N ]

+ =

1 t−1 X 0 X ) + o(1) η( Abi zbt−1 + xt−1 za i n j∈[N ] b∈[n] X ya − Aaj xtj j∈[N ]

+

X 1 t−1 za hηt−1 ( Abi zbt−1 + xt−1 ) + o(1)i . i δ

This concludes the proof.

δ=

which is clearly what we wanted to prove.

=

X

b∈[n]

X

b∈[n]

In the [34] we showed the following relations,

2(1 − )(1 − Φ(β ∗ )) +  = δ,

j∈[N ]



P(|X + σZ| > β σ) = 2(1 − )(1 − Φ(β )) + . (34)

0 A2aj zat−1 ηt−1 (