Optimized Projections for Compressed Sensing via Direct Mutual ...

Report 3 Downloads 73 Views
1

Optimized Projections for Compressed Sensing via Direct Mutual Coherence Minimization

arXiv:1508.03117v1 [cs.IT] 13 Aug 2015

Zhouchen Lin, Senior Member, IEEE, Canyi Lu and Huan Li

Compressed Sensing (CS) is a novel technique for simultaneous signal sampling and compression based on the existence of a sparse representation of signal and a projected dictionary PD, where P ∈ Rm×d is the projection matrix and D ∈ Rd×n is the dictionary. To exactly recover the signal with a small number of measurements m, the projected dictionary PD is expected to be of low mutual coherence. Several previous methods attempt to find the projection P such that the mutual coherence of PD can be as low as possible. However, they do not minimize the mutual coherence directly and thus their methods are far from optimal. Also the solvers they used lack of the convergence guarantee and thus there has no guarantee on the quality of their obtained solutions. This work aims to address these issues. We propose to find an optimal projection by minimizing the mutual coherence of PD directly. This leads to a nonconvex nonsmooth minimization problem. We then approximate it by smoothing and solve it by alternate minimization. We further prove the convergence of our algorithm. To the best of our knowledge, this is the first work which directly minimizes the mutual coherence of the projected dictionary with a convergence guarantee. Numerical experiments demonstrate that the proposed method can recover sparse signals better than existing methods.

I. I NTRODUCTION

T

He recently developed theory of Compressed Sensing (CS) [1], [2] is a novel technique for joint signal sampling and compression. It shows that signals which have a sparse representation with respect to appropriate bases can be recovered from a small number of measurements. A fundamental problem in CS is how to construct a measurement matrix so that the number of measurements is near minimal. Consider a signal x ∈ Rd which is assumed to have a sparse representation over a fixed redundant dictionary D ∈ Rd×n (d < n). This can be described as x = Dα,

(1)

where α ∈ Rn is a sparse representation coefficient, i.e., kαk0  n. Here kαk0 denotes the `0 -norm which counts the number of nonzero elements in α. Problem (1) is not unique since d < n. To find a unique solution, we need to use some additional structures of D and α. Consider that α is sparse, we are interested in system (1) with the sparsest representation coefficient α. This leads to the following sparse representation problem min kαk0 , s. t. x = Dα. (2) α

Z. Lin and H. Li are with the Key Laboratory of Machine Perception (MOE), School of EECS, Peking University, China. They are also with Cooperative Medianet Innovation Center, Shanghai Jiaotong University, China. (e-mail: [email protected]; lihuan [email protected]) C. Lu is with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore (e-mail: [email protected]).

However, the above problem is known to be NP-hard [3] and thus is challenging to solve. Some suboptimal methods, such as Basis Pursuit (BP) [4] and Orthogonal Matching Pursuit (OMP) [5] are used in practice. An interesting theoretical problem is that under what conditions are the solution to (2) is unique? If the solution is unique, when can it be exactly or approximately computed by BP or OMP? Some previous work ensure the uniqueness of the sparse solution to (2) based on the mutual coherence of the dictionary D [6]. Definition 1: Given D = [d1 , · · · , dn ] ∈ Rd×n , its mutual coherence is defined as the largest absolute and normalized inner product between different columns of D, i.e., |dTi dj | , 1≤i,j≤n kdi k kdj k

µ(D) = max i6=j

The mutual coherence measures the highest correlation between any two columns of D. It is expected to be as low as possible in order to find the sparest solution to (2). Theorem 1: [6], [7], [8] For problem (2), if α satisfies   1 1 kαk0 < 1+ , (3) 2 µ(D) then the followings hold: • α is the solution to (2). • α is also the solution to the following convex `1 minimization problem min kαk1 , s. t. x = Dα, α

P

where kαk1 = i |αi | is the `1 -norm of α. α can be obtained by OMP. The above theorem shows that if the mutual coherence of D is low enough, then the sparest solution to (2) is computable. Thus, how to construct a dictionary D with low mutual coherence is crucial in sparse coding. In CS, to reduce the number of measurements, we face a similar problem on the sensing matrix construction. CS merges the data acquisition and compression and guarantees the exact signal recovery based on a few number of measurements which can be much smaller than what the conventional Nyquist-Shannon sampling theorem requires. Given the signal x ∈ Rd in (1), CS suggests to replace these n direct samples with m indirect ones by measuring linear projections of x defined by a proper projection or sensing matrix P ∈ Rm×d , i.e., •

y = Px,

(4)

2

such that m  d. It means that instead of sensing all n elements of the original signal x, we can sense it indirectly by its compressed form y with a much smaller size m. Surprisingly, the original signal x can be recovered from the observed y by using the sparse representation in (1), i.e, y = PDα with the sparest α. Thus the reconstruction requires solving the following problem

and t is a fixed threshold which controls the top fraction of the matrix elements of |G| that are to be considered. To find P by minimizing µt (M), some properties of the Gram matrix G = MT M are used. Assume that each column of M is normalized to Euclidean unit length. Then diag (G) = 1,

(7)

min kαk0 , s. t. y = Mα,

rank (G) = m.

(8)

α

(5)

where M = PD ∈ Rm×n is called the effective dictionary. Problem (5) is NP-hard. As suggested by Theorem 1, if the mutual coherence of PD is low enough, then the solution α to (5) is computable by OMP or by solving the following convex problem min kαk1 , s. t. y = Mα. (6) α

Finally, the original signal x can be reconstructed by x = Dα. So it is expected to find a proper projection matrix P such that µ(PD) is low. Furthermore, many previous work [9], [10] show that the required number of measurements to recover the signal x by CS can be reduced if µ(PD) is low. In a word, the above discussions imply that by choosing an appropriate projection matrix P such that µ(PD) is low enough, then the true signal x can be recovered with high probability by efficient algorithms. Previously the random projection matrix was shown to be a good choice since its columns are incoherent with any fixed basis D with high probability [11]. However, many previous work [9], [12], [10] show that the well designed deterministic projection matrix can usually lead to a better performance of signal reconstruction than the random projections. In this work, we focus on the construction of deterministic projection method. We first give a brief review of some previous deterministric methods. A. Related Work In this work, we only consider the case that D is fixed while P can be arbitrary. Our target is to find P by minimizing µ(M) = µ(PD). If each column of M is normalized to Euclidean unit length, then µ(M) = kGk∞,off , where G = (gij ) = MT M is named as the Gram matrix and kGk∞,off = maxi6=j |gij | is the largest off-diagonal entry of |G|. Several previous work used the Gram matrix to find the projection matrix P [9], [12], [10]. We give a review of these methods as follows. 1) The algorithm of Elad The algorithm of Elad [9] considers to minimize the taveraged mutual coherence defined as the average of the absolute and normalized inner products between different columns of M which are above t, i.e., P 1≤i,j≤k, i6=j χt (|gij |)|gij | , µt (M) = P 1≤i,j≤k, i6=j χt (|gij |) where χt (x) is the characteristic function defined as ( 1, if x ≥ t, χt (x) = 0, otherwise,

The work [9] proposed to minimize µt (M) by iteratively updating P as follows. First, initialize P as an arbitrary random matrix and normalize each column of PD to Euclidean unit length. Second, shrink the elements of G = MT M (where M = PD) by   if |gij | ≥ t, γgij , gij = γtsign(gij ), if t > |gij | ≥ γt,   gij , if γt > |gij |, where 0 < γ < 1 is the down-scaling factor. Third, apply SVD and reduce the rank of G to be equal to m. At last, build the squared-root of G, ST S = G, where S ∈ Rm×n , and find P = SD† , where † denotes the Moor-Penrose pseudoinverse. There are several limitations of the algorithm of Elad. First, the algorithm of Elad is suboptimal since the t-averaged mutual coherence µt (M) is different from the mutual coherence µ(M) which is our real target. Second, the proposed algorithm to minimize µt (M) has no convergence guarantee. So there has no guarantee about the quality of the obtained solution. Third, the choices of two parameters t and γ are crucial for the signal recovery performance in CS. Tuning them empirically in practice leads to additional cost. 2) The algorithm of Duarte-Carajalino and Sapiro The algorithm of Duarte-Carajalino and Sapiro [12] is not a method based on the mutual coherence. It instead aims to find the sensing matrix P such that the corresponding Gram matrix is as close to the identity as possible, i.e., G = MT M = DT PT PD ≈ I,

(9)

where I denotes the identity matrix. Multiplying both sides of the previous expression by D on the left and DT on the right, it becomes DDT PT PDDT ≈ DDT . (10) Let DDT = VΛVT be the eigen-decomposition of DDT . Then (10) is equivalent to ΛVT PT PVΛ = Λ.

(11)

Define Γ = PV. Then they finally formulated the following model w.r.t. Γ

min Λ − ΛΓT ΓΛ F . (12) Γ

After solving the above problem, the projection matrix can be obtained by P = ΓVT . However, the signal recovery performance of the algorithm of Duarte-Carajalino and Sapiro is not very good since M is overcomplete and the Gram matrix G could not be an identity matrix.

3

3) The algorithm of Xu et al. The algorithm of Xu et al. [10] is motivated from the known Welch bound [13]. For any M ∈ Rm×n , the mutual coherence µ(M) is lower bounded, i.e., r n−m . (13) µ(M) ≥ m(n − 1) The algorithm of Xu et al. aims to find M such that the offdiagonal elements of G = MT M approximate the Welch bound well. They proposed to solve the following problem min kG − GΛ kF G

s.t. GΛ =

GTΛ ,

diag(GΛ ) = 1, kGΛ k∞,off ≤ µW ,

(14)

q n−m . The proposed iterative solver for the where µW = m(n−1) above problem is similar to the algorithm of Elad. The main difference is the shrink function used to control the elements of G. See [10] for more details. However, their proposed solver in [10] for (14) also lacks of convergence guarantee. Another issue is that, for M ∈ Rm×n , the Welch bound (13) is not tight when n is large. Actually, . This implies the equality of (13) holds only if n ≤ m(m+1) 2 that the algorithm of Xu et al. is not optimal in the case that . n > m(m+1) 2 B. Contributions There are at least two main issues in the previous methods reviewed above. First, none of them aims to find P by directly minimizing µ(PD) which is our real target. Thus the objectives of these methods are not optimal. For their obtained solutions P, µ(PD) is usually much larger than the Welch bound in (13). Second, the algorithms of Elad and Xu et al. have no convergence guarantee and thus they may produce very different solutions given slightly different initializations. The convergence issue may limit their applications in CS. To address the above issues, we develop the Direct Mutual Coherence Minimization (DMCM) models. First, we show how to construct a low mutual coherence matrix M by minimizing µ(M) directly. This leads to a nonconvex and nonsmooth problem. To solve our new problem efficiently, we first smooth the objective function such that its gradient is Lipschitz continuous. Then we solve the approximated problem by proximal gradient which owns the convergence guarantee. Second, motivated from DMCM, we propose the DMCM based Projection (DMCM-P) model which aims to find a projection P by minimizing µ(PD) directly. To solve the nonconvex DMCM-P problem, we then propose an alternating minimization method and provide its convergence guarantee in theory. Experimental results show that our DMCM-P achieves lower mutual coherence of PD and thus leads to a better signal recovery performance.

that each column of M is normalized to Euclidean unit length. Then we aim to find M by the following DMCM model

min µ(M) = MT M ∞,off m×n M∈R (15) s. t. kMi k2 = 1, i = 1, · · · , n, where Mi (or (M)i ) denotes the i-th column of M. The above problem is equivalent to

min f (M) = MT M − I ∞ M∈Rm×n (16) s. t. kMi k2 = 1, i = 1, · · · , n, where kAk∞ = maxi,j |aij | denotes the infinity norm of A. Solving the above problem is not easy since it is nonconvex and its objective is nonsmooth. Generally, due to the nonconvexity, the globally optimal soluton to (16) is not computable. We instead consider to find a locally optimal solution with convergence guarantee. First, to ease the problem, we adopt the smoothing technique in [14] to smooth the nonsmooth `∞ -norm in the objective of (16). Note that the `∞ -norm f can be rewritten as

f (M) = MT M − I ∞ = max hMT M − I, Vi, kVk1 ≤1

P

where kVk1 = ij |vij | denotes the `1 -norm of V. Since {V|kVk1 ≤ 1} is a bounded convex set, we can define a proximal function d(V) for this set, where d(V) is continuous and strongly convex on this set. A natural choice of d(V) is d(V) = 21 kVk2F , where k · kF denotes the Frobenius norm of the matrix. Hence, we have the following smooth approximation of f defined in (16), fρ (M) = max hMT M − I, Vi − kVk1 ≤1

ρ 2 kVkF , 2

(17)

where ρ > 0 is a smoothness parameter. Note that the smooth function fρ can approximate the nonsmooth f with arbitrary precision and it is easiler to be minimized. Indeed, f and fρ have the following relationship fρ (M) ≤ f (M) ≤ fρ (M) + ργ, 2

where γ = maxV { 21 kVkF | kVk∞ ≤ 1}. For any  > 0, if we choose ρ = γ , then |f (M)−fρ (M)| ≤ . This implies that if ρ is sufficiently small. Then the difference between f and fρ is sufficiently small. This motives us to use fρ to replace f in (16) and thus we have the following relaxed problem min

M∈Rm×n

fρ (M) (18)

s. t. kMi k2 = 1, i = 1, · · · , n. Problem (18) is easiler to solve since ∇fρ (M) = M(V∗ + V∗ T ), where V∗ is the optimal solution to (17), is Lipschitz continuous. That is, for any M1 , M2 ∈ Rm×n , there exists a constant L = 1/ρ such that

II. L OW M UTUAL C OHERENCE M ATRIX C ONSTRUCTION

k∇fρ (M1 ) − ∇fρ (M2 )kF ≤ L kM1 − M2 kF .

In this section, we show how to construct a matrix M ∈ Rm×n with low mutual coherence µ(M) by DMCM. Assume

With the above property, problem (18) can be solved by the proximal gradient method which updates M in the (k + 1)-th

4

Algorithm 1 Solve (18) by Proximal Gradient algorithm Initialize: k = 0, Mk ∈ Rm×n , ρ > 0, α = 0.99ρ, K > 0. Output: M∗ = PG(Mk , ρ). while k < K do 1) Compute Vk by solving (21); 2) Compute Mk+1 by solving (19); 3) k = k + 1. end while iteration by Mk+1 = arg minh∇fρ (Mk ), M − Mk i + M

1 2 kM − Mk kF 2α

1 2 kM − (Mk − α∇fρ (Mk ))kF 2 s. t. kMi k2 = 1, i = 1, · · · , n. = arg min M

(19)

where α > 0 is the step size. To guarantee the convergence, it is require that α < ρ. In this work, we simply set α = 0.99ρ. The above problem has a closed form solution by normalizing each column of Mk − α∇fρ (Mk ), i.e., (Mk+1 )i =

(Mk − α∇fρ (Mk ))i . k(Mk − α∇fρ (Mk ))i k2

(20)

To compute ∇fρ (Mk ) = Mk (Vk + Vk T ), where Vk is optimal to (17) when M = Mk , one has to solve (17) which is equivalent to the following problem

1 Vk = arg min V − (MTk Mk − I)/ρ F , V 2 (21) s. t. kVk1 ≤ 1. Solving the above problem requires computing a proximal projection onto the `1 ball. This can be done efficiently by the method in [15]. Iteratively updating V by (21) and M by (19) leads to the Proximal Gradient (PG) algorithm for solving problem (18). We summarize the whole procedure of PG for (18) in Algorithm 1. For nonconvex problems, e.g., (18), it was proved that PG guarantees to converge1 . Theorem 2: [16] Let {Mk } be the sequence generated by PG in Algorithm (1). Then {Mk } is bounded and any accumulation point M∗ of {Mk } is a stationary point. Though PG guarantees to converge, the obtained suboptimal solution to (18) may be far from optimal to problem (16) which is our original target. There are two important factors which will affect the quality of the obtained solution by PG. First, due to the nonconvexity of (18), the solution may be sensitive to the initialization of M. Second, the smoothness parameter ρ > 0 is expected to be small, then the objective fρ in (18) can well approximate the objective f in (16). However, if ρ is directly set to a very small value, PG may decrease the objective function value of (18) slowly. This can be easily seen from the updating of M in (19). To address the above two issues, we use a continuation trick to find a better solution to (16) by solving (18) with different initialization. Namely, we 1 In this paper, an algorithm converges means that any accumulation point is a stationary point.

Algorithm 2 Solve (18) by PG with continuation trick Initialize: ρ >, α = 0.99ρ, η > 1, M, t = 0, T > 0. while t < T do 1) M = PG(M, ρ) by calling Algorithm 1; 2) ρ = ρ/η, α = 0.99ρ; 3) t = t + 1. end while begin with a relatively large value of ρ and reduce it gradually. For each fixed ρ, we solve (18) by PG in Algorithm 1 and use the solution as a new initialization of M in PG. To achieve a better solution, we repeat the procedure K times or we stop it till ρ reaches a predefined target ρt . We summarize the procedure of PG with continuation trick in Algorithm 2. Finally, we would like to emphasis some advantages of our DMCM model (16) and the proposed solver. A main merit of our model (16) is that it minimizes the mutual coherence µ(M) directly and thus the mutual coherence of its optimal solution should be the lowest. Though the optimal solution is generally not computable due to the nonconvexity of (16), our proposed solver which first smooths the objective and then minimizes it by PG owns the convergence guarantees. To the best of our knowledge, this is the first work which directly minimizes the mutual coherence of a matrix with convergence guarantee. III. L OW M UTUAL C OHERENCE BASED P ROJECTION In this section, we show how to find a projection matrix P such that µ(PD) can be as low as possible. This is crucial to signal recovery by compressed sensing associated to problem (5). Similar to the DMCM model shown in (16), the ideal way is to minimize µ(PD) directly, i.e.,

min (PD)T (PD) − I ∞ m×d P∈R (22) s. t. kPDi k2 = 1, i = 1, · · · , n. However, the constraint of (22) is more complex than the one in (16), and thus (22) is much more challenging to solve. We instead consider an approximated model of (22) based on the following observation. Theorem 3: For any M1 , M2 ∈ Rm×n , if M1 → M2 , then µ(M1 ) → µ(M2 ). It is easy to prove the above result by the definition of the mutual coherence of a matrix. The above theorem indicates that the difference of the mutual coherences of two matrices is small when the difference of two matrices is small. This motivates us to find M such that µ(M) is low and the difference between M and PD is small. Thus we have the following approximated model of (22): min

1 kM − PDk2F 2β (23) s. t. kMi k2 = 1, i = 1, · · · , n,

P∈Rm×d ,M∈Rm×n

kMT M − Ik∞ +

where β > 0 trades off µ(M) and the difference between M and PD. To differ from the DMCM model in (16), we name the above model as DMCM based Projection (DMCM-P) in this work.

5

Algorithm 3 Solve (24) by Alternating Minimization m×d

Initialize: k = 0, Pk ∈ R , Mk ∈ R 0.99ρ, β > 0. Output: {P∗ M∗ } = AM(Mk , Pk , ρ, β). while k < K do 1) Compute Pk+1 by solving (26); 2) Compute Vk by solving (21); 3) Compute Mk+1 by solving (25); 4) k = k + 1. end while

m×n

, ρ > 0, α =

Now we show how to solve (23). First, we smooth kMT M − Ik∞ as fρ (M) defined in (17). Thus problem (23) can be approximated by the following problem with smooth objective 1 kM − PDk2F 2β s. t. kMi k2 = 1, i = 1, · · · , n.

min F (M, P) = fρ (M) +

P,M

(24)

We propose to alternately updating P and M to solve problem (24). 1. Fix P = Pk and update M by Mk+1 = arg minh∇fρ (Mk ), M − Mk i + M

1 2 kM − Mk kF 2α

1 kM − Pk Dk2F (25) 2β   2

1 1 α Mk + β Pk D − ∇fρ (Mk ) 1

= arg min M − 1 1

M 2

α + β +

F

s. t. kMi k2 = 1, i = 1, · · · , n. where α > 0 is the step size satisfying α < ρ. Similar to (19), the above problem has a closed form solution. To compute ∇fρ (Mk ) in (25), we also need to compute Vk by solving (21). 2. Fix M = Mk+1 and update P by solving Pk+1 = argmin kMk+1 − PDk2F ,

Algorithm 4 Solve (24) by AM with continuation trick Initialize: ρ > 0, α = 0.99ρ, β > 0, η > 1, M, P, t = 0, T > 0. while t < T do 1) (P, M) = AM(P, M, ρ, β) by calling Algorithm 3; 2) ρ = ρ/η, α = 0.99ρ; 3) β = β/η; 4) t = t + 1. end while

requires D in problem (24) to be of full row rank. Such an assumption usually holds in compressed sensing since D ∈ Rd×n is a redundant dictionary with d < n. Though AM guarantees to converge, the obtained solution to (24) may be far from optimal to problem (23) which is our original target. In order to find a better solution to (23) by solving (24), ρ > 0 is expected to small. On the other hand, β > 0 is also expected to be small such that the difference between M and PD is small and thus µ(PD) is expected to be as small as µ(M). Similar to Algorithm 2, we use a continuation trick to achieve a better solution to (23). We begin with relatively large value of ρ > 0 and β > 0 and reduce them gradually. For each fixed pair (ρ, β), we solve (24) by AM in Algorithm 3 and use the solution as a new initialization of P and M in AM. We repeat the procedure K times or we stop it till ρ and β reach predefined targets ρt and βt . This will lead to a better solution. We summarize the procedure of AM with continuation trick in Algorithm 4. Note that when β is small enough, kM − PDkF can be arbitrarily small. Thus for the obtained projection P, µ(PD) is expected to be as small as µ(M). Finally, we would like to emphasis some advantages of our DMCM-P over previous methods. The main merit of our DMCM-P is that it is the first model which minimizes µ(PD) directly and also the proposed solver owns the convergence guarantee. The algortihms of Elad [9] and Xu et al. [10] are also mutual coherence based methods. But their objectives are suboptimal and their solvers lack of convergence guarantee.

(26)

P

which has a closed form solution P = Mk+1 D† . Iteratively updating P by (26) and M by (25) leads to the Alternating Minimization (AM) method for (24). We summarize the whole procedure of AM in Algorithm 3. In theory, we prove that any accumulation point of AM is a stationary point to (24). Theorem 4: Assume that D in problem (24) is of full row rank. Let {(Mk , Pk )} be the sequence generated by Algorithm 3. Then the followings hold: (i) F (Mk , Pk ) is monotonically decreasing. (ii) Mk+1 − Mk → 0, Pk+1 − Pk → 0. (iii) The sequence {(Mk , Pk )} is bounded. (iv) Any accumulation point of {(Mk , Pk )} is a stationary point. The proof of Theorem 4 can be found in the Appendix. Note that to guarantee the convergence of Algorithm 3, Theorem 4

IV. N UMERICAL R ESULTS In this section, we conduct several experiments to demonstrate the effectiveness of our proposed methods. The experiments contain two parts. The first part is to show the effectiveness of DMCM and DMCM-P based on the measure of the mutual coherence by comparing with previous methods. The second part is to show the effectiveness of the learned projection by DMCM-P measured by the signal recovery error in compressed sensing. A. Comparison on the Mutual Coherence This subsection contains two experiments to show the effectiveness of DMCM and DMCM-P, respectively. In the first experiment, we show that our DMCM is able to construct a matrix M ∈ Rm×n with low mutual coherence. We compare DMCM with

6

0.8 0.6 0.4

DMCM Random Elad Xu Duarte Welch bound

1

0.8

0.6

0.4

1

0.2

0.2 0 6

8

10 12 # of measurements

14

16

0 10

DMCM Random Elad Xu Duarte Welch bound

1.2

Mutual Coherence

Mutual Coherence

1

Mutual Coherence

DMCM Random Elad Xu Duarte Welch bound

1.2

0.8 0.6 0.4 0.2

15

(a) n = 60

20 25 # of measurements

30

35

0 10

15

20

(b) n = 120

25 30 35 # of measurements

40

45

50

(c) n = 180

Fig. 1: Plots of the averaged mutual coherence M v.s. number of measurements m

0.8 0.6 0.4

DMCM-P Random Elad Xu Duarte Welch bound

1

0.8

0.6

0.4

1

0.8

0 6

8

10 12 # of measurements

(a) n = 60

14

16

0 10

0.7 0.6 0.5 0.4 0.3

0.2

0.2

DMCM-P Random Elad Xu Duarte Welch bound

0.9

Mutual Coherence

Mutual Coherence

1

Mutual Coherence

DMCM-P Random Elad Xu Duarte Welch bound

1.2

0.2 15

20 25 # of measurements

(b) n = 120

30

35

0.1 10

15

20

25 30 35 # of measurements

40

45

50

(c) n = 180

Fig. 2: Plots of the averaged mutual coherence of PD v.s. number of measurements m

• • • • •

Random: random matrix containing values drawn from the standard normal distribution. Elad: the algorithm of Elad [9] with D = I. Xu: the algorithm of Xu et al. [10] with D = I. Duarte: the algorithm of Duarte-Carajalino and Sapiro [12] with D = I. Welch bound: the Welch bound [13] shown in (13).

Note that the compared algorithms of Elad [9], Xu et al. [10] and Duarte-Carajalino and Sapiro [12] were designed to find a projection P such that M = PD has low mutual coherence. They can still be compared with our DMCM by setting D = I which is an identity matrix. To solve our DMCM model in (18), we run Algorithm 1 for 15 iterations and Algorithm 2 for 1000 iterations. In Algorithm 2, we set ρ0 = 0.5 and η = 1.2. M is initialized as a random Gaussian matrix. The compared methods are tested on three schemes with varying size of M ∈ Rm×n : (1) m = [6 : 2 : 16], n = 60; (2) m = [10 : 5 : 35], n = 120; (3) m = [10 : 10 : 50], n = 180. Note that the constructed matrices may not be the same for the compared methods with different initializations. So for each size pair (m, n), we repeat the experiment for 10 times and report the average of the mutual coherences of the constructed matrices M. The averaged mutual coherences v.s. the number of measurements m and the Welch bound are reported in Figure 1. It can be seen that the matrix constructed by our

DMCM achieves much lower mutual coherence than previous methods. The main reason is that our DMCM minimizes the mutual coherence of M directly, while the objectives of all the previous methods are indirect. For the second experiment in this subsection, we show that our DMCM-P is able to compute a projection P ∈ Rm×d such that PD ∈ Rm×n has low mutual coherence for given D ∈ Rd×n . We choose D to be a random Gaussian matrix in this experiment. To solve our DMCM-P model in (23), we run Algorithm 3 for 15 iterations and Algorithm 4 for 1000 iterations. In Algorithm 4, we set ρ0 = 0.5, β = 2 and η = 1.2. P is initialized as a random Gaussian matrix. We compare our DMCM-P with the algorithms of Elad [9], Xu et al. [10] and Duarte-Carajalino and Sapiro [12] measured by the mutual coherence of PD. We test on three schemes: (1) m = [6 : 2 : 16], n = 60, d = 30; (2) m = [10 : 5 : 35], n = 120, d = 60; (3) m = [10 : 10 : 50], n = 180, d = 90. Figure 2 shows the mutual coherence of PD as a function of the number of measurements m. It can be seen that our DMCM-P achieves a better projection such that PD has lower mutual coherence for any given D. Such achieved results are similar to DMCM in Figure 1. It is worth mentioning that though our DMCM-P in (24) does not minimize the mutual coherence of PD directly. We still achieve as low mutual coherence of PD as M since the difference of PD and M tends to be very small in (24).

7

0

10

2000 DMCM−P 1000

-1

0.2

0.4

0.6

0.8 Random

0 0 1000

0.2

0.4

0.6

10

1

0.8

1

Relative Error

0 0 500

-2

10

-3

Elad

10

500 0 0 1000

0.2

0.4

0.6

0.8

1 Xu

0.2

0.4

0.6

0.8

6

8

10 12 # of measurements

14

16

Fig. 4: Signal reconstruction errors v.s. number of measurements

500 0 0 1000

-4

10

DMCM-P Random Elad Xu Duarte

1

0

10

Duarte 500 -1

10

0.2

0.4

0.6

0.8

1

Fig. 3: Distribution of the absolute values of (PD)T (PD). Figure 3 shows the distribution of the absolute values of inner products between distinct columns of PD with m = 20, n = 120, d = 60. It can be seen that our DMCM-P has shortest tail, showing that the number of the Gram entries that are closer to the ideal Welch bound is larger than the compared methods. Such a result is consistent with the low mutual coherence values depicted in Figure 2. B. Comparison on the CS Performance In this subsection, we concern the application of the optimized projection by our DMCM-P in compressed sensing. We first generate a T -sparse vector α ∈ Rn , which constitutes a sparse representation of signal x = Dα, x ∈ Rd . The locations of nonzeros are chosen randomly and their values are created by a uniform distribution in [−1, 1]. We choose the dictionary D ∈ Rd×n to be a random Gaussian matrix. Then we apply different projection matrices P learned by our DMCM-P, random projection matrix, the algorithms of Elad [9], Xu et al. [10] and Duarte-Carajalino and Sapiro [12], to generate the compressed y by y = PDα. At last we solve ˆ We compare the perforproblem (5) by OMP to obtain α. mance of different projection methods based on the signal ˆ 2 . The smaller reconstruction reconstruction error kx − Dαk error means a better CS performance. We conduct two experiments in this subsection. The first one includes varying values of the number of measurements m and the second one includes varying values of the sparsity level T .

Relative Error

0 0

-2

10

DMCM-P Random Elad Xu Duarte

-3

10

-4

10

1

2

3

Sparsity

4

5

6

Fig. 5: Signal reconstruction v.s. sparsity

For every value of the aforementioned parameters we perform 3000 experiments and calculate the reconstruction error; if the mean squared error of a reconstruction exceeds a threshold of order O(10−4 ), the reconstruction is considered to be a failure. In the first experiment, we set m = [6 : 2 : 16], n = 60, d = 30 and T = 2. Figure 4 shows the averaged reconstruction error v.s. the number of measurements m (T is fixed). Generally, the CS performance improves for OMP as m increases in this experiment. Also, as expected, all the optimized projection methods lead to better CS performance by comparing with the random projection. Our proposed DMCM-P outperforms the algorithms of Elad, Xu et al. and Duarte-Carajalino and Sapiro. In the second experiment, we vary the sparsity level T from 1 to 6 and set m = 18, n = 180 and d = 90. Figure 5 shows the signal reconstruction error as a function of the sparsity level T (m is fixed). Generally, the CS performance improves

8

for OMP as T decreases. Also, our DMCM-P still outperforms random projection and other deterministic projection methods. This is due to the low mutual coherence property of PD based on our optimized projection as also verified in the previous experiments.

By the optimality of Mk+1 , we have h(Mk+1 ) + h∇fρ (Mk ), Mk+1 − Mk i 1 1 + kMk+1 − Mk k2F + kMk+1 − Pk Dk2F 2α 2β 1 ≤h(Mk ) + kMk − Pk Dk2F , (28) 2β

V. C ONCLUSIONS This paper focuses on optimizing the projection matrix for measurements in compressed sensing for signals which are sparse in overcomplete dictionary. We develop the first model which aims to find a projection P by minimizing the mutual coherence of PD directly. We propose to solve the nonconvex problem by alternating minimization and prove the convergence. Simulation results show that our method does achieve much lower mutual coherence of PD, and thus leads to better CS performance. Considering that the mutual coherence are important in sparse coding which has many applications, we expect that the proposed construction will be useful in other applications, besides the CS.

and 0 ∈∂h(Mk+1 ) + ∇fρ (Mk ) + +

4. Definition 2: [17], [16] Let g be a proper and lower semicontinuous function. 1) For a given x ∈dom g, the Frechet subdifferential of g ˆ at x, written as ∂g(x), is the set of all vectors u ∈ Rn which satisfies g(y) − g(x) − hu, y − xi ≥ 0. lim inf y6=x,y→x ky − xk 2) The limiting-subdifferential, or simply the subdifferential, of g at x ∈ Rn , written as ∂g(x), is defined through the following closure process ∂f (x)

n

:= {u ∈ R : ∃xk → x, g(xk ) → g(x), ˆ k ) → u, k → ∞}. uk ∈ ∂g(x

Proposition 1: [17], [16] The following result holds: 1) In the nonsmooth context, the Fermat’s rule remains unchanged: If x ∈ Rn is a local minimizer of g, then 0 ∈ ∂g(x). 2) Let (xk , uk ) be a sequence such that xk → x, uk → u, g(xk ) → g(x) and uk ∈ ∂g(xk ). Then u ∈ ∂g(x). 3) If f is a continuously differentiable function, then ∂(f + g)(x) = ∇f (x) + ∂g(x). Proof of Theorem 4: First, we define  0, if kMi k2 = 1, i = 1, · · · , n, h(M) = ∞, otherwise. Then (25) can be rewritten as Mk+1 1 2 kM − Mk kF = arg minh∇fρ (Mk ), M − Mk i + M 2α 1 + kM − Pk Dk2F + h(M). 2β

(27)

1 (Mk+1 − Pk D). β

From the Lipschitz continuity of ∇fρ (M), we have F (Mk+1 , Pk ) 1 kMk+1 − Pk Dk2F 2β ≤fρ (Mk ) + h∇fρ (Mk ), Mk+1 − Mk i (29) 1 1 kMk+1 − Pk Dk2F . + kMk+1 − Mk k2F + 2ρ 2β =fρ (Mk+1 ) +

A PPENDIX In this section, we give the convergence proof of Theorem

1 (Mk+1 − Mk ) α

Add (28) and (29), we have h(Mk+1 ) + F (Mk+1 , Pk )   1 1 − kMk+1 − Mk k2F ≤h(Mk ) + fρ (Mk ) − 2α 2ρ 1 kMk − Pk Dk2F (30) + 2β   1 1 =h(Mk ) + F (Mk , Pk ) − − kMk+1 − Mk k2F . 2α 2ρ Second, from the optimality of Pk+1 to problem (26), we have F (Mk+1 , Pk+1 ) ≤ F (Mk+1 , Pk ),

(31)

0 = ∇P F (Mk+1 , Pk+1 ) = (Mk+1 − Pk+1 D)DT .

(32)

and

By the assumption that D is of full row rank, (32) implies that Pk+1 = Mk+1 DT (DDT )−1 . (33) Combining (30) and (31) leads to h(Mk+1 ) + F (Mk+1 , Pk+1 )   1 1 − kMk+1 − Mk k2F . ≤h(Mk ) + F (Mk , Pk ) − 2α 2ρ (34) So h(Mk ) + F (Mk , Pk ) and F (Mk , Pk ) are monotonically decreasing. Summing all the above inequalities for k ≥ 0, it follows that h(M0 ) + F (M0 , P0 )  +∞  X 1 1 ≥ − kMk+1 − Mk k2F 2α 2ρ k=0

≥0.

(35)

9

This implies that Mk+1 − Mk → 0. Hence Pk+1 − Pk → 0 by using (33). Third, note that F (M, P) is coercive, i.e., F (M, P) is bounded from below and F (M, P) → +∞ when k[M, P]kF → +∞. It can be seen that F (Mk , Pk ) is bounded from (34). Thus {Mk , Pk } is bounded. Then there exists an accumulation point (M∗ , P∗ ) and a subsequence {Mkj , Pkj } such that (Mkj , Pkj ) → (M∗ , P∗ ) as j → +∞. Now we prove that (M∗ , P∗ ) is a stationary point. From (29) we have ∇fρ (Mk+1 ) − ∇fρ (Mk ) −

=

∈ =

1 (Mk+1 − Mk ) α

1 − (Pk+1 − Pk )D β 1 1 − (Mk+1 − Mk ) − ∇fρ (Mk ) − (Mk+1 − Pk D) α β 1 +∇fρ (Mk+1 ) + (Mk+1 − Pk+1 D) β 1 ∂h(Mk+1 ) + ∇fρ (Mk+1 ) + (Mk+1 − Pk+1 D) β ∂h(Mk+1 ) + ∇M F (Mk+1 , Pk+1 ). (36)

Thus k∇fρ (Mk+1 ) − ∇fρ (Mk ) −

1 (Mk+1 − Mk ) α

1 − (Pk+1 − Pk )DkF β ≤

k∇fρ (Mk+1 ) − ∇fρ (Mk )kF +

1 kMk+1 − Mk kF α

1 + k(Pk+1 − Pk )DkF β 1 1 ≤ kMk+1 − Mk kF + kMk+1 − Mk kF ρ α 1 + kPk+1 − Pk kF kDkF . β → 0.

(37)

Since F (M, P) is continuously differentiable, we have F (Mkj , Pkj ) → F (M∗ , P∗ ). As h(Mk ) = 0 for all k and the set {M : kMi k2 = 1, i = 1, · · · , n} is closed, we have h(M∗ ) = 0 and F (Mkj , Pkj ) + h(Mkj ) → F (M∗ , P∗ ) + h(M∗ ). Combing (32), (36), (37) and using Propertion 1, we have 0 ∈ ∂h(M∗ ) + ∇F (M∗ , P∗ ). Thus (M∗ , P∗ ) is a stationary point.

(38)

[5] Yagyensh Chandra Pati, Ramin Rezaiifar, and PS Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on. IEEE, 1993, pp. 40–44. [6] R´emi Gribonval and Morten Nielsen, “Sparse representations in unions of bases,” IEEE Transactions on Information Theory, vol. 49, no. 12, pp. 3320–3325, 2003. [7] David L Donoho and Michael Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via `1 minimization,” Proceedings of the National Academy of Sciences, vol. 100, no. 5, pp. 2197–2202, 2003. [8] Joel A Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Transactions on Information Theory, vol. 50, no. 10, pp. 2231–2242, 2004. [9] Michael Elad, “Optimized projections for compressed sensing,” IEEE Transactions on Signal Processing, vol. 55, no. 12, pp. 5695–5702, 2007. [10] Jianping Xu, Yiming Pi, and Zongjie Cao, “Optimized projection matrix for compressive sensing,” EURASIP Journal on Advances in Signal Processing, vol. 2010, pp. 43, 2010. [11] Emmanuel J Candes and Terence Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?,” IEEE Transactions on Information Theory, vol. 52, no. 12, pp. 5406–5425, 2006. [12] Julio Martin Duarte-Carvajalino and Guillermo Sapiro, “Learning to sense sparse signals: Simultaneous sensing matrix and sparsifying dictionary optimization,” IEEE Transactions on Image Processing, vol. 18, no. 7, pp. 1395–1408, 2009. [13] Lloyd Welch, “Lower bounds on the maximum cross correlation of signals (corresp.),” IEEE Transactions on Information theory, pp. 397– 399, 1974. [14] Yu Nesterov, “Smooth minimization of non-smooth functions,” Mathematical Programming, vol. 103, no. 1, pp. 127–152, 2005. [15] Yoram Singer John Duchi, Shai Shalev-Shwartz and Tushar Chandra, “Efficient projections onto the l1 -ball for learning in high dimensions,” in International Conference of Machine Learning, 2008. [16] Shoham Sabachy J´erˆome Bolte and Marc Teboullez, “Proximal alternating linearized minimization for nonconvex and nonsmooth problems,” Mathematical Programming, vol. 146, no. 1-2, pp. 459–494, 2014. [17] Rockafellar R. Tyrrell and Wets. Roger, Eds., Variational Analysis, Springer, 1998.

Zhouchen Lin received the Ph.D. degree in Applied Mathematics from Peking University, in 2000. He is currently a Professor at Key Laboratory of Machine Perception (MOE), School of Electronics Engineering and Computer Science, Peking University. He is also a Chair Professor at Northeast Normal University and a Guest Professor at Beijing Jiaotong University. Before March 2012, he was a Lead Researcher at Visual Computing Group, Microsoft Research Asia. He was a Guest Professor at Shanghai Jiaotong University and Southeast University, and a Guest Researcher at Institute of Computing Technology, Chinese Academy of Sciences. His research interests include computer vision, image processing, computer graphics, machine learning, pattern recognition, and numerical computation and optimization. He is an Associate Editor of IEEE Trans. Pattern Analysis and Machine Intelligence and International J. Computer Vision, an area chair of CVPR 2014, ICCV 2015, NIPS 2015 and AAAI 2016, and a Senior Member of the IEEE.



R EFERENCES [1] Emmanuel J Cand`es, Justin Romberg, and Terence Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006. [2] David L Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006. [3] Balas Kausik Natarajan, “Sparse approximate solutions to linear systems,” SIAM Journal on Computing, vol. 24, no. 2, pp. 227–234, 1995. [4] Scott Shaobing Chen, David L Donoho, and Michael A Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 33–61, 1998.

Canyi Lu received the bachelor degree in mathematics from the Fuzhou University in 2009, and the master degree in the pattern recognition and intelligent system from the University of Science and Technology of China in 2012. He is currently a Ph.D. student with the Department of Electrical and Computer Engineering at the National University of Singapore. His current research interests include computer vision, machine learning, pattern recognition and optimization. He was the winner of the Microsoft Research Asia Fellowship 2014.

10

Huan Li received the bachelor degree in medical information from Central South University in 2011, and the master degree in software engineering from Peking University in 2014. He is currently a Ph.D. student at Key Laboratory of Machine Perception (MOE), School of Electronics Engineering and Computer Science, Peking University. His research interests include optimization and machine learning.