Noname manuscript No. (will be inserted by the editor)
Revisiting Compressed Sensing: Exploiting the Efficiency of Simplex and Sparsification Methods Robert Vanderbei Kevin Lin Han Liu Lie Wang Received: date / Accepted: date
Abstract We present two new approaches to solve large-scale compressed sensing problems. The first approach uses the parametric simplex method to recover very sparse signals by taking a small number of simplex pivots while the second approach reformulates the problem using Kronecker products to achieve faster computation via a sparser problem formulation. While both approaches are not new optimization methods, we feel the computational advantages of these methods have been neglected in the compressed sensing literature. For the first approach, if the true signal is assumed to be very sparse and we initialize our solution to be the zero vector, then a customized parametric simplex method will usually take a small number of pivots to arrive at a solution. Our numerical studies show that this approach is more than 10 times faster than state-of-the-art methods for very sparse signals. The second approach requires imposing that the sensing matrix is the Kronecker product of two smaller matrices. We show that Kronecker compressed sensing (KCS) requires a stronger condition for perfect recovery compared to the original problem. However, KCS can be formulated as a linear program with a very sparse constraint matrix, whereas the original problem involves a completely dense constraint matrix. Hence, algorithms that benefit from sparse probThe first author’s research is supported by ONR Award N00014-13-1-0093, the third author’s by NSF Grant III–1116730, and the fourth author’s by NSF Grant DMS-1005539 Robert J. Vanderbei Department of Ops. Res. and Fin. Eng., Princeton University, Princeton, NJ 08544. Tel.: +609-258-2345 E-mail:
[email protected] 2
lem representation, such as interior point methods (IPM) and most modern algorithms, can solve the Kronecker sensing problem much faster than the corresponding dense problem. We numerically demonstrate that KCS speeds up IPMs to be up to 10 times faster, making these general IPMs competitive with state-of-the-art methods. Keywords Linear programming · compressed sensing · parametric simplex method · sparse signals · interior-point methods Mathematics Subject Classification (2000) MSC 65K05 · 62P99
1 Introduction and Contribution Overview Compressed sensing (CS) aims to recover a sparse signal from a small number of measurements. The theoretical foundation of compressed sensing was first laid out by Donoho (2006) and Cand`es et al. (2006) and can be traced further back to the sparse recovery work of Donoho and Stark (1989); Donoho and Huo (2001); Donoho and Elad (2003). More recent progress in the area of compressed sensing is summarized in Kutyniok (2012) and Elad (2010). Let x∗ := (x∗1 , . . . , x∗n )T ∈ Rn denote a signal to be recovered. We assume n is large and x∗ is sparse. Let A be a given (or chosen) m × n matrix with m < n. Let aij denote the (i, j)th element of A. The compressed sensing problem is to recover x∗ assuming only that we know y := Ax∗ ∈ Rm and that x∗ is sparse. Since x∗ is a sparse vector, one can hope, and hence assume, that it is the sparsest solution to the underdetermined linear system and therefore can be recovered from y by solving min kxk0 subject to Ax = y, x
where kxk0 :=
Pn
i=1
(P0 )
1(xi 6= 0). This problem is NP-hard because of the nonconvexity of the `0
pseudo-norm. To avoid the NP-hardness, Chen et al. (1998) proposed the basis pursuit approach Pn where they use kxk1 := i=1 |xi | to replace kxk0 , min kxk1 subject to Ax = y. x
(P1 )
Donoho and Elad (2003) and Cohen et al. (2009) give conditions under which the solutions to Problems (P0 ) and (P1 ) are unique in their respective problems.
3
One key question is to understand under when the solutions to (P0 ) and (P1 ) are equal. Various sufficient conditions have been discovered. For example, letting A∗S denote the submatrix of A with columns indexed by a subset S ⊂ {1, . . . , n}, we say that A has the k-restricted isometry property (k-RIP) with constant δk if for any S with cardinality k,
(1 − δk )kvk22 ≤ kA∗S vk22 ≤ (1 + δk )kvk22 for any v ∈ Rk ,
where kvk2 :=
qP n
j=1
(1)
vj2 . We define δk (A) to be the smallest value of δk for which the matrix
A has the k-RIP property. Under the assumption that k := kx∗ k0 n and A satisfies the k-RIP condition, Cai and Zhang (2012) prove that whenever δk (A) < 1/3, the solutions to (P0 ) and (P1 ) are the same. Similar results have been obtained by Donoho and Tanner (2005a,b, 2009) using convex geometric functional analysis. Existing algorithms for solving the convex program (P1 ) include interior-point methods (Cand`es et al., 2006; Kim et al., 2007), projected gradient methods (Figueiredo et al., 2008), first-order methods (Juditsky et al., 2014) and Bregman iterations (Yin et al., 2008). Besides algorithms that solve the convex program (P1 ), several greedy algorithms have been proposed, including matching pursuit (Mallat and Zhang, 1993) and its many variants (Tropp, 2004; Donoho et al., 2006; Needell and Vershynin, 2009; Needell and Tropp, 2010; Donoho et al., 2009; Foucart, 2011). To achieve more scalability, combinatorial algorithms such as HHS pursuit (Gilbert et al., 2007) and sub-linear Fourier transform (Iwen, 2010) have also been developed. In this paper, we revisit the optimization aspects of the classical compressed sensing formulation (P1 ) and one of its extensions, Kronecker compressed sensing (Duarte and Baraniuk, 2012). We consider two ideas for accelerating iterative algorithms. One reduces the total number of iterations, and the other reduces the computation required to do one iteration. We demonstrate the effectiveness of these ideas by numerical simulations. Our first idea, an optimization algorithm, is motivated by the fact that if the desired solution is very sparse, it should be reached after a relatively small number of simplex pivots starting from the zero vector. If we use the parametric simplex method (see, e.g., Vanderbei (2007); Dantzig (1998)), the zero vector is a suitable initial basic solution. While the standard simplex method might not be computationally as competitive as modern optimization solvers, it is well known that slightly altering the pivot rules and formulation of the simplex method can result in a significant increase in computational speed (Forrest and Goldfarb, 1992; Pan, 2008). In
4
our work, our customized parametric simplex method is one such alternative simplex method that we feel can strongly benefit the compressed sensing literature.
The second idea, a problem reformulation, adds a new structural assumption on A. Specifically, we assume that A is the Kronecker product of two smaller matrices, B and C. Since we are typically allowed to design the sensing matrix A ourselves, this assumption does not add any practical limitations. This results in a Kronecker compressed sensing (KCS) problem that has been considered before in Duarte and Baraniuk (2012), but in our paper, we reformulate the linear programming problem in such a way that the constraint matrix is very sparse and therefore the problem can be solved very efficiently. While most optimization research in compressed sensing focuses on creating faster algorithms, our approach uses existing algorithms but a sparser problem formulation to increase the computational speed. We believe that this alternative idea to speed up most optimization algorithms, especially linear programming algorithms (Vanderbei, 1991), has not been exploited in the compressed sensing literature yet.
Theoretically, KCS involves a tradeoff between computational complexity and statistical requirements: it gains computational advantages (shown empirically) at the price of requiring more measurements (i.e., larger m). More specifically, using sub-Gaussian random sensing ma√ trices (to be defined later), whenever m is of the order m = O(k 2 log2 ( n/k)), KCS recovers the √ true signal with probability at least 1 − 4 exp(−C m) for some constant C. This size requirement for m is larger than in standard compressed sensing, which needs only m = O(k log (n/k)) measurements to recover the true signal with probability at least 1 − 2 exp(−Cm).
The rest of the paper is organized as follows. The problem formulations presented above involve no noise—the measurements are assumed to be exact. In the next section, we describe how to solve the standard compressed sensing version (P1 ) of the problem using the parametric simplex method. Then, in Section 3, we describe the statistical foundation behind Kronecker compressed sensing (KCS). In Section 4, we present the sparse formulation of KCS that dramatically speeds up existing optimization algorithms (such as interior point methods). In Section 5, we provide numerical comparisons against state-of-the-art methods to show that our simple modifications to existing approaches yield competitive results. In Section 6, we conclude by discussing how to extend our algorithm to handle noisy cases of compressed sensing.
5
2 Compressed Sensing via the Parametric Simplex Method Consider the following parametric perturbation to (P1 ) for a specified µ, b = argmin µkxk1 + kk1 x
(P2 )
x,
subject to Ax + = y, where ∈ Rm denotes the residual errors. Since µ is a tuning parameter that affects the solution b, we should denote the solution to (P2 ) as x bµ , but for notational simplicity, we write x b as x b = 0 and b = y. For shown in (P2 ) instead. For large values of µ, the optimal solution is x values of µ close to zero, the situation reverses: b = 0 and we’ve solved the original problem (P1 ). Belloni and Chernozhukov (2011) considered the above formulation, and they provide statistical guarantees on the solution for a particular magnitude of µ. In this section, we focus on any matrix A that satisfies a suitable k−RIP property specified by the following lemma. Lemma 1 (Cai and Zhang (2012)) For a specified µ, assume that the optimal solution of b = x∗ . (P2 ) yields = 0. Let k = kx∗ k0 , the sparsity of vector x∗ . If δk (A) < 1/3, then x For example, setting µ = 0 will necessarily yield = 0. Problem (P2 ) can be re-expressed as a parametric linear programming problem using the parametric simplex method (see, e.g., Vanderbei (2007); Dantzig (1998)). This is a homotopy method commonly used for sensitivity and perturbation analysis, and is also more commonly referred to as `1 -penalized quantile regression. In particular, we start at a large value of µ and successively reduce the value of µ for which the current basic solution is optimal until we arrive at a value of µ for which the optimal solution has b = 0, at which point we have solved the b will be mostly zero. original problem (P1 ). If there are few iterations, then the final vector x The advantage of the parametric simplex method over other variants is that this method solves for the entire regularization path in terms of µ when solving for the optimal solution to the original problem. Specifically, the parametric simplex method can output a finite collection of intervals covering the reals for µ such that optimal solution for each value of µ lying in any of the intervals is already computed. Aside from a finite collection of points, these intervals are mutually disjoint. From this collection of intervals for µ, we could pick the desired µ as specified in Belloni and Chernozhukov (2011) to satisfy certain statistical properties. However, we focus
6
on picking the smallest µ to achieve b = 0 in this paper in order to compare computational performances. This procedure is illustrated in Figure 1.
Fig. 1 Illustration of the parametric simplex method. The real number line (black) corresponds to varying values of µ. We explicitly use superscripts to denote the iteration counter for clarity. The real line is partitioned into a finite number of intervals such that each interval corresponds to a solution {x, } that is optimal for any value of µ within that interval. µ is initialized to be ||A||1 , and {x(0) , (0) }, our initial solution, is optimal for its corresponding interval. The algorithm decreases µ towards 0 until it reaches a solution, {x(T ) , (T ) } b is our desired optimal solution to Problem (P1 ). If the solution to (P1 ) is where (T ) = 0. Then x(T ) = x unique, the interval corresponding to {x(T ) , (T ) } will contain µ = 0. Since we actually solve for the entire regularization path, other methods such as Belloni and Chernozhukov (2011) could suggest to pick the solution x(T −1) corresponding to µ∗ , but we do not investigate such methods in our paper.
To formulate (P2 ) as a linear programming problem, we need to rewrite it using nonnegative variables and equality constraints. We also need to explain how to find an initial basic solution. We split each variable into a difference between two nonnegative variables,
x = x+ − x− and = + − − ,
where the entries of x+ , x− , + , − are all nonnegative. The next step is to replace kxk1 with 1T (x+ +x− ) and to make a similar substitution for kk1 . In general, the sum does not equal the absolute value but it is easy to see that equality holds at optimality. This is a well-known and standard technique for rewriting problems involving `1 -norms as linear programs. The resulting
7
linear program then becomes Simplex CS:
min
x+ ,x− ,+ ,−
µ1T (x+ + x− ) + 1T (+ + − )
subject to
(P3 )
A(x+ − x− ) + (+ − − ) = y x+ , x− , + , − ≥ 0.
For µ large, the optimal solution has x+ = x− = 0, and + − − = y. Given that these latter variables are required to be nonnegative, it follows that − yi > 0 =⇒ + i > 0 and i = 0, and
(2)
+ yi < 0 =⇒ − i > 0 and i = 0.
(3)
The equality case can be decided either way. With these choices for variable values, the solution is feasible for all µ and is optimal for large µ. Furthermore, declaring the nonzero variables to be basic variables and the zero variables to be nonbasic, we use this as our initial solution for the parametric simplex method for an initial µ = ||A||1 . We are guaranteed that our initial solution is optimal for this setting of µ by the triangle inequality, i.e., ||y||1 = ||Ax + ||1 ≤ ||Ak|1 ||x||1 +||||1 . The full pseudo-code to initialize and determine µ is presented in Algorithm 1 where Step 4 refers to the pivot step (see Chapters 7 and 8 of Vanderbei (2001)). Algorithm 1 Pseudo-Code for Parametric Simplex Method Require: Inputs A and y as set in Problem (P3 ). 1: Set µ = ||y||1 . Set initial optimal solution to be x+ = x− = 0 and + and − to follow (2) and (3) respectively. This is reflected in Figure 1. 2: while + + − 6= 0 do 3:
Determine the smallest value of µ such that the current solution for (P3 ) is optimal.
4:
For the current value of µ, apply a simplex pivot step to determine a new feasible solution.
5: end while b = x+ − x− . 6: return The optimal solution to (P1 ), x
3 Kronecker Compressed Sensing Formulation and Statistical Recovery In this section, we introduce the Kronecker compressed sensing problem (Duarte and Baraniuk, 2012). Unlike the previous section in which we allowed the sensing matrix A to be any matrix that satisfies a suitable k−RIP condition, in this section, we add a structural requirement
8
onto the sensing matrix. In particular, for given matrices B and C, we consider the following formulation that explicitly writes the Kronecker product, IPM CS:
min kxk1 subject to
B ⊗ C x = y.
(P4 )
The matrices B and C are matrices of size m1 × n1 and m2 × n2 respectively, and A, our sensing matrix, is a (m1 m2 ) × (n1 n2 ) matrix defined as the Kronecker product of B and C: Cb · · · Cb 11 1n 1 . .. . . . . A := B ⊗ C = . . . Cbm1 1 · · · Cbm1 n1 This Kronecker structural assumption on the new sensing matrix A has a natural physical interpretation of using a left and a right sensing matrix, and as we will explain in the next section, it will lead to a computational advantage. We adopt the following notation. We first stack x∗ ∈ Rn into a matrix X∗ ∈ Rn2 ×n1 by putting each length n2 sub-vector of x∗ into a column of X∗ . Here, without loss of generality, we assume n = n1 × n2 . Let the compressed P matrix signal Y be defined as Y := CX∗ BT . We define kXk0 := i,j 1(xij 6= 0) and kXk1 := P i,j |xij |. Let the vec(·) operator take a matrix and concatenate its elements column-by-column to build one large column-vector containing all the elements of the matrix. Given a matrix Y ∈ Rm2 ×m1 and the sensing matrices B and C, we make two observations that are both easily verifiable. First, if y := Ax∗ where x∗ = vec(X∗ ), we have the equivalence y = vec(Y). Second, Problem (P4 ) is mathematically equivalent to argmin kXk1 subject to CXBT = Y.
(P5 )
b is the solution to (P5 ), x b Hence, we can b is the solution to (P4 ) and X b = vec(X). This is, if x interpret our Kronecker compressed sensing scheme as either having a Kronecker structural assumption on A or having two separate (a left and a right) sensing matrices, B and C. Recall the definition of RIP given in Equation (1). To understand the statistical impact of assuming Kronecker structure on the sensing matrix A, we use Lemma 2 from Duarte and Baraniuk (2012) to relate the k−RIP constants of δk (B) and δk (C) to δk (A). Lemma 2 (Duarte and Baraniuk (2012)) Suppose A = B ⊗ C. Then 1 + δk (A) ≤ (1 + δk (B))(1 + δk (C)). In addition, we define a sub-Gasusian distribution as follows:
(4)
9
Definition 1 (Sub-Gaussian Distribution) We say a mean-zero random variable X follows a sub-Gaussian distribution if there exists some σ ∈ R+ such that E exp (tX) ≤ exp
σ 2 t2 2
for all t ∈ R.
It is obvious that the Gaussian distribution with mean 0 and variance σ 2 satisfies the above definition using a Taylor expansion. The next theorem provides sufficient conditions that guarantees perfect recovery of the KCS problem with high probability. Theorem 1 Suppose the entries of matrices B and C follow a sub-Gaussian distribution with parameter σ. Then there exists a constant C > 0 (depending on σ) such that whenever m1 ≥ C · k log (n1 /k) and m2 ≥ C · k log (n2 /k) , the convex program (P5 ) attains perfect recovery with probability b = X∗ ≥ 1 − P X
! 1 1 . m1 + 2 exp − m2 2 exp − 2C 2C {z } |
ρ(m1 ,m2 )
Proof We use the equivalence between Problem (P5 ) and Problem (P4 ). From Lemma 1 and √ Lemma 2, it suffices to show that δk (B) and δk (C) are both less than 2/ 3 − 1. Let τ := √ 2/ 3 − 1. Using Theorem 9.2 of Foucart and Rauhut (2013), there exists constants C1 and C2 (depending on σ) such that if m1 ≥ 2C1 τ −2 k log(en1 /k) and m2 ≥ 2C2 τ −2 k log(en2 /k), then ! 2 2 P δk (B) < √ − 1 and δk (C) < √ − 1 = 1 − P δk (B) ≥ τ + P δk (C) ≥ τ 3 3 2 2 ! τ m1 τ m2 ≥ 1 − 2 exp − + 2 exp − 2C1 2C2 ≥ 1 − ρ(m1 , m2 ). t u This theorem can also be derived from the results in Baraniuk et al. (2010), but we are focusing on the computational aspects in this paper. From the above theorem, we see that for m1 = √ √ m2 = m and n1 = n2 = n, whenever the number of measurements satisfies √ ! n 2 2 m = O k log , (5) k b = X∗ with probability at least 1 − 4 exp(−C √m) for some constant C. we have X
10
Here we compare the above result to that of (standard) compressed sensing (P1 ). Following the same argument as in Theorem 1, whenever
m = O k log
n k
! ,
(6)
b = x∗ with probability at least 1−2 exp(−Cm). Comparing Equation (6) to Equation we have x (5), we see that KCS requires more stringent conditions for perfect recovery. Specifically, for a fixed n, as k (the unknown sparsity level) increases, the required number of samples for KCS will grow at a faster rate than standard CS. However, in the next section and in the numerical results, we will see that KCS enjoys a tremendous improvement in computation time.
4 Sparsifying the Constraint Matrix for Efficient Computation We can use an interior-point algorithm or most other optimization algorithms to solve (P4 ). However, we can craft our optimization algorithm to explicitly use the Kronecker structure of A. The key to efficiently solving the linear programming problem associated with the Kronecker sensing problem lies in noting that the dense matrix A that is a Kronecker product can be factored into a product of two very sparse matrices: Cb11 . . A= . Cbm1 1
· · · Cb1n1 .. .. . . · · · Cbm1 n1
C 0 ··· 0
=
0 .. .
b11 In2
b12 In2 · · · b1n1 In2
0 b21 In2 b22 In2 .. .. .. . . . bm1 1 In2 bm1 1 In2 0 0 ··· C C ··· .. . . . .
· · · b2n1 In2 .. .. . . · · · bm1 n1 In2
=: VW,
where In2 denotes an n2 × n2 identity matrix and 0 denotes an m2 × m2 zero matrix. Notice that while the matrix A is usually completely dense, it is a product of two very sparse matrices: V = Im1 ⊗ C ∈ Rm×m1 n2 and W = B ⊗ In2 ∈ Rm1 n2 ×n . Hence, if we introduce new variables z, we can rewrite (P4 ) equivalently as
min ||x||1
(P6 )
x,z
subject to Wx −z = 0 Vz = y
11
Using our previous notation, we can rewrite (P6 ) as IPM KCS:
min ||X||1
(P7 )
X,Z
subject to CX −Z = 0 ZBT = Y. If we wanted to use a parametric simplex method to solve (P6 ), we can, as before, split x and into a difference between their positive and negative parts and enforce equality constraints to convert the problem into a linear program: Simplex KCS:
min
x+ ,x− ,+ ,−
subject to
µ1T (x+ + x− ) + 1T (+ + − ) z − W(x+ − x− ) Vz
(P8 ) =0
+
(+ − − )
=y
x+ , x− , + , − ≥ 0. This formulation (P8 ) has more variables and more constraints than (P3 ), but now the constraint matrix is very sparse. For linear programming, sparsity of the constraint matrix is a significant contributor to algorithm efficiency (Vanderbei, 1991). In fact, we can view the decomposition in (P7 ) as one step of the Fast Fourier Transformation (Vanderbei, 2012).
5 Numerical Results Compared against Other Methods Before showing our simulation results, we briefly describe three other popular methods to solve compressed sensing problems: `1 `s , Mirror Prox and Fast Hard Thresholding Pursuit. We will compare the performance of our method against these three methods on noiseless compressed sensing problems. We use “KCS” to refer to optimization problems that explicitly exploit sparsity structure in the formulation, (P7 ) and (P8 ). Proposed Methods: In this paper, we have presented two ideas, the parametric simplex method (an optimization algorithm) and Kronecker compressed sensing (a problem reformulation). We have made implementations that use these ideas either separately or jointly. We modify the parametric simplex algorithm described in Vanderbei (2007) implemented in C found at http://www.orfe.princeton.edu/~rvdb/LPbook/src/index.html, to solve both (P3 ) and (P8 ). We will refer to these implementations as “Simplex” and “Simplex KCS” respectively in our simulation results. We also use an interior-point solver called loqo (Vanderbei
12
(1999)) to solve (P4 ) and (P7 ). We will refer to these implementations as “IPM” and “IPM KCS” respectively in our simulation results. (Specialized) Interior-Point Method: The `1 `s (Kim et al., 2007) is a truncated Newton interior-point method using a preconditioner to solve b = argmin ||Ax − y||22 + λ||x||1 x
(P9 )
x
for a given regularization parameter λ. Since computing the Newton direction is computationally prohibitive for large-scale problems due to its Hessian, `1 `s circumvents this problem by approximating the Newton direction using preconditioned conjugate gradients. Interior point methods are generally advocated for using second order information to advance in the correct direction and choosing the appropriate step-size. The Matlab code for this algorithm can be found at http://stanford.edu/~boyd/l1_ls/. Greedy Method: The Fast Hard Thresholding Pursuit (FHTP) (Foucart, 2011) is a greedy algorithm that iterates two steps to approximately solve b = argmin ||x||1 : Ax = y x
(P10 )
x:||x||0 =k
for a given sparsity level k. In the first step, it chooses the best k coordinates of x according to a certain criterion, and in the next step it optimizes x for only thse k coordinates while setting the remaining coordinates to 0. This algorithm is advocated for its simplicity while retaining exact recovery as long as A satisfies an RIP condition and k is correctly chosen. The Matlab code for this algorithm can be found at http://www.math.uga.edu/~foucart/software.htm. First Order Method: The Mirror Prox algorithm (Juditsky et al., 2014) is a first-order algorithm that solves b = argmin ||x||1 : ||Ax − y||2 ≤ δ x
(P11 )
x
for a given tolerance δ by reformulating the problem as a saddle-point problem and using a proximal method to solve its variational inequality. First order algorithms are typically favored in compressed sensing literature for not computing the Hessian matrix, and saddle point formulations are advantageous since they naturally combine the primal and dual of the problem via variational inequalities. The Matlab code for this algorithm can be found at http://www2.isye.gatech.edu/~nemirovs/MirrorProxJan10_2012.zip. b Experimental Protocol: In the following section, let x∗ denote the true signal and x denote the estimated signal using one of the algorithms. We measure the accuracy of the
13
solution by Relative `1 error:
b||1 ||x∗ − x ||x∗ ||1
and
b||∞ . `∞ error: ||x∗ − x
We will be comparing five different algorithms, the parametric simplex method, the interior point method, `1 `s , FHTP and Mirror Prox at different sparsity levels k. FHTP requires two different modes, oracle and agnostic. In the former, FHTP is given the true sparsity of x∗ . In the latter, FHTP is always given a sparsity of 100 regardless of the truth. Since each algorithm is solving a slightly different optimization problem, we devise a methodology to roughly treat each optimization problem fairly. An important quantity to b satisfies the achieve fairness is the imprecision, ||Ab x − y||22 , the degree to which the solution x constraints. We first apply our proposed methods (Simplex KCS, IPM KCS, Simplex CS, IPM CS) and record their magnitudes of imprecision. We require the solutions of the `1 `s (P9 ) and Mirror Prox problem (P11 ) to have imprecision of only up to two magnitudes more. We cannot reasonably expect the same exact magnitude of constraint error since simplex methods are exact algorithms. For each k, we found that λ = 0.01 in (P9 ) achieves a comparable imprecision. Given the solution to `1 `s , we can easily set δ in (P11 ) to match the precision. The parametric simplex method and oracle FHTP naturally achieve the highest precision in most cases. To ensure that each optimization algorithm is solving the same problem, we sample A and B as Gaussian matrices where each entry is a standard Gaussian (mean 0, standard deviation 1). “Simplex KCS” and “IPM KCS” utilize the matrices V = I ⊗ C and W = B ⊗ I, while all the other methods utilize the sensing matrix A = B ⊗ C. In the following, we perform two different simulation sets, and we simulate 10 trials for each iteration in both simulation sets. Instructions for downloading and running the various codes/algorithms described in this section can be found at http://www.orfe.princeton.edu/~rvdb/tex/CTS/kronecker_sim.html. Results: (Varying Sparsity) In the first simulation set, we vary the true degrees of sparsity by generating random problems using m = 1,122 = 33 × 34 and n = 20,022 = 141 × 142. and varying the number of nonzeros k in signal x∗ from 2 to 150. Table 1 is a table showing time measured in seconds, the relative `1 error and the `∞ error averaged across 10 trials for each degree of sparsity. All the simulation times are shown in Figure 2. There are two observations to make. First, when the true sparsity is very small (k ≤ 70), even the simplex method that does not utilize Kronecker structure (Simplex) outperforms most modern methods in terms of precision and time. Second, we see that previously slow methods
14
(Simplex and IPM) are tremendously sped up once they utilize Kronecker structure (Simplex KCS and IPM KCS). In fact, IPM KCS is uniformally faster than `1 `s and Mirror Prox. This is roughly a ten-fold improvement in speed. This contradicts most authors in compressed sensing who claim that simplex methods and interior point methods are too slow compared to first order methods. Our results show that if we had designed our sensing matrix to have Kronecker structure, simplex and interior point methods are still competitive.
Fig. 2 Solution times for a large number of trials having m = 1,122, n = 20,022, and various degrees of sparsity in the underlying signal. The horizontal axis shows the number of nonzeros in the signal. The vertical axis gives a semi-log scale of solution times. The error bars have lengths equal to one standard deviation based on the multiple trials. Between Simplex KCS and IPM KCS, we outperform `1 `s and Mirror Prox.
In four of the trials for Simplex throughout the entire simulation set, the optimization problem became unbounded. We suspect this is because the hundreds of thousands of simplex
15 Sparsity k = 2
Sparsity k = 20
Time (Sec)
Rel. `1 Error
`∞ Error
Time (Sec)
Rel. `1 Error
`∞ Error
Simplex KCS
0.00000
0.00000
0.00000
2.00000
0.00000
0.00000
IPM KCS
44.65000
0.00000
0.00000
48.00000
0.00000
0.00000
Simplex
2.00000
0.00000
0.00000
3.50000
0.00000
0.00000
IPM
730.35000
0.00000
0.00000
758.70000
0.00000
0.00001
l1ls
110.12000
0.00000
0.00001
96.75300
0.00000
0.00010
Mirror Prox
28.07000
0.00025
0.00099
82.74000
0.00143
0.01535
FHTP (Oracle)
16.91200
0.00000
0.00000
14.27000
0.00001
0.00009
FHTP (Agnostic)
191.57500
0.25993
4.67440
196.25500
0.17329
3.49615
Sparsity k = 50
Sparsity k = 70
Time (Sec)
Rel. `1 Error
`∞ Error
Time (Sec)
Rel. `1 Error
`∞ Error
Simplex KCS
12.00000
0.00000
0.00000
25.50000
0.00000
0.00000
IPM KCS
47.70000
0.00000
0.00001
48.95000
0.00000
0.00001
Simplex
19.50000
0.00000
0.00000
66.00000
0.00000
0.00000
IPM
758.70000
0.00000
0.00001
821.25000
0.00000
0.00002
l1ls
170.31500
0.00000
0.00031
267.15500
0.00000
0.00052
Mirror Prox
42.84000
0.00011
0.06293
64.03000
0.00015
0.11449
FHTP (Oracle)
13.27300
0.00000
0.00160
15.57550
0.00000
0.00144
FHTP (Agnostic)
146.21500
0.00504
1.77320
35.64750
0.00518
1.42405
Sparsity k = 100
Sparsity k = 150
Time (Sec)
Rel. `1 Error
`∞ Error
Time (Sec)
Rel. `1 Error
`∞ Error
Simplex KCS
70.50000
0.00000
0.00002
462.50000
0.00000
0.00017
IPM KCS
49.95000
0.00000
0.00002
56.50000
0.00000
0.00136
Simplex
234.50000
0.00000
0.00001
1587.50000
0.00000
0.00010
IPM
783.45000
0.00000
0.00035
794.50000
0.00000
0.00150
l1ls
377.31500
0.00000
0.00104
789.16500
0.00000
0.00683
Mirror Prox
410.05000
0.00011
0.42348
635.08500
0.00036
2.43170
FHTP (Oracle)
16.23100
0.00000
0.00471
79.67700
0.01460
128.01000
FHTP (Agnostic)
20.43900
0.00000
0.00471
148.94500
0.01646
145.07500
Table 1 Table showing the time (seconds), relative `1 error and relative `∞ error for 6 selected sparsity levels averaged (median) over 10 trials. The first two rows of each table represent our proposed methods. Parametric simplex method outperforms other methods for very sparse problems k ≤ 70. Naturally, FHTP (oracle) is the fastest for k ≤ 100, but we see that both our methods outperform FHTP in both relative `1 error and uniform error. By incorporating Kronecker structure, we see that previously slow methods can experience a drastic speed-up (ie: the difference between IPM and IPM KCS).
16
iterations for large values of k caused numerical instability. We coded the entire parametric simplex method ourselves in C instead of using commercialized functions in Matlab. (Varying Size) In the second simulation set, we vary the problem size by fixing the number of nonzeros k in signal x∗ to 100 and constructing random problems using m = 1,122 = 33×34 and varying n = 141 × 142 to n = 141 × 402. Let n2 denote the varying dimension. Table 2 shows the same attributes as in the previous table. We make two observations. First, as shown in Figure 3, the relative order of the seven algorithms mostly persists throughout the simulation set. In particular, even when the vector x∗ contains four-times as many dimensions, the Simplex KCS still outperforms `1 `s and Mirror Prox. IPM KCS outperforms most methods throughout the entire simulation set. Second, we notice that IPM (with no Kronecker structure) stops at n2 = 202. This is because when n2 increases, our computer ran out of memory after ten or more minutes to complete the computation. This raises another practical issues in favor of using Kronecker structure. Storing the matrix A is much more computationally expensive than storing the matrices B and C, resulting in a dramatic increase in computational time.
6 Discussion and Conclusions We revisit compressed sensing from an optimization perspective. Many authors in the compressed sensing community have frowned upon the simplex method for its lack of theoretical convergence rates and standard interior point method for its computationally expensive updates per iteration. In fact, our results confirm that densely formulated IPMs are slow for CS problems. However, we advance the field of compressed sensing in two ways. First, despite not having a theoretical convergence rate, the parametric simplex method is still competitive for very sparse signals. It outperforms `1 `s and Mirror Prox (and we suspect many other methods) in both time and precision. Also, by adopting a parametric simplex method, we solve the problem for all values of µ, thereby simultaneously solving the entire regularization path. This allows the user to pick a particular value of µ or take the µ that satisfies equality constraints. Our paper focused on the latter. Second, if we utilize Kronecker structure, both parametric simplex and interior point methods speed up ten-fold, making them competitive with other modern CS algorithms. But, as explained earlier, the Kronecker sensing problem involves changing the underlying problem being solved. The sensing matrix is now viewed as the Kronecker of two Gaussian matrices.
17
Fig. 3 Solution times for a large number of trials having m = 1, 122 and k = 100. The number of dimensions for x∗ in each trial is n = 141 × n2 , and the horizontal axis shows the range of n2 . The vertical axis and error bars denote the same quantities as in Figure 2. IPM KCS outperforms most methods throughout the entire simulation set.
While we presented this idea using only the parametric simplex and interior point methods, we expect this idea to benefit most optimization methods. The Kronecker idea is based off Fast Fourier Transforms where dense computations are split into two smaller computations (Vanderbei (2012), Vanderbei (1991)). Hence, any optimization method that can accommodate fast matrix multiplication operations can be altered to also benefit from a Kronecker compressed sensing scheme. The theoretical guarantees for using Kronecker compressed sensing are more stringent however. This illustrates the tradeoff between computing and statistics. In most applications, we expect some noise to corrupt the true signal. As long as the signalto-noise ratio is high, we can adapt our methods to handle inexact constraints. Furhtermore, the parametric simplex method computes solutions for all values of µ. We can pick the solution
18 Size n2 = 102
Size n2 = 202
Time (Sec)
Rel. `1 Error
`∞ Error
Time (Sec)
Rel. `1 Error
`∞ Error
Simplex KCS
104.50000
0.00000
0.00002
180.50000
0.00000
0.00002
IPM KCS
51.50000
0.00000
0.00000
455.00000
0.00000
0.00000
Simplex
53.50000
0.00000
0.00000
81.25000
0.00000
0.00000
IPM
776.15000
0.00000
0.00000
1585.15000
0.00000
0.00000
l1ls
181.62000
0.00000
0.00089
380.70500
0.00000
0.00097
Mirror Prox
383.16000
0.00009
0.27131
1178.05000
0.00010
0.42257
FHTP (Oracle)
10.22450
0.00001
0.00681
30.13650
0.00000
0.00590
Size n2 = 302
Size n2 = 402
Time (Sec)
Rel. `1 Error
`∞ Error
Time (Sec)
Rel. `1 Error
`∞ Error
Simplex KCS
483.00000
0.00000
IPM KCS
1269.50000
0.00000
0.00004
989.00000
0.00166
2.76549
0.00002
3555.50000
0.00002
0.00505
Simplex
169.70000
0.00000
0.00000
320.30000
0.00042
1.00000
IPM
—
—
—
—
—
—
l1ls
1734.65000
0.00000
0.00203
2201.20000
0.00166
2.76892
Mirror Prox
1800.60000
0.00007
0.94227
1802.35000
0.00197
5.98285
FHTP (Oracle)
316.26500
0.00366
62.96350
596.54000
0.00433
82.61700
Table 2 Table of similar format as Table 1 showing 4 selected dimension sizes n2 averaged (median) over 10 trials. Simplex KCS and IPM KCS outperform most methods in terms of time and accuracy. The parametric simplex (without Kronecker structure) performs better than `1 `s for k = 100 even as n2 grows. Note that the `∞ error for n2 = 402 is large for `1 `s , Mirror PRox and FHTP, meaning at least one dimension is drastically incorrect. There are no values for IPM for n2 = {302, 402} due to memory insufficiency during the computation.
associated with the value suggested in Belloni and Chernozhukov (2011) to achieve statistical properties for noisy cases. If we have a specific residual size, ||||1 , that we are willing to tolerate, we can pick the appropriate solution from the regularization path. In future work, we plan to investigate noisy cases more thoroughly and extend the proposed method to the setting of 1-bit compressed sensing. —————————————— ——————————————
References Baraniuk, R., Davenport, M. A., Duarte, M. F. and Hegde, C. (2010). An Introduction to Compressive Sensing. CONNEXIONS, Rice University, Houston, Texas.
19
Belloni, A. and Chernozhukov, V. (2011).
`1 -penalized quantile regression in high-
dimensional sparse models. Annals of Statistics 39 82–130. Cai, T. T. and Zhang, A. (2012). Sharp RIP bound for sparse signal and low-rank matrix recovery. Applied and Computational Harmonic Analysis . `s, E., Romberg, J. and Tao, T. (2006). Robust uncertainty principles: Exact sigCande nal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52 489–509. Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20 33–61. Cohen, A., Dahmen, W. and Devore, R. (2009). Compressed sensing and best k-term approximation. J. Amer. Math. Soc 211–231. Dantzig, G. B. (1998). Linear programming and extensions. Princeton university press. Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory 52 1289–1306. Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via `1 -minimization. Proc. Natl. Acad. Sci. USA 100. Donoho, D. L., Elad, M. and Temlyakov, V. N. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory 52 6–18. Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory 47 2845–2862. Donoho, D. L., Maleki, A. and Montanari, A. (2009). Message passing algorithms for compressed sensing. Proc. Natl. Acad. Sci. USA 106 18914–18919. Donoho, D. L. and Stark, P. B. (1989). Uncertainty principles and signal recovery. SIAM J. Appl. Math. 49 906–931. Donoho, D. L. and Tanner, J. (2005a). Neighborliness of randomly projected simplices in high dimensions. Proceedings of the National Academy of Sciences 102 9452–9457. Donoho, D. L. and Tanner, J. (2005b). Sparse nonnegative solutions of underdetermined linear equations by linear programming. In Proceedings of the National Academy of Sciences. Donoho, D. L. and Tanner, J. (2009). Observed universality of phase transitions in highdimensional geometry, with implications for modern data analysis and signal processing. Philos. Trans. Roy. Soc. S.-A 367 4273–4293.
20
Duarte, M. F. and Baraniuk, R. G. (2012). Kronecker compressive sensing. IEEE Transactions on Image Processing 21 494–504. Elad, M. (2010). Sparse and Redundant Representations - From Theory to Applications in Signal and Image Processing. Springer. Figueiredo, M., Nowak, R. and Wright, S. (2008). Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing 1 586–597. Forrest, J. J. and Goldfarb, D. (1992). Steepest-edge simplex algorithms for linear programming. Mathematical programming 57 341–374. Foucart, S. (2011). Hard thresholding pursuit: An algorithm for compressive sensing. SIAM Journal on Numerical Analysis 49 2543–2563. Foucart, S. and Rauhut, H. (2013). A Mathematical Introduction to Compressive Sensing. Springer. Gilbert, A. C., Strauss, M. J., Tropp, J. A. and Vershynin, R. (2007). One sketch for all: Fast algorithms for compressed sensing. In STOC (D. S. Johnson and U. Feige, eds.). ACM. Iwen, M. A. (2010). Combinatorial sublinear-time Fourier algorithms. Foundations of Computational Mathematics 10 303–338. Juditsky, A., Karzan, F. K. and Nemirovski, A. (2014). `1 minimization via randomized first order algorithms. Tech. rep. Kim, S., Koh, K., Lustig, M., Boyd, S. and Gorinevsky, D. (2007). An interior-point method for large-scale l1 -regularized least squares. IEEE Transactions on Selected Topics in Signal Processing 1 606–617. Kutyniok, G. (2012). Compressed sensing: Theory and applications. CoRR abs/1203.3815. Mallat, S. and Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on 41 3397–3415. Needell, D. and Tropp, J. A. (2010). CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Commun. ACM 53 93–100. Needell, D. and Vershynin, R. (2009). Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit. Foundations of Computational Mathematics 9 317–334.
21
Pan, P.-Q. (2008). A largest-distance pivot rule for the simplex algorithm. European journal of operational research 187 393–402. Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory 50 2231–2242. Vanderbei, R. (1991). Splitting dense columns in sparse linear systems. Lin. Alg. and Appl. 152 107–117. Vanderbei, R. (1999). LOQO: An interior point code for quadratic programming. Optimization Methods and Software 12 451–484. Vanderbei, R. (2007). Linear Programming: Foundations and Extensions. 3rd ed. Springer. Vanderbei, R. J. (2001). Linear programming. Foundations and extensions, International Series in Operations Research & Management Science 37. Vanderbei, R. J. (2012). Fast Fourier optimization. Math. Prog. Comp. 4 1–17. Yin, W., Osher, S., Goldfarb, D. and Darbon, J. (2008). Bregman iterative algorithms for `1 -minimization with applications to compressed sensing. SIAM J. Img. Sci. 1 143–168.