Latent Variable Bayesian Models for Promoting Sparsity David Wipf Biomagnetic Imaging Lab University of California, San Francisco Compressive Sensing Workshop, Duke University, Feb. 2009
Overview ♦
Sparse inverse problems
♦
Latent variable representations of sparse priors
♦
Two approaches to sparse estimation ♦ ♦
Type I: Integrate out latent variables, maximize over coefficients (MAP) Type II: Integrate out coefficients, maximize over latent variables (empirical Bayes)
♦
Duality
♦
Properties of Type II cost function
♦
Optimization strategies, e.g., reweighted L1 and L2
♦
Empirical Results
Sparse Inverse Problem ♦
Linear generative model: Observed m-dimensional data vector
♦
y
=
Φx + ε
Matrix of n basis vectors
Gaussian noise with variance λ
Vector of Unknown Coefficients
Objective: Estimate the unknown x given the following assumptions: 1.
Φ is overcomplete, meaning the number of features (columns) n is greater than the signal dimension m.
2.
x is maximally sparse, i.e., many elements equal zero.
Sparse Inverse Problem Cont. ♦
Noiseless case (εε = 0):
x0
arg min x
x
0
s.t. y = Φx
L0 quasi-norm = # of nonzeros in x
♦
Noisy case (εε > 0):
x0 ( λ ) arg min x
2
y − Φx 2 + λ x
0
2 = arg max exp − 21λ y − Φx 2 exp − 12 x 0 x
likelihood
prior
Sparse Inverse Problem Cont. ♦
♦
Forward model is linear, the inverse problem is very difficult to solve for two reasons: 1.
Combinatorial number of local minima
2.
Objective is discontinuous
A variety of approximate methods can be viewed in Bayesian terms using a flexible class of sparse priors.
Latent Variable Representations of Sparse Priors Gaussian scale mixture:
1.
p( xi ) =
2 1 N x |0, γ p ( γ ) dγ ∝ exp − g x ( ) ( ) i 2 ∫ i i i i
Convexity-based representation:
2.
p( xi ) = sup N ( xi |0, γi ) ϕ (γi ) ∝ exp − 12 g ( xi2 ) γi ≥ 0 Properties ♦
♦
Essentially all sparse priors can be represented in both forms [Palmer et al., 2006]. For non-negative functions p(γi) and ϕ(γi), resulting g ( xi2 ) will be non-decreasing, concave (favors sparsity).
Examples
g ( xi2 ) = log ( xi2 + ε ) ,
[Chartrand and Yin, 2008; Tipping, 2001]
g ( xi2 ) = log ( xi + ε ) , [Candes et al., 2008] g ( xi2 ) = xi , p
[Leahy and Jeffs, 1991; Rao and Kreutz-Delgado, 1999]
Type I (MAP Estimation) Integrate out the latent variables γ, maximize over the coefficients x.
x( I )
arg max x
= arg min x
∫ p ( y | x ) ∏ N ( x | 0, γ ) p (γ ) d γ i
i
i
i
− log p ( y | x ) − log p ( x )
= arg min x
data fit
n
y − Φx 2 + λ ∑ g ( xi2 ) 2
i =1
nondecreasing, concave penalty
i
Type I Cont. ♦
Convenient Optimization ⇒ Iterative reweighted L1, L2.
♦
Examples: ♦ ♦ ♦ ♦
♦
Candes et al. (2008) Chartrand and Yin (2008) Figueiredo et al. (2007) Rao et al. (2003)
Potential Limitations: ♦ ♦
If g ( xi2 ) is too sparse, problem is not convex. If it is not sparse enough, then global minimum may not be sparse enough.
Type II (Empirical Bayes) Integrate out the coefficients x, maximize over the latent variables γ. arg max
γ ( II )
γ
= arg min γ
∫ p ( y | x ) ∏ N ( x | 0, γ ) p (γ ) dx i
i
i
i
i
T
log λ I + ΦΓΦ + y
T
( λ I + ΦΓΦ ) T
−1
y + ∑ f (γ i ) i
where
Γ diag [ γ ] f ( γ i ) − 2 log p ( γ i )
Given γ(II ) , can easily get a point estimate for x using
x
( II )
E x | y , γ
( II )
= Γ
( II )
Φ
T
( λ I + ΦΓ
( II )
Φ
T
)
−1
y
Type II Cont. ♦
Convenient Optimization ⇒ Iterative reweighted L1, L2.
♦
Examples: ♦ ♦ ♦ ♦ ♦ ♦
♦
Bishop and Tipping (2000) Girolami (2001) Neal (1996) Sato et al. (2004) Tipping (2001) Wipf and Nagarajan (2008)
Potential Limitations: ♦ ♦
Typically non-convex cost function. Unclear how to choose f(γi) to get maximal sparsity.
Outstanding Issues ♦
What is the exact relationship between Type I and Type II?
♦
Duality: ♦ ♦
♦
So direct comparisons are possible by evaluating in an equivalent space ♦
♦
Type I can be expressed as an equivalent problem in γ-space. Type II can be expressed as an equivalent problem in x-space.
E.g., Type I is a special limiting case of Type II.
Viewing Type II in x-space leads to theoretical insights.
Type II Cost Function in x-Space Theorem 1
x
( II )
= E x | y, γ ( II ) = arg min y − Φx 2 + λ g ( II ) ( x 2 ) x 2
where
g
( II )
x 2 x12 ,… , xn2
(x ) 2
T
Concave conjugate of − log λ I + ΦΓΦT + ∑ i f ( γ i ) w.r.t. Γ −1
Properties of Type II Penalty Assume simplest case where f(γi) = 0 [Tipping, 2001].
1.
Concave in |xi| for all i
2.
Non-factorial, meaning g
( II )
(x ) 2
⇒
≠
sparsity-inducing
∑ g (x ) ( II ) i
i
2 i
⇒
better approx. to L0 quasi-norm
Advantages of Non-Factorial Penalty Theorem 2 ♦
In the low noise limit (λ → 0), and assuming x0 < spark [ Φ ] − 1, then Type II penalty satisfies: x 0 = arg min g ( II ) ( x 2 ) s.t. y = Φx x
♦
No factorial penalty g ( x ) = ∑ i g ( xi ) satisfies this condition and has fewer minima than the Type II penalty g ( II ) ( x 2 ) in the feasible region. 2
2
Example of Local Minima Smoothing ♦
Consider when y = Φx has a 1-D feasible region, i.e., n = m+1
♦
Any feasible solution x will satisfy:
x = x '+ α v v ∈ Null ( Φ ) where
α is a scalar x ' is a fixed solution
♦
Can plot penalty functions vs. α to view local minima profile over the 1-D feasible region.
Local Minima Smoothing Example g ( II ) ( x 2 )
∑
xi
0.01
≈ x
0
penalty, normalized
i
α, null space dimension
Conditions For a Single Minimum Theorem 3 ♦
Assume x 0 < spark [ Φ ] − 1. If the magnitudes of the non-zero elements in x0 are sufficiently scaled, then the Type II cost function (λ=0) has a single minimum which is located at x0.
x0
x0 Scaled Weights (easy) ♦
Uniform Weights (hard)
No possible factorial penalty satisfies this condition.
Reweighted L2 Implementation of Type II x ( k +1)
( k +1) i
w
xi2 → arg min y − Φx 2 + λ ∑ ( k ) x i wi 2
(k )
T
=
W Φ
→
( k +1) 2 i
(x )
( λ I + ΦW (k ) i
+ w
(k )
Φ
T
)
−1
y
1 − w( k )φ T ( λ I + ΦW ( k ) ΦT )−1 φ i i i
εi Many other variants are possible using different majorization-minimization algorithms
Connection with Type I Equivalent to solving
x
( II )
= arg min y − Φx 2 + λ ∑ log ( xi2 + ε i ) 2
x
i
adaptive
Reweighted L1 Implementation of Type II
x
( k +1)
→ arg min y − Φx 2 + λ ∑ 2
x
wi( k +1) →
i
xi wi( k )
φ T λ I + ΦW ( k ) diag x( k +1) ΦT i
(
)
−1
φi
− 12
Properties of Reweighted L1 Minimization ♦
Globally convergent [Zangwill 1969]: Guaranteed to locally minimize the Type II objective.
♦
Sparsity will not increase (λ λ=0):
♦
Extensible: Easy to extend to more general cases by adding constraints to the x-update step, e.g., non-negative sparse inverse problems, alternative likelihood models, etc.
♦
Fast, Robust: Even one or two iterations greatly improves upon the performance of the minimum L1-norm solution.
xˆ ( k +1)
0
≤ xˆ ( k )
0
≤ x BP
0
Always Room for Improvement Theorem 4 ♦
Assume spark(Φ) = m + 1.
♦
Let x* be any coefficient vector drawn from support S with cardinality |S| < (m + 1)/2 such that standard L1 minimization fails.
♦
Then there exists a set of signals y = Φx, with x having support S, such that 1.
Type II reweighted L1 always succeeds
2.
Standard L1 minimization always fails.
Empirical Example Penalty 2 log x ∑ ( i + ε ),
Updates L2 iters [Chartrand and Yin, 2008]
i
∑ log ( x ∑ log ( x
i
+ ε ), L1 iters [Candes et al., 2008]
i
+ ε ), L2 iters
i
i
g ( II ) ( x 2 ) ,
L1 iters [Wipf and Nagarajan, 2008]
g ( II ) ( x 2 ) ,
L2 iters [Bishop and Tipping, 2000; Wipf, 2006]
Empirical Example ♦
Generate data via Y = ΦX0: ♦
Φ is 50-by-100 with Gaussian iid entries
♦
X0 is 100-by-5 with random nonzero rows, i.e., simultaneous sparse approximation problem [Cotter et al., 2005; Tropp, 2006; Wipf and Rao, 2007].
♦
Run each algorithm and check if X0 is recovered.
Results (m = 50, n = 100)
Non-Negative Sparse Recovery Example Using Iterative Reweighted L1 (Type II)
m = 50, n = 100 30 nonzeros
Results With Different Nonzero Coefficient Distributions (m = 50, n = 100) Approx. Jeffreys Weights Error Rate
Error Rate
Unit Weights
x0
0
x0 OMP BP Type II
0
Summary ♦
Type II methods motivate some non-traditional means of solving underdetermined sparse linear inverse problems.
♦
Non-factorial penalty functions have some very desirable properties.
♦
Reweighted L1 and L2 updates reveal ♦ ♦ ♦
some connections between algorithms, often lead to performance improvements, and remove some of the stigma of non-convex cost functions.
Thank You Wipf and Nagarajan, “Iterative Reweighted L1 and L2 Methods for Finding Sparse Solutions,” UCSF Tech Report, 2009. Wipf and Nagarajan, “Latent Variable Models for Promoting Sparsity,” UCSF Tech Report, 2009.