Sparse Signal Recovery - UCSD DSP LAB - University of California ...

Comment

Report 1 Downloads 90 Views

Bhaskar Rao

David Wipf

DepartmentofElectricaland ComputerEngineering UniversityofCalifornia,SanDiego

BiomedicalImagingLaboratory UniversityofCalifornia,San Francisco

SpecialThankstoYuzhe Jin 1

Outline:Part1

MotivationforTutorial

SparseSignalRecoveryProblem

A li ti Applications

ComputationalAlgorithms p g ¾ GreedySearch ¾ ˜1 normminimization i i i ti

PerformanceGuarantees 2

Outline:Part2

Motivation:Limitationsofpopular p p inversemethods Maximumaposteriori(MAP)estimation p ( ) BayesianInference AnalysisofBayesianinferenceand connectionswithMAP A li ti t Applicationstoneuroimaging i i

3

Outline:Part1

MotivationforTutorial

SparseSignalRecoveryProblem

A li ti Applications

ComputationalAlgorithms p g ¾ GreedySearch ¾ ˜1 normminimization i i i ti

PerformanceGuarantees 4

EarlyWorks y y

y

y y

R.R.HockingandR.N.Leslie,“SelectionoftheBest SubsetinRegressionAnalysis ”Technometrics SubsetinRegressionAnalysis, Technometrics,1967. 1967 S.Singhal andB.S.Atal,“AmplitudeOptimizationand PitchEstimationinMultipulse Coders,”IEEETrans. Acoust.,Speech,SignalProcessing,1989 S.D.CabreraandT.W.Parks,“Extrapolationand spectralestimationwithiterativeweightednorm modification,”IEEE od cat o , Trans.Acoust.,Speech,Signal a s coust , Speec , S g a Processing,April1991. ManyMoreworks Ourfirstwork Ń I.F.Gorodnitsky,B.D.Rao andJ.George,“SourceLocalizationin Magnetoencephal0graphyusinganIterativeWeightedMinimum NormAlgorithm,IEEEAsilomar ConferenceonSignals,Systems andComputers PacificGrove CA Pages:167 171 Oct 1992 andComputers,PacificGrove,CA,Pages:167Ǧ171,Oct.1992 5

EarlySessiononSparsity EarlySessionon Sparsity OrganizedwithProf.Bresler aSpecialSessionatthe1998IEEE InternationalConferenceonAcoustics SpeechandSignalProcessing InternationalConferenceonAcoustics,SpeechandSignalProcessing SPEC-DSP: SIGNAL PROCESSING WITH SPARSENESS CONSTRAINT Signal Processing with the Sparseness Constraint ............................................................................................................. 111-1861 B Rao (University of California, San Diego, USA) B. Application of Basis Pursuit in Spectrum Estimation ........................................................................................................ 111-1865 S. Chen (IBM, USA); D. Donoho (Stanford University, USA) Parsimony and Wavelet Method for Denoising ...................................................................................................................111-1869 H. Krim (MIT, USA); J. Pesquet (University Paris Sud, France); I. Schick (GTE Ynternetworking and Harvard Univ., USA) Parsimonious Side Propagation .......................................................................................................................................... .111-1873 P. Bradley, 0. Mangasarian (University of Wisconsin-Madison, USA) Fast Optimal and Suboptimal Algorithms for Sparse Solutions to Linear Inverse Problems ....................................... 111-1877 G. Harikumar (Tellabs Research, USA); C. Couvreur, Y. Bresler (University of Illinois, Urbana-Champaign. USA) Measures and Algorithms for Best Basis Selection ............................................................................................................ 111-1881 K. Kreutz-Delgado, B. Rao (University of Califomia, San Diego, USA) Sparse Inverse Solution Methods for Signal and Image Processing Applications .......................................................... 111-1885 B. Jeffs (Brigham Young University, USA) Image Denoising Using Multiple Compaction Domains .................................................................................................... 111-1889 P. Ishwar, K. Ratakonda, P. Moulin, N. Ahuja (University of Illinois, Urbana-Champaign, USA)

6

MotivationforTutorial y

SparseSignalRecoveryisaninterestingareawith manypotentialapplications.Unificationofthetheory i l li i U ifi i f h h willprovidesynergy.

y

MethodsdevelopedforsolvingtheSparseSignal M th d d l df l i th S Si l Recoveryproblemcanbeavaluabletoolforsignal processingpractitioners. Manyinterestingdevelopmentsintherecentpastthat makethesubjecttimely.

y

7

Outline:Part1

MotivationforTutorial

SparseSignalRecoveryProblem

A li ti Applications

ComputationalAlgorithms p g ¾ GreedySearch ¾ ˜1 normminimization i i i ti

PerformanceGuarantees 8

ProblemDescription y

Ȱ n ×m

x

ɂ

y

y isn × 1measurementvector.

y

Ȱ isn × m Dictionarymatrix.m >>n.

y

x ism × 1desiredvectorwhichissparsewithk nonǦzeroentries.

y

ɂ istheadditivenoisemodeledasadditivewhiteGaussian. i h ddi i i d l d ddi i hi G i 9

ProblemStatement y

NoiseFreeCase:Givenatargetsignalt anda di i dictionaryȰ,findtheweightsx Ȱ fi d h i h thatsolve: h l m

min I ( x i v 0)subjecttoy Ȱx x

i 1

whereI(.)istheindicatorfunction () y

NoisyCase:Givenatargetsignaly andadictionaryȰ, findtheweightsx thatsolve: m

min I ( x i v 0)subjectto y Ȱx 2 b ȕ x

i 1

10

Complexity y

Searchoverallpossiblesubsets,whichwouldmeanasearch overatotalof( l f(mCk)subsets.CombinatorialComplexity. ) b C bi i lC l i Withm =30;n =20;andk =10thereare3× 107 subsets(Very Complex) p )

y

Abranchandboundalgorithmcanbeusedtofindtheoptimal solution.Thespaceofsubsetssearchedisprunedbutthe searchmaystillbeverycomplex.

y

Indicatorfunctionnotcontinuousandsonotamenableto standardoptimizationtools. Challenge:Findlowcomplexitymethodswithacceptable performance

11

Outline:Part1

MotivationforTutorial

SparseSignalRecoveryProblem

A li ti Applications

ComputationalAlgorithms p g ¾ GreedySearch ¾ ˜1 normminimization i i i ti

PerformanceGuarantees 12

Applications y

SignalRepresentation(Mallat,Coifman,Wickerhauser,Donoho, ...)

y

EEG/MEG (Leahy,Gorodnitsky,Ioannides,...)

y

FunctionalApproximationandNeuralNetworks(Chen, FunctionalApproximationandNeuralNetworks(Chen Natarajan,Cun,Hassibi,...)

y

Bandlimited extrapolationsandspectralestimation(Papoulis, Lee Cabrera Parks ) Lee,Cabrera,Parks,...)

y

SpeechCoding(Ozawa,Ono,Kroon,Atal,...)

y

S Sparsechannelequalization(Fevrier,Greenstein,Proakis,…) h l li ti (F i G t i P ki )

y

CompressiveSampling(Donoho,Candes,Tao...)

y

MagneticResonanceImaging(Lustig,..) i i 13

DFTExample y

Measurement y y [l ] 2(cos Ȧ0l cos Ȧ1l ),l 0,1,2,...,n 1.n 64. Ȧ0

y

2ʌ 33 2ʌ 34 , Ȧ1 . 64 2 64 2

DictionaryElements:

Il

( m)

[1,e

jZl

,e

j 2Zl

,...,e

j ( n 1)Zl T

] ,Zl

2S l m

Considerm =64,128,256and512. =64 128 256and512 Questions: y WhatistheresultofazeropaddedDFT? p y Whenviewedasproblemofsolvingalinearsystemof equationsdictionary,whatsolutiondoestheDFTgiveus? y Aretheremoredesirablesolutionsforthisproblem? y

14

DFTExample y

Notethat y I33(128) I34(128) I94(128) I95(128) (256) I66(256) I68(256) I1(256) I 88 190 (512) (512) (512) (512) I132 I136 I376 I 376 380

y y y

Considerthelinearsystemofequations

y Ȱ( m) x Thefrequencycomponentsinthedataarein thedictionariesȰ((m)) for m=128,256,512. m=128 256 512 Whatsolutionamongallpossiblesolutions doestheDFTcompute? 15

DFTExample m=64

m=128

80

80

60

60

40

40

20

20

0

10

20

30

0

20

m=256 80

60

60

40

40

20

20 20

40 60 80 100 120

60

m=512

80

0

40

0

50

100 150 200 250 16

SparseChannelEstimation

m 1

r (i )

Noise

¦ s(i j)c( j) İ(i),) i

0 0,1,... 1 ,n 1

j 0

Receivedseq. Trainingseq.

Channelimpulse response

17

Example: SparseChannelEstimation h l y

Formulatedasasparsesignalrecoveryproblem ª r (o) º « r (1) » « » « # » « » ¬r (n 1)) ¼

y

s(1)) ª s(0) « s(1) s(0) « # « # « ¬ s(n 1)) s(n 2))

" s(m 1)) º ª c(o) º ª İ (o) º " s(m 2) » « c (1) » « İ (1) » »« »« » % # »« # » « # » »« » « » " s(m n)¼ ¬ c(m 1)) ¼ ¬İ (n 1))¼

Canuseanyrelevantalgorithmtoestimatethesparse y g p channelcoefficients

18

CompressiveSampling y

D.Donoho,“CompressedSensing,”IEEE p g Trans.onInformationTheory,2006

y

E.Candes andT.Tao,“NearOptimal SignalRecoveryfromrandom Projections:UniversalEncoding Strategies ”IEEETrans Strategies, IEEETrans.onInformation onInformation Theory,2006

19

CompressiveSampling y

TransformCoding g Ȳ

y

x

b

Whatistheproblemhere? Ń SamplingattheNyquist rate Ń Keepingonlyasmallamountofnonzerocoefficients Ń Canwedirectlyacquirethesignalbelow theNyquist rate?

20

CompressiveSampling y

TransformCoding g Ȳ

y

x

b

Compressive Sampling A

Ȳ

Ȱ

x

A

b

y

21

CompressiveSampling y

CompressiveSampling A

Ȳ

x

A

b

y

Ȱ

y

Computation: 1. 2.

y

Solveforw S l f suchthatȰx=y hth tȰ Reconstruction:b =Ȳx

Issues Ń Needtorecoversparsesignalw withconstraintȰx=y Ń NeedtodesignsamplingmatrixA 22

Outline:Part1

MotivationforTutorial

SparseSignalRecoveryProblem

A li ti Applications

ComputationalAlgorithms p g ¾ GreedySearch ¾ ˜1 normminimization i i i ti

PerformanceGuarantees 23

PotentialApproaches CombinatorialComplexityandsoneedalternate strategies y

GreedySearchTechniques:MatchingPursuit, OrthogonalMatchingPursuit g g

y

MinimizingDiversityMeasures:Indicatorfunction notcontinuous.DefineSurrogateCostfunctions th t thataremoretractableandwhoseminimization t t bl d h i i i ti leadstosparsesolutions,e.g. A 1 minimization

y

BayesianMethods:MakeappropriateStatistical assumptionsonthesolutionandapplyestimation techniquestoidentifythedesiredsparsesolution

24

GreedySearchMethod:Matching Pursuit y

Selectacolumnthatismostalignedwiththecurrentresidual y

x

Ȱ

Ń r(0) =y Ń S(i):setofindicesselected Ń l argmax I j T r ( i 1) 1d j dm

y

ɂ

Practicalstopcriteria Practicalstopcriteria: • Certain#iterations (i )

• r smallerthan 2 threshold

Removeitscontributionfromtheresidual (i)thesame l S ( i 1) ,S( i ) S ( i 1) * {l } Ń UpdateS(i):If.Or,keepS Ń Update r(i): r ( i ) PIAl r ( i 1)

T ( i 1) r ( i 1) II l l r 25

GreedySearchMethod: O OrthogonalMatchingPursuit(OMP) th lM t hi P it(OMP) y

Selectacolumnthatismostalignedwiththecurrentresidual y

Ȱ

x

ɂ

Ń r(0) =y Ń S(i):setofindicesselected Ń l argmax I j T r ( i 1) 1d j dm

y

Removeitscontributionfromtheresidual Ń UpdateS(i): S ( i ) S( i 1) * {l } Ń Update r(i): r ( i ) P[Al1 ,l2 ,...,li ]r ( i 1) r ( i 1) P[ l1 ,l2 ,...,li ]r ( i 1) 26

GreedySearchMethod: OrderRecursiveOMP d y

Selectacolumnthatismostalignedwiththecurrentresidual y

Ń r(0) =y Ń S(i):setofindicesselected T ( i 1) 2 I j ( i 1) Ń l argmax I j r 1d j dm

y

x

Ȱ

ɂ

2 2

Removeitscontributionfromtheresidual Ń UpdateS(i): S ( i ) S( i 1) * {l } Ń Updater(i): r ( i ) P[Al1 ,l2 ,...,li ]r ( i 1) r ( i 1) P[ l1 ,l2 ,...,li ]r ( i 1) (i ) 2 Il 2 Il ( i ) P[Al1 ,l2 ,...,li ]Il Ń Update:.Canbecomputedrecursively 27

DeficiencyofMatchingPursuit TypeAlgorithms l h y

Ifthealgorithmpicksawrongindexataniteration, thereisnowaytocorrectthiserrorinsubsequent iterations. iterations

SomeRecentAlgorithms •Stagewise OrthogonalMatchingPursuit(Donoho,Tsaig,..) •COSAMP(Needell,Tropp) COSAMP(Needell Tropp)

28

InverseTechniques y

ForthesystemsofequationsȰx =y,thesolutionsetis characterizedby{xs :xs =Ȱ+ y +v;v N(Ȱ)},whereN(Ȱ) denotesthenullspaceofȰ andȰ+ =ȰT(ȰȰT )Ǧ1.

y

MinimumNormsolution:Theminimum˜2 normsolution xmn =Ȱ+y isapopularsolution

y

NoisyCase:regularized˜2 normsolutionoftenemployedand isgivenby xreg =Ȱ ȰT(ȰȰT +ɉI) ɉ )Ǧ1y 29

Minimum2ǦǦNormSolution Minimum2 y

Problem:Minimum˜2 normsolutionisnotsparse Example:

x mn

ª2 «3 ¬

ª1 0 1 º Ȱ « ,yy » ¬0 1 1 ¼ 1 3

1º 3 »¼

T

vs.

x

ª1 º «0 » ¬ ¼

>1

0 0@

T

DFT:Alsocomputesminimum2Ǧnormsolution

30

DiversityMeasures y

Recall: m

min I ( x i v 0)subjecttoy Ȱx x

i 1

y

Functionals whoseminimizationleadstosparse solutions

y

Manyexamplesarefoundinthefieldsofeconomics, socialscienceandinformationtheoryy

y

Thesefunctionals areusuallyconcavewhichleadsto difficultoptimizationproblems p p 31

ExamplesofDiversityMeasures y

˜(pθ1) DiversityMeasure m

E ( x ) x i ,p b 1 ( p)

p

i 1

y

Asp ї0, m

m

limE ( x ) lim x i I( x i v 0) ( p)

pl0

y

pl0

p

i 1

i 1

˜1norm,convexrelaxationof˜0 m

E (x ) xi (1)

i 1

32

˜1 DiversityMeasure y

Noiselesscase m

min ¦ x i subjecttoȰx x

y

y

i 1

Noisycase N i Ń ˜1 regularization[Candes,Romberg,Tao] m

min ¦ x i subjectto y Ȱx 2 d ȕ x

i 1

Ń Lasso[Tibshirani],BasisPursuitDeǦnoising[Chen, [ ], g[ , Donoho,Saunders] m

min y Ȱx 2 Ȝ¦ x i 2

x

i 1

33

˜1 normminimizationandMAP estimation i i y

MAPestimation xˆ argmax p( x | y ) x

argmax >log p( y | x ) log p( x )@ x

y

Ifweassume f Ń ɂi iszeromean,i.i.d.Gaussiannoise Ń p( x ) p( x i ),wherep( x i ) v exp O x i

y

Then

i

2 ª º xˆ argmin « y Ȱx 2 O ¦ x i » x i ¬ ¼

34

Attractivenessof˜1 methods y

ConvexOptimizationandassociatedwith p richclassofoptimizationalgorithms Ń InteriorǦpointmethods Ń Coordinatedescentmethod Ń …….

y

Question Ń Whatistheabilitytofindthesparsesolution?

35

Whydiversitymeasureencourages sparsesolutions? l T p

min [ x 1 , x2 ] I1 x 1 I2 x2

p

subjecttoI1 x 1 I2 x2

y

y

equalǦnorm contour

0θ p 1 36

˜1 normandlinearprogramming LinearProgram(LP): g ( ) mincT x subjectto j )x x

y

KeyResultinLP(Luenberger): y ( g ) a) Ifthereisafeasiblesolution,thereisabasic feasiblesolution* b) Ifthereisanoptimalfeasiblesolution,there isanoptimalbasicfeasiblesolution.

*IfȰ isn×m,thenabasicfeasiblesolutionisasolution withnnonǦzeroentries 37

Examplewith˜1 diversitymeasure ª1 0 1 º Ȱ « ,y » ¬0 1 1 ¼ y

ª1 º «0 » ¬ ¼

NoiselessCase Ń xBP =[1,0,0]T (machineprecision)

y

NoisyCase y Ń Assumethemeasurementnoiseɂ =[0.01,Ǧ0.01]T Ń A 1 regularizationresult:xl1R [0.986,0,8.77× 986 0 8 77× 10Ǧ6]T l R =[0 Ń Lassoresult(ɉ =0.05):xlasso =[0.975,0,2.50× 10Ǧ5]T 38

Examplewith˜1 diversitymeasure y

ContinuewiththeDFTexample: y [l ] 2(cos Ȧ0l cos Ȧ1l ),l 0,1,2,...,n 1.n 64. Ȧ0

y y

2ʌ 33 2ʌ 34 , Ȧ1 . 64 2 64 2

64,128,256,512DFTcannotseparatetheadjacent frequencycomponents Using˜1diversitymeasureminimization(m=256)

1 08 0.8 0.6 0.4 0.2 0

50

100

150

200

250

39

Outline:Part1

MotivationforTutorial

SparseSignalRecoveryProblem

A li ti Applications

ComputationalAlgorithms p g ¾ GreedySearch ¾ ˜1 normminimization i i i ti

PerformanceGuarantees 40

ImportantQuestions y

Whenisthe˜0 solutionunique?

y

Whenisthe˜1 solutionequivalenttothatof˜0? Ń NoiselessCase N i l C Ń NoisyMeasurements

41

Uniqueness y

D fi iti fS k DefinitionofSpark Ń ThesmallestnumberofcolumnsfromȰthatare linearlydependent (Spark(ĭ) n+1) linearlydependent.(Spark(

y

Uniquenessofsparsestsolution m 1 I ( x i z 0) Spark(Ȱ) Ń If,thenx istheunique ¦ 2 i 1 solutionto m

argmin ¦ I ( x i z 0)subjecttoȰx x

y

i 1

42

MutualCoherence y

ForagivenmatrixȰ =[ɔ1, ɔ2,…, ɔm],themutual coherenceµ(Ȱ)isdefinedas h (Ȱ)i d fi d

P (Ȱ) max

1 d i , j d m ;i z j

IiTI j Ii

2

Ij

2

43

Performancesof˜1 diversity minimizationalgorithms l h y

N i l C [D NoiselessCase[Donoho h &Elad &El d 03]] 1§ 1 · If,thenx , istheunique q I ( x i z 0)) ¨ 1 ¦ ¸ 2© P (Ȱ) ¹ i 1 m

solutionto

m

argmin ¦ x i subjecttoȰx x

y

i 1

44

Performancesof˜1 diversity minimizationalgorithms l h y

NoisyCase[Donoho etal06] m

İ 2 d ȕ ¦ I ( x i z 0) Assumey =Ȱx +İ,, i 1

Thenthesolution

1§ 1 · 1 ¨ ȝ (Ȱ) ¸¹ 4©

m

d * argmin ¦ di subjectto y Ȱd 2 d ȕ d

i 1

satisfies 2 4 ȕ d* x d 2 1 ȝ(Ȱ) 4 x 0 1 2

45

Empiricaldistributionofmutual coherence(Gaussianmatrices) y y y

GaussianMatrices:N(0,1),normalizecolumnnormto1. 2000randomgeneratedmatricesȰ Histogramofmutualcoherence Histogram

Histogram 90

80

80

70

70

60

60 50

frequency

frequency

50

40

40

30 30 20

20

10

10 0

0.35

0.4

0.45

0.5

0.55

0

0.15

0.16

0.17

0.18

0.19

0.2

P

P

Ȱ100x200,mean=0.4025,std=0.0249

Ȱ1000x2000,mean=0.161,std=0.0073 46

PerformanceofOrthogonal MatchingPursuit h y

NoiselessCase[Tropp 04] m

1§

1

·

If,thenOMPguarantees I ( x i z 0) ¨ 1 ¦ ¸ 2 (Ȱ) P i 1 © ¹ m

recoveryofx afteriterations. ¦I(x i z 0) i 1

47

PerformanceofOrthogonal MatchingPursuit h y

NoisyCase[Donoho etal] Assumey =Ȱx +İ,,, İ 2 d ȕ xmin min x i 1di dm

1§ 1 · ȕ I( x i z 0) ¨ 1 and. ¦ ¸ ȝ(Ȱ) ¹ ȝ (Ȱ) xmin 2© i 1 m

StopOMPwhenresidualerrorθȕ. ThenthesolutionofOMPsatisfies

ȕ2 xOMP x 2 d 1 ȝ (Ȱ) x 0 1 2

48

RestrictedIsometry Restricted Isometry Constant y

Definition[Candes etal] ForamatrixȰ,thesmallest constantįk suchthat 2

2

((1 įk ) x 2 d Ȱx 2 d ((1 įk ) x

2 2

foranykǦsparse signalx.

49

Performancesof˜1 diversity measureminimizationalgorithms l h y

[Candes 08]

Assumey =Ȱx +İ,x iskǦsparse,and. į2k 2 1 İ 2dȕ Then,thesolution

d

*

m

argmin ¦ di subjectto y Ȱd 2 d ȕ d

i 1

satisfies d* x d C ȕ 2

whereC onlydependsonį2k.

50

Performancesof˜1 diversity measureminimizationalgorithms l h y

[Candes 08]

Assumey=Ȱx +İ,and. į2k 2 1 İ 2dȕ Then,thesolution

d

*

m

argmin ¦ di subjectto y Ȱd 2 d ȕ d

i 1

satisfies

1 d x d C1 x x k 1 C2 ȕ 2 k whereC1,C2 onlydependonį2k,andxk isthevectorx with allbutthekǦlargestentriessettozero. *

51

MatriceswithGoodRIC y

Itturnsoutthatrandommatricescansatisfythe requirement(say )withhighprobability requirement(say,)withhighprobability. į2k 2 1

y

ForamatrixȰn×m Ń Generateeachelementɔij ~N(0,1/n),i.i.d. N(0 /n) i i d § § m ·· Ń [Candes etal]If,thenP()ї1. n =O¨ k log¨ ¸ ¸ į2k 2 1 k © ¹¹ ©

y

Observations: Ń AlargeGaussianrandommatrixwillhavegoodRICwithhigh p probability. y Ń Similarresultscanbeobtainedusingotherprobabilityensembles: Bernoulli,RandomFourier,….

y

For ˜1basedprocedure,numberofmeasurements requiredaren ~ >k logm

52

MoreGeneralQuestion y

Whatarelimitsofrecoveryinthepresence ofnoise? Ń Noconstraintonrecoveryalgorithm

y

Informationtheoreticapproachesareuseful inthiscase(Wainwright,Fletcher, Akcaka a ) Akcakaya,..)

y

Canconnecttheproblemofsupport recoverytochannelcodingoverthe Gaussianmultipleaccesschannel.Capacity regionsbecomeuseful(Jin) i b f l(Ji ) 53

PerformanceforSupportRecovery V H2 Noisymeasurements: y =Ȱx +İ, İi ~N(0,),i.i.d. RandommatrixȰ,ɔij ~N(0,),i.i.d. V a2 Performancemetric:exactsupportrecovery Considerasequenceofproblemswithincreasingsizes ° 1 § V a2 · ½° 2 log¨ 1 2 ¦ x nonzero ,i ¸ ¾ c( x ) min ® k =2:.(TwoǦuserMACcapacity) T {1 ,2} 2 T °¯ © V H iT ¹ °¿

y y y y y

Sufficientcondition

If

llog m c( x ) nm mof then thereexistsasequenceof supportrecoverymethods(for diff.problemsresp.) suchthat limsup

lim P^supp( pp xˆ ) z supp( pp x )` 0

mof

y

Necessarycondition

Ifthereexistsasequenceof supportrecoverymethods suchthat

lim P^supp( xˆ ) z supp( x )` 0

mof

then

limsup mof

log m d c( x ) nm

Theapproachcandealwithgeneralscalingamong(m,n,k). 54

Networkinformationtheoryperspective C Connectiontochannelcoding(K Connectiontochannelcoding( ti t h l di (K (K=1) K=1) ) y

Ȱ

x

ɂ

xi thiscolumnisselected

55

y

ɔi

ɂ

× xi y =xi ɔi +ɂ 56

AdditiveWhiteGaussianNoise(AWGN) Ch Channel l e

a

y y y y

h

a:channelinput h h h:channelgain l i e:additiveGaussiannoise b:channeloutput b=ha+e b:channeloutput,b=ha+e

b

Recall y =xi ɔi +ɂ 57

ɔi

ɂ

y

xi

e a

h

b 58

ɔi

ɂ

y

xi

e a

h

b 59

ɔi

ɂ

y

xi

e a

h

b 60

ɔi

ɂ

y

xi

e a

h

b 61

ɔi

ɂ

y

xi

e a

h

b 62

Connectiontochannelcoding:(K Connectiontochannelcoding:( Kι1) ι1)

y

Ȱ

ɂ

x x s1

x sk

63

MultipleAccessChannel(MAC) ɔs1

ɔsk x s1

ɂ

y

x sk

y =xs1ɔs1+…+xskɔsk+ɂ

e aK a1

hK h1

b

b =h1a1+…+hKaK+e

64

Connectionbetweentwoproblems y

Ȱ

x

e

ɂ x

s1

aK

hK

b

h1 x sk

a1

Ȱ:dictionary,measurementmatrix

codebook

ɔi:acolumn

acodeword

m differentpositionsinx d ff

messageset{1,2,…,m}

s1,…,sk:indicesofnonzer0s

themessagesselected

xsi:sourceactivity ti it

channelgainh h l i hi 65

Differencesbetweentwoproblems

y

Channel coding

Support recovery

Eachsenderworks witha codebookdesignedonly forthat sender.

All“senders”sharethesame All“ d ” h th codebookA.Different senders selectdifferentcodewords. (Commoncodebook)

Capacityresultavailablewhen channel gainhi isknown at receiver.

Wedon’t knowthenonzerosignal activitiesxi. (Unknownchannel)

Proposedapproach[Jin,KimandRao] Ń LeverageMACcapacityresulttoshedlightonperformancelimitofexact supportrecovery. Conventionalmethods+customizedmethods. Ń Conventionalmethods+customizedmethods x Distancedecoding x Nonzerosignalvalueestimation x Fano’s Inequality

66

       

   

Outline 1.

Motivation: Limitations of popular inverse methods

2.

Maximum a posteriori (MAP) estimation

3.

Bayesian Inference

4.

Analysis of Bayesian inference and connections with MAP

5.

Applications to neuroimaging

Section I: Motivation

Limitation I ♦

Most sparse recovery results, using either greedy methods (e.g., OMP) or convex !1 minimization (e.g., BP), place heavy restrictions on the form of the dictionary Φ.

♦

While in some situations we can satisfy these restrictions (e.g., compressive sensing), for many applications we cannot (e.g., source localization, feature selection).

♦

When the assumptions on the dictionary are violated, performance may be very suboptimal

Limitation II ♦

The distribution of nonzero magnitudes in the maximally sparse solution x0 can greatly affect the difficulty of canonical sparse recovery problems.

♦

This effect is strongly algorithm-dependent …

Examples: !1 -norm Solutions and OMP ♦

With !1 -norm solutions, performance is independent of the magnitudes of nonzero coefficients [Malioutov et al., 2004].

♦

Problem: Performance does not improve when the situation is favorable.

♦

OMP is highly sensitive to these magnitudes.

♦

Problem: Perform degrades heavily with unit magnitudes.

In General ♦

If the magnitudes of the non-zero elements in x0 are highly scaled, then the canonical sparse recovery problem should be easier.

x0

x0 scaled coefficients (easy) ♦

uniform coefficients (hard)

The (approximate) Jeffreys distribution produces sufficiently scaled coefficients such that best solution can always be easily computed.

Jeffreys Distribution 1 Density: p ( x) ∝ | x| 2.5

2

1.5

p(x) p(w)

All have equal area

1

0.5

A 0 -8

-7

-6

-5

-4

-3

-2

-1

0

w

x

11 22

B 3

44

C 5

6

7

88

Empirical Example ♦

♦

For each test case: 1.

Generate a random dictionary Φ with 50 rows and 100 columns.

2.

Generate a sparse coefficient vector x0.

3.

Compute signal via y = Φ x0 (noiseless).

4.

Run BP and OMP, as well as a competing Bayesian method called SBL (more on this later) to try and correctly estimate x0.

5.

Average over1000 trials to compute empirical probability of failure.

Repeat with different sparsity values, i.e., x 0 from 10 to 30.

0

ranging

Sample Results (n = 50, m = 100) Approx. Jeffreys Coefficients Error Rate

Error Rate

Unit Coefficients

x0

0

x0

0

Limitation III ♦

It is not immediately clear how to use these methods to assess uncertainty in coefficient estimates (e.g., covariances) .

♦

Such estimates can be useful for designing an optimal (non-random) dictionary Φ.

♦

For example, it is well known in and machine learning and image processing communities that under-sampled random projections of natural scenes are very suboptimal.

Section II: MAP Estimation Using the Sparse Linear Model

Overview ♦

Can be viewed as sparse penalized regression with general sparsity-promoting penalties (!1 penalty is special case).

♦

These penalties can be chosen to overcome some previous limitations when minimized with simple, efficient algorithms.

♦

Theory is somewhat harder to come by, but practical results are promising.

♦

In some cases can guarantee improvement over !1 solution on sparse recovery problems.

Sparse Linear Model ♦

Linear generative model: Observed

y

=

Φx + 

Gaussian noise with variance λ

n-dimensional data vector

♦

Matrix of m basis vectors

Unknown Coefficients

Objective: Estimate the unknown x given the following assumptions: 1.

Φ is overcomplete, meaning the number of columns m is greater than the signal dimension n.

2.

x is maximally sparse, i.e., many elements equal zero.

Sparse Inverse Problem ♦

Noiseless case (εε = 0):

x0

 arg min x

x

0

s.t. y = Φx

!0 norm = # of nonzeros in x ♦

Noisy case (εε > 0):

x0 ( λ )  arg min x

2

y − Φx 2 + λ x

0

2 = arg max exp  − 21λ y − Φx 2  exp  − 12 x 0    x

likelihood

prior

Difficulties 1.

Combinatorial number of local minima

2.

Objective is discontinuous

A variety of existing approximate methods can be viewed as MAP estimation using a flexible class of sparse priors.

Sparse Priors: 2-D Example

[Tipping, 2001] Gaussian Distribution

Sparse Distribution

Basic MAP Estimation xˆ  arg max x

p (x | y)

= arg min

− log p ( y | x ) − log p ( x )

= arg min

y − Φx 2 + λ  g ( xi )

x

x

data fit

2

n

i= =1

g ( xi ) = h ( xi2 ) , where h is a nondecreasing, concave function

Note: Bayesian interpretation will be useful later …

Example Sparsity Penalties ♦

♦

With g ( xi ) = I [ xi ≠ 0] we have the canonical sparsity penalty and its associated problems. Practical selections: g ( xi ) = log ( xi2 + ε ) ,

[Chartrand and Yin, 2008]

g ( xi ) = log ( xi + ε ) , [Candes et al., 2008] p

g ( xi ) = xi ,

♦

[Rao et al., 2003]

Different choices favor different levels of sparsity.

Example 2-D Contours g ( xi ) = xi

p

1

0.5

x2

0

p = 0.2 p = 0.5

-0.5

-1 -1

p=1

-0.5

0

x1

0.5

1

Which Penalty Should We Choose? ♦

Two competing issues: 1.

2.

♦

If the prior is too sparse (e.g., p ≈ 0), then we may get stuck in a local minima: convergence error. If the prior is not sparse enough (e.g. p ≈ 1), then the global minimum may be found, but it might not equal x0 : structural error.

Answer is ultimately application- and algorithmdependent

Convergence Errors vs. Structural Errors Structural Error (p ≈ 1)

-log log pp(x|y)

-log log p(x|y)

Convergence Error (p ≈ 0)

x'

x0

x'

x' = solution we have converged to x0 = maximally sparse solution

x0

Desirable Properties of Algorithms ♦

Can be implemented via relatively simple primitives already in use, e.g., !1 solvers, etc.

♦

Improved performance over OMP, BP, etc.

♦

Naturally extends to more general problems: 1.

Constrained sparse estimation (e.g., finding non-negative sparse solutions)

2.

Group sparsity problems …

Extension: Group Sparsity ♦

Example : ♦

The simultaneous sparse estimation problem - the goal is to recover a matrix X, with maximal row sparsity [Cotter et al., 2005; Tropp, 2006] , given observation matrix Y produced via

Y = Φ X+ E ♦

Optimization Problem:

X 0 ( λ ) = arg min Y − Φ X X

2 F

m

+ λ  I  xi⋅ ≠ 0  i =1

# of nonzero rows in X

♦

Can be efficiently solved/approximated by replacing indicator function with alternative function g.

Reweighted !2 Optimization ♦

Assume:

♦

Updates: x( k +1)

g ( xi ) = h ( xi2 ) , h concave

→ arg min y − Φx 2 + λ  wi( k ) xi2 2

x

=

( k +1) i

w

i

W ( k ) ΦT λ I + ΦW ( k ) ΦT

(

♦ ♦

y

∂g ( xi ) ( k +1) ( k +1) −1   → , W → diag  w 2 ∂xi x = x( k +1) i

♦

)

−1

i

Based on simple 1st order approx. to g ( xi ) [Palmer et al., 2006]. Guaranteed not to increase objective function. Global convergence assured given additional assumptions.

Examples FOCUSS algorithm [Rao et al., 2003]:

1.

p

♦

Penalty:

g ( xi ) = xi , 0 ≤ p ≤ 2 ( k +1) i

w

→

( k +1) p − 2 i

x

♦

Weight update:

♦

Properties: Well-characterized convergence rates; very susceptible to local minima when p is small.

Chartrand and Yin (2008) algorithm:

2. ♦

♦

♦

Penalty:

g ( xi ) = log ( xi2 + ε ) , ε ≥ 0 ( k +1) i

Weight update: w

→ ( x 

( k +1) 2 i

)

+ ε  

−1

Properties: Slowly reducing ε to zero smoothes out local minima initially allowing better solutions to be found; very useful for recovering scaled coefficients …

Empirical Comparison ♦

♦

For each test case: 1.

Generate a random dictionary Φ with 50 rows and 250 columns

2.

Generate a sparse coefficient vector x0.

3.

Compute signal via y = Φ x0 (noiseless).

4.

Compare Chartrand and Yin’s reweighted !2 method with !1 norm solution with regard to estimating x0.

5.

Average over 1000 independent trials.

Repeat with different sparsity levels and different nonzero coefficient distributions.

probability of success

Empirical: Unit Nonzeros

x0

0

probability of success

Results: Gaussian Nonzeros

x0

0

Reweighted !1 Optimization ♦

Assume:

♦

Updates:

g ( xi ) = h ( xi ) , h concave

x

( k +1)

→ arg min y − Φx 2 + λ  i wi( k ) xi 2

x

( k +1) i

w

∂g ( xi ) → ∂ xi x = x( k +1) i

i

♦

Based on simple 1st order approximation to g ( xi ) [Fazel et al., 2003] Global convergence given minimal assumptions [Zangwill, 1969]. Per-iteration cost expensive, but few needed (and each are sparse).

♦

Easy to incorporate alternative data fit terms or constraints on x.

♦ ♦

Example [Candes et al., 2008] ♦

Penalty:

♦

Updates:

g ( xi ) = log ( xi + ε ) , ε ≥ 0

x( k +1) ( k +1) i

w

→ arg min y − Φx 2 + λ  i wi( k ) xi 2

x

→  x

( k +1) i

+ ε 

−1

♦

When nonzeros in x0 are scaled, works much better than regular !1, depending on how ε is chosen.

♦

Local minima exist, but since each iteration is sparse, local solutions are not so bad (no worse than regular !1 solution).

Empirical Comparison ♦

For each test case: 1.

Generate a random dictionary Φ with 50 rows and 100 columns.

2.

Generate a sparse coefficient vector x0 with 30 truncated Gaussian, strictly positive nonzero coefficients.

3.

♦

Compute signal via y = Φ x0 (noiseless).

4.

Compare Candes et al.’s reweighted !1 method (10 iterations) with !1 norm solution, both constrained to be non-negative to try and estimate x0.

5.

Average over 1000 independent trials.

Repeat with different values of the parameter ε .

probability of success

Empirical Comparison

ε

Conclusions ♦

In practice, MAP estimation addresses some limitations of standard methods (although not Limitation III, assessing uncertainty).

♦

Simple updates are possible using either iterative reweighted !1 or !2 minimization.

♦

More generally, iterative reweighted f minimization, where f is a convex function, is possible.

Section III: Bayesian Inference Using the Sparse Linear Model

Note ♦

MAP estimation is really just standard/classical penalized regression.

♦

So the Bayesian interpretation has not really contributed much as of yet …

Posterior Modes vs. Posterior Mass ♦

Previous methods focus on finding the implicit mode of p(x|y) by maximizing the joint distribution p ( x, y ) = p ( y | x ) p ( x )

♦

Bayesian inference uses posterior information beyond the mode, i.e., posterior mass:

♦

Examples: 1.

Posterior mean: Can have attractive properties when used as a sparse point estimate (more on this later …).

2.

Posterior covariance: Useful assessing uncertainty in estimates, e.g., experimental design, learning new projection directions for compressive sensing measurements.

Posterior Modes vs. Posterior Mass Mode

p(x|yy)

Probability Mass

x

Problem ♦

For essentially all sparse priors, cannot compute normalized posterior

p (x | y) =

♦

 p ( y | x ) p ( x ) dx

Also cannot computer posterior moments, e.g.,

µx Σx ♦

p (y | x) p (x)

= E [x | y] = Cov [ x | y ]

So efficient approximations are needed …

Approximating Posterior Mass ♦

♦

Goal: Approximate p(x,y) with some distribution pˆ ( x, y ) that 1.

Reflects the significant mass in p(x,y).

2.

Can be normalized to get the posterior p(x|y).

3.

Has easily computable moments, e.g., can compute E[x|y] or Cov[x|y].

Optimization Problem: Find the pˆ ( x, y ) that minimizes the sum of the misaligned mass:

 p ( x, y ) − pˆ ( x, y ) dx

=

 p ( y | x ) p ( x ) − pˆ ( x ) dx

Recipe 1. Start with a Gaussian likelihood − n2

(

p ( y | x ) = ( 2πλ ) exp − 2σ1 2 y − Φx

2 2

)

2. Pick an appropriate prior p(x) that encourages sparsity 3. Choose a convenient parameterized class of approximate priors pˆ ( x ) = p ( x;  ) 4. Solve: ˆ = arg min  p (y | x) p (x) − p ( x;  ) dx 

5. Normalize: p ( x | y; ˆ ) =

p ( y | x ) p ( x; ˆ )

 p ( y | x ) p ( x; ˆ ) dx

Step 2: Prior Selection ♦

Assume a sparse prior distribution on each coefficient:

− log p( xi ) ∝ g ( xi ) = h ( xi2 ) , h concave, non-decreasing. 2-D example

[Tipping, 2001] Gaussian Distribution

Sparse Distribution

Step 3: Forming Approximate Priors p(x;γ) ♦

Any sparse prior can be expressed via the dual form [Palmer et al., 2006]:

  xi2 −1/2 p ( xi ) = max ( 2πγ i ) exp  − γ i ≥0  2γ i  ♦

   ϕ (γ i )  

Two options: 1. Start with p ( xi ) and then compute ϕ ( γ i ) via convexity results, or 2. Choose ϕ ( γ i ) directly and then compute p ( x ) ; this procedure will always i produce a sparse prior [Palmer et al. 2006].

♦

Dropping the maximization gives the strict variational lower bound

p ( xi ) ≥ p ( xi ; γ i ) = ( 2πγ i ) ♦

−1/2

 xi2 exp  −  2γ i

 ϕ (γ i ) 

Yields a convenient class of scaled Gaussian approximations:

p ( x;  ) = ∏ p ( xi ; γ i ) i

Example: Approximations to Sparse Prior

p(xi)

p(xi;γi)

xi

Step 4: Solving for the Optimal γ ♦

To find the best approximation, must solve

ˆ = arg min  p (y | x) p (x) − p (x;  ) dx  ≥0

♦

By virtue of the strict lower bound, this is equivalent to

ˆ = arg max  p(y | x) p (x;  )dx  ≥0

m

= arg min log Σ y + y T Σ −y1y − 2 log ϕ ( γ i )  ≥0

where

Σ y = λ I + ΦΓΦT

i =1

Γ = diag (  )

How difficult is finding the optimal parameters γ? ♦

If original MAP estimation problem is convex, then so is Bayesian inference cost function [Nickisch and Seeger, 2009].

♦

In other situations, Bayesian inference cost is generally much smoother than associated MAP estimation (more on this later …).

Step 5: Posterior Approximation ♦

We have found the approximation

p ( y | x ) p ( x; ˆ ) = p ( x, y; ˆ ) ≈ p ( x, y )

♦

By design, this approximation reflects significant mass in the full distribution p(x,y).

♦

Also, it is easily normalized to form

p ( x | y; ˆ ) = N ( µ x , Σ x )

E [ x | y; γˆ ]

µx

=

Σx

= Cov [ x | y; γˆ ] =

=

−1

ˆ ) y ( λ I + ΦΓΦ ˆ ˆ ) ΦΓˆ Γˆ − ΓΦ ( λ I + ΦΓΦ ˆ ΓΦ

T

T

T

T

−1

Applications of Bayesian Inference 1.

Finding maximally sparse representations ♦ ♦

2.

Replace MAP estimate with posterior mean estimator. For certain prior selections, this can be very effective (next section)

Active learning, experimental design

Experimental Design ♦

Basic Idea [Shihao Ji et al., 2007, Seeger and Nickisch, 2008]: Use the approximate posterior

p ( x | y; ˆ ) = N ( µ x , Σ x ) to learn new rows of the design matrix Φ such that uncertainty about x is reduced.

♦

Choose each additional row to minimize the differential entropy H:

ˆ H = log Σ x , Σ x = Γˆ − ΓΦ 1 2

T

ˆ ) ( λ I + ΦΓΦ T

−1

ΦΓˆ

Experimental Design Cont. ♦

Drastic improvement over random projections is possible in a variety of domains.

♦

Examples: ♦

Reconstructing natural scenes [Seeger and Nickisch, 2008]

♦

Undersampled MRI reconstruction [Seeger et al., 2009]

Section IV: Analysis of Bayesian Inference and Connections with MAP

Overview ♦

Bayesian inference can be recast as a general form of MAP estimation in x-space.

♦

This is useful for several reasons: 1.

Allows us to leverage same algorithmic formulations as with iterative reweighted methods for MAP estimation.

2.

Reveals that Bayesian inference can actually be an easier computational task than searching for the mode as with MAP.

3.

Provides theoretical motivation for posterior mean estimator when searching for maximally sparse solutions.

4.

Allows modifications to Bayesian inference cost (e.g., adding constraints), and inspires new non-Bayesian sparse estimators.

Reformulation of Posterior Mean Estimator Theorem 2

 x = E [ x | y , ˆ ] = arg min y − Φx 2 + λ ginfer ( x ) x

with Bayesian inference penalty function

ginfer ( x ) = min xT Γ −1x + log λ I + ΦΓΦT − 2  ≥0

 log ϕ (γ ) i

i

[Wipf and Nagarajan, 2008] 

So the posterior mean can be obtained by minimizing a penalized regression problem just like MAP



Posterior covariance is a natural byproduct as well

Property I of Penalty ginfer(x) ♦

Penalty ginfer(x) is formed from a minimum of upper2 bounding hyperplanes with respect to each xi .

♦

This implies: 1.

2

Concavity in xi for all i [Boyd 2004].

2.

Can be implemented via iterative reweighted !2 minimization (multiple possibilities using various bounding techniques) [Seeger, 2009; Wipf and Nagarajan, 2009].

3.

Note: Posterior covariance can be obtained easily too, therefore entire approximation can be computed for full Bayesian inference.

Student’s t Example ♦

Assume the following sparse distribution for each unknown coefficient:

 xi  p ( xi ) ∝  b +  2   2

Note: ♦ ♦

When When

−( a +1/2 )

a = b → ∞ , prior approaches a Gaussian (not sparse) a = b → 0 , prior approaches a Jeffreys (highly sparse)

♦

Using convex bounding techniques to approximate the required derivatives, leads to simple reweighted !2 update rules [Seeger, 2009; Wipf and Nagarajan, 2009].

♦

Algorithm can also be derived via EM [Tipping, 2001].

Reweighted !2 Implementation Example x ( k +1)

→ arg min y − Φx 2 + λ  wi( k ) xi2 2

x

i

W ( k ) ΦT λ I + ΦW ( k ) ΦT

(

=

wi( k +1) →

)

−1

y,

(k ) ( k ) −1  W = diag  w 

1 + 2a ( k +1) 2 i

(x )

+

( k ) −1 i

( k ) −2 i

(w ) − (w )

T i

φ

(

λ I + ΦW ( k ) ΦT

)

−1

φi + 2b

Guaranteed to reduce or leave unchanged objective function at each iteration • Other variants are possible using different bounding techniques • Upon convergence, posterior covariance is given by •

Σ x = λ Φ Φ + W −1

T

(k )

−1

 , W ( k ) = diag  w ( k ) 

Property II of Penalty ginfer(x) If −2 log ϕ ( γ i ) is concave in γ i , then:

1.

ginfer(x) is concave in |xi| for all i

sparsity-inducing

2.

The implies posterior mean will always have at least m – n elements equal to exactly zero as occurs with MAP.

3.

Can be useful for canonical sparse recovery problems …

4.

Can implement via reweighted !1 minimization [Wipf, 2006; Wipf and Nagarajan, 2009]

Reweighted !1 Implementation Example ♦

Assume −2 log ϕ ( γ i )

x

( k +1)

= aγ i , a ≥ 0

→ arg min y − Φx 2 + λ  wi( k ) xi 2

x

wi( k +1) → 

 

i

φ T σ 2 I + ΦW ( k ) diag  x ( k +1)  ΦT    i

(

)

−1

φi + a 

Guaranteed to converge to a minimum of the Bayesian inference objective function Easy to outfit with additional constraints In noiseless case, sparsity will not increase, i.e., x ( k +1)

0

≤ x( k )

 -norm )

0

≤ x( 1

0



1 2

Property III of Penalty ginfer(x) ♦

Bayesian inference is most sensitive to posterior mass, therefore it is less sensitive to spurious local peaks as is MAP estimation.

♦

Consequently, in x-space, the Bayesian Inference penalized regression problem 2

min y − Φx 2 + λ ginfer ( x ) x

is generally smoother than associated MAP problem.

Student’s t Example ♦

Assume the Student’s t distribution for each unknown coefficient:

 xi  p ( xi ) ∝  b +  2   2

Note: ♦ ♦

♦

When When

−( a +1/2 )

a = b → ∞ , prior approaches a Gaussian (not sparse) a = b → 0 , prior approaches a Jeffreys (highly sparse)

Goal: Compare Bayesian inference and MAP for different sparsity levels to show smoothing effect.

Visualizing Local Minima Smoothing ♦

♦

Consider when y = Φx has a 1-D feasible region, i.e., m = n +1 Any feasible solution x will satisfy:

x = x true + α v v ∈ Null ( Φ ) where

α is a scalar x true is the true generative coefficients

♦

Can plot penalty functions vs. α to view local minima profile over the 1-D feasible region.

Empirical Example ♦

Generate an iid Gaussian random dictionary Φ with 10 rows and 11 columns.

♦

Generate a sparse coefficient vector xtrue with 9 nonzeros and Gaussian iid amplitudes.

♦

Compute signal y = Φ x0.

♦

Assume a Student’s t prior on x with varying degrees of sparsity.

♦

Plot MAP/Bayesian inference penalties vs. α to compare local minima profiles over the 1-D feasible region.

penalty, normalized

Local Minima Smoothing Example #1

Low Sparsity Student’s t

α

penalty, normalized

Local Minima Smoothing Example #2

Medium Sparsity Student’s t

α

penalty, normalized

Local Minima Smoothing Example #3

High Sparsity Student’s t

α

Property IV of Penalty ginfer(x)

♦

Non-separable, meaning ginfer ( x ) ≠

 g (x ) i

i

i

♦

Non-separable penalty functions can have an advantage over separable penalties (i.e., MAP) when it comes to canonical sparse recovery problems [Wipf and Nagarajan, 2010].

Example ♦

Consider original sparse estimation problem

x0

♦

 arg min x

x

0

s.t. y = Φx

Problem: Combinatorial number of local minima:  m − 1 m   + 1 ≤ number of local minima ≤    n  n

♦

Local minima occur at each basic feasible solution (BFS):

x 0 ≤ n s.t. y = Φx

Visualization of Local Minima in !0 Norm ♦

Generate an iid Gaussian random dictionary Φ with 10 rows and 11 columns.

♦

Generate a sparse coefficient vector xtrue with 9 nonzeros and Gaussian iid amplitudes.

♦

♦

Compute signal via y = Φ x0. Plot x 0 vs. α (1-D null space dimension) to view local minima profile of the !0 norm over the 1-D feasible region.

# of nonzeros

!0 Norm in 1-D Feasible Region

α

Non-Separable Penalty Example ♦

Would like to smooth local minima while retaining same global solution as !0 at all times (unlike !1 norm)

♦

This can be accomplished by a simple modification of the !0 penalty.

♦

Truncated !0 penalty:

xˆ = arg min x x

♦

0

x = k largest elements of x s.t. y = Φx

If k < m, then there will necessarily be fewer local minima; however, the implicit prior/penalty function is non-separable.

# of nonzeros

Truncated !0 Norm in 1-D Feasible Region

k=10

α

Using posterior mean estimator for finding maximally sparse solutions Summary of why this could be a good idea: 1.

If −2 log ϕ ( γ i ) is concave, then posterior mean will be sparse (local minima of Bayesian inference cost will also be sparse).

2.

The implicit Bayesian inference cost function can be much smoother than the associated MAP objective.

3.

Potential advantages of non-separable penalty functions.

Choosing the function ϕ ♦

♦

♦

♦

♦

For sparsity, require that −2 log ϕ ( γ i ) is concave. To avoid adding extra local minima (i.e., to maximally exploit smoothing effect), require that −2 log ϕ ( γ i ) is convex. So −2 log ϕ ( γ i ) = aγ i , a ≥ 0

is well-motivated [Wipf et al. 2007].

Assume simplest case: −2 log ϕ ( γ i ) = 0 , sometimes referred to as sparse Bayesian learning (SBL) [Tipping, 2001]. We denote the penalty function in this case gSBL ( x ) .

Advantages of Posterior Mean Estimator Theorem ♦

In the low noise limit (λ → 0), and assuming x0 then the SBL penalty satisfies:

arg min gSBL ( x ) = arg min x x: y =Φx

♦

x: y =Φx

0

< spark [ Φ ] − 1,

0

No separable penalty g ( x ) =  i gi ( xi ) satisfies this condition and has fewer minima than the SBL penalty in the feasible region. [Wipf and Nagarajan, 2008]

Conditions For a Single Minimum Theorem ♦

Assume x 0 0 < spark [ Φ ] − 1. If the magnitudes of the non-zero elements in x0 are sufficiently scaled, then the SBL cost (λ → 0) has a single minimum which is located at x0.

x0 scaled coefficients (easy) ♦

x0 uniform coefficients (hard)

No possible separable penalty (standard MAP) satisfies this condition. [Wipf and Nagarajan, 2009]

Empirical Example ♦

Generate an iid Gaussian random dictionary Φ with 10 rows and 11 columns.

♦

Generate a maximally sparse coefficient vector x0 with 9 nonzeros and either 1. 2.

♦

♦

amplitudes of similar scales, or amplitudes with very different scales.

Compute signal via y = Φ x0. Plot MAP/Bayesian inference penalty functions vs. α to compare local minima profiles over the 1-D feasible region to see the effect of coefficient scaling.

penalty, normalized

Smoothing Example: Similar Scales

penalty, normalized

Smoothing Example: Different Scales

Always Room for Improvement Theorem ♦

Consider the noiseless sparse recovery problem.

x0 ♦

 arg min x

x

0

s.t. y = Φx

Under very mild conditions, SBL with reweighted !1 implementation will: 1.

Never do worse than the regular !1-norm solution

2.

For any dictionary and sparsity profile, there will always be cases where it does better. [Wipf and Nagarajan, 2010]

Empirical Example: Simultaneous Sparse Approximation ♦

♦ 1.

2. 3. !"

Generate data matrix via Y = ΦX0 (noiseless): ♦

X0 is 100-by-5 with random nonzero rows.

♦

Φ is 50-by-100 with Gaussian iid entries

Check if X0 is recovered using various algorithms: Generalized SBL , reweighted !2 implementation [Wipf and Nagarajan, 2010] Candes et al. (2008) reweighted !1 method Chartrand and Yin (2008) reweighted !2 method !1 solution via Group Lasso [Yuan and Lin, 2006]

probability of success

Empirical Results (1000 Trials)

SBL

row sparsity

Conclusions ♦

Posterior information beyond the mode can be very useful in a wide variety of applications.

♦

Variational approximation provides useful estimates of posterior means and covariances, which can be computed efficiently using standard iterative reweighting algorithms.

♦

In certain situations, posterior mean estimate can be effective substitute for !0 norm minimization.

♦

In simulation tests, out-performs a wide variety of MAPbased algorithms [Wipf and Nagarajan, 2010]...

Section V: Application Examples in Neuroimaging

Applications of Sparse Bayesian Methods 1.

Recovering fiber track geometry from diffusion weighted MR images [Ramirez-Manzanares et al. 2007].

2.

Multivariate autoregressive modeling of fMRI time series for functional connectivity analyses [Harrison et al. 2003].

3.

Compressive sensing for rapid MRI [Lustig et al. 2007].

4.

MEG/EEG source localization [Sato et al. 2004; Friston et al. 2008].

MEG/EEG Source Localization

Maxwell’s eqs.

? source space (X)

sensor space (Y)

The Dictionary Φ ♦

Can be computed using a boundary element brain model and Maxwell’s equations.

♦

Will be dependent on location of sensors and whether we are doing MEG, EEG, or both.

♦

Unlike compressive sensing domain, columns of Φ will be highly correlated regardless of where sensors are placed.

Source Localization ♦

Given multiple measurement vectors Y, MAP or Bayesian inference algorithms can be run to estimate X.

♦

The estimated nonzero rows should correspond with the location of active brain areas (also called sources).

♦

Like compressive sensing, may apply algorithms in appropriate transform domain where row-sparsity assumption holds.

Empirical Results Simulations with real brain noise/interference:

1. ♦ ♦

Generate damped sinusoidal sources Map to sensors using Φ and apply real brain noise, artifacts

Data from real-world experiments:

2. ♦

Auditory evoked fields from binaurally presented tones (which produce correlated, bilateral activations)

Compare localization results using MAP estimation and SBL posterior mean from Bayesian inference

MEG Source Reconstruction Example

Ground Truth

SBL

Group Lasso

Real Data: Auditory Evoked Field (AEF)

SBL

sLORETA

Beamformer

Group Lasso

Conclusion ♦

MEG/EEG source localization demonstrates the effectiveness of Bayesian inference on problems where the dictionary is:

m  n, e.g., n = 275 and m = 100, 000.

♦

Highly overcomplete, meaning

♦

Very ill-conditioned and coherent, i.e., columns are highly correlated.

Final Thoughts ♦

Sparse Signal Recovery is an interesting area with many potential applications.

♦

Methods developed for solving the Sparse Signal Recovery problem can be valuable tools for signal processing practitioners.

♦

Rich set of computational algorithms, e.g., ♦ Greedy search (OMP) ♦ !1 norm minimization (Basis Pursuit, Lasso) ♦ MAP methods (Reweighted !1 and !2 methods) ♦ Bayesian Inference methods like SBL (show great promise)

♦

Potential for great theory in support of performance guarantees for algorithms.

♦

Expectation is that there will be continued growth in the application domain as well as in the algorithm development.

Thank You

Recommend Documents

BOOTSTRAPPED SPARSE BAYESIAN LEARNING ... - UCSD DSP LAB

Parameterized Deformation Sparse Coding via Tree ... - UCSD DSP LAB

Sparse image coding using learned overcomplete ... - UCSD DSP LAB

Sparse Bayesian Learning Using Approximate ... - UCSD DSP LAB