Streaming Sparse Principal Component Analysis - VideoLectures.NET

Comment

Report 0 Downloads 86 Views

Streaming Sparse Principal Component Analysis Wenzhuo Yang, Huan Xu National University of Singapore

Introduction Standard principal component analysis (PCA):  Perform the spectral decomposition of the sample covariance matrix.  select the eigenvectors corresponding to the largest eigenvalues. Weaknesses of PCA:  The output may lack interpretability.  In the high dimensional regime where 𝑝 ≫ 𝑛, PCA is not consistent.

ICML 2015

Introduction To address these issues: Previous works focus on sparse PCA: only a few of attributes of the resulting PCs are non-zero.  A regression-type formulation based on the elastic-net (Zou et al., 2006).  A convex semidefinite program formulation (d’Aspremont et al., 2007).  The TPower method and the iterative thresholding method (Yuan & Zhang, 2013 and Ma, 2013).  The Fantope projection selection (Vu et al., 2013).

ICML 2015

Introduction The difficulty of sparse PCA: it is hard to apply them to large scale data  They either explicitly compute the sample covariance matrix or store all the samples. 𝑂(𝑝 min{𝑝, 𝑛}) storage  The computational cost may become prohibitive when the dimensionality is high.  𝑂(𝑝3 ) operations For non-sparse PCA: e.g., Online PCA (Warmuth & Kuzmin, 2008)

Incremental PCA (Brand, 2002)

Stochastic power method (Arora et al., 2012)

Streaming PCA (Mitliagkas et al., 2013)

𝑂(𝑝2 ) storage

No theoretical guarantees

No theoretical guarantees

Inconsistent when 𝑝 ≫ 𝑛

ICML 2015

Introduction For sparse PCA: e.g., online sparse PCA based on the online learning algorithm for sparse coding (Mairal et al., 2010), Memory complexity

Computational complexity

𝑂(𝑝𝑘)

High (the elastic net should be solved in each iteration)

How to design a computation- and memory-efficient sparse PCA method remains unsolved…

ICML 2015

Introduction Another important issue – sub-Gaussianality assumption:  Many sparse PCA methods are theoretically analyzed under the spike model, e.g., Amini & Wainwright, 2009; Vu et al., 2013; Yuan & Zhang, 2013; Vu & Lei, 2012; Shen et al., 2013; Mitliagkas et al., 2013; Cai et al., 2014

 The spike model requires subgaussian data and noise, and hence can not model heavy-tail distributions. To relax this assumption:  The semiparametric transelliptical family and elliptical family are used to model data (Han & Liu, 2013). Transelliptical component analysis (TCA) and elliptical component analysis (ECA)  𝑂(𝑝 3 ) computation and 𝑂(𝑝 2 ) memory ICML 2015

Introduction Our contributions:  We propose two variants of sparse PCA: Spike model

Elliptical model

Streaming sparse PCA

Streaming sparse ECA

 Our theoretical analysis shows that both of the algorithms have

ICML 2015

Memory complexity

Computational complexity

Sample complexity

𝑂(𝑝𝑘)

𝑂(𝑝𝑘𝑠log𝑝)

Θ(𝑠log𝑝)

Problem Setup Streaming data model:  One receives sample 𝒙𝑡 at time 𝑡, and 𝒙𝑡 vanishes after it is collected unless it is stored in the memory.

Spike model:  Sample 𝒙𝑡 is generated according to 𝒙𝑡 = 𝑨𝒛𝑡 + 𝒘𝑡 • 𝒛𝑡 – a sample of standard Gaussian 𝑁 0, 𝑰𝑑 .

• 𝒘𝑡 – a sample of standard Gaussian 𝑁 0, 𝜎 2 𝑰𝑝 . • 𝑨 ∈ 𝑹𝑝×𝑑 – a deterministic but unknown matrix.  The covariance matrix 𝜮 = 𝑨𝑨⊤ + 𝜎 2 𝑰𝑝 .

ICML 2015

Problem Setup Elliptical model:  Sample 𝒙𝑡 is generated according to 𝒙𝑡 = 𝝁 + 𝜉𝑡 𝑨𝒛𝑡

• 𝒛𝑡 – a sample of a uniform random vector on the unit sphere. • 𝜉𝑡 – a sample of a scalar random variable with unknown distribution. • 𝑨 ∈ 𝑹𝑝×𝑑 – a deterministic satisfying 𝑨𝑨⊤ = 𝜮. The sparse setting:  The projection matrix 𝜫 = 𝑼𝑘 𝑼⊤ 𝑘 satisfies that diag 𝜫 0 ≤ 𝑠.

ICML 2015

Algorithm Basic idea:  Block-wise stochastic power methods – update the estimated PCs once a block of samples are received.  The “row truncation” operator – maintain the sparsity of the estimated PCs.

ICML 2015

Algorithm Streaming sparse PCA (for the spike model): Computational complexity: 𝑂(𝑝𝑘min 𝑘, 𝑠log𝑝 ) 𝐵 = Θ(𝑠log𝑝) 𝐒𝜏+1 = 𝚺𝜏+1 𝑸𝜏

2) 𝑂(𝑝𝑘𝐵)

3) 𝑂(𝑝𝑘 + 𝑝log𝑝) 4) 𝑂(𝑝𝑘 2 )

ICML 2015

Algorithm The iterative deflation method: E.g., the leading 𝑘 PCs are all sparse but their supports are nearly disjoint.

ICML 2015

Algorithm The advantages compared with streaming PCA and TPower:

1. The streaming SPCA is consistent in the high dimensional regime where the streaming PCA is inconsistent. 2. TPower requires 𝑂(𝑝min{𝑝, 𝑛}) storage but our method only requires 𝑂 𝑝𝑘 storage. 3. When the leading 𝑘 PCs are row sparse, our method can extract them simultaneously, but TPower can only extract them one by one.

ICML 2015

Algorithm For elliptically distributed data, ECA (Liu & Han, 2013) utilizes the multivariate Kendall’s tau statistic: The eigenspace of 𝑲 is identical to that of 𝜮.

Consider the following estimator of 𝑲:

The empirical covariance matrix of samples 𝑦𝑖 =

ICML 2015

𝒙2𝑖−1−𝒙2𝑖 𝒙2𝑖−1 −𝒙2𝑖 2

Algorithm Streaming sparse ECA (for the elliptical model):

Compute sample 𝒚𝑡

ICML 2015

Performance Guarantees The main theorem for streaming sparse PCA: Theorem 1: For parameters 𝜂 > 0,0 < 𝜖 < 1, and 𝛾 ≥ 𝑠, let 2+ 2 𝜇 𝜂 𝑘 + 1 𝜆𝑘+1 + 2𝜂𝜆𝑘 𝜇≜ , 𝑓 𝜇, 𝜂, 𝑘 ≜ max , . 𝜆𝑘 𝑘 𝑘 If the initial solution 𝑸0 is “good” 2 1 − 𝜇 𝜈 ≜ 𝑼⊤ , 𝑘,⊥ 𝑸0 2 < 2 1 − 𝜇 + 𝜇 + 1 𝑓 𝜇, 𝜂, 𝑘 then as long as the block size 𝐵 and the iteration number 𝑇 are “large enough” 𝜖 log 𝜈 𝑐𝑘 2 𝜆21 𝑠 + 2𝛾 log 𝑝 + log 𝑇 𝑇≥ ,𝐵 ≥ , 𝜇 2 𝜂 2 𝜆2 𝜖 log 𝑘 1 − 𝜈 − 𝑓 𝜇, 𝜂, 𝑘 𝜈 with probability at least 1 − 𝑠 −10 , the output 𝑸 𝑇 of Algorithm 2 satisfies that 𝑼⊤ 𝑘,⊥ 𝑸 𝑇 2 ≤ 𝜖. ICML 2015

Theoretical Guarantees 2 𝑘 + 1 𝜆𝑘+1 + 2𝜂𝜆𝑘 1 − 𝜇 𝜇≜ , 𝑼⊤ , 𝑘,⊥ 𝑸0 2 < 2 𝜆𝑘 1 − 𝜇 + 𝜇 + 1 𝑓 𝜇, 𝜂, 𝑘 𝜖 log 𝜈 𝑐𝑘 2 𝜆21 𝑠 + 2𝛾 log 𝑝 + log 𝑇 𝑇≥ ,𝐵 ≥ , 𝜇 2 2 2 𝜖 𝜂 𝜆𝑘 log 1 − 𝜈 − 𝑓 𝜇, 𝜂, 𝑘 𝜈

Remarks: 1) The algorithm succeeds as long as 𝜆𝑘 > 𝑘 + 1 𝜆𝑘+1 since ∃𝜂 so that 𝜇 < 1.

2) A smaller 𝜇 leads to faster convergence and less samples required.

3) A more accurate initial solution is required when 𝜇 is larger.

4) The algorithm can succeed when 𝐵 = Θ 𝑠 log 𝑝 + log 𝑇 if 𝑠 ≤ 𝛾 ≤ 2𝑠.

ICML 2015

Theoretical Guarantees The main theorem for streaming sparse ECA: Theorem 2: For parameters 𝜂 > 0,0 < 𝜖 < 1, and 𝛾 ≥ 𝑠, let 2+ 2 𝜇 𝜂 𝑘 + 1 𝜆𝑘+1 + 2𝜂𝜆𝑘 𝜇≜ , 𝑓 𝜇, 𝜂, 𝑘 ≜ max , . 𝜆𝑘 𝑘 𝑘 If the initial solution 𝑸0 is “good” 2 1 − 𝜇 𝜈 ≜ 𝑼⊤ , 𝑘,⊥ 𝑸0 2 < 2 1 − 𝜇 + 𝜇 + 1 𝑓 𝜇, 𝜂, 𝑘 then as long as the block size 𝐵 and the iteration number 𝑇 are “large enough” 𝜖 2 2 log 𝜈 𝑐𝑘 1 + 𝜆1 𝑲 𝑠 + 2𝛾 log 𝑝 + log 𝑇 𝑇≥ ,𝐵 ≥ , 𝜇 2 𝜂 2𝜆 𝑲 2 𝜖 𝑘 log 1 − 𝜈 − 𝑓 𝜇, 𝜂, 𝑘 𝜈 with probability at least 1 − 𝑠 −10 , the output 𝑸 𝑇 of Algorithm 2 satisfies that 𝑼⊤ 𝑘,⊥ 𝑸 𝑇 2 ≤ 𝜖. ICML 2015

Experimental Results Comparison between streaming sparse PCA, streaming PCA, FPS and online sparse PCA:

The samples are generated under the spike model. ICML 2015

Experimental Results Comparison between streaming sparse PCA and streaming PCA:

The samples are generated under the spike model. ICML 2015

Experimental Results Comparison between ECA, streaming sparse ECA, streaming sparse PCA and streaming PCA:

𝜉 follows (Left) the chi-distribution and (Right) the F-distribution ICML 2015

Experimental Results Real-world dataset: (Left) NIPS dataset and (Right) NYTimes dataset. Parameters B and γ in streaming sparse PCA are set to 300 and 500, respectively.

Large scale sparse PCA (Zhang & El Ghaoui, 2011)

ICML 2015

ICML 2015

Recommend Documents

Optimal solutions for Sparse Principal Component Analysis - DI ENS