Streaming Sparse Principal Component Analysis - VideoLectures.NET

Report 0 Downloads 86 Views
Streaming Sparse Principal Component Analysis Wenzhuo Yang, Huan Xu National University of Singapore

Introduction Standard principal component analysis (PCA): ๏ƒ˜ Perform the spectral decomposition of the sample covariance matrix. ๏ƒ˜ select the eigenvectors corresponding to the largest eigenvalues. Weaknesses of PCA: ๏ƒ˜ The output may lack interpretability. ๏ƒ˜ In the high dimensional regime where ๐‘ โ‰ซ ๐‘›, PCA is not consistent.

ICML 2015

Introduction To address these issues: Previous works focus on sparse PCA: only a few of attributes of the resulting PCs are non-zero. ๏ƒ˜ A regression-type formulation based on the elastic-net (Zou et al., 2006). ๏ƒ˜ A convex semidefinite program formulation (dโ€™Aspremont et al., 2007). ๏ƒ˜ The TPower method and the iterative thresholding method (Yuan & Zhang, 2013 and Ma, 2013). ๏ƒ˜ The Fantope projection selection (Vu et al., 2013).

ICML 2015

Introduction The difficulty of sparse PCA: it is hard to apply them to large scale data ๏ƒ˜ They either explicitly compute the sample covariance matrix or store all the samples.๏ƒ  ๐‘‚(๐‘ min{๐‘, ๐‘›}) storage ๏ƒ˜ The computational cost may become prohibitive when the dimensionality is high. ๏ƒ  ๐‘‚(๐‘3 ) operations For non-sparse PCA: e.g., Online PCA (Warmuth & Kuzmin, 2008)

Incremental PCA (Brand, 2002)

Stochastic power method (Arora et al., 2012)

Streaming PCA (Mitliagkas et al., 2013)

๐‘‚(๐‘2 ) storage

No theoretical guarantees

No theoretical guarantees

Inconsistent when ๐‘ โ‰ซ ๐‘›

ICML 2015

Introduction For sparse PCA: e.g., online sparse PCA based on the online learning algorithm for sparse coding (Mairal et al., 2010), Memory complexity

Computational complexity

๐‘‚(๐‘๐‘˜)

High (the elastic net should be solved in each iteration)

How to design a computation- and memory-efficient sparse PCA method remains unsolvedโ€ฆ

ICML 2015

Introduction Another important issue โ€“ sub-Gaussianality assumption: ๏ƒ˜ Many sparse PCA methods are theoretically analyzed under the spike model, e.g., Amini & Wainwright, 2009; Vu et al., 2013; Yuan & Zhang, 2013; Vu & Lei, 2012; Shen et al., 2013; Mitliagkas et al., 2013; Cai et al., 2014

๏ƒ˜ The spike model requires subgaussian data and noise, and hence can not model heavy-tail distributions. To relax this assumption: ๏ƒ˜ The semiparametric transelliptical family and elliptical family are used to model data (Han & Liu, 2013). Transelliptical component analysis (TCA) and elliptical component analysis (ECA) ๏ƒ  ๐‘‚(๐‘ 3 ) computation and ๐‘‚(๐‘ 2 ) memory ICML 2015

Introduction Our contributions: ๏ƒ˜ We propose two variants of sparse PCA: Spike model

Elliptical model

Streaming sparse PCA

Streaming sparse ECA

๏ƒ˜ Our theoretical analysis shows that both of the algorithms have

ICML 2015

Memory complexity

Computational complexity

Sample complexity

๐‘‚(๐‘๐‘˜)

๐‘‚(๐‘๐‘˜๐‘ log๐‘)

ฮ˜(๐‘ log๐‘)

Problem Setup Streaming data model: ๏ƒ˜ One receives sample ๐’™๐‘ก at time ๐‘ก, and ๐’™๐‘ก vanishes after it is collected unless it is stored in the memory.

Spike model: ๏ƒ˜ Sample ๐’™๐‘ก is generated according to ๐’™๐‘ก = ๐‘จ๐’›๐‘ก + ๐’˜๐‘ก โ€ข ๐’›๐‘ก โ€“ a sample of standard Gaussian ๐‘ 0, ๐‘ฐ๐‘‘ .

โ€ข ๐’˜๐‘ก โ€“ a sample of standard Gaussian ๐‘ 0, ๐œŽ 2 ๐‘ฐ๐‘ . โ€ข ๐‘จ โˆˆ ๐‘น๐‘ร—๐‘‘ โ€“ a deterministic but unknown matrix. ๏ƒ˜ The covariance matrix ๐œฎ = ๐‘จ๐‘จโŠค + ๐œŽ 2 ๐‘ฐ๐‘ .

ICML 2015

Problem Setup Elliptical model: ๏ƒ˜ Sample ๐’™๐‘ก is generated according to ๐’™๐‘ก = ๐ + ๐œ‰๐‘ก ๐‘จ๐’›๐‘ก

โ€ข ๐’›๐‘ก โ€“ a sample of a uniform random vector on the unit sphere. โ€ข ๐œ‰๐‘ก โ€“ a sample of a scalar random variable with unknown distribution. โ€ข ๐‘จ โˆˆ ๐‘น๐‘ร—๐‘‘ โ€“ a deterministic satisfying ๐‘จ๐‘จโŠค = ๐œฎ. The sparse setting: ๏ƒ˜ The projection matrix ๐œซ = ๐‘ผ๐‘˜ ๐‘ผโŠค ๐‘˜ satisfies that diag ๐œซ 0 โ‰ค ๐‘ .

ICML 2015

Algorithm Basic idea: ๏ƒ˜ Block-wise stochastic power methods โ€“ update the estimated PCs once a block of samples are received. ๏ƒ˜ The โ€œrow truncationโ€ operator โ€“ maintain the sparsity of the estimated PCs.

ICML 2015

Algorithm Streaming sparse PCA (for the spike model): Computational complexity: ๐‘‚(๐‘๐‘˜min ๐‘˜, ๐‘ log๐‘ ) ๐ต = ฮ˜(๐‘ log๐‘) ๐’๐œ+1 = ๐šบ๐œ+1 ๐‘ธ๐œ

2) ๐‘‚(๐‘๐‘˜๐ต)

3) ๐‘‚(๐‘๐‘˜ + ๐‘log๐‘) 4) ๐‘‚(๐‘๐‘˜ 2 )

ICML 2015

Algorithm The iterative deflation method: E.g., the leading ๐‘˜ PCs are all sparse but their supports are nearly disjoint.

ICML 2015

Algorithm The advantages compared with streaming PCA and TPower:

1. The streaming SPCA is consistent in the high dimensional regime where the streaming PCA is inconsistent. 2. TPower requires ๐‘‚(๐‘min{๐‘, ๐‘›}) storage but our method only requires ๐‘‚ ๐‘๐‘˜ storage. 3. When the leading ๐‘˜ PCs are row sparse, our method can extract them simultaneously, but TPower can only extract them one by one.

ICML 2015

Algorithm For elliptically distributed data, ECA (Liu & Han, 2013) utilizes the multivariate Kendallโ€™s tau statistic: The eigenspace of ๐‘ฒ is identical to that of ๐œฎ.

Consider the following estimator of ๐‘ฒ:

The empirical covariance matrix of samples ๐‘ฆ๐‘– =

ICML 2015

๐’™2๐‘–โˆ’1โˆ’๐’™2๐‘– ๐’™2๐‘–โˆ’1 โˆ’๐’™2๐‘– 2

Algorithm Streaming sparse ECA (for the elliptical model):

Compute sample ๐’š๐‘ก

ICML 2015

Performance Guarantees The main theorem for streaming sparse PCA: Theorem 1: For parameters ๐œ‚ > 0,0 < ๐œ– < 1, and ๐›พ โ‰ฅ ๐‘ , let 2+ 2 ๐œ‡ ๐œ‚ ๐‘˜ + 1 ๐œ†๐‘˜+1 + 2๐œ‚๐œ†๐‘˜ ๐œ‡โ‰œ , ๐‘“ ๐œ‡, ๐œ‚, ๐‘˜ โ‰œ max , . ๐œ†๐‘˜ ๐‘˜ ๐‘˜ If the initial solution ๐‘ธ0 is โ€œgoodโ€ 2 1 โˆ’ ๐œ‡ ๐œˆ โ‰œ ๐‘ผโŠค , ๐‘˜,โŠฅ ๐‘ธ0 2 < 2 1 โˆ’ ๐œ‡ + ๐œ‡ + 1 ๐‘“ ๐œ‡, ๐œ‚, ๐‘˜ then as long as the block size ๐ต and the iteration number ๐‘‡ are โ€œlarge enoughโ€ ๐œ– log ๐œˆ ๐‘๐‘˜ 2 ๐œ†21 ๐‘  + 2๐›พ log ๐‘ + log ๐‘‡ ๐‘‡โ‰ฅ ,๐ต โ‰ฅ , ๐œ‡ 2 ๐œ‚ 2 ๐œ†2 ๐œ– log ๐‘˜ 1 โˆ’ ๐œˆ โˆ’ ๐‘“ ๐œ‡, ๐œ‚, ๐‘˜ ๐œˆ with probability at least 1 โˆ’ ๐‘  โˆ’10 , the output ๐‘ธ ๐‘‡ of Algorithm 2 satisfies that ๐‘ผโŠค ๐‘˜,โŠฅ ๐‘ธ ๐‘‡ 2 โ‰ค ๐œ–. ICML 2015

Theoretical Guarantees 2 ๐‘˜ + 1 ๐œ†๐‘˜+1 + 2๐œ‚๐œ†๐‘˜ 1 โˆ’ ๐œ‡ ๐œ‡โ‰œ , ๐‘ผโŠค , ๐‘˜,โŠฅ ๐‘ธ0 2 < 2 ๐œ†๐‘˜ 1 โˆ’ ๐œ‡ + ๐œ‡ + 1 ๐‘“ ๐œ‡, ๐œ‚, ๐‘˜ ๐œ– log ๐œˆ ๐‘๐‘˜ 2 ๐œ†21 ๐‘  + 2๐›พ log ๐‘ + log ๐‘‡ ๐‘‡โ‰ฅ ,๐ต โ‰ฅ , ๐œ‡ 2 2 2 ๐œ– ๐œ‚ ๐œ†๐‘˜ log 1 โˆ’ ๐œˆ โˆ’ ๐‘“ ๐œ‡, ๐œ‚, ๐‘˜ ๐œˆ

Remarks: 1) The algorithm succeeds as long as ๐œ†๐‘˜ > ๐‘˜ + 1 ๐œ†๐‘˜+1 since โˆƒ๐œ‚ so that ๐œ‡ < 1.

2) A smaller ๐œ‡ leads to faster convergence and less samples required.

3) A more accurate initial solution is required when ๐œ‡ is larger.

4) The algorithm can succeed when ๐ต = ฮ˜ ๐‘  log ๐‘ + log ๐‘‡ if ๐‘  โ‰ค ๐›พ โ‰ค 2๐‘ .

ICML 2015

Theoretical Guarantees The main theorem for streaming sparse ECA: Theorem 2: For parameters ๐œ‚ > 0,0 < ๐œ– < 1, and ๐›พ โ‰ฅ ๐‘ , let 2+ 2 ๐œ‡ ๐œ‚ ๐‘˜ + 1 ๐œ†๐‘˜+1 + 2๐œ‚๐œ†๐‘˜ ๐œ‡โ‰œ , ๐‘“ ๐œ‡, ๐œ‚, ๐‘˜ โ‰œ max , . ๐œ†๐‘˜ ๐‘˜ ๐‘˜ If the initial solution ๐‘ธ0 is โ€œgoodโ€ 2 1 โˆ’ ๐œ‡ ๐œˆ โ‰œ ๐‘ผโŠค , ๐‘˜,โŠฅ ๐‘ธ0 2 < 2 1 โˆ’ ๐œ‡ + ๐œ‡ + 1 ๐‘“ ๐œ‡, ๐œ‚, ๐‘˜ then as long as the block size ๐ต and the iteration number ๐‘‡ are โ€œlarge enoughโ€ ๐œ– 2 2 log ๐œˆ ๐‘๐‘˜ 1 + ๐œ†1 ๐‘ฒ ๐‘  + 2๐›พ log ๐‘ + log ๐‘‡ ๐‘‡โ‰ฅ ,๐ต โ‰ฅ , ๐œ‡ 2 ๐œ‚ 2๐œ† ๐‘ฒ 2 ๐œ– ๐‘˜ log 1 โˆ’ ๐œˆ โˆ’ ๐‘“ ๐œ‡, ๐œ‚, ๐‘˜ ๐œˆ with probability at least 1 โˆ’ ๐‘  โˆ’10 , the output ๐‘ธ ๐‘‡ of Algorithm 2 satisfies that ๐‘ผโŠค ๐‘˜,โŠฅ ๐‘ธ ๐‘‡ 2 โ‰ค ๐œ–. ICML 2015

Experimental Results Comparison between streaming sparse PCA, streaming PCA, FPS and online sparse PCA:

The samples are generated under the spike model. ICML 2015

Experimental Results Comparison between streaming sparse PCA and streaming PCA:

The samples are generated under the spike model. ICML 2015

Experimental Results Comparison between ECA, streaming sparse ECA, streaming sparse PCA and streaming PCA:

๐œ‰ follows (Left) the chi-distribution and (Right) the F-distribution ICML 2015

Experimental Results Real-world dataset: (Left) NIPS dataset and (Right) NYTimes dataset. Parameters B and ฮณ in streaming sparse PCA are set to 300 and 500, respectively.

Large scale sparse PCA (Zhang & El Ghaoui, 2011)

ICML 2015

ICML 2015