Recover Signals under Systemic Effects via Convex Optimization Project Report EE 364b Yunting Sun June 3, 2011
1
Background
There has been considerable progress in multiple testing methods for high throughput applications. A common example, coming from biology, is testing which of N genes’ expression levels correlate significantly with a scalar variable, which we’ll call the primary variable. The primary variable may be an experimentally applied treatment or it may be a covariate such as a phenotype. We will use the gene expression example for concreteness, although it is just one of many instances of this problem. The general data model is: Y = γg T + ΣE where Y ∈ RN ×n is a matrix of gene expression values, n is the number of arrays/subjects (in hundreds), N n is the number of genes (in thousands). Yij is the expression value for gene i and sample j. g ∈ Rn×1 is a vector of primary variable for n samples, γ ∈ RN ×1 is the primary parameter of interest, γi = 0 when gene i is not associated with g (primary variable) and otherwise it is associated. In many circumstances, the number of genes associated with g is small (< 10%). Noise matrix E ∼ N (0, IN ⊗ In ) and Σ is a diagonal matrix and Σii , i = 1, · · · , N is the standard deviation of noise in ith gene. The independence of noise E across rows and columns is a crucial assumption for most multiple hypothesis testing techniques. However, in practice, this is rarely true. For example, in microarray-based experiments, it is well known that samples processed in the same batch are correlated. Batch, technician, and other sources of variation in sample preparation can be modeled by latent variables. Another example comes from genetic association studies, where differences in ancestral history among subjects serve as latent variables and can lead to correlation across samples and false or inaccurate associations. We modify the model to incorporate the role of latent factors. Y = γg T + U V T + ΣE
(1)
columns of V ∈ Rn×k are unknown random latent factors and U ∈ RN ×k are the unknown latent loadings, k is the number of latent factors which is much smaller than min(N, n) and also unknown. The matrix product L = U V T models the dependence across genes through latent effects. The latent factors might even be correlated with the primary variable g as a result of imperfect randomization. Ignoring U V T can lead to serious confounding issue and thus not only alter the level of hypothesis test but also affect the rank of ordering among the N p-values. In one hand, U V T confounds the inference on γ; in the other hand, S = γg T serves as contaminating large noises on latent structure U V T as well. The goal is to adjust the latent structure U V T and recover the sparse signal γ from an observed pair (Y, g). For this purpose, Leek and storey(2008) proposed an iterative reweighted surrogate variable analysis. A full and precise description of SVA appears in the supplementary information and online software for Leek and Storey (2008). Here we present a brief outline. The SVA iteration is as follows. First, they fit a linear model of Y over g without any latent variables, getting
1
estimates γ and the residual R = Y − γˆ g T . Second, they apply the simulation method of Buja and Eyuboglu (1992) to R to estimate the number k of factors, and then take the top k right eigenvectors of R as the initial estimator Vˆ . Third, they form the empirical Bayes estimates wi = P r(γi = 0, Ui 6= 0|Y, g, Vˆ ) from Storey et al. (2005). Fourth, based on those weights, they perform a weighted singular value decomposition of the original data matrix Y , where row i is weighted by wi . The weighted SVD gives them an updated estimator Vˆ . They repeat steps 3 and 4, revising the weights wi and then the matrix Vˆ , until Vˆ converges. They estimate γ through the multivariate linear regression model Y = γg T + U Vˆ + ΣE where Vˆ is treated as known covariates to adjust for the primary effect g. It is a good idea to estimate latent factors V from rows that are less likely affected by primary variable g. However, there is no guarantee that the solution of SVA will be able to recover γ. The convergence can be very slow and algorithm doesn’t exploit on the sparsity of γ as well.
2
Our method
With sufficient data, it is possible to estimate Σ accurately. Fit a linear model of Y over g without any latent variables,getting estimates γ and the residual R = Y − γˆ g T . Perform SVD on R and choose the number of ˆ Vˆ T . And estimate Σ by factors according to Buja and Eyuboglu (1992) so that we have R w U ˆ Vˆ T )(R − U ˆ Vˆ T )T ˆ 2 = 1 diag(R − U Σ n Hence for simplicity, we assume common known noise standard deviation across genes, i.e. Σ = σIN ×N . To make good inference about γ, it makes sense to estimate γ, L together. The information we know is that γ is sparse and L is of low rank. To incorporate such information, we propose the following estimation scheme. minimize kY − γg T − Lk2F subject to card(γ) ≤ s, rank(L) ≤ k The constraints are nonconvex but we can resort to relaxation to transform the problem to a convex one. minimize kY − γg T − Lk2F subject to kγk1 ≤ t, kLk? ≤ l where k · k1 is the L1 norm and the k · k? is the nuclear norm. l and t are fixed tuning parameters which can be chosen by cross validation or moment matching to the known noise standard deviation σ. We can transform the problem to reduce the number of tuning parameters. minimize kLk? + λkγk1 √ subject to kY − γg T − LkF ≤ σ N n
(2)
We will show in simulation that with properly chosen λ, we will be able to recover γ. First of all we use cvx to solve the above problem for a few simulation settings. Let N = 60, n = 30, primary variable gi = 1, i = 1, · · · , 30, gi = −1, i = 31, · · · , 30. And then g is standardized to have kgk = 1. The primary parameter γ has independent components γi taking values ss and 0 with probability π = 0.1 and 1 − π = 0.9 respectively. We use k = 1 latent variable that has correlation ρ with g. The latent vector U ∈ RN is generated as independent N (0, uu2 ) where uu is a scalar controlling the strength of latent structure . V is p 2 taken to be ρg + 1 − ρ W where W is uniformly distributed on the set of unit vectors orthogonal to g. The i.i.d. noise matrix E has independent components Eij ∼ N (0, 1). Noise standard deviation Σ = eeIN ×N where ee is a scalar controlling the strength of noise. We can vary ss, uu, ee to obtain different combination of signal, latent structure and noise strength. In all the simulations, tuning parameter λ is set to be √1n following 2
(Candes et al 2009). We compare our estimator with an oracle given U V T which then does regression of Y − U V T on g. The measure for comparison is the area under the ROC curve (or in abbreviation: AUC) which measures the accuracy of estimation of γˆ . AUC is equivalent to P P γi | > |ˆ γj |) i∈Ωc j∈Ω I(|ˆ |Ω||Ωc | where Ω = {k|γk = 0} is the set of genes not associated with g. The higher the AUC the better the estimator ˆ , Vˆ estimated by our method γˆ . If γˆ = γ, AUC is equal to 1. Also we compute the correlation between U with the actual U, V . ρ 0
0.3
ss 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
uu 0.25 0.5 1 1.5 2 0.25 0.5 1 1.5 2 0.25 0.5 1 1.5 2 0.25 0.5 1 1.5 2
ee 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.4 0.4
AUC(our method) 0.9997 1 0.9739 0.9393 0.8637 0.8292 0.8410 0.8143 0.7642 0.7293 0.9997 1 0.9582 0.9297 0.8242 0.8198 0.8168 0.7984 0.7573 0.6937
AUC(oracle) 0.9998 1 0.9834 0.9458 0.8818 0.8525 0.8485 0.8015 0.7820 0.7444 0.9998 1 0.9834 0.9458 0.8818 0.8525 0.8485 0.8015 0.7820 0.7444
corr(Vˆ , V ) 0.74 0.95 0.99 0.99 0.99 0.30 0.70 0.96 0.98 0.99 0.72 0.95 0.99 0.99 0.31 0.30 0.73 0.95 0.98 0.99
ˆ, U) corr(U 0.66 0.92 0.98 0.99 0.99 0.23 0.62 0.92 0.96 0.97 0.64 0.92 0.98 0.99 0.99 0.23 0.64 0.92 0.96 0.98
It seems that as long as the latent structure is strong compared with signal and noise, the performance of our method is reasonable. If the latent structure is very weak, it can become undetectable according to (Onatski 2007). And that is why our method underperforms the oracle when the latent to noise ratio uu/ee is small. Also we notice that the performance of our method catches up with that of the oracle when the signal strength is roughly the same as the latent structure strength. That is mainly because here we choose λ = √1n which is designed for the case when signal and latent are comparable. To get better performance for other settings, we need to use cross validation to choose the tuning parameter λ.
3
Solving large problems
We propose to use an accelerated proximal gradient ( APG ) method to scale our method to solve large problem. It is well established that solving (2) is equivalent to solving minimizeγ,L
1 µ(σ)kLk? + µ(σ)λkγk1 + kY − γg T − Lk2F 2
3
for a well chosen µ(σ) which is a function of σ. Define x − , if x > S (x) = x + , if x < − 0, otherwise, for x ∈ R, > 0. And extend this operator to vectors and matrices by applying it element-wise. The APG algorithm runs as follows. Input: Y ∈ RN ×n , g ∈ Rn×1 , λ, µ, σ ∈ R 1. γ−1 = γ0 = 0; L−1 = L0 = 0; s−1 = s0 = 0 2. while not converged do 3. ZkL = Lk +
sk−1 −1 (Lk sk
− Lk−1 ), Zkγ = γk +
sk−1 −1 (γk sk
− γk−1 ) 4. (U, D, V ) = SVD ZkL − 21 (ZkL + Zkγ g T − Y ) , Lk+1 = U Sµ/2 (D) V T . 5. γk+1 = Sλµ/2 Zkγ − 12 (ZkL + Zkγ g T − Y )g √ 1+ 4s2k +1 6. sk+1 = 2 7. k = k + 1 8. end while
Output: L, γ. We stop when k2(ZkL − Lk ) + Lk + γk g T − ZkL − Zkγ g T k2 < 10−4 . As we don’t know µ(σ), we can use the following scheme to solve the dual problem instead. maximizey subject to
InfL,γ
kLk? + λkγk1 + y/2(kY − γg T − Lk2F − σ 2 N n) y≥0
Input :Y ∈ RN ×n , g ∈ Rn×1 , λ, σ and fixed step size α. 1. while not converged do 2. (Lk , γk ) = arg minγ,L kLk? + λkγk1 + y2 kY − γg T − Lk2F 3. yk+1 = yk − α(σ 2 N n − kY − Lk − γk g T k2F ) 4. k = k + 1 5. end while Output : L, γ. where step (2) is solved by the APG method.
4
Discussion
For the model in the ideal world without latent effects, (Donoho and Johnstone 1994) shown that the soft thresholded estimator γˆ is nearly minimax and they shown that the following oracle inequality holds. 4
Ekˆ γ − γk22 ≤ (2 log n + 1)|T |σ 2 where T is the support of a sparse vector γ. In the presence of latent effects U V T , no proof for ”nearly minimax” result exists in the literature but it is possible to follow the result of robust principal component analysis proposed by (Candes et al 2009) to obtain exact recovery of γ in the noiseless scenario under certain assumptions on γ, g, U, V . In the presence of noise, more work need to be done to obtain the oracle inequality. The APG method we use is a special form of FISTA (fast iterative shrinkage thresholding algorithm) and similar to that proposed by (Lin et al 2009) for robust principal component analysis. Theorem 2.1 in (Lin et al 2009) shows that the convergence rate of the algorithm is O(1/k 2 ) (k is the number of iteration).
5
Sample code
The following code is an implementation of the APG algorithm in R. # model 1/2||y - \gamma||_2^2 + lambda ||\gamma||_1 l1threshold0)-(y