Low-Rank Positive Semidefinite Matrix Recovery ... - Semantic Scholar

Report 1 Downloads 109 Views
1

Low-Rank Positive Semidefinite Matrix Recovery from Quadratic Measurements with Outliers

arXiv:1602.02737v1 [cs.IT] 8 Feb 2016

Yuanxin Li, Yue Sun, and Yuejie Chi?

Abstract—We address the problem of estimating a low-rank positive semidefinite (PSD) matrix X ∈ Rn×n from a set of measurements zi = aTi Xai , i = 1, . . . , m, which are quadratic in the sensing vectors ai ’s composed of i.i.d. standard Gaussian entries. This problem arises from applications such as covariance sketching, quantum space tomography, and power spectrum estimation. Specifically, we consider a convex optimization problem that seeks the PSD matrix with minimum `1 norm of the observation residual. The advantage of our algorithm is that it is free of parameters, therefore eliminating the need for tuning parameters and allowing easy implementations. We establish that with high probability, a rank-r PSD matrix X of size-n can be exactly recovered from O(nr2 ) measurements, even when a fraction of the measurements is corrupted by outliers with arbitrary magnitudes. Moreover, the recovery is also stable against bounded noise. With the additional information of the rank of the PSD matrix, a non-convex algorithm based on subgradient descent is proposed that demonstrates superior empirical performance. Index Terms—quadratic measurements, low-rank covariance estimation, outliers

L

1X 2 zi = |hai , xl i| = aTi L l=1

L

1X xl xTl L

! ai ≈ aTi X 0 ai .

l=1

(3) Here, X 0 corresponds to the covariance matrix of the data, which is assumed low-rank due to the intrinsic low dimensionality of the data. Our goal is to recover the low-rank PSD matrix X 0 from (1) using as small number of measurements as possible in a computationally efficient and robust manner. A popular convex relaxation algorithm is based on trace minimization [5], which seeks the PSD matrix with the smallest trace norm while satisfying the observation constraint. It is shown in [5] that this algorithm exactly recovers all rank-r PSD matrices as soon as the number of measurements exceeds the information-theoretic limit O (nr) in the absence of noise, and the recovery is robust against bounded noise as well. A. Our Contributions

I. I NTRODUCTION In many emerging applications of science and engineering, we are interested in estimating a low-rank positive semidefinite (PSD) matrix X 0 ∈ Rn×n from a set of magnitude measurements that are quadratic in the sensing vectors ai ∈ Rn : zi = hai aTi , X 0 i = aTi X 0 ai ,

L data samples as

i = 1, . . . , m.

(1)

These measurements either arise due to physical limitations, e.g. incapability of capturing phases, such as phase retrieval and optical imaging from intensity measurements [1]–[4], which can be formulated as  2 zi = |hai , x0 i| = aTi x0 xT0 ai = aTi X 0 ai , (2) where X 0 = x0 xT0 is a lifted rank-one matrix from the signal of interest; or arise by design, such as in covariance sketching or power spectrum estimation [5], where zi is aggregated from Y. Li and Y. Chi are with Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210 USA (e-mails: {li.3822, chi.97}@osu.edu). Y. Sun is with Department of Electronics Engineering, Tsinghua University, Beijing, China. Part of the work was done while Y. Sun was visiting The Ohio State University. This work is supported in part by NSF under grant CCF-1422966, ECCS1462191 and AFOSR under grant FA9550-15-1-0205. Corresponding e-mail: [email protected]. Date: February 9, 2016. The results in this paper were presented in part at the IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, March 2016.

In this paper, we focus on robust recovery of the lowrank PSD matrix when the measurements are further corrupted by outliers, possibly adversarial with arbitrary amplitudes. In practice, outliers are somewhat inevitable, which may be caused by sensor failures, malicious attacks, or reading errors. In the application of covariance sketching (3), a sufficient aggregation length L is necessary in order for each measurement zi being well approximated by (1). Measurements which are not aggregated from a large enough L may be regarded as outliers. Therefore, it becomes critical to address robust recovery of X 0 in the presence of outliers. Fortunately, the number of outliers is usually much smaller than the number of total measurements, making it possible to leverage the sparsity of the outliers to faithfully recover the low-rank PSD matrix of interest. We first propose a convex relaxation algorithm that seeks the PSD matrix that minimizes the `1 -norm of the measurement residual, where the `1 norm is adopted to promote outlier sparsity. The algorithm is free of tuning parameters and eliminates the need for trace minimization by only enforcing the PSD constraint. When the sensing vectors are composed of i.i.d. standard Gaussian entries, we establish that for a fixed rank-r PSD matrix of size-n, as long as the number of measurements exceeds O(nr2 ), the algorithm can exactly recover it with high probability, even when a fraction of O(1/r) measurements are arbitrarily corrupted. Our measurement complexity is orderwisely near-optimal up to a factor of r, and is optimal in

2

the rank-one case. Furthermore, the recovery is also robust to bounded noise. To further reduce the computational burden when facing large-scale problems, we next develop a non-convex algorithm based on subgradient descent when the rank of the PSD matrix, or an estimate of it, is known a priori. Since any rankr PSD matrix can be uniquely decomposed as X 0 = U 0 U T0 , where U 0 ∈ Rn×r up to some orthonormal transformations, it is sufficient to recover U 0 without constructing the PSD matrix explicitly. The algorithm iteratively updates the estimate by descending along the subgradient of the `1 norm of the measurement residual using a properly selected step size and initialization. While it is challenging to establish rigorous global convergence of the proposed algorithm, numerical experiments validate its superior empirical performance. B. Organization The rest of the paper is organized as below. Section II presents the proposed convex optimization algorithm and its corresponding performance guarantee. Section III describes the proposed non-convex subgradient descent algorithm that is computationally efficient with excellent empirical performance. Numerical examples are provided in Section IV. The proof of the main theorem is given in Section V. Finally, we conclude in Section VI. II. PARAMETER -F REE C ONVEX R ELAXATION Let X 0 ∈ Rn×n be a rank-r PSD matrix, then the measurements, which may be corrupted by either bounded noise or arbitrary outliers, can be represented as z = A(X 0 ) + β + w,

(4)

where z, β, w ∈ Rm . The linear mapping A: Rn×n → Rm m T is defined as A (X 0 ) = ai X 0 ai i=1 , where ai ∈ Rn ’s are the sensing vectors that are composed of i.i.d. standard Gaussian entries. The vector β denotes the outlier vector, which is assumed to be a sparse vector whose entries can be arbitrarily large, and the fraction of nonzero entries is defined as s := kβk0 /m. Moreover, the vector w denotes the additive noise, which is assumed bounded as kwk1 ≤ . A. Recovery of General PSD Matrices When only the outlier vector β is present in (4), one may seek a rank-r PSD matrix that minimizes the cardinality of the measurement residual, given as ˆ = argminX0 kz − A(X)k0 , X

s.t.

rank(X) = r. (5)

However, both the cardinality minimization and the rank constraint are NP-hard in general, making this method computationally infeasible. A common approach is to resort to convex relaxation, where we minimize the `1 -norm of the measurement residual to motivate outlier sparsity, and meanwhile, drop the rank constraint: ˆ = argmin X X0 kz − A(X)k1 .

(6)

This algorithm coincides with the PhaseLift algorithm studied in [6]–[8] for phase retrieval. The advantage of this formulation is that it does not require any prior knowledge of the noise bound, the rank of X 0 , nor the sparsity level of the outliers, and is not involved with any regularization parameter. Encouragingly, we prove that the algorithm (6) admits robust recovery of a rank-r PSD matrix as soon as the number of measurements is large enough, even with a fraction of arbitrary outliers. To the best of our knowledge, this is the first proof of the robustness of (6) with respect to arbitrary outliers in the low-rank setting. Our main theorem is given as below. Theorem 1. Suppose that kwk1 ≤ . Assume the support of β is selected uniformly at random with the signs of its nonzero entries generated from the Rademacher distribution as P {sgn (βi ) = −1} = P {sgn (βi ) = 1} = 1/2 for each i ∈ supp(β). Then for a fixed rank-r PSD matrix X 0 ∈ Rn×n , there exist some absolute constants c1 > 0 and 0 < s0 < 1 such that as long as s0 m ≥ c1 nr2 , s ≤ , r the solution to (6) satisfies

r

ˆ

X − X 0 ≤ c2 , m F with probability exceeding 1 − exp(−γm/r2 ) for some constants c2 and γ. Theorem 1 has the following consequences. • Exact Recovery with Outliers: When  = 0, Theorem 1 ˆ = X 0 even when suggests the recovery is exact, i.e. X a fraction of the measurements is arbitrarily corrupted, as long as the number of measurements m is on the order of nr2 . Given there are at least nr unknowns, our measurement complexity is near-optimal up to a factor of r. • Stable Recovery with Bounded Noise: In the presence of bounded noise, Theorem 1 suggests that the recovery performance decreases gracefully with the increase of , where the Frobenius norm of the reconstruction error is proportional to the per-entry noise level of the measurements up to a factor of r. • Phase Retrieval: When r = 1, the problem degenerates to the case of phase retrieval, and Theorem 1 recovers existing results in [8] for outlier-robust phase retrieval, where the measurement complexity is on the order of n, optimal up to a scaling factor. B. Comparisons to Related Work In the absence of outliers, the PhaseLift algorithm in the following form min Tr(X) s.t.

X0

kz − A(X)k1 ≤ ,

(7)

where Tr(X) denotes the trace of X, has been proposed to solve the phase retrieval problem [2], [3], [6]. Later the same algorithm has been employed to recover low-rank PSD matrices in [5], where O(nr) measurements obtained from

3

i.i.d. sub-Gaussian sensing vectors are shown to guarantee exact recovery in the noise-free case and stable recovery with bounded noise. One problem with the algorithm (7) is that the noise bound  is assumed known. Furthermore, it is not amenable to handle outliers, since kz − A(X 0 )k1 can be arbitrarily large with outliers and consequently the ground truth X 0 quickly becomes infeasible for (7). The proposed algorithm (6) is studied in [6]–[8] as a variant of PhaseLift for phase retrieval, corresponding to the case where X 0 = x0 xT0 is rank-one. It is shown in [6], [7] that with O(n) i.i.d. Gaussian sensing vectors, the algorithm succeeds with high probability. Compared with (7), the algorithm (6) eliminates trace minimization and leads to easier algorithm implementations. We note that [9] also considers a regularizationfree algorithm for PSD matrix estimation that minimizes the `2 norm of the residual, which unfortunately, cannot handle outliers as (6). Hand [8] first considered the robustness of the algorithm (6) in the presence of outliers for phase retrieval, establishing that the same guarantee holds even with a constant fraction of outliers. Our work extends the performance guarantee in [8] to the general low-rank PSD matrix case. Standard approaches for separating low-rank and sparse components [10]–[13] via convex optimization are given as min Tr(X) + λkβk1 ,

X0, β

s.t.

kz − A(X) − βk1 ≤ ,

where λ is a regularization parameter that requires to be tuned properly. In contrast, the formulation (6) is parameter-free. III. A N ON -C ONVEX S UBGRADIENT D ESCENT A LGORITHM When the rank of the PSD matrix X 0 is known a priori, we can decompose X 0 as X 0 = U 0 U T0 where U 0 ∈ Rn×r . Instead of directly recovering X 0 , we may aim at recovering U 0 up to orthogonal transforms, since (U 0 Q)(U 0 Q)T = U 0 U 0 for any orthonormal matrix Q ∈ Rr×r . Consider the relaxation of (5) with the rank constraint: ˆ = argmin X X0 kz − A(X)k1 ,

s.t.

rank(X) = r.

Since any rank-r PSD matrix X can be written as X = U U T for some U ∈ Rn×r , we can equivalently rewrite the above equation as ˆ = argminU ∈Rn×r f (U ), U (8) where m



1 X

T 2

T zi − U ai , f (U ) = z − A(U U ) = m i=1 2 1

thus, the algorithm (8) is no longer convex. The first row of Fig. 1 illustrates the objective function in the negative logarithmic scale − log f (U ) under different corruption scenarios with U ∈ R2×1 . For comparison, the second row of Fig. 1 shows the objective function evaluated in `2 norm: g(U ) = kz − A(U U T )k22 , which is obviously not robust to outliers. Motivated by the recent non-convex approaches [14]–[16] of solving quadratic systems, we propose a subgradient descent

algorithm to solve (8) effectively, working with a non-smooth function f (U ). Note that a subgradient of f (U ) with respect to U can be given as  m

2 

1 X

(9) sgn zi − U T ai ai aTi U , ∂f (U ) = − m i=1 2 where the sign function sgn(·) is defined as   +1, x > 0 0, x = 0 . sgn(x) =  −1, x < 0 Our subgradient descent algorithm proceeds as below. Denote the estimate in the tth iteration by U (t) ∈ Rn×r . First, we initialize U (0) as the best rank-r approximation of the following matrix: ! m 1 X (0) (0) T T zi ai ai , (10) U (U ) = Pr m i=1 where Pr (Z) := argminX:rank(X)=r kX − Zk2F denotes the projection of Z to the closest rank-r matrix with respect to Frobenius norm. Secondly, at the (t + 1)th iteration, t ≥ 0, we apply subgradient descent to refine our estimate as U (t+1) = U (t) − µt · ∂f (U (t) ) m   µt X sgn zi − k(U (t) )T ai k22 ai aTi U (t) , (11) = U (t) + m i=1 where the step size µt is adaptively set as µt = µf (U (t) ), with µ being some small constant. This is because the subgradient only depends on the signs of the errors, but not their amplitudes. The step size is selected to reflect the magnitude of the current residual f (U (t) ). The procedure is summarized in Alg. 1. In the numerical simulations, the default value of µ is set as 0.1. The stopping rule in Alg. 1 is simply put as a maximum number of iterations, while in practice, we can also examine the difference of the residuals between consecutive iterations, and stop when the difference is negligible. Algorithm 1: Subgradient descent for solving (8) Parameters:Rank r, number of iterations Tmax , and step size µt ; Input: Measurements z, and sensing vectors {ai }m i=1 ; Initialization: Initialize U (0) ∈ Rn×r via (10); for t = 0 : Tmax − 1 do update U (t+1) via (11); end for ˆ = U (Tmax ) . Output: U The main advantage of Alg. 1 is its low memory and computational complexity. Given that it is not necessary to construct the full PSD matrix, the memory complexity is simply the size of U (t) , which is O(nr)1 . The computational 1 We

do not count the storage complexity of the sensing vectors here.

4

no outliers

modest outlier amplitudes

large outlier amplitudes

f (U ) = kz − A(U U T )k1

g(U ) = kz − A(U U T )k22 Fig. 1: Illustrations of the objective function − log f (U ) and its `2 norm counterpart − log g(U ) (in negative logarithmic scales) under different corruption scenarios when U ∈ R2×1 . The number of measurements is m = 100 with i.i.d. Gaussian sensing vectors, and the fraction of outliers is s = 0.2 with uniformly selected support and amplitudes drawn from Unif[0, 10] or Unif[0, 100]. It is interesting to observe that while large outliers completely distort g(U ), the proposed objective is quite robust with the ground truth being the only global optima of f (U ).

complexity per iteration is also low, which is on the order of O(mnr), that is linear in all the parameters. We demonstrate the excellent empirical performance of Alg. 1 in Section IV-B.

1 14

0.9

0.9

12

12 0.8

0.8

0.7

0.7

10

10

8 0.5

0.4

6

0.6

Rank (r)

0.6

Rank (r)

IV. N UMERICAL E XAMPLES A. Performance of Convex Relaxation We first consider the performance of (6). Let n = 40. We randomly generate a low-rank PSD matrix of rank r as X 0 = U U T , where U ∈ Rn×r is composed of i.i.d. standard Gaussian variables. The sensing vectors are also composed of i.i.d. standard Gaussian variables. Each Monte Carlo simula−3 ˆ tion is called successful if kX−X , where 0 kF /kX 0 kF ≤ 10 ˆ denotes the solution of (6). For each cell, the success rate X is calculated by averaging over 10 Monte Carlo simulations. Fig. 2 shows the success rate of the proposed algorithm with respect to the number of measurements and the rank, (a) with the trace minimization as in (7); and (b) without the trace minimization as proposed in (6). It can be seen that the performance of the two algorithms are almost equivalent, confirming a similar numerical observation for the phase retrieval problem [4] also holds in the low-rank setting. This also suggests possible room for improvements of our theoretical guarantee, as the numerical results indicate that the required measurement complexity for successful recovery has a seemingly linear relationship with r. Fig. 3 further shows the success rate of the proposed algorithm (a) with respect to the number of measurements and

1 14

8 0.5

0.4

6

0.3 4

0.3 4

0.2 2

0.1

0.2 2

0.1

0 100

150

200

250

300

350

400

450

500

Number of measurements (m)

550

600

0 100

150

200

250

300

350

400

450

500

550

600

Number of measurements (m)

(a) With trace minimization (b) Without trace minimization Fig. 2: Phase transitions for PSD matrix recovery with respect to the number of measurements and the rank, (a) with trace minimization; and (b) without trace minimization of noise-free measurements, when n = 40.

the rank, when 5% of measurements are selected uniformly at random and corrupted by arbitrary standard Gaussian variables; and (b) with respect to the percent of outliers and the rank, for a fixed number of measurements m = 400. We next consider robust recovery of Toeplitz PSD matrices, where we allow complex-valued sensing vectors A(X) = m {aH i Xai }i=1 and complex-valued Toeplitz PSD matrices X. We modify (6) by incorporating the Toeplitz constraint as: ˆ = argmin X X0 kz − A(X)k1 , s.t. X is Toeplitz.

(12)

Let n = 64, the Toeplitz PSD matrix X 0 is generated as X 0 =

5

1

1 10

0.9

0.9 9

0.8

0.8 8

10 0.7

0.7 7 0.6 6 0.5 5

0.4

0.4 4

0.3

0.3

4

3 0.2

0.2 2

2

0.1

0.1 1

0 100

150

200

250

300

350

400

450

500

550

0

600

0

0.02

0.04

0.06

Number of measurements (m)

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Percent of outliers (s)

(a)

(b)

Fig. 3: Phase transitions of PSD matrix recovery with respect to (a) the number of measurements and the rank, with 5% of measurements corrupted by arbitrary standard Gaussian variables; (b) the percent of outliers and the rank, when the number of measurements is m = 400, when n = 40. V ΣV H , where V = [v(f1 ), . . . , v(fr )] ∈ Cn×r is a Vandermonde matrix with v(fi ) = [1, ej2πfi , . . . , ej2π(n−1)fi ]T , fi ∼ Unif[0, 1], and Σ = diag[σ12 , . . . , σr2 ], with σi2 ∼ Unif[0, 1]. Fig. 4 shows the phase transitions of Toeplitz PSD matrix recovery with respect to the number of measurements and the rank without outliers in (a), and when 5% of measurements are selected uniformly at random and corrupted by arbitrary standard Gaussian variables in (b). 1

1

1

10

7

0.9

0.9

9 0.8

0.8

6

8 0.7

0.7

7

5 0.6

6 0.5 5 0.4

0.6 4

0.5 0.4

3

4 0.3

0.3

3

2

0.2

0.2

2 0.1 200

400

600

800

1000

1200

0.1

1

1 0

200

Number of measurements (m)

400

600

800

1000

1200

0

Number of measurements (m)

(a)

1

10

(b)

10 0.9

0.9

9

9 0.8

0.8

8

8 0.7

0.7

7

7

6 0.5 5

0.6

Rank (r)

0.6

Rank (r)

using the same initialization (10). The step size is set as µWF = 0.1. Fig. 5 (b) depicts the success rate of the WF t algorithm under the same conditions of Fig. 5 (a). Both algorithms achieve comparable performance with noise-free observations. However, the proposed Alg. 1 allows perfect recovery even in the presence of outliers, while the WF algorithm fails. Fig. 6 (a) shows the success rate of Alg. 1 with respect to the percent of outliers and the rank, under the same setup of Fig. 3 (b), where the performance is similar to the convex counterpart in (6). In contrast, the WF algorithm performs poorly even with a few outliers, as shown in its success rate plot in Fig. 6 (b).

Rank (r)

0.5 6

Rank (r)

Rank (r)

0.6 8

[16] in the low-rank case, that minimizes the squared `2 norm of the residual, where the update rule becomes m  µWF X  zi − k(U (t) )T ai k22 ai aTi U (t) , U (t+1) = U (t) + t m i=1

Rank (r)

12

6 0.5 5

0.4 4

0.4 4

0.3 3

0.3 3

0.2 2

Fig. 5: Phase transitions of PSD matrix recovery with respect to the number of measurements and the rank by (a) the proposed Alg. 1 with `1 -norm objective, and (b) the WF algorithm with squared `2 -norm objective, when n = 100.

0.2 2

0.1 1

0.1 1

0 10

20

30

40

50

Number of measurements (m)

60

0 10

20

30

40

50

60

Number of measurements (m) 1 8

Fig. 4: Phase transitions of Toeplitz PSD matrix recovery with respect to the number of measurements and the rank, (a) without outliers, and (b) with 5% of measurements corrupted by arbitrary standard Gaussian variables, when n = 64.

0.8

7

0.8

6

0.7

6

0.7

0.6

5

0.5 4

0.4

0.6

5

0.5 4

0.4

3

0.3

3

0.3

2

0.2

2

0.2

0.1

1 0

0.1

0.2

0.3

0.4

(a) We next consider the performance of the non-convex sugradient descent algorithm in Alg. 1 under the same setup of Fig. 2. In Alg. 1, the number of iterations is set as Tmax = 6 × 104 (Tmax is set as a large value to guarantee convergence when terminated) and µ = 0.1. Denote the ˆ , and each Monte Carlo simulation solution to Alg. 1 by U ˆU ˆ T − X 0 kF /kX 0 kF ≤ 10−6 . For is deemed successful if kU each cell, the success rate is calculated by averaging over 50 Monte Carlo simulations. Fig. 5 (a) shows the success rate of Alg. 1 with respect to the number of measurements and the rank, when n = 100. Indeed, empirically the algorithm succeeds as soon as the number of measurements is on the order of nr. We also compare against the extension of the Wirtinger Flow (WF) algorithm in [14],

0.9

7

Percent of outliers (s)

B. Performance of Non-Convex Subgradient Descent

1 8

0.9

Rank (r)

(b) Rank (r)

(a)

0.5

0.6

0

0.1

1 0

0.05

0.1

0.15

0.2

0.25

0.3

0

Percent of outliers (s)

(b)

Fig. 6: Phase transitions of PSD matrix recovery with respect to the number of measurements and rank by (a) the proposed Alg. 1, and (b) the WF algorithm, when n = 40 and m = 400. Under the same setup of Fig. 5, Fig. 7 (a) further shows the convergence rate of the proposed Alg. 1 for different step size µ, when r = 1. It can be seen that a larger step size leads to a faster convergence. Fig. 7 (b) compares the convergence rate of Alg. 1 and the WF algorithm in the presence of 20% outliers, where the proposed Alg. 1 converges to the ground truth while the WF algorithm fails. V. P ROOF OF M AIN T HEOREM In this section we prove Theorem 1, and the roadmap of our proof is below. In Section V-A, we first provide

6

0

0

10

10

Proposed WF

µ=5 µ=3 µ=2 µ=1

−1

−2

10

−1

Normalized error

Normalized error

10

−3

10

−4

10

10

−2

10

−5

10

−6

−3

10

10 0

50

100

150

200

250

300

350

400

450

20

number of iterations

40

60

80

100

120

140

Number of iterations

(a)

(b)

Fig. 7: (a) Convergence rate of Alg. 1 for rank-1 recovery with different step sizes; and (b) convergence rates of Alg. 1 and the WF algorithm in the presence of 20% outliers, when n = 40 and m = 400.

the sufficient conditions for an approximate dual certificate that certifies the optimality of the proposed algorithm (6) in Lemma 1. Section V-B records a few lemmas that show A satisfies the required restricted isometry properties. Then, a dual certificate is constructed and validated for a fixed lowrank PSD matrix X 0 in Section V-C. Finally, the proof is concluded in Section V-D. First we introduce some additionalPnotations. Denote the m T adjoint operator of A by A∗ (µ) = i=1 µi ai ai . We use kXk, kXkF and kXk1 to denote the spectral norm, the Frobenius norm and the nuclear norm of the matrix X, respectively, and use kxkp to denote the `p norm of the vector x. Let the singular value decomposition of the fixed rankr PSD matrix X 0 be X 0 = U ΛU T , then the symmetric tangent space T at X 0 is denoted by n o T := U Z T + ZU T | Z ∈ Rn×r . We denote by PT and PT ⊥ the orthogonal projection onto T and its orthogonal complement, respectively. And for notational simplicity, we denote H T := PT (H) and H T ⊥ := H −PT (H) for any symmetric matrix H ∈ Rn×n . Moreover, γ, c, c1 and c2 represent absolute constants, whose values may change according to context.

and for all matrices X ∈ T , 1 1 kAS ⊥ (X)k1 > ⊥ |S | 5

1 12

 kXkF ,

(15)

and 

9 sgn(βi ), µi = m 9 |µi | ≤ m ,

i ∈ supp(β) , i∈ / supp(β)

(17)

the solution to (6) satisfies

r

ˆ

X − X 0 ≤ c , m F where c is a constant. ˆ = X 0 + H 6= Proof: Denote the solution of (6) by X ˆ  0, H T ⊥  0, and furthermore, X 0 , then there is X kA(H) − (β + w)k1 = kz − A(X 0 + H)k1 ˆ 1 = kz − A(X)k ≤ kz − A(X 0 )k1 = kβ + wk1 .

The following lemma suggests that under certain appropriate restricted isometry preserving properties of A, a properly constructed dual certificate can guarantee faithful recovery of the proposed algorithm (6). Lemma 1 (Approximate Dual Certificate for (6)). Denote a s√0 subset S with |S| m := d 13 2r e, where 0 < s0 < 1 is some constant, and the support of β satisfies supp(β) ⊆ S. Suppose that the mapping A obeys that for all symmetric matrices X,   1 1 kXk1 , (13) kA (X)k1 ≤ 1 + m 10 1 kAS (X)k1 ≤ |S|

1−

where AS and AS ⊥ is the operator constrained on S and S ⊥ respectively. Then if there exists a matrix Y = A∗ (µ) that satisfies 1 1 Y T ⊥  − I T ⊥ , kY T kF ≤ , (16) r 13r

A. Approximate Dual Certificate

and



 1+

1 10

 kXk1 ,

(14)

Since kA(H)−(β+w)k1 = kAS (H)−β−wS k1 +kAS ⊥ (H)−wS ⊥ k1 , and kβ + wk1 = kβ + wS k1 + kwS ⊥ k1 , we have kAS ⊥ (H)k1 ≤ kAS ⊥ (H) − wS ⊥ k1 + kwS ⊥ k1 ≤ kβ + wk1 − kAS (H) − β − wS k1 + kwS ⊥ k1 ≤ kβ + wS k1 − kAS (H) − β − wS k1 + 2kwS ⊥ k1 ≤ kAS (H)k1 + 2kwS ⊥ k1 ,

7

where the last inequality follows from the triangle inequality. We could further bound kAS ⊥ (H T )k1 ≤ kAS ⊥ (H)k1 + kAS ⊥ (H T ⊥ )k1 ≤ kAS (H)k1 + kAS ⊥ (H T ⊥ )k1 + 2kwS ⊥ k1 ≤ kAS (H T )k1 + kAS (H T ⊥ )k1 + kAS ⊥ (H T ⊥ )k1 + 2kwS ⊥ k1 = kAS (H T )k1 + kA(H T ⊥ )k1 + 2kwS ⊥ k1 . (18) Our assumptions on A imply that   1 1+ Tr (H T ⊥ ) 10 1 ≥ kA (H T ⊥ )k1 m 1 (kA ⊥ (H T )k1 − kAS (H T )k1 − 2kwS ⊥ k1 ) ≥ m S    |S ⊥ | 1 |S| 1 2 1− kH T kF − 1+ kH T k1 − , ≥ 5m 12 m 10 m where the first inequality follows from (13), the second inequality follows from (18), and the last inequality follows from (14) and (15). This gives  ⊥  |S | |S| √ 2 Tr (H T ⊥ ) ≥ − 2r kH T kF − , (19) 6m m m √ where we use the inequality kH T k1 ≤ 2rkH T kF . On the other hand, since µ/(9/m) is a subgradient of the `1 norm at β from (17), we have Dm E kβk1 + µ, w − A(H) ≤ kw + β − A(H)k1 9 ≤ kβ + wk1 ≤ kβk1 + kwk1 , which, by a simple transformation, is 9 hµ, A(H)i ≥ hµ, wi − kwk1 m   18 9 ≥ − kµk∞ + kwk1 ≥ − . m m Then with hH, Y i = hA(H), µi, we can get 18 ≤ hA(H), µi = hH, Y i m = hH T , Y T i + hH T ⊥ , Y T ⊥ i 1 ≤ kY T kF kH T kF − hH T ⊥ , I T ⊥ i r 1 1 kH T kF − Tr(H T ⊥ ), ≤ 13r r which gives −

18r 1 kH T kF + . (20) 13 m Combining with (19), we know  ⊥  |S | |S| √ 2 1 18r − 2r kH T kF − ≤ kH T kF + . 6m m m 13 m Tr(H T ⊥ ) ≤

√ ⊥ 1 > 0 under the assumption on Since |S6m| − |S| 2r − 13 m in Lemma 1, we have kH T kF ≤ m



|S ⊥ | 6m

20r √ 2r − − |S| m

1 13

 ≤ c1

|S| m

r , m

where c1 is some fixed constant. Finally, there is ˆ − X 0 kF ≤ kH T kF + kH T ⊥ kF kX ≤ kH T kF + Tr(H T ⊥ )   1 18r r ≤ 1+ kH T kF + ≤c , 13 m m for some constant c. B. Restricted Isometry of A The first two conditions (13) and (14) in Lemma 1 are supplied straightforwardly in the following lemma as long as m ≥ cnr and |S| = c1 m/r ≥ c2 n for some constants c, c1 and c2 . Lemma 2 ( [2]). Fix any δ ∈ (0, 12 ) and assume m ≥ 20δ −2 n. Then for all PSD matrices X, one has (1 − δ) kXk1 ≤

1 kA (X)k1 ≤ (1 + δ) kXk1 m 2

with probability exceeding 1 − 2e−m /2 , where 2 +  = 4δ . The right hand side holds for all symmetric matrices. The third condition (15) in Lemma 1 can be obtained using the mixed-norm RIP-`2 /`1 provided in [5] as long as m ≥ cnr and |S| ≤ c1 m for some constants c and c1 . Lemma 3 ( [5]). Suppose the sensing vectors ai ’s are composed of i.i.d. sub-Gaussian entries, then there exist positive universal constants c1 , c2 , c3 such that, provided that m > c3 nr, for all matrices X of rank at most r, one has   2 1 − δrlb kXkF ≤ kB (X)k1 ≤ 1 + δrub kXkF , m with probability exceeding 1 − c1 e−c2 m , where δrlb and δrub are defined as the RIP-`2 /`1 constants. And the operator B represents the linear transformation that maps X ∈ Rn×n m/2 to {Bi (X)}i=1 ∈ Rm/2 , where Bi (X) := ha2i−1 aT2i−1 − T a2i a2i , Xi. The third condition (15) can be easily validated from the 2 lower bound by setting δrlb appropriately, since m kB (X)k1 ≤  Pm/2 2 ha2i−1 aT , Xi + ha2i aT , Xi = 2i−1 2i i=1 m 2kA (X) k1 . C. Construction of Dual Certificate For notational simplicity, let α0 := EZ 2 1{|Z|≤3} ≈ 0.9707, β0 := EZ 4 1{|Z|≤3} ≈ 2.6728 and θ0 := EZ 6 1{|Z|≤3} ≈ 11.2102 for a standard Gaussian random variable Z, where 1E is an indicator function with respect to an event E. Consider that the singular value decomposition of a PSD matrix X 0 of rank at most r can be represented as X 0 =

8

Pr

i=1

λi ui uTi , then inspired by [6], [8], we construct Y as r 1 X h 1 X T 2 a ui 1{|aT ui |≤3} j m r i=1 j j∈S ⊥ i β0 − α0 9 X − (α0 + ) · aj aTj + χj aj aTj r m

Y :=

j∈S

:= Y

(0)

−Y

(1)

+Y

(2)

,

where Y

(0)

Y (1) Y (2)

# " r 1 X 1 X T 2 = aj ui 1{|aT ui |≤3} aj aTj , j m r i=1 j∈S ⊥   X β0 − α0 1 aj aTj , α0 + = m r j∈S ⊥ X 9 = χj aj aTj . m j∈S

We set χj = sgn (βj ) if j ∈ supp(β), otherwise χj ’s are i.i.d. Rademacher random variables with P {χj = 1} = P {χj = −1} = 1/2. The construction immediately indicates that Y satisfies (17). We will show that Y satisfies (16) with high proability. In what follows, we separate the constructed Y into two parts and consider the bounds on Y (0) − Y (1) and Y (2) respectively. 1) Proof of Y T ⊥ + 1r I T ⊥  0: First, by standard results in random matrix theory [17, Corollary 5.35], we have

 

m

β0 β0 − α0 (1)

Y − α + I 0

|S ⊥ |

≤ 40r , r ⊥ 2 with probability 1 − 2e−γ |S |/r for some constant γ ⊥ at least provided S ≥ cnr2 for some constant c. In particular, this gives

 

m

β0 − α0 β0 (1)

≤ IT ⊥ . (21)

|S ⊥ | Y T ⊥ − α0 +

r 40r

Let a0j = (I − U U T )aj be the projection of aj onto the orthogonal complement of the column space of U , then we have 1 X (0) Y T⊥ = j Tj , m ⊥ j∈S

 P 1/2 2 r where j = 1r i=1 aTj ui 1{|aT ui |≤3} a0j are i.i.d. j copies of a zero-mean, isotropic and sub-Gaussian random vector , which satisfies E[T ] = α0 I T ⊥ . Following [17, Theorem 5.39], we have



m α0 (0)

Y − α I , (22) ⊥ 0 T ≤

|S ⊥ | T ⊥ 40r −γ |S ⊥ |/r 2 with probability for some constant ⊥ at least2 1 − 2e γ provided S ≥ cnr for some constant c. As a result, if m ≥ cnr2 for some large constant c and |S| ≤ c1 m for some 2 constant c1 small enough, with probability at least 1−e−γm/r ,

there is

(0)

Y ⊥ − Y (1)⊥ + β0 − α0 I T ⊥

T

T r



β0 − α0 S

(0)

(1) ≤ Y T ⊥ − Y T ⊥ + IT ⊥

r m ⊥ ! S β0 − α0 + 1− m r

m

|S| β0 − α0 m β0 − α0 (0) (1)

≤ ⊥ Y T⊥ − ⊥ Y T⊥ + IT ⊥

+ m |S | |S | r r β0 α0 ≤ + . (23) 30r 60r

(2) = Next, let’s check Y T ⊥ . Since Y (2) P 1 T T = 0, by [17, j∈S 9χj aj aj , where E[9χj aj aj ] m Theorem 5.39] we have



|S| X

1 1

(2) T

9χj aj aj ≤ ,

Y =

m |S|

10r j∈S

with probability at least 1 − 2 exp(−γm/r) as long as m ≥ cnr2 and |S| = c1 m/r ≥ c2 nr, for some constants c, c1 and c2 . In particular, this gives

1

(2) . (24)

Y T ⊥ ≤ 10r Putting this together with (23), we can obtain that if m ≥ cnr2 and |S| = c1 m/r ≥ c2 nr for some constants c, c1 and 2 c2 , with probability at least 1 − e−γm/r , there is





Y T ⊥ + 1.7 I T ⊥ = Y (0)⊥ − Y (1)⊥ + Y (2)⊥ + 1.7 I T ⊥



T T T r r   α0 β0 1 0.25 ≤ + + 0.11 ≤ . 60 30 r r   1 2) Proof of kY T kF ≤ 13r : Let Y˜ = Y (0) − Y (1) U ,   0 and Y˜ = I − U U T Y˜ be the projection of Y˜ onto the orthogonal complement of U , then we have

2

2

0 2

(0)

(1) (25)

Y T − Y T = U T Y˜ + 2 Y˜ . F

F

F

First consider the term kU Y˜ k2F in (25), where the kth column of U T Y˜ can be expressed explicitly as   U T Y˜ k " #  r X 1 1 X T 2 β0 − α0 = a ui 1{|aT ui |≤3} − α0 + j m r i=1 j r ⊥ j∈S    · aTj uk U T aj T

:=

1 Φck , m

⊥ where Φ ∈ Rr×|S | is constructed by U T aj ’s, and ck ∈ ⊥ R|S | is composed of ck,j ’s, each one expressed as

ck,j = " r  #  1 X T 2 β0 − α0 aTj uk , aj ui 1{|aT ui |≤3} − α0 + j r i=1 r

9

with  1 E[c2k,j ] = 2 θ0 + (r − 1) β0 − β02 − (r − 1) α02 r   4.07 1 1 β0 − α02 + 2 θ0 + α02 − β02 − β0 ≤ . = r r r

variable with n degrees of freedom. Again [18, Lemma 1] tells us m 2 2 , kΨxk2 ≤ kAxk2 ≤ 12000r2 2

with probability exceeding 1 − e−γm/r , provided m ≥ cnr2 for a sufficiently large constant c. Hence,

2

 0  2 ck 1 1

˜

kck k22 ≤ Ψ ,

Y

≤ 2 m kck k2 2700r3 k 2

Note that c2k,j ’s are i.i.d. sub-exponential random variables

with c2k,j ≤ K, for some constant K, then according ψ1 to [17, Corollary 5.17], there is 2   ! which leads to  X    2 S ⊥ r 

0 2  X c2k,j − Ec2k,j ≥ S ⊥ ≤ 2exp −c 2 2 P ,

˜ 0 2

˜  ⊥  r K r = Y Y



j∈S F

which shows that as long as |S| ≤ c1 m, for some constants c and c1 , 4.07 + c 4.1m 2 kck k2 ≤ m≤ r r

k=1

k 2

1 . 2700r2

(27)

Then, combining (26) and (27), we know that r

0 2

2 1



(0)

(1) . (28)

Y T − Y T = U T Y˜ + 2 Y˜ ≤ 30r F F F

2

holds with probability at least 1 − e−γm/r . Furthermore, for ⊥ 2 a fixed vector x ∈ R|S | obeying kxk2 = 1, kΦxk2 is distributed as a chi-square random variable with r degrees of freedom. From [18, Lemma 1], we have m 2 , kΦxk2 ≤ 12000r2 2

with probability at least 1 − e−γm/r , provided m ≥ cnr2 for some sufficiently large constant c. Therefore, we can obtain

2

  2 ck 1 1

T ˜

kck k22 ≤ ,

U Y = 2 Φ

m kck k2 2 2700r3 k 2 which yields kU Y˜ k2F = T

r   2 X

U T Y˜ ≤ k=1

k 2

1 , 2700r2

(26)

2

with probability at least 1 − e−γm/r , when m ≥ cnr2 and |S| ≤ c1 m. To bound the second term in (25), we could adopt the same 0 techniques as before. The kth column of Y˜ can be expressed explicitly as  0 Y˜ k " r #  1 X 1 X T 2 β0 − α0 = a ui 1{|aT ui |≤3} − α0 + j m r i=1 j r j∈S ⊥   · (aTj uk ) I − U U T aj 1 X 1 := ck,j a0j := Ψck , m m ⊥ j∈S

⊥ where Ψ ∈ Rn×|S | is constructed by a0j ’s, each of which, as a reminder, is the projection of aj onto the orthogonal  comT 0 plement of the column space of U as aj = I − U U aj .   T n×|S ⊥ | Equivalently, Ψ = I − U U A, where A ∈ R is S⊥| ⊥ | constructed by aj ’s, j ∈ S . For a fixed

 vector x ∈ R 2

2 obeying kxk2 = 1, there is kΨxk2 = I − U U T Ax ≤

2 kAxk2 ,

where

2 kAxk2

2

is distributed as a chi-square random

(2)

Next, let’s check kY T k2F , which can be written as 0 (2) kY T k2F = kU T Y¯ k2F + 2kY¯ k2F , 0 where Y¯ = Y (2) U and Y¯ = (I − U U T )Y¯ . For the first T ¯ 2 term kU Y kF , the kth column of U T Y¯ can be formulated explicitly as     1 ¯ 1 X U T Y¯ = 9χj aTj uk U T aj := Φd k, m m k j∈S

¯ ∈R where Φ is constructed by U T aj ’s, and dk ∈ R|S| is composed of dk,j ’s, each one expressed as  dk,j = 9χj aTj uk , r×|S|

2 with E[d2k,j ] = 81. Note

that dk,j ’s are i.i.d. sub-exponential

random variables with d2k,j ≤ K, for some constant K, ψ1 then based on [17, Corollary 5.17], there is      X   2 |S| P d2k,j − Ed2k,j ≥  |S| ≤ 2exp −c1 ,   K2 j∈S

which indicates that if |S| = cm/r, for some constant c, 2

kdk k2 ≤ (81 + c1 ) |S| ≤ 82 |S| := δ0 |S| for a fixed holds with probability at least 1 − e−γm/r

. And ¯ 2 is also a chivector x ∈ R|S| obeying kxk2 = 1, Φx 2 square random variable with r degrees of freedom, so there is

2 m ¯ ≤

Φx , 2 2700δ0 cr2 2

with probability at least 1 − e−γm/r , provided m ≥ c1 nr2 for some sufficiently large constant c1 . Thus we have

2

  2 1 dk 1

T ¯ ¯

kdk k22 ≤ U Y = Φ ,

m2 kdk k2 2 2700r3 k 2 which gives kU T Y¯ k2F =

r   2 X

U T Y¯ ≤ k=1

k 2

1 , 2700r2

(29)

10

2

with probability at least 1 − e−γm/r , when m ≥ c1 nr2 and |S| = cm/r, for some appropriate constants c and c1 . 0 (2) Now consider the second term kY¯ k2F in kY T k2F , where 0 ¯ the kth column of Y can be expressed explicitly as     1 X 0 Y¯ = 9χj (aTj uk ) I − U U T aj m k j∈S 1 X 1 ¯ := dk,j a0j := Ψd k, m m j∈S

¯ ∈ Rn×|S| is constructed by a0 ’s. Also, we can where Ψ j   T ¯ ¯ ¯ ¯ ∈ Rn×|S| is decompose Ψ as Ψ = I − U U A, where A |S| constructed by aj ’s, j ∈ S. For a fixed

2

vector x ∈ R obey

2

T ¯ ¯ = I − UU Ax ing kxk2 = 1, there is Ψx

≤ 2 2



¯ 2 , where Ax ¯ 2 is a chi-square random variable with

Ax 2 2 n degrees of freedom as well. Since we already know that provided m ≥ c1 nr2 for a sufficiently large constant c1 , there is



m ¯ 2 ≤ Ax ¯ 2 ≤

Ψx , 2 2 2700δ0 cr2 2

with probability exceeding 1 − e−γm/r , we can have

2

  2 1 dk 1

¯0 ¯

kdk k22 ≤ ,

≤ 2 Ψ

Y m kdk k2 2 2700r3 k 2 and a further result r 

2  X

¯ 0

¯ 0 2 Y =



Y

≤ F

k=1

k 2

1 , 2700r2

which, combining with (29), leads to r

2

2 1

(2)

0 .

Y T = U T Y¯ + 2 Y¯ ≤ 30r F F F

(30)

(31)

Finally we can obtain that if m ≥ cnr2 and |S| = c1 m/r, for some constants c and c1 , with probability at least 1 − 2 e−γm/r , there is

1

(0) (1) (2) kY T kF = Y T − Y T + Y T ≤ . (32) 15r F D. Proof of Theorem 1 The required restricted isometry properties of the linear mapping A are supplied in Section V-B and a valid appropriate dual certificate is constructed in Section V-C, therefore, Theorem 1 can be straightforwardly obtained from the Lemma 1 in Section V-A. VI. C ONCLUSION In this paper, we address the problem of estimating a lowrank PSD matrix X ∈ Rn×n from m quadratic magnitude measurements that are possibly corrupted by arbitrary outliers and bounded noise. This problem has many applications in covariance sketching, phase space tomography, and noncoherent detection in communications. It is shown that with O(nr2 ) random Gaussian sensing vectors, a PSD matrix of rank-r can be robustly recovered by minimizing the `1 -norm of the observation residual within the semidefinite cone with high

probability, even when a fraction of the measurements are adversarially corrupted. This convex formulation eliminates the need for trace minimization and tuning of parameters. Moreover, a non-convex subgradient descent algorithm is proposed with excellent empirical performance, when additional information of the rank of the PSD matrix is available. R EFERENCES [1] J. R. Fienup, “Reconstruction of an object from the modulus of its fourier transform,” Optics letters, vol. 3, no. 1, pp. 27–29, 1978. I [2] E. J. Candes, T. Strohmer, and V. Voroninski, “Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming,” Communications on Pure and Applied Mathematics, vol. 66, no. 8, pp. 1241–1274, 2013. I, II-B, 2 [3] E. J. Candes, Y. C. Eldar, T. Strohmer, and V. Voroninski, “Phase retrieval via matrix completion,” SIAM Journal on Imaging Sciences, vol. 6, no. 1, pp. 199–225, 2013. I, II-B [4] I. Waldspurger, A. d’Aspremont, and S. Mallat, “Phase recovery, maxcut and complex semidefinite programming,” Mathematical Programming, vol. 149, no. 1-2, pp. 47–81, 2015. I, IV-A [5] Y. Chen, Y. Chi, and A. Goldsmith, “Exact and stable covariance estimation from quadratic sampling via convex programming,” Information Theory, IEEE Transactions on, vol. 61, no. 7, pp. 4034–4059, July 2015. I, I, II-B, V-B, 3 [6] E. J. Candes and X. Li, “Solving quadratic equations via phaselift when there are about as many equations as unknowns,” Foundations of Computational Mathematics, vol. 14, no. 5, pp. 1017–1026, 2014. II-A, II-B, V-C [7] L. Demanet and P. Hand, “Stable optimizationless recovery from phaseless linear measurements,” Journal of Fourier Analysis and Applications, vol. 20, no. 1, pp. 199–221, 2014. II-A, II-B [8] P. Hand, “Phaselift is robust to a constant fraction of arbitrary errors,” arXiv preprint arXiv:1502.04241, 2015. II-A, II-A, II-B, V-C [9] M. Kabanava, R. Kueng, H. Rauhut, and U. Terstiege, “Stable low-rank matrix recovery via null space properties,” arXiv preprint arXiv:1507.07184, 2015. II-B [10] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of ACM, vol. 58, no. 3, pp. 11:1–11:37, Jun 2011. II-B [11] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence for matrix decomposition,” SIAM Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011. II-B [12] J. Wright, A. Ganesh, K. Min, and Y. Ma, “Compressive principal component pursuit,” Information and Inference, vol. 2, no. 1, pp. 32–68, 2013. II-B [13] X. Li, “Compressed sensing and matrix completion with constant proportion of corruptions,” Constructive Approximation, vol. 37, pp. 73– 99, 2013. II-B [14] E. J. Candès, X. Li, and M. Soltanolkotabi, “Phase retrieval via wirtinger flow: Theory and algorithms,” Information Theory, IEEE Transactions on, vol. 61, no. 4, pp. 1985–2007, 2015. III, IV-B [15] Y. Chen and E. J. Candès, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,” arXiv:1505.05114, May 2015. III [16] C. D. White, R. Ward, and S. Sanghavi, “The local convexity of solving quadratic equations,” arXiv preprint arXiv:1506.07868, 2015. III, IV-B [17] R. Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” Compressed Sensing, Theory and Applications, pp. 210 – 268, 2012. V-C1, V-C1, V-C1, V-C2, V-C2 [18] B. Laurent and P. Massart, “Adaptive estimation of a quadratic functional by model selection,” Annals of Statistics, pp. 1302–1338, 2000. V-C2, V-C2