AN EFFICIENT, SPARSITY-PRESERVING, ONLINE ALGORITHM FOR ...

Report 0 Downloads 102 Views
AN EFFICIENT, SPARSITY-PRESERVING, ONLINE ALGORITHM FOR DATA APPROXIMATION

arXiv:1602.05950v1 [cs.NA] 18 Feb 2016

DAVID G. ANDERSON∗ AND MING GU∗ Abstract. The Singular Value Decomposition (SVD) is a longstanding standard for data approximation because it is optimal in the 2 and Frobenius norms. The SVD, nevertheless, suffers from many setbacks, including computational cost, loss of sparsity in the decomposition, and the inability to be updated easily when new information arrives. Additionally, the SVD provides limited information on data features and variables that best represent the data. In this work, we present a truncated LU factorization called Spectrum-Revealing LU (SRLU) for effective low-rank matrix approximation, and develop the first algorithm to compute an SRLU factorization, which is both efficient and reliable. Our algorithm uses randomization and a novel LU updating technique with partial pivoting, which is more stable than any other known LU updating algorithm. We provide both approximation error bounds and singular value bounds for the SRLU approximation computed by our algorithm. Our analysis suggests that SRLU is competitive with the best low-rank matrix approximation methods, deterministic or randomized, in both computational complexity and approximation quality. Numeric experiments illustrate that SRLU preserves sparsity, highlights important data features and variables, can be efficiently updated, and calculates data approximations nearly as accurately as the SVD. To the best of our knowledge this is the first practical variant of the LU decomposition for efficient and effective low-rank matrix approximation. Key words. LU decomposition, randomized algorithm, rank-revealing, low-rank approximation

1. Introduction. Low-rank approximation is an essential data processing technique for understanding data that is either large or noisy. The Singular Value Decomposition (SVD) is a stable algorithm for factoring a data matrix, and is optimal in the sense [10, 12]: (1.1)

kA − Ak kξ =

min rank(B)≤k

kA − Bkξ ,

where ξ = 2, F , and Ak is the truncated SVD of A. The far-reaching applications of low-rank approximation and the SVD include data compression, image and pattern recognition, signal processing, compressed sensing, latent semantic indexing, anomaly detection, and recommendation systems. The SVD, nevertheless, suffers from several setbacks: • the SVD is expensive to compute, • the SVD is dense in general, even when the input is sparse, • the SVD cannot be updated easily when new data arrives, • important data variables and features of the data are difficult or impossible to discern from an SVD factorization. Another rudimentary matrix factorization, the LU decomposition is much cheaper to compute, preserves sparsity, highlights important rows and columns of the data, and can be updated. The stability of the LU decomposition has been studied extensively, and is implemented in practice with partial pivoting to solve general linear systems. Creating a low-rank approximation with the LU decomposition, however, requires complete pivoting during factorization. If LU with complete pivoting is performed, then a low-rank approximation in practice can be achieved by implementing ∗ Department of Mathematics, University of California, Berkeley, California 94720 ([email protected], [email protected]).

1

2

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION

a truncated factorization in the form k

Π1 AΠT2 =

k m−k



L11 L21

m−k

k

  In−k

U11

n−k

 U12 , S

where Π1 and Π2 are permutations matrices, and then approximating    def L11 U11 U12 = L1 U1 . Π1 AΠT2 ≈ L21 However, there are two known fatal setbacks of this approach: • complete pivoting is expensive, • complete pivoting does not guarantee a reliable low-rank matrix approximation. Indeed, although some theory and software for LU with complete pivoting exists, its efficient implementation remains an open problem. 1.1. Preliminaries on the LU Decomposition. The LU decomposition factors any matrix into the product of a lower triangular matrix and an upper triangular matrix, which can then be used to efficiently solve linear systems by solving the two triangular systems. The standard LU decomposition picks a row pivot in the current column, scales the current column of L, and then updates the Schur complement (Algorithm 1). This algorithm is called right-looking LU because it updates the Schur complement (located to the right of the column being updated) before the next iteration. Inputs: Data matrix A ∈ Rm×n 1: for k = 1 to min(m, n) − 1 do 2: apply permutations 3: for i = k + 1 to m do 4: Ai,k = Ai,k /Ak,k 5: end for 6: for i = k + 1 to m do 7: for j = k + 1 to n do 8: Ai,j = Ai,j − Ai,k · Ak,j 9: end for 10: end for 11: end for Algorithm 1: Right-Looking LU [7] Another variation of the LU decomposition, the left-looking LU decomposition factors columns from left to right, but does not update the Schur complement after each update. Instead, before a column is factored, all necessary updates are applied simultaneously. A lesser known matrix factorization, the Crout version of the LU decomposition [5] similarly postpones updating the Schur complement, but factors rows and columns during iteration (Algorithm 2). As the goal of this work is a rank-k truncated LU factorization with k  m, n, the left-looking and Crout LU decompositions exhibit more desirable properties than the right-looking version because they avoid updating the Schur complement S, which is not needed in the final approximation. The Crout version allows for row and column pivoting, and the data approximation method presented here is derived from the Crout LU decomposition.

D. G. ANDERSON, AND M. GU

3

Inputs: Data matrix A ∈ Rm×n 1: for k = 1 to min(m, n) do 2: apply permutations 3: for i = 1 to k − 1 do 4: for j = k to n do 5: Ak,j = Ak,j − Ak,i · Ai,j 6: end for 7: end for 8: for i = 1 to k − 1 do 9: for j = k + 1 to m do 10: Aj,k = Aj,k − Ai,k · Aj,i 11: end for 12: end for 13: Ai,k = Ai,k /Ak,k 14: end for Algorithm 2: Crout LU [5] 2. Previous Work. 2.1. Deterministic Algorithms. As mentioned above, the truncated SVD is the standard deterministic method for low-rank data approximation. SVD-like methods, such as the URV and ULV decompositions are variants that allow for data updating. Sparse PCA seeks to compute an SVD-like factorization that is sparse, but is computationally more difficult than the truncated-SVD. The Interpolative Decomposition (ID), the truncated QR with column pivoting factorization, and deterministic column selection algorithms, such as the main result in [2], are other common deterministic low-rank approximation methods. None of these methods is as efficient as the LU decomposition. 2.2. Randomized Low-Rank Approximation Algorithms. Randomized algorithms have grown in popularity in recent years because of their ability to efficiently process large data matrices and because they can be supported with rigorous theory. Randomized low-rank approximation algorithms generally fall into one of two categories: sampling algorithms and black box algorithms. Sampling algorithms form data approximations from a random selection of rows and/or columns of the data. Exam- ples include [8, 9, 11, 20]. The latter proves that for R and C, subsets of O −2 k log k rows and columns of a data matrix A randomly selected using leveraged score weights, then for U = C† AR† (2.1)

kA − CURkF ≤ (2 + ) kA − Ak kF

with probability at least 98%. Here k is the target rank. The leverage scores are calculated from a rank-k truncated SVD of the data. Black box algorithms project a data matrix onto a smaller subspace and then analyze or factor the projection. This technique relies on the ability to efficiently project the data to a lower-dimensional subspace while maintaining structural properties of the data. The projection itself does not form an easily interpretable low-rank approximation, and it must be applied quickly enough so that the time savings of operating on the lower-dimensional space are not negated. An efficient projection technique for a general data matrix was shown to exist by Johnson and Lindenstrauss [18], where

4

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION

the projection is achieved by premultiplying by a random Gaussian matrix. Low-rank data approximation methods using a Johnson-Lindenstrauss projection have been extensively studied [4, 16, 21, 26, 27, 29]. The random projection introduces error and adds an initialization cost. For both sampling and black box algorithms the tuning parameter  cannot be arbitrarily small, as the methods become meaningless if the number of rows and columns sampled (in the case of sampling algorithms) or the size of the random projection (in the case of black box algorithms) surpasses the size of the data. A common practice is  ≈ 12 . 2.3. The LU Decomposition. The LU decomposition has been studied extensively since long before the invention of computers, with notable results from many mathematicians, including Gauss, Turing, and Wilkinson. Current research on LU decompositions includes communication-avoiding implementations, such as tournament pivoting [19], sparse implementations [14], and new computation of preconditioners [3]. These results are almost exclusively in the context of linear equation solving, either directly or indirectly through an incomplete factorization used to precondition an iterative method. This work repurposes the LU decomposition to create a novel efficient and effective low-rank approximation algorithm using modern randomization technology. A previous result on randomized LU factorizations for low-rank approximation was presented in [1], but the theoretical guarantees and computational performance presented here are far stronger, and, for the first time, are competitive with all other low-rank approximation algorithms. In practice, LU with partial pivoting provides reliable data factorizations, but pathological data matrices may still produce numerically unstable factorizations in the form of exponential element growth in the decomposition. Complete pivoting has long been known to provided a mathematically stable form of the LU decomposition  [28], but requires an additional O n3 comparisons to find the pivot elements. A randomized approach to efficiently compute the LU decomposition with complete pivoting recently appeared in [22]. 3. Main Contribution: Spectrum-Revealing LU (SRLU). Our algorithm for computing SRLU is composed of two subroutines: partially factoring the data matrix with randomized complete pivoting (TRLUCP) and performing swaps to improve the quality of the approximation (SRP). The first provides an efficient algorithm for computing a truncated LU factorization, whereas the second ensures the resulting approximation is provably reliable. 3.1. Truncated Randomized LU with Complete Pivoting (TRLUCP). Intuitively, TRLUCP performs deterministic LU with partial row pivoting for some initial data with permuted columns. TRLUCP uses a random projection of the Schur complement to cheaply find and move forward columns that are more likely to be representative of the data. To accomplish this, Algorithm 3 performs an iteration of block LU factorization in a careful order that resembles Crout LU reduction. The ordering is reasoned as follows: LU with partial row pivoting cannot be performed until the needed columns are selected, and so column selection must first occur at each iteration. Once a block column is selected, a Schur update must be performed on that column before proceeding. At this point, an iteration of block LU with partial row pivoting can be performed on the current block. Once the row pivoting is performed, a Schur update of a block row of U can be performed, which completes the factorization up to rank j + b. Finally, the projection matrix R can be cheaply

5

D. G. ANDERSON, AND M. GU

Inputs: Data matrix A ∈ Rm×n , target rank k, block size b, oversampling parameter p ≥ b, random Gaussian matrix Ω ∈ Rp×m 1: Calculate random projection R = ΩA 2: for j = 0, b, 2b, · · · , k − b do 3: Perform column selection algorithm on R and swap columns of A 4: Update block column of L1 5: Perform block LU with partial row pivoting and swap rows of A 6: Update block row of U1 7: Update R 8: end for Algorithm 3: TRLUCP updated to prepare for the next iteration. Note that any column selection method may be used when picking column pivots from R, such as QR with column pivoting, LU with row pivoting, or even the row pivots selected by this algorithm. The flop count of TRLUCP is dominated by the three matrix multiplication steps. The total number of flops is F TRLUCP = 2pmn + (m + n)k 2 + low order. Because the output of TRLUCP is only written once, the total number of memory writes is W TRLUCP = (m + n − k)k. Minimizing the number of data writes by only writing data once significantly improves efficiency because writing data is typically one of the slowest computational operations. 3.1.1. Updating R. The goal of TRLUCP is to access the entire matrix once in the initial random projection, and then choose column pivots at each iteration without accessing the Schur complement. Therefore, a projection of the Schur complement must be obtained at each iteration without accessing the Schur complement, a method that first appeared in [22]. Assume that s iterations of TRLUCP have been performed and denote the projection matrix

Ω=

sb

b

n − (s + 1)b

Ω1

Ω2

Ω3



.

Then the current projection of the Schur complement is

cur

R

=

b

n − (s + 1)b

Rcur 1

Rcur 2



= Ω2

  S11 Ω3 S21

 S12 , S22

where the right-most matrix is the current Schur complement. The next iteration of TRLUCP will need to choose columns based on a random projection of the Schur

6

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION

complement, which we wish to avoid accessing. We can write:  Rupdate = Ω3 A33 − A32 A−1 22 A23 = Ω3 A33 + Ω2 A23 − Ω2 A23 − Ω3 A32 A−1 22 A23 = Ω3 A33 + Ω2 A23 − Ω2 L22 U23 − Ω3 L32 U23 (3.1)

= Rcurrent − (Ω2 L22 + Ω3 L32 ) U23 . 2

Here the current L and U at stage s have been blocked in the same way as Ω. Note equation (3.1) no longer has the term A33 . Furthermore, A−1 22 has been replaced by substituting in submatrices of L and U that have already been calculated, which helps eliminate potential instability. When the block size b = 1 and TRLUCP runs fully (k = min(m, n)), TRLUCP is mathematically equivalent to the Gaussian Elimination with Randomized Complete Pivoting (GERCP) algorithm of [22]. However, TRLUCP differs from GERCP in two very important aspects: TRLUCP is based on the Crout variant of the LU factorization, which allows efficient truncation for low-rank matrix approximation; and TRLUCP has been structured in block form for more efficient implementation. An argument parallel to that of [22] leads to the following stability result concerning TRLUCP. Theorem 3.1 (ColumnGrowth Factor forTRLUCP). Choose , δ ∈ (0, 1). If n(n + 1) 4 ln , then the element growth factor ρ of TRLUCP satisfies p>b+ 2  − 3 2δ 1 + p 1+ln e(n + 1)n 1−

q

ρ≤

1+ 1−



1

n 2 ln(n)

with probability greater than 1 − δ. Thus TRLUCP enjoys almost the same level of numerical stability as the deterministic LU with complete pivoting, but is much more efficient. In practice, we have found that TRLUCP works very well even when p is only slightly larger than b. 3.1.2. Choice of Block Size and Projection Size. A heuristic for choosing a block size for TRLUCP is developed here, which differs from standard block size methodologies for the LU decomposition. Because the ideal block size depends on many parameters, such as the architecture of the computer and the costs for various arithmetic, logic, and memory operations, guidelines are sought instead of an exact determination of the most efficient block size. To simplify calculations, only the DGEMM operations are considered, which are the bottleneck of computation. Let M denote the size of cache, f and m the number of flops and memory movements, and tf and tm the cost of a floating point operation and the cost of a memory movement. We seek to choose a block size to minimize the total calculation time T modeled as T = f · tf + m · tm . Choosing p = b + c for a small, fixed constant c, and minimizing implies    4 2 T = (m + n − k) k 2 − kb − k 3 + 2bk 2 − b2 k · tf (3.2) 3 3   2   3 k 4k 2 + (m + n − k) −k − + 2k 2 − bk b 3 b 3 M · √ 2 · tm . b2 + M − b

7

D. G. ANDERSON, AND M. GU

To see this, we analyze blocking by allowing different block sizes in each dimension. For matrices Ω ∈ Rp×m and A ∈ Rm×n consider blocking in the form b `

Ω·R=

s



∗ ∗

∗ ∗

∗ ∗





∗ · ∗ ∗ `

∗ ∗ ∗

 ∗ ∗ . ∗

Then a current block update requires cache storage of s` + `b + sb ≤ M. Thus we will constrain `≤

M − sb . s+b

The total runtime T is T = 2pmn · tf + = 2pmn · tf ≥ 2pmn · tf = 2pmn · tf =: 2pmn · tf

p m n

(s` + `b + sb) · tm s ` b  s+b 1 + pmn + · tm sb `   s+b s+b + · tm + pmn sb M − sb   s+b · tm + pmnM sb (M − sb) + pmnM L (s, b, M ) · tm .

Given Ω and A, changing the block sizes has no effect on the flop count. Optimizing L (s, b, M ) over s yields s2 + 2sb = M. By symmetry b2 + 2sb = M. Note, nevertheless, that s ≤ p by definition. Hence ! r M ∗ s = min ,p , 3 and r b∗ = max

! M p 2 , p +M −p . 3

These values assume M − sb ` = = max s+b ∗

r

! M p 2 , p + M − p = b∗ . 3

8

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION

This analysis applies to matrix-matrix multiplication where the matrices are fixed and the leading matrix is short and fat or the trailing matrix is tall and skinny. As noted above, nevertheless, the oversampling parameter p is a constant amount larger than the block size used during the LU factorization. The total initialization time is

T

init



 s+b = 2pmn · tf + pmnM · tm sb (M − sb)   M   √ p = 2pmn · tf + mn · min 3 3 √ , p 2  · tm . M 2 p +M −p

We next choose the parameter b used for blocking the LU factorization, where p = b+c for a small constant c. Steps 4 and 6 of Algorithm 3 have a cumulative DGEMM runtime of X

T DGEMM =

j=b:b:k−b

[2jb(m − j) + 2jb(n − j − b)] · tf + 2 [j(m − j) + j(n − j − b)] √

M

2 · tm b2 + M − b



=

  4 2 (m + n − k) k2 − kb − k3 + 2bk2 − b2 k · tf + 3 3   2   k 4 k3 2 M −k − + 2k2 − bk √ + (m + n − k) 2 · tm b 3 b 3 2 b +M −b

DGEMM =: NfDGEMM · tf + Nm · tm ,

which implies (3.2). Given hardware-dependent parameters M , tf , and tm , a minimizing b can easily be found. For problems where LAPACK chooses b = 64, our experiments have shown block sizes of 8 to 20 to be optimal for TRLUCP. Experiments are shown below. To our knowledge, this is the first algorithm where the choice of block size affects the flop count. This relationship between block size and projection size reflects a key benefit over other current algorithms: the random projection is significantly smaller for TRLUCP than the projection for other methods (2pmn flops for TRLUCP versus 2kmn for others) because it is based on the block size, not the truncation rank, which can be much larger than the block size. 3.1.3. Analysis of Truncated LU Decompositions. A basic result for any truncated LU decomposition: Theorem 3.2. For any truncated LU factorization kΠ1 AΠT2 − L1 U1 k = kSk for any norm k·k. Furthermore, kΠ1 AΠT2 − (L1 U1 )s k2 ≤ 2kSk2 +σs+1 (A) , where (·)s is the rank-s truncated SVD for s ≤ k  m, n.

9

D. G. ANDERSON, AND M. GU

Proof. The equation simply follows from Π1 AΠT2 = L1 U1 +



0 0

 0 . For the S

inequality: kΠ1 AΠT2 − (L1 U1 )s k2 = kΠ1 AΠT2 − L1 U1 + L1 U1 − (L1 U1 )s k2 ≤ kΠ1 AΠT2 − L1 U1 k2 +kL1 U1 − (L1 U1 )s k2 = kSk2 +σs+1 (L1 U1 )   0 T = kSk2 +σs+1 Π1 AΠ2 − 0

 0 S

≤ kSk2 +σs+1 (A) + kSk2 . Theorem 3.2 simply concludes that the approximation is accurate if the Schur complement is small, but an additional result is needed to guarantee that the approximation retains structural properties of the original data: Theorem 3.3. For a general rank-k truncated LU decomposition, we have for all 1 ≤ j ≤ k,     kSk2 kSk2 . σj (A) ≤ σj (L1 U1 ) 1 + 1 + σk (L1 U1 ) σj (A) Proof.   kSk2 σj (A) ≤ σj (L1 U1 ) 1 + σj (L1 U1 )   σj (A) kSk2 = σj (L1 U1 ) 1 + σj (L1 U1 ) σj (A)   σj (L1 U1 ) + kSk2 kSk2 ≤ σj (L1 U1 ) 1 + σj (L1 U1 ) σj (A)     kSk2 kSk2 = σj (L1 U1 ) 1 + 1 + σj (L1 U1 ) σj (A)     kSk2 kSk2 ≤ σj (L1 U1 ) 1 + 1 + . σk (L1 U1 ) σj (A) Note that the relaxation in the final step serves to establish a universal constant across all j, which leads to fewer terms that need bounding when the global SRLU swapping strategy is developed. Just as in the case of deterministic LU with comkSk2 kSk2 and σj (L range from moderate to small for plete pivoting, the sizes of σk (L 1 U1 ) 1 U1 ) almost all data matrices of practice interest. They, nevertheless, cannot be effectively bounded for a general TRLUCP factorization. Below we develop a swapping algorithm that applies necessary corrections to the TRLUCP algorithm to ensure that these terms are controlled. The result is a truncated LU factorization that effectively approximates a data matrix. 3.2. Computing the Spectrum-Revealing LU Decomposition. Since TRLUCP produces high-quality data approximations for almost all data matrices, despite the lack of theoretical guarantees, below we develop an efficient variant of the existing rank-revealing LU algorithms [15, 23] to rapidly detect and, if necessary, correct any possible matrix approximation failures of TRLUCP.

10

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION

Intuitively, the quality of the factorization can be tested by searching for the next choice of pivot in the Schur complement if the factorization continued. Because TRLUCP does not provide an updated Schur complement, the largest element in the Schur complement can be approximated by finding the column of R with largest norm, performing a Schur update of that column, and then picking the largest element in that column. Let α be this element, and, without loss of generality, assume it is the first entry of the Schur complement. Denote:    U11 u U13 L11 α sT12  . Π1 AΠT2 =  `T 1   s21 S22 L31 I Next, we must find the row and column that should be replaced if the row and column containing α are important. Note that the smallest entry of L11 U11 may still lie in an important row and column, and so the largest element of the inverse should be examined instead. Define    U11 u def L11 A11 = . α `T 1 Then, for a tolerance parameter f ≥ 1 that provides a control of accuracy versus the number of swaps needed, we propose testing (3.3)

−1

kA11 kmax ≤

f . |α|

Should the test fail, the row and column containing α are swapped with the row and −1 column containing the largest element in A11 . Note that this element may occur −1 in the last row or last column of A11 , indicating only a column swap or row swap respectively is needed. When the swaps are performed, the factorization must be updated to maintain truncated LU form. We have developed a variant of the LU updating algorithm of [13] to efficiently update the SRLU factorization. Our update procedure is as follows: we position the row and column to be swapped into the leading k rows and columns in the k + 1th row and and column position. We swap the row and column to be moved out into the k th row and column. The updating method of [13] is then performed to restore the form of the factorization, but is terminated once the k − 1th row and column have been updated. Because updating the k th row and column may require a swap, updating the last row and column may fail because such a swap would conflict with the swap needed to move the needed data in and the unnecessary data out. Instead, the k th and k + 1th rows and columns are then swapped, and LU with partial pivoting is performed on stages k − 1 and k to reliably restore the factorization. SRP can be implemented efficiently: each swap requires O (k(m + n)) operations, −1 and kA11 kmax can be calculated quickly using [17]. An argument similar to that used in [23]  shows that each swap after the failed test will increase the determinant det A11 by a factor at least f . Thus the test (3.3) will eventually succeed for any  data matrix, due to the fact that there is a finite upper bound on det A11 . It is possible to derive theoretical upper bounds on the worst number of swaps necessary, but in practice, this number is zero for most matrices, and does not exceed 3 − 5 in the most pathological data matrix of dimension at most 1000 we can contrive.

D. G. ANDERSON, AND M. GU

11

Inputs: Truncated LU factorization A ≈ L1 U1 , tolerance f > 1 −1 f 1: while kA11 kmax > |α| do 2: Set α to be the largest element in S (or find an approximate α using R) 3: Swap row and column containing α with row and column of largest element in −1 A11 4: Update truncated LU factorization 5: end while Algorithm 4: Spectrum-Revealing Pivoting (SRP) The combination of Algorithms 3 and 4 will ensure the computation of an SpectrumRevealing LU: the data approximation is guaranteed to capture all the leading singular values of the original matrix well: √ Theorem 3.4. (SRLU Error Bounds.) For j ≤ k and γ = O (f k mn), a rank-k SRLU factorization satisfies kΠ1 AΠT2 − L1 U1 k2 ≤ γσk+1 (A) ,   σk+1 (A) , kΠ1 AΠT2 − (L1 U1 )j k2 ≤ σj+1 (A) 1 + 2γ σj+1 (A) Proof. Note that the definition of α implies p kSk2 ≤ (m − k)(n − k)|α|. From [25]:  σmin A11 ≤ σk+1 (A) . Then: −1

σk+1 (A) ≤ kA11 k2 −1

≤ (k + 1)kA11 kmax f ≤ (k + 1) . |α| Thus |α|≤ f (k + 1)σk+1 (A) . The theorem follows by using this result with Theorem 3.2, with √ γ ≤ mnf (k + 1). Theorem 3.4 is a special case of Theorem 3.2 for SRLU factorizations. For a data matrix with a rapidly decaying spectrum, the right-hand side of the second inequality is close to σj+1 (A), a substantial improvement over the sharpness of Bound (2.1) above.  Theorem 3.5. (SRLU Spectral Bound). For 1 ≤ j ≤ k and τ ≤ O mnk 2 f 3 , a rank-k SRLU factorization satisfies:   σk+1 (A) σj (A) ≤ σj (L1 U1 ) ≤ σj (A) 1 + τ , (A) σj (A) 1 + τ σσk+1 j (A)

12

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION

Proof. After running k iterations of rank-revealing LU, Π1 AΠT2 = L1 U1 + C,  0 where C = 0

 0 , and S is the Schur complement. Then S

(3.4)

σj (A) ≤ σj (L1 U1 ) + kCk2   kCk2 . = σj (L1 U1 ) 1 + σj (L1 U1 )

For the upper bound: σj (L1 U1 ) = σj (A − C) ≤ σj (A) + kCk2   kCk2 = σj (A) 1 + σj (A)   kSk2 = σj (A) 1 + . σj (A) The final form is achieved using the same bound on γ as in Theorem 4.2. Theorem 3.5 contrasts with singular value bound of [23]: and LU factorization is strong rank-revealing if it satisfies: • σi (A11 ) ≥ σi (A) /q1 (k, m, n) and σj (S) ≤ σk+1 (A) q1 (k, m, n), • | A21 A−1 |≤ q (k, m, n), and 2 11 ij  • | A−1 A |≤ q 12 3 (k, m, n), 11 ij where 1 ≤ i ≤ k, 1 ≤ j ≤ n − k, and q1 (k, m, n) , q2 (k, m, n) , and q3 (k, m, n) are functions bounded by some low degree polynomials in k, m, and n. (A rankrevealing factorization, only provides a bound on the k th singular value.) A strong rank-revealing factorization requires numerical calculation or theoretical bounds of all of these expressions. Furthermore, the singular values of A11 are not the same as the singular values of the low-rank approximation induced by the truncated LU factorization, which could be larger than those of A. Alternatively, an SRLU factorization, which is a factorization of the form in Theorem 3.5, can be stated and calculated in a concise manner, and approximates the singular values of the desired approximation matrix. Condition (3.3) is more easily achieved than the requirements of a strong rank-revealing factorization, as these requirements imply (3.3). Note, nevertheless, that Theorem 3.5 is stronger than the singular value bounds of a strong rank-revealing factorization, indicating that the extra work for the stronger requirements is unnecessary. For the purpose of low-rank approximation, a spectrum-revealing factorization is generally more desirable than a strong rank-revealing factorization. While the worst case upper bound on τ is large, it is dimension-dependent, and j (A) and k may be chosen so that σσk+1 is arbitrarily small compared to τ . This implies j (A) Theorem 3.5 is near-optimal when the gap between σj (A) and σk+1 (A) is large, which is another advantage of SRLU bounds over strong rank-revealing bounds. In particular, if k is the numeric rank of A, then the singular values of the approximation are numerically equal to those of the data.

13

D. G. ANDERSON, AND M. GU

These bounds are matrix-specific bounds because their quality depends on the spectrum of the original data, rather than universal constants that appear in previous results. The benefit of these matrix-specific bounds is that an approximation of data with a rapidly decaying spectrum is guaranteed to be high-quality. Furthermore, the Eckart-Young Theorem, equation (1.1), implies that if σk+1 (A) is not small compared to σj (A), then no high-quality low-rank approximation is possible in the 2 and Frobenius norms. Thus, in this sense, the bounds presented in Theorems 3.4 and 3.5 are optimal. Given a high-quality rank-k truncated LU factorization, Theorem 3.5 ensures that a low-rank approximation of rank ` with ` < k of the compressed data is an accurate rank-` approximation of the full data. The proof of this theorem centers on bounding the terms in Theorems 3.2 and 3.3. Experiments will show that τ is small in almost all cases. 4. The CUR Decomposition with LU. A natural extension of truncated LU factorizations is a CUR-type decomposition for improved accuracy [20]:   def Π1 AΠT2 ≈ L1 L†1 AU†1 U1 = L1 MU1 . A general result for truncated-LU factorizations: Theorem 4.1. kΠ1 AΠT2 − L1 MU1 k2 ≤ 2kSk2 , kΠ1 AΠT2 − L1 MU1 kF ≤ kSkF .

Proof. First T T ! 0 QL C QU 1 2 kΠ1 AΠT − L MU k =     1 1 2 2 T U T L T C QU T QL C Q Q 2 1 2 2 2   T  T T T L U L U ≤ Q1 C Q2 C Q1 + Q2

2   T  T T  T L U QL = Q1 C Q2 + C QU 2 1 2

QL 2

T

C QU 2   T QU 2

T 

2

2

≤ 2kCk2 = 2kSk2 .

Also kΠ1 AΠT 2

− L1 MU1 kF

T !     T  T  T T  QU QL L L L U U L 1 1 U U = Q1 Q2 − Q1 Q1 A Q1 Q1 T A Q1 Q2 U L Q Q2 2 F             T T T T T T L U U U U L L U U L L A Q Q + Q Q A Q Q A Q Q + Q Q = QL Q 2 2 2 2 1 1 2 1 1 2 2 2

F  T  T  T  T L  L T  U T U L L U U L L U U = Q1 Q1 C Q2 Q2 + Q2 Q2 C Q1 Q1 + Q2 Q2 C Q2 Q2 F

14

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION  !  U  T U T  Q1 0 QL L 1  C Q2  = QL Q   T T T T 2 QU 1 QL C QU QL C QU 2 2 1 2 2 F T  ! U T 0 QL C Q 1 2  = T T T T QL C QU QL C QU 2 1 2 2 F !     T U T L T C QU T C Q Q QL 1 1  1 2  ≤ T T T T QL C QU C QU QL 2 1 2 2 F T !   QU  L    Q L T T 1 1 U U = Q1 QL C Q1 Q2 T 2 QU QL 2 2 F = kCkF = kSkF .

The CUR decomposition can be created from either a TRLUCP factorization or an SRLU factorization. The next two results are specifically for SRLU: Theorem 4.2. kΠ1 AΠT2 − L1 MU1 k2 ≤ 2γσk+1 (A) , kΠ1 AΠT2 − L1 MU1 kF ≤ ωσk+1 (A) , √ where γ = O (f k mn) is the same as in Theorem 3.4, and ω = O (f kmn). Proof. Note that the definition of α implies kSkF ≤ (m − k)(n − k)|α|. The rest follows by using Theorem 4.1 in a manner similar to how Theorem 3.4 invoked Theorem 3.2. As before, γ and ω are small in practice. Theorem 4.3. If σj2 (A) > 2kSk22 then s 2  σk+1 (A) , σj (A) ≥ σj (L1 MU1 ) ≥ σj (A) 1 − 2γ σj (A)  where γ = O mnk 2 f 2 , and f is an input parameter controlling a tradeoff of quality vs. speed as before.   L  RL L L 11 R12 Proof. Perform QR and LQ decompositions L = QL RL =: Q1 Q2 RL 22  U   U L11 Q1 and U = LU QU =: . Then LU LU QU 21 22 2 L L1 MU1 = QL 1 Q1

T

A QU 1

T

QU 1.

Note that T

L AT QL 2 = (L1 U1 + C) Q2

T L L U U = QL Q2 1 R11 L11 Q1 + C    T L U T U T L T = Q1 L11 R11 QL Q2 + CT QL 1 2 (4.1)

= CT QL 2.

Analogously (4.2)

A QU 2

T

= C QU 2

T

.

15

D. G. ANDERSON, AND M. GU

Then T T T T ! QL A QU QL A QU 1 1 1 2 σj (A) = σj T T T T A QU A QU QL QL 2 1 2 2 T ! T T T C QU A QU QL QL 2 1 1 1 = σj T T T T QL C QU QL C QU 2 1 2 2 v  u  !T  T T u U T U T u QL QL 1  C Q2  1  A Q1  = tλj  T T T T QL C QU QL C QU 2 1 2 2 s λj

=

s ≤

λj





QL 1

T

A QU 1

 +



QL 2

QL 1

T

A QU 1

T

T

QL 1

C QU 1 T

T

T

QL 2



λj s λj

= s





 λj

= s ≤

 λj

 L T

T A QU 1

Q1

QL 1

T

A QU 1

T

T

T T 

C QU 2

QL 1

T

T T 

A QU 1 QL 2

T

T

QL 1

C QU 1

T

T

C QU 2 QL 2

T

T 



C QU 2

T

T

C QU 2

T 



QL 1

T T T  L T T QL C QU Q1 A QU 1 2 1

T

C QU 2

T 



QL 1

T T



QL 1

T

 T T + QL C QU 2 1 s

C QU 2

 T  ! U T QL 1  C Q2   T T QL C QU 2 2

T  U T QL 1  A Q1  T T QL C QU 2 1

QL 1

C QU 2 QL 2

T

T

T T 

C QU 2

C QU 2

T  

QL 1

T

A QU 1

T 

T 

2

QL 1

T

A QU 1

T

C QU 2

QL 1

T

A QU 1

T 

QL 1

T

A QU 1

T T

+ QL 1

QL 1

T

A QU 1

T 

QL 1

T

A QU 1

T T



T T A QU 1



T

QL 1

T

C QU 2

T 

QL 1

T

C QU 2

T T



+ kCk22 + kCk22

+ kCk22

 T T   L T U T U T + kCk2 C Q C Q Q + QL 1 2 2 2 1 2

s ≤



r σj2

≤ =

q

A Q1

T

A QU 1

Q1

λj 

  U T

 L T

QL 1

T 

 L T

Q1

= σj (L1 MU1 )

1+2 s

= σj (L1 MU1 )



1+2

+ 2kCk22

+ 2kCk22

σj2 (L1 MU1 ) + 2kCk22 s 

kCk2 σj (L1 MU1 )

2

kSk2 σj (L1 MU1 )

2



.

Solve for σj (L1 MU1 ) for the lower bound. The upper bound: T    ! U T L T U T QL A Q Q A Q 1 1 1 2 σj (A) = σj T T T T QL A QU QL A QU 2 1 2 2  T T  ≥ σj QL A QU 1 1     L T U T U = σj QL Q A Q Q 1 1 1 1 = σj (L1 MU1 ) . Observe that for most data matrices in practice, their singular values decay with increasing j. For such matrices this result is significantly stronger than Theorem 3.5. No known comparable result exists.

16

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION

Table 1: Rank-300 TRLUCP factorization time (seconds) of random matrices of various sizes.

Rows

2k 4k 6k 8k

2k

Cols 4k

6k

8k

0.065 0.112 0.159 0.202

0.122 0.193 0.265 0.335

0.180 0.277 0.375 0.471

0.237 0.360 0.485 0.607

Table 2: Rank-300 Truncated DGETRF factorization time (seconds) of random matrices of various sizes.

Rows

2k 4k 6k 8k

2k

Cols 4k

6k

8k

0.105 0.228 0.325 0.442

0.215 0.436 0.656 0.890

0.326 0.659 0.989 1.339

0.435 0.879 1.328 1.788

5. Numeric Experiments. Because this work introduces a novel data matrix factorization, a variety of numeric experiments are now presented to verify its properties. While there are many informative examples that are not included here, the most pertinent experiments are offered below. 5.1. Benchmarking. The runtime of TRLUCP is compared to the runtime of Truncated DGETRF1 in Tables 1 and 2. As the size of the data matrix increases, the TRLUCP factorization appears to become arbitrarily faster than Truncated DGETRF. Correcting swaps are not included here, as none are expected in most cases. The efficiency advantage over LU demonstrates that TRLUCP is also more efficient than comparable QR and SVD factorizations, as well as methods based on QR and SVD factorizations. Benchmarking experiments were run on 1 core of NERSC’s Edison and compiled with the MKL compiler. 5.2. Accuracy. Several Accuracy test are explored here to illustrate that TRLUCP and SRLU provide accurate data approximations. In Figure 1, three random matrices of differing rates of spectral decay are factored using TRLUCP only and SRLU. The relative Frobenius norm error (compared to the SVD) for various truncation ranks is always below the error bound provided by 2.1. In Figure 2, as the truncation rank increases the convergence of the 2-norm of the TRLUCP and CUR approximations is shown for random matrices with different rates of spectral decay. 1 Truncated DGETRF is the standard LAPACK right-looking DGETRF factorization with partial row pivoting, but which is aborted after k rows and columns have been factored. The factorization is not guaranteed to be meaningful, and is for benchmarking only.

17

D. G. ANDERSON, AND M. GU

Fig. 1: The relative error of the Frobenius norm for random matrices of varying spectral decay. No swaps were needed. For faster spectral decay, larger truncation ranks are not included because the factorization error is near machine epsilon. The error bound for the previous results is presented as a range because this bound depends on the choice of . Here 0 <  < 1/2.

Table 3: Mean values of the constants from the theorems presented in this work, for various random matrices. Constants for spectral theorems are averaged over the top 10 singular values. No swaps were needed, and so SRLU results match TRLUCP.

Constant

Theorem

Method

γ γ τ γ ω γ

3.4 (1st ineq.) 3.4 (2nd ineq.) 3.5 4.2 4.2 4.3

TRLUCP TRLUCP TRLUCP TRLUCP TRLUCP TRLUCP

Mean 7.97 0.04 0.16 1.57 0.24 0.40

Next, we test the dimension-dependent constants that appear in the theorems in sections 3.2 and 4. Although the data matrix does affect the size of the constant seen in practice, Table 3 summarizes the maximum constants seen with random matrices of spectrums decay at rates 0.9, 0.8, and 0.7. In the case of spectral bounds, the maximums were taken over the top 10 singular values.

18

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION

Fig. 2: Comparison of convergence of the error of TRLUCP and TRLUCP with CUR versus the error of the truncated SVD as the truncation rank increases. The rate of spectral decay of the test matrix is in parenthesis. No swaps were needed.

5.3. Feature Selection. An image processing example is now presented to illustrate the benefit of feature selection. In Figure 3 an image is compressed to a rank-50 approximation using SRLU. Note that the rows and columns chosen overlap with the astronaut and the planet, implying that minimal storage is needed to capture the black background, which composes approximately two thirds of the image.

Fig. 3: Image processing example. The original image [24], a rank-50 approximation with SRLU, and a highlight of the rows and columns selected by SRLU.

5.4. Testing the Necessity of Swaps. The theoretical results presented in this work results require correcting swaps, however few are expected in general. Table 3 illustrates that swaps are not needed in many cases, and that TRLUCP produces reasonable low-rank data approximations and correcting swaps may provide minimal benefit. The number of swaps performed, nevertheless, is highly dependent on the choice of f . For a random 1000-by-1000 matrix with spectral decay rate 0.9 and truncation rank k = 100, values of f = 1.5, 2 and 5 require 11.1, 1.3, and 0 swaps respectively

D. G. ANDERSON, AND M. GU

19

when averaged over ten tests. Note that the random projection means the number of swaps may vary when an experiment is repeated. We recommend a choice of f = 5 or 10 in general. 5.5. Block Size Experiments. The methodology for choosing a block size presented in section 3.1.2 is compared to other choices of block size in Figure 4. Note that LAPACK generally chooses a block size of 64 for these matrices, which is suboptimal in all cases, and can be up to twice as slow. In all of the cases tested, the calculated block size is close to or exactly the optimal block size.

Fig. 4: Benchmarking TRLUCP with various block sizes on random matrices of different sizes and truncation ranks.

5.6. Sparsity Preservation. The sparsity-preserving nature of SRLU is demonstrated on a sparse circuit simulation matrix example from [6] (example 1112). The original 430-by-430 matrix contains 1,544 nonzero entries (Figure 5). An SRLU factorization of rank 43 produces an approximation of 0.257% relative error in the 2-norm (no correcting swaps are needed in SRLU). If the Schur complement is updated, then this decomposition produces an exact factorization with 3,542 nonzeros (Figure 6). If just the approximation is considered (i.e. the leading 43 columns of the L factor and leading 43 rows of the U factor), then only 868 nonzeros appear, over 40% storage savings. A full LU factorization with partial row pivoting yields 9,729 nonzeros. Figure 6 demonstrates how the amount of fill increases as an LU factorization progresses, indicating a sparsity-preserving benefit of a truncated factorization. The full SVD is dense and has 369,204 nonzeros. A QR factorization with column pivoting produces 23,296 nonzeros.

20

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION

Fig. 5: Circuit simulation example and sparsity pattern of corresponding simulation matrix.

Fig. 6: Top row: Sparsity pattern of rank 43 SRLU factorization. The green entries compose the low-rank approximation of the data. The red entries are the additional data needed for an exact factorization. Bottom row: Sparsity pattern of full rightlooking LU factorization with partial row pivoting.

5.7. Online Data Processing. Given a factored data matrix A ∈ Rm×n and new observations B = the form 

k

m−k

B1

B2

Π1 AΠT2 B





∈ Rs×m , an augmented LU decomposition takes



L11 = L21 L31

 U11 

I I

 U12 S , Snew

D. G. ANDERSON, AND M. GU

21

new where L31 = B1 U−1 = B2 − B1 U−1 11 and S 11 U12 . An SRLU factorization can then be obtained by simply performing correcting swaps. The expected number of swaps, nevertheless, will in general be larger the number required for the initial factorization, but is bounded by min(s, k). Consider a random 500-by-500 matrix with a spectrum decaying at the rate of 0.9, a truncation rank of k = 100, and tuning parameter f = 1.8. The SRLU factorization was performed on the first 250 rows, and then updated using the data integration method described in this section. The test was repeated 20 times, and in all cases the relative error of the top 10 singular values of the final factorization was near machine epsilon. On average, 3.4 swaps were needed for the factorization of the first 250 rows and 3.3 swaps for the data integration. Factoring the entire data at once required an average of 3.95 swaps. Thus a few extra swaps are required for online data processing, but TRLUCP was only needed on half of the data.

6. Conclusion. We have introduced the Spectrum-Revealing LU factorization for computing low-rank data approximations, and have provided the first algorithm to efficiently and reliably calculate the SRLU decomposition. This algorithm is composed of two subroutines: TRLUCP, the powerhouse behind SRLU, and SRP, which is rarely needed in practice, but ensures the quality of the SRLU. SRP also introduces a novel updating method for updating truncated LU decompositions with better stability than previous LU updating algorithms. We have shown strong error and singular value bounds for this data approximation method, and have included numeric experiments to demonstrate that SRLU has many desirable properties: efficiency, accuracy, sparsity-preservation, the ability to be updated, and the ability to highlight important data features and variables. REFERENCES [1] Y. Aizenbud, G. Shabat, and A. Averbuch. Randomized lu decomposition using sparse projections. CoRR, abs/1601.04280, 2016. [2] J. Batson, D. A. Spielman, and N. Srivastava. Twice-ramanujan sparsifiers. SIAM Journal on Computing, 41(6):1704–1721, 2012. [3] E. Chow and A. Patel. Fine-grained parallel incomplete lu factorization. SIAM J. Scientific Computing, 37(2), 2015. [4] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. CoRR, abs/1207.6365, 2012. [5] P. D. Crout. A short method for evaluating determinants and solving systems of linear equations with real or complex coefficients. AIEE Transactions, 60:1235–1240, 1941. [6] T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software, 38:1:1–1:25, 2011. http://www.cise.ufl.edu/research/sparse/ matrices. [7] J. Demmel. Applied Numerical Linear Algebra. SIAM, 1997. [8] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang. Matrix approximation and projective clustering via volume sampling. Theory of Computing, 2(12):225–247, 2006. [9] A. Deshpande and S. Vempala. Adaptive sampling and fast low-rank matrix approximation. In APPROX-RANDOM, volume 4110 of Lecture Notes in Computer Science, pages 292–303. Springer, 2006. [10] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936. [11] A. M. Frieze, R. Kannan, and S. Vempala. Fast monte-carlo algorithms for finding low-rank approximations. J. ACM, 51(6):1025–1041, 2004. [12] G. H. Golub and C. F. van Loan. Matrix Computations. JHU Press, 4th edition, 2013. [13] J. Gondzio. Stable algorithm for updating dense lu factorization after row or column exchange and row and column addition or deletion. Optimization: A journal of Mathematical Programming and Operations Research, pages 7–26, 2007.

22

EFFICIENT, SPARSITY-PRESERVING, ONLINE DATA APPROXIMATION

[14] L. Grigori, J. Demmel, and X. S. Li. Parallel symbolic factorization for sparse lu with static pivoting. SIAM J. Scientific Computing, 29(3):1289–1314, 2007. [15] M. Gu and S. C. Eisenstat. Efficient algorithms for computing a strong rank-revealing qr factorization. SIAM J. Sci. Comput., 17(4):848–869, 1996. [16] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217– 288, 2011. [17] N. J. Higham and S. D. Relton. Estimating the largest entries of a matrix. 2015. [18] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26:189–206, 1984. [19] A. Khabou, J. Demmel, L. Grigori, and M. Gu. Lu factorization with panel rank revealing pivoting and its communication avoiding version. SIAM J. Matrix Analysis Applications, 34(3):1401–1429, 2013. [20] M. W. Mahoney and P. Drineas. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009. [21] P.-G. Martinsson, V. Rokhlin, and M. Tygert. A randomized algorithm for the approximation of matrices. Tech. Rep., Yale University, Department of Computer Science, (1361), 2006. [22] C. Melgaard and M. Gu. Gaussian elimination with randomized complete pivoting. CoRR, abs/1511.08528, 2015. [23] L. Miranian and M. Gu. Stong rank revealing lu factorizations. Linear Algebra and its Applications, 367:1–16, 2003. [24] NASA. Nasa celebrates 50 years of spacewalking. https://www.nasa.gov/image-feature/ nasa-celebrates-50-years-of-spacewalking. Accessed on August 22, 2015. Published on June 3, 2015. Original photograph from February 7, 1984. [25] C.-T. Pan. On the existence and computation of rank-revealing lu factorizations. Linear Algebra and its Applications, 316(1):199–222, 2000. [26] C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis. J. Comput. Syst. Sci., 61(2):217–235, 2000. [27] T. Sarlos. Improved approximation algorithms for large matrices via random projections. In FOCS, pages 143–152. IEEE Computer Society, 2006. [28] J. H. Wilkinson. Error analysis of direct methods of matrix inversion. J. ACM, 8(3):281–330, 1961. [29] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert. A fast randomized algorithm for the approximation of matrices. Applied and Computational Harmonic Analysis, 25(3):335–366, 2008.