An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization David Bindel and Jeffrey Chadwick Department of Computer Science Cornell University
22 July 2013
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
1 / 27
u = K \ f Great for circuit simulations, 1D or 2D finite elements, etc. Standard advice to students: Just try backslash for these problems. Standard response: What about for the 3D case?
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
2 / 27
“Try PCG with a good preconditioner. Maybe start with the ones in PETSc. You’ve taken Matrix Computations, right? Blah blah yadda blah...”
(Not an actual student) (U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
3 / 27
Direct or iterative? A
L
CW: Gaussian elimination scales poorly. Iterate instead! Pro: Less memory, potentially better complexity Con: Less robust, potentially worse memory patterns Commercial finite element codes still use (out-of-core) Cholesky. Longer compute times, but fewer tech support hours. (U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
4 / 27
Desiderata
I want a code for sparse Cholesky (A = LLT ) that Handles modest problems on a desktop (or laptop?) Inside a loop, without trying my patience =) Does not need gobs of memory =) Makes effective use of level 3 BLAS
Requires little parameter fiddling / hand-holding Works with general elliptic problems (esp. elasticity) Idea is in the air – see talks by Darve and Li (bracketing me!), also work by Xia, Gu, Martinsson, Ying, ...
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
5 / 27
From ND to “superfast” ND
...
ND gets performance using just graph structure: 2D: 3D:
O(N 3/2 ) time, O(N 2 ) time,
O(N log N ) space. O(N 4/3 ) space.
Superfast ND reduces space/time complexity via low-rank structure.
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
6 / 27
Strategy
Start with CHOLMOD (a good supernodal left-looking Cholesky) Supernodal data structures are compact Algorithm + data layout =) most work in level 3 BLAS Widely used already (so re-use the API!)
Incorporate compact representations for low-rank blocks Outer product for off-diagonal blocks HSS-style representations for diagonal blocks
Optimize, test, swear, fix, repeat
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
7 / 27
Supernodal storage structure
LD j
L(Cj , Cj ) ⌘ LD j L(Cj , Rj ) ⌘ LO j collapsed
LO j Lj
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
8 / 27
Supernode factorization UD j UO j
A(Cj , Cj ) A(Rj , Cj )
for each k 2 Dj do Build dense updates from LO k O Scatter updates to UD j and Uj LD j LO j
cholesky(UD j ) O D T Uj (Lj )
Initialize storage Pull Schur contributions Finish forming LD j
What changes in the rank-structured Cholesky?
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
9 / 27
Off-diagonal block compression
LD j ⇡ Vj Collapsed LO j
UTj
Compressed LO j
LO j
Collapsed off-diagonal block is a (nearly low-rank) dense matrix (U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
10 / 27
Off-diagonal block compression G rand(|Cj |, r + p) T C (LO j ) G for i = 1, . . . , s do C (LO j )C T C (LO j ) C Uj = orth(C) Vj = L O j Uj
Compress without explicit LO j : T Probe (LO j ) with random G
Extract orth. row basis Uj T O LO j = Vj Uj =) Vj = Lj Uj
Where do we get the estimated rank bound r?
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
11 / 27
Interaction rank
Could dynamically estimate the rank of LO j . p Practice: empirical rank bound ⇡ ↵ k log(k). (U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
12 / 27
Optimization: Selective off-diagonal compression
j1
j2
j3
j1
j2
j3
Compress off-diagonal blocks of sufficiently large supernodes (j1 , j2 ).
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
13 / 27
Optimization: Interior blocks B2
B4
j1
j2
B1
B3 B1
j3
B2
j1
B3
B4
j2
j3
Don’t store any of LO j for “interior” blocks O O 1 when needed) (Represent as Lj = Aj (LD j ) (U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
14 / 27
Diagonal block compression 0
LD j,1
0
B 0 B D D B L L j,2 j,3 B LD j =B B LD 0 j,5 @ LD j,4 D LD j,6 Lj,7
1 C C C C C C A
Basic observation: off-diagonal blocks are low-rank. (H-matrix, semiseparable structure, quasiseparable structure, ...) Assumes reasonable ordering of unknowns!
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
15 / 27
Diagonal block compression
0
LD j,1
0
B B D T B Vj,2 (UD LD j,2 ) j,3 D Lj ⇡ B B B D (UD )T @ Vj,4 j,4
0 LD j,5 D (UD )T Vj,6 j,6
1
C C C C. C 0 C A LD j,7
How do we get directly to this without forming UD j explicitly?
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
16 / 27
Forming compressed updates
LD j,1 LD j,3 LD j,2
LD j,5
LD j,4
LD j,7 LD j,6
Dj
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
17 / 27
Rank-structured supernode factorization
Basic ingredients: Randomized algorithms form UD j Rank-structured factorization of UD j D Randomized algorithm forms LO j (involves solves with Lj )
Plus various optimizations.
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
18 / 27
Example: Large deformation of an elastic block
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
19 / 27
Example: Large deformation of an elastic block Benchmark based on example from deal.II: Nearly-incompressible hyperelastic block under compression Mixed FE formulation (pressure and dilation condensed out) Tried both p = 1 and p = 2 finite elements Two load steps, Newton on each (14-15 steps) Experimental setup: 8-core Xeon X5570 with 48 GB RAM LAPACK/BLAS from MKL 11.0 PCG + preconditioners from Trilinos
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
20 / 27
101
101
100
100
10
1
10
2
10
3
10
4
10
5
Relative residual
Relative residual
RSC vs standard preconditioners (p = 1, N = 50)
Jacobi
10
1
10
2
10
3
10
4
10
5
ML RSC 0
ICC 5
10
Iterations (U. Waterloo, ND40)
ICC Jacobi RSC 0
·10
3
Rank-Structured Cholesky
ML 0.5 Seconds 2013-07-22
1 ·103 21 / 27
101
101
100
100
10
1
10
2
10
3
10
4
10
5
Jacobi
ML RSCICC 0
5
10
Iterations (U. Waterloo, ND40)
Relative residual
Relative residual
RSC vs standard preconditioners (p = 2, N = 35)
10
1
10
2
10
3
10
4
10
5
15 ·103 Rank-Structured Cholesky
ML ICC RSC 0
Jacobi 5
10
Seconds 2013-07-22
·103 22 / 27
Time and memory comparisons (p = 1) RSC ICC ML Jacobi
800
25
600
Cholesky
400
20 15 10
200
5
0
0 0
0.5
1 n
(U. Waterloo, ND40)
Cholesky
30
Memory (GB)
Solve time (s)
1,000
1.5 ·106 Rank-Structured Cholesky
RSC
0
0.5
1 n
1.5 ·106 2013-07-22
23 / 27
Effect of in-separator ordering 101
Relative residual
100 10
1
10
2
10
3
10
4
10
5
10
Semi-sep diag relies on variable order – don’t want any old order! Apply recursive bisection based on spatial coords Use coordinates if known Geo.
6
Else assign spectrally
Eig. Random
0
0.2 Iteration
(U. Waterloo, ND40)
0.4 ·103 Rank-Structured Cholesky
2013-07-22
24 / 27
102
102
101
101
100
100
10
1
10
2
10
3
10
4
10
5
10
6
Relative residual
Relative residual
Example: Trabecular bone model (⇡ 1M dof)
RSC2 RSC1 0
1
ML ICC 2
Iterations (U. Waterloo, ND40)
10
1
10
2
10
3
10
4
10
5
10
6
3 3
·10
Rank-Structured Cholesky
ICC ML RSC1 RSC2 0
1
2
Seconds 2013-07-22
·103 25 / 27
103
103
102
102
101
101
100
100
10
1
10
2
10
3
10
4
10
5
10
6
Relative residual
Relative residual
Example: Steel flange (⇡ 1.5M dof)
RSC1 RSC2 ML 0
2 Iterations
(U. Waterloo, ND40)
ICC
10
1
10
2
10
3
10
4
10
5
10
6
RSC2 RSC1 ML ICC 0
4 ·103 Rank-Structured Cholesky
0.5
1
Seconds 2013-07-22
·103 26 / 27
Conclusions
For more: www.cs.cornell.edu/~bindel
[email protected] Current status: Jeff has graduated! But... Code basically ready (still tweaking build system) Paper is mostly there (submit+arXiV before September)
(U. Waterloo, ND40)
Rank-Structured Cholesky
2013-07-22
27 / 27