An Efficient Solver for Sparse Linear Systems based on Rank ...

Comment

Report 2 Downloads 119 Views

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization David Bindel and Jeffrey Chadwick Department of Computer Science Cornell University

22 July 2013

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

1 / 27

u = K \ f Great for circuit simulations, 1D or 2D finite elements, etc. Standard advice to students: Just try backslash for these problems. Standard response: What about for the 3D case?

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

2 / 27

“Try PCG with a good preconditioner. Maybe start with the ones in PETSc. You’ve taken Matrix Computations, right? Blah blah yadda blah...”

(Not an actual student) (U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

3 / 27

Direct or iterative? A

L

CW: Gaussian elimination scales poorly. Iterate instead! Pro: Less memory, potentially better complexity Con: Less robust, potentially worse memory patterns Commercial finite element codes still use (out-of-core) Cholesky. Longer compute times, but fewer tech support hours. (U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

4 / 27

Desiderata

I want a code for sparse Cholesky (A = LLT ) that Handles modest problems on a desktop (or laptop?) Inside a loop, without trying my patience =) Does not need gobs of memory =) Makes effective use of level 3 BLAS

Requires little parameter fiddling / hand-holding Works with general elliptic problems (esp. elasticity) Idea is in the air – see talks by Darve and Li (bracketing me!), also work by Xia, Gu, Martinsson, Ying, ...

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

5 / 27

From ND to “superfast” ND

...

ND gets performance using just graph structure: 2D: 3D:

O(N 3/2 ) time, O(N 2 ) time,

O(N log N ) space. O(N 4/3 ) space.

Superfast ND reduces space/time complexity via low-rank structure.

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

6 / 27

Strategy

Start with CHOLMOD (a good supernodal left-looking Cholesky) Supernodal data structures are compact Algorithm + data layout =) most work in level 3 BLAS Widely used already (so re-use the API!)

Incorporate compact representations for low-rank blocks Outer product for off-diagonal blocks HSS-style representations for diagonal blocks

Optimize, test, swear, fix, repeat

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

7 / 27

Supernodal storage structure

LD j

L(Cj , Cj ) ⌘ LD j L(Cj , Rj ) ⌘ LO j collapsed

LO j Lj

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

8 / 27

Supernode factorization UD j UO j

A(Cj , Cj ) A(Rj , Cj )

for each k 2 Dj do Build dense updates from LO k O Scatter updates to UD j and Uj LD j LO j

cholesky(UD j ) O D T Uj (Lj )

Initialize storage Pull Schur contributions Finish forming LD j

What changes in the rank-structured Cholesky?

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

9 / 27

Off-diagonal block compression

LD j ⇡ Vj Collapsed LO j

UTj

Compressed LO j

LO j

Collapsed off-diagonal block is a (nearly low-rank) dense matrix (U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

10 / 27

Off-diagonal block compression G rand(|Cj |, r + p) T C (LO j ) G for i = 1, . . . , s do C (LO j )C T C (LO j ) C Uj = orth(C) Vj = L O j Uj

Compress without explicit LO j : T Probe (LO j ) with random G

Extract orth. row basis Uj T O LO j = Vj Uj =) Vj = Lj Uj

Where do we get the estimated rank bound r?

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

11 / 27

Interaction rank

Could dynamically estimate the rank of LO j . p Practice: empirical rank bound ⇡ ↵ k log(k). (U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

12 / 27

Optimization: Selective off-diagonal compression

j1

j2

j3

j1

j2

j3

Compress off-diagonal blocks of sufficiently large supernodes (j1 , j2 ).

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

13 / 27

Optimization: Interior blocks B2

B4

j1

j2

B1

B3 B1

j3

B2

j1

B3

B4

j2

j3

Don’t store any of LO j for “interior” blocks O O 1 when needed) (Represent as Lj = Aj (LD j ) (U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

14 / 27

Diagonal block compression 0

LD j,1

0

B 0 B D D B L L j,2 j,3 B LD j =B B LD 0 j,5 @ LD j,4 D LD j,6 Lj,7

1 C C C C C C A

Basic observation: off-diagonal blocks are low-rank. (H-matrix, semiseparable structure, quasiseparable structure, ...) Assumes reasonable ordering of unknowns!

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

15 / 27

Diagonal block compression

0

LD j,1

0

B B D T B Vj,2 (UD LD j,2 ) j,3 D Lj ⇡ B B B D (UD )T @ Vj,4 j,4

0 LD j,5 D (UD )T Vj,6 j,6

1

C C C C. C 0 C A LD j,7

How do we get directly to this without forming UD j explicitly?

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

16 / 27

Forming compressed updates

LD j,1 LD j,3 LD j,2

LD j,5

LD j,4

LD j,7 LD j,6

Dj

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

17 / 27

Rank-structured supernode factorization

Basic ingredients: Randomized algorithms form UD j Rank-structured factorization of UD j D Randomized algorithm forms LO j (involves solves with Lj )

Plus various optimizations.

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

18 / 27

Example: Large deformation of an elastic block

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

19 / 27

Example: Large deformation of an elastic block Benchmark based on example from deal.II: Nearly-incompressible hyperelastic block under compression Mixed FE formulation (pressure and dilation condensed out) Tried both p = 1 and p = 2 finite elements Two load steps, Newton on each (14-15 steps) Experimental setup: 8-core Xeon X5570 with 48 GB RAM LAPACK/BLAS from MKL 11.0 PCG + preconditioners from Trilinos

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

20 / 27

101

101

100

100

10

1

10

2

10

3

10

4

10

5

Relative residual

Relative residual

RSC vs standard preconditioners (p = 1, N = 50)

Jacobi

10

1

10

2

10

3

10

4

10

5

ML RSC 0

ICC 5

10

Iterations (U. Waterloo, ND40)

ICC Jacobi RSC 0

·10

3

Rank-Structured Cholesky

ML 0.5 Seconds 2013-07-22

1 ·103 21 / 27

101

101

100

100

10

1

10

2

10

3

10

4

10

5

Jacobi

ML RSCICC 0

5

10

Iterations (U. Waterloo, ND40)

Relative residual

Relative residual

RSC vs standard preconditioners (p = 2, N = 35)

10

1

10

2

10

3

10

4

10

5

15 ·103 Rank-Structured Cholesky

ML ICC RSC 0

Jacobi 5

10

Seconds 2013-07-22

·103 22 / 27

Time and memory comparisons (p = 1) RSC ICC ML Jacobi

800

25

600

Cholesky

400

20 15 10

200

5

0

0 0

0.5

1 n

(U. Waterloo, ND40)

Cholesky

30

Memory (GB)

Solve time (s)

1,000

1.5 ·106 Rank-Structured Cholesky

RSC

0

0.5

1 n

1.5 ·106 2013-07-22

23 / 27

Effect of in-separator ordering 101

Relative residual

100 10

1

10

2

10

3

10

4

10

5

10

Semi-sep diag relies on variable order – don’t want any old order! Apply recursive bisection based on spatial coords Use coordinates if known Geo.

6

Else assign spectrally

Eig. Random

0

0.2 Iteration

(U. Waterloo, ND40)

0.4 ·103 Rank-Structured Cholesky

2013-07-22

24 / 27

102

102

101

101

100

100

10

1

10

2

10

3

10

4

10

5

10

6

Relative residual

Relative residual

Example: Trabecular bone model (⇡ 1M dof)

RSC2 RSC1 0

1

ML ICC 2

Iterations (U. Waterloo, ND40)

10

1

10

2

10

3

10

4

10

5

10

6

3 3

·10

Rank-Structured Cholesky

ICC ML RSC1 RSC2 0

1

2

Seconds 2013-07-22

·103 25 / 27

103

103

102

102

101

101

100

100

10

1

10

2

10

3

10

4

10

5

10

6

Relative residual

Relative residual

Example: Steel flange (⇡ 1.5M dof)

RSC1 RSC2 ML 0

2 Iterations

(U. Waterloo, ND40)

ICC

10

1

10

2

10

3

10

4

10

5

10

6

RSC2 RSC1 ML ICC 0

4 ·103 Rank-Structured Cholesky

0.5

1

Seconds 2013-07-22

·103 26 / 27

Conclusions

For more: www.cs.cornell.edu/~bindel [email protected] Current status: Jeff has graduated! But... Code basically ready (still tweaking build system) Paper is mostly there (submit+arXiV before September)

(U. Waterloo, ND40)

Rank-Structured Cholesky

2013-07-22

27 / 27

Recommend Documents

Wavelet based preconditioners for sparse linear systems

An Efficient Multigrid Solver based on Distributive Smoothing for ...

RankSign: an efficient signature algorithm based on the rank metric

An Efficient Sparse Kernel Adaptive Filtering Algorithm Based on ...

Sparse Learning-to-Rank via an Efficient Primal-Dual Algorithm