Assessing the impact of ABFT & Checkpoint ... - Semantic Scholar

Report 2 Downloads 50 Views
Assessing the impact of ABFT & Checkpoint composite strategies George Bosilca1 , Aur´elien Bouteiller1 , Thomas H´erault1 , Yves Robert1,2 and Jack Dongarra1 1. University of Tennessee Knoxville, USA ´ 2. Ecole Normale Sup´erieure de Lyon & INRIA, France

April 25, 2014 - ICL Lunch talk

Outline

1

Motivation

2

ABFT&PeriodicCkpt

3

Performance Modeling

4

Periodic Checkpointing Protocols (for comparison)

5

Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling

6

Conclusion

Optimists

Faults: a reality? New technologies (graphene) solves the heat and voltage issues, making faults extremely unlikely Progress in the fabrication process will make the hardware much more reliable Simple methods to fix the problem directly at the hardware level: units duplication, checksum (ECC)

Skeptics

How many “faults” you see on your daily working tool (laptop)? What would it be different with a Exascale tool?

It is not because they could do it before but nobody did it, that they will not do it in the future (!) Do you have life insurance ?

Faults: with big numbers come big responsability

Assume independent failures

Let r be the probability of a component to operate for 1h

140

1 hour reliability 0.9999 0.99999 0.999999

120 MTTF (hours)

Let N be the number of components (“System Size”)

100 80 60 40

Let R be the probability of the system to operate for 1h R = rN 1 1 R ≈ λN , = 1 − r λ e

20 0 100

1000

10000 System Size

100000

(Figure from Dan Reed “The Challenge of Complexity and Scale” replotted)

Fault Tolerance Techniques General Techniques Replication Rollback Recovery Coordinated Checkpointing Uncoordinated Checkpointing & Message Logging Hierarchical Checkpointing

Application-Specific Techniques Algorithm Based Fault Tolerance (ABFT) Iterative Convergence

Coordinated Checkpointing and Rollback Recovery

P0 Coordinated checkpoints over all processes Global restart after a failure

m1

m2

m3

P1 m4 P2

, General technique (we assume preemptive checkpointing capability)

/ All processors need to roll back / All memory needs to be saved

m5

Algorithm-Based Fault Tolerance !

! Operation

A



! A

C

C = Cksum(A)

B

Operation

 B



C0



C 0 = Cksum(B)

Principle of ABFT Input Data (A) and Result (B) are distributed Operation preserves Checksum properties Apply the operation on Data + Checksum (AC ) In case of failure, recover the missing data by inversion of the checksum

Application LIBRARY Phase

Typical Application f o r ( aninsanenumber ) { /∗ E x t r a c t d a t a from ∗ s i m u l a t i o n , f i l l up ∗ m a t r i x ∗/ sim2mat ( ) ; /∗ F a c t o r i z e m a t r i x , ∗ S o l v e ∗/ dgeqrf ( ) ; dsolve (); /∗ Update s i m u l a t i o n ∗ w i t h r e s u l t v e c t o r ∗/ vec2sim ( ) ; }

GENERAL Phase

Process 0

Application Library

Process 1

Application Library

Process 2

Application Library

Characteristics , Large part of (total) computation spent in factorization/solve Between LA operations:

/ use resulting vector / matrix /

with operations that do not preserve data checksums modify data not covered by ABFT algorithms

Application LIBRARY Phase

Typical Application f o r ( aninsanenumber ) { /∗ E x t r a c t d a t a from ∗ s i m u l a t i o n , f i l l up ∗ m a t r i x ∗/ sim2mat ( ) ;

GENERAL Phase

Process 0

Application Library

Process 1

Application Library

Goodbye ABFT?!

/∗ F a c t o r i z e m a t r i x , ∗ S o l v e ∗/ dgeqrf ( ) ; dsolve (); /∗ Update s i m u l a t i o n ∗ w i t h r e s u l t v e c t o r ∗/ vec2sim ( ) ; }

Process 2

Application Library

Characteristics , Large part of (total) computation spent in factorization/solve Between LA operations:

/ use resulting vector / matrix /

with operations that do not preserve data checksums modify data not covered by ABFT algorithms

Application

Problem Statement Typical Application

LIBRARY Phase

GENERAL Phase

Process 0

Application Library

Process 1

Application Library

Process 2 f o r ( aninsanenumber ) { How to use fault tolerant operations (∗) within /∗ E x t r a c t d a t a from tolerant (∗∗) application? (∗∗∗) ∗ s i m u l a t i o nnon-fault , f i l l up Characteristics ∗ m a t r i x ∗/ sim2mat ( ) ;

a

Application Library

, Large part of (total)

or other application-specific FT spent in computation /∗ F a c t o r i z e m a(*) t r i ABFT, x , not have the same kind of FT ∗ S o l(**) v e Or ∗/ within an application that does factorization/solve d g e q r f ( ) ; (***) And keep the application globally fault tolerant... Between LA operations: dsolve (); /∗ Update s i m u l a t i o n ∗ w i t h r e s u l t v e c t o r ∗/ vec2sim ( ) ; }

/ use resulting vector / matrix /

with operations that do not preserve data checksums modify data not covered by ABFT algorithms

Outline

1

Motivation

2

ABFT&PeriodicCkpt

3

Performance Modeling

4

Periodic Checkpointing Protocols (for comparison)

5

Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling

6

Conclusion

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: no failure Periodic Checkpoint Process 0

Application Library

Process 1

Application Library

Process 2

Application Library Split Forced Checkpoints

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: failure during Library phase Process 0

Application Library

Process 1

Application Library

Process 2

Application Library

Failure (during LIBRARY) Rollback (partial) Recovery

ABFT Recovery

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: failure during General phase Process 0

Application Library

Process 1

Application Library

Process 2

Application Library

Failure (during GENERAL) Rollback (fulll) Recovery

ABFT&PeriodicCkpt: Optimizations

Application Library

Process 1

Application Library

Process 2

Application Library

ABFT&PERIODICCKPT

Process 0

ABFT&PeriodicCkpt: Optimizations If the duration of the General phase is too small: don’t add checkpoints If the duration of the Library phase is too small: don’t do ABFT recovery, remain in General mode this assumes a performance model for the library call

ABFT&PeriodicCkpt: Optimizations

Application Library

Process 1

Application Library

Process 2

Application Library

ABFT&PERIODICCKPT

Process 0

GENERAL Checkpoint Interval

ABFT&PeriodicCkpt: Optimizations If the duration of the General phase is too small: don’t add checkpoints If the duration of the Library phase is too small: don’t do ABFT recovery, remain in General mode this assumes a performance model for the library call

Outline

1

Motivation

2

ABFT&PeriodicCkpt

3

Performance Modeling

4

Periodic Checkpointing Protocols (for comparison)

5

Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling

6

Conclusion

A few notations T0 Process 0

Application Library PG

Process 1

Application Library

Process 2

Application Library TG

TL

Times, Periods T0 : Duration of an Epoch (without FT) TL = αT0 : Time spent in the Library phase TG = (1 − α)T0 : Time spent in the General phase PG : Periodic Checkpointing Period T ff , TGff , TLff : “Fault Free” times tGlost , tLlost : Lost time (recovery overhreads) TGfinal , TLfinal : Total times (with faults)

A few notations CL

C

CL

Process 0

Application Library

Process 1

Application Library

Process 2

Application Library

Costs CL = ρC : time to take a checkpoint of the Library data set CL¯ = (1 − ρ)C : time to take a checkpoint of the General data set R, RL¯ : time to load a full / General data set checkpoint D: down time (time to allocate a new machine / reboot) ReconsABFT : time to apply the ABFT recovery φ: Slowdown factor on the Library phase, when applying ABFT

General phase, fault free waste General phase Periodic Checkpoint Process 0

Application

Process 1

Application Library

Process 2

Application Library

Library

Split Forced Checkpoints

Without Failures TGff

 =

TG + CL¯ TG PG −C × PG

if TG < PG if TG ≥ PG

Library phase, fault free waste Library phase Periodic Checkpoint Process 0

Application Library

Process 1

Application Library

Process 2

Application Library Split Forced Checkpoints

Without Failures TLff = φ × TL + CL

General phase, failure overhead General phase Process 0

Application Library

Process 1

Application Library

Process 2

Application Library

Failure (during GENERAL) Rollback (fulll) Recovery

Failure Overhead ( tGlost

=

D +R + D +R +

TGff 2 PG 2

if TG < PG if TG ≥ PG

Library phase, failure overhead

Library phase Process 0

Application Library

Process 1

Application Library

Process 2

Application Library

Failure (during LIBRARY)

ABFT Recovery

Rollback (partial) Recovery

Failure Overhead tLlost = D + RL¯ + ReconsABFT

Overall Overall Time (with overheads) of Library phase is constant (in PG ): TLfinal =

1 1−

D+RL¯ +ReconsABFT µ

× (α × TL + CL )

Time (with overehads) of General phase accepts two cases:  1 × (TG + CL ) if TG < PG  ¯  L  1− D+R+ TG +C 2 µ TGfinal = TG  if TG ≥ PG   (1− C )(1− D+R+ P2G ) PG

µ

Which is minimal in the second case, if p PG = 2C (µ − D − R)

Waste From the previous, we derive the waste, which is obtained by Waste = 1 −

TGfinal

T0 + TLfinal

Outline

1

Motivation

2

ABFT&PeriodicCkpt

3

Performance Modeling

4

Periodic Checkpointing Protocols (for comparison)

5

Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling

6

Conclusion

PurePeriodicCkpt

PurePeriodicCkpt Application Library

Process 1

Application Library

Process 2

Application Library Optimal Checkpoint Interval

Optimization opt PPC =

p 2C (µ − D − R)

PUREPERIODICCKPT

Process 0

BiPeriodicCkpt

BiPeriodicCkpt Application Library

Process 1

Application Library

Process 2

Application Library GENERAL Checkpoint Interval

LIBRARY Checkpoint Interval

Optimization opt PBPC ,G = opt PBPC ,L

p 2C (µ − D − R) p = 2CL (µ − D − R)

BIPERIODICCKPT

Process 0

Outline

1

Motivation

2

ABFT&PeriodicCkpt

3

Performance Modeling

4

Periodic Checkpointing Protocols (for comparison)

5

Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling

6

Conclusion

Outline

1

Motivation

2

ABFT&PeriodicCkpt

3

Performance Modeling

4

Periodic Checkpointing Protocols (for comparison)

5

Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling

6

Conclusion

Ratio of time spent in Library Phase (α)

Model & Simulations: PurePeriodicCkpt

T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 60

80 100 120 140 160 180 200 220 240 MTBF system (minutes) SIMULATION MODEL

PurePeriodicCkpt

Ratio of time spent in Library Phase (α)

Model & Simulations: BiPeriodicCkpt

T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 60

80 100 120 140 160 180 200 220 240 MTBF system (minutes) SIMULATION MODEL

BiPeriodicCkpt

Ratio of time spent in Library Phase (α)

Model & Simulations: ABFT&PeriodicCkpt

T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 60

80 100 120 140 160 180 200 220 240 MTBF system (minutes) SIMULATION MODEL

ABFT&PeriodicCkpt

T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 60

80 100 120 140 160 180 200 220 240 MTBF system (minutes)

PurePeriodicCkpt

Ratio of time spent in Library Phase (α)

Ratio of time spent in Library Phase (α)

Model: PurePeriodicCkpt vs. BiPeriodicCkpt

T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 60

80 100 120 140 160 180 200 220 240 MTBF system (minutes)

BiPeriodicCkpt

T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 60

80 100 120 140 160 180 200 220 240 MTBF system (minutes)

PurePeriodicCkpt

Ratio of time spent in Library Phase (α)

Ratio of time spent in Library Phase (α)

Model & Simulations: PurePeriodicCkpt vs. ABFT&PeriodicCkpt

T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 60

80 100 120 140 160 180 200 220 240 MTBF system (minutes)

ABFT&PeriodicCkpt

T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 60

80 100 120 140 160 180 200 220 240 MTBF system (minutes)

BiPeriodicCkpt

Ratio of time spent in Library Phase (α)

Ratio of time spent in Library Phase (α)

Model & Simulations: BiPeriodicCkpt vs. ABFT&PeriodicCkpt

T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 60

80 100 120 140 160 180 200 220 240 MTBF system (minutes)

ABFT&PeriodicCkpt

Outline

1

Motivation

2

ABFT&PeriodicCkpt

3

Performance Modeling

4

Periodic Checkpointing Protocols (for comparison)

5

Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling

6

Conclusion

Toward Exascale, and Beyond!

Let’s think at scale Number of components %⇒ MTBF & Number of components %⇒ Problem Size % Problem Size %⇒ Computation Time spent in Library phase %

, ABFT&PeriodicCkpt should perform better with scale ij /

By how much?

Weak Scale #1

Weak Scale Scenario #1 Number of components, x, increases Memory per component Mind remains constant √ PbSize n increases in O( x) (e.g. matrix, n2 = xMind ) µ at x = 105 : 1 day, is in O( x1 ) C (=R) at x = 105 , is 1 minute, is in O(x) α is constant at 0.8, as is ρ. (both Library and General phase increase in time at the same speed)

# Faults

Weak Scale #1 40 30 20 10 0 0.4

Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt

PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt

0.35 0.3 Waste

0.25 0.2 0.15 0.1 0.05 0 1k

10k

100k Nodes

1M

Weak Scale #2

Weak Scale Scenario #2 Number of components, x, increases Memory per component Mind remains constant √ PbSize n increases in O( x) (e.g. matrix, n2 = xMind ) µ at x = 105 : 1 day, is O( x1 ) C (=R) at x = 105 , is 1 minute, is in O(x) ρ remains constant at 0.8, but Library phase is O(n3 ) when General phases progresses in O(n2 ) (α is 0.8 at x = 105 nodes).

40 30 20 10 0 0.4

Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt 1.00

0.35

0.88

0.3

0.75

0.25

0.62

0.2

0.50

PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt ABFT Ratio

0.15 0.1

0.38 0.25

0.05

0.12

0

0.00 1k

10k

100k Nodes

1M

Ratio of time spent in the ABFT routine

Waste

# Faults

Weak Scale #2

Weak Scale #3

Weak Scale Scenario #3 Number of components, x, increases Memory per component Mind remains constant √ PbSize increases in O( x) (e.g. matrix, n2 = xMind ) µ at x = 105 : 1 day, is O( x1 ) C (=R) at x = 105 , is 1 minute, stays independent of x (O(1)) ρ remains constant at 0.8, but Library phase is O(n3 ) when General phases progresses in O(n2 ) (α is 0.8 at x = 105 nodes).

# Faults

Weak Scale #3 6 4 2

Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt

0 0.4 0.35 0.3 Waste

0.25 0.2 0.15

PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt

0.1 0.05 0 1k α = 0.55

10k α = 0.8

Nodes

100k α = 0.92

1M α = 0.975

Outline

1

Motivation

2

ABFT&PeriodicCkpt

3

Performance Modeling

4

Periodic Checkpointing Protocols (for comparison)

5

Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling

6

Conclusion

Conclusion

Method of composing fault tolerance approaches applications that alternate between ABFT-aware and ABFT-unaware sections each section is protected by its own mechanism

Performance model shows good opportunity for scaling even when checkpointing hypothesis is optimistic composite approach benefits from checkpointing improvements too

Energy Efficiency? Checkpointing on Buddies? Checksumming? Better techniques to recover the ABFT-protected data in some cases.