Assessing the impact of ABFT & Checkpoint composite strategies George Bosilca1 , Aur´elien Bouteiller1 , Thomas H´erault1 , Yves Robert1,2 and Jack Dongarra1 1. University of Tennessee Knoxville, USA ´ 2. Ecole Normale Sup´erieure de Lyon & INRIA, France
April 25, 2014 - ICL Lunch talk
Outline
1
Motivation
2
ABFT&PeriodicCkpt
3
Performance Modeling
4
Periodic Checkpointing Protocols (for comparison)
5
Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling
6
Conclusion
Optimists
Faults: a reality? New technologies (graphene) solves the heat and voltage issues, making faults extremely unlikely Progress in the fabrication process will make the hardware much more reliable Simple methods to fix the problem directly at the hardware level: units duplication, checksum (ECC)
Skeptics
How many “faults” you see on your daily working tool (laptop)? What would it be different with a Exascale tool?
It is not because they could do it before but nobody did it, that they will not do it in the future (!) Do you have life insurance ?
Faults: with big numbers come big responsability
Assume independent failures
Let r be the probability of a component to operate for 1h
140
1 hour reliability 0.9999 0.99999 0.999999
120 MTTF (hours)
Let N be the number of components (“System Size”)
100 80 60 40
Let R be the probability of the system to operate for 1h R = rN 1 1 R ≈ λN , = 1 − r λ e
20 0 100
1000
10000 System Size
100000
(Figure from Dan Reed “The Challenge of Complexity and Scale” replotted)
Fault Tolerance Techniques General Techniques Replication Rollback Recovery Coordinated Checkpointing Uncoordinated Checkpointing & Message Logging Hierarchical Checkpointing
Application-Specific Techniques Algorithm Based Fault Tolerance (ABFT) Iterative Convergence
Coordinated Checkpointing and Rollback Recovery
P0 Coordinated checkpoints over all processes Global restart after a failure
m1
m2
m3
P1 m4 P2
, General technique (we assume preemptive checkpointing capability)
/ All processors need to roll back / All memory needs to be saved
m5
Algorithm-Based Fault Tolerance !
! Operation
A
! A
C
C = Cksum(A)
B
Operation
B
C0
C 0 = Cksum(B)
Principle of ABFT Input Data (A) and Result (B) are distributed Operation preserves Checksum properties Apply the operation on Data + Checksum (AC ) In case of failure, recover the missing data by inversion of the checksum
Application LIBRARY Phase
Typical Application f o r ( aninsanenumber ) { /∗ E x t r a c t d a t a from ∗ s i m u l a t i o n , f i l l up ∗ m a t r i x ∗/ sim2mat ( ) ; /∗ F a c t o r i z e m a t r i x , ∗ S o l v e ∗/ dgeqrf ( ) ; dsolve (); /∗ Update s i m u l a t i o n ∗ w i t h r e s u l t v e c t o r ∗/ vec2sim ( ) ; }
GENERAL Phase
Process 0
Application Library
Process 1
Application Library
Process 2
Application Library
Characteristics , Large part of (total) computation spent in factorization/solve Between LA operations:
/ use resulting vector / matrix /
with operations that do not preserve data checksums modify data not covered by ABFT algorithms
Application LIBRARY Phase
Typical Application f o r ( aninsanenumber ) { /∗ E x t r a c t d a t a from ∗ s i m u l a t i o n , f i l l up ∗ m a t r i x ∗/ sim2mat ( ) ;
GENERAL Phase
Process 0
Application Library
Process 1
Application Library
Goodbye ABFT?!
/∗ F a c t o r i z e m a t r i x , ∗ S o l v e ∗/ dgeqrf ( ) ; dsolve (); /∗ Update s i m u l a t i o n ∗ w i t h r e s u l t v e c t o r ∗/ vec2sim ( ) ; }
Process 2
Application Library
Characteristics , Large part of (total) computation spent in factorization/solve Between LA operations:
/ use resulting vector / matrix /
with operations that do not preserve data checksums modify data not covered by ABFT algorithms
Application
Problem Statement Typical Application
LIBRARY Phase
GENERAL Phase
Process 0
Application Library
Process 1
Application Library
Process 2 f o r ( aninsanenumber ) { How to use fault tolerant operations (∗) within /∗ E x t r a c t d a t a from tolerant (∗∗) application? (∗∗∗) ∗ s i m u l a t i o nnon-fault , f i l l up Characteristics ∗ m a t r i x ∗/ sim2mat ( ) ;
a
Application Library
, Large part of (total)
or other application-specific FT spent in computation /∗ F a c t o r i z e m a(*) t r i ABFT, x , not have the same kind of FT ∗ S o l(**) v e Or ∗/ within an application that does factorization/solve d g e q r f ( ) ; (***) And keep the application globally fault tolerant... Between LA operations: dsolve (); /∗ Update s i m u l a t i o n ∗ w i t h r e s u l t v e c t o r ∗/ vec2sim ( ) ; }
/ use resulting vector / matrix /
with operations that do not preserve data checksums modify data not covered by ABFT algorithms
Outline
1
Motivation
2
ABFT&PeriodicCkpt
3
Performance Modeling
4
Periodic Checkpointing Protocols (for comparison)
5
Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling
6
Conclusion
ABFT&PeriodicCkpt
ABFT&PeriodicCkpt: no failure Periodic Checkpoint Process 0
Application Library
Process 1
Application Library
Process 2
Application Library Split Forced Checkpoints
ABFT&PeriodicCkpt
ABFT&PeriodicCkpt: failure during Library phase Process 0
Application Library
Process 1
Application Library
Process 2
Application Library
Failure (during LIBRARY) Rollback (partial) Recovery
ABFT Recovery
ABFT&PeriodicCkpt
ABFT&PeriodicCkpt: failure during General phase Process 0
Application Library
Process 1
Application Library
Process 2
Application Library
Failure (during GENERAL) Rollback (fulll) Recovery
ABFT&PeriodicCkpt: Optimizations
Application Library
Process 1
Application Library
Process 2
Application Library
ABFT&PERIODICCKPT
Process 0
ABFT&PeriodicCkpt: Optimizations If the duration of the General phase is too small: don’t add checkpoints If the duration of the Library phase is too small: don’t do ABFT recovery, remain in General mode this assumes a performance model for the library call
ABFT&PeriodicCkpt: Optimizations
Application Library
Process 1
Application Library
Process 2
Application Library
ABFT&PERIODICCKPT
Process 0
GENERAL Checkpoint Interval
ABFT&PeriodicCkpt: Optimizations If the duration of the General phase is too small: don’t add checkpoints If the duration of the Library phase is too small: don’t do ABFT recovery, remain in General mode this assumes a performance model for the library call
Outline
1
Motivation
2
ABFT&PeriodicCkpt
3
Performance Modeling
4
Periodic Checkpointing Protocols (for comparison)
5
Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling
6
Conclusion
A few notations T0 Process 0
Application Library PG
Process 1
Application Library
Process 2
Application Library TG
TL
Times, Periods T0 : Duration of an Epoch (without FT) TL = αT0 : Time spent in the Library phase TG = (1 − α)T0 : Time spent in the General phase PG : Periodic Checkpointing Period T ff , TGff , TLff : “Fault Free” times tGlost , tLlost : Lost time (recovery overhreads) TGfinal , TLfinal : Total times (with faults)
A few notations CL
C
CL
Process 0
Application Library
Process 1
Application Library
Process 2
Application Library
Costs CL = ρC : time to take a checkpoint of the Library data set CL¯ = (1 − ρ)C : time to take a checkpoint of the General data set R, RL¯ : time to load a full / General data set checkpoint D: down time (time to allocate a new machine / reboot) ReconsABFT : time to apply the ABFT recovery φ: Slowdown factor on the Library phase, when applying ABFT
General phase, fault free waste General phase Periodic Checkpoint Process 0
Application
Process 1
Application Library
Process 2
Application Library
Library
Split Forced Checkpoints
Without Failures TGff
=
TG + CL¯ TG PG −C × PG
if TG < PG if TG ≥ PG
Library phase, fault free waste Library phase Periodic Checkpoint Process 0
Application Library
Process 1
Application Library
Process 2
Application Library Split Forced Checkpoints
Without Failures TLff = φ × TL + CL
General phase, failure overhead General phase Process 0
Application Library
Process 1
Application Library
Process 2
Application Library
Failure (during GENERAL) Rollback (fulll) Recovery
Failure Overhead ( tGlost
=
D +R + D +R +
TGff 2 PG 2
if TG < PG if TG ≥ PG
Library phase, failure overhead
Library phase Process 0
Application Library
Process 1
Application Library
Process 2
Application Library
Failure (during LIBRARY)
ABFT Recovery
Rollback (partial) Recovery
Failure Overhead tLlost = D + RL¯ + ReconsABFT
Overall Overall Time (with overheads) of Library phase is constant (in PG ): TLfinal =
1 1−
D+RL¯ +ReconsABFT µ
× (α × TL + CL )
Time (with overehads) of General phase accepts two cases: 1 × (TG + CL ) if TG < PG ¯ L 1− D+R+ TG +C 2 µ TGfinal = TG if TG ≥ PG (1− C )(1− D+R+ P2G ) PG
µ
Which is minimal in the second case, if p PG = 2C (µ − D − R)
Waste From the previous, we derive the waste, which is obtained by Waste = 1 −
TGfinal
T0 + TLfinal
Outline
1
Motivation
2
ABFT&PeriodicCkpt
3
Performance Modeling
4
Periodic Checkpointing Protocols (for comparison)
5
Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling
6
Conclusion
PurePeriodicCkpt
PurePeriodicCkpt Application Library
Process 1
Application Library
Process 2
Application Library Optimal Checkpoint Interval
Optimization opt PPC =
p 2C (µ − D − R)
PUREPERIODICCKPT
Process 0
BiPeriodicCkpt
BiPeriodicCkpt Application Library
Process 1
Application Library
Process 2
Application Library GENERAL Checkpoint Interval
LIBRARY Checkpoint Interval
Optimization opt PBPC ,G = opt PBPC ,L
p 2C (µ − D − R) p = 2CL (µ − D − R)
BIPERIODICCKPT
Process 0
Outline
1
Motivation
2
ABFT&PeriodicCkpt
3
Performance Modeling
4
Periodic Checkpointing Protocols (for comparison)
5
Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling
6
Conclusion
Outline
1
Motivation
2
ABFT&PeriodicCkpt
3
Performance Modeling
4
Periodic Checkpointing Protocols (for comparison)
5
Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling
6
Conclusion
Ratio of time spent in Library Phase (α)
Model & Simulations: PurePeriodicCkpt
T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 60
80 100 120 140 160 180 200 220 240 MTBF system (minutes) SIMULATION MODEL
PurePeriodicCkpt
Ratio of time spent in Library Phase (α)
Model & Simulations: BiPeriodicCkpt
T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 60
80 100 120 140 160 180 200 220 240 MTBF system (minutes) SIMULATION MODEL
BiPeriodicCkpt
Ratio of time spent in Library Phase (α)
Model & Simulations: ABFT&PeriodicCkpt
T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 60
80 100 120 140 160 180 200 220 240 MTBF system (minutes) SIMULATION MODEL
ABFT&PeriodicCkpt
T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 60
80 100 120 140 160 180 200 220 240 MTBF system (minutes)
PurePeriodicCkpt
Ratio of time spent in Library Phase (α)
Ratio of time spent in Library Phase (α)
Model: PurePeriodicCkpt vs. BiPeriodicCkpt
T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 60
80 100 120 140 160 180 200 220 240 MTBF system (minutes)
BiPeriodicCkpt
T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 60
80 100 120 140 160 180 200 220 240 MTBF system (minutes)
PurePeriodicCkpt
Ratio of time spent in Library Phase (α)
Ratio of time spent in Library Phase (α)
Model & Simulations: PurePeriodicCkpt vs. ABFT&PeriodicCkpt
T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 60
80 100 120 140 160 180 200 220 240 MTBF system (minutes)
ABFT&PeriodicCkpt
T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 60
80 100 120 140 160 180 200 220 240 MTBF system (minutes)
BiPeriodicCkpt
Ratio of time spent in Library Phase (α)
Ratio of time spent in Library Phase (α)
Model & Simulations: BiPeriodicCkpt vs. ABFT&PeriodicCkpt
T0=1w, C=R=10min, D=1min, ρ=0.8, φ=1.03, ReconsABFT=2 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 60
80 100 120 140 160 180 200 220 240 MTBF system (minutes)
ABFT&PeriodicCkpt
Outline
1
Motivation
2
ABFT&PeriodicCkpt
3
Performance Modeling
4
Periodic Checkpointing Protocols (for comparison)
5
Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling
6
Conclusion
Toward Exascale, and Beyond!
Let’s think at scale Number of components %⇒ MTBF & Number of components %⇒ Problem Size % Problem Size %⇒ Computation Time spent in Library phase %
, ABFT&PeriodicCkpt should perform better with scale ij /
By how much?
Weak Scale #1
Weak Scale Scenario #1 Number of components, x, increases Memory per component Mind remains constant √ PbSize n increases in O( x) (e.g. matrix, n2 = xMind ) µ at x = 105 : 1 day, is in O( x1 ) C (=R) at x = 105 , is 1 minute, is in O(x) α is constant at 0.8, as is ρ. (both Library and General phase increase in time at the same speed)
# Faults
Weak Scale #1 40 30 20 10 0 0.4
Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt
PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt
0.35 0.3 Waste
0.25 0.2 0.15 0.1 0.05 0 1k
10k
100k Nodes
1M
Weak Scale #2
Weak Scale Scenario #2 Number of components, x, increases Memory per component Mind remains constant √ PbSize n increases in O( x) (e.g. matrix, n2 = xMind ) µ at x = 105 : 1 day, is O( x1 ) C (=R) at x = 105 , is 1 minute, is in O(x) ρ remains constant at 0.8, but Library phase is O(n3 ) when General phases progresses in O(n2 ) (α is 0.8 at x = 105 nodes).
40 30 20 10 0 0.4
Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt 1.00
0.35
0.88
0.3
0.75
0.25
0.62
0.2
0.50
PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt ABFT Ratio
0.15 0.1
0.38 0.25
0.05
0.12
0
0.00 1k
10k
100k Nodes
1M
Ratio of time spent in the ABFT routine
Waste
# Faults
Weak Scale #2
Weak Scale #3
Weak Scale Scenario #3 Number of components, x, increases Memory per component Mind remains constant √ PbSize increases in O( x) (e.g. matrix, n2 = xMind ) µ at x = 105 : 1 day, is O( x1 ) C (=R) at x = 105 , is 1 minute, stays independent of x (O(1)) ρ remains constant at 0.8, but Library phase is O(n3 ) when General phases progresses in O(n2 ) (α is 0.8 at x = 105 nodes).
# Faults
Weak Scale #3 6 4 2
Nb Faults PeriodicCkpt Nb Faults Bi-PeriodicCkpt Nb Faults ABFT PeriodicCkpt
0 0.4 0.35 0.3 Waste
0.25 0.2 0.15
PeriodicCkpt Bi-PeriodicCkpt ABFT PeriodicCkpt
0.1 0.05 0 1k α = 0.55
10k α = 0.8
Nodes
100k α = 0.92
1M α = 0.975
Outline
1
Motivation
2
ABFT&PeriodicCkpt
3
Performance Modeling
4
Periodic Checkpointing Protocols (for comparison)
5
Evaluation As function of α (% in library) and µ (MTBF) Weak Scaling
6
Conclusion
Conclusion
Method of composing fault tolerance approaches applications that alternate between ABFT-aware and ABFT-unaware sections each section is protected by its own mechanism
Performance model shows good opportunity for scaling even when checkpointing hypothesis is optimistic composite approach benefits from checkpointing improvements too
Energy Efficiency? Checkpointing on Buddies? Checksumming? Better techniques to recover the ABFT-protected data in some cases.