/
Efficient
Diagnosis
of Multiprocessor
under
Probabilistic Douglas
Department
The
CA
Hopkins
Baltimore,
Gerald Science
Department
University
MD
Engineering
92717
F. Sullivan
of Computer
Johns
Computer
of California
Irvine,
Department
Blough
and
University
Gregory
Models*
M.
of Electrical
Systems
The
of Computer
Johns
21218
M. Masson Hopkins
Baltimore,
Science
University
MD
21218
Abstract In this tems on
paper,
the
is considered minimizing
the
to correctly
probability.
state
of every
to achieve
is shown
be significantly the
number
nosis
of tests
overhead,
performance
Index
Terms:
permanent
*This
faults,
research
under
supported
high
fault
with the of
is prerequired
proven.
Lower
for regular systems includes hypercubes
probability.
In all cases,
model set
is shown
model.
is a measure
a dramatic
order
in a class of tests
of tests
probabilistic
be conducted represent
one
is also
fault
in diagnose
number number
focuses
system
correctly
systems
with this
diagnosis
probabilistic
was
the
a bounded-size
must
results
of system-level
Algorithms,
on
sys-
work
conducted
approaching
bound
under
This in the
can
a linear
in arbitrary
that
these
processor that
than
diagnosable
than
be
number of tests required of regular systems which
required
less
in multiprocessor model.
must
probability
lower
on the A class
of tests
that
of every
greater
diagnosis
to be correctly
fault
algorithm
with
matching
correct
number
state
slightly
and upper bounds are also presented. the
the
processor
A nearly
diagnosis
of tests
A diagnosis
performing
sented.
of fault
a probabilistic
number
diagnose
high
systems
problem
under
of the
improvement
to
Because diagin the
techniques.
diagnosis,
hypercube,
multiprocessor
systems,
models.
in part
by
NASA
grant
1442. N_L_ - _.?Z12
_I_:.iL":
(C _I if,rni
J :,,niv.
)
,:'7 "._
C::uL
hi,
1
Introduction
Highly
parallel
of distinct tions.
computer
processing
of diagnostic
of the The from
probabilistic
diagnosis
systems
an associated
known
processors
has
probabilistically O(n _) algorithm method
has
slightly ing
less
more
an incorrect can
rect
set
fault
homogeneous p. An systems It was must
have
containing
is an avail-
evaluation are
a more
increases
the
p can
set
faulty
realistic
complexity
may
n)
the
with
in which
approaching
zero
may
log_' was
with the
of achieving
Unfortunately,
due
2
to
best
set
of a
presented
a
with
diagno-
of diagnosis
set
of occurmay
be
probability diagnoses
probability
optimal the
i.e.
to
of failure a class
of
one.
all algorithms
diagnosis flaw
cor-
applies
approaching
possible,
only
of choos-
when
utilized
correctly
an
maximum
for which
model
correct
fault
optimal
even
systems
a subtle
while
probability
be true
is is p-
I20]
probability fault
a common that
of
of faulty
system
quality
the
The
has
the
with
examined
presented
result
set
Blount
this
likely that
probability.
for c > this
likely
is correct
sets
most
same
processor
any
each
class
to p of occurring
work,
how
meaning
The
the
a given
most
which
fault
author
was
n tests,
the
con-
in which
be co-NP-complete
the
sets,
high
each
algorithm
tests.
or equal
determined
systems, fault
in which
In related
paper
examined
[4]. Unfortunately,
Hence,
be high.
in [18] that
probability
to
previously
first
systems
whether
(diagnosis model
In [18],
cnlog
claimed
addressed
authors
than
model.
exist.
other
be identified
diagnosis
been
The
The
[6] for determining
time and it was not of tests conducted.
than
has
systems
shown
diagnosable
fault can
been
given
be achieved.
o(nlog
on
processors
heterogeneous
greater
probabilistic
than
containing also
has
diagnosis
systems
efficient
focuses yields
time
of determining
optimal
probable
diagnosis
system
approach
diagnosis
diagnosable
weighted
exponential the number
slightly
as increasing
paper
same
of failure.
related
In p-probabilistically rence
the
[13] examined
problem
been
closely
in a general
sis requires varies with
diagnosis
in which
This
at
probability
The
of achieving
probability)
model
system
probability
diagnosable
in the
of applica-
fault
in [3,4,6,7,13,15,17,18,19,20].
a priori
diagnosable.
system
but
as p-probabilistically
that
uniquely
in this
probabilities.
of multiprocessor viewpoint
has
number
analysis.
problem
processor
number
automatic as well
a large
system fault diagnosis has been primarily leading to overly pessimistic assessments
presented
capability
containing
in a growing
costs
a probabilistic
a probabilistic
cerning
work
identical
of diagnostic
corresponding
utilized
multiprocessor fault scenarios,
under
and
systems
of processors,
maintenance
The
strategies
independent
assessment
number
of reducing work on worst-case
computer
arebeing
a large
capability.
of diagnosis with
with
method
ability. Previous concerned with
i.e.
elements,
For systems
attractive
systems,
in the
in systems proof,
this
result is untrue. This result wasalsousedin bound
in a more
In this lower
paper,
bound
constant
in
probability that
graphs
produces
correct
class
of hypercubes.
can
It is also
all diagnosis
algorithms
class
of tests
required
is then
proven.
A class
of systems
achieved
with
systems
shown
that
perform
lower
poorly.
regular
the
problem O(n
systems
This
final
systems,
of diagnosis
result
one
in
is preclass
o(n log n) tests,
implies
forms
tests
important
possessing
weaker
diagnosis
log n)
approaching as the
in di-
A nearly
correct
conducting
in [18] as well
for regular
one
is given.
to achieve
with
a diagnosis
approaching
probability
given
Next,
of tests
Finally,
to the
is achieved
1 tests.
probability
number
be
of fixed-degree
with
number
the
flawed
A counterexample diagnosis
n -
a linear
one
contains
with
than
is considered.
diagnosis
This
portant
on the
in [18]. correct
digraphs
diagnosis
more
approaching
systems
sented.
presented in which
correct
bound
probability
which
model
[18] is given
slightly
lower
in regular
the
a similarly
model.
in a sequence'of
containing
matching
probabilistic
we utilize
claimed
algorithm
with
general
[3] to prove
that
for
the
im-
must
be
in [16].
In
of diagnosis
considered.
2
Preliminaries
The
multiprocessor
system
this
model
is represented
a system
representing
processors
performed related
model
in the
by one processor to this
model
utilized
this
as a directed system
and
on another
are
in
graph
edges
processor.
defined
and
of n
processors,
paper
a measure
was with
proposed vertices
of the
of the
digraph
representing
In this
section_
all basic
of diagnosis
algorithm
digraph tests
quantities
performance
is presented.
2.1
Basic
Definitions
For
a system
composed
U --- {ul,... tests
on one
vertex only
set
A complete sets,
Definition
It
another.
is assumed This
U corresponds
if processor
is a test fault
,un}.
collection and
1
other
For
situation
v in the
outcome
of test
a digraph
is a 1(0)
outcomes
fundamental
set
G(U,E),
processors are
of the
system
Associated
if u evaluates are
a syndrome
is represented
capable
by a digraph
system.
constitutes
concepts
of
processors
of processors
processor
This
the these
is represented
to the set
u tests
outcome.
that
G(U, and
E),
where
each
v as faulty
(u, v) E E (fault-free).
Below
syndromes,
defined.
is a function
the
(u, v) _ E if and
with
a syndrome.
by
of performing
from
E
to (0,1}.
Definition
2
For
a digraph
For a processor ure
set
consists
processors below.
Definition
test
3
F -1 (u),
u, the of the
that
G(U,
For
is given
E),
tester
that
with
a digraph
set
set consists
processors
u along
a fault
of the
fail
those
u,
and
of the
processors
and
that
G(U,E)
is a subset
the
u E U,
These
the
set
test
u, the
that
neighbor
u tests.
vertex
set
consists
quantities
tester
set
failof the
are
of u,
U.
defined
denoted
by
by
r-l(u) = {ve U: (v,u) e E}
Definition u,
4
denoted
For
a digraph
by fail_.,_(u),
G(U,E),
is given fail,,(u)
Definition N(u),
5
For
is given
by
a syndrome
= {v e F-1Cu):
a digraph
S,
and
u E U,
the
failure
set
of
by
G(U,E)
and
S((v,u))
u E'U,
= 1}
the
neighbor
set
of u,
denoted
by
N(u) : {re U: (u,v) E Z or (,,u) • E} 2.2
Diagnosis
Algorithm
A fundamental
problem
sors
given
in a system
diagnosis
of the
diagnosed drome by
as faulty it is possible
set
analysis however, digraph
of diagnosis the G(U,
FaultyA(S)
Using this, characterized
of correct and
:
as the
algorithm
notion E),
used
performance.
represents
the diagnosis in Definition
the quality 6.
set
of an
in the
processors
outputs
processors a syn-
is correct Syndrome,
subsequent
probabilistic
with
this
For a syndrome
analysis S from
a
A, let
A diagnoses
u as faulty
of Algorithm algorithm
to as a
and
processors.
proceeding
be defined.
proces-
and the
algorithm
of faulty
element
algorithm
output
of faulty
Before
must
U : Algorithm
exactly
of a deterministic the
basic
diagnosis
a deterministic
{u •
with
faulty
is referred
as input
contains
for a set
the
problem
a syndrome
subset
output
output
for this
takes This
Thus,
if the
algorithm's
are therefore
FaultyA(S) Thus,
to evaluate
the
pairs
system.
algorithm.
is to identify
algorithm
algorithm
in the
by the
systems
An
A diagnosis
processors
comparing
fault
in multiprocessor a syndrome.
algorithm.
a subset
Evaluation
on
when
A when a syndrome,
run
run
on S}
on syndrome fault
set
pair
S. is
Definition 6 terministic
Note
diagnosis
if and
only
if FaultyA(S)
C F,
alarm
diagnosis
Definition
for
care
previous
has
focused
model,
correct
system
is no greater
practice.
algorithm the
relatively Some
is correct
each
faulty
a rigorous
this
goal
we take
performance can
pro-
as well
be evaluated.
This
section.
are
faulty
are
made
diagnosis
to an overly
produce
can
sets
the
the
paper
that
be achieved
in the
view
outcomes
when of tests
in contrast with
high
to the probability
in
of diagnosis
that
system.
rare
for the
faults
a diagnosis
This
performance
outcome
set
be extremely
approach
by accounting
In our
p independently
correct
set
of t or
model
in a system.
probability
concerning in this
in the
fault
any
probability
algorithm
fault
with
may
pessimistic
the
algorithm
processors
allows
a probabilistic
processors
of the
of faulty
a model
sets that
we present
diagnosis
a bounded-size
number
Such
including
of diagnosis
faulty
always
if the
lead
area,
Under
of performance,
the
assessment
diagnosis
t < n/2.
paper,
as a measure
It will be shown
probabilistic
of one
another,
performing
a test,
performed
by faulty
bounded-size in this
fault model
at
low cost. comments
are
in order.
by
faulty
virtually
so long
is to provide achieve
algorithm
performance.
therefore
of occurrence
no assumptions correct
and
paper To
[21], where
6, diagnosis
performance
system-level
to be faulty,
identifies
realistic
processors
model,
this
following
value
In this
processors
processors.
of this
e.g.
as fault-free
as fault-free
problem.
be guaranteed
some can
we use,
likelihood
model,
fault-free
set
approach
correctly
a more
work,
In Definition
goals
which
in the
in the
in a system
and
previous
to be identified
of diagnosis
worst-case
can
than
performance.
in a system
a de-
_ F.
in some
diagnosis
under
work on
diagnosis
This
algorithm
of the
measure
model
E),
Model
of the
fewer processors
One of the
is presented
Probabilistic
used
processors as faulty.
a proper
fault
that
is identified
analysis
G(U,
and
if FaultyA(S)
processor
as faulty.
the
only
is identified
fault-free
model
evaluation
and
faulty
in defining
probabilistic
fault
from
allow
processor
each
if and
6 differs may
as a probabilistic
for
a digraph
partial
is identified
yields
(S, F) from
-=- F,
foundation
In much
pair
if FaultyA(S)
when
3
set
to produce only
as no fault-free
great
fault
is said if and
diagnosis
cessor
A
diagnosis
that
only
a syndrome,
correct
false
correct
For
algorithm
processors. any
concerning
We make manner.
the
no assumptions Thus,
faulty
For example,
behavior
of faulty
concerning processors faulty
the can
processors
pass
processors
under
outcomes
of tests
or fail
can:
other
this
mode]
performed
processors
in
1. alwaysfail otherprocessors, 2. alwayspassotherprocessors, 3.
fail other
4.
collaborate
processors with
attempt
any
these
equivalent most
processor
as
other
that
The
for which possible
these
achieved
by restricting
under
are
probability
flG(y,z Since specified fault
set
pairs
for the
pair
model. which
made
Note
that
event
may
The have
basic the
family
many
of events
also
systems
in an
model,
With
paper
of these
show
that
are this
are faulty
the
very
is
in the
in this any
which
this
outcomes
improvements
S is a function the
set
nearly can
of the
only
in mind,
events
same
fault
in a basic V(u,v) fault
distinct
outcomes
of the set
be
we now
and
7C(U,E}
in this
consist
whose'
syndromes
Formally,
F, S((u,v))=
with
probability
each
set
event
basic
pairs. of G(U,
space
con-
by faulty may
not
are
identical fault
S'((u,v))}
event
Now,
be
of syndrome,
a syndrome,
u e U-
fault
set
of sets
as follows:
associated
syndrome,
performed
a fault
B defined
e E with
set
model
model
E to {0, 1}}.
of tests given
processors.
event
probability
from
syndrome
/3G(tZ,E) = {B : B is a basic The
We
processors.
out of faulty
and
is a unique
contain
this test
under
significant
of faulty
concerning
on edges
r = F'
there
robust.
contains
of a particular
(S a, F e) is contained
B = {(S,F):
outcomes
we present
probability
hence,
F) : F C U and
are
labels
under
produce
the sample space 12a(tZ,E ) of this set pairs in that digraph, i.e.
probability
in this
except set
) = {(S,
the
very
work
behavior
test
model.
no assumptions
processors,
high
therefore and
allowed
algorithms
with
algorithms
For a digraph G(U, E), of all syndrome, fault
sists
are processors
diagnosis
this the
their
or
behaviors.
faulty
model
through
algorithm,
behaviors
the
diagnosis
sparsest
processors
above
correct
systems
the
any
and
probability,
diagnosis
manner.
behaviors
present
faulty
the
assuming
to produce
some
or all of the
as well to
detrimental
shown
other
to confuse
5. combine Since
with
but
that
each
let
E)}.
is the
set
of
all subsets
of
Ba(U,E). Definition incompatible
7 A
syndrome,
if and
only
fault if 3u,
set
pair
(S,F)
v E U such
that
6
in a digraph u E U -
F,
G(U,
E)
(u, v) E E,
is said and
to be
I. v 6 U - F and S((u,v)) -- 1, or andS((u, ,)) = 0. A syndrome,
fault set pair which
is not incompatible
is said to be compatible.
A basic
event is said to be incompatible if its syndrome, fault set pairs are incompatible, otherwise it is compatible. The probability of a basic event B in a digraph G(U, E) is defined as follows: 0 if B is incompatible
PG(B)
where
F represents
plFl(1 - p)n-lF[
the unique
fault
otherwise
set associated Pc(B)
with
B. Clearly,
= 1
B6B(;(u,E)
and, hence,
this is a legitimate
The primary paper
measure
is the probability
probability
of the performance that
the algorithm
G(U, E) and a deterministic
algorithm
Correcta(A) and
let
NotCorrectc(A)
of a diagnosis
produces
correct
algorithm
used
diagnosis.
in this
For a digraph
A, let
= {(s,r):
represent
Correcte(A) represents the which Algorithm A produces
the
FaultyA(S)= complement
F} of
Correcte(A).
Thus,
set of all syndrome, fault set pairs in a digraph for correct diagnosis. Note that it may be the case that
CorrectG(A) ¢ J'C(U,E) in which output of a particular diagnosis performed algorithm specified.
measure.
case PG(CorrectG(A)) algorithm may depend
will not be defined. The on the outcomes of tests
by faulty processor s and thus, the probability of correct diagnosis for the cannot be determined until a probability distribution on these edges is
For a digraph
G(U,E),
let
P_
be a probability
function
defined
on
['_G(U,E)
such that the family of events is equal to all subsets of fla(u,E} and VB 6 Ba(U,E), P_(B) = Pa(B). Such a probability function will be referred to as a refinement of Pa. Now, let PG represent the set of all refinements of Pa. Since any type of behavior of the faulty processors is allowed in this model, the probability for a deterministic algorithm A in a digraph G(U, E), denoted defined to be DiagProbG(A)=
rain P_6Pc
P_(CorrectG(A))=
min P_6Po
_ (S,F)6Correcto{A)
of correct diagnosis by DiagProba(A ) is
P_((S,F))
Thus,
when calculating
sumed
that
the probability
the faulty
processors
of correct
perform
their
diagnosis
to the algorithm.
We may also define this diagnosis
nosis algorithms.
Given
a syndrome
for an algorithm
tests in the manner probability
S, a probabilistic
most
it is as-
detrimental
for probabilistic
diagnosis
algorithm
diag-
A chooses
a fault set F with some probability tall it PA,s(F) where _fCtr pA,s(F) = 1. Thus, for a digraph G(U, E) and a probabilistic diagnosis algorithm A, the probability of correct diagnosis for Algorithm A is defined to be DiagProbc(A)
4
Diagnosis
Using
In [18], an efficient ity approaching
=
diagnosis
min v_ePc
n-1
_
F)).
PA,s(F)
Tests
algorithm
one in sequences
P,b ((S,
(S,f )_nc
that achieves
of digraphs
correct
containing
diagnosis
with probabil-
cn log n edges,
for c > toz-_l_,
was presented. It was also claimed in [18] that all diagnosis algorithms must have a probability of correct diagnosis that approaches zero for digraphs containing o(nlog n) edges. In this section, a sequence of digraphs containing n - 1 edges is exhibited for which a simple diagnosis algorithm can achieve correct diagnosis with constant probability, thereby providing a counter-example to this claim. Consider a sequence defined as follows: Err
i.e. Ul tests algorithm. Algorithm Input: " Output:
=
all other
of digraphs
{(Ul,
tt2),
(Ul,
processors.
Gn(Un,E,_)
u3),
Now,
• .
.
, (ttl,
consider
with
Urt-
1),
the
Un = {ul,...,u,}
(Ul,
Urt)
following
and
E,_
} ,
simple
diagnosis
Naive A syndrome S in a digraph A set F C U.
G(U, E).
for each v e {u2,u3,...,un} if S((ut,v))=
1 then
F _
Fu{v}
Algorithm Naive simply assumes that ul is fauit-free and diagnoses a processor as faulty if and only if it is failed by ul. Clearly, if u_ is faulty, Algorithm Naive
incorrectly
diagnoses
correct
diagnosis.
ul itself. Thus,
If ul is fault-free
VPb.
,
=
Pb.({(S,F)
=
1-p
Naive
produces
: u,
is fault-free))
therefore DiagProba.
Thus,
this
ability
5
simple
diagnosis
in a sequence
In this
section,
only
if it is failed
Algorithm
exactly
diagnosis n -
with
constant
prob-
1 edges.
powerful
diagnosis
Majority
than
1/2
the
algorithm
a processor processors
known
as Algorithm
Ma-
as faulty
if and
is diagnosed
in its
tester
set.
has
a time
Majority A syndrome A set
S in a digraph
G(U,
E).
F _C U.
uEU
if Ifailin(u)[
Theorem
1 and
Proofl
> _
For
calculated
set
in a single and
calculated.
output
Algorithm blindly
believing
vote
among
the
the
as well
of the
only
space
labeled
storage
that
for the special
class
and
no other
are
complexity
is slightly the
Majority
complexity
of
of O(IEI).
test
is also
of a single
of systems
in which Algorithms
9
digraph.
algorithm these
can
This
aside
values
be
from
as they
are II
than
and
tests
Naive.
it relies
processor.
processor Naive
Algorithm
processor,
of a given one
of the
the
to hold
cardinalities
O([EI).
sophisticated set
set
lists for
variables
more tester
tester
adjacency
outcomes
in the
conducted,
as the
requirement
is a set of temporary
processors
tests
Algorithm
cardinalities
The
Majority
than
G(U,E),
traversal
time.
Hence,
F _-- F U {u}
complexity
failure
O(]Et)
input
then
2
a digraph
a space The
requires
yet
by more
Output:
O(IEI)
correct
containing
In Algorithm
Input:
for each
produces
= 1 - p.
Algorithm
a simple
is presented.
(Naive)
algorithm
of digraphs
A Majority-Vote
jority
the
Algorithm
E Pa.
Vbn(Correcte.(Naive))
and
however,
It should every
Majority
Rather
on a majorityother
are
be noted processor
equivalent.
6
Diagnosis
in
Sparse
Systems
In this section, we examine the problem of correctly diagnosing multiprocessor systems having sparse communication networks. First, it is shown that for a class of irregularly Algorithm
structured Majority
ing one. Next,
systems correctly
utilizing a number of tests growing just faster than n, diagnoses every processor with probability approach-
the probability
of correct
diagnosis
of Algorithm
Majority
is evaluated
on some fixed systems which utilize a modest number of tests. Finally, it is proven that a linear number of tests are required for any diagnosis algorithm to be capable of producing
6.1
correct
diagnosis
An Upper Bound rect Diagnosis
with high probability.
on
the
Number
of Tests
Necessary
for
Cor-
Consider a class of systems in which there is a set of processors known as the testers. The systems are such that any processor which is a tester tests all other processors in the system
(including
the other
testers).
Any processor
that
is not a tester
conducts
no tests. Thus, a (small) fraction of the processors are relied upon to satisfy all the testing requirements of the system. Such a digraph will be referred to as a tester digraph,
formally
defined
Definition 8 A digraph 3TG C_U such that
below. G(U, E) is said to be a tester
digraph
if and only if
E = {(_, v): _ _ To,, e U, and _ # ,}. The set TG is known
Figure
as the testing
1 is an example
For a tester
digraph
of a tester
set of G.
digraph
with 3 testers
G(U, E) with testing
set To,
and
8 vertices.
let
GoodMajG= {(S, F): ITGf3(U - F)} > ITcl T and (S, F) is compatible} Thus, GoodMaja represents more than 1/2 the testers majority Majority
of testers in a tester will be correct.
Lemma
1 For a tester
the set of compatible syndrome, fault set pairs in which are fault-free. The following lemma shows that if the digraph
are fault-free,
digraph G(U, E), GoodMajG
10
then the diagnosis
of Algorithm
___CorrectG(Majority).
• - -.
------
..........
1
Testing L
Set
..................
J
Figure
Proof: and
We will show therefore,
Consider
GoodMaj
any
case
diagnosed case
then
(S, F) E Correctc
(Majority)
is compatible,
any
u E U.
F)
u must
be passed
Recall
by Algorithm
that
by
more
than
FaultyMajority(S)
Majority
when
than
the
run
1/2
is the
on
the set
testers,
im-
of processors
S.
_ : uE(U-TG)nF
Similarly,
u must
be failed
ease S : u e Tc n(UHere,
and -
FaultyMajority(S). as faulty
Digraph
a C_ Corrects(Majority).
(S, F) _ GoodMajc
(S, F) u _
Tester
if (S, F) E GoodMaja,
i : u e (U - TG)n(U
Because plying
that
1: A
u can be failed
diagnoses
a unit
by more
testers
implying
u e FaultyMajority(S).
F)
by at most
as faulty
1/2
1/2
only
when
failed
by
the
remaining
it is failed
testers.
Since
by a strict
Algorithm
majority
of its
Majority tester
set,
u ¢ FaultyMajority(S). case In
this
,_ : u E TG A F case,
u must
be
more
than
1/2
the
remaining
testers,
implying
u _. FaultyMajority(S). Hence,
FaultyM_jority(S)
= F and
therefore
11
(S, F) E CorrectG(Majority).
I
Thus,
if more
than
1/2 the testers
Majority produces correct diagnosis. is given by any unbounded function, ity approaching one and hence the Majority
approaches
in a tester
digraph
are fault-free,
A}gorithm
Theorem 2 shows that if the number of testers this condition will be achieved with probabilprobability of correct diagnosis for Algorithm
one.
Theorem 2 Let w(n) be any unbounded function. If p < 1/2, then for any sequence of tester digraphs on n vertices having win ) testers, the probability of correct diagnosis
for Algorithm
Proof:
We' must
DiagProbc. the number
Majority
show
one as n _ oo.
for any sequence
satisfying
the
theorem
condition,
(Majority) --_ 1 as n --_ oo. If we let X be a random variable representing of faulty units in the testing set of a tester digraph G, then
GoodMaja Now, X is a binomial Lemma
that
approaches
1 that
VP_.
ITal X < -_
= {(S,F): random
variable
and (S, F)is
with parameters
compatible}
[Tel and p. It follows
from
6 Pa. P_. (Correcto.
(Majority))
_> Pb. (GoodMaja.)
Now, since p < 1/2,
=
Pb.({(s,r):
=
1-
-_
1
Pb,,({iS,
X < _}) F):
Rl - G.({(S,F):ITa.
½ - p > 0, and by the P,'G. (Correcta.
X >_ I_a[}) X ---p_>I
,
_-p})
Weak Law of Large
iMajority))-*
Numbers
[9],
1
and therefore DiagProba.
iMajority)
_
1.
I Thus, Algorithm Majority produces correct diagnosis with probability approaching one in a class of digraphs containing a number of edges given by n. w(n), where win ) is any function that goes to infinity (arbitrarily slowly) with n. Under a bounded-size fault set model a quadratic number of tests are required to withstand a linear
number
of faults
while
this result 12
shows
that
in this probabilistic
model
a
I p I Ir l I 0.001
3
p.oo5
5
0.010 0.050
5 11
0.100
19
0.200
41
0.300
105
Table 1: Size of Testing Set Required for Correct Diagnosis Probability of 0.99
linear
expected
number
of faults
can be tolerated
with a number
of tests that
is arbi-
trarily close to linear. The maximum degree of the vertices in this class of digraphs is large, however, which may be a problem in'some applications. This motivates us to study
6.2
the problem
Performance
In this section, diagnosis digraph
of diagnosis
of Algorithm
the number
in tester
G(U, E) with testing
that
the probability
regular
systems
Majority
of tests required
digraphs
DiagProba
Note
in sparse
using
Algorithm
on
to achieve
in Section
Fixed
7.
Systems
a given probability
Majority
is examined.
of correct For a tester
set Ta
(Majority)
(I)
___
of correct diagnosis
depends
only on the testing
nality and not on n. For a given probability of failure, determine the number of testers needed for Algorithm
set cardi-
Inequality 1 can be used to Majority to achieve a speci-
fied probability of correct diagnosis. The size of the testing set required to achieve a correct diagnosis probability of 0.99999 for various values of p is shown in Table 1. If the probability rect diagnosis
of failure of a processor is 0.001, Algorithm Majority can achieve corwith a probability of 0.99999 using three tests per processor regardless
of the number of processors in the system. For a probability of failure of 0.005 or 0.010 the tester set need only be of cardinality five for Algorithm Majority to achieve a probability
of correct
diagnosis
of 0.99999.
13
Thus,
when
the probability
of failure
"
Ip
Probabilistic
][ Bounded-size
100
0.01
400
99
100
0.10
1800
495
100
0.30
4100
3069
1000
0.01
18,000
999
1000
0.10
123,000
4995
1000
0.30
334,000
30,969
10,000
0.01
1,240,000
9999
10,000
0.10
10,700,000
49,995
10,000
0.30
31,070,000
309,969
t
Table
2:
Total
Number
Correct
is small total
correct
diagnosis
number
of tests
indicated
in Table
processor
are
a large
fraction
that
the
that
total
Necessary of
achieved
with
n.
p is larger,
When
a probability correct
processors
in the
number
of tests
system
remains
probability
tests
are
more
with
probability
will are
high more
of 0.300,
diagnosis
of tests
for
0.99
extremely
of failure
to achieve a larger
number
be
is near
1, for
of the
to be expected
can
that
required
of Tests Probability
Diagnosis
be faulty
required.
proportional
than
a
100
As
tests
per
0.99999.
in this The
using
necessary.
Since
situation
important
to n regardless
of the
it is point
is
value
of
p. In
Table
fault
set
2, we compare
model
a correct
diagnosis
bounded-size and
fault
faulty
for various under
the
set
is no
that
greater
probabilistic
over
bounded-size
calculated probability
p.
set
required fault
0.01. For large
in the
Table
model. set
For example, probabilistic model.
14
results
p the than
when
the
bounded-size
in order
to achieve
required
under
manner. t out
the
small
model
tests
than
2 shows lower
of
following
of more
n and
the
Majority
number
is dramatically
in the
required_under
Algorithm
The
the
model
fault
of tests
of tests by
was
than
of n and
number the
number required of 0.99.
model
t such
values
bounded-size
the
number
probability
p, determine
being
the
to the
For a given n processors
of this
comparison
of tests
number
n -- 10,000
n
of the
number
is reduced
the
required and
by
required under
p = 0.10,
a factor
the
of 214
6.3
A Lower
Bound
on the
Number
of Tests
Necessary
for
Correct
Diagnosis In this section, a lower bound diagnosis with high probability
on the number of tests necessary to achieve correct is p;roven. It is shown that if the number of edges in
an arbitrary sequence of digraphs grows slower than n, then all diagnosis algorithms have probability approaching zero of achieving correct diagnosis. This result implies that Algorithm Majority achieves a probability approaching one of correct diagnosis on systems that are very nearly as sparse as possible. Thus, this relatively simple diagnosis
algorithm
is indeed
When the number processors,
i.e.
extremely
powerful.
of edges in a sequence
processors
which
have
of digraphs
no incident
grows slower
edges
must
exist.
than n, isolated Intuitively,
no
diagnosis algorithm should be capable of correctly identifying the state of all these isolated processors with high probability, making diagnosis in such situations impossible. This is formally proven in Theorem 3. The essence of the proof of Theorem 3 can be explained
quite
A has a probability
simply.
To prove
approaching
that
a deterministic
zero of achieving
of digraphs Gn(Un, En), a set of (S, F) pairs exhibited that has a probability dominating'the a given syndrome
from
a system
with
correct
disjoint from probability
isolated
processors,
diagnosis diagnosis
algorithm
in a sequence
CorrectG, (A) must be of Correcta.(A). For it can be shown
that
so
long as the number of isolated processors approaches infinity, the probability of that syndrome and a fault set with a particular labeling of the isolated processors is dominated by the probability of that syndrome and the fault sets in which the isolated processors are relabeled a set of syndrome, fault
in all possible ways. Thus, for any (S, F) 6 Correcte. set pairs disjoint from Correct(;. (A) can be exhibited
has probability dominating the probability of (S,F). It is also shown exists a deterministic diagnosis algorithm that has perforrfiance at least the performance Theorem
of any probabilistic
3 Let A be any
algorithm,
probabilistic
thus completing
or deterministic
(A), that
that there as good as
the proof.
diagnosis
algorithm.
If
0 < p < 1, then for any sequence of digraphs on n vertices having o(n) edges, probabi'lity of correct diagnosis for Algorithm A approaches zero as n ---* oo. Proof: rithm
We must
A and any sequence
DiagProba,(A test.
show that
of digraphs
G,(Un,
) --* 0 as n --* o_. Assume
This yields
P_,, ((S,F))
for any probabilistic
=
a refinement
Pb,, 6 ?a.,
or deterministic
E,)
faulty
having
processors
diagnosis
the
algo-
IE,[ 6 o(n), pass
all processors
they
where
0plF](1 if (S,F) is incompatible _ p)n-IFI otherwise
15
or 3u 6 F,v 6 U with S((u,v))
= 1
Now,let ISOa.
C_ Un represent
have no incident
edges,
the set of isolated
in Gn(Un,
E_).
IISOc.
processors,
i.e.
processors
which
Clearly,
I >_ n - 21E,_I _
oo.
4
For a syndrome,
fault
Relabel(s.F)
set pair
(S, F) E CorrectG.
(A) let
= {(S', F') : S' = S, F' # F, and F - ISOG,
= F' - ISOG.
}
and let AllLabel(s.F
) = Relabel(s2-
Thus, Relabel(s.F) consists of the syndrome, of ISOG. are relabeled in all possible ways. P'G. (NotCorrectG. k
fault set pairs Clearly,
in which
the processors
(A))
_ Pb. (Relabel(s,F)) (S.F)eCorrect,_. (A)
----
and since
) U {(S, F)}.
E
all processors P_. ((S, r))
[P_..CAllLabelcs,F)).
in the set ISOG.
- P"G. ((S,F))]
are isolated,
= p Its°_.
nF}(i - p)llS°_"
n(V.-F)[p_.
E
P_. (AllLabel(s.F))
(AllLabel(s.F)).
Therefore,
(S,F )e CorrectGn
(A )
R' G"((S'F))
__, (S,Y)_:Uorrecf;Gn
(A) PllSOa"
E(S,F)eCorr.¢tG.
>
[max(p,
nf](
l -- P)IlSOa"
n(V"-Y)l
(A) P_. ((S, F))
1 - p)l[ISO_.. [
and thus P.'G. (NotCorrectG. ->
(
[max(p,
(A))
1
-
1
p)]Iisoa.[
- 1
16
)
(s2')eCo_r_¢t,;. (A)
P£((S,F))
Therefore, P_.(Correcta.(A))
< -
[max(p'l-P)][ls°a"[ 1 -[max(p, 1 - p)][lsoG.J"
--*
as n _ oo. Thus, any probabilistic
0
'
for any algorithm diagnosis
A, DiagProba.
algorithm
A.
(A) _
ap-
_.
The
r
systems
from
of systems which
This
class
not
contained
The
systems
conduct
is tested
includes
regular
4 shows
sequent needed.
results,
achieves
correct
contains
many
in the
at
with
Dl,¢log_
class.
in this
section
least
O(nlog
sequence
following
Let Y
be a binomial
-
e.g.
for
to a theorem
variable
pi(1-
with
which
processor
c sufficiently large Majority
In order
to prove
proved
in the
large. degree.
parameters
will produce this
and
n and
O
1. Most of the previous work in the diagnosis set model where it is assumed that no more
area than
has utilized a bounded-size fault t faults occur in the system. A
system is said to be t-diagnosable if any combination of t faulty units in the system can be uniquely diagnosed. It is well known that a k-dimensional hypercube is kdiagnosable of vertices satisfied
but not (k + 1)-diagnosable. of the cube, the assumptions only when
the
number
of faults
Since, k = log s n, where n is the number of the bounded-size fault set model are is less than
or equal
to the
logarithm
of
the number of units. It is unlikely that this condition will be met in large systems. Under the probabilistic model, however, a number of faults that is linear in the number of units can be tolerated. Table
3 illustrates
the diagnosis
performance
on hypercube systems for probabilities column of this table lists the expected sponding system and failure probability.
difference
between
the two models
of failure of 0.002 and 0.020. The fourth number of faulty processors for the correPk represents the probability that no more
2O
IkI
"
1.0000
1.0000
0.020
1.28
0.9997
0.9999
0.002
0.51
1.0000
1.0000
256
0.020
5.12
0.9258
1.0000
1024
0.002
2.05
1.0000
1.0000
1024
0.020
20.48
0.0079
1.0000
4096
0.002
8.19
0.9267
1.0000
4096
0.020
81.92
0.0000
1.0000
16384
0.002
32.77
0.0002
1.0000
16384
0.020
327.68
0.0000
1.0000
16
65536
0.002
131.07
0.0000
1.0000
16
65536
0.020
1310.72
0.0000
1.0000
20
1048576
0.002
2097.15
0.0000
1.0000
20971.52
0.0000
1.0000
64
0.002
6
64
8
256
8 10 10 12 12 14 14
20
than
3:
1048576
Diagnosis
k units
are
diagnosis
for
bounded-size
fault
hypercube
an estimate
for the
It can
be seen
degrades
correct
and
on
PM_j
for Algorithm
dimensional
model
0.020
Probability
faulty
correct the
diagnosis
set
model
when
from for
represents
a lower
Majority. the
probability
size
Algorithm
only
number
3 that
as the
Since
can
of correct
Table
rapidly
a k-dimensional,
situation,
Majority Under
the expected
still the
correct
bounded-size
fault
this
situation.
When
may
seem
large,
a system
[11], has
been
Machine
k
=
set 16,
the
containing
the
number this
many
built.
21
probability
the
algorithms
than
or equal
the
than
a probability number
processors,
to k, Pk is
bounded-size The
one
1300
and
that
is very
set
for
of
all
the
is as large is 0.02. In
yet Algorithm
is limited
is 65,536. namely
fault
probability
nearly
of faults
of processors
in a k-
algorithms.
increases. is very
of
proposed
diagnosis
of failure of a processor the probability of failure
is greater
with
model,
under
however,
of faults
diagnosis
on
correct
is less
hypercube
Majority,
Hypercube
bound
for those
performance
n-node
diagnosis
guarantee
of faults
of the
number
produces
the
diagnosis
hypercubes studied, even when the probability as 0.02. Consider the case where k = 16 and this
PM_
0.13
6
Table
Pk
I p I Exp #faulty
the
nearly to While
one. 16 for this
Connection
7.3
Lower
Bound
While hypercubes are an important class of system, systems with even fewer connections are expected to see increased use in future multiprocessor applications. We are therefore necessary
interested
to achieve
in determining correct
diagnosis
a lower with
bound
high
on the total
probability.
was proven in [2] for regular systems. This result states that must have a probability of correct diagnosis that approaches
number
Such
of tests
a lower
bound
all diagnosis algorithms zero in regular systems
with o(n log n) tests. This more general probability model contains the model utilized in this paper as a special case and hence this result holds for this model as well. Thus, for the important class of regular systems the algorithm given in [18] as well as Algorithm Majority are both optimal to within a constant factor. This result also demonstrates that the irregular structure of the tester digraphs studied in this paper
is a crucial
factor
in making
them
amenable
to diagnosis.
Of special interest due to their widespread use are muitiprocessor systems which are regular and of fixed degree. Included in this class of systems are rings, torii, and hexagonal meshes. This somewhat pessimistic result implies that weaker forms of diagnosis
8
must
be considered
Diagnosis
for these systems.
using
a Linear
Number
of
Tests
It has been shown that Algorithm Majority can achieve correct diagnosis with probability approaching one in digraphs containing nw(n) edges, while all algorithms must have probability approaching zero of correct diagnosis in digraphs possessing o(n) edges. These results leave open the question of what can be achieved edges, for some positive constant c. In this section, it is shown that with Algorithm Majority can achieve a probability of correct diagnosis that is a arbitrarily close to one. It is also shown that a constant probability less is the best
that
any algorithm
Algorithm
Majority
The 'following digraphs
is optimal theorem
with a linear
Theorem5
:Proof: if G,(U,,En)
diagnosis We must
characterizes
number
for Algorithm show
with
in this situation,
a linear
the performance
number
meaning
that
of edges.
of Algorithm
Majority
on
of edges.
large tester
is a sequence
to achieve
for digraphs
Let e be any real number
that for all su_ciently of correct
can hope
using cn cn edges constant than one
that,
such that O < e 0, n0 such with
ITG. I :> c, then
Vn
that >_ no,
1
DiagProbG.(Majority
) :> 1 -e.
Let a = l-'0_-p)" 2 < 1. Then,
P_.(Correctc.(Majority))
>
VP_,, E PG.,
]-i
1i=0
>_ by Corollary
1. Now, if c is chosen
I-[e-(1-'_)'/2]
(l-p)c
such that -2lne
c>
(1- a) (1 - v)
then P_.(Correctc,,(Majority))
>__ 1 - e l"_ =
l--e
l Thus,
Algorithm
Majority
can achieve
correct
trarily close to one in sequences of digraphs following theorem shows that all diagnosis correct diagnosis situation. Theorem6
that
is bounded
Let c be any positive
away
diagnosis
with
probability
having a linear number of edges. The algorithms must have a probability of
from
constant.
one by a positive
If O < p
O such that for any
probabilistic or deterministic diagnosis algorithm A and any sufficiently on n vertices having no more than cn edges, the probability of correct Algorithm
arbi-
for any c > 0, 3e > 0, no such
that
large digraph diagnosis for
if G,,(Un,
IE,_I < cn, then Vn _> no, DiagProbc.(A)
En)
is
_< 1 - e. Let
R c. t
E PG. be such that faulty processors fail all other processors. Now, let Umina. E Un be any vertex of G,, such that Vu E On, IN(umin(:.)l _< [N(u)]. Thus, Umi.,_. is a processor having minimum size neighbor set in Gn. Clearly, IN(umina.)l _ min |5 \1 _> min
£ P
, 1 - p) p
1 - p' l-P)
R'a. (CorrectG.(A)
Q SurrG.)
[p:C-PS.(NotCorrectG.(A))
]
or
P_.(NotCorrectG.(A))
[1 + min ( 1 -p'
-
_
1 -p'
p
P
and P_.(NotCorrecte.(A)) so long" as 0 < p < 1. Now, consider VP_.
min(l___p, l__p )p2_ > = e > 0 - 1 + min(__p, __e_) any probabilistic
diagnosis
algorithm
A. Then,
6 PG. DiagPr°ba.
Consider
the deterministic
F such that
(A)