Efficient Diagnosis of Multiprocessor Systems under Probabilistic Models

Report 3 Downloads 46 Views
/

Efficient

Diagnosis

of Multiprocessor

under

Probabilistic Douglas

Department

The

CA

Hopkins

Baltimore,

Gerald Science

Department

University

MD

Engineering

92717

F. Sullivan

of Computer

Johns

Computer

of California

Irvine,

Department

Blough

and

University

Gregory

Models*

M.

of Electrical

Systems

The

of Computer

Johns

21218

M. Masson Hopkins

Baltimore,

Science

University

MD

21218

Abstract In this tems on

paper,

the

is considered minimizing

the

to correctly

probability.

state

of every

to achieve

is shown

be significantly the

number

nosis

of tests

overhead,

performance

Index

Terms:

permanent

*This

faults,

research

under

supported

high

fault

with the of

is prerequired

proven.

Lower

for regular systems includes hypercubes

probability.

In all cases,

model set

is shown

model.

is a measure

a dramatic

order

in a class of tests

of tests

probabilistic

be conducted represent

one

is also

fault

in diagnose

number number

focuses

system

correctly

systems

with this

diagnosis

probabilistic

was

the

a bounded-size

must

results

of system-level

Algorithms,

on

sys-

work

conducted

approaching

bound

under

This in the

can

a linear

in arbitrary

that

these

processor that

than

diagnosable

than

be

number of tests required of regular systems which

required

less

in multiprocessor model.

must

probability

lower

on the A class

of tests

that

of every

greater

diagnosis

to be correctly

fault

algorithm

with

matching

correct

number

state

slightly

and upper bounds are also presented. the

the

processor

A nearly

diagnosis

of tests

A diagnosis

performing

sented.

of fault

a probabilistic

number

diagnose

high

systems

problem

under

of the

improvement

to

Because diagin the

techniques.

diagnosis,

hypercube,

multiprocessor

systems,

models.

in part

by

NASA

grant

1442. N_L_ - _.?Z12

_I_:.iL":

(C _I if,rni

J :,,niv.

)

,:'7 "._

C::uL

hi,

1

Introduction

Highly

parallel

of distinct tions.

computer

processing

of diagnostic

of the The from

probabilistic

diagnosis

systems

an associated

known

processors

has

probabilistically O(n _) algorithm method

has

slightly ing

less

more

an incorrect can

rect

set

fault

homogeneous p. An systems It was must

have

containing

is an avail-

evaluation are

a more

increases

the

p can

set

faulty

realistic

complexity

may

n)

the

with

in which

approaching

zero

may

log_' was

with the

of achieving

Unfortunately,

due

2

to

best

set

of a

presented

a

with

diagno-

of diagnosis

set

of occurmay

be

probability diagnoses

probability

optimal the

i.e.

to

of failure a class

of

one.

all algorithms

diagnosis flaw

cor-

applies

approaching

possible,

only

of choos-

when

utilized

correctly

an

maximum

for which

model

correct

fault

optimal

even

systems

a subtle

while

probability

be true

is is p-

I20]

probability fault

a common that

of

of faulty

system

quality

the

The

has

the

with

examined

presented

result

set

Blount

this

likely that

probability.

for c > this

likely

is correct

sets

most

same

processor

any

each

class

to p of occurring

work,

how

meaning

The

the

a given

most

which

fault

author

was

n tests,

the

con-

in which

be co-NP-complete

the

sets,

high

each

algorithm

tests.

or equal

determined

systems, fault

in which

In related

paper

examined

[4]. Unfortunately,

Hence,

be high.

in [18] that

probability

to

previously

first

systems

whether

(diagnosis model

In [18],

cnlog

claimed

addressed

authors

than

model.

exist.

other

be identified

diagnosis

been

The

The

[6] for determining

time and it was not of tests conducted.

than

has

systems

shown

diagnosable

fault can

been

given

be achieved.

o(nlog

on

processors

heterogeneous

greater

probabilistic

than

containing also

has

diagnosis

systems

efficient

focuses yields

time

of determining

optimal

probable

diagnosis

system

approach

diagnosis

diagnosable

weighted

exponential the number

slightly

as increasing

paper

same

of failure.

related

In p-probabilistically rence

the

[13] examined

problem

been

closely

in a general

sis requires varies with

diagnosis

in which

This

at

probability

The

of achieving

probability)

model

system

probability

diagnosable

in the

of applica-

fault

in [3,4,6,7,13,15,17,18,19,20].

a priori

diagnosable.

system

but

as p-probabilistically

that

uniquely

in this

probabilities.

of multiprocessor viewpoint

has

number

analysis.

problem

processor

number

automatic as well

a large

system fault diagnosis has been primarily leading to overly pessimistic assessments

presented

capability

containing

in a growing

costs

a probabilistic

a probabilistic

cerning

work

identical

of diagnostic

corresponding

utilized

multiprocessor fault scenarios,

under

and

systems

of processors,

maintenance

The

strategies

independent

assessment

number

of reducing work on worst-case

computer

arebeing

a large

capability.

of diagnosis with

with

method

ability. Previous concerned with

i.e.

elements,

For systems

attractive

systems,

in the

in systems proof,

this

result is untrue. This result wasalsousedin bound

in a more

In this lower

paper,

bound

constant

in

probability that

graphs

produces

correct

class

of hypercubes.

can

It is also

all diagnosis

algorithms

class

of tests

required

is then

proven.

A class

of systems

achieved

with

systems

shown

that

perform

lower

poorly.

regular

the

problem O(n

systems

This

final

systems,

of diagnosis

result

one

in

is preclass

o(n log n) tests,

implies

forms

tests

important

possessing

weaker

diagnosis

log n)

approaching as the

in di-

A nearly

correct

conducting

in [18] as well

for regular

one

is given.

to achieve

with

a diagnosis

approaching

probability

given

Next,

of tests

Finally,

to the

is achieved

1 tests.

probability

number

be

of fixed-degree

with

number

the

flawed

A counterexample diagnosis

n -

a linear

one

contains

with

than

is considered.

diagnosis

This

portant

on the

in [18]. correct

digraphs

diagnosis

more

approaching

systems

sented.

presented in which

correct

bound

probability

which

model

[18] is given

slightly

lower

in regular

the

a similarly

model.

in a sequence'of

containing

matching

probabilistic

we utilize

claimed

algorithm

with

general

[3] to prove

that

for

the

im-

must

be

in [16].

In

of diagnosis

considered.

2

Preliminaries

The

multiprocessor

system

this

model

is represented

a system

representing

processors

performed related

model

in the

by one processor to this

model

utilized

this

as a directed system

and

on another

are

in

graph

edges

processor.

defined

and

of n

processors,

paper

a measure

was with

proposed vertices

of the

of the

digraph

representing

In this

section_

all basic

of diagnosis

algorithm

digraph tests

quantities

performance

is presented.

2.1

Basic

Definitions

For

a system

composed

U --- {ul,... tests

on one

vertex only

set

A complete sets,

Definition

It

another.

is assumed This

U corresponds

if processor

is a test fault

,un}.

collection and

1

other

For

situation

v in the

outcome

of test

a digraph

is a 1(0)

outcomes

fundamental

set

G(U,E),

processors are

of the

system

Associated

if u evaluates are

a syndrome

is represented

capable

by a digraph

system.

constitutes

concepts

of

processors

of processors

processor

This

the these

is represented

to the set

u tests

outcome.

that

G(U, and

E),

where

each

v as faulty

(u, v) E E (fault-free).

Below

syndromes,

defined.

is a function

the

(u, v) _ E if and

with

a syndrome.

by

of performing

from

E

to (0,1}.

Definition

2

For

a digraph

For a processor ure

set

consists

processors below.

Definition

test

3

F -1 (u),

u, the of the

that

G(U,

For

is given

E),

tester

that

with

a digraph

set

set consists

processors

u along

a fault

of the

fail

those

u,

and

of the

processors

and

that

G(U,E)

is a subset

the

u E U,

These

the

set

test

u, the

that

neighbor

u tests.

vertex

set

consists

quantities

tester

set

failof the

are

of u,

U.

defined

denoted

by

by

r-l(u) = {ve U: (v,u) e E}

Definition u,

4

denoted

For

a digraph

by fail_.,_(u),

G(U,E),

is given fail,,(u)

Definition N(u),

5

For

is given

by

a syndrome

= {v e F-1Cu):

a digraph

S,

and

u E U,

the

failure

set

of

by

G(U,E)

and

S((v,u))

u E'U,

= 1}

the

neighbor

set

of u,

denoted

by

N(u) : {re U: (u,v) E Z or (,,u) • E} 2.2

Diagnosis

Algorithm

A fundamental

problem

sors

given

in a system

diagnosis

of the

diagnosed drome by

as faulty it is possible

set

analysis however, digraph

of diagnosis the G(U,

FaultyA(S)

Using this, characterized

of correct and

:

as the

algorithm

notion E),

used

performance.

represents

the diagnosis in Definition

the quality 6.

set

of an

in the

processors

outputs

processors a syn-

is correct Syndrome,

subsequent

probabilistic

with

this

For a syndrome

analysis S from

a

A, let

A diagnoses

u as faulty

of Algorithm algorithm

to as a

and

processors.

proceeding

be defined.

proces-

and the

algorithm

of faulty

element

algorithm

output

of faulty

Before

must

U : Algorithm

exactly

of a deterministic the

basic

diagnosis

a deterministic

{u •

with

faulty

is referred

as input

contains

for a set

the

problem

a syndrome

subset

output

output

for this

takes This

Thus,

if the

algorithm's

are therefore

FaultyA(S) Thus,

to evaluate

the

pairs

system.

algorithm.

is to identify

algorithm

algorithm

in the

by the

systems

An

A diagnosis

processors

comparing

fault

in multiprocessor a syndrome.

algorithm.

a subset

Evaluation

on

when

A when a syndrome,

run

run

on S}

on syndrome fault

set

pair

S. is

Definition 6 terministic

Note

diagnosis

if and

only

if FaultyA(S)

C F,

alarm

diagnosis

Definition

for

care

previous

has

focused

model,

correct

system

is no greater

practice.

algorithm the

relatively Some

is correct

each

faulty

a rigorous

this

goal

we take

performance can

pro-

as well

be evaluated.

This

section.

are

faulty

are

made

diagnosis

to an overly

produce

can

sets

the

the

paper

that

be achieved

in the

view

outcomes

when of tests

in contrast with

high

to the probability

in

of diagnosis

that

system.

rare

for the

faults

a diagnosis

This

performance

outcome

set

be extremely

approach

by accounting

In our

p independently

correct

set

of t or

model

in a system.

probability

concerning in this

in the

fault

any

probability

algorithm

fault

with

may

pessimistic

the

algorithm

processors

allows

a probabilistic

processors

of the

of faulty

a model

sets that

we present

diagnosis

a bounded-size

number

Such

including

of diagnosis

faulty

always

if the

lead

area,

Under

of performance,

the

assessment

diagnosis

t < n/2.

paper,

as a measure

It will be shown

probabilistic

of one

another,

performing

a test,

performed

by faulty

bounded-size in this

fault model

at

low cost. comments

are

in order.

by

faulty

virtually

so long

is to provide achieve

algorithm

performance.

therefore

of occurrence

no assumptions correct

and

paper To

[21], where

6, diagnosis

performance

system-level

to be faulty,

identifies

realistic

processors

model,

this

following

value

In this

processors

processors.

of this

e.g.

as fault-free

as fault-free

problem.

be guaranteed

some can

we use,

likelihood

model,

fault-free

set

approach

correctly

a more

work,

In Definition

goals

which

in the

in the

in a system

and

previous

to be identified

of diagnosis

worst-case

can

than

performance.

in a system

a de-

_ F.

in some

diagnosis

under

work on

diagnosis

This

algorithm

of the

measure

model

E),

Model

of the

fewer processors

One of the

is presented

Probabilistic

used

processors as faulty.

a proper

fault

that

is identified

analysis

G(U,

and

if FaultyA(S)

processor

as faulty.

the

only

is identified

fault-free

model

evaluation

and

faulty

in defining

probabilistic

fault

from

allow

processor

each

if and

6 differs may

as a probabilistic

for

a digraph

partial

is identified

yields

(S, F) from

-=- F,

foundation

In much

pair

if FaultyA(S)

when

3

set

to produce only

as no fault-free

great

fault

is said if and

diagnosis

cessor

A

diagnosis

that

only

a syndrome,

correct

false

correct

For

algorithm

processors. any

concerning

We make manner.

the

no assumptions Thus,

faulty

For example,

behavior

of faulty

concerning processors faulty

the can

processors

pass

processors

under

outcomes

of tests

or fail

can:

other

this

mode]

performed

processors

in

1. alwaysfail otherprocessors, 2. alwayspassotherprocessors, 3.

fail other

4.

collaborate

processors with

attempt

any

these

equivalent most

processor

as

other

that

The

for which possible

these

achieved

by restricting

under

are

probability

flG(y,z Since specified fault

set

pairs

for the

pair

model. which

made

Note

that

event

may

The have

basic the

family

many

of events

also

systems

in an

model,

With

paper

of these

show

that

are this

are faulty

the

very

is

in the

in this any

which

this

outcomes

improvements

S is a function the

set

nearly can

of the

only

in mind,

events

same

fault

in a basic V(u,v) fault

distinct

outcomes

of the set

be

we now

and

7C(U,E}

in this

consist

whose'

syndromes

Formally,

F, S((u,v))=

with

probability

each

set

event

basic

pairs. of G(U,

space

con-

by faulty may

not

are

identical fault

S'((u,v))}

event

Now,

be

of syndrome,

a syndrome,

u e U-

fault

set

of sets

as follows:

associated

syndrome,

performed

a fault

B defined

e E with

set

model

model

E to {0, 1}}.

of tests given

processors.

event

probability

from

syndrome

/3G(tZ,E) = {B : B is a basic The

We

processors.

out of faulty

and

is a unique

contain

this test

under

significant

of faulty

concerning

on edges

r = F'

there

robust.

contains

of a particular

(S a, F e) is contained

B = {(S,F):

outcomes

we present

probability

hence,

F) : F C U and

are

labels

under

produce

the sample space 12a(tZ,E ) of this set pairs in that digraph, i.e.

probability

in this

except set

) = {(S,

the

very

work

behavior

test

model.

no assumptions

processors,

high

therefore and

allowed

algorithms

with

algorithms

For a digraph G(U, E), of all syndrome, fault

sists

are processors

diagnosis

this the

their

or

behaviors.

faulty

model

through

algorithm,

behaviors

the

diagnosis

sparsest

processors

above

correct

systems

the

any

and

probability,

diagnosis

manner.

behaviors

present

faulty

the

assuming

to produce

some

or all of the

as well to

detrimental

shown

other

to confuse

5. combine Since

with

but

that

each

let

E)}.

is the

set

of

all subsets

of

Ba(U,E). Definition incompatible

7 A

syndrome,

if and

only

fault if 3u,

set

pair

(S,F)

v E U such

that

6

in a digraph u E U -

F,

G(U,

E)

(u, v) E E,

is said and

to be

I. v 6 U - F and S((u,v)) -- 1, or andS((u, ,)) = 0. A syndrome,

fault set pair which

is not incompatible

is said to be compatible.

A basic

event is said to be incompatible if its syndrome, fault set pairs are incompatible, otherwise it is compatible. The probability of a basic event B in a digraph G(U, E) is defined as follows: 0 if B is incompatible

PG(B)

where

F represents

plFl(1 - p)n-lF[

the unique

fault

otherwise

set associated Pc(B)

with

B. Clearly,

= 1

B6B(;(u,E)

and, hence,

this is a legitimate

The primary paper

measure

is the probability

probability

of the performance that

the algorithm

G(U, E) and a deterministic

algorithm

Correcta(A) and

let

NotCorrectc(A)

of a diagnosis

produces

correct

algorithm

used

diagnosis.

in this

For a digraph

A, let

= {(s,r):

represent

Correcte(A) represents the which Algorithm A produces

the

FaultyA(S)= complement

F} of

Correcte(A).

Thus,

set of all syndrome, fault set pairs in a digraph for correct diagnosis. Note that it may be the case that

CorrectG(A) ¢ J'C(U,E) in which output of a particular diagnosis performed algorithm specified.

measure.

case PG(CorrectG(A)) algorithm may depend

will not be defined. The on the outcomes of tests

by faulty processor s and thus, the probability of correct diagnosis for the cannot be determined until a probability distribution on these edges is

For a digraph

G(U,E),

let

P_

be a probability

function

defined

on

['_G(U,E)

such that the family of events is equal to all subsets of fla(u,E} and VB 6 Ba(U,E), P_(B) = Pa(B). Such a probability function will be referred to as a refinement of Pa. Now, let PG represent the set of all refinements of Pa. Since any type of behavior of the faulty processors is allowed in this model, the probability for a deterministic algorithm A in a digraph G(U, E), denoted defined to be DiagProbG(A)=

rain P_6Pc

P_(CorrectG(A))=

min P_6Po

_ (S,F)6Correcto{A)

of correct diagnosis by DiagProba(A ) is

P_((S,F))

Thus,

when calculating

sumed

that

the probability

the faulty

processors

of correct

perform

their

diagnosis

to the algorithm.

We may also define this diagnosis

nosis algorithms.

Given

a syndrome

for an algorithm

tests in the manner probability

S, a probabilistic

most

it is as-

detrimental

for probabilistic

diagnosis

algorithm

diag-

A chooses

a fault set F with some probability tall it PA,s(F) where _fCtr pA,s(F) = 1. Thus, for a digraph G(U, E) and a probabilistic diagnosis algorithm A, the probability of correct diagnosis for Algorithm A is defined to be DiagProbc(A)

4

Diagnosis

Using

In [18], an efficient ity approaching

=

diagnosis

min v_ePc

n-1

_

F)).

PA,s(F)

Tests

algorithm

one in sequences

P,b ((S,

(S,f )_nc

that achieves

of digraphs

correct

containing

diagnosis

with probabil-

cn log n edges,

for c > toz-_l_,

was presented. It was also claimed in [18] that all diagnosis algorithms must have a probability of correct diagnosis that approaches zero for digraphs containing o(nlog n) edges. In this section, a sequence of digraphs containing n - 1 edges is exhibited for which a simple diagnosis algorithm can achieve correct diagnosis with constant probability, thereby providing a counter-example to this claim. Consider a sequence defined as follows: Err

i.e. Ul tests algorithm. Algorithm Input: " Output:

=

all other

of digraphs

{(Ul,

tt2),

(Ul,

processors.

Gn(Un,E,_)

u3),

Now,

• .

.

, (ttl,

consider

with

Urt-

1),

the

Un = {ul,...,u,}

(Ul,

Urt)

following

and

E,_

} ,

simple

diagnosis

Naive A syndrome S in a digraph A set F C U.

G(U, E).

for each v e {u2,u3,...,un} if S((ut,v))=

1 then

F _

Fu{v}

Algorithm Naive simply assumes that ul is fauit-free and diagnoses a processor as faulty if and only if it is failed by ul. Clearly, if u_ is faulty, Algorithm Naive

incorrectly

diagnoses

correct

diagnosis.

ul itself. Thus,

If ul is fault-free

VPb.

,

=

Pb.({(S,F)

=

1-p

Naive

produces

: u,

is fault-free))

therefore DiagProba.

Thus,

this

ability

5

simple

diagnosis

in a sequence

In this

section,

only

if it is failed

Algorithm

exactly

diagnosis n -

with

constant

prob-

1 edges.

powerful

diagnosis

Majority

than

1/2

the

algorithm

a processor processors

known

as Algorithm

Ma-

as faulty

if and

is diagnosed

in its

tester

set.

has

a time

Majority A syndrome A set

S in a digraph

G(U,

E).

F _C U.

uEU

if Ifailin(u)[

Theorem

1 and

Proofl

> _

For

calculated

set

in a single and

calculated.

output

Algorithm blindly

believing

vote

among

the

the

as well

of the

only

space

labeled

storage

that

for the special

class

and

no other

are

complexity

is slightly the

Majority

complexity

of

of O(IEI).

test

is also

of a single

of systems

in which Algorithms

9

digraph.

algorithm these

can

This

aside

values

be

from

as they

are II

than

and

tests

Naive.

it relies

processor.

processor Naive

Algorithm

processor,

of a given one

of the

the

to hold

cardinalities

O([EI).

sophisticated set

set

lists for

variables

more tester

tester

adjacency

outcomes

in the

conducted,

as the

requirement

is a set of temporary

processors

tests

Algorithm

cardinalities

The

Majority

than

G(U,E),

traversal

time.

Hence,

F _-- F U {u}

complexity

failure

O(]Et)

input

then

2

a digraph

a space The

requires

yet

by more

Output:

O(IEI)

correct

containing

In Algorithm

Input:

for each

produces

= 1 - p.

Algorithm

a simple

is presented.

(Naive)

algorithm

of digraphs

A Majority-Vote

jority

the

Algorithm

E Pa.

Vbn(Correcte.(Naive))

and

however,

It should every

Majority

Rather

on a majorityother

are

be noted processor

equivalent.

6

Diagnosis

in

Sparse

Systems

In this section, we examine the problem of correctly diagnosing multiprocessor systems having sparse communication networks. First, it is shown that for a class of irregularly Algorithm

structured Majority

ing one. Next,

systems correctly

utilizing a number of tests growing just faster than n, diagnoses every processor with probability approach-

the probability

of correct

diagnosis

of Algorithm

Majority

is evaluated

on some fixed systems which utilize a modest number of tests. Finally, it is proven that a linear number of tests are required for any diagnosis algorithm to be capable of producing

6.1

correct

diagnosis

An Upper Bound rect Diagnosis

with high probability.

on

the

Number

of Tests

Necessary

for

Cor-

Consider a class of systems in which there is a set of processors known as the testers. The systems are such that any processor which is a tester tests all other processors in the system

(including

the other

testers).

Any processor

that

is not a tester

conducts

no tests. Thus, a (small) fraction of the processors are relied upon to satisfy all the testing requirements of the system. Such a digraph will be referred to as a tester digraph,

formally

defined

Definition 8 A digraph 3TG C_U such that

below. G(U, E) is said to be a tester

digraph

if and only if

E = {(_, v): _ _ To,, e U, and _ # ,}. The set TG is known

Figure

as the testing

1 is an example

For a tester

digraph

of a tester

set of G.

digraph

with 3 testers

G(U, E) with testing

set To,

and

8 vertices.

let

GoodMajG= {(S, F): ITGf3(U - F)} > ITcl T and (S, F) is compatible} Thus, GoodMaja represents more than 1/2 the testers majority Majority

of testers in a tester will be correct.

Lemma

1 For a tester

the set of compatible syndrome, fault set pairs in which are fault-free. The following lemma shows that if the digraph

are fault-free,

digraph G(U, E), GoodMajG

10

then the diagnosis

of Algorithm

___CorrectG(Majority).

• - -.

------

..........

1

Testing L

Set

..................

J

Figure

Proof: and

We will show therefore,

Consider

GoodMaj

any

case

diagnosed case

then

(S, F) E Correctc

(Majority)

is compatible,

any

u E U.

F)

u must

be passed

Recall

by Algorithm

that

by

more

than

FaultyMajority(S)

Majority

when

than

the

run

1/2

is the

on

the set

testers,

im-

of processors

S.

_ : uE(U-TG)nF

Similarly,

u must

be failed

ease S : u e Tc n(UHere,

and -

FaultyMajority(S). as faulty

Digraph

a C_ Corrects(Majority).

(S, F) _ GoodMajc

(S, F) u _

Tester

if (S, F) E GoodMaja,

i : u e (U - TG)n(U

Because plying

that

1: A

u can be failed

diagnoses

a unit

by more

testers

implying

u e FaultyMajority(S).

F)

by at most

as faulty

1/2

1/2

only

when

failed

by

the

remaining

it is failed

testers.

Since

by a strict

Algorithm

majority

of its

Majority tester

set,

u ¢ FaultyMajority(S). case In

this

,_ : u E TG A F case,

u must

be

more

than

1/2

the

remaining

testers,

implying

u _. FaultyMajority(S). Hence,

FaultyM_jority(S)

= F and

therefore

11

(S, F) E CorrectG(Majority).

I

Thus,

if more

than

1/2 the testers

Majority produces correct diagnosis. is given by any unbounded function, ity approaching one and hence the Majority

approaches

in a tester

digraph

are fault-free,

A}gorithm

Theorem 2 shows that if the number of testers this condition will be achieved with probabilprobability of correct diagnosis for Algorithm

one.

Theorem 2 Let w(n) be any unbounded function. If p < 1/2, then for any sequence of tester digraphs on n vertices having win ) testers, the probability of correct diagnosis

for Algorithm

Proof:

We' must

DiagProbc. the number

Majority

show

one as n _ oo.

for any sequence

satisfying

the

theorem

condition,

(Majority) --_ 1 as n --_ oo. If we let X be a random variable representing of faulty units in the testing set of a tester digraph G, then

GoodMaja Now, X is a binomial Lemma

that

approaches

1 that

VP_.

ITal X < -_

= {(S,F): random

variable

and (S, F)is

with parameters

compatible}

[Tel and p. It follows

from

6 Pa. P_. (Correcto.

(Majority))

_> Pb. (GoodMaja.)

Now, since p < 1/2,

=

Pb.({(s,r):

=

1-

-_

1

Pb,,({iS,

X < _}) F):

Rl - G.({(S,F):ITa.

½ - p > 0, and by the P,'G. (Correcta.

X >_ I_a[}) X ---p_>I

,

_-p})

Weak Law of Large

iMajority))-*

Numbers

[9],

1

and therefore DiagProba.

iMajority)

_

1.

I Thus, Algorithm Majority produces correct diagnosis with probability approaching one in a class of digraphs containing a number of edges given by n. w(n), where win ) is any function that goes to infinity (arbitrarily slowly) with n. Under a bounded-size fault set model a quadratic number of tests are required to withstand a linear

number

of faults

while

this result 12

shows

that

in this probabilistic

model

a

I p I Ir l I 0.001

3

p.oo5

5

0.010 0.050

5 11

0.100

19

0.200

41

0.300

105

Table 1: Size of Testing Set Required for Correct Diagnosis Probability of 0.99

linear

expected

number

of faults

can be tolerated

with a number

of tests that

is arbi-

trarily close to linear. The maximum degree of the vertices in this class of digraphs is large, however, which may be a problem in'some applications. This motivates us to study

6.2

the problem

Performance

In this section, diagnosis digraph

of diagnosis

of Algorithm

the number

in tester

G(U, E) with testing

that

the probability

regular

systems

Majority

of tests required

digraphs

DiagProba

Note

in sparse

using

Algorithm

on

to achieve

in Section

Fixed

7.

Systems

a given probability

Majority

is examined.

of correct For a tester

set Ta

(Majority)

(I)

___

of correct diagnosis

depends

only on the testing

nality and not on n. For a given probability of failure, determine the number of testers needed for Algorithm

set cardi-

Inequality 1 can be used to Majority to achieve a speci-

fied probability of correct diagnosis. The size of the testing set required to achieve a correct diagnosis probability of 0.99999 for various values of p is shown in Table 1. If the probability rect diagnosis

of failure of a processor is 0.001, Algorithm Majority can achieve corwith a probability of 0.99999 using three tests per processor regardless

of the number of processors in the system. For a probability of failure of 0.005 or 0.010 the tester set need only be of cardinality five for Algorithm Majority to achieve a probability

of correct

diagnosis

of 0.99999.

13

Thus,

when

the probability

of failure

"

Ip

Probabilistic

][ Bounded-size

100

0.01

400

99

100

0.10

1800

495

100

0.30

4100

3069

1000

0.01

18,000

999

1000

0.10

123,000

4995

1000

0.30

334,000

30,969

10,000

0.01

1,240,000

9999

10,000

0.10

10,700,000

49,995

10,000

0.30

31,070,000

309,969

t

Table

2:

Total

Number

Correct

is small total

correct

diagnosis

number

of tests

indicated

in Table

processor

are

a large

fraction

that

the

that

total

Necessary of

achieved

with

n.

p is larger,

When

a probability correct

processors

in the

number

of tests

system

remains

probability

tests

are

more

with

probability

will are

high more

of 0.300,

diagnosis

of tests

for

0.99

extremely

of failure

to achieve a larger

number

be

is near

1, for

of the

to be expected

can

that

required

of Tests Probability

Diagnosis

be faulty

required.

proportional

than

a

100

As

tests

per

0.99999.

in this The

using

necessary.

Since

situation

important

to n regardless

of the

it is point

is

value

of

p. In

Table

fault

set

2, we compare

model

a correct

diagnosis

bounded-size and

fault

faulty

for various under

the

set

is no

that

greater

probabilistic

over

bounded-size

calculated probability

p.

set

required fault

0.01. For large

in the

Table

model. set

For example, probabilistic model.

14

results

p the than

when

the

bounded-size

in order

to achieve

required

under

manner. t out

the

small

model

tests

than

2 shows lower

of

following

of more

n and

the

Majority

number

is dramatically

in the

required_under

Algorithm

The

the

model

fault

of tests

of tests by

was

than

of n and

number the

number required of 0.99.

model

t such

values

bounded-size

the

number

probability

p, determine

being

the

to the

For a given n processors

of this

comparison

of tests

number

n -- 10,000

n

of the

number

is reduced

the

required and

by

required under

p = 0.10,

a factor

the

of 214

6.3

A Lower

Bound

on the

Number

of Tests

Necessary

for

Correct

Diagnosis In this section, a lower bound diagnosis with high probability

on the number of tests necessary to achieve correct is p;roven. It is shown that if the number of edges in

an arbitrary sequence of digraphs grows slower than n, then all diagnosis algorithms have probability approaching zero of achieving correct diagnosis. This result implies that Algorithm Majority achieves a probability approaching one of correct diagnosis on systems that are very nearly as sparse as possible. Thus, this relatively simple diagnosis

algorithm

is indeed

When the number processors,

i.e.

extremely

powerful.

of edges in a sequence

processors

which

have

of digraphs

no incident

grows slower

edges

must

exist.

than n, isolated Intuitively,

no

diagnosis algorithm should be capable of correctly identifying the state of all these isolated processors with high probability, making diagnosis in such situations impossible. This is formally proven in Theorem 3. The essence of the proof of Theorem 3 can be explained

quite

A has a probability

simply.

To prove

approaching

that

a deterministic

zero of achieving

of digraphs Gn(Un, En), a set of (S, F) pairs exhibited that has a probability dominating'the a given syndrome

from

a system

with

correct

disjoint from probability

isolated

processors,

diagnosis diagnosis

algorithm

in a sequence

CorrectG, (A) must be of Correcta.(A). For it can be shown

that

so

long as the number of isolated processors approaches infinity, the probability of that syndrome and a fault set with a particular labeling of the isolated processors is dominated by the probability of that syndrome and the fault sets in which the isolated processors are relabeled a set of syndrome, fault

in all possible ways. Thus, for any (S, F) 6 Correcte. set pairs disjoint from Correct(;. (A) can be exhibited

has probability dominating the probability of (S,F). It is also shown exists a deterministic diagnosis algorithm that has perforrfiance at least the performance Theorem

of any probabilistic

3 Let A be any

algorithm,

probabilistic

thus completing

or deterministic

(A), that

that there as good as

the proof.

diagnosis

algorithm.

If

0 < p < 1, then for any sequence of digraphs on n vertices having o(n) edges, probabi'lity of correct diagnosis for Algorithm A approaches zero as n ---* oo. Proof: rithm

We must

A and any sequence

DiagProba,(A test.

show that

of digraphs

G,(Un,

) --* 0 as n --* o_. Assume

This yields

P_,, ((S,F))

for any probabilistic

=

a refinement

Pb,, 6 ?a.,

or deterministic

E,)

faulty

having

processors

diagnosis

the

algo-

IE,[ 6 o(n), pass

all processors

they

where

0plF](1 if (S,F) is incompatible _ p)n-IFI otherwise

15

or 3u 6 F,v 6 U with S((u,v))

= 1

Now,let ISOa.

C_ Un represent

have no incident

edges,

the set of isolated

in Gn(Un,

E_).

IISOc.

processors,

i.e.

processors

which

Clearly,

I >_ n - 21E,_I _

oo.

4

For a syndrome,

fault

Relabel(s.F)

set pair

(S, F) E CorrectG.

(A) let

= {(S', F') : S' = S, F' # F, and F - ISOG,

= F' - ISOG.

}

and let AllLabel(s.F

) = Relabel(s2-

Thus, Relabel(s.F) consists of the syndrome, of ISOG. are relabeled in all possible ways. P'G. (NotCorrectG. k

fault set pairs Clearly,

in which

the processors

(A))

_ Pb. (Relabel(s,F)) (S.F)eCorrect,_. (A)

----

and since

) U {(S, F)}.

E

all processors P_. ((S, r))

[P_..CAllLabelcs,F)).

in the set ISOG.

- P"G. ((S,F))]

are isolated,

= p Its°_.

nF}(i - p)llS°_"

n(V.-F)[p_.

E

P_. (AllLabel(s.F))

(AllLabel(s.F)).

Therefore,

(S,F )e CorrectGn

(A )

R' G"((S'F))

__, (S,Y)_:Uorrecf;Gn

(A) PllSOa"

E(S,F)eCorr.¢tG.

>

[max(p,

nf](

l -- P)IlSOa"

n(V"-Y)l

(A) P_. ((S, F))

1 - p)l[ISO_.. [

and thus P.'G. (NotCorrectG. ->

(

[max(p,

(A))

1

-

1

p)]Iisoa.[

- 1

16

)

(s2')eCo_r_¢t,;. (A)

P£((S,F))

Therefore, P_.(Correcta.(A))

< -

[max(p'l-P)][ls°a"[ 1 -[max(p, 1 - p)][lsoG.J"

--*

as n _ oo. Thus, any probabilistic

0

'

for any algorithm diagnosis

A, DiagProba.

algorithm

A.

(A) _


ap-

_.

The

r

systems

from

of systems which

This

class

not

contained

The

systems

conduct

is tested

includes

regular

4 shows

sequent needed.

results,

achieves

correct

contains

many

in the

at

with

Dl,¢log_

class.

in this

section

least

O(nlog

sequence

following

Let Y

be a binomial

-

e.g.

for

to a theorem

variable

pi(1-

with

which

processor

c sufficiently large Majority

In order

to prove

proved

in the

large. degree.

parameters

will produce this

and

n and

O
1. Most of the previous work in the diagnosis set model where it is assumed that no more

area than

has utilized a bounded-size fault t faults occur in the system. A

system is said to be t-diagnosable if any combination of t faulty units in the system can be uniquely diagnosed. It is well known that a k-dimensional hypercube is kdiagnosable of vertices satisfied

but not (k + 1)-diagnosable. of the cube, the assumptions only when

the

number

of faults

Since, k = log s n, where n is the number of the bounded-size fault set model are is less than

or equal

to the

logarithm

of

the number of units. It is unlikely that this condition will be met in large systems. Under the probabilistic model, however, a number of faults that is linear in the number of units can be tolerated. Table

3 illustrates

the diagnosis

performance

on hypercube systems for probabilities column of this table lists the expected sponding system and failure probability.

difference

between

the two models

of failure of 0.002 and 0.020. The fourth number of faulty processors for the correPk represents the probability that no more

2O

IkI

"

1.0000

1.0000

0.020

1.28

0.9997

0.9999

0.002

0.51

1.0000

1.0000

256

0.020

5.12

0.9258

1.0000

1024

0.002

2.05

1.0000

1.0000

1024

0.020

20.48

0.0079

1.0000

4096

0.002

8.19

0.9267

1.0000

4096

0.020

81.92

0.0000

1.0000

16384

0.002

32.77

0.0002

1.0000

16384

0.020

327.68

0.0000

1.0000

16

65536

0.002

131.07

0.0000

1.0000

16

65536

0.020

1310.72

0.0000

1.0000

20

1048576

0.002

2097.15

0.0000

1.0000

20971.52

0.0000

1.0000

64

0.002

6

64

8

256

8 10 10 12 12 14 14

20

than

3:

1048576

Diagnosis

k units

are

diagnosis

for

bounded-size

fault

hypercube

an estimate

for the

It can

be seen

degrades

correct

and

on

PM_j

for Algorithm

dimensional

model

0.020

Probability

faulty

correct the

diagnosis

set

model

when

from for

represents

a lower

Majority. the

probability

size

Algorithm

only

number

3 that

as the

Since

can

of correct

Table

rapidly

a k-dimensional,

situation,

Majority Under

the expected

still the

correct

bounded-size

fault

this

situation.

When

may

seem

large,

a system

[11], has

been

Machine

k

=

set 16,

the

containing

the

number this

many

built.

21

probability

the

algorithms

than

or equal

the

than

a probability number

processors,

to k, Pk is

bounded-size The

one

1300

and

that

is very

set

for

of

all

the

is as large is 0.02. In

yet Algorithm

is limited

is 65,536. namely

fault

probability

nearly

of faults

of processors

in a k-

algorithms.

increases. is very

of

proposed

diagnosis

of failure of a processor the probability of failure

is greater

with

model,

under

however,

of faults

diagnosis

on

correct

is less

hypercube

Majority,

Hypercube

bound

for those

performance

n-node

diagnosis

guarantee

of faults

of the

number

produces

the

diagnosis

hypercubes studied, even when the probability as 0.02. Consider the case where k = 16 and this

PM_

0.13

6

Table

Pk

I p I Exp #faulty

the

nearly to While

one. 16 for this

Connection

7.3

Lower

Bound

While hypercubes are an important class of system, systems with even fewer connections are expected to see increased use in future multiprocessor applications. We are therefore necessary

interested

to achieve

in determining correct

diagnosis

a lower with

bound

high

on the total

probability.

was proven in [2] for regular systems. This result states that must have a probability of correct diagnosis that approaches

number

Such

of tests

a lower

bound

all diagnosis algorithms zero in regular systems

with o(n log n) tests. This more general probability model contains the model utilized in this paper as a special case and hence this result holds for this model as well. Thus, for the important class of regular systems the algorithm given in [18] as well as Algorithm Majority are both optimal to within a constant factor. This result also demonstrates that the irregular structure of the tester digraphs studied in this paper

is a crucial

factor

in making

them

amenable

to diagnosis.

Of special interest due to their widespread use are muitiprocessor systems which are regular and of fixed degree. Included in this class of systems are rings, torii, and hexagonal meshes. This somewhat pessimistic result implies that weaker forms of diagnosis

8

must

be considered

Diagnosis

for these systems.

using

a Linear

Number

of

Tests

It has been shown that Algorithm Majority can achieve correct diagnosis with probability approaching one in digraphs containing nw(n) edges, while all algorithms must have probability approaching zero of correct diagnosis in digraphs possessing o(n) edges. These results leave open the question of what can be achieved edges, for some positive constant c. In this section, it is shown that with Algorithm Majority can achieve a probability of correct diagnosis that is a arbitrarily close to one. It is also shown that a constant probability less is the best

that

any algorithm

Algorithm

Majority

The 'following digraphs

is optimal theorem

with a linear

Theorem5

:Proof: if G,(U,,En)

diagnosis We must

characterizes

number

for Algorithm show

with

in this situation,

a linear

the performance

number

meaning

that

of edges.

of Algorithm

Majority

on

of edges.

large tester

is a sequence

to achieve

for digraphs

Let e be any real number

that for all su_ciently of correct

can hope

using cn cn edges constant than one

that,

such that O < e 0, n0 such with

ITG. I :> c, then

Vn

that >_ no,

1

DiagProbG.(Majority

) :> 1 -e.

Let a = l-'0_-p)" 2 < 1. Then,

P_.(Correctc.(Majority))

>

VP_,, E PG.,

]-i

1i=0

>_ by Corollary

1. Now, if c is chosen

I-[e-(1-'_)'/2]

(l-p)c

such that -2lne

c>

(1- a) (1 - v)

then P_.(Correctc,,(Majority))

>__ 1 - e l"_ =

l--e

l Thus,

Algorithm

Majority

can achieve

correct

trarily close to one in sequences of digraphs following theorem shows that all diagnosis correct diagnosis situation. Theorem6

that

is bounded

Let c be any positive

away

diagnosis

with

probability

having a linear number of edges. The algorithms must have a probability of

from

constant.

one by a positive

If O < p
O such that for any

probabilistic or deterministic diagnosis algorithm A and any sufficiently on n vertices having no more than cn edges, the probability of correct Algorithm

arbi-

for any c > 0, 3e > 0, no such

that

large digraph diagnosis for

if G,,(Un,

IE,_I < cn, then Vn _> no, DiagProbc.(A)

En)

is

_< 1 - e. Let

R c. t

E PG. be such that faulty processors fail all other processors. Now, let Umina. E Un be any vertex of G,, such that Vu E On, IN(umin(:.)l _< [N(u)]. Thus, Umi.,_. is a processor having minimum size neighbor set in Gn. Clearly, IN(umina.)l _ min |5 \1 _> min

£ P

, 1 - p) p

1 - p' l-P)

R'a. (CorrectG.(A)

Q SurrG.)

[p:C-PS.(NotCorrectG.(A))

]

or

P_.(NotCorrectG.(A))

[1 + min ( 1 -p'

-

_

1 -p'

p

P

and P_.(NotCorrecte.(A)) so long" as 0 < p < 1. Now, consider VP_.

min(l___p, l__p )p2_ > = e > 0 - 1 + min(__p, __e_) any probabilistic

diagnosis

algorithm

A. Then,

6 PG. DiagPr°ba.

Consider

the deterministic

F such that

(A)