Determining Plurality - CiteSeerX

Report 4 Downloads 54 Views
Determining Plurality Edward M. Reingold†

Laurent Alonso

February 14, 2006 Abstract. Given a set of n elements, each of which is colored one of c colors, we must determine an element of the plurality (most frequently occurring) color by pairwise   equal/unequal color comparisons of elements. We prove that c  1  n  c  2 color comparisons are necessary in the worst case to determine the plurality color and give  an algorithm requiring 0  775c  3  6  n  O c2 color comparisons for c  9. Key words. Algorithm analysis, plurality problem, majority problem AMS(MOS) subject classifications. 68Q25, 68P10, 68Q20, 68M15

1 Introduction The simplest case of determining plurality is the majority problem: determine an element of the majority color in a set of n elements, each of which is colored red or blue, using pairwise equal/unequal color comparisons of elements. This problem arose implicitly in connection with fault-diagnosis: how can we find an working component among n components, assuming that the majority of the components are working and that any component can test any other [13]; the close connection with the majority problem was shown in [5]. The majority problem was posed explicitly by Moore [12] as the determination of a majority vote (color) among n processors by paired comparisons. 1 [12] did not specify the range of votes the processors could give (that is, the number of colors), but in the published solution [9], an unknown number of possible votes were assumed, and 3n  2  2 comparisons were proved necessary and sufficient. Saks and Werman [14]  solved the binary (two-color) problem by proving n  ν n  questions are necessary

INRIA-Lorraine and LORIA, Universit´e Henri Poincar´e-Nancy I, BP 239, 54506 Vandoeuvre-l`esNancy, France. Email: [email protected] † Department of Computer Science, Illinois Institute of Technology, 10 West 31st Street, Chicago, Illinois 60616-2987, USA. Email: [email protected] 1 It was posed in the problems column, edited by Leo Guibas, of the Journal of Algorithms. By a strange coincidence, the succeeding problem, by Blecher [6] (partial solution [17]), was to determine the reliable people among a mixed group of reliable and unreliable people—nobody remarked on the equality of the bounds in the two problems, 3n  2 2 and  3  n  1  2 , respectively. The close connection between Moore’s and Blecher’s problems was shown in [5].

1



and sufficient for the majority problem, where, following [10], ν n  is the number of 1-bits in the binary representation of n. A short, elementary proof was given in  [3]; [16] presented yet a different proof. In [4], 2n  3  8n  9π  O log n  color comparisons were proved necessary and sufficient in the average case of the two-color problem, assuming all 2n distinct colorings of the n elements are equally probable. [11] proved similar bounds for randomized algorithms. In [1], a new wrinkle was added by exploring nonadaptive algorithms (that is, which comparisons to make cannot depend on the results of previous comparisons—the full sequence of questions is fixed a priori), are necessary and sufficient in this case. [7] showing that 2  n  2 1 comparisons   gave a nonadaptive algorithm using 27  o 1  n color comparisons to determine the majority color when there are an unknown number of colors, but a majority is known to exist. The plurality problem is the obvious generalization of the majority problem to c colors. Posed in [3], the first results for c  2 were for three colors: [2] proved that 3  n  2 2 color comparisons are necessary and that  5n  3 2 suffice; they proved in general that cn  40 color comparisons are necessary for c colors, a lower   bound improved to 2cn  27  o n  in [11]. [7] showed that c  1  1  c  n  2 comparisons suffice, and that for nonadaptive strategies with c  3, n 2  6  3n  2 comparisons are necessary to determine plurality. [8] studied the case c  3 and proved that ! 3n  2  O n logn  comparisons are necessary and 3n  2  O 1  are sufficient when  randomized algorithms are allowed; for c colors, [11] proved a lower bound of Ω cn  for randomized algorithms. [8] also studied the partition problem in which the three color classes of the elements must be completely determined; they proved that 2n   3  comparisons are necessary and sufficient in the ordinary case and 5n  8 " 3  o 1  in the randomized case; for the ordinary case of the partition problem with c colors  c and n  c, they proved that c  1  n $# 2 % comparisons are necessary and sufficient;   [11] showed that c  1  n  c " 4 are necessary for randomized algorithms. With an unknown number of colors, # n & 2 1% comparisons are necessary and sufficient to determine a color that occurs at least as often as any other color (the “non-strict” plurality problem) [15].   In this paper we prove in Section 2 that c  1  n  c " 2 color comparisons are necessary  to determine the plurality color. Then, in Section 3 we give an algorithm  requiring 0  775c  3  6  n  O c2  color comparisons for c  9. These new results substantially narrow the gap between the best lower and upper bounds previously known,  2cn  27  o n  [11] and c  1  1  c  n  2 [7], respectively.

2 The Lower Bound To prove the lower bound, we use an adversary argument in which answers to color comparisons asked by the algorithm are given by the adversary so as to force the algo rithm to ask approximately c  1  n  2 questions; if the algorithm does not ask enough questions, it can be tricked by the adversary into giving the wrong result. The details of the adversary will differ slightly depending n mod c, with the simplest case being when n is an integer multiple of c. We are given a set of elements x1 , x2 , . . . , xn colored with (at most) c colors, and 2







a coloring that assigns to each element a color, C x1  , C x2  , . . . , C xn  , respectively. A plurality  algorithm must determine an element of the plurality color by asking ques tions “Is C xi  C x j  ?”. We will look at sets of colorings, S ; the specific sets of colorings will depend on n mod c, but the general idea is that a set of colorings S will be “balanced,” having equal numbers of each color (or as nearly so as possible). Failure to ask enough questions will make it possible for the adversary to show that the algorithm cannot correctly distinguish between two different pluralities. We can view any algorithm as a binary decision tree in which the internal nodes specify color comparisons between elements and the leaves either specify an element determined to be of the plurality color or specify that there is no plurality color. For each node in the tree, there is a set of colorings consistent with the node—all colorings are consistent with the root because there have been no comparisons; all colorings consistent with a leaf have either identified a particular element xi as being of the plurality color, if there is a plurality, or have no plurality. The adversary for set S of colorings behaves as follows: If prior answers together with a “unequal” answer to the present question are consistent with some member of S , answer “unequal”; otherwise, answer “equal”. Because all colorings of elements are consistent with the root, the adversary forces the algorithm to a leaf (that is, outcome) consistent with some subset of colorings in S . We choose a member of S compatible with the leaf and fix the colors of the elements accordingly. The adversary thus defines a specific coloring of the input elements that causes the algorithm to terminate at that leaf. Lemma 1. If n is a  multiple of c, any algorithm correctly solving the plurality problem must make at least c  1  n  2 color comparisons. Proof. Let n  kc, and let S be the set of all colorings in which there are k elements of each of the c colors. Consider the leaf defined by the adversary for S . We claim that to reach this leaf, where the algorithm must announce that there is no plurality, there must have been at least k “unequal” answers to comparisons between elements of color A and elements of color B, for any two distinct colors A and B. Suppose not; in other words, suppose there are at most k  1 “unequal” answers between elements of colors A and B. Then there must be an element a of color A and an element b of color B that were never compared. If a has no “equal” comparison to another element of color A, we can change the color of a to B and thus cause the algorithm’s answer “No plurality” at that leaf to be wrong. So, a must have been compared to some other element of color A, resulting in an “equal” answer. Similarly, b must have been involved in an “equal” comparison to another element of color B. But, the adversary only answers “equal” if there are no colorings consistent with an “unequal” answer, and clearly in this case switching the colors of a and b gives us a coloring consistent with the “unequal” answer preferred by the adversary, contradicting our assumption that this is the leaf defined by the adversary. Thus we must have at least k  n  c comparisons between elements of colors A  and B, for any pair of different colors A and B. There are # 2c %  c c  1 " 2 such pairs, giving a total of at least c k' 2( 



n c c  1 c 2 3





c  1 n 2

color comparisons. Lemma 2. If n ) 1 (mod c   and c  2, any algorithm correctly solving the plurality problem must make at least c  1  n  1  2 color comparisons. Proof. Let n  kc  1, and let S be the set of all colorings in which there are k  1 elements of one color and k elements of the remaining c  1 colors. Consider the leaf defined by the adversary for S , and the coloring of the elements thus defined; call the k  1 plurality color blue and the other c  1 colors shades of red. Arguing as in Lemma 1, we claim that to reach this leaf, where the algorithm will produce a blue element representing the plurality, there must have been at least: (i) k “unequal” answers to comparisons between elements of any two different shades of red, (ii) k “unequal” answers to comparisons between blue elements with elements of any shade of red. In case (i), there are elements of each of two shades of red that can be recolored, so the plurality color could be either blue or a shade of red, fooling the algorithm. Similarly, in case (ii), there are at least two blue elements that can be recolored, so the plurality can be either blue or a shade of red and there is no way for the algorithm to choose an element of plurality color—that is, if the algorithm selects a plurality element in a shade of red, it errs because the plurality color is blue; but if the algorithm selects a blue plurality element, the adversary recolors a blue element some shade of red, forcing the plurality to be that shade of red. Thus we must have at least k'

c 1 2 ( 

 

k c  1 *



c  1 n  1 2

color comparisons. Lemma 3. If n ) l (mod c  , l  1,  and c  2, any algorithm correctly solving the plurality problem must make at least c  1  n  c  l  2 color comparisons. Proof. Let n  kc  l, and let S be the set of all colorings in which there are k  1 elements of each of l colors and k elements of the remaining c  l colors. Consider the leaf defined by the adversary for S , and the coloring of the elements thus defined; call the l colors of which there are k  1 elements shades of blue and the other c  l colors shades of red. Arguing as in Lemma 1, we claim that to reach this leaf, where the algorithm will announce that there is no plurality, there must have been at least: (i) k  1 “unequal” answers to comparisons between elements of any two different shades of blue, (ii) k  1 “unequal” answers to comparisons between elements of any two different shades of red, (iii) k “unequal” answers to comparisons between elements of any shade of blue with elements of any shade of red. 4

Because l  1 in case (i), fewer than k  1 “unequal” answers mean that (because of the adversary’s strategy) there are elements of two different shades of blue that can be freely recolored by the adversary to fool the algorithm into an erroneous answer. In case (ii), there are pairs of elements of each of two shades of red that can be recolored, so the plurality color could be either a shade of blue or a shade of red, again fooling the algorithm. Similarly, in case (iii), the plurality can be either a shade of red or a shade of blue. Thus we must have at least 

k  1 +'

l 2( 



k  1 +'

c l 2 (

 

kl c  l , 



c  1 n  c  l  2

color comparisons. 



Because c  1, n  c - n  1, and because c  1  n  c  l  2 from Lemma 3 is minimized at l  0, these three lemmas combine to give us 



Theorem 1. At least c  1  n  c  2 color comparisons to solve the c-color plurality problem.

3 The Upper Bound 

In this section we present and analyze a plurality algorithm that requires 0  775c   3  6  n  O c2  color comparisons for c  9. The algorithm works in three phases: The first phase processes a portion of the elements to partition them into color classes, after which the second phase identifies any “large” color classes in the unprocessed elements and merges them with previous classes; the second phase can leave some elements of undetermined color. In the third phase any elements of undetermined color from the second phase are processed; the result is a partition of the elements into color classes, with the possible omission of elements of color classes too small to affect the plurality. At the end of the third phase, a plurality element (if one exists) can be identified directly from the partition. We first describe the three phases in detail, then establish their correctness, and finally the analyze the number of color comparisons used in the worst case. Our presentation of the algorithm uses some subtly chosen constants; we discuss these choices in the concluding subsection of this section. We begin with a set of n elements of (at most) c colors. Phase I We process the elements one at a time, maintaining a set R of as yet unprocessed elements and a partition of processed elements S of at most c color classes of non-increasing size S1 , S2 , . . . . The partitioning is done as follows: Start with no color classes and with the set R of n unprocessed elements. An element of R is removed and compared to an element from the smallest color class, then with the second-smallest color class, and so on until there is a color match or we have exhausted the classes. In the former case the element is added to the class; in the latter case a new class is created. In both cases the color classes are reordered, if necessary, so they are in non-increasing 5

size. This phase can end with all the elements having been processed ( . R ./ 0), or it can abort before completion if a certain condition (discussed below) is met. Phase II This phase processes the remaining (unprocessed) elements R to determine any “large” color classes in R—the definition of “large” depends on how Phase I ended. Phase II takes the elements of R and processes them one by one into color classes R 1 , R2 , . . . , just as in Phase I, but with an important difference: There is limit f on the number of classes allowed; if there are already f classes and the next element processed is of a color not among those f , that element is discarded and each of the classes R 1 , R2 , . . . , R f has an element removed from it and discarded; then any empty classes are discarded and the non-empty classes are renumbered R1 , R2 , . . . . This uses at most f 01. R . color comparisons. When all elements of R have been processed, we merge the Ri into the Si using at most c 0 f color comparisons. We discard all “small” non-empty classes (discussed below) resulting, in cˆ - c color classes. At this point the color classes are no longer in non-increasing size order. Phase III Elements discarded in Phase II are now compared to representatives of each of the undiscarded cˆ color classes, added to the classes if they match, and permanently discarded if they do not match any of the classes; this uses at most cˆ color comparisons per discarded element. The largest of these classes is of the plurality color. If there is a tie, there is no plurality. To complete the description, we must specify the condition under which Phase I aborts, the limit f on the number of color classes in Phase II, and the definition of “small” at the end of Phase II; naturally, these four constants are interdependent. We define 

b f 

m 



a

0  225c







0  45c

0  16c

8 5

The values a and b define the extent of a portion of the subsets S i in the analysis of Phase I, that is, Sa 2 1 , . . . , Sb . f , as noted, is the limit on the number of classes Ri in Phase II. m is a multiplicative factor used in deciding whether Phase I must abort; specifically, Phase I aborts if there is a significant disparity between the sizes of S 1 and Sb , namely if m. R.  (1) c At the end of Phase II, a color class is “small”, and hence is discarded, if it contains fewer than . R. . S1 ." (2) f 1 . S1 ."3. Sb ./

6

elements. Therefore, when Phase III begins, (1) and (2) tell us that each undiscarded color class contains at least . Sb .45. R . '

m c 

1 f 

(3)

1(

elements. Slightly better choices for a, b, f , and m will be possible if we are careful to track their interdependences. Thus we define the parameters α γ

0  225 6  25  

so that 

f 



a b



αc

2αc

c  γ 6

In the analysis of Section 3.2 we will express inequalities both with constants (say, 0.775), as well as the origins of those constants (that is, say, 1  α). This dual presentation will facilitate refinement of the algorithm in Section 3.3.

3.1 Correctness If Phase I never aborts, when it ends we have partitioned the entire set of elements into sets S1 , S2 , . . . . The largest of these, S1 , is of the plurality color, unless . S1 .78. S2 . , in which case there is no plurality. We must prove that if Phase I aborts and Phase II begins, the small classes discarded in Phase II do not affect plurality. When we discard an element of R in Phase II, we also discard f other elements; these f  1 elements are all of different colors. This means that among the elements discarded from R in Phase II, there are at most . R .  f  1  like-colored elements. Therefore, if we take a “small” color class—that is, one with fewer than . S1 .9:. R .  f  1  elements—and add  to it all discarded elements of the same  color, the class will have fewer than . S1 .;. R .  f  1 + S2 ???"> Sc , we have made at most 

c

∑ i . Si ."

c  1 @. S ."

iA 1

color comparisons. 7

c

(4)

Proof. The first element placed in Si uses i  1 comparisons. Placing the remaining . Si . 1 elements with the backward scan of Sc , Sc & 1 , . . . uses at most c  1  i comparisons per element, however the dynamic nature of the indices (through rearrangement)  prevents us from simply summing the costs . Si .! 1  c  1  i  of the final partition. Instead we prove (4) by induction on . S . . For simplicity of notation, let S  S1 > S2 ???!> Sc and c be free variables representing the set of elements partitioned and the number of colors found so far, rather than their final values. When . S .B  0, c  0 and (4) is trivial. Suppose (4) holds for S and consider the insertion of the . S .9 1  st element. If it represents a new color, we use c color comparisons and 

c DC c  1 @. S ."

c

∑ i . Si ."



cEF-

c  2



. S .4

c2 1



∑ i . Si ."

1 G

iA 1

c  1

iA 1



because . Sc2 1 .H 1 and c -I. S . . On the other hand, if the . S . 1  st element is of a previously known color Sk , we use c  k  1 color comparisons. If, after rearrangement, that color becomes Skˆ , kˆ - k, we must show that 



c  k  1 JC c  1 @. S ."

c

∑ i . Si ."

cEF-

iA 1



c  1



. S .4

1 G

c

∑ i . Sˆi ."

cK

iA 1

where Sˆi are the rearranged colors. This inequality simplifies to c

∑i

iA 1

which further simplifies to



. Sˆi ."3. Si .  -



kK

kK -

because only . Sk .LM. Sˆkˆ . changes (it is incremented by 1); the induction follows by the ˆ definition of k. Proposition 1. If Phase I never aborts on n elements, but ends with the partition S  S1 > S2 ??? and no unprocessed elements R, then the number of color comparisons made is at most   1  α  c  0  5  n  0  775c  0  5  n K for c 

9.

Proof. No generality is lost by assuming that there are c colors and no fewer, so we assume that S  S1 > S2 ???"> Sc ; because Phase I did not abort, we know by (1) that just before the final element of S is placed in a color class (that is, when . R .6 1), . S1 .N S2 ??? and unprocessed elements R. We use Lemma 4 and bound c  1 L. S ." ∑ciA 1 i . Si . by refining the sum-splitting argument in the previous proof by writing . S .N

A B C D EK

where a 



A 

iA 1

. Si ."T. Sb . UK

b

∑ . Sb .QK

B 

iA 1

b

C 



min

C



iA a2 1

D 



'=. Si ."T. Sb .QK". R .V'

b

iA a2 1

E

c





iA b2 1



. Si ."3. Sb . UK

. Si .Q

9

m c 

1 f 

1 (W(

K

Intuitively, when Phase I aborts, we use the size of Sb as a reference point, looking at the sizes of sets S1 K S2 K"K Sb by their excess over . Sb . : B thus represents the baseline value. A gives the total excess over B in S1 K S2 K"K Sa . By (3), C gives a lower bound on the excess over the baseline of Sa 2 1 K Sa2 2 K"K Sb ; this is the least number of elements that can prevent all of the sets Sa 2 1 K Sa 2 2 KK Sb from being discarded as “small”. D gives the error in that lower bound. E is everything no larger than the baseline. Proposition 2. If Phase I aborts on n elements with the partition S  S1 > S2 ??? and unprocessed elements R, then the number of color comparisons done in Phase I is at most 

 -

for c 

1  α c  1 n  α2 c 2 2



cC2  2 m  γ L. R .

 

0  775c 

X'

α2 m 2





α  1 c . R .4 (

4mα  7 L. R . 8

α2 m  α  2  c 2αm  1  2 4 cC2 2  0  559c . R .4 0  09 . R .Y 0  026c  0  67c  1  21 K 1 n  4 5. R .

9.

Proof. We relate each of the sums A, B, C, D, and E to the corresponding sums at the end of Phase I, a

∑i

AZ[



iA 1

. Si ."3. Sb . NK

b

BZ

∑ i . Sb .RK



iA 1

b

C Z[



i min

C Z;



iA a2 1

DZ[ 

'=. Si ."3. Sb .RK;. R .H'

b

iA a2 1 c



E Z[

iA b2 1

m c 

1 f

K

1 (W( 



i . Si ."3. Sb . NK

i . Si .R

Then by Lemma 4 the number of color comparisons done in Phase I is at most 

c  1 L. S ."

c

∑ i . Si ."

c

 



iA 1

c  1  A  B  C  D  E \









AZ

BZ 



CZ 

DZ 

EZ 



c  1  A  AZ ^75] c  1  B  BZ ^75] c  1  C  C Z ^   _] c  1  D  DZ ^7] c  1  E  E Z ^V c ]

and we will bound each of these terms.   The simplest terms are c  1  B  BZ are c  1  E  E Z which were handled in the proof of Proposition 1. We have, BZ6

b

∑ i . Sb .@

iA 1



b b  1 . Sb .@ 2 10

b 1 b . Si .@ 2 i∑ A 1

b 1 B 2

so that 

c  1  B  BZ 



 



because b 

1

c



iA b2 1

c





i . Si .V

iA b2 1

(10) (11)

0  775c  0  5  B

2αc W 2αc  0  45c, so c  b  2 -

E ZH

b  0 5 B 2 ( α c  0  5 B

c '

-

b 1 B 2 (

c 1 '



b  1 L. Si .@



(12)

1  α  c  0  775c. Similarly, c



b  1

iA b2 1

. Si ./

b 1 EK 2 ( '

so that 

c  1  E  E Z`-

b 1 E 2 (

c 1 '

b  0 5 E 2 (  1  α c  0  5 E  0  775c  0  5  E 



'



c



(13) (14)





Now for c  1  D  D Z . Note the identity x  min yK z a max x  yK x  z  , so that D 

b



C 



iA a2 1 b







. Si ."3. Sb . G

iA a2 1 b





iA a2 1

. Si ."3. Sb . NK

min ' . Si ."T. Sb .QK". R .

max ' 0 K;. Si ."3. Sb ."3. R .V'

m c 

'

m c

1 

f

1 f

K

1 (W( 

K

1 (W( 

and similarly, b



DZ6

iA a2 1

i max ' 0 K;. Si ."3. Sb ."3. R .V'

m c

1 

f

1 (W( 



The maximum here is non-negative, thus b

DZ







iA a2 1





a  1  max ' 0 K". Si ."3. Sb ."3. R .

a  1

b



iA a2 1

max ' 0 K;. Si ."3. Sb ."3. R .

11

m c '

'

m c

1 



f

1 (b( 

1 f 

1 (W(

K

and so 



c  1  D  DZ -

(15)

0  775cD 



c  a D 1  α  cD 

-

(16)



b αc  0  225c, hence c  a because a  αc 1  α  c  0  775c.   Bounding c  1  A  A Z and c  1  C  C Z is trickier. Recall that the algorithm enforces . S1 .@