EXACT
AND A P P R O X I M A T E
MEMBERSHIP
TESTERS
L a r r y Carter Robert Floyd John Gill George Markowsky Mark Wegman
I.
Introduction
In this paper, lower b o u n d s on the size of m e m b e r s h i p t e s t e r s are g i v e n and algorithms are presented that nearly a c h i e v e the lower bounds. Some of these algorithms perform better than the two m e t h o d s p r o p o s e d by Bloom [BI] and a n a l y z e d here. If we a l l o w an a p p r o x i m a t e m e m b e r ship t e s t e r to i n c o r r e c t l y a c c e p t a frac-
In this p a p e r we c o n s i d e r the q u e s t i o n of how m u c h s p a c e is n e e d e d to r e p r e s e n t a set. Given a f i n i t e u n i v e r s e U and some subset V (called the vocabulary), an exact m e m b e r s h i p t e s t e r is a p r o c e d u r e that for e a c h e l e m e n t s in U d e t e r m i n e s if s is in V. An a p p r o x i m a t e m e m b e r s h i p tester is allowed to m a k e mistakes: we require that the membership tester correctly accepts every element of V , but we a l l o w it to also accept a small f r a c t i o n of the e l e m e n t s o f U -V
tion 2 -r of the words of U , then approximately vr bits are needed for a v o c a b u l a r y of v words. A l t h o u g h we are primarily concerned with the program' s size, we are s e c o n d a r i l y c o n c e r n e d with the time r e q u i r e d to test m e m b e r s h i p . For both e x a c t and a p p r o x i m a t e m e m b e r s h i p t e s t e r s , we present theoretical procedures which n e a r l y a c h i e v e the l o w e r b o u n d s and some more p r a c t i c a l procedures which require a l i t t l e extra space. For instance, one of the p r a c t i c a l a p p r o x i m a t e m e m b e r s h i p testers can be implemented using at m o s t v( r+2 ) bits.
M e m b e r s h i p t e s t e r s are useful in several areas. An obvious application is s p e l l i n g c h e c k i n g in d o c u m e n t p r e p a r a t i o n , where many typographical errors can be r e c o g n i z e d a s not in a s t a n d a r d English vocabulary. Another use is in a u t o m a t e d ciphertext-only cryptanalysis; a proposed d e c i p h e r m e n t can be d i s c a r d e d u n l e s s a sign i f i c a n t f r a c t i o n of the w o r d s in the d e c i p h e r m e n t are valid. In o p t i c a l c h a r a c t e r r e c o g n i t i o n systems, u n c e r t a i n t i e s c o u l d be resolved in favor of words or letter s e q u e n c e s stored in a v o c a b u l a r y of c o m m o n w o r d s [RH]. Yet a n o t h e r a p p l i c a t i o n is the s t o r i n g of the set of v a l i d user IDs and passwords, or credit card numbers, or account numbers.
2. Lower testers
on
the
size
of m e m b e r s h i p
We are m o t i v a t e d by a p p l i c a t i o n s in which V can be c o n s i d e r e d to be r a n d o m l y c h o s e n from U , with e a c h s u b s e t o f s i z e v being e q u a l l y likely. Thus, the question we ask is, "Given a set U and an integer v , what is the space r e q u i r e d to r e p r e s e n t any v o c a b u l a r y of size v chosen from U ?" If we wish to store a v o c a b u l a r y that h a s a fair a m o u n t of structure, t h e n there m a y be ways to f u r t h e r r e d u c e the space requirements by t a k i n g a d v a n t a g e of the s t r u c t u r e . For i n s t a n c e , to represent an English dictionary, we might use a Huffman (optimum variablelength) code [ H u ] , [ G a , pp. 52-55] to r e m o v e some of the redundancy of the English words, t h e n p a r t i t i t i o n the set o f e n c o d e d w o r d s a c c o r d i n g to lengths of the e n c o d -
The first, fourth, and fifth a u t h o r s are in the A u t o m a t i c P r o g r a m m i n g Group at the IBM Watson Research Center, Y o r k t o w n Heights, NY 10598. The second and third a u t h o r s are in the C o m p u t e r Science and Electrical Engineering Departments, respectively, of S t a n f o r d University, Stanford, CA 94305. This r e s e a r c h was s u p p o r t e d in part by National Science Foundation Grants M C S 7 2 - O 3 6 6 3 - A 0 4 and M C S 7 7 - 0 7 5 5 5 and by Joint S e r v i c e s E l e c t r o n i c s P r o g r a m Contract N 0 0 0 1 4 - 7 5 - C - 0 6 0 1 .
-
bounds
59-
ings, and f i n a l l y for e a c h l e n g t h use a representation proposed in this p a p e r to store the sets of e n c o d i n g s o f e a c h length.
The n e x t proposition gives a lower bound on the s t o r a g e in t e r m s of the n u m b e r of vocabulary words and the undetected error probability. We assume that all i n c o r r e c t w o r d s are e q u a l l y likely. (We ignore for e x a m p l e the " l o c a l " nature of typographical errors and the consistent m i s s p e l l i n g s of c e r t a i n words. These practical p r o b l e m s c o u l d be a t t a c k e d in a spelling c h e c k i n g a p p l i c a t i o n by u s i n g a t a b l e of f r e q u e n t m i s s p e l l i n g s . )
A s t a n d a r d c o u n t i n g a r g u m e n t from program size c o m p l e x i t y [ K o ] , [ C h l ] gives the minimal memory requirements for exact membership testers. In 2.
base
this
paper
Proposition
lg
1:
In
denotes
logarithms
a universe
of
size
Proposition
u , at m o s t a fraction 2 -k of the v o c a b u l a r i e s o f size v can be a c c e p t e d by e x a c t m e m b e r s h i p t e s t e r s o f size less than Ig
(v u) - k
bits.
Proof: of
v
u >> v, at vocabularies
There
words,
are
(~)
vocabularies
lg(vu) - k
programs
of
- I < 2 - k !)-u v
size
less
than
ig
In a
universe
of
size
most a fraction 2 -k of size v can be
of the checked
w i t h u n d e t e c t e d error p r o b a b i l i t y p r o g r a m s of size less than vr - k
2 -r by bits.
Proof: Every approximate membership t e s t e r can be d e s c r i b e d by the subset W of U that it a c t u a l l y a c c e p t s . If the false a l a r m rate is 0 and the u n d e t e c t e d
but o n l y
2
2:
(I) u (v)
-
error p r o b a b i l i t y is 2 -r, then W is a superset of V that c o n t a i n s no m o r e t h a n u v+ (u-v)2 -r elements. There are (v) dif-
k
bits.
ferent
Not s u r p r i s i n g l y , as the u n i v e r s e size u grows, that is, as the m a x i m u m p o s s i b l e length of a vocabulary word grows, the m e m o r y r e q u i r e d to r e p r e s e n t the v o c a b u l a r y increases. (The lower bound of this p r o p o sition can be i n c r e a s e d if we a s s u m e that p r o g r a m s are s e l f - d e l i m i t i n g [Ch2].)
no
of
size
v
more
than
(v+(u-w)2-r) -
,
and
each
can c o r r e t e s t e r s for
vocabularies.
T h e r e f o r e , for e v e r y v and e v e r y r' < r, if u is s u f f i c i e n t l y large, t h e n at l e a s t
-r) vr'
We now s t u d y the storage needed to a p p r o x i m a t e v o c a b u l a r i e s to w i t h i n a specified error p r o b a b i l i t y . There are two ways in w h i c h a p p r o x i m a t e m e m b e r s h i p t e s t e r s can m a k e m i s t a k e s : v a l i d w o r d s can be u n r e c o g nized as b e l o n g i n g to the v o c a b u l a r y (false alarms) or incorrect words can escape detection (undetected errors). It is e a s y to see that l i t t l e m e m o r y can be saved by p e r m i t t i n g false alarms; a m e m b e r s h i p tester with a small false a l a r m p r o b a b i l i t y is in fact a c h e c k e r for a s l i g h t l y s m a l l e r vocabulary. On the o t h e r hand, if we do not insist that all incorrect words be d e t e c t e d , but r e q u i r e an u n d e t e c t e d error
different programs are needed to i n c l u d e all v o c a b u l a r i e s o f size v The p r o o f o f the p r o p o s i t i o n is c o n c l u d e d by the same c o u n t i n g a r g u m e n t s as in the p r o o f of Prop o s i t i o n I: to a c h i e v e an u n d e t e c t e d error p r o b a b i l i t y of 2 -r, for l a r g e u , at least vr - k bits are n e e d e d to r e p r e s e n t all but words.
2 -k
3.
membership
Exact
of
the
dictionaries
of
v
testers
In this s e c t i o n , we p r e s e n t several m e t h o d s for c r e a t i n g e x a c t m e m b e r s h i p testers. We assume that, compared to the space r e q u i r e d to store the r e p r e s e n t a t i o n of the set, the space r e q u i r e d for the exec u t a b l e c o d e of the m e m b e r s h i p t e s t e r is negligible. We also ignore any t e m p o r a r y w o r k area that the p r o c e d u r e m a y r e q u i r e . G r a n t i n g this, the first e x a c t m e m b e r s h i p
probability less than 2 -r, then a v o c a b u l a r y can be r e p r e s e n t e d in an a m o u n t o f s t o r a g e that d e p e n d s o n l y on v and r; this s t o r a g e is i n d e p e n d e n t o f the u n i v e r s e size u and for r e a s o n a b l e v a l u e s of r is less than the p r o g r a m size c o m p l e x i t y of V.
-
vocabularies
set W of v + ( u - v ) 2 -r words s p o n d to a p p r o x i m a t e m e m b e r s h i p
60
-
tester Fig
achieves
(~)~
bits,
the
but
lower
bound
of
it is not p r a c t i c a l .
tion
Exact Membership Tester 1: Assume some m e t h o d of e n u m e r a t i n g w i t h o u t r e p e t i tion all subsets of U of size v Represent a vocabulary V by the n u m b e r of subsets that come before V in this e n u m e r a t ion.
of
tient
and
the
remainder
d i v i d e d by 2k the l o w - o r d e r high-order
respectively
. k
bits
the
of
In other words, bits and X[i] of
wi
If
b
quo-
w. when i Y[i] is is the =
[u/2 k]
then X[i] is in {0, I, ...,b-1 } We represent the set V by the array Y t o g e t h e r with the bit s t r i n g Z of v+b bits that has 1's in p o s i t i o n s I+X[I] , 2+X[2] ,..., v+X[v] and O's elsewhere.
s to
(i)
Write
in U , we can d e t e r m i n e V as f o l l o w s : s
as
Sl 2k
+
s2 ,
s2 < 2k (ii) Determine j , the I's before the sl-th zero in i ,
the
and the V . is in
number
of
(si+I)-th
(iii) If Otherwise,
1's zero
in
between
v
of P r o p o s i -
approximated
by
Ig e = I. 44.
r
Divide
such
that
sI
by
t
s I = qt+r
We know that in the bits of Z , t h e r e are and W[ q] I 's.
initial exactly
by
the
to o b t a i n and
q
r < t
W[q] + qt qt zeroes
(lib) In the s u b s t r i n g of Z starting at the ( W [ q ] + q t + 1 ) - t h bit, c o u n t h o w many 1's there are b e f o r e the r-th zero. Add this n u m b e r to W[q] to o b t a i n j As before, let k be the n u m b e r of 1's b e t w e e n the r-th and the ( r + 1 ) - t h zero.
if
Analysis: A s s u m i n g that u are p o w e r s of 2, Exact M e m b e r s h i p requires
where
n u m b e r of Z , and the
and
u >>
sI
(iia)
Since the elements of X are in increasing order, the I we put in the (i+X[i])-th p o s i t i o n is indeed the i-th I of Z Thus the array X can be recovered from the a r r a y Z ; in fact, X[i] is the n u m b e r of O's p r e c e d i n g the i-th I in Z . Given belongs
,
bound
S t e p (ii) can n o w be r e p l a c e d f o l l o w i n g f a s t e r s e q u e n c e of steps:
and
s
lower
closely
If
E x a c t M e m b e r s h i p T e s t e r 3: Let t be an i n t e g e r with I < t < v The s m a l l e r t is, the faster the m e m b e r s h i p tester will o p e r a t e and the m o r e space will be required. The r e p r e s e n t a t i o n of V consists of the a r r a y Y and the bit s t r i n g Z of M e m b e r s h i p T e s t e r 2, along with the array W where W[i] is the n u m b e r of 1's b e f o r e the ( i X t ) - t h zero of Z
{ W l , W 2 , . . . , w v} where w I < w 2