EXACT AND APPROXIMATE MEMBERSHIP ... - Semantic Scholar

Report 4 Downloads 153 Views
EXACT

AND A P P R O X I M A T E

MEMBERSHIP

TESTERS

L a r r y Carter Robert Floyd John Gill George Markowsky Mark Wegman

I.

Introduction

In this paper, lower b o u n d s on the size of m e m b e r s h i p t e s t e r s are g i v e n and algorithms are presented that nearly a c h i e v e the lower bounds. Some of these algorithms perform better than the two m e t h o d s p r o p o s e d by Bloom [BI] and a n a l y z e d here. If we a l l o w an a p p r o x i m a t e m e m b e r ship t e s t e r to i n c o r r e c t l y a c c e p t a frac-

In this p a p e r we c o n s i d e r the q u e s t i o n of how m u c h s p a c e is n e e d e d to r e p r e s e n t a set. Given a f i n i t e u n i v e r s e U and some subset V (called the vocabulary), an exact m e m b e r s h i p t e s t e r is a p r o c e d u r e that for e a c h e l e m e n t s in U d e t e r m i n e s if s is in V. An a p p r o x i m a t e m e m b e r s h i p tester is allowed to m a k e mistakes: we require that the membership tester correctly accepts every element of V , but we a l l o w it to also accept a small f r a c t i o n of the e l e m e n t s o f U -V

tion 2 -r of the words of U , then approximately vr bits are needed for a v o c a b u l a r y of v words. A l t h o u g h we are primarily concerned with the program' s size, we are s e c o n d a r i l y c o n c e r n e d with the time r e q u i r e d to test m e m b e r s h i p . For both e x a c t and a p p r o x i m a t e m e m b e r s h i p t e s t e r s , we present theoretical procedures which n e a r l y a c h i e v e the l o w e r b o u n d s and some more p r a c t i c a l procedures which require a l i t t l e extra space. For instance, one of the p r a c t i c a l a p p r o x i m a t e m e m b e r s h i p testers can be implemented using at m o s t v( r+2 ) bits.

M e m b e r s h i p t e s t e r s are useful in several areas. An obvious application is s p e l l i n g c h e c k i n g in d o c u m e n t p r e p a r a t i o n , where many typographical errors can be r e c o g n i z e d a s not in a s t a n d a r d English vocabulary. Another use is in a u t o m a t e d ciphertext-only cryptanalysis; a proposed d e c i p h e r m e n t can be d i s c a r d e d u n l e s s a sign i f i c a n t f r a c t i o n of the w o r d s in the d e c i p h e r m e n t are valid. In o p t i c a l c h a r a c t e r r e c o g n i t i o n systems, u n c e r t a i n t i e s c o u l d be resolved in favor of words or letter s e q u e n c e s stored in a v o c a b u l a r y of c o m m o n w o r d s [RH]. Yet a n o t h e r a p p l i c a t i o n is the s t o r i n g of the set of v a l i d user IDs and passwords, or credit card numbers, or account numbers.

2. Lower testers

on

the

size

of m e m b e r s h i p

We are m o t i v a t e d by a p p l i c a t i o n s in which V can be c o n s i d e r e d to be r a n d o m l y c h o s e n from U , with e a c h s u b s e t o f s i z e v being e q u a l l y likely. Thus, the question we ask is, "Given a set U and an integer v , what is the space r e q u i r e d to r e p r e s e n t any v o c a b u l a r y of size v chosen from U ?" If we wish to store a v o c a b u l a r y that h a s a fair a m o u n t of structure, t h e n there m a y be ways to f u r t h e r r e d u c e the space requirements by t a k i n g a d v a n t a g e of the s t r u c t u r e . For i n s t a n c e , to represent an English dictionary, we might use a Huffman (optimum variablelength) code [ H u ] , [ G a , pp. 52-55] to r e m o v e some of the redundancy of the English words, t h e n p a r t i t i t i o n the set o f e n c o d e d w o r d s a c c o r d i n g to lengths of the e n c o d -

The first, fourth, and fifth a u t h o r s are in the A u t o m a t i c P r o g r a m m i n g Group at the IBM Watson Research Center, Y o r k t o w n Heights, NY 10598. The second and third a u t h o r s are in the C o m p u t e r Science and Electrical Engineering Departments, respectively, of S t a n f o r d University, Stanford, CA 94305. This r e s e a r c h was s u p p o r t e d in part by National Science Foundation Grants M C S 7 2 - O 3 6 6 3 - A 0 4 and M C S 7 7 - 0 7 5 5 5 and by Joint S e r v i c e s E l e c t r o n i c s P r o g r a m Contract N 0 0 0 1 4 - 7 5 - C - 0 6 0 1 .

-

bounds

59-

ings, and f i n a l l y for e a c h l e n g t h use a representation proposed in this p a p e r to store the sets of e n c o d i n g s o f e a c h length.

The n e x t proposition gives a lower bound on the s t o r a g e in t e r m s of the n u m b e r of vocabulary words and the undetected error probability. We assume that all i n c o r r e c t w o r d s are e q u a l l y likely. (We ignore for e x a m p l e the " l o c a l " nature of typographical errors and the consistent m i s s p e l l i n g s of c e r t a i n words. These practical p r o b l e m s c o u l d be a t t a c k e d in a spelling c h e c k i n g a p p l i c a t i o n by u s i n g a t a b l e of f r e q u e n t m i s s p e l l i n g s . )

A s t a n d a r d c o u n t i n g a r g u m e n t from program size c o m p l e x i t y [ K o ] , [ C h l ] gives the minimal memory requirements for exact membership testers. In 2.

base

this

paper

Proposition

lg

1:

In

denotes

logarithms

a universe

of

size

Proposition

u , at m o s t a fraction 2 -k of the v o c a b u l a r i e s o f size v can be a c c e p t e d by e x a c t m e m b e r s h i p t e s t e r s o f size less than Ig

(v u) - k

bits.

Proof: of

v

u >> v, at vocabularies

There

words,

are

(~)

vocabularies

lg(vu) - k

programs

of

- I < 2 - k !)-u v

size

less

than

ig

In a

universe

of

size

most a fraction 2 -k of size v can be

of the checked

w i t h u n d e t e c t e d error p r o b a b i l i t y p r o g r a m s of size less than vr - k

2 -r by bits.

Proof: Every approximate membership t e s t e r can be d e s c r i b e d by the subset W of U that it a c t u a l l y a c c e p t s . If the false a l a r m rate is 0 and the u n d e t e c t e d

but o n l y

2

2:

(I) u (v)

-

error p r o b a b i l i t y is 2 -r, then W is a superset of V that c o n t a i n s no m o r e t h a n u v+ (u-v)2 -r elements. There are (v) dif-

k

bits.

ferent

Not s u r p r i s i n g l y , as the u n i v e r s e size u grows, that is, as the m a x i m u m p o s s i b l e length of a vocabulary word grows, the m e m o r y r e q u i r e d to r e p r e s e n t the v o c a b u l a r y increases. (The lower bound of this p r o p o sition can be i n c r e a s e d if we a s s u m e that p r o g r a m s are s e l f - d e l i m i t i n g [Ch2].)

no

of

size

v

more

than

(v+(u-w)2-r) -

,

and

each

can c o r r e t e s t e r s for

vocabularies.

T h e r e f o r e , for e v e r y v and e v e r y r' < r, if u is s u f f i c i e n t l y large, t h e n at l e a s t

-r) vr'

We now s t u d y the storage needed to a p p r o x i m a t e v o c a b u l a r i e s to w i t h i n a specified error p r o b a b i l i t y . There are two ways in w h i c h a p p r o x i m a t e m e m b e r s h i p t e s t e r s can m a k e m i s t a k e s : v a l i d w o r d s can be u n r e c o g nized as b e l o n g i n g to the v o c a b u l a r y (false alarms) or incorrect words can escape detection (undetected errors). It is e a s y to see that l i t t l e m e m o r y can be saved by p e r m i t t i n g false alarms; a m e m b e r s h i p tester with a small false a l a r m p r o b a b i l i t y is in fact a c h e c k e r for a s l i g h t l y s m a l l e r vocabulary. On the o t h e r hand, if we do not insist that all incorrect words be d e t e c t e d , but r e q u i r e an u n d e t e c t e d error

different programs are needed to i n c l u d e all v o c a b u l a r i e s o f size v The p r o o f o f the p r o p o s i t i o n is c o n c l u d e d by the same c o u n t i n g a r g u m e n t s as in the p r o o f of Prop o s i t i o n I: to a c h i e v e an u n d e t e c t e d error p r o b a b i l i t y of 2 -r, for l a r g e u , at least vr - k bits are n e e d e d to r e p r e s e n t all but words.

2 -k

3.

membership

Exact

of

the

dictionaries

of

v

testers

In this s e c t i o n , we p r e s e n t several m e t h o d s for c r e a t i n g e x a c t m e m b e r s h i p testers. We assume that, compared to the space r e q u i r e d to store the r e p r e s e n t a t i o n of the set, the space r e q u i r e d for the exec u t a b l e c o d e of the m e m b e r s h i p t e s t e r is negligible. We also ignore any t e m p o r a r y w o r k area that the p r o c e d u r e m a y r e q u i r e . G r a n t i n g this, the first e x a c t m e m b e r s h i p

probability less than 2 -r, then a v o c a b u l a r y can be r e p r e s e n t e d in an a m o u n t o f s t o r a g e that d e p e n d s o n l y on v and r; this s t o r a g e is i n d e p e n d e n t o f the u n i v e r s e size u and for r e a s o n a b l e v a l u e s of r is less than the p r o g r a m size c o m p l e x i t y of V.

-

vocabularies

set W of v + ( u - v ) 2 -r words s p o n d to a p p r o x i m a t e m e m b e r s h i p

60

-

tester Fig

achieves

(~)~

bits,

the

but

lower

bound

of

it is not p r a c t i c a l .

tion

Exact Membership Tester 1: Assume some m e t h o d of e n u m e r a t i n g w i t h o u t r e p e t i tion all subsets of U of size v Represent a vocabulary V by the n u m b e r of subsets that come before V in this e n u m e r a t ion.

of

tient

and

the

remainder

d i v i d e d by 2k the l o w - o r d e r high-order

respectively

. k

bits

the

of

In other words, bits and X[i] of

wi

If

b

quo-

w. when i Y[i] is is the =

[u/2 k]

then X[i] is in {0, I, ...,b-1 } We represent the set V by the array Y t o g e t h e r with the bit s t r i n g Z of v+b bits that has 1's in p o s i t i o n s I+X[I] , 2+X[2] ,..., v+X[v] and O's elsewhere.

s to

(i)

Write

in U , we can d e t e r m i n e V as f o l l o w s : s

as

Sl 2k

+

s2 ,

s2 < 2k (ii) Determine j , the I's before the sl-th zero in i ,

the

and the V . is in

number

of

(si+I)-th

(iii) If Otherwise,

1's zero

in

between

v

of P r o p o s i -

approximated

by

Ig e = I. 44.

r

Divide

such

that

sI

by

t

s I = qt+r

We know that in the bits of Z , t h e r e are and W[ q] I 's.

initial exactly

by

the

to o b t a i n and

q

r < t

W[q] + qt qt zeroes

(lib) In the s u b s t r i n g of Z starting at the ( W [ q ] + q t + 1 ) - t h bit, c o u n t h o w many 1's there are b e f o r e the r-th zero. Add this n u m b e r to W[q] to o b t a i n j As before, let k be the n u m b e r of 1's b e t w e e n the r-th and the ( r + 1 ) - t h zero.

if

Analysis: A s s u m i n g that u are p o w e r s of 2, Exact M e m b e r s h i p requires

where

n u m b e r of Z , and the

and

u >>

sI

(iia)

Since the elements of X are in increasing order, the I we put in the (i+X[i])-th p o s i t i o n is indeed the i-th I of Z Thus the array X can be recovered from the a r r a y Z ; in fact, X[i] is the n u m b e r of O's p r e c e d i n g the i-th I in Z . Given belongs

,

bound

S t e p (ii) can n o w be r e p l a c e d f o l l o w i n g f a s t e r s e q u e n c e of steps:

and

s

lower

closely

If

E x a c t M e m b e r s h i p T e s t e r 3: Let t be an i n t e g e r with I < t < v The s m a l l e r t is, the faster the m e m b e r s h i p tester will o p e r a t e and the m o r e space will be required. The r e p r e s e n t a t i o n of V consists of the a r r a y Y and the bit s t r i n g Z of M e m b e r s h i p T e s t e r 2, along with the array W where W[i] is the n u m b e r of 1's b e f o r e the ( i X t ) - t h zero of Z

{ W l , W 2 , . . . , w v} where w I < w 2