A TREE ALGORITHM FOR NEAREST NEIGHBOR SEARCHING IN ...

Report 0 Downloads 42 Views
A TREE ALGORITHM FOR NEAREST NEIGHBOR SEARCHING IN DOCUMENT RETRIEVAL SYSTEMS By Caroline M. Eastman Department of Mathematics Florida State University Tallahassee, Florida 32306 Stephen F. Weiss University of North Carolina Chapel Hill, North Carolina

Abstract The problem of finding nearest neighbors to a query in a document collection is a special case of associative retrieval, are performed using more than one key. retrieval algorithm,

A nearest neighbors associative

suitable for document retrieval using similarity

matching, is described.

The basic structure used is a binary tree,

at each node a set of keys promising branch.

in which searches

(concepts)

is tested to select the most

Backtracking to initially rejected branches is al-

lowed and often necessary. Under certain conditions, is

0(log2N) k

N

dependent parameter.

the search time required by this algorithm

is the number of documents, and

k

is a system-

A series of experiments with a small collection

confirm the predictions made using the analytic model;

k

is approxi-

mately 4 in this situation. This algorithm is compared with two other searching algorithms; sequential search and clustered search.

For large collections,

the

average search time for this algorithm is less than that for a sequential search and greater than that for a clustered search.

However,

the clustered search,unlike the sequential search and this algorithm, does not guarantee that the near neighbors found are actually the nearest neighbors.

131

i.

Introduction This p a p e r d e s c r i b e s

document retrieval over currently similar

algorithm

systems a n d that a p p e a r s

s u g g e s t e d methods.

that c a n be u s e d in

to h a v e some a d v a n t a g e s

It c o u l d a l s o be u s e d in o t h e r

searching problems.

The type of d o c u m e n t described system,

by Salton

retrieval

(1975)

a set of c o n c e p t s

the q u e r i e s

in o r d e r

larity measure vector

a searching

a n d u s e d in the S M A R T is u s e d

to r e p r e s e n t

here

system.

is t h a t In such a

to c l a s s i f y b o t h the d o c u m e n t s t h e m as c o n c e p t vectors.

is u s e d to c o m p a r e d o c u m e n t s

f o r m in o r d e r

queries;

system considered

and q u e r i e s

to s e l e c t t h o s e d o c u m e n t s

and

A simi-

in c o n c e p t

best matching

the

this a p p r o a c h m a y be c o n t r a s t e d w i t h the use of B o o l e a n

matching. The s e a r c h e S

involved

t h a n one s e a r c h key; searches.

the u s e of m o r e

they are thus m u l t i - a t t r i b u t e ,

or a s s o c i a t i v e ,

Since each c o n c e p t

of h i g h d i m e n s i o n . looks

in such a s y s t e m r e q u i r e

is u s e d as a key,

Such a s e a r c h

for those d o c u m e n t s

is a n e a r e s t

Euclidean distance).

s e a r c h or a p a r t i a l - m a t c h

ments

retrieved

thus an

m

nearest

neighbor

search.

neighbor

in r e s p o n s e search rather

132

space

search~

the query,

(e.g.

So it is a b e s t - m a t c h

t h a n an e x a c t - m a t c h are g e n e r a l l y

neighbor

which most closely match

to any one of a v a r i e t y of s i m i l a r i t y m e a s u r e s cosine,

the s e a r c h

is it

according

n-dimensional search rather

search.

to a query;

Several

docu-

the s e a r c h

than s i m p l y a n e a r e s t

is

The standard are sequential

searching methods

search and inverted file search.

search is straightforward, time-consuming

for document retrieval

The sequential

but in large collections

to examine

the entire collection.

in similarity-based become rather

with somewhat more effort,

searching as well.

inefficient

it is v e r y

The inverted

file search is often used to provide quick searches using Boolean matching;

systems

Inverted

in systems it can be used

file searches

when requests have many concepts

(Salton,

1968). Clustered

files are often suggested as a way to cut down search

time in similarity-based documents

systems.

are grouped together

mising clusters are examined.

query.

in clusters, Although

tically reduce the search time, clustered

In such an organization,

An overview of clustering methods (1977)

Rivest algorithm

(1974)

demonstrates

class of best-match

trees

of a particular searches.

(Friedman,

Bentley,

Burkhard and Keller

hashing

However,

that the records are is to be

is met here.

Quad trees

numbers of keys, but become volved.

are binary,

(1975).

search.

tree-based methods have been suggested

nearest neighbors.

to the

an estimate of the

and that only one nearest neighbor

none of these conditions

Various

in a cluster

the opt~nality

that the attributes

randomly distributed,

k-d

retrieved by a

is given in Salton

describe a model which provides

for a restricted

he has assumed

m

such an approach can dras-

the near neighbors

number of nearest neighbors missed

found;

and only the most pro-

file search are not always the nearest neighbors

Yu and Luk

similar

for retrieving

(Finkle and Bentley,

and Finkle,

impractical (1973) 133

1977)

1974)

and

are fine for small

when lots of keys are in-

and Fukanaga

and Narendra

(1975)

describe here;

however,

metric

2.

tree s t r u c t u r e d

algorithms

similar

these two a l g o r i t h m s

rather

to the one s u g g e s t e d

d e p e n d on the u s e of a d i s t a n c e

than a s i m i l a r i t y m e a s u r e

for m a t c h i n g .

Description Each document

in a c o l l e c t i o n

by a c o n c e p t vector. vector.

The

chosen.

A similarity measure

compare

m

Each query

documents

documents

document

most

of

similar

(A s i m i l a r i t y

as a c o n c e p t

0 and 1 is u s e d to of 1 m e a n s

and a s i m i l a r i t y

and the q u e r y c o n t a i n does

is r e p r e s e n t e d

to the q u e r y are to be

ranging between

a n d queries.

search algorithm described

documents

is also r e p r e s e n t e d

a n d q u e r y are identical,

the d o c u m e n t

N

no c o n c e p t s

t h a t the

of 0 m e a n s

in common.)

that

The

not d e p e n d on the use of a p a r t i c u -

lar s i m i l a r i t y measure. The d o c u m e n t

collection

is o r g a n i z e d

as a b i n a r y

tree.

w i t h each i n t e r n a l

node of the tree is a set of concepts.

cuments

in b u c k e t s

are s t o r e d

cument resides A document

is i n s e r t e d

into the s u b t r e e

The d o c u m e n t

the node.

t h e n the d o c u m e n t

is i n s e t e d r e c u r s i v e l y

t h e n the d o c u m e n t

intersect

is c o m p a r e d

If the i n t e r s e c t i o n

the left c h i l d of the o r i g i n a l empty,

node.

is i n s e r t e d

and d e s c e n d p r o c e s s

e a c h do-

r o o t e d at a p a r t i c u l a r to the c o n c e p t between

in the tree w h o s e r o o t is

If the i n t e r s e c t i o n in the r i g h t

continues

subtree.

is non~This

u n t i l a leaf node is reached;

is t h e n p l a c e d

in the b u c k e t a s s o c i a t e d

node.

for each node,

all d o c u m e n t s

134

set a s s o -

the two is empty,

the d o c u m e n t Thus,

The do-

in e x a c t l y one bucket.

node as follows. ciated with

at the leaves of the tree;

Associated

w i t h t h a t leaf

that are in the r i g h t

subtree

contain

concept

set.

at least one c o n c e p t

All l e f t - d e s c e n d e n t

in c o m m o n w i t h

documents

contain

that node's none of the

concepts. The time r e q u i r e d L

is O(L).

of size

N

The m a x i m u m is

size of one.

log2N

.

The search

tree w o u l d

is a similar process.

likely

to c o n t a i n

subtree.

in a subtree

bound.

its n e a r e s t p o s s i b l e

In either

between

larity b o u n d will d e p e n d

case,

the query and the

The n e a r e s t p o s s i b l e

be p r e s e n t

The similarity

Of course,

in

subtree.

b o u n d w i l l be r e f e r r e d

for that subtree.

is em-

set have concepts

can be d e t e r m i n e d

need not a c t u a l l y

with

near n e i g h b o r s

neighbor

by c o n s i d e r i n g

sets to be found on the p a t h to that subtree.

neighbor

Thus,

into the

is c o m p a r e d

is m o r e promising.

can be p u t on the similarity

it is a theoretical

bound

documents

If the i n t e r s e c t i o n

and the c o n c e p t

subtree

in a p a r t i c u l a r

the concept

N

bucket

empty buckets.

The query

to the a p p r o p r i a t e

that could be found

nearest

is m o r e

If the query

the search d e s c e n d s

documents

for a c o l l e c t i o n

0(N log2N)

then the right

A bound

L

guarantee

set of the root of a tree.

of the query.

of

and hence p u t t i n g

p t y then the left subtree

common,

value

into a tree of h e i g h t

Such a tree would have an average

L ~ log2N

is at w o r s t

the c o n c e p t

reasonable

A n y higher

in general, tree

to i n s e r t o n e d o c u m e n t

in a p a r t i c u l a r

between

tree;

the q u e r y and

to as the s i m i l a r i t y

the c a l c u l a t i o n

on the p a r t i c u l a r

This

of the simi-

similarity

measure

used. The similarity possible

neighbor

bound

for the entire

is one w h i c h

tree is 1 since

exactly matches

135

the n e a r e s t

the query.

After

the query is compared

to the concept set at the root,

a similarity

bound for both subtrees can be calculated. Suppose common. least

the query and the concept set have

Then all documents c

of the concepts

neighbor

concepts

in

in the left subtree are missing

at

in the query.

The nearest possible

along the left path is thus a document

but those

c

query concepts.

the right subtree

c

that matches

The nearest possible

neighbor

is still an exact match to the query;

larity bound for the subtree remains

The nearest possible

is thus a document

one extra.

the simi-

in common,

in the right subtree have at least one concept that

was not requested. subtree

for

equal to i.

If the query and the concept set have no concepts all documents

all

neighbor

in the right

that has all the query concepts plus

The nearest possible

neighbor

for the left subtree

is an exact match to the query. Whether or not the query matches root node,

the similarity bound for the less promising

can be calculated

by determining

and its nearest possible

Similarity bounds

neighbor

The similarity

i.

When a concept

set is matched,

and similarity bound remain the same

The matched concepts

in the left subtree.

for the left subtree

subtree

for subtrees of nodes below the root can be

for the right subtree. documents

in that subtree.

subtree remains

in much the same way.

the nearest possible

at the

the similarity between the query

neighbor

bound for the more promising

determined

any of the concepts

are not present

So the nearest possible

is the previous

136

nearest possible

in

neighbor

neighbor

without

the matched concepts.

The similarity

bound for this sub-

tree is the similarity of the query with this nearest possible neighbor. When a concept set is not matched,

the nearest possible

neigh-

bor and similarity bound for the left subtree remain the same.

An extra concept is added to the nearest possible tree to obtain the nearest possible

neighbor

The similarity bound can then be determined

neighbor

for that

for its right subtree. from this nearest pos-

sible neighbor. The search process descends All documents If

m

in the associated bucket are compared with the query.

nearest neighbors

saved.

As before,

bucket requires Backtracking tree encountered nearest

the tree until a leaf is reached.

neighbor

have been requested,

the process of descending

0 (L)

the best

m

from the root to a

time.

now occurs.

If the similarity bound of any sub-

is greater than the similarity w i t h the so far encountered,

There may be documents those already found.

are

that subtree

mth

is examined.

associated with it which are better

than

The search is over when the root of the

tree is reached. The search algorithm

is summarized

is set in motion by calling bors,

in Figure i. The search

SEARCH(Query,

root of entire tree).

137

number of nearest niegh-

Figure i:

Search Algorithm

SEARCH(QUERY,M,ROOT) IF (the ROOT is a leaf) THEN Compare each document in the associated bucket with the QUERY.

Merge the results with the

seen so far; keep the best IF

(ROOT concept set

m

best documents

m .

N QUERY ~ @) THEN

Calculate the similarity bound for the left branch. CALL SEARCH(right child of ROOT). IF (the similarity bound for the left child is greater than the similarity of the

mth

best document found

so far) THEN CALL SEARCH(left child of ROOT). IF (ROOT concept set

N QUERY =

~) THEN

Calculate the similarity bound for the right branch. CALL SEARCH(left child of ROOT). IF (the similarity bound for the right child is greater than the similarity of the

mth

best document found

so far) THEN CALL SEARCH(right child of ROOT). END SEARCH

138

3.

Analysis If some r e s t r i c t i v e

average

but reasonable

time c o m p l e x i t y

assumptions (i)

of this a l g o r i t h m

are made,

the

can be determined.

These

are:

The d o c u m e n t

collection

tree of h e i g h t (2)

Each b u c k e t

(3)

The cutoff bors m

assumptions

is o r g a n i z e d

L . N

contains

-2L

similarity,

documents.

s , for the

is the same for all queries.

nearest

neighbors

than or equal

to

m

nearest

similarities

s ; all other d o c u m e n t s

s

would

to

s .

almost

neigh-

In other words,

to a query have

ties less than or equal the value of

into a full b i n a r y

have

(In a real

certainly

the greater

similari-

situation,

not be the same

for all queries.) (4)

At any node, the subtree

the r e d u c t i o n not i n i t i a l l y

tion is c l e a r l y

in the s i m i l a r i t y chosen

is

r .

o n l y a rough a p p r o x i m a t i o n

bound

for

(This assumpto the ac-

tual situation.) (5)

Backtracking order.

T h a t is,

similarity mined

through

If

k

0(log2N)k

Sl,

than

L/2

in s i m i l a r i t y

bound

yet to be e x a m i n e d

have

s2,...s j , then the subtree

exa-

nest is the one w i t h bound

in a search

is less

if the subtrees

bounds,

If these a s s u m p t i o n s be examined

the tree occurs

equal

to

MAX(sl,s2,,,sj).

are made, then the number of buckets to k 1-s is ill ( ) , w h e r e k is equal to [--~-I , then this q u a n t i t y

and m a y be taken as the a v e r a g e 139

can be shown to be

time complexity.

(Details

of this d e r i v a t i o n

this a l g o r i t h m

will be of p r a c t i c a l

Large values

of

s

of

be found are v e r y

similar

s

Large v a l u e s of

k .

s

depends

interest

mean

simply

to the query.

of

r of

r

mean

that a r a p i d

tightening

of the bounds

by e l i m i n a t i n g

entire

of

that of

given

k ,

If

is large,

k

is smaller

N

t h a n any that w o u l d

of

faster

arise

tained by d i v i d i n g by the total

of

of

N

bounds

to

for

reasonable

in shorter

on the p a r t i c u l a r between

N

searches

The v a l u e tree used.

algorithm. as

N

that of For any

approaches

for w h i c h

(log2N)

that they are m u c h

in a p r a c t i c a l

described

A more precise

k .

that the

consideration.

(log2N)k

m a y be so enormous

was used b e c a u s e

the general

of

neighbors

It is quite

0(log2N)

the v a l u e s

k = 4

strates

is small.

in small v a l u e s

is i n t e r m e d i a t e

than

2 shows a c o m p a r i s o n

situation

k

be noted

result

f r o m further

Figure

mental

would

s , does d e p e n d

however, N

reduced.

and that of an

grows

than

subtrees

of this a l g o r i t h m

algorithm

It should

that the s i m i l a r i t y

are r a p i d l y

0(N)

Clearly

in small v a l u e s

(near i) w i l l also r e s u l t

not chosen

an

only when

that the n e a r e s t

the b r a n c h e s

The b e h a v i o r

(1977).)

in no w a y on the tree.

Large v a l u e s

r , unlike

in E a s t m a n

(near i) w i l l r e s u l t

Such large values

value of

are given

k

larger

situation.

and

(log2N) 4

that was the value

(The v a l u e

found

in the next section).

in the experi-

This g r a p h

illur

shape of the two curves. estimate

of the e x p e c t e d

the number

of b u c k e t s

number of buckets,

fraction

of d o c u m e n t s

figures

do not include

examined

2L

Figure

searching

140

length

to be searched,

for several

time spent

search

3 shows

values

of

through

can be obk T i~l(~) ,

the p r e d i c t e d N

.

(These

the tree;

this

time should be roughly proportional to the scan fraction).

Although

the algorithm performs poorly in small collections, it should do well in large ones.

I(~

Figure 2:

Growth of

N

and (log2N)4

/

N

(log 2 N) 4

4

lOgl0 (log2N)

2

3

4.

5

6

7

8

9

10

lOgl0N Figure 3:

Some predicted scan fractions (bucket size = i) Scan Fraction k=3

k=

128

0.50

0.94

1,024

0.17

0.72

0.0013

0.021

N

1,048,576

5

141

4.

Experimental Several

These First,

Results

experiments

experiments

the c o n c e p t

is the rough a n a l y s i s k

tains

(ADI)

collection

c e p t vector m o d e l

As is g e n e r a l l y

possible

an outcome,

values

Second,

how useful

situation?

here u s e d the A m e r i c a n 1971).

Third,

Documentation

The ADI c o l l e c t i o n

in c o n c e p t

vector

experiments

form.

based

con-

It has

on the con-

retrieval. there are a large number

it is i n f e a s i b l e

for these variables.

set of base c o n d i t i o n s

questions.

in the actual

retrieval

true w h e n

related

section?

(Keen,

of d o c u m e n t

three

were performed.

in the p r e v i o u s

and 35 queries

b e e n u s e d in m a n y d o c u m e n t

algorithm

sets be chosen?

described

described

83 d o c u m e n t s

influencing

towards

w o u l d be found

The e x p e r i m e n t s Institute

this search

were directed

h o w should

w h a t value of

using

was chosen.

of factors

to try all c o m b i n a t i o n s

So a r e a s o n a b l e These

conditions

of

(but arbitrary) are d e s c r i b e d

in the next paragraph. The 5 n e a r e s t

neighbors

are r e t r i e v e d

this number was chosen b e c a u s e documents

p e r query

of the c o l l e c t i o n 6.1% r e t r i e v e d of height

that is r e t r i e v e d The c o n c e p t

6; this h e i g h t

to allow

searched

gives

tree u s e d

the a v e r a g e

w i t h a small

some d u p l i c a t i o n

in similarity

the f r a c t i o n

smaller

than the

is a full b i n a r y bucket

Concept number

of c o n c e p t

bound order.

of 5 r e l e v a n t

(In general,

w i l l be m u c h

of 82 documents.

in a tree are identical; easier

there are an a v e r a g e

for this collection.

here.)

1 for a c o l l e c t i o n

for each of the queries;

tree

size c l o s e s t

to

sets at the same level of concepts, sets.

The s i m i l a r i t y

it is

The nodes are measure

used

is the cosine. The experimental conditions a p p r o x i m a t e d 142

the a n a l y t i c

assumption

described tree, were

in the p r e v i o u s

section.

b u t the n u m b e r of d o c u m e n t s searched

in s i m i l a r i t y

the c u t o f f

similarity

similarity

bound,

r

The tree u s e d was a full b i n a r y per b u c k e t varied.

c e i l i n g order.

for 5 documents,

The subtrees

The a v e r a g e v a l u e of

was

0.34.

The r e d u c t i o n

, v a r i e d w i t h the d i r e c t i o n

c h o s e n and the

s , in

level of the node. The m e a s u r e

u s e d to c o m p a r e

age scan fraction, to a query. traversal,

the total w o r k

is important,

sets is c r i t i c a l

is,

involved,

is the averin r e s p o n s e

including

the tree

it should be r o u g h l y p r o p o r t i o n a l

it is o b v i o u s

to the p e r f o r m a n c e

c h a n c e of a m a t c h b e t w e e n

tunately,

searched

it is not clear w h a t the o p t i m a l

collection

the a v e r a g e

searches

to the

for trees of equal height.

Even though ticular

the f r a c t i o n of d o c u m e n t s

Although

scan f r a c t i o n

the d i f f e r e n t

r

more

sophisticated

the p r o b a b i l i t y

that the s e l e c t i o n of c o n c e p t of this algorithm.

a q u e r y and a c o n c e p t

and thus d e c r e a s e

the a v e r a g e

choice methods

of such m a t c h e s

tree for a par-

A greater

set should increase s e a r c h time.

designed

Unfor-

to i n c r e a s e

involve c o r r e s p o n d i n g l y

greater

effort. Several m e t h o d s used to c o n s t r u c t

of i n c r e a s i n g

concept

c e p t sets of d i f f e r e n t best performance In each case, source.

trees.

sohpistication

For each method,

sizes w e r e c o n s t r u c t e d )

trees w i t h con-

the size g i v i n g

the

for each m e t h o d was used in the final c o m p a r i s o n .

the c o n c e p t s

u s e d in the ADI q u e r i e s w e r e used as a

(The same q u e r i e s w e r e u s e d

for searching.

and o v e r h e a d w e r e

Of course,

for c o n s t r u c t i n g

a production

implementation

the tree and w o u l d have

to use some sample of queries.) The s i m p l e s t m e t h o d of c o n c e p t c h o i c e 143

is r a n d o m selection.

A

somewhat more concept

sophisticated

that occur m o s t o f t e n

to form the c o n c e p t there

sets.

uses

concept

frequencies;

in the sample of queries

Since

the m o s t c o m m o n

the

are c h o s e n

concepts

are used,

should be m o r e matches.

Use of concepts crease

highly

the p r o b a b i l i t y

decrease

the search

tions w e r e used.

those u n u s e d one e l e m e n t avoids

correlated

of m u l t i p l e

length.

concepts

concept

each other

matches

forms

should

with concept algorithms

in-

sets and

using

correla-

the sets one at a time by t a k i n g

to start a set and then s e l e c t i n g

most highly

to e a c h set in turn,

the p o s s i b i l i t y

with

Two p o s s i b l e

The first

the m o s t c o m m o n u n u s e d

will

method

correlated the second

with

it.

selection

that the v e r y h i g h e s t

By a d d i n g algorithm

frequency

concepts

all be in the same set. A more

sophisticated

is to c l u s t e r clusters. were

the c o n c e p t s

terion until

Two c l u s t e r a cutoff

the c o n c e p t s

similarity

to as f r e q u e n c y - o r d e r e d

clusters

containing

the m o s t

A second a l g o r i t h m , r e f e r r e d lected

those c l u s t e r s

algorithm

concepts

are h i g h l y

in F i g u r e

using

clustering cri-

then

re-

selected

those

for c o n c e p t

sets.

clustering,

se-

concepts.

the m o s t c o m m o n

concepts

If any two of the p o t e n t i a l w i t h each other,

Then s i n g l e - l i n k

up to the d e s i r e d

The results below

correlated

uses

sets f r o m those

One algorithm,

concepts

the m o s t

information

the s i n g l e - l i n k

to as s i z e - o r d e r e d

to start the clusters.

clusters

using

is reached.

common

as seeds

same cluster.

concept

algorithm

clustering,

containing

The third c l u s t e r i n g

the c o r r e l a t i o n

and then choose

Three of the m a n y p o s s i b l e

tried.

ferred

w a y of u s i n g

clustering

seed

they are p u t in the

is u s e d to b r i n g

the

size.

from searches

using

4. 144

each of these m e t h o d s

are shown

Figure

4:

Scan F r a c t i o n

Using D i f f e r e n t

Method

Methods

of C o n c e p t

Scan F r a c t i o n

Relevant

Choise

Parameter

Random

0.97

8 concepts/set

Most Common

0.92

4 concepts/set

Correlated

(version

l)

0.89

8 concepts/set

Correlated

(version

2)

0.89

8 concepts/set

Clustered

0.94

Cutoff

similarity

= 0.55

Frequency-Ordered Clustered

0.94

CutQff

similarity

= 0.55

Most Common Clustered

0.89

8 concepts/set

Size-Ordered

Figure

5:

Increase In Scan F r a c t i o n Retrieved

with Number

Neighbors

m

Average

1

0.52

0.80

2

0.43

0.83

3

0.39

0.86

4

0.37

0.87

5

0.34

0.89

i0

0.27

0.93

30

0.14

0.94

82

0.00

0.94

145

s

of N e a r e s t

Scan F r a c t i o n

A Friedman

analysis

of v a r i a n c e

hypothesis

that the c o n c e p t

fraction.

The 35 scan fractions

ferent m e t h o d s 20.7;

mation

to c o n c l u d e

differ

common

additional

concepts concepts

approach

results.

highly

nearest

neighbors

different

numbers

Most Common

should

w h i c h uses

was

When

m

close

to the plateau,

taking

this

in e a c h case.

that those m e t h o d s

decrease

s

would

the value

incorporating

give of

the b e s t

k

u p o n the search

by in-

m

length can

when different

A series

was p e r f o r m e d per

of n e a r e s t

if there w e r e

numbers

of searches,

8 concepts

The number

greater

1 nearest

chosen would

are in accord w i t h the p r e d i c -

neighbors,

tree w i t h

less than

The scan f r a c t i o n

the

t h e m out w i t h

already

0.89

set for-

r .

of n e a r e s t

retrieved

those

and c l u s t e r i n g

are retrieved.

in F i g u r e 5.

is i0,

with

is

So it is

of c o n c e p t

any m e t h o d

at the scan f r a c t i o n

are shown

tained w h e n

model,

of the value of

Clustered

similarity

is 0.002.

methods

scan fraction,

frequencies

the value of

be seen by l o o k i n g

calculated

The three m e t h o d s

from this e x p e r i m e n t

The i n f l u e n c e

with

correlated

in practice.

These m e t h o d s

creasing

(Xr 2)

the scan

for the 7 dif-

to start the sets and then fills

choice

about

does not effect

for this value

Probably

based on the a n a l y t i c

information

to test the null

(one for each query)

that the v a r i o u s

had the lowest b e s t

The results

method

The test statistic

significantly.

be a r e a s o n a b l e

tion,

selection

the level of s i g n i f i c a n c e

reasonable

most

was used.

was p e r f o r m e d

set.

retrieving

using

the

The r e s u l t s

neighbors

fewer

of

than

actually

m

documents

than 0 to the query.

rises

sharply

neighbor

the a v e r a g e which

from 0.80,

is retrieved,

scan f r a c t i o n is r e a c h e d 146

the m i n i m u m

to a p l a t e a u

is 0.93;

when

m

obof 0.94.

this value

is 30.

is

The m a x i m u m

scan f r a c t i o n similarity

is less than 1 b e c a u s e

bounds

h i g h scan f r a c t i o n s

retrieved

the a l g o r i t h m

in s u b t r e e s w i t h

of 0 n e e d n o t be searched.

As p r e d i c t e d , cuments

documents

occur w i t h r e l a t i v e l y

in this c o l l e c t i o n .

However,

is such that m u c h b e t t e r

few do-

the b e h a v i o r

of

results

should be o b s e r v e d

described

h e r e are in ac-

in large c o l l e c t i o n s . The r e s u l t of the two e x p e r i m e n t s cord w i t h p r e d i c t i o n s Several o t h e r results

m a d e on the b a s i s of the a n a l y t i c model.

experiments

are d e s c r i b e d

of these e x p e r i m e n t s

the b a s i s on the analysis. rough analysis described

in E a s t m a n

(1977);

also a g r e e w i t h p r e d i c t i o n s So it w o u l d a p p e a r

here can be u s e f u l

the

m a d e on

that e v e n the

in e x a m i n i n g

the be-

h a v i o r of this algorithm. A r o u g h v a l u e of in two d i f f e r e n t

k

ways.

from the o b s e r v e d

for the ADI c o l l e c t i o n First,

a v a l u e of

scan fractions.

be b a s e d on the e s t i m a t e d

k

Second,

v a l u e of

s

can be e s t i m a t e d

can be e s t i m a t e d

an e s t i m a t e

and

r .

of

k

can

Both methods

give

a n e s t i m a t e of 4. The e x p e c t e d

scan f r a c t i o n s

t a i n e d by d i v i d i n g

for a tree of h e i g h t

the e s t i m a t e d

by the n u m b e r of buckets, when 4 nearest neighbors

2L were

n u m b e r of b u c k e t s The l o w e s t

ADI c o l l e c t i o n is 4.

is 0.52.

The r e d u c t i o n s

unmatched

concepts

s

scan f r a c t i o n o b t a i n e d

found in the ADI c o l l e c t i o n

This is the scan f r a c t i o n p r e d i c t e d The a v e r a g e v a l u e of

6 c a n be obk T searched, i~l(~)

when

when

k

5 neighbors

is 0.89.

is 4. are r e t r i e v e d

in the

The a v e r a g e q u e r y l e n g t h of an ADI q u e r y

in s i m i l a r i t y

bound occurring

and for 1 to 6 extra c o n c e p t s

147

for 1 to 4

for a q u e r y of

this length were calculated. is 0.25.

The average

r

The average

r

when matches

when matches do not occur

most searches will involve a mixture of matches two values were averaged Since

k

is equal to

to obtain an estimate

l-s [--7]

, the estimate of

occur

is 0.06.

and misses, for k

r

of

Since these

0.16.

for this method

is also 4.

5.

Conclusions For sim±larity-based

has advantages

document

retr±eval

over both sequential

but the average k

collection,

search.

0(log2N) k

4 ±n a large collection,

It

quality and search time.

the same items as a sequential

search length is

is approximately

this algorithm

search and clustered

offers a middle ground in terms of retrieval This algorithm retrieves

systems~

rather than

search, If

0(N)

as it was in the ADI

only a small fraction of the entire collection would need

to be searched. The algorithm described here requires more search time than a clustered

search,

both for the average

However,

it does a better

actually

the nearest neighbors

query.

Also,

gradation

rather than search length. still retrieve

job of retrieval.

as the document

for a clustered

rather

The items retrieved

than near neighbors

collection

grows,

The tree-structured

would suffer.

148

are

to the

the retrieval

search would occur in retrieval

the nearest neighbors,

trieval quality,

case and for the worst case.

de-

quality

algorithm would

so the speed,

rather than re-

References

Burkhard, W.A. and Keller, R.M., 1973. "Some approaches to bestmatch file searching". Communications of the ACM, Vol. 16, No. 4, April 1973, pp. 230-236. Eastman, Caroline M. "A tree algorithm for nearest neighbor searching in document retrieval systems". Ph.D. Dissertation, University of North Carolina at Chapel Hill, 1977. Finkel, R.A. and Bentley, J.L. "Quad trees: a data structure for retrieval on composite keys". Acta informatica, Vol. 4, No. i. 1974, pp. 1-9. Friedman, J.H., Bentley, J.L. and Finkel, R.A. "An algorithm for finding best matches in logarithmic expected time". ACM Transactions on Mathematical Software, Vol. 3, No. 3, Sept: 1977, pp. 209-226. Fukanaga, Keinosoke and Narendra, Patrenahalli, 1975. "A branch and bound algorithm for computing k-nearest neighbors". IEEE Transactions on Computers, Vol. C-24, No. 7, July, 1975, pp. 750-753. Keen, E.M. 1971.

"An Analysis of the documentation

requests".

In Salton,

Rivest, R.L. Analysis of associative retrieval algorithms. Stanford Computer Science Department, Report STAN-CS-74-415, 1974. Salton, G. Automatic Information Organization and Retrieval. Hill Book Company, New York, New York, 1968.

Mc-Graw-

Salton, G. editor, The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1971. Salton, G. Dynamic Information and Library Processing. Hall, Inc., Englewood Cliffs, New Jersey, 1975.

Prentice-

Yu, C.T. and Lik, W.S. "Analysis of effectiveness of retrieval in clustered files". Journal of the ACM, Vol. 24, No. 4, Oct. 1977, pp. 607-622.

149