A TREE ALGORITHM FOR NEAREST NEIGHBOR SEARCHING IN DOCUMENT RETRIEVAL SYSTEMS By Caroline M. Eastman Department of Mathematics Florida State University Tallahassee, Florida 32306 Stephen F. Weiss University of North Carolina Chapel Hill, North Carolina
Abstract The problem of finding nearest neighbors to a query in a document collection is a special case of associative retrieval, are performed using more than one key. retrieval algorithm,
A nearest neighbors associative
suitable for document retrieval using similarity
matching, is described.
The basic structure used is a binary tree,
at each node a set of keys promising branch.
in which searches
(concepts)
is tested to select the most
Backtracking to initially rejected branches is al-
lowed and often necessary. Under certain conditions, is
0(log2N) k
N
dependent parameter.
the search time required by this algorithm
is the number of documents, and
k
is a system-
A series of experiments with a small collection
confirm the predictions made using the analytic model;
k
is approxi-
mately 4 in this situation. This algorithm is compared with two other searching algorithms; sequential search and clustered search.
For large collections,
the
average search time for this algorithm is less than that for a sequential search and greater than that for a clustered search.
However,
the clustered search,unlike the sequential search and this algorithm, does not guarantee that the near neighbors found are actually the nearest neighbors.
131
i.
Introduction This p a p e r d e s c r i b e s
document retrieval over currently similar
algorithm
systems a n d that a p p e a r s
s u g g e s t e d methods.
that c a n be u s e d in
to h a v e some a d v a n t a g e s
It c o u l d a l s o be u s e d in o t h e r
searching problems.
The type of d o c u m e n t described system,
by Salton
retrieval
(1975)
a set of c o n c e p t s
the q u e r i e s
in o r d e r
larity measure vector
a searching
a n d u s e d in the S M A R T is u s e d
to r e p r e s e n t
here
system.
is t h a t In such a
to c l a s s i f y b o t h the d o c u m e n t s t h e m as c o n c e p t vectors.
is u s e d to c o m p a r e d o c u m e n t s
f o r m in o r d e r
queries;
system considered
and q u e r i e s
to s e l e c t t h o s e d o c u m e n t s
and
A simi-
in c o n c e p t
best matching
the
this a p p r o a c h m a y be c o n t r a s t e d w i t h the use of B o o l e a n
matching. The s e a r c h e S
involved
t h a n one s e a r c h key; searches.
the u s e of m o r e
they are thus m u l t i - a t t r i b u t e ,
or a s s o c i a t i v e ,
Since each c o n c e p t
of h i g h d i m e n s i o n . looks
in such a s y s t e m r e q u i r e
is u s e d as a key,
Such a s e a r c h
for those d o c u m e n t s
is a n e a r e s t
Euclidean distance).
s e a r c h or a p a r t i a l - m a t c h
ments
retrieved
thus an
m
nearest
neighbor
search.
neighbor
in r e s p o n s e search rather
132
space
search~
the query,
(e.g.
So it is a b e s t - m a t c h
t h a n an e x a c t - m a t c h are g e n e r a l l y
neighbor
which most closely match
to any one of a v a r i e t y of s i m i l a r i t y m e a s u r e s cosine,
the s e a r c h
is it
according
n-dimensional search rather
search.
to a query;
Several
docu-
the s e a r c h
than s i m p l y a n e a r e s t
is
The standard are sequential
searching methods
search and inverted file search.
search is straightforward, time-consuming
for document retrieval
The sequential
but in large collections
to examine
the entire collection.
in similarity-based become rather
with somewhat more effort,
searching as well.
inefficient
it is v e r y
The inverted
file search is often used to provide quick searches using Boolean matching;
systems
Inverted
in systems it can be used
file searches
when requests have many concepts
(Salton,
1968). Clustered
files are often suggested as a way to cut down search
time in similarity-based documents
systems.
are grouped together
mising clusters are examined.
query.
in clusters, Although
tically reduce the search time, clustered
In such an organization,
An overview of clustering methods (1977)
Rivest algorithm
(1974)
demonstrates
class of best-match
trees
of a particular searches.
(Friedman,
Bentley,
Burkhard and Keller
hashing
However,
that the records are is to be
is met here.
Quad trees
numbers of keys, but become volved.
are binary,
(1975).
search.
tree-based methods have been suggested
nearest neighbors.
to the
an estimate of the
and that only one nearest neighbor
none of these conditions
Various
in a cluster
the opt~nality
that the attributes
randomly distributed,
k-d
retrieved by a
is given in Salton
describe a model which provides
for a restricted
he has assumed
m
such an approach can dras-
the near neighbors
number of nearest neighbors missed
found;
and only the most pro-
file search are not always the nearest neighbors
Yu and Luk
similar
for retrieving
(Finkle and Bentley,
and Finkle,
impractical (1973) 133
1977)
1974)
and
are fine for small
when lots of keys are in-
and Fukanaga
and Narendra
(1975)
describe here;
however,
metric
2.
tree s t r u c t u r e d
algorithms
similar
these two a l g o r i t h m s
rather
to the one s u g g e s t e d
d e p e n d on the u s e of a d i s t a n c e
than a s i m i l a r i t y m e a s u r e
for m a t c h i n g .
Description Each document
in a c o l l e c t i o n
by a c o n c e p t vector. vector.
The
chosen.
A similarity measure
compare
m
Each query
documents
documents
document
most
of
similar
(A s i m i l a r i t y
as a c o n c e p t
0 and 1 is u s e d to of 1 m e a n s
and a s i m i l a r i t y
and the q u e r y c o n t a i n does
is r e p r e s e n t e d
to the q u e r y are to be
ranging between
a n d queries.
search algorithm described
documents
is also r e p r e s e n t e d
a n d q u e r y are identical,
the d o c u m e n t
N
no c o n c e p t s
t h a t the
of 0 m e a n s
in common.)
that
The
not d e p e n d on the use of a p a r t i c u -
lar s i m i l a r i t y measure. The d o c u m e n t
collection
is o r g a n i z e d
as a b i n a r y
tree.
w i t h each i n t e r n a l
node of the tree is a set of concepts.
cuments
in b u c k e t s
are s t o r e d
cument resides A document
is i n s e r t e d
into the s u b t r e e
The d o c u m e n t
the node.
t h e n the d o c u m e n t
is i n s e t e d r e c u r s i v e l y
t h e n the d o c u m e n t
intersect
is c o m p a r e d
If the i n t e r s e c t i o n
the left c h i l d of the o r i g i n a l empty,
node.
is i n s e r t e d
and d e s c e n d p r o c e s s
e a c h do-
r o o t e d at a p a r t i c u l a r to the c o n c e p t between
in the tree w h o s e r o o t is
If the i n t e r s e c t i o n in the r i g h t
continues
subtree.
is non~This
u n t i l a leaf node is reached;
is t h e n p l a c e d
in the b u c k e t a s s o c i a t e d
node.
for each node,
all d o c u m e n t s
134
set a s s o -
the two is empty,
the d o c u m e n t Thus,
The do-
in e x a c t l y one bucket.
node as follows. ciated with
at the leaves of the tree;
Associated
w i t h t h a t leaf
that are in the r i g h t
subtree
contain
concept
set.
at least one c o n c e p t
All l e f t - d e s c e n d e n t
in c o m m o n w i t h
documents
contain
that node's none of the
concepts. The time r e q u i r e d L
is O(L).
of size
N
The m a x i m u m is
size of one.
log2N
.
The search
tree w o u l d
is a similar process.
likely
to c o n t a i n
subtree.
in a subtree
bound.
its n e a r e s t p o s s i b l e
In either
between
larity b o u n d will d e p e n d
case,
the query and the
The n e a r e s t p o s s i b l e
be p r e s e n t
The similarity
Of course,
in
subtree.
b o u n d w i l l be r e f e r r e d
for that subtree.
is em-
set have concepts
can be d e t e r m i n e d
need not a c t u a l l y
with
near n e i g h b o r s
neighbor
by c o n s i d e r i n g
sets to be found on the p a t h to that subtree.
neighbor
Thus,
into the
is c o m p a r e d
is m o r e promising.
can be p u t on the similarity
it is a theoretical
bound
documents
If the i n t e r s e c t i o n
and the c o n c e p t
subtree
in a p a r t i c u l a r
the concept
N
bucket
empty buckets.
The query
to the a p p r o p r i a t e
that could be found
nearest
is m o r e
If the query
the search d e s c e n d s
documents
for a c o l l e c t i o n
0(N log2N)
then the right
A bound
L
guarantee
set of the root of a tree.
of the query.
of
and hence p u t t i n g
p t y then the left subtree
common,
value
into a tree of h e i g h t
Such a tree would have an average
L ~ log2N
is at w o r s t
the c o n c e p t
reasonable
A n y higher
in general, tree
to i n s e r t o n e d o c u m e n t
in a p a r t i c u l a r
between
tree;
the q u e r y and
to as the s i m i l a r i t y
the c a l c u l a t i o n
on the p a r t i c u l a r
This
of the simi-
similarity
measure
used. The similarity possible
neighbor
bound
for the entire
is one w h i c h
tree is 1 since
exactly matches
135
the n e a r e s t
the query.
After
the query is compared
to the concept set at the root,
a similarity
bound for both subtrees can be calculated. Suppose common. least
the query and the concept set have
Then all documents c
of the concepts
neighbor
concepts
in
in the left subtree are missing
at
in the query.
The nearest possible
along the left path is thus a document
but those
c
query concepts.
the right subtree
c
that matches
The nearest possible
neighbor
is still an exact match to the query;
larity bound for the subtree remains
The nearest possible
is thus a document
one extra.
the simi-
in common,
in the right subtree have at least one concept that
was not requested. subtree
for
equal to i.
If the query and the concept set have no concepts all documents
all
neighbor
in the right
that has all the query concepts plus
The nearest possible
neighbor
for the left subtree
is an exact match to the query. Whether or not the query matches root node,
the similarity bound for the less promising
can be calculated
by determining
and its nearest possible
Similarity bounds
neighbor
The similarity
i.
When a concept
set is matched,
and similarity bound remain the same
The matched concepts
in the left subtree.
for the left subtree
subtree
for subtrees of nodes below the root can be
for the right subtree. documents
in that subtree.
subtree remains
in much the same way.
the nearest possible
at the
the similarity between the query
neighbor
bound for the more promising
determined
any of the concepts
are not present
So the nearest possible
is the previous
136
nearest possible
in
neighbor
neighbor
without
the matched concepts.
The similarity
bound for this sub-
tree is the similarity of the query with this nearest possible neighbor. When a concept set is not matched,
the nearest possible
neigh-
bor and similarity bound for the left subtree remain the same.
An extra concept is added to the nearest possible tree to obtain the nearest possible
neighbor
The similarity bound can then be determined
neighbor
for that
for its right subtree. from this nearest pos-
sible neighbor. The search process descends All documents If
m
in the associated bucket are compared with the query.
nearest neighbors
saved.
As before,
bucket requires Backtracking tree encountered nearest
the tree until a leaf is reached.
neighbor
have been requested,
the process of descending
0 (L)
the best
m
from the root to a
time.
now occurs.
If the similarity bound of any sub-
is greater than the similarity w i t h the so far encountered,
There may be documents those already found.
are
that subtree
mth
is examined.
associated with it which are better
than
The search is over when the root of the
tree is reached. The search algorithm
is summarized
is set in motion by calling bors,
in Figure i. The search
SEARCH(Query,
root of entire tree).
137
number of nearest niegh-
Figure i:
Search Algorithm
SEARCH(QUERY,M,ROOT) IF (the ROOT is a leaf) THEN Compare each document in the associated bucket with the QUERY.
Merge the results with the
seen so far; keep the best IF
(ROOT concept set
m
best documents
m .
N QUERY ~ @) THEN
Calculate the similarity bound for the left branch. CALL SEARCH(right child of ROOT). IF (the similarity bound for the left child is greater than the similarity of the
mth
best document found
so far) THEN CALL SEARCH(left child of ROOT). IF (ROOT concept set
N QUERY =
~) THEN
Calculate the similarity bound for the right branch. CALL SEARCH(left child of ROOT). IF (the similarity bound for the right child is greater than the similarity of the
mth
best document found
so far) THEN CALL SEARCH(right child of ROOT). END SEARCH
138
3.
Analysis If some r e s t r i c t i v e
average
but reasonable
time c o m p l e x i t y
assumptions (i)
of this a l g o r i t h m
are made,
the
can be determined.
These
are:
The d o c u m e n t
collection
tree of h e i g h t (2)
Each b u c k e t
(3)
The cutoff bors m
assumptions
is o r g a n i z e d
L . N
contains
-2L
similarity,
documents.
s , for the
is the same for all queries.
nearest
neighbors
than or equal
to
m
nearest
similarities
s ; all other d o c u m e n t s
s
would
to
s .
almost
neigh-
In other words,
to a query have
ties less than or equal the value of
into a full b i n a r y
have
(In a real
certainly
the greater
similari-
situation,
not be the same
for all queries.) (4)
At any node, the subtree
the r e d u c t i o n not i n i t i a l l y
tion is c l e a r l y
in the s i m i l a r i t y chosen
is
r .
o n l y a rough a p p r o x i m a t i o n
bound
for
(This assumpto the ac-
tual situation.) (5)
Backtracking order.
T h a t is,
similarity mined
through
If
k
0(log2N)k
Sl,
than
L/2
in s i m i l a r i t y
bound
yet to be e x a m i n e d
have
s2,...s j , then the subtree
exa-
nest is the one w i t h bound
in a search
is less
if the subtrees
bounds,
If these a s s u m p t i o n s be examined
the tree occurs
equal
to
MAX(sl,s2,,,sj).
are made, then the number of buckets to k 1-s is ill ( ) , w h e r e k is equal to [--~-I , then this q u a n t i t y
and m a y be taken as the a v e r a g e 139
can be shown to be
time complexity.
(Details
of this d e r i v a t i o n
this a l g o r i t h m
will be of p r a c t i c a l
Large values
of
s
of
be found are v e r y
similar
s
Large v a l u e s of
k .
s
depends
interest
mean
simply
to the query.
of
r of
r
mean
that a r a p i d
tightening
of the bounds
by e l i m i n a t i n g
entire
of
that of
given
k ,
If
is large,
k
is smaller
N
t h a n any that w o u l d
of
faster
arise
tained by d i v i d i n g by the total
of
of
N
bounds
to
for
reasonable
in shorter
on the p a r t i c u l a r between
N
searches
The v a l u e tree used.
algorithm. as
N
that of For any
approaches
for w h i c h
(log2N)
that they are m u c h
in a p r a c t i c a l
described
A more precise
k .
that the
consideration.
(log2N)k
m a y be so enormous
was used b e c a u s e
the general
of
neighbors
It is quite
0(log2N)
the v a l u e s
k = 4
strates
is small.
in small v a l u e s
is i n t e r m e d i a t e
than
2 shows a c o m p a r i s o n
situation
k
be noted
result
f r o m further
Figure
mental
would
s , does d e p e n d
however, N
reduced.
and that of an
grows
than
subtrees
of this a l g o r i t h m
algorithm
It should
that the s i m i l a r i t y
are r a p i d l y
0(N)
Clearly
in small v a l u e s
(near i) w i l l also r e s u l t
not chosen
an
only when
that the n e a r e s t
the b r a n c h e s
The b e h a v i o r
(1977).)
in no w a y on the tree.
Large v a l u e s
r , unlike
in E a s t m a n
(near i) w i l l r e s u l t
Such large values
value of
are given
k
larger
situation.
and
(log2N) 4
that was the value
(The v a l u e
found
in the next section).
in the experi-
This g r a p h
illur
shape of the two curves. estimate
of the e x p e c t e d
the number
of b u c k e t s
number of buckets,
fraction
of d o c u m e n t s
figures
do not include
examined
2L
Figure
searching
140
length
to be searched,
for several
time spent
search
3 shows
values
of
through
can be obk T i~l(~) ,
the p r e d i c t e d N
.
(These
the tree;
this
time should be roughly proportional to the scan fraction).
Although
the algorithm performs poorly in small collections, it should do well in large ones.
I(~
Figure 2:
Growth of
N
and (log2N)4
/
N
(log 2 N) 4
4
lOgl0 (log2N)
2
3
4.
5
6
7
8
9
10
lOgl0N Figure 3:
Some predicted scan fractions (bucket size = i) Scan Fraction k=3
k=
128
0.50
0.94
1,024
0.17
0.72
0.0013
0.021
N
1,048,576
5
141
4.
Experimental Several
These First,
Results
experiments
experiments
the c o n c e p t
is the rough a n a l y s i s k
tains
(ADI)
collection
c e p t vector m o d e l
As is g e n e r a l l y
possible
an outcome,
values
Second,
how useful
situation?
here u s e d the A m e r i c a n 1971).
Third,
Documentation
The ADI c o l l e c t i o n
in c o n c e p t
vector
experiments
form.
based
con-
It has
on the con-
retrieval. there are a large number
it is i n f e a s i b l e
for these variables.
set of base c o n d i t i o n s
questions.
in the actual
retrieval
true w h e n
related
section?
(Keen,
of d o c u m e n t
three
were performed.
in the p r e v i o u s
and 35 queries
b e e n u s e d in m a n y d o c u m e n t
algorithm
sets be chosen?
described
described
83 d o c u m e n t s
influencing
towards
w o u l d be found
The e x p e r i m e n t s Institute
this search
were directed
h o w should
w h a t value of
using
was chosen.
of factors
to try all c o m b i n a t i o n s
So a r e a s o n a b l e These
conditions
of
(but arbitrary) are d e s c r i b e d
in the next paragraph. The 5 n e a r e s t
neighbors
are r e t r i e v e d
this number was chosen b e c a u s e documents
p e r query
of the c o l l e c t i o n 6.1% r e t r i e v e d of height
that is r e t r i e v e d The c o n c e p t
6; this h e i g h t
to allow
searched
gives
tree u s e d
the a v e r a g e
w i t h a small
some d u p l i c a t i o n
in similarity
the f r a c t i o n
smaller
than the
is a full b i n a r y bucket
Concept number
of c o n c e p t
bound order.
of 5 r e l e v a n t
(In general,
w i l l be m u c h
of 82 documents.
in a tree are identical; easier
there are an a v e r a g e
for this collection.
here.)
1 for a c o l l e c t i o n
for each of the queries;
tree
size c l o s e s t
to
sets at the same level of concepts, sets.
The s i m i l a r i t y
it is
The nodes are measure
used
is the cosine. The experimental conditions a p p r o x i m a t e d 142
the a n a l y t i c
assumption
described tree, were
in the p r e v i o u s
section.
b u t the n u m b e r of d o c u m e n t s searched
in s i m i l a r i t y
the c u t o f f
similarity
similarity
bound,
r
The tree u s e d was a full b i n a r y per b u c k e t varied.
c e i l i n g order.
for 5 documents,
The subtrees
The a v e r a g e v a l u e of
was
0.34.
The r e d u c t i o n
, v a r i e d w i t h the d i r e c t i o n
c h o s e n and the
s , in
level of the node. The m e a s u r e
u s e d to c o m p a r e
age scan fraction, to a query. traversal,
the total w o r k
is important,
sets is c r i t i c a l
is,
involved,
is the averin r e s p o n s e
including
the tree
it should be r o u g h l y p r o p o r t i o n a l
it is o b v i o u s
to the p e r f o r m a n c e
c h a n c e of a m a t c h b e t w e e n
tunately,
searched
it is not clear w h a t the o p t i m a l
collection
the a v e r a g e
searches
to the
for trees of equal height.
Even though ticular
the f r a c t i o n of d o c u m e n t s
Although
scan f r a c t i o n
the d i f f e r e n t
r
more
sophisticated
the p r o b a b i l i t y
that the s e l e c t i o n of c o n c e p t of this algorithm.
a q u e r y and a c o n c e p t
and thus d e c r e a s e
the a v e r a g e
choice methods
of such m a t c h e s
tree for a par-
A greater
set should increase s e a r c h time.
designed
Unfor-
to i n c r e a s e
involve c o r r e s p o n d i n g l y
greater
effort. Several m e t h o d s used to c o n s t r u c t
of i n c r e a s i n g
concept
c e p t sets of d i f f e r e n t best performance In each case, source.
trees.
sohpistication
For each method,
sizes w e r e c o n s t r u c t e d )
trees w i t h con-
the size g i v i n g
the
for each m e t h o d was used in the final c o m p a r i s o n .
the c o n c e p t s
u s e d in the ADI q u e r i e s w e r e used as a
(The same q u e r i e s w e r e u s e d
for searching.
and o v e r h e a d w e r e
Of course,
for c o n s t r u c t i n g
a production
implementation
the tree and w o u l d have
to use some sample of queries.) The s i m p l e s t m e t h o d of c o n c e p t c h o i c e 143
is r a n d o m selection.
A
somewhat more concept
sophisticated
that occur m o s t o f t e n
to form the c o n c e p t there
sets.
uses
concept
frequencies;
in the sample of queries
Since
the m o s t c o m m o n
the
are c h o s e n
concepts
are used,
should be m o r e matches.
Use of concepts crease
highly
the p r o b a b i l i t y
decrease
the search
tions w e r e used.
those u n u s e d one e l e m e n t avoids
correlated
of m u l t i p l e
length.
concepts
concept
each other
matches
forms
should
with concept algorithms
in-
sets and
using
correla-
the sets one at a time by t a k i n g
to start a set and then s e l e c t i n g
most highly
to e a c h set in turn,
the p o s s i b i l i t y
with
Two p o s s i b l e
The first
the m o s t c o m m o n u n u s e d
will
method
correlated the second
with
it.
selection
that the v e r y h i g h e s t
By a d d i n g algorithm
frequency
concepts
all be in the same set. A more
sophisticated
is to c l u s t e r clusters. were
the c o n c e p t s
terion until
Two c l u s t e r a cutoff
the c o n c e p t s
similarity
to as f r e q u e n c y - o r d e r e d
clusters
containing
the m o s t
A second a l g o r i t h m , r e f e r r e d lected
those c l u s t e r s
algorithm
concepts
are h i g h l y
in F i g u r e
using
clustering cri-
then
re-
selected
those
for c o n c e p t
sets.
clustering,
se-
concepts.
the m o s t c o m m o n
concepts
If any two of the p o t e n t i a l w i t h each other,
Then s i n g l e - l i n k
up to the d e s i r e d
The results below
correlated
uses
sets f r o m those
One algorithm,
concepts
the m o s t
information
the s i n g l e - l i n k
to as s i z e - o r d e r e d
to start the clusters.
clusters
using
is reached.
common
as seeds
same cluster.
concept
algorithm
clustering,
containing
The third c l u s t e r i n g
the c o r r e l a t i o n
and then choose
Three of the m a n y p o s s i b l e
tried.
ferred
w a y of u s i n g
clustering
seed
they are p u t in the
is u s e d to b r i n g
the
size.
from searches
using
4. 144
each of these m e t h o d s
are shown
Figure
4:
Scan F r a c t i o n
Using D i f f e r e n t
Method
Methods
of C o n c e p t
Scan F r a c t i o n
Relevant
Choise
Parameter
Random
0.97
8 concepts/set
Most Common
0.92
4 concepts/set
Correlated
(version
l)
0.89
8 concepts/set
Correlated
(version
2)
0.89
8 concepts/set
Clustered
0.94
Cutoff
similarity
= 0.55
Frequency-Ordered Clustered
0.94
CutQff
similarity
= 0.55
Most Common Clustered
0.89
8 concepts/set
Size-Ordered
Figure
5:
Increase In Scan F r a c t i o n Retrieved
with Number
Neighbors
m
Average
1
0.52
0.80
2
0.43
0.83
3
0.39
0.86
4
0.37
0.87
5
0.34
0.89
i0
0.27
0.93
30
0.14
0.94
82
0.00
0.94
145
s
of N e a r e s t
Scan F r a c t i o n
A Friedman
analysis
of v a r i a n c e
hypothesis
that the c o n c e p t
fraction.
The 35 scan fractions
ferent m e t h o d s 20.7;
mation
to c o n c l u d e
differ
common
additional
concepts concepts
approach
results.
highly
nearest
neighbors
different
numbers
Most Common
should
w h i c h uses
was
When
m
close
to the plateau,
taking
this
in e a c h case.
that those m e t h o d s
decrease
s
would
the value
incorporating
give of
the b e s t
k
u p o n the search
by in-
m
length can
when different
A series
was p e r f o r m e d per
of n e a r e s t
if there w e r e
numbers
of searches,
8 concepts
The number
greater
1 nearest
chosen would
are in accord w i t h the p r e d i c -
neighbors,
tree w i t h
less than
The scan f r a c t i o n
the
t h e m out w i t h
already
0.89
set for-
r .
of n e a r e s t
retrieved
those
and c l u s t e r i n g
are retrieved.
in F i g u r e 5.
is i0,
with
is
So it is
of c o n c e p t
any m e t h o d
at the scan f r a c t i o n
are shown
tained w h e n
model,
of the value of
Clustered
similarity
is 0.002.
methods
scan fraction,
frequencies
the value of
be seen by l o o k i n g
calculated
The three m e t h o d s
from this e x p e r i m e n t
The i n f l u e n c e
with
correlated
in practice.
These m e t h o d s
creasing
(Xr 2)
the scan
for the 7 dif-
to start the sets and then fills
choice
about
does not effect
for this value
Probably
based on the a n a l y t i c
information
to test the null
(one for each query)
that the v a r i o u s
had the lowest b e s t
The results
method
The test statistic
significantly.
be a r e a s o n a b l e
tion,
selection
the level of s i g n i f i c a n c e
reasonable
most
was used.
was p e r f o r m e d
set.
retrieving
using
the
The r e s u l t s
neighbors
fewer
of
than
actually
m
documents
than 0 to the query.
rises
sharply
neighbor
the a v e r a g e which
from 0.80,
is retrieved,
scan f r a c t i o n is r e a c h e d 146
the m i n i m u m
to a p l a t e a u
is 0.93;
when
m
obof 0.94.
this value
is 30.
is
The m a x i m u m
scan f r a c t i o n similarity
is less than 1 b e c a u s e
bounds
h i g h scan f r a c t i o n s
retrieved
the a l g o r i t h m
in s u b t r e e s w i t h
of 0 n e e d n o t be searched.
As p r e d i c t e d , cuments
documents
occur w i t h r e l a t i v e l y
in this c o l l e c t i o n .
However,
is such that m u c h b e t t e r
few do-
the b e h a v i o r
of
results
should be o b s e r v e d
described
h e r e are in ac-
in large c o l l e c t i o n s . The r e s u l t of the two e x p e r i m e n t s cord w i t h p r e d i c t i o n s Several o t h e r results
m a d e on the b a s i s of the a n a l y t i c model.
experiments
are d e s c r i b e d
of these e x p e r i m e n t s
the b a s i s on the analysis. rough analysis described
in E a s t m a n
(1977);
also a g r e e w i t h p r e d i c t i o n s So it w o u l d a p p e a r
here can be u s e f u l
the
m a d e on
that e v e n the
in e x a m i n i n g
the be-
h a v i o r of this algorithm. A r o u g h v a l u e of in two d i f f e r e n t
k
ways.
from the o b s e r v e d
for the ADI c o l l e c t i o n First,
a v a l u e of
scan fractions.
be b a s e d on the e s t i m a t e d
k
Second,
v a l u e of
s
can be e s t i m a t e d
can be e s t i m a t e d
an e s t i m a t e
and
r .
of
k
can
Both methods
give
a n e s t i m a t e of 4. The e x p e c t e d
scan f r a c t i o n s
t a i n e d by d i v i d i n g
for a tree of h e i g h t
the e s t i m a t e d
by the n u m b e r of buckets, when 4 nearest neighbors
2L were
n u m b e r of b u c k e t s The l o w e s t
ADI c o l l e c t i o n is 4.
is 0.52.
The r e d u c t i o n s
unmatched
concepts
s
scan f r a c t i o n o b t a i n e d
found in the ADI c o l l e c t i o n
This is the scan f r a c t i o n p r e d i c t e d The a v e r a g e v a l u e of
6 c a n be obk T searched, i~l(~)
when
when
k
5 neighbors
is 0.89.
is 4. are r e t r i e v e d
in the
The a v e r a g e q u e r y l e n g t h of an ADI q u e r y
in s i m i l a r i t y
bound occurring
and for 1 to 6 extra c o n c e p t s
147
for 1 to 4
for a q u e r y of
this length were calculated. is 0.25.
The average
r
The average
r
when matches
when matches do not occur
most searches will involve a mixture of matches two values were averaged Since
k
is equal to
to obtain an estimate
l-s [--7]
, the estimate of
occur
is 0.06.
and misses, for k
r
of
Since these
0.16.
for this method
is also 4.
5.
Conclusions For sim±larity-based
has advantages
document
retr±eval
over both sequential
but the average k
collection,
search.
0(log2N) k
4 ±n a large collection,
It
quality and search time.
the same items as a sequential
search length is
is approximately
this algorithm
search and clustered
offers a middle ground in terms of retrieval This algorithm retrieves
systems~
rather than
search, If
0(N)
as it was in the ADI
only a small fraction of the entire collection would need
to be searched. The algorithm described here requires more search time than a clustered
search,
both for the average
However,
it does a better
actually
the nearest neighbors
query.
Also,
gradation
rather than search length. still retrieve
job of retrieval.
as the document
for a clustered
rather
The items retrieved
than near neighbors
collection
grows,
The tree-structured
would suffer.
148
are
to the
the retrieval
search would occur in retrieval
the nearest neighbors,
trieval quality,
case and for the worst case.
de-
quality
algorithm would
so the speed,
rather than re-
References
Burkhard, W.A. and Keller, R.M., 1973. "Some approaches to bestmatch file searching". Communications of the ACM, Vol. 16, No. 4, April 1973, pp. 230-236. Eastman, Caroline M. "A tree algorithm for nearest neighbor searching in document retrieval systems". Ph.D. Dissertation, University of North Carolina at Chapel Hill, 1977. Finkel, R.A. and Bentley, J.L. "Quad trees: a data structure for retrieval on composite keys". Acta informatica, Vol. 4, No. i. 1974, pp. 1-9. Friedman, J.H., Bentley, J.L. and Finkel, R.A. "An algorithm for finding best matches in logarithmic expected time". ACM Transactions on Mathematical Software, Vol. 3, No. 3, Sept: 1977, pp. 209-226. Fukanaga, Keinosoke and Narendra, Patrenahalli, 1975. "A branch and bound algorithm for computing k-nearest neighbors". IEEE Transactions on Computers, Vol. C-24, No. 7, July, 1975, pp. 750-753. Keen, E.M. 1971.
"An Analysis of the documentation
requests".
In Salton,
Rivest, R.L. Analysis of associative retrieval algorithms. Stanford Computer Science Department, Report STAN-CS-74-415, 1974. Salton, G. Automatic Information Organization and Retrieval. Hill Book Company, New York, New York, 1968.
Mc-Graw-
Salton, G. editor, The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1971. Salton, G. Dynamic Information and Library Processing. Hall, Inc., Englewood Cliffs, New Jersey, 1975.
Prentice-
Yu, C.T. and Lik, W.S. "Analysis of effectiveness of retrieval in clustered files". Journal of the ACM, Vol. 24, No. 4, Oct. 1977, pp. 607-622.
149