Evaluating Database Selection Techniques: A Testbed and Experiment James C. French Allison L. Powell Department of Computer Science University of Virginia Charlottesville, VA ffrench|
[email protected] Charles L. Vilesy School of Information and Library Science University of North Carolina, Chapel Hill Chapel Hill, NC
[email protected] Travis Emmitt Kevin J. Prey Department of Computer Science University of Virginia Charlottesville, VA
fte3d|
[email protected] Abstract We describe a testbed for database selection techniques and an experiment conducted using this testbed. The testbed is a decomposition of the TREC/TIPSTER data that allows analysis of the data along multiple dimensions, including collection-based and temporal-based analysis. We characterize the subcollections in this testbed in terms of number of documents, queries against which the documents have been evaluated for relevance, and distribution of relevant documents. We then present initial results from a study conducted using this testbed that examines the eectiveness of the gGlOSS approach to database selection. The databases from our testbed were ranked using the gGlOSS techniques and compared to the gGlOSS Ideal(l) baseline and a baseline derived from TREC relevance judgements. We have examined the degree to which several gGlOSS estimate functions approximate these baselines. Our initial results con rm that the gGlOSS estimators are excellent predictors of the Ideal(l) ranks but that the Ideal(l) ranks do not estimate relevance-based ranks well.
Several investigations centering on one or more of these areas have been undertaken, but in general it is not possible to compare the results of these inquiries in a meaningful way. In this paper we describe a speci c testbed which provides a standardized test environment with a conformant evaluation methodology. In addition we describe preliminary results of a comprehensive investigation that we are undertaking to evaluate collection selection algorithms within this testbed.
2 The Testbed
Our testbed is based on the TIPSTER data used in the TREC conferences. We decompose the large collections into smaller subcollections that serve as hypothetical \sites" in our distributed information retrieval test environment. Speci c details of the decomposition as well as a discussion of some characteristics of this environment follow. A number of researchers have been working on issues in distributed information retrieval systems and much of this work has 1been done in connection with TREC using the TIPSTER data. Other large corpora such as USENET 1 Introduction news groups have also been used. As information resources proliferate on internets and inDistributed IR research encompasses many important tranets, algorithms for database selection, distributed search- problems such as the following: ing and results merging become increasingly important. database or collection selection [3, 7, 6, 9]; This work supported in part by DARPA contract N66001-97-C8542 and NASA GSRP NGT5-50062. collection fusion or results merging [1, 3, 4, 15, 14]; y This work supported in part by DARPA contract N66001-97-Cand 8542. dissemination of collection information to increase retrieval eectiveness [5, 11, 10, 12]. Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that An examination of Table 1 shows the variety of test encopies are not made or distributed for pro t or commercial advironments employed by researchers and gives some insight vantage, the copyright notice, the title of the publication and its into why it is often impossible to compare ndings from date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or dierent research eorts.
to redistribute to lists, requires prior speci c permission and/or c 1998 ACM 1-58113-015-5 fee. SIGIR'98, Melbourne, Australia 8/98 $5.00.
1
A description of this data can be found in Harman[8] for example.
Group
Gravano et al.[6] Moat & Zobel [9] Viles & French [10] Vorhees et al.[14] Callan et al.[3] Walczuch et al.[16] Fox et al.[4] Vorhees [13]
Sources
news groups TREC (by source, disk) TREC-CatB (random) TREC (by source) TREC (by source, disk) TREC (by source) TREC (by source) TREC (source, month)
DB Queries
53 6,800 user 9 51-150 20 201-250 5 1-200 17 51-150 5 1-100 5 1-100 98 251-300
Table 1: Summary attributes of distributed document collections that have been used in a sampling of previous work.
2.1 Test Data
For eectiveness experiments, the possible sources of data are relatively limited. Because relevance judgements are needed, researchers are limited to the traditional IR test collections and TREC/TIPSTER data. On the other hand, if relevance judgements are not needed, then the number of data sources is potentially much greater. Because we are interested in both eciency and eectiveness, the TREC/TIPSTER data is the only realistic starting point. The main problem to address was how to partition the data into subcollections. Our requirements for this included: A natural partition. Earlier experiments involving parameter driven creation of document collections ([11]) were illuminating, but the partitions themselves did not re ect any physical (i.e. time or source) attribute of the source data. In addition, the TREC data is often referenced in terms of source : disk number, and has often been subdivided using one or both of those attributes [3, 16, 4, 13]. To the extent possible, a candidate partition should not obscure these other, more coarse grained possibilities. At least 100 subcollections. We feel realistic experiments must at least involve distributed document collections with hundreds of participating subcollections. A temporal dimension. Because we are speci cally interested in examining dierences in document collections over time, we wanted the documents to be partitioned by date.
Easy composition of \supercollections" from components. As much as possible, we wanted to
create a partition of the data from which easily identi able compositions could be created. For example, all of the documents in the twelve subcollections labeled AP.90.01 { AP.90.12 represent the AP documents on disk 3. This will be particularly helpful for those who are used to the standard TREC referencing that uses source and disk number. We started with the data available to participants in the TREC-4 [8] experiments. Gross characteristics of this data appear in Table 2. To summarize, this data is approximately 3 GB of text spread over several years and from seven (7) primary sources: AP Newswire (AP), Wall Street Journal (WSJ), Computer Select (ZIFF), the Patent Oce (PAT),
Disk 1 2 3 Totals
Source
WSJ (86-89) AP (89) ZIFF FR (89) WSJ (90-92) AP (88) ZIFF FR (88) AP (90) SJMN (91) PAT
Size (MB) Size (docs) 270 259 245 262 247 241 178 211 242 290 245 2,690
98,732 84,678 75,180 25,960 74,520 79,919 56,920 19,860 78,321 90,257 6,711 691,058
Table 2: Summary characteristics of TREC data on disks 1, 2, 3. ZIFF from disk 3 and DOE omitted. (From Harman, TREC4) San Jose Mercury News (SJMN), Federal Register (FR), and Department of Energy (DOE). Much of the TREC data is from news sources and so has easily identi able date components. The one undated collection is the set of documents from DOE. The net result of combining the particular attributes of the TREC data and our own requirements was a partition comprised of 236 document collections derived from some but not all of TREC disks 1, 2, and 3. Summary characteristics of this partition are given in Table 3.
Disk 1 2 3
Source
WSJ (86-89) AP (89) ZIFF FR (89) DOE WSJ (90-92) AP (88) ZIFF FR (88) AP (90) SJMN (91) ZIFF PAT
Num. Date Range Total DB DB 29 12 14 12 XX 22 11 11 10 12 12 XX 92
12/86{11/89 01/89{12/89 11/89{12/90 01/89{12/89 XX 04/90{03/92 02/88{12/88 01/89{11/89 01/88{12/88 01/90{12/90 01/91{12/91 XX 06/82{08/92
67 54 116
Table 3: Summary characteristics of the document partition. Note: Total DB column sums to 237 because there is overlap of one database in ZIFF between disks 1 and 2 (ZIFF.89.11). Subject to the exceptions and special cases noted below, we split the available data by source and month into 236 smaller collections2 . Speci c comments about the partition follow: No DOE documents. The DOE data is undated. No ZIFF documents from TREC disk 3. Some of the data from ZIFF on disk 3 overlaps temporally with data from ZIFF on disks 1 and 2. We considered placing these documents with the others from the same month, but this would have involved considerable intermingling of disk 3 documents with those from disks 2 The unambiguous document id ! collection id mapping is available from the authors.
1 and 2. While certainly possible, it would have violated the composibility requirement given previously. Vast size heterogeneity. There are several dozen very small collections 1 to 20 documents in size. These are mainly derived from early PAT data. Fixed date increment of one month. However, many sources only contain partial years, so the data is not continuous temporally. This is an attribute of the underlying TREC corpus. Disambiguation of dates in ZIFF. A small number of the documents in the ZIFF subcollections are dated as \Summer", \Winter", etc. rather than by month. In these cases, we determined the date of the documents by looking at the dates of documents surrounding the document in question. Thus if a \Spring" document was immediately preceded by one dated \March", then we assigned the \Spring" document to the \March" subcollection for that year. Disambiguation of multiple dates in PAT. The structure of the PAT documents is complex and often contains references to previously issued patents. To disambiguate, we used the rst appearance of the \Application Filing Date" (AFD) eld within a document as the operative date. Subsequent occurrences of this eld within a document refer to other patent documents. An equally viable, alternative partitioning strategy would have split the data into N equal sized subcollections. This partitioning approach has attractive characteristics in that 1) it is easy to control the number of subcollections and 2) one confounding variable, database size, is held constant. We chose not to pursue this strategy because equal sized temporal chunks are more appropriate for our research interests. Coverage of the 236 document databases by source and date is given in Figure 1. Researchers who are interested in examining collection changes over time must be judicious in the subset of databases they use as the basis for their work. For example, there are several multi-month \holes" in the Wall Street Journal collections.
2.2 Test Queries
In many years, the TREC conference has introduced new document sets. In every year, new topic sets have been introduced, generally in batches of 50 topics per set. Thus, because of the evolutionary nature of the conference, relevance judgements are not available for all combinations of topic and document sets. Through TREC-4, there were a total of 250 topics with relevance judgements over some portion of the TREC documents. In Table 4 we summarize this coverage through TREC-4, including an indication of how subsets of disks map to the number of representative databases in our partition. The topic coverage (Table 4) is important because it constrains the possible combinations of topic sets and document collections that can be used in a distributed retrieval experiment where relevance judgements are needed. For example, if you want to work with topics 201-250, then you cannot
Source Topic Set Num. Disk 1-50 51-100 101-150 151-200 201-250 DB 1 2 3 1,2 2,3 1,2,3
X X
X
X X X X X X
X X X X X X
X X X
X X X
67 54 116 120 170 236
Table 4: Coverage of topics over TREC data, disks 1-3, through TREC-4. Note: Database ZIFF.89.11 is drawn from both disk 1 and 2. use any of the collections derived from disk 1; thus instead of 236 subcollections, only 54 + 116 = 170 subcollections could be used. Similarly, if you want to work with the San Jose Mercury News data (found on disk 3), then only three of the ve topic sets are applicable. A nal detail is that if you are using topic set 201-250, then the ZIFF subcollection from 11/89 (ZIFF.89.11) must be excluded because a portion of these documents come from disk 1 and thus do not have associated relevance judgements.
3 Database Selection Experiment This section describes some very preliminary experiments in database selection using the testbed described above.
3.1 The Problem
Distributed searching can be decomposed into three fundamental activities: (1) choosing the speci c databases to search; (2) searching the chosen databases; and (3) merging the results into a cohesive response. Although there is considerable interest in all these aspects, we focus speci cally on the rst activity. Callan et al.[3] call this the collection selection problem while Gravano et al.[7] refer to it as the text database resource discovery problem. Both Callan et al.[3] and Gravano et al.[6] formulate the problem similarly. We rst assume that we have a group of databases that are candidates for search whenever we process a query. Then, given a query, we rank the databases, that is, we decide in what order to search the databases in the group of candidates. We assume that there is some preferred order in which to search the databases, but the nature of that preferred order is cast dierently by dierent researchers. Finally, there is an evaluation phase of the work in which the predicted ranks for queries are compared with the preferred orderings to decide how well the particular ranking methodology worked. Unfortunately, the nature of this comparison also diers from research group to research group. This point will be developed more fully in the section on evaluation below. In our experiments we investigate the gGlOSS [6] methodology in a dierent test environment and compare its performance to the standard proposed by its developers, so-called Ideal(l) ranks, as well as to the standard used by Callan et al.[3], the so-called optimal ranks. These are described more fully later.
Figure 1: Document and query coverage for the testbed (236 subcollections, partitioned by original source and date).
3.2 gGlOSS
Gravano et al.[7] proposed GlOSS, the Glossary-of-Servers Server, as an approach to the database selection problem. GlOSS originally assumed a Boolean retrieval model but was later generalized to gGlOSS [6] to handle the vector space information retrieval model. gGlOSS assumes that the group of databases can be characterized according to their goodness with respect to any particular query. gGlOSS 's job is then to estimate the goodness of each candidate database with respect to a particular query and then suggest a ranking of the databases according to the estimated goodness.
Goodness and Ideal Ranks
In [6], Gravano et al. make the following two assumptions: 1. all the databases in a group of databases employ the same algorithms to compute term weights and similarities; and 2. given a query q and a document d, d is only useful for q if sim(q; d) > l for a given threshold l. They then de ne a notion of goodness for each database, db, as follows.
Goodness(l;q; db) =
X
d2fdbjsim(q;d)>lg
sim(q; d)
(1)
Once Goodness(l;q; db) has been calculated for each database db with respect to q at threshold l, the ideal rank for the query at threshold l, Ideal(l) can be formed by sorting the databases in descending order of their goodness.
Note that gGlOSS does not compute Ideal(l). Ideal(l) is advanced as the goal to which gGlOSS ranks will be compared. The strategy employed by gGlOSS is to attempt to estimate the goodness of each database and thereby create a ranking. Since by hypothesis the database utility to the query is expressed by goodness, it is reasonable to measure the performance of gGlOSS by how well it estimates this goodness. That leaves open the question of how well goodness correlates to relevance.
Goodness Estimators
gGlOSS needs two vectors of information from each database dbi in order to make its estimates. 1. the document frequency dfij for each term tj in dbi ; and 2. the sum of the weight of each term tj over all documents in dbi , that is, the column sums of the documentterm matrix. This information is gathered periodically by unspeci ed means from all databases in the group of databases and used to formulate estimates. As we have already noted, gGlOSS creates rankings for each query by estimating the goodness of each database with respect to a query. There are many ways that one might make these estimates; two were reported in [6]. 1. High-Correlation Scenario : if two query terms t1 and t2 appear in dbi and dfi1 dfi2 then every document in dbi that contains t1 also contains t2 . This assumption gives rise to the Max(l) estimator. 2. Disjoint Scenario : two terms appearing in the query do not appear together in any database document. This assumption gives rise to the Sum(l) estimator.
In both cases above it is further assumed that the weight of a term is distributed uniformly over all documents that contain the word. gGlOSS uses the assumptions underlying Max(l) (or Sum(l)) to estimate the number of documents in a database db having similarity to a query greater than a threshold l. This forms the basis for the gGlOSS estimate of the goodness of db. Complete details for calculating the Max(l) and Sum(l) estimators are given in [6] and are not reproduced here. But, for later reference we note that Max(0) = Sum(0) = Ideal(0); (2) that is, at threshold l = 0 both estimators give identically the Ideal(0) ranking of databases for all queries.
Note that for these experiments each of the 236 sites used the same parameters and search engine (SMART) to process queries. The next step was to prepare a union vocabulary incorporating all the terms appearing at any of the separate collections. This gave us a canonical global vocabulary with which to store the document frequencies and weight sums required by gGlOSS to make its estimates. Next, TREC topics 51-150 were indexed by SMART to convert them into term lists in the global vocabulary for use by gGlOSS. Finally, we produced gGlOSS rankings for each of the queries using two threshold values, l = :2 and l = 0, for both the Max(l) and Sum(l) gGlOSS estimators. These are the same threshold values used by Gravano et al.[6] in their experiments.
3.3 The Experiment
Baselines for Comparison
The original gGlOSS evaluation compared the gGlOSS rankings produced by Max(l) and Sum(l) to the Ideal(l) rankings. The authors argued that \ranks based on end-user relevance are not appropriate for evaluating schemes like gGlOSS."[6] They further claim that \the best we can hope for any tool like gGlOSS is that it predicts the answers that the databases will give when presented with a query."[6] We strongly agree with the latter assertion, but feel that there is something to be learned by evaluating against enduser relevance; in the end that is the only interesting metric from a user's standpoint. So our plan was to look at two questions. 1. How well do the gGlOSS estimators predict Ideal(l)? 2. How well does Ideal(l) predict the relevance-based ranking? The relevance-based ranking was called the optimal ranking by Callan et al.[3] and Rel All by Gravano et al.[6]. We simply mean that the databases will be ordered by the number of relevant documents they contain. In order to avoid overloaded terms such as \optimal" and \ideal," we will use the acronym RBR for the relevance-based ranking.
3.4 Evaluation Methodology
We tested the gGlOSS methodology against two dierent benchmarks (Ideal(l) and relevance-based ranking) using the full 236 subcollections of the TREC data described earlier. We used the TREC topics 51-150 as the test query set.
gGlOSS Estimates
We prepared the test collection by using SMART version 11.0[2]. We used the following SMART parameters to be consistent with Gravano et al.[6]. Documents were indexed using ntc: document weights are formed using the familiar tf idf . Queries were indexed using nnn: query weights are based on term frequency (tf ). Similarity measure is the dot product of the document and query vectors.
Ideal(l): Gravano et al.[6] conducted their performance evaluation of gGlOSS using Ideal(l) rankings as baselines. We produced two reference rankings, Ideal(0) and Ideal(:2) by processing each query at each of the
236 subcollections and then using the goodness to rank the subcollections. RBR: These rankings were produced for each query by using the relevance judgements supplied with the TREC data.
3.5 Metrics for Comparison
There is no general agreement on how this type of comparison should be done. The general problem is that we are given a baseline ranking for some query and a ranking produced by some collection selection algorithm. The goal is to decide how well the candidate ranking approximates the baseline ranking. We have chosen to use some of the approaches given in the literature as well as a new approach proposed here.
Mean Squared Error
Callan et al.[3] reported their comparisons using the mean squared error of the predicted ranks and the desired ranks. Given a group of N databases to rank for any candidate ranking we compute
MSE = N1
XN (base rank(dbi ) ? est rank(dbi ))
2
i=1
(3)
where base rank(dbi ) is the baseline or desired rank and est rank(dbi ) is the predicted rank for dbi .
Recall and Precision Analogs
In this section we discuss performance metrics that are analogous to the well known IR metrics of recall and precision. We begin by introducing some terminology and notation that tries to make this analysis neutral and generalizes it to include a variety of baselines. Recall that for each query we provide a baseline ranking, say B , that represents a desired goal or query plan. Given
4000
4000
3000
3000
MSE
5000
MSE
5000
2000
2000
1000
1000
0
0 60
80
100
120
140
60
80
Query ID
100
120
140
Query ID
Figure 2: Mean Squared Error for Sum(:2) compared to
Figure 3: Mean Squared Error for Ideal(0) compared to Relevance-based Ranking (RBR)
some algorithm that produces an estimated ranking, E , our goal is to decide how well E approximates B . To begin, we assume that each database dbi in the collection has some merit, merit(q; dbi ), to the query q. We expect the baseline to be expressed in terms of this merit; we expect the estimated ranking to be formulated by implicitly or explicitly estimating merit. Let dbbi and dbei denote the database in the i-th ranked position of rankings B and E respectively. Let Bi = merit (q; dbbi ) and Ei = merit (q; dbei ) (4) denote the merit associated with the i-th ranked database in the baseline and estimated rankings respectively. We note that for viable baseline rankings it should always be the case that Bi Bi+1 ; i = 1:::n ? 1: (5) For the baselines reported here this is always true because we assume that the baseline ranking is determined by sorting the databases in decreasing order of merit for some appropriate de nition of merit. However, it is not generally the case that Ei Ei+1 . The performance measures described here attempt to quantify the degree to which this is true for any estimated ranking. Gravano et al.[6] de ned Rn as follows.
The denominator is just the total merit contributed by all the databases that are useful to the query. Thus, Rn is a measure of how much of the total merit has been accumulated via the top n databases in the estimated ranking. These two measures are clearly related. Since
Ideal(:2)
Pn Rn = Pni BEi : =1
i=1 i
(6)
This is a measure of how much of the available merit in the top n ranked databases of the baseline has been accumulated via the top n databases in the estimated ranking. There is an alternative de nition that we will use to present some of the performance results. First we need one more de nition. Let n = max k such that Bk 6= 0: (7) Intuitively, n is the ordinal position in the ranking of the last database with non-zero merit; it is the breakpoint between the useful and useless databases. With this de nition we de ne our alternative \recall" metric as follows.
P bRn = Pnni
Ei : i=1 Bi =1
(8)
b
n X Xn Rn Bi = Rb n Bi ;
b
i=1
i=1
(9)
b
we have Rn Rn and Rn = Rn . However, they provide dierent perspectives on the database selection process: how much of the available merit has been estimated (Rn) versus how much of the total merit has been estimated (Rn). Gravano et al.[6] have also proposed a precision-related measure, Pn . It is de ned as follows. E )jmerit(q; db) > 0gj (10) Pn = jfdb 2 Topn (jTop n (E )j This gives the fraction of the top n databases in the estimated ranking that have non-zero merit. In the results that follow we have the following convention. For the Ideal(l) calculations we have merit(q; db) = Goodness(l;q; db); for the RBR we de ne merit(q; db) to be the number of relevant documents in db.
b
3.6 Experimental Results Mean Squared Error
We rst evaluated the gGlOSS rankings against the two baseline standards Ideal(l) and RBR using the mean squared error (MSE) of the ranks as described earlier. Since theoretically Max(0) = Sum(0) = Ideal(0), we simply veri ed this case. For all other cases we produced a plot of MSE by query ID to help get a sense of the error distribution. Max(:2) had very low MSE in general (MSE 272 for all queries) when compared to Ideal(:2). Due to the scale of the MSE graphs, and the comparatively small MSE values for this graph, little variability could be seen. Therefore,
0.8
0.8
0.6
0.6
P
P
*
1.0
*
1.0
0.4
0.4
0.2
0.2
0.0
0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
R*
0.4
0.6
0.8
1.0
R*
Figure 4: (R; P ) for Sum(:2) compared to Ideal(:2)
Figure 5: (R ; P ) for Ideal(0) compared to RBR
this graph is not displayed. Sum(:2) had considerably more variability and higher MSE values (see Figure 2). We then compared the gGlOSS ranks to the RBR baseline. Figure 3 shows the MSE for Ideal(0) compared to RBR. Figure 3 is typical of the MSE for Max(:2), Sum(:2) and Ideal(:2) when compared to RBR. The comparison with Ideal(0) was made to help determine how well the Ideal(l) baseline approximated the RBR baseline. The MSE analysis suggests that the gGlOSS estimators are reasonably accurate models of Ideal(l). The MSE analysis further suggests that Ideal(l) is not a very strong predictor of RBR. MSE can identify variability but is not very useful for resolving the source of variability. We probe this further in the next section.
R 0:999 and P 0:974 for all queries. Sum(:2) also predicts the ranking of Ideal(:2) well, but the agreement is less pronounced (see Figure 4). Figure 5 shows the performance of Ideal(0) with respect to the RBR baseline. This gure is typical of all the comparisons that we made between gGlOSS estimators and the RBR baseline.
Recall and Precision Analogs
b
b
b
80
Ideal(0) RBR Ideal(0.2)
Count
60
40
20
0 231 - 240 221 - 230 211 - 220 201 - 210 191 - 200 181 - 190 171 - 180 161 - 170 151 - 160 141 - 150 131 - 140 121 - 130 111 - 120 101 - 110 91 - 100 81 - 90 71 - 80 61 - 70 51 - 60 41 - 50 31 - 40 21 - 30 11 - 20 1 - 10
We plotted a number of scatterplots on the (R; P ) plane. Each point on the plane represents the Rn and Pn values measured for a particular query when compared against the operative baseline metric. 100 points, one for each query, are plotted. We considered two xed values of n, 5 and 10, as well as n . The two xed values were chosen to determine how well an estimated ranking would perform when selecting a small number of databases. We want to emphasize that n is potentially dierent for every query and this aects the way that the plots are interpreted. To make this distinction clear we will write Rn as R . First we look at plots of (R ; P ). To help visualize the distribution we show arcs centered at (1; 1) in increments of 0.25 as well as the line P = R . Note that by de nition R and P evaluate to one when applied to the baseline ranking. Also recall that R = R so (R ; P ) plots are identical for Rn and Rn . However, when we plot (R5; P5 ) and (R10 ; P10 ), Rn and Rn will produce dierently scaled outcomes. Again, since Max(0) = Sum(0) = Ideal(0), the (R ; P ) plots comparing Max(0) and Sum(0) to Ideal(0) would simply be (1; 1) so we do not plot these. When Max(:2) is compared to Ideal(:2), the agreement is very strong, with
100
n*
Figure 6: Distribution of n for RBR and Ideal(l) A histogram of distances from (1; 1) shows that 73% of the points in Figure 5 lie within a radius of 0.5. R > :65 for these points with P > :5. This would seem to suggest that Ideal(0) can eectively deliver documents according to the RBR criterion. However, our MSE analysis suggests that Ideal(0) is not a strong predictor of RBR (see Figure 3). Part of the explanation for the apparent contradiction of the results shown in Figures 3 and 5 may be due to the distribution of n . Figure 6 shows that n for RBR ranges from 21 to 150 (9% to 64% of the databases in our test collection). By the time an estimated rank includes this many databases it will accumulate a signi cant amount of merit by chance alone. For reasons of eciency or cost, it may not be desirable to search all of the databases in the collection that have
1.0 1.0
0.8 0.8
P5
0.6
R(n)
0.6
0.4
0.4
0.2
optimal Max (0.2)
Ideal(0).RBR Sum(0).Ideal(0) and Max(0).Ideal(0) Max (0.2).Ideal(0) Sum (0.2).Ideal(0)
0.2
0.0 0.0
0.2
0.4
0.6
0.8
1.0 0.0
R5
0
50
Figure 7: (R5 ; P5 ) for Max(:2) compared to RBR
100
150
200
n
Figure 9: Rn
1.0
0.8 1.0
P10
0.6 0.8
0.4
^ R(n)
0.6
0.2
optimal Max (0.2)
0.4
0.0 0.0
0.2
0.4
0.6
0.8
Ideal(0).RBR Sum(0).Ideal(0) and Max(0).Ideal(0) Max (0.2).Ideal(0) Sum (0.2).Ideal(0)
1.0 0.2
R10
Figure 8: (R10 ; P10 ) for Max(:2) compared to RBR
0.0 0
50
100
150
n
b
b
b
b
Figure 10: Rn
1.0
0.8
0.6
P(n)
been estimated to contain merit. It may be preferable to search only a select few from the ranked list. To evaluate the results from this perspective we plotted (R5 ; P5 ) and (R10 ; P10 ), Rn and Pn at two xed values of n. Figures 7 and 8 show these results. Again each of these points should be compared with (1,1). Note also that R10 has twice as many precision levels as R5 . The gures show a fairly uniform scatter. Most of the points have Pn > :6, but Rn has a much wider range. This suggests that the top n databases in the estimated rankings frequently included databases with some merit, but not often those with the greatest merit. We also plotted (n; Rn) (Figure 9), (n; Rn ) (Figure 10), and (n; Pn ) (Figure 11) for n = 1:::N . The labels in the gures are of the form E.B where E is the estimate ranking and B is the baseline ranking. Rn and Pn are the same metrics as those reported in Gravano et al.[6]. A comparison of Figures 9 and 10 illustrates the dierences in interpretation of Rn and Rn . Figure 9 tracks how much of the available merit in the top n databases of the baseline has been accumulated by the top n databases of the estimate. In contrast, Figure 10 tracks the fraction of total merit accumulated. It is apparent that the two measures can produce drastically dierent numeric results. Note that the Ideal(0).RBR plot in Figure 9 con rms on a larger scale the phenomenon
200
0.4 Ideal(0).RBR Sum(0).Ideal(0) and Max(0).Ideal(0) Max (0.2).Ideal(0) Sum (0.2).Ideal(0)
0.2
0.0 0
50
100
150
n
Figure 11: Pn
200
shown in Figures 7 and 8; the databases with the most merit (in terms of relevant documents) tend not to be ranked extremely highly by the gGlOSS estimates. Finally, note that Figure 11 shows that, on average, the Ideal(0) estimate can incorrectly predict merit 20 to 50 percent of the time when compared to the RBR baseline. Again, these results seem to corroborate the MSE conclusion that gGlOSS models Ideal(l) well but does not seem to approximate RBR well.
4 Conclusions We have made several contributions in this paper. We proposed and characterized a standard test environment for distributed information retrieval algorithms. We conducted a preliminary evaluation of gGlOSS, a prominent collection selection technology. We introduced an alternative recall-like metric Rn and showed its relationship to the Rn of Gravano et al.[6]. We investigated some new evaluation methodologies. The gGlOSS estimators model Ideal(l) reasonably accurately in this test environment (i.e., all sites use same index weighting and search strategy), but Ideal(l) does not accurately approximate RBR. Since Ideal(l) expresses usefulness on the basis of similarity scores, there will be many practical situations that confound this methodology. For example, a database with very many documents of marginal similarity will appear, in aggregate, to be more useful than a database having a few documents with large similarity. It is also the case that similarity scores across heterogeneous document collections are essentially noncomparable. For these reasons it is not surprising that gGlOSS does not correlate well with RBR; on the contrary it would have been surprising if it did. Although in general gGlOSS does rank good databases highly, it does not necessarily rank the best databases highest. We note that RBR might be too rigid a standard for schemes like gGlOSS, but it provides some insight into how well these technologies can realize the potential merit of the system. It was not our intention to declare gGlOSS good or bad, rather we are trying to discover under what conditions it performs best. We intend to ask the same questions of other algorithms for collection selection. We have gained new insight into the collection selection problem and will undertake further investigations to reach more de nitive conclusions. We also intend to reproduce the experiments of others in our test environment to better assess the various approaches proposed.
b
Acknowledgement. We would like to thank Walter R. Creighton III for his help with this research. We would also like to thank the reviewers for their thoughtful comments.
References [1] Nicholas J. Belkin, Paul Kantor, Edward A. Fox, and J. A. Shaw. Combining the Evidence of Multiple Query Representations for Informa tion Retrieval. Information Processing and Management, 31(4):431{448, 1995.
[2] Chris Buckley. SMART version 11.0, 1992. ftp://ftp.cs.cornell.edu/pub/smart. [3] James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching Distributed Collections with Inference Networks. In Proceedings of the 18th International Conference on Research and Development in Information Retrieval, pages 21{29, Seattle, WA, 1995. [4] Edward A. Fox, M. Prabhakar Koushik, Joseph Shaw, Russell M odlin, and Durgesh Rao. Combining Evidence from Multiple Searches. In The First Text Retrieval Conference (TREC-1), pages 319{328, Gaithersburg, MD, November 1992. [5] James C. French and Charles L. Viles. Ensuring Retrieval Eectiveness in Distributed Digital Libraries. Journal of Visual Communication and Image Representation, 7(1):61{ 73, 1996. [6] Luis Gravano and Hector Garcia-Molina. Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB), Zurich, Switzerland, 1995. [7] Luis Gravano, Hector Garcia-Molina, and Anthony Tomasic. The Eectiveness of GlOSS for the Text Database Discovery Problem. In SIGMOD94, pages 126{137, Minneapolis, MN, May 1994. [8] Donna Harman. Overview of the Fourth Text Retrieval Conference (TREC-4). In Proceedings of the Fourth Text Retrieval Conference (TREC-4), Gaithersburg, MD, 1996. [9] Alistair Moat and Justin Zobel. Information Retrieval Systems for Large Document Collections. In Proceedings of the Third Text Retrieval Conference (TREC-3), pages 85{94, Gaithersburg, MD, 1995. [10] Charles L. Viles and James C. French. TREC4 Experiments Using Drift. In Proceedings of the Fourth Text Retrieval Conference (TREC-4), Gaithersburg, MD, 1996. [11] Charles L. Viles and James C. French. Dissemination of Collection Wide Information in a Distributed Inform ation Retrieval System. In Proceedings of the 18th International Conference on Research and Development in Information Retrieval, pages 12{20, Seattle, WA, July 1995. [12] Charles L. Viles and James C. French. On the Update of Term Weights in Dynamic Information Retrieval Syste ms. In Proceedings of the 4th International Conference on Knowledge and Information Management, pages 167{174, Baltimore, MD, November 1995. [13] Ellen Vorhees. The TREC-5 Database Merging Track. In Proceedings of the Fifth Text Retrieval Conference (TREC5), Gaithersburg, MD, November 1996. [14] Ellen Vorhees, Narendra K. Gupta, and Ben Johnson-Laird. Learning Collection Fusion Strategies. In Proceedings of the 18th International Conference on Research and Development in Information Retrieval, pages 172{179, Seattle, WA, 1995. [15] Ellen Vorhees, Narendra K. Gupta, and Ben Johnson-Laird. The Collection Fusion Problem. In Proceedings of the Third Text Retrieval Conference (TREC-3), pages 95{104, Gaithersburg, MD, 1995. [16] Nikolaus Walczuch, Norbert Fuhr, Michael Pollman, and Birgit Sievers. Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Loosely Federated Environment. In The Third Text REtrieval Conference (TREC-3), pages 135{144, Gaithersburg, MD, November 1994.