Inferring Document Relevance from Incomplete ... - Semantic Scholar

Report 1 Downloads 115 Views
Inferring Document Relevance from Incomplete Information Javed A. Aslam∗, Emine Yilmaz College of Computer and Information Science Northeastern University 360 Huntington Ave, #202 WVH Boston, MA 02115

{jaa,emine}@ccs.neu.edu

ABSTRACT

mous human effort for judging the documents in the document collection. In order to avoid judging the entire document collection, TREC uses a technique called pooling where in the case of depth-k pooling, the union of the top k documents retrieved by each submitted run is formed and the documents in this pool are judged for relevance with respect to the given topic. In the standard TREC setup, depth-100 pools are found to be effectively complete in terms of robust and accurate evaluation; hence, depth-100 pooling has been shown to be an effective way for evaluating the quality of retrieval systems without judging the complete document collection [8, 11]. Although depth-100 pooling considerably reduces the judgment effort needed, the pool still requires a large number of judgments. In the case of TREC, obtaining accurate, robust, and reusable assessments requires that tens of thousands of documents be judged. In TREC 8, for example, the 129 systems submitted were run against 50 queries (topics) and a total of 86,850 documents were judged in order to evaluate systems with respect to all these queries. This number becomes much larger when real collections are considered (e.g. the world wide web). Much research has attempted to address the assessment effort required for large-scale retrieval evaluation. Shallower pools [11] and techniques that are likely to contain most of the relevant documents have been studied [6, 1]. However, when the total number of judgments are limited, all of these methods produce biased or unprovable estimates of evaluation measures. Carterette et al. [5] developed a method that aims at ranking systems correctly using limited relevance judgments. However, this method cannot be used to compute a provable estimate of an evaluation measure when incomplete judgments are present. Recently, Aslam et al. [2] and Yilmaz and Aslam [10] proposed statistical techniques for estimating the value of an evaluation measure from a judged random sample of documents. Unlike previous methods, these statistical techniques determine unbiased estimates of the standard measures themselves. These results show that average precision (and other measures of retrieval performance) can be accurately estimated from a carefully chosen judged random pool as small as 4% of the typical TREC-style depth pool. One disadvantage of the aforementioned statistical techniques is that they cannot be used to assess the performance of systems in any standard way: in order to estimate the measures, a special procedure requiring access to the infor-

Recent work has shown that average precision can be accurately estimated from a small random sample of judged documents. Unfortunately, such “random pools” cannot be used to evaluate retrieval measures in any standard way. In this work, we show that given such estimates of average precision, one can accurately infer the relevances of the remaining unjudged documents, thus obtaining a fully judged pool that can be used in standard ways for system evaluation of all kinds. Using TREC data, we demonstrate that our inferred judged pools are well correlated with assessor judgments, and we further demonstrate that our inferred pools can be used to accurately infer precision recall curves and all commonly used measures of retrieval performance.

Categories and Subject Descriptors H3.4 [Information Storage and Retrieval]: Systems and Software – Performance evaluation (efficiency and effectiveness)

General Terms Theory, Measurement, Experimentation

Keywords Relevance Judgments, Incomplete Judgments, Average Precision

1.

INTRODUCTION

We consider the problem of large-scale retrieval evaluation. Standard methods of retrieval evaluation can be quite expensive when conducted on a large-scale. Evaluation measures depend on the assumption that the relevance of documents retrieved by a search engine is known, requiring enor∗ We gratefully acknowledge the support provided by NSF grant IIS-0534482.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’07, November 6–8, 2007, Lisboa, Portugal. Copyright 2007 ACM 978-1-59593-803-9/07/0011 ...$5.00.

633

Assuming that Hypothesis (2) holds, based on the fact that accurate inferences about the relevance of documents can be obtained using actual average precision values together with total number of relevant documents (R) values as input, one might then use estimates of average precision computed using a few judgments to infer the relevance of unjudged documents; obtaining complete judgments (qrels) from a few judged documents (Hypothesis (3)). If Hypothesis (3) holds, then this fact could be used to obtain complete judgments from a small number of incomplete judgments, enabling efficient retrieval evaluation on a large scale using standard tools. Every year there are many systems submitted to the annual Text REtrieval Conference. These systems are run against some number of queries (typically 50) and for each query, the depth-100 pool of the documents are formed and stored in a qrel file. For each query, the actual average precision of each system is then computed using this depth-100 pool. Due to its setup and the need for extensive judgment effort, aforementioned methods (hypothesis) are applicable and useful to TREC. Hence, we used TREC data to validate our hypothesis. In the sections that follow, we validate the correctness of Hypotheses (2) and (3), and we describe the technique that can be used to infer complete judgments given incomplete judgments. We begin in Section 2.1 by describing our methodology for inferring relevance judgments given average precision values of multiple lists together with the total number of relevant documents in a query as input. In Section 3, we show that the inferences obtained from the proposed method using actual average precision values and actual R are highly accurate, thus validating Hypothesis (2). Note that inferring relevance of documents given the value of actual average precision is not very interesting in practice as the computation of average precision requires the judgments at the first place. However, the problem becomes much more interesting if it is possible to obtain accurate inferences of the relevance of unjudged documents given estimates of average precision obtained using a few judgments. In Section 4, we show that Hypothesis (3) holds; thus, given estimates of average precision, one can accurately infer the relevance of unjudged documents.

mation from the sampling process is needed, so standard tools such as trec_eval or other software implementations which calculate average precision and other performance measures cannot be used. The computation of evaluation measures in a standard way requires complete knowledge of the relevance of documents. Hence, methods that can infer the relevance of documents from a few judged documents are desired. In this work, we show that estimates of average precision using a few judged documents can be used to accurately infer the relevance of unjudged documents, thus making it possible to infer fully judged pools from a small fraction of judged documents. We further show, through the use of TREC data, that the judgments in these inferred pools correlate well with actual assessments and that these inferred pools can be used to compute precision-recall curves and other measures of retrieval performance in standard ways, permitting efficient standard evaluation on a large scale.

2.

FROM INCOMPLETE JUDGMENTS TO COMPLETE JUDGMENTS

The goal of this work is to accurately infer the relevance of unjudged documents given a few relevance judgments.1 The method for inferring these relevance judgments is based on the following three hypotheses: 1. Given the actual average precision value of a single system (together with the total number of relevant documents), one can accurately infer the relevance of documents retrieved by this system. 2. Given the actual average precision values of multiple systems (together with the total number of relevant documents), one can accurately infer the relevance of documents in the complete (depth-100) pool. 3. Given average precision estimates of multiple systems obtained using a few judgments (together with an estimate of the total number of relevant documents), one can accurately infer the relevance of documents in the complete (depth-100) pool. In recent work, we have shown that Hypothesis (1) is largely correct; given the value of average precision of a list retrieved in response to a query and the total number of relevant documents in the query, one can accurately infer the distribution of relevant and nonrelevant documents in the retrieved list [4]. The main idea behind Hypothesis (2) is as follows: The actual average precision of a single system provides some information about the relevance of documents retrieved by that system on a particular query. If there are multiple systems, then average precision of each system will provide some information about the relevance of documents retrieved by that system on that query. Furthermore, some documents will be retrieved by multiple systems. This imposes a constraint on what the relevance of these documents can be as the relevance of a document retrieved by multiple systems should be consistent among all systems. Hence, information from multiple systems could be used to obtain better inferences about the relevance of documents than the inferences obtained from a single system.

2.1 Methodology The methodology for inferring relevance assessments from average precision is conceptually simple: given (1) the ranked lists of documents submitted in response to a given topic, (2) the average precisions associated with these lists, and (3) R, the number of documents relevant to the topic, find the binary relevance judgments associated with the underlying documents which minimize the “difference” between the given average precisions and those incurred by the inferred relevance assessments. This optimization problem can be written as: • Goal : Assign relevance values to each document in the complete judgment set. • Optimization: The average precisions incurred must be “close” to the given average precisions. • Constraints:

1

A preliminary version of this work appeared as a recent poster [3].

(1) The total number of relevant documents is R.

634

(2) Any document contained in multiple lists must have the same relevance assessment. • Integrality: The inferred assessments must be binary. In order to ensure that the inferred relevance judgments incur average precision values “close” to those given, we minimize the sum squared error between the actual and inferred average precision values. The above definition of the problem is a constrained integer optimization problem. Hence, this problem is intractable for the same reason that integer programming is intractable. To alleviate this problem, we relax the condition that the inferred relevance assessments must be binary. We instead allow the inferred relevance assessments to correspond to probabilities of relevance, and we deduce an expected value for average precision from these probabilistic relevance assessments. Let pi be the probability of relevance associated with the document at rank i in the list of length Z. In a recent work [4], we showed that the expected value of average precision can then be computed as 1 E[AP ] = R

Z i=1

  pi i

i−1

pj

1+



Figure 1: Diagram for inferring relevance judgments. The upper right plot corresponds to the input estimates of average precisions and R and the upper left plot corresponds to the document constraints imposed by documents retrieved from multiple lists. Using these inputs, the optimization procedure then can be used to obtain complete relevance judgments (qrels) that can be used to accurately evaluate systems in a standard way.

.

j=1

Therefore, we ensure that the inferred relevance judgments incur average precision values “close” to those given by minimizing the sum squared error between the actual and inferred expected average precision values. Thus, our optimization criterion is to minimize the mean squared error, min i (E[AP i ] − ap i )2 where ap i is the given average precision associated with list i. The problem as formulated above can be solved using any number of constrained optimization routines, available, for instance, in MatLab. The output of the above optimization procedure is the probability of relevance of the documents that are in the complete judgment set (the depth-100 pool in TREC). However, most of the standard evaluation measures such as average precision, R-precision and precision-at-cutoff k use binary judgments,2 i.e., a document can be either relevant (1) or nonrelevant (0). Hence, in order to make use of the inferred probabilistic judgments in a standard way, these judgments must be converted to binary judgments. There are three intuitive ways of converting probabilistic relevance assessments to binary. (1) A trivial possibility is to threshold the probabilities at 0.5 and assign a relevance score of 1 to documents with probability of relevance at least 0.5 and assign a relevance score of 0 to the remaining documents. (2) Another possibility is to sort the documents in decreasing order based on their probability of relevance and assign a relevance score of 1 to the top R documents (if we know or assume that there are R relevant documents in total). (3) The last option is based on the idea of randomized rounding

that is used in linear programming to solve integer programs using probabilities. Based on this method, a document with probability of relevance p is assigned a relevance score of 1 with this probability and a score of 0 with probability 1 − p. Later in this paper, we explore all three methods and show that randomized rounding experimentally gives the best results. Hence, this method will be used to convert probabilistic relevance judgments to binary in the remainder of our experiments. Figure 1 shows how the aforementioned method can be used for inferring complete judgments from incomplete judgments. The black box in the figure corresponds to the method described above at an abstract level. The input to the optimization method is an estimate of R together with average precision estimates of multiple systems. The optimization method makes use of the fact that any document contained in multiple lists must have the same relevance assessment and outputs complete relevance judgments (inferred qrels) that could be used in any standard way (e.g. trec_eval) to evaluate retrieval systems.



3. INFERRING DOCUMENT RELEVANCE FROM ACTUAL AP One of the main hypotheses of this paper is that given the actual values of average precision for multiple systems and the total number of relevant documents R in a query, the relevance of documents in the complete pool can accurately be inferred (Hypothesis 2). The goal of this section is to validate this hypothesis using TREC data. To be consistent with TREC terminology where the set of complete judgments are referred to as qrels, in the remainder of this paper, we name the inferred relevance judgments as inferred qrels and the relevance judgments available from TREC as actual qrels. Using data from TREC, we run the method with actual average precision and R values and obtain a probability of relevance for each document.

2 In a recent work [4], together with the definition of generalized average precision using probability of relevances, we also define the generalized probabilistic versions of R-precision and precision-at-cutoff k. Given a ranked list of N documents with probability of relevances p1 , p2 , . . . pN and the total number of relevant documents R in the query, the Rprecision and precision-at-cutoff k values can be computed R k 1 1 as rp = R i=1 pi and P C(k) = k i=1 pi . Hence, one can use the generalized probabilistic versions of these measures to evaluate the retrieval systems using the inferred probabilistic qrels.





635

TREC8 query 5 R = 38

TREC8 query 9 R = 22 1 0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6 0.5 0.4

probability

1 0.9

probability

probability

TREC8 query 1 R = 300 1 0.9

0.6 0.5 0.4

0.6 0.5 0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0

0

100

200

300

400

500

600

0

0.1

0

10

20

30

documents

40

50

60

70

80

0

0

5

10

15

documents

20

25

30

35

40

45

documents

Figure 2: The probability of relevances of the top 2R documents using inferred qrels obtained for queries 1, 5 and 9 in TREC 8, sorted in decreasing order. qrels from actual ap

qrels from actual ap 0.6

Train RMS = 0.0248, τ = 0.9760

Test RMS = 0.0074, τ = 0.9602

0.5

0.5

0.4

0.4

map estimate

map estimate

Test RMS = 0.0193, τ = 0.9556

qrels from actual ap 0.5

Train RMS = 0.0130, τ = 0.9768

0.3

0.2

0.1

0.45

Train RMS = 0.0022, τ = 0.9859 Test RMS = 0.0023, τ = 0.9825

0.4 0.35

map estimate

0.6

0.3

0.2

0.3 0.25 0.2 0.15 0.1

0.1

0.05 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0 0

0.1

0.2

actual map

0.3

0.4

actual map

0.5

0.6

0 0

0.1

0.2

0.3

0.4

0.5

actual map

Figure 3: The estimated MAP vs. actual MAP values for TREC 8 using three different methods to convert the probabilities of relevance to binary judgments: (left) thresholding at 0.5 (middle) picking the top R documents with highest probability of relevance and (right) randomized rounding. inferred qrels obtained from our optimization method are non-binary (probabilistic). Hence, we need to convert the inferred probabilistic relevance assessments to binary. In the previous section, we mentioned three different methods that could be used to convert probabilities of relevance to binary judgments: (1) thresholding at 0.5, (2) picking the top R documents based on probability of relevance, and (3) randomized rounding. If there is a sharp contrast among the probabilities of relevance of documents in the inferred qrels, i.e., most of the documents have probabilities of relevance either close to 1 or close to 0, all these methods would give similar results. Even though this is the case for some topics, for some of the inferred qrels there are many documents with probability of relevance between 0 and 1. Figure 2 shows the inferred probabilistic qrels obtained for topics 1, 5 and 9 in TREC 8. For each of these queries, we sort the documents in decreasing order based on their probability of relevance and plot the probability of relevances of the top 2R documents. The first two plots in the figure show that the inferred probability of relevances of documents do not always have a sharp contrast, hence, the method used to convert the probabilities to binary values makes a difference. In order to assess all three methods for converting probabilistic judgments to binary, we focused on the first evaluation criterion, namely, how do the inferred qrels evaluate the systems compared to actual qrels? We convert the probabilistic qrels to binary using all three possible methods, obtaining three different inferred binary qrels. We then compute average precisions of the systems using each inferred binary qrel and compare the inferred average precision values to actual average precisions of systems.

The runs that are submitted to TREC are naturally split into two groups: those runs which contributed to the original TREC depth 100 pool (actual qrel) and those which did not contribute to the pool. In the case of TREC 8, for example, there are 70 runs that contribute to the pool and 59 runs which did not. We ran our method for inferring qrels using only the runs that contribute to the pool as input. For each topic, we inferred relevance judgments by running the aforementioned method with, as input, the actual average precision values of the runs that contributed to the pool and the actual total number of relevant documents (R) in the topic. To verify our hypothesis, we need to evaluate the quality of the inferred qrels to show that without knowing anything about which documents are relevant, and by using the values of actual average precision values and R and the structure of the lists alone, one can accurately infer which documents are relevant. The quality of the inferred qrels can be evaluated based on at least three criteria: 1. How do the inferred qrels evaluate the input systems, as compared to actual qrels? 2. How do the inferred qrels generalize to evaluating unseen runs? 3. How do the inferred qrels compare with actual qrels?

3.1 From probability of relevance to binary judgments In order to perform any of these evaluations in a standard way, the inferred qrels need to be binary. However, the

636

MITSLStd Query 1

MITSLStd Query 2

0.7

MITSLStd Query 3

1 Inferred qrel Actual qrel

1 Inferred qrel Actual qrel

0.9

0.6

0.9 0.8 0.8

0.7

0.4

0.3

0.6

precision

precision

precision

0.5

0.5 0.4 0.3

0.2

0.7

0.6

0.5

0.2 0.1

0.4

Inferred qrel

0.1

Actual qrel 0 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0

0.1

0.2

0.3

recall

0.4

0.5

0.6

0.7

0.8

0.9

0

0.2

0.4

recall

0.6

0.8

1

recall

Figure 4: Precision-recall curves obtained using the inferred qrels vs. the actual qrels for system MITSLStd, queries 1, 2 and 3 in TREC 8. Figure 3 shows how the mean average precision (average precision averaged over all queries) estimates calculated using the binary relevance judgments obtained using the three proposed methods in the given order compare with the actual MAP (mean average precision) values. The dots in these plots refer to the systems that were used to create the pool (referred to as the training systems) and the plus signs refer to the systems that did not contribute to the pool (referred to as the testing systems). For comparison purposes, the three plots report the root mean squared error (How good are the estimates in terms of value?) and Kendall’s τ (How good are the estimates in terms of ranking?). It can be seen that if randomized rounding is used to convert inferred probabilistic qrels to binary, then the MAP values computed through the inferred binary qrels are very close to the actual MAP values. Hence, throughout this paper, we will use the method of randomized rounding to convert the inferred probabilistic qrels to binary.

TREC 7 TREC 8 TREC 10

prec 0.7127 0.7156 0.5955

recall 0.7090 0.7096 0.5958

F1 0.7099 0.7119 0.5939

Table 1: Precision, recall and F1 values of the inferred binary qrels for TREC 7, 8 and 10. shows that the inferred qrels evaluate the systems in essentially the same way as the actual qrels. Criterion 2: How do the inferred qrels generalize to evaluating unseen runs? When a test collection is built, one of the primary goals is to be able to use the test collection to evaluate unseen systems (the systems that didn’t contribute to the test collection) (reusability). Hence, it is important that our inferred qrels generalize to evaluating unseen runs (the second evaluation criteria). Since we only use the systems that contribute to the pool as input (the training systems), the generalizability of the inferred qrels can be evaluated based on how well the inferred qrels can evaluate the systems that did not contribute to the pool (testing systems). Figure 5 shows that the inferred qrels generalize well to the testing systems, in the sense that the estimates values of the measures for these systems are also very close the actual values of these measures. Hence, the inferred qrels are reusable. Criterion 3: How do the inferred qrels compare with actual qrels? Note that until now we have shown that the inferred qrels evaluate the systems in the same way as the actual qrel. However, this does not necessarily mean that the relevance judgments in these qrels are “correct.” One would like to compare the inferred qrels with the actual qrels to check if the relevant documents in the inferred qrel match the relevant documents in the actual qrel (the third evaluation criterion). In order to perform this comparison, we treat the relevant documents in the inferred qrel as a set and calculate the set precision, recall and F1 values (averaged over 50 queries) of these sets using the actual qrels for TRECs 7, 8 and 10. Note that precision value computed in this setup corresponds to the question “What fraction of the documents that are relevant in the inferred qrel are actually relevant in the actual qrel?” and the recall value corresponds to the question “What fraction of the documents that are actually relevant are identified as relevant in the inferred qrels?”. Table 1 shows the computed precision, recall, and F1 values for TRECs 7, 8 and 10. It can be seen through the

3.2 How accurate are the inferred qrels? Once the inferred binary qrels are obtained, we can now evaluate the quality of inferred qrels based on the three evaluation criteria in order to verify the hypothesis that the inferences are highly accurate. Criterion 1: How do the inferred qrels evaluate the systems, as compared to actual qrels? The first evaluation criterion is whether the inferred qrels can evaluate systems in the same way as actual qrels or not. Since many aspects of retrieval performance can be inferred from the precision-recall curves, one way to evaluate this is to compute the precision-recall curves of the systems using the inferred qrels and compare these inferred precision-recall curves with actual precision-recall curves. Figure 4 shows the inferred precision-recall curves vs. actual precision-recall curves of a randomly chosen system from TREC 8 (system MITSLStd) on queries 1, 2 and 3. It can be seen that the inferred precision-recall curves are highly accurate, and they are even identical to actual precision-recall curves for some queries (query 3). Since the precision-recall curves obtained using the inferred qrels are highly accurate, the inferred qrels are expected to evaluate systems similar to the actual qrel. In order to test this, we compute the estimates of standard measures mean R-precision, and mean precision-at-cutoff 10 and 100 using the inferred qrels and compare these estimates to the actual values of these standard measures. Figure 5

637

qrels from actual ap

qrels from actual ap 0.6

Train RMS = 0.0041, τ = 0.9751 Test RMS = 0.0035, τ = 0.9614

Test RMS = 0.0146, τ = 0.9298

0.4

0.3

0.2

0.1

0.5

0.4

0.3

0.2

0.1

0.1

0.2

0.3

actual mrp

0.4

0.5

0.6

0 0

Train RMS = 0.0032, τ = 0.9780 Test RMS = 0.0030, τ = 0.9648

0.5

mpc(10) estimate

mrp estimate

0.5

0 0

qrels from actual ap 0.6

Train RMS = 0.0170, τ = 0.9461

mpc(100) estimate

0.6

0.4

0.3

0.2

0.1

0.1

0.2

0.3

0.4

actual mpc(10)

0.5

0.6

0 0

0.1

0.2

0.3

0.4

0.5

0.6

actual mpc(100)

Figure 5: MRP, MPC(10) and MPC(100) estimates for testing and training systems calculated using the inferred qrels vs. the actual values of these measures for TREC 8. precision, recall and F1 measures that the relevances of the majority of the documents in the inferred qrels match with their relevances in the actual qrels. While our relevance inferences are largely correct, there are also clearly differences. One can argue that the actual qrels reported by TREC cannot be considered as the only “correct” way of judging the documents. Voorhees [9] shows that different judges may highly disagree on what is relevant and what is not, and hence the qrels formed by different judges may be quite different. This difference between the inferred qrels and the actual qrels may be due to a different interpretation of relevance compared to the judge that created the actual qrels. In the future, we plan to manually investigate the documents for which the inferred qrels and actual qrels disagree. Based on the three evaluation criteria, we have shown that given the values of average precision for multiple lists, together with the total number of relevant documents in a query, one can accurately infer the relevance of documents, validating Hypothesis (2). In the following section, we validate Hypothesis (3), which is more interesting and highly useful in practice.

4.

sion from incomplete judgments could be used to used to estimate these values. We recently developed a measure, inferred AP, that can be used to estimate average precision when only limited judgments are available [10]. We also developed a statistical method for estimating evaluation measures such as average precision, R-precision and precision-at-cutoff k as well as R using limited relevance judgments [2]. This latter method uses random sampling to compute the expected value of average precision using limited judgments. Both of these methods are valid options for computing the input average precision estimates; and we will use the latter method to obtain the input estimates of average precision and R. Throughout this paper, we refer to the estimates of AP and R obtained using this statistical method as the sampling estimates. Figure 6 shows how the MAP sampling estimates computed using 29, 71, and 200 judgments on average per query using the above statistical method compare with actual MAP values for TREC 8. These judgments correspond to 1.7%, 4.1%, and 11.5% of the complete judged depth-100 pool. Figure 6 shows that the MAP estimates obtained are highly accurate estimates of the actual MAP value obtained using complete judgments (1737 judgments on average per query). Having identified the method to compute estimates of AP and R, we now describe in greater detail the method used to infer relevance judgments given access to a limited number of relevance judgments. This method can be explained in four main steps:

INFERRING DOCUMENT RELEVANCE FROM AP ESTIMATES

In the previous section, we have shown that using the value of actual average precision and R, one can accurately infer the relevance of the documents using the proposed methodology (Hypothesis (2)). Inferring relevances of documents using the values of actual AP and R is not very useful in practice since in order to obtain these actual values, one needs to have access to the complete relevance judgment set that we wish to infer. Since Hypothesis (2) is valid, one might reasonably argue expect that using accurate estimates of average precisions and R (obtained by judging a few relevant documents) one could still accurately infer the relevance of documents (even the documents that were not judged to obtain the initial estimates); Hypothesis (3). This is particularly important since if one could accurately infer complete judgments by judging a few documents, then large scale evaluation using standard tools would be possible by judging very few documents. In order to make use of the proposed optimization method to infer complete judgments from a limited number of relevance judgments, we first need to obtain estimates of average precision and R values from these limited judgments. Any method that produces accurate estimates of average preci-

1. Obtain input estimates: Sample and judge some documents, obtaining estimates of AP and R through the sampling method. 2. Optimization: Use the estimates obtained through the sampling method to infer the probability of relevance of documents. 3. From probabilities to binary: Convert the inferred probabilistic relevance judgments into binary by using the method of randomized rounding. 4. Correction: The judgments obtained in the previous step contain estimates of the relevance of documents that were in fact judged to compute the estimates of AP and R in Step (1). Since these documents were already judged, their actual relevances are known. Correct the inferred binary judgments of these documents, obtaining the final inferred qrels.

638

Input estimates from 29 judgments 0.6

Input estimates from 71 judgments 0.6

Train RMS = 0.0351, τ = 0.7970

Test RMS = 0.0109, τ = 0.8749

Test RMS = 0.0304, τ = 0.7649

0.3

0.2

0.1

0.5

input map estimate

0.4

0.4

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5

0 0

0.6

Train RMS = 0.0164, τ = 0.9611 Test RMS = 0.0166, τ = 0.9205

0.5

input map estimate

0.5

input map estimate

Input estimates from 200 judgments 0.6

Train RMS = 0.0143, τ = 0.9138

0.4

0.3

0.2

0.1

0.1

0.2

actual map

0.3

0.4

0.5

0 0

0.6

0.1

actual map

0.2

0.3

0.4

0.5

0.6

actual map

Figure 6: MAP estimates obtained using the sampling method with 29 (depth 1), 71 (depth 3), and 200 (depth 10) judgments for TREC 8. MITSLStd Query 2 (qrels from 29 judgments)

MITSLStd Query 1 (qrels from 29 judgments) 0.7

MITSLStd Query 3 (qrels from 29 judgments)

1 Inferred qrel Actual qrel

1 Inferred qrel Actual qrel

0.9

0.6

0.9 0.8 0.8

0.7

0.4

0.3

0.6

precision

precision

precision

0.5

0.5 0.4 0.3

0.2

0.7

0.6

0.5

0.2 0.1

0.4 0.1

Inferred qrel Actual qrel

0 0

0.05

0.1

0.15

0.2

0.25

0.3

0 0

0.35

0.1

0.2

0.3

0.4

recall

0.5

0.6

0.7

0.8

0.9

0

0.2

0.4

recall

MITSLStd Query 1 (qrels from 71 judgments)

MITSLStd Query 2 (qrels from 71 judgments)

0.7

0.8

1

MITSLStd Query 3 (qrels from 71 judgments)

1 Inferred qrel Actual qrel

0.6

recall

1 Inferred qrel Actual qrel

0.9

0.9

0.6

0.4

0.3

0.2

0.8

0.7

0.7

0.6

0.6

precision

precision

precision

0.5

0.8

0.5 0.4

0.5 0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.1

Inferred qrel Actual qrel

0 0

0.05

0.1

0.15

0.2

0.25

0.3

0 0

0.35

0.1

0.2

0.3

0.4

recall

0.5

0.6

0.7

0.8

0 0

0.9

0.2

0.4

recall

MITSLStd Query 1 (qrels from 200 judgments)

MITSLStd Query 2 (qrels from 200 judgments)

0.7

0.8

1

MITSLStd Query 3 (qrels from 200 judgments)

1 Inferred qrel Actual qrel

0.6

recall

1 Inferred qrel Actual qrel

0.9

0.6

0.9 0.8 0.8

0.7

0.4

0.3

0.6

precision

precision

precision

0.5

0.5 0.4 0.3

0.2

0.7

0.6

0.5

0.2 0.1

0.4 0.1

Inferred qrel Actual qrel

0 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0

0.1

0.2

0.3

0.4

recall

0.5

recall

0.6

0.7

0.8

0.9

0

0.2

0.4

0.6

0.8

recall

Figure 7: Precision-recall curves obtained using the inferred qrels obtained using 29, 71 and 200 judgments for TREC 8 vs. the actual qrels for queries 1, 2, and 3 in TREC 8. Note that we include the last step so that the judgment effort that was already used to create the estimates is not wasted. Later in this paper we show that inferred qrels are quite accurate even without this correction and that the effect of this correction is not dramatic. Using these steps, we now have a method to infer complete relevance judgments (qrels) given small number of in-

put judgments and the estimates of average precision and R obtained from these input judgments. Now, we can show that these inferred qrels are “close” to the actual qrels that are generated using many more relevance judgments, validating our hypothesis that average precision estimates of multiple systems (together with an estimate of R) can be used to infer complete relevance judgments (Hypothesis (3)).

639

1

4.1 How accurate are the inferred qrels?

cision values, the average precision estimates obtained from the inferred qrels are also expected to have some variance. For example, since the area under the inferred precisionrecall curves from 71 judgments is less than the area under the actual precision-recall curves for queries 1, 2 and 3 due to input, it means that this inferred area is higher than the actual area for some other queries, since the inferred mean average precision values are very close to the actual mean average precision values. Note also that as better (lower variance) estimates of average precision are used as input to the optimization procedure, the quality of the inferred qrels would likely increase. To evaluate how inferred qrels evaluate systems when compared to actual qrels, Figure 8 shows how the inferred qrels using sampling estimates obtained from 29, 71 and 200 judgments as input evaluate the retrieval systems. It can be seen from the plots that using as few as 29 judgments, the inferred MAP, MRP, MPC(10) and MPC(100) values are very close to their actual values, especially for ranking purposes (Kendall’s τ ). The inferred values of these measures are even closer to their actual values when 71 judgments are used, and when 200 judgments are used to infer the qrels, the inferred qrels evaluate systems almost identically to the actual qrels (third row). Note that the mean average precision values computed using inferred qrels are expected to be close to the actual mean average values, since the optimization procedure finds the best fit to the input sampling average precision estimates, which are quite accurate. Therefore, the fact that inferred mean average precision values are highly correlated with actual mean average values is not very interesting or surprising. However, the fact that the inferred MRP, MPC(10) and MPC(100) values are highly accurate proves that the inferred qrels are highly useful and can be used to accurately evaluate any standard measure. One interesting result is that when the input MAP estimates obtained through the sampling method (Figure 6) and the MAP estimates obtained through the inferred qrels (leftmost plots in Figure 8) are compared, it can be seen that the estimates obtained through the inferred qrels have less variance and are often better than the input estimates. For example, inferred mean average precision values obtained from 29 judgments (top left plot in Figure 8) are much better than the input sampling estimates from 29 judgments (Figure 6). This improvement can be explained as follows: As it can be seen, the sampling estimates input to the optimization procedure are noisy due to random sampling. However, the optimization procedure cannot perfectly fit to this random noise, especially given the constraint that document retrieved by multiple systems should have the same relevance. Hence, the optimization procedure finds the best possible fit subject to these constraints, resulting in a reduction in the noise associated with the input. Thus, the inferred qrels evaluate systems even better than the input.

In order to evaluate the quality of the inferred qrels, we use our three criteria defined in the previous section. For evaluation, we use data from TRECs 7, 8 and 10, focusing on results obtained from TREC 8 due to space constraints. For all TRECs, we run the above method using estimates obtained from judgments with different levels of incompleteness. For TREC 8, we mainly focus on the inferred qrels obtained from estimates of AP and R computed with 29 (1.7% of complete judgments), 71 (4.1% of complete judgments), and 200 (11.5% of complete judgments) judgments as input.3 Criterion 1: How do the inferred qrels evaluate the systems, as compared to actual qrels? Our first criterion in evaluating the quality of inferred qrels is how well they evaluate systems. Using the same idea from the previous section, we show that (1) precisionrecall curves of systems using inferred qrels are very close to actual precision-recall curves, and (2) inferred qrels and actual qrels evaluate systems similarly. Figure 7 shows the inferred precision-recall curves of the system MITSLStd for queries 1, 2 and 3 when the inferred qrels are obtained from sampling estimates using 29 (first row), 71 (middle row) and 200 (third row) judgments. It can be seen that with as few as 29 judgments (1.7%) per query, the inferred precision-recall curve of the system is close to the actual precision-recall curve of this system. Furthermore, with as few as 200 (11.5% ) judgments per query, the inferred precision-recall curves are almost exactly the same as the actual precision-recall curves. Also, the precisionrecall curves obtained from sampling estimates with 200 judgments are almost as good as the precision-recall curves obtained using actual AP as input (Figure 4). According to Figure 7, for some queries, the area under the inferred precision-recall curves is more or less than the area under the actual precision-recall curve (e.g., mostly low for inferred precision-recall curves from 71 judgments). Since average precision is an approximation to the area under the precision-recall curve, this means that the inferred average precision of the system on these queries is more/less than the actual average precision value. Furthermore, at first glance, the inferred precision-recall curves obtained using 71 judgments seem worse than the inferred precision-recall curves from 29 judgments. This may seem counterintuitive given that the input sampling MAP estimates from 71 judgments are much better than the sampling MAP estimates from 29 judgments (Figure 6). This behavior is due to the variability inherent in the input estimates from the sampling method. Since this method is based on random sampling, the average precision estimates of a system on some queries may be lower/higher than its actual value (variance). However, when these estimates are averaged over many queries, the resulting estimate of mean average precision is an unbiased estimate of mean actual average precision [2] (reduction of variance). Since the optimization method finds the best fit to the given average pre-

Criterion 2: How do the inferred qrels generalize to evaluating unseen runs? Our second evaluation criterion is: How well do the inferred qrels generalize to evaluating unseen runs? Figure 8 shows that the Kendall’s τ and RMS error values for the training and testing systems are not much different, showing that the inferred qrels generalize well to unseen systems.

3

In the sampling method [2], the sampling estimates of the measures are compared with the estimates obtained using depth k pooling by sampling and judging the same number of documents on average per query as would be judged if depth k pooling was used. Using the same setup, we use the estimates obtained using equivalent judgments needed when depth k pooling is used, for k ∈ {1, 3, 5, 10, 15, 20}.

Criterion 3: How do the inferred qrels compare with actual qrels? Our final evaluation criterion is whether

640

qrels from 29 judgments 0.6

Train RMS = 0.0256, τ = 0.9171

0.5

0.4

0.4

0.3

0.2

0.1

0.2

0.1

0.3

0.4

0.5

0.6

0.1

0.1

0.2

actual map

0.3

0.2

0.1

0.2

0.3

0.4

0.5

qrels from 200 judgments

0.4

0.3

0.2

0.1

0.1

0.2

0.3

0.4

0.5

0.3

0.4

0.5

0.6

actual map

0.2

0.3

0.4

0.5

0.2

0.3

0.4

0.5

0.6

actual mrp

Train RMS = 0.0204, τ = 0.9434

0.3

0.2

0.1

0.2

0.3

0.4

prec 0.4808 0.5784 0.5754 0.6618 0.7293 0.7415 prec 0.5562 0.5919 0.6243 0.7068 0.7720 0.8101 prec 0.3240 0.4694 0.6207 0.6360 0.7115 0.7217

correction recall 0.3386 0.4622 0.5224 0.6361 0.6852 0.7448 correction recall 0.3833 0.5495 0.6004 0.6887 0.7361 0.7694 correction recall 0.2622 0.4008 0.4459 0.5890 0.6597 0.7155

F1 0.3336 0.4785 0.5228 0.6354 0.6926 0.7302 F1 0.4171 0.5332 0.5880 0.6906 0.7465 0.7835 F1 0.2431 0.3890 0.4813 0.5931 0.6732 0.7041

no prec 0.4250 0.5301 0.5136 0.5538 0.5940 0.5871 no prec 0.5340 0.5277 0.5458 0.5910 0.6191 0.6535 no prec 0.2932 0.3510 0.5014 0.4774 0.5225 0.5426

Train RMS = 0.0086, τ = 0.9726 Test RMS = 0.0055, τ = 0.9622

0.5

0.4

0.3

0.2

0.4

0.3

0.2

0.1

0.1

0.2

0.3

0.4

0.5

0.6

0 0

0.1

0.2

0.3

0.4

actual mpc(100)

the inferred qrels are similar to actual qrels. Following our previous setup, we treat the relevant documents in the inferred qrel as a set and calculate the set precision, recall and F1 values (averaged over 50 queries) of these sets using the actual qrels for TRECs 7, 8 and 10 (Table 2). It can be seen that very high precision values (showing that the relevant documents in the inferred qrels are also relevant in the actual qrel), and very high recall values (showing that the inferred qrels contain most of the relevant documents in the actual qrel) can be obtained using as few as approximately 21% (or even less) of the complete relevance judgments as input, concluding that the inferred qrels are quite similar to actual qrels. The table contains two columns labeled correction and no correction. These columns refer to the qrels obtained when the judgments used to obtain the input sampling estimates are used to correct the inferred qrel and when these input judgments are not used for correction (the last step of the procedure). The goal of this is to check the effect of these given judgments on the inferred qrels. Is the performance of the method mainly governed by these given judgments, or is the method itself correctly identifying most of the relevant documents? It can be seen that even without using correction, the optimization method accurately identifies many relevant documents. This behavior can also be seen in Figure 9. For various TRECs, this figure shows the number of relevant documents in the given input judgments (averaged over all queries) vs. the number of relevant documents in the inferred qrels (with and without correction) as the number of input judgments used in obtaining the sampling estimates varies. It can be

Table 2: Precision, recall and F1 values of the corrected and uncorrected inferred qrels for TREC 7, 8 and 10 when sampling estimates obtained with various number of judgments as input.

641

0.6

qrels from 200 judgments 0.6

Train RMS = 0.0102, τ = 0.9738

actual mpc(10)

correction recall F1 0.3284 0.3191 0.4280 0.4383 0.4630 0.4630 0.5356 0.5304 0.5638 0.5665 0.6206 0.5944 correction recall F1 0.3675 0.3992 0.5016 0.4816 0.5234 0.5165 0.5664 0.5714 0.6052 0.6062 0.6338 0.6409 correction recall F1 0.2352 0.2179 0.3241 0.3076 0.3520 0.3793 0.4430 0.4426 0.4749 0.4839 0.5130 0.5150

0.5

actual mpc(100)

Figure 8: MAP, MRP, MPC(10), and MPC(100) estimates obtained using the sampling estimates obtained from 29 judgments (first row), 71 judgments (middle row) and 200 judgments (third row) as input vs. the actual values of these measures for testing and training systems for TREC 8.

TREC-7 docs judged 27 1.7% 65 4.0% 98 6.1% 179 11.1% 257 16.0% 337 21.0% TREC-8 docs judged 29 1.7% 71 4.1% 110 6.3% 200 11.5% 290 16.7% 379 21.8% TREC-10 docs judged 25 1.6% 64 4.5% 100 7.1% 184 13.1% 263 18.7% 339 24.1%

0.6

0.4

0 0

0.6

0.1

0.1

0.5

0.1

0.1

0.5

0 0

0.4

0.5

Test RMS = 0.0068, τ = 0.9505

0.2

0.3

Test RMS = 0.0154, τ = 0.9293

qrels from 200 judgments 0.6

Train RMS = 0.0065, τ = 0.9586

0.2

qrels from 71 judgments 0.6

actual mpc(10)

0.1

0.2

0.1

actual mpc(100)

0.2

0 0

0.6

0.3

0 0

0 0

0.6

0.1

mpc(10) estimate

0.4

mrp estimate

0.5

0.5

0.3

Test RMS = 0.0046, τ = 0.9415

0.5

0.4

0.4

qrels from 200 judgments 0.6

0.3

0.5

0.2

Test RMS = 0.0037, τ = 0.9591

0.2

Train RMS = 0.0368, τ = 0.9306

actual mrp

Train RMS = 0.0063, τ = 0.9768

0.1

0.1

Test RMS = 0.0338, τ = 0.8804

0.3

0 0

0.6

0.2

qrels from 71 judgments 0.6

0.1

0.1

0.3

actual mpc(10)

Train RMS = 0.0157, τ = 0.9321

actual map

map estimate

0.6

mpc(10) estimate

0.4

mrp estimate

map estimate

0.5

0.4

0 0

0.5

Test RMS = 0.0089, τ = 0.9018

0.5

0.6

0.4

qrels from 71 judgments 0.6

Train RMS = 0.0156, τ = 0.9437 Test RMS = 0.0078, τ = 0.9427

0 0

0.3

0.4

0.1

actual mrp

qrels from 71 judgments 0.6

0.2

mpc(100) estimate

0.2

0.3

mpc(100) estimate

0.1

0.5

0.4

0 0

Train RMS = 0.0520, τ = 0.9143 Test RMS = 0.0423, τ = 0.9118

0.5

0.3

0 0

qrels from 29 judgments 0.6

Train RMS = 0.0898, τ = 0.9118 Test RMS = 0.0823, τ = 0.8779

mpc(10) estimate

0.5

0 0

qrels from 29 judgments 0.6

Train RMS = 0.0247, τ = 0.8931 Test RMS = 0.0126, τ = 0.8772

mrp estimate

map estimate

Test RMS = 0.0141, τ = 0.8842

mpc(100) estimate

qrels from 29 judgments 0.6

0.5

0.6

TREC 7, avrg R = 93.460

TREC 8, avrg R = 94.560

50

40

30

20 input sample

10

inferred qrel corrected inferred qrel 32

76

114

207

297

389

50

60

50

40

30

20 input sample

10

inferred qrel corrected inferred qrel

0

40

95

144

num judged

260

379

494

num judged

average number of relevant documents

60

0

TREC 10, avrg R = 67.260

70

average number of relevant documents

average number of relevant documents

70

45 40 35 30 25 20 15 10

input sample inferred qrel

5

corrected inferred qrel 0

33

82

128

234

331

425

num judged

Figure 9: TREC 7, 8 and 10: Total number of relevant documents averaged over all queries in the input judgments, inferred qrels without correction and inferred qrel with correction as the number of judgments to create input sampling estimates changes.

6. REFERENCES

seen that especially when the number of input judgments is small, the proposed method finds many more relevant documents than the relevant documents given as the input judgments. Also, as expected, using the input judgments to correct to qrels does not change the inferred qrels much when the number of input judgments is small. We have shown that based on all three evaluation criteria, the qrels inferred from estimates are highly accurate, validating the claim that the qrels obtained from a small number of judgments using the proposed method can reliably be used to construct test collections with limited judgment budget.

5.

[1] J. A. Aslam, V. Pavlu, and R. Savell. A unified model for metasearch, pooling, and system evaluation. In O. Frieder, J. Hammer, S. Quershi, and L. Seligman, editors, Proceedings of the Twelfth International Conference on Information and Knowledge Management, pages 484–491. ACM Press, November 2003. [2] J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 541–548. ACM Press, August 2006. [3] J. A. Aslam and E. Yilmaz. Inferring document relevance via average precision. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 601–602. ACM Press, August 2006. [4] J. A. Aslam, E. Yilmaz, and V. Pavlu. The maximum entropy method for analyzing retrieval measures. In G. Marchionini, A. Moffat, J. Tait, R. Baeza-Yates, and N. Ziviani, editors, Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 27–34. ACM Press, August 2005. [5] B. Carterette, J. Allan, and R. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 268–275, 2006. [6] G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In Croft et al. [7], pages 282–289. [7] W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors. Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Aug. 1998. [8] D. Harman. Overview of the third text REtreival conference (TREC-3). In D. Harman, editor, Overview of the Third Text REtrieval Conference (TREC-3), pages 1–19. U.S. Government Printing Office, Apr. 1995. [9] E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 315–323. ACM Press, 1998. [10] E. Yilmaz and J. A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Proceedings of the Fifteenth ACM International Conference on Information and Knowledge Management, pages 102–111. ACM Press, November 2006. [11] J. Zobel. How reliable are the results of large-scale retrieval experiments? In Croft et al. [7], pages 307–314.

CONCLUSIONS AND FUTURE WORK

We described a method that can be used to infer complete judgments (qrels) given a small number of judged documents. The method uses estimates of average precision of multiple systems together with estimate of R, computed using a small number of relevance judgments, to infer fully judged pools from a small fraction of judged documents. The proposed method has great potential for efficient largescale retrieval evaluation. First, we show that the inferred qrels are highly accurate in the sense that (1) they evaluate systems similarly to actual qrels and (2) the relevance of documents in the inferred qrels are very similar to actual qrels. Hence, these inferred qrels obtained from small judgments can reliably be used in place of actual qrels, significantly decreasing the judgment effort needed. Furthermore, we show that the inferred qrels contain many relevant documents, many more than the initial judged documents contain. Hence, the proposed method can be used to build qrels with real judgments using many fewer judgments than the traditional depth-100 pools, as follows: (1) judge some small number of relevant documents to obtain the sampling estimates, (2) run the optimization procedure to obtain the inferred qrels, and (3) judge the documents that are marked as relevant in the inferred qrels. Since the inferred qrels correctly identify most of the relevant documents, by judging only the documents marked as relevant in the inferred qrels (which are much less than the size of the depth-100 pool), most of the relevant documents can be correctly identified. Furthermore, this three step process could be repeated by obtaining better estimates of average precision using the judgments in the last step and feeding these estimates back into the first step. In the future, we plan to investigate this process in more detail.

642