The Impact of Ranker Quality on Rank Aggregation Algorithms: Information vs. Robustness Sibel Adalı, Brandeis Hill and Malik-Magdon Ismail Rensselaer Polytechnic Institute
Motivation Ranker1 Ranks 1
2
Ranker2 Ranker3
• Given a set of ranked list of objects, what is the best way to aggregate them into a final ranked list? • The correct answer depends on the what the objective is. • The consensus among the input rankers • The most correct final ordering
3
4
5
• In this paper: ➡ We implement existing rank aggregation methods and introduce new ones. ➡ We implement a statistical framework for evaluating the methods and report on the rank aggregation methods.
Related Work • Rank aggregation methods • Use of cheap methods such as average and median is common • Methods based on consensus introduced first by Dwork, Kumar, Naor and Sivakumar [WWW 2001] and median rank as an approximation by Fagin, Kumar, Sivakumar [SIGMOD 2003] • Methods that integrate rank and textual information are common in metasearching, for example Lu, Meng, Shu, Yu, Liu [WISE 2005] • Machine learning methods learn the best factors for a user by incorporating user feedback, for example Joachims [SIGKDD 2002] • Evaluation of rank aggregation methods are mainly with real data using fairly small data sets, for example Renda, Straccia [SAC 2003]
Error Measures • Given two rankers A and B • Precision (p) finds the number of objects A and B in common (maximization problem) • Kendall-tau (τ) finds the total number of pairwise disagreements between A and B (minimization problem)
Input Rankers
Aggregate
A
B
C
D
o1
o2
o4
o2
o2
o3
o2
o1
o3
o1
o3
o3
o4
o4
o1
o4
o5
o6
o7
o5
• Precision of D with respect to A,B, and C p(A,D) + p(B,D) + p(C,D) = 5 + 4 + 4 = 13 • Kendall-tau of D with respect to A, B, and C τ(A,D) + τ(B,D) + τ(C,D) = 1 + 1 + 4 = 6 • Missing values for τ are handled separately.
Aggregation Methods • Cheap Methods: • Average (Av) • Median (Me) • Precision optimal (PrOPT) • Methods that aim to optimize the Kendall-tau error of the aggregate with respect to the input rankers • Markov chain methods (Pagerank, Pg) • Iterative methods that improve a given aggregate methods • adjacent pairs (ADJ) • iterative best flip (IBF)
Precision Optimal • Rank objects with respect to the number of times they appear in all the lists • Break ties with respect to their ranking in average rankers • Break remaining ties randomly
Input Rankers
Number of times each object appears
Break ties
Choose top K
A
B
C
D
D
D
o1
o2
o4
o1, o2, o4
o2
o2
o2
o3
o2
o3, o5
o1
o1
o5
o6, o7
o1
o3 o4
o4
o1
o4
o4
o5
o6
o7
o3
o3
o5
o5
o6 o7
Pagerank • Construct a graph from rankings (similar to Dwork et. al. WWW2001)
A
• Each object returned in a ranked list is a vertex
o1
• Insert an edge (i,j) for each ranked list where i is ranked higher than j
o2
• Compute the pagerank [Brin & Page, WWW 1998] on this graph
o3
• The edges are weighted (wj,i) proportional to the difference in rank it represents • The navigation probability is proportional to the edge weights
2/3
• The random jump probability (pi) is proportional to the indegree of each node • Alpha (α) is set to 0.85.
P gi = αpi + (1 − α)
! (j,i)∈E
1 1/3
• The pagerank Pgi is the solution to the equations below: P gj wj,i
o1
o2 1/3 o3
2/3
Iterative Improvement Methods • Adjacent Pairs (ADJ) • Given an aggregate ranking, flip adjacent pairs until the total error with respect to the input rankers is reduced -> normally the Kendall-tau error metric is used [Dwork] • Iterative Best Flip (IBF) • Given an aggregate ranking While not done For each object record the current configuration find the best flip among all other objects and do this flip even if it increases the error temporarily and make this the current configuration Choose the lowest error configuration from the history If the overall error is lower or if this is a configuration not seen before, then make this the current configuration Else break ;
Iterative Best Flip Input Rankers
Aggregate
After best flip for o5
After best flip for o1
After best flip for o2
After best flip for o4
After best flip for o3
A
B
C
D
D
D
D
D
D
o1
o5
o1
o5
o1
o2
o5
o4
o4
o2
o2
o4
o1
o5
o5
o2
o2
o2
o3
o3
o2
o2
o2
o1
o1
o1
o1
o4
o4
o3
o4
o4
o4
o4
o5
o3
o5
o1
o5
o3
o3
o3
o3
o3
o5
Error τ = 14
Error τ = 13
Error τ = 14
Error τ = 13
Error τ = 12
Error τ = 11
Choose the minimum error configuration from this run! and continue IBF seems to outperform ADJ and do well even when we start from a random ranking.
Analysis of Aggregation Methods • Complex aggregators incorporate little nuances about the input rankers.They use more information but are sensitive to noise. • Simple aggregators disregard information contained in the input rankers but are less sensitive to noise. • For example average is more complex than median and precision optimal • How about pagerank and other Kendall-tau optimization based optimizers? Input Rankers A
B
o1
o3
o2
o1
o3
o2
Kendall-tau optimal aggregations The question we would like to answer is which aggregator D1 D2 D3 performs well under different conditions. Does reducing o3 o1 o1 Kendall-tau with respect to o1 o2 o3 Kendall-tau always lead to a good solution? o2 o3 o2
Statistical Model of Aggregators • Suppose there is a correct ranked list called the ground truth that represents the correct ordering. • The correct ordering is computed for each object using: • A set of factors that measure the fit of object for a specific criteria (factors F =f1 ... fF where fl in [-3,3]) • Examples of factors are number of occurrences of a keyword, recency of updates to a document or pagerank • A weight for each factor (W= w1,..., wF where w1 + ... + wF = 1) • The final score of each object oi is computed using a linear combination F function ! Vi = wl fl (oi ) l=1
• Objects are ranked with respect to the scores.
Statistical Model of Aggregators • Each ranker produces a ranked list by using the same formula and the same factors • The ranker j tries to estimate the factors’ true values for each object, produces Fj • It also guesses the correct weights for the combination formula, produces j Wj RANKER Objects GROUND TRUTH o1 .. . . . . . . on
w1
w2
w3
w4
w5
f1
f2
f3
f4
f5
w1j
f1j = f1+!1
Vi =
F ! l=1
wl .fl (oi )
w2j f2j = f2+!2
Vij =
w3j
w4j
w5j
f4j = f4+!4 f3j = f3+!3
F ! l=1
wlj .flj (oi )
f5j = f5+!5
Statistical Model of Aggregators • The ranker’s estimate Fj of a factor introduces an error ε , i.e. Fj = F + εj • The magnitude of error depends on a variance parameter σ2 • The distribution of the error can be adjusted to model different types of spam V ar(!jil ) = σ 2
(γ − fl (oi ))δ .(γ + fl (oi ))β maxf ∈[−3,3] (γ − f )δ .(γ + f )β
• In our model, we can model various types of correlation between the factors and the errors, but we do not report on those.
Test Setup • We distribute the scores for each factor uniformly for 100 objects, use 5 factors and 5 rankers • We set γ= 1, δ = 5, β = 0.01 which models a case where rankers make small mistakes for “good” objects and make increasingly larger mistakes for “bad” objects • We vary σ2 from 0.1, 1, 5 to 7 • We set the ground truth weights to W
W =!
1 2 3 4 5 , , , , " 15 15 15 15 15
• We assign 1,2,3,4, and 5 rankers to correct weights (W) and the remaining rankers are assigned the incorrect weights (Wr) (nMI represent the number of rankers with the wrong weights) Wr = !
5 4 3 2 1 , , , , " 15 15 15 15 15
Test Setup • For each setting, we construct 40,000 different data sets • For each dataset, we construct each aggregator for top 10 from the input rankers and output top 10 • Compare the performance of each aggregator with respect to the ground truth using precision and Kendall-tau • For each error metric, we compute the difference between all pairs of aggregators • For each test case and error metric, we output for all pairs of aggreagators [A1, A2] a range [l, h] with 99.9% confidence • We assume A1 and A2 are roughly equivalent (A1 A2) if the range [l,h] crosses zero • Otherwise, we construct an ordering A1 > A2 or A2 < A1 based on the range and the error metric • We order the aggregators using topological sort based on this ordering for each test and each error metric
Legend Av
0.0183
Results, precision for nMI = 0
PrOpt 0.0166 8.7E-4
PrOpt
0.0203
Pg
MeADJ
4.0E-4
0.0024
Av
AvADJ
0.00469 0.0021
AvIBF
3.0E-4
PgIBF
MeIBF
0.0013
IBF opt. after aggregator x
MeIBF
0.0026 0.0027 0.0022
0.0197 RndIBF 0.0227
0.0138 0.0214
RndIBF
Me 0.0186 0.0244
0.0027 0.0155 MeADJ
0.0016 Av
0.0128
0.0031 Me
Pg
xIBF
AvADJ
AvIBF
σ2 = 0.1
Random Precision Optimal
0.0144 0.0170
2.7E-4 7.5E-4
Rnd PrOpt
0.0172
0.0026
5.7E-4
Pagerank
PgIBF
PgADJ 3.2E-4
Median
Pg
xADJ ADJ opt. after aggregator x
PgADJ
0.0028
1.4E-4
Average
Me
0.0154
σ2 = 1.0
0.0121
Legend Av
Results, precision for nMI = 0
Average
Me
Median
Pg
Pagerank
Rnd
Random
PrOpt
Precision Optimal
xADJ ADJ opt. after aggregator x 0.0151 0.0125 PrOpt
PgADJ
0.0119 Pg
0.0190 0.0157
xIBF PgIBF
IBF opt. after aggregator x
0.0168 0.0129
RndIBF
0.0115
AvIBF
0.0929
AvADJ
0.0583
MeADJ
0.0286
Av
0.1006
MeIBF
0.0196
0.0216
σ2 = 5
PrOpt 0.0192 PgIBF 0.0181
PgADJ
0.0217
AvIBF 0.4138
0.0181
MeADJ 0.0165 Pg
0.4162
0.0156
0.0193
RndIBF
0.0157 0.0141 MeIBF
σ2 = 7.5
0.0319
AvADJ
0.0799
Me
0.0672
Av
Me
Legend Av
Kendall-tau results for nMI = 2
Average
Me
Median
Pg
Pagerank
Rnd
Random
PrOpt
Precision Optimal
xADJ ADJ opt. after aggregator x xIBF Me
-0.0186
MeADJ
-0.179
MeIBF
-0.2622
RndIBF
-0.174
PgIBF
-0.0761
AvIBF
-0.0285
-1.3268
PgADJ
-6.6752
AvADJ
PrOpt
IBF opt. after aggregator x -4.7128
Pg
-4.348
Av
σ2 = 0.1 MeNBadj
-0.05835
MeNB
-0.3103
MeNBIBF
-0.166725
RndIBF
-0.67745
-0.062074997
AvNBIBF
-0.339975
PgNBIBF
PgNBadj
-1.142
AvNBadj
-5.63925
PrOpt
-1.7299
PgNB
-3.46555
AvNB
σ2 = 1 MeNBIBF
-0.068401285
PgNBIBF
-0.1032 -0.07852385
AvNBIBF
-0.181225
-1.1758001
PgNBadj
PrOpt
-0.057450004
PgNB
-0.541825
AvNBadj
-1.48045
MeNBadj
-0.544575
AvNB
-1.8643
MeNB
RndIBF
σ2 = 5 -0.0944 PrOpt
-0.0998
MeIBF
PgADJ
-0.1034
-0.4184
-0.1668 -0.1578
RndIBF
PgIBF
-0.4246 AvIBF
Pg
σ2 = 7.5
-5.384 -5.3778
AvADJ
-1.5151
Av
-0.3827
MeADJ
-2.7771
Me
Legend Av
Precision results for nMI = 4
Average
Me
Median
Pg
Pagerank
Rnd
Random
PrOpt
Precision Optimal
xADJ ADJ opt. after aggregator x xIBF
IBF opt. after aggregator x
0.0030 0.0111 Av
0.5772
AvADJ
0.3533
RndIBF
0.0039
Pg
Me
PrOpt
0.0125
0.0128
0.0108
PgIBF
0.0022
MeADJ
0.0033 0.0028
PgADJ
AvIBF MeIBF
σ2 = 0.1
PrOpt
0.0184 0.0147
PgADJ
0.0174 0.0202
PgIBF
MeIBF
0.0075
0.0194
AvIBF 0.1412
RndIBF
Pg
σ2 = 7.5
0.1383 AvADJ
0.0418
Av
0.1114
MeADJ
0.0635
Me
Result Summary • Low noise: • Average is best when all the rankers are the same
high noise
• Median is best when there is asymmetry among the rankers • High noise • Robustness is needed, PrOpt, IBF and Pg are the best • As misinformation increases, robust but more complex rankers tend to do better
low noise
PrOpt Pg*
PrOpt Pg*
MeIBF
PrOpt MeIBF Pg*
PrOpt (Pg PgADJ)
PrOpt MeIBF Pg*
RndIBF MeIBF
Av (Pg AvADJ)
Av (AvADJ)
Av Pg*
PrOpt MeIBF PgADJ
Me (MeADJ)
Av (Pg)
Av (AvADJ)
PrOpt Av*
PrOpt Me* PgADJ
Me (MeADJ)
Av (Pg)
Av (AvADJ)
Pg* less misinformation
*IBF
PrOpt Pg PgADJ
PrOpt Pg PgADJ
more misinformation
Conclusion and Future Work • Two new aggregation methods, PrOPT and IBF that seem to do well in many cases, IBF seems to do well even starting from a random ranking • No single rank aggregation method is best, there is a trade-off between information and robustness • Further evaluation of rank aggregation methods is needed • Testing with various correlations both positive and negative between ranking factors and the errors made on these • Testing of the model with negative weights where misinformation is more misleading