The Impact of Ranker Quality on Rank Aggregation ... - Semantic Scholar

Report 0 Downloads 109 Views
The Impact of Ranker Quality on Rank Aggregation Algorithms: Information vs. Robustness Sibel Adalı, Brandeis Hill and Malik-Magdon Ismail Rensselaer Polytechnic Institute

Motivation Ranker1 Ranks 1

2

Ranker2 Ranker3

• Given a set of ranked list of objects, what is the best way to aggregate them into a final ranked list? • The correct answer depends on the what the objective is. • The consensus among the input rankers • The most correct final ordering

3

4

5

• In this paper: ➡ We implement existing rank aggregation methods and introduce new ones. ➡ We implement a statistical framework for evaluating the methods and report on the rank aggregation methods.

Related Work • Rank aggregation methods • Use of cheap methods such as average and median is common • Methods based on consensus introduced first by Dwork, Kumar, Naor and Sivakumar [WWW 2001] and median rank as an approximation by Fagin, Kumar, Sivakumar [SIGMOD 2003] • Methods that integrate rank and textual information are common in metasearching, for example Lu, Meng, Shu, Yu, Liu [WISE 2005] • Machine learning methods learn the best factors for a user by incorporating user feedback, for example Joachims [SIGKDD 2002] • Evaluation of rank aggregation methods are mainly with real data using fairly small data sets, for example Renda, Straccia [SAC 2003]

Error Measures • Given two rankers A and B • Precision (p) finds the number of objects A and B in common (maximization problem) • Kendall-tau (τ) finds the total number of pairwise disagreements between A and B (minimization problem)

Input Rankers

Aggregate

A

B

C

D

o1

o2

o4

o2

o2

o3

o2

o1

o3

o1

o3

o3

o4

o4

o1

o4

o5

o6

o7

o5

• Precision of D with respect to A,B, and C p(A,D) + p(B,D) + p(C,D) = 5 + 4 + 4 = 13 • Kendall-tau of D with respect to A, B, and C τ(A,D) + τ(B,D) + τ(C,D) = 1 + 1 + 4 = 6 • Missing values for τ are handled separately.

Aggregation Methods • Cheap Methods: • Average (Av) • Median (Me) • Precision optimal (PrOPT) • Methods that aim to optimize the Kendall-tau error of the aggregate with respect to the input rankers • Markov chain methods (Pagerank, Pg) • Iterative methods that improve a given aggregate methods • adjacent pairs (ADJ) • iterative best flip (IBF)

Precision Optimal • Rank objects with respect to the number of times they appear in all the lists • Break ties with respect to their ranking in average rankers • Break remaining ties randomly

Input Rankers

Number of times each object appears

Break ties

Choose top K

A

B

C

D

D

D

o1

o2

o4

o1, o2, o4

o2

o2

o2

o3

o2

o3, o5

o1

o1

o5

o6, o7

o1

o3 o4

o4

o1

o4

o4

o5

o6

o7

o3

o3

o5

o5

o6 o7

Pagerank • Construct a graph from rankings (similar to Dwork et. al. WWW2001)

A

• Each object returned in a ranked list is a vertex

o1

• Insert an edge (i,j) for each ranked list where i is ranked higher than j

o2

• Compute the pagerank [Brin & Page, WWW 1998] on this graph

o3

• The edges are weighted (wj,i) proportional to the difference in rank it represents • The navigation probability is proportional to the edge weights

2/3

• The random jump probability (pi) is proportional to the indegree of each node • Alpha (α) is set to 0.85.

P gi = αpi + (1 − α)

! (j,i)∈E

1 1/3

• The pagerank Pgi is the solution to the equations below: P gj wj,i

o1

o2 1/3 o3

2/3

Iterative Improvement Methods • Adjacent Pairs (ADJ) • Given an aggregate ranking, flip adjacent pairs until the total error with respect to the input rankers is reduced -> normally the Kendall-tau error metric is used [Dwork] • Iterative Best Flip (IBF) • Given an aggregate ranking While not done For each object record the current configuration find the best flip among all other objects and do this flip even if it increases the error temporarily and make this the current configuration Choose the lowest error configuration from the history If the overall error is lower or if this is a configuration not seen before, then make this the current configuration Else break ;

Iterative Best Flip Input Rankers

Aggregate

After best flip for o5

After best flip for o1

After best flip for o2

After best flip for o4

After best flip for o3

A

B

C

D

D

D

D

D

D

o1

o5

o1

o5

o1

o2

o5

o4

o4

o2

o2

o4

o1

o5

o5

o2

o2

o2

o3

o3

o2

o2

o2

o1

o1

o1

o1

o4

o4

o3

o4

o4

o4

o4

o5

o3

o5

o1

o5

o3

o3

o3

o3

o3

o5

Error τ = 14

Error τ = 13

Error τ = 14

Error τ = 13

Error τ = 12

Error τ = 11

Choose the minimum error configuration from this run! and continue IBF seems to outperform ADJ and do well even when we start from a random ranking.

Analysis of Aggregation Methods • Complex aggregators incorporate little nuances about the input rankers.They use more information but are sensitive to noise. • Simple aggregators disregard information contained in the input rankers but are less sensitive to noise. • For example average is more complex than median and precision optimal • How about pagerank and other Kendall-tau optimization based optimizers? Input Rankers A

B

o1

o3

o2

o1

o3

o2

Kendall-tau optimal aggregations The question we would like to answer is which aggregator D1 D2 D3 performs well under different conditions. Does reducing o3 o1 o1 Kendall-tau with respect to o1 o2 o3 Kendall-tau always lead to a good solution? o2 o3 o2

Statistical Model of Aggregators • Suppose there is a correct ranked list called the ground truth that represents the correct ordering. • The correct ordering is computed for each object using: • A set of factors that measure the fit of object for a specific criteria (factors F =f1 ... fF where fl in [-3,3]) • Examples of factors are number of occurrences of a keyword, recency of updates to a document or pagerank • A weight for each factor (W= w1,..., wF where w1 + ... + wF = 1) • The final score of each object oi is computed using a linear combination F function ! Vi = wl fl (oi ) l=1

• Objects are ranked with respect to the scores.

Statistical Model of Aggregators • Each ranker produces a ranked list by using the same formula and the same factors • The ranker j tries to estimate the factors’ true values for each object, produces Fj • It also guesses the correct weights for the combination formula, produces j Wj RANKER Objects GROUND TRUTH o1 .. . . . . . . on

w1

w2

w3

w4

w5

f1

f2

f3

f4

f5

w1j

f1j = f1+!1

Vi =

F ! l=1

wl .fl (oi )

w2j f2j = f2+!2

Vij =

w3j

w4j

w5j

f4j = f4+!4 f3j = f3+!3

F ! l=1

wlj .flj (oi )

f5j = f5+!5

Statistical Model of Aggregators • The ranker’s estimate Fj of a factor introduces an error ε , i.e. Fj = F + εj • The magnitude of error depends on a variance parameter σ2 • The distribution of the error can be adjusted to model different types of spam V ar(!jil ) = σ 2

(γ − fl (oi ))δ .(γ + fl (oi ))β maxf ∈[−3,3] (γ − f )δ .(γ + f )β

• In our model, we can model various types of correlation between the factors and the errors, but we do not report on those.

Test Setup • We distribute the scores for each factor uniformly for 100 objects, use 5 factors and 5 rankers • We set γ= 1, δ = 5, β = 0.01 which models a case where rankers make small mistakes for “good” objects and make increasingly larger mistakes for “bad” objects • We vary σ2 from 0.1, 1, 5 to 7 • We set the ground truth weights to W

W =!

1 2 3 4 5 , , , , " 15 15 15 15 15

• We assign 1,2,3,4, and 5 rankers to correct weights (W) and the remaining rankers are assigned the incorrect weights (Wr) (nMI represent the number of rankers with the wrong weights) Wr = !

5 4 3 2 1 , , , , " 15 15 15 15 15

Test Setup • For each setting, we construct 40,000 different data sets • For each dataset, we construct each aggregator for top 10 from the input rankers and output top 10 • Compare the performance of each aggregator with respect to the ground truth using precision and Kendall-tau • For each error metric, we compute the difference between all pairs of aggregators • For each test case and error metric, we output for all pairs of aggreagators [A1, A2] a range [l, h] with 99.9% confidence • We assume A1 and A2 are roughly equivalent (A1 A2) if the range [l,h] crosses zero • Otherwise, we construct an ordering A1 > A2 or A2 < A1 based on the range and the error metric • We order the aggregators using topological sort based on this ordering for each test and each error metric

Legend Av

0.0183

Results, precision for nMI = 0

PrOpt 0.0166 8.7E-4

PrOpt

0.0203

Pg

MeADJ

4.0E-4

0.0024

Av

AvADJ

0.00469 0.0021

AvIBF

3.0E-4

PgIBF

MeIBF

0.0013

IBF opt. after aggregator x

MeIBF

0.0026 0.0027 0.0022

0.0197 RndIBF 0.0227

0.0138 0.0214

RndIBF

Me 0.0186 0.0244

0.0027 0.0155 MeADJ

0.0016 Av

0.0128

0.0031 Me

Pg

xIBF

AvADJ

AvIBF

σ2 = 0.1

Random Precision Optimal

0.0144 0.0170

2.7E-4 7.5E-4

Rnd PrOpt

0.0172

0.0026

5.7E-4

Pagerank

PgIBF

PgADJ 3.2E-4

Median

Pg

xADJ ADJ opt. after aggregator x

PgADJ

0.0028

1.4E-4

Average

Me

0.0154

σ2 = 1.0

0.0121

Legend Av

Results, precision for nMI = 0

Average

Me

Median

Pg

Pagerank

Rnd

Random

PrOpt

Precision Optimal

xADJ ADJ opt. after aggregator x 0.0151 0.0125 PrOpt

PgADJ

0.0119 Pg

0.0190 0.0157

xIBF PgIBF

IBF opt. after aggregator x

0.0168 0.0129

RndIBF

0.0115

AvIBF

0.0929

AvADJ

0.0583

MeADJ

0.0286

Av

0.1006

MeIBF

0.0196

0.0216

σ2 = 5

PrOpt 0.0192 PgIBF 0.0181

PgADJ

0.0217

AvIBF 0.4138

0.0181

MeADJ 0.0165 Pg

0.4162

0.0156

0.0193

RndIBF

0.0157 0.0141 MeIBF

σ2 = 7.5

0.0319

AvADJ

0.0799

Me

0.0672

Av

Me

Legend Av

Kendall-tau results for nMI = 2

Average

Me

Median

Pg

Pagerank

Rnd

Random

PrOpt

Precision Optimal

xADJ ADJ opt. after aggregator x xIBF Me

-0.0186

MeADJ

-0.179

MeIBF

-0.2622

RndIBF

-0.174

PgIBF

-0.0761

AvIBF

-0.0285

-1.3268

PgADJ

-6.6752

AvADJ

PrOpt

IBF opt. after aggregator x -4.7128

Pg

-4.348

Av

σ2 = 0.1 MeNBadj

-0.05835

MeNB

-0.3103

MeNBIBF

-0.166725

RndIBF

-0.67745

-0.062074997

AvNBIBF

-0.339975

PgNBIBF

PgNBadj

-1.142

AvNBadj

-5.63925

PrOpt

-1.7299

PgNB

-3.46555

AvNB

σ2 = 1 MeNBIBF

-0.068401285

PgNBIBF

-0.1032 -0.07852385

AvNBIBF

-0.181225

-1.1758001

PgNBadj

PrOpt

-0.057450004

PgNB

-0.541825

AvNBadj

-1.48045

MeNBadj

-0.544575

AvNB

-1.8643

MeNB

RndIBF

σ2 = 5 -0.0944 PrOpt

-0.0998

MeIBF

PgADJ

-0.1034

-0.4184

-0.1668 -0.1578

RndIBF

PgIBF

-0.4246 AvIBF

Pg

σ2 = 7.5

-5.384 -5.3778

AvADJ

-1.5151

Av

-0.3827

MeADJ

-2.7771

Me

Legend Av

Precision results for nMI = 4

Average

Me

Median

Pg

Pagerank

Rnd

Random

PrOpt

Precision Optimal

xADJ ADJ opt. after aggregator x xIBF

IBF opt. after aggregator x

0.0030 0.0111 Av

0.5772

AvADJ

0.3533

RndIBF

0.0039

Pg

Me

PrOpt

0.0125

0.0128

0.0108

PgIBF

0.0022

MeADJ

0.0033 0.0028

PgADJ

AvIBF MeIBF

σ2 = 0.1

PrOpt

0.0184 0.0147

PgADJ

0.0174 0.0202

PgIBF

MeIBF

0.0075

0.0194

AvIBF 0.1412

RndIBF

Pg

σ2 = 7.5

0.1383 AvADJ

0.0418

Av

0.1114

MeADJ

0.0635

Me

Result Summary • Low noise: • Average is best when all the rankers are the same

high noise

• Median is best when there is asymmetry among the rankers • High noise • Robustness is needed, PrOpt, IBF and Pg are the best • As misinformation increases, robust but more complex rankers tend to do better

low noise

PrOpt Pg*

PrOpt Pg*

MeIBF

PrOpt MeIBF Pg*

PrOpt (Pg PgADJ)

PrOpt MeIBF Pg*

RndIBF MeIBF

Av (Pg AvADJ)

Av (AvADJ)

Av Pg*

PrOpt MeIBF PgADJ

Me (MeADJ)

Av (Pg)

Av (AvADJ)

PrOpt Av*

PrOpt Me* PgADJ

Me (MeADJ)

Av (Pg)

Av (AvADJ)

Pg* less misinformation

*IBF

PrOpt Pg PgADJ

PrOpt Pg PgADJ

more misinformation

Conclusion and Future Work • Two new aggregation methods, PrOPT and IBF that seem to do well in many cases, IBF seems to do well even starting from a random ranking • No single rank aggregation method is best, there is a trade-off between information and robustness • Further evaluation of rank aggregation methods is needed • Testing with various correlations both positive and negative between ranking factors and the errors made on these • Testing of the model with negative weights where misinformation is more misleading