KMV-Peer: A Robust and Adaptive Peer-Selection ... - VideoLectures

Report 4 Downloads 28 Views
KMV-Peer: A Robust and Adaptive Peer-Selection Algorithm Yosi Mass, Yehoshua Sagiv, Michal Shmueli-Scheuer IBM Haifa Research Lab Hebrew University of Jerusalem

Wsdm’11, Feb 9 – 12, Hong Kong

1

Motivation and Problem Statement 

Motivation  

Scale up Indexing and retrieval of large data collections Solution is described in the context of cooperative peers, P1 each has its own collection P3 P2



Problem Statement 

P4

Find a good approximation of a centralized system for answering conjunctive multi-term queries, while keeping at a minimum both the number of peers that are contacted and the communication cost Wsdm’11, Feb 9 – 12, Hong Kong

2

Solution Framework - Indexing Create small-size per-term local statistics

Make all statistics globally available

Statistics of P1 for term t1

Full posting list of P1 for term t1

t1 d1,d3,…,

σ11

t2 d1,d5,d3,…

σ12

Use DHT to assign terms to peers A peer that is responsible for a term has the statistics of all other peers for that term

P1

P1 P3

P2

P3 P2

P4

t1 d’8,d’2,…,

σ41

t2 d’2,d’5,…

σ42

P4

P3 responsible for term t1

t1 (P1,σ11),(P4,σ41)

t2 (P1,σ12),(P4,σ42) Wsdm’11, Feb 9 – 12, Hong Kong

3

Our Contributions 





A novel per-term statistics based on KMV (Beyer et el. 2007) synopses and histograms A peer-selection algorithm that exploits the above statistics An improvement of the state-of-the-art by a factor of four

Wsdm’11, Feb 9 – 12, Hong Kong

4

Agenda   



Collection statistics Peer-selection algorithm Experiments Summary and Future Work

Wsdm’11, Feb 9 – 12, Hong Kong

5

Per-term KMV Statistics 

 

Keep posting list for each term tj, sorted by increasing score for q=(tj) Divide the documents into M equi-width score intervals Apply a uniform hash function to the doc ids in each interval and take the l minimal values 1 Sij M

0 tj

2 Sij M

Lij1

KMV synopses of peer Pi for term tj

Lij2

(Max score)

… d20,d14,d25,...

d1,d3,d5,d15... d8,d2,...

 ij :

S ij

Lij3

Lij5

KMV synopsis for interval 5

Wsdm’11, Feb 9 – 12, Hong Kong

6

Peer-Scoring Functions 

Given a query q=(t1,…,tn) and the statistics of peer Pi for the query terms, use the histograms to estimate the score of a virtual document that belongs to Pi. t3

t2

t1

scoreq (d )  g aggr(scoret1 (d ),..., scoretn (d ))

tn

Pi

 i3

 in

 i2

 i1

scoreq ( pi )  F ?( i1 ,...,  in ) Wsdm’11, Feb 9 – 12, Hong Kong

7

Peer-Scoring Functions - contd 

 

Consider the set C  {h  (h1 ,..., hn ) | h j  ij } namely all combinations of one KMV synopsis for each query term. The score associated with a KMV synopsis hj, denoted by mid(hj), is the middle of the interval that corresponds to that synopsis 

KMV synopsis h1

 i1 :

h L1i1

Li21

Li31

 i2 :

L1i 2

Li22

Li32

Li42



 in :

L1in

Lin2

Lin3

scoreq (d )  g aggr(scoret1 (d ),..., scoretn (d )) 

score( h )  g aggr(mid (h1 ),..., mid (hn )) Wsdm’11, Feb 9 – 12, Hong Kong

8

KMV-int: The Peer Intersection Score 

h

 i1 :

L1i1

Li21

Li31

 i2 :

Li22

L1i 2

Li32

Li42



 in :

L1in

Lin2

Lin3







Non-emptiness estimator h is true if the intersection of {h1,…,hn} is not empty  q

Intersection score - score ( pi )  max ( score( h ))  





h C  h 

If h is true, then we are guaranteed there is a document d with all query terms 



But h can be an underestimate (false negative) especially for queries with a large number of terms Wsdm’11, Feb 9 – 12, Hong Kong

9

KMV-exp: The Peer Expected Score 

Measures the expected relevance of the documents of Pi to the query q 

h

 i1 :

L1i1

Li21

Li31

 i2 :

L1i 2

Li22

Li32

Li42





 in :

L1in

Lin2

Lin3



score ( pi ) | Di |  score( h ) Pr( h ) E q



h C



n

e( h j )

j 1

| Di |

Pr( h )  

KMV size estimator for hj

All docs in peer Pi Wsdm’11, Feb 9 – 12, Hong Kong

10

A Basic Peer-Selection Algorithm 

Input: q=(t1,…,tn), k (top-k results), K (max number of peers to contact)



Locate the peers that are responsible for the query terms



Get all their statistics

P1 P3

t1 (P1,σ11),(P4,σ41) t2 (P1,σ12),(P4,σ42)

P2



P4

tn (P1,σ1n),(P5,σ5n),(P9,σ9n) 

Rank the peers using KMV-int and if less than K peers have non-empty intersection then rank the rest by KMV-exp



Select the top-K peers and contact them to get their top-k results



Merge the returned results and return the top-k Wsdm’11, Feb 9 – 12, Hong Kong

11

Algorithm Improvements – Save Communication Cost 

At the query initiating peer Pq :  



Locate the two peers that are responsible for the terms tf ts P P with the smallest statistics. Call them and t Forward the query to peer P s

At peer P t : s

  

t

Get all statistics from peer P f Apply KMV-int on the peers in the two lists and obtain a set of candidate peers P Get the rest of the statistics about q but only for peers in P Wsdm’11, Feb 9 – 12, Hong Kong

12

Algorithm Improvements – Adaptive Ranking 

Work in rounds   

In each round contact the next best k’ peers (k’ < K) Obtain a threshold score (min-k) which is the score of the last (i.e., k-th) document among the current top-k Adaptively rank the remaindered peers   Define high ( h )  g aggr(high (h1 ),..., high (hn )) 

h

 i1 : 

L1i1

Li21

Li31

 i2 :

L1i 2

Li22

Li32

Li42



 in :

L1in

Lin2

Lin3

In the scoring functions (KMV-int and KMV-exp),  ignore tuples whose high ( h ) < min-k Wsdm’11, Feb 9 – 12, Hong Kong

13

KMV-Peer: The Peer-Selection Algorithm k – top-k results are requested k’ – number of peers to contact in each iteration K – max number of peers to contact

Score peers by KMV-int, but if less than k’ peers have a non-zero score then use KMV-exp

Wsdm’11, Feb 9 – 12, Hong Kong

14

Experimental Setting 

Datasets  



Setups   





Trec – 15 queries from the topic-distillation track of the TREC 2003 Web Track benchmark Blog – 75 queries from the blog track of TREC 2008

Parameters 



Trec-10K – 10,000 peers, each having 1,000 documents Trec-1K – 1,000 peers, each having 10,000 documents Blog – 1,000 peers, each having 2,000 documents

Queries 



Trec – 10M web pages from Trec GOV2 collection Blog – 2M Blog posts from Blogger.com

l (KMV size), M (num score intervals), G (num groups)

Evaluation  

Normalized DCG (nDCG), which considers the order of the results in the ground truth (i.e., a centralized system) MAP Wsdm’11, Feb 9 – 12, Hong Kong

15

KMV-Peer Compared to State-of-the-Art Blog (l10,M5)

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

nDCG

nDCG

Trec-10K (l10,M5)

0.5 0.4

0.5 0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

0

10

20

30

40

50

60

70

0

10

Num ber of selected peers KMV

hist

cdf-ctf

cori

20

30

40

50

60

70

Num ber of selected peers

crcs

KMV

hist

cdf-ctf

cori

crcs

Communication cost (KBytes)

Wsdm’11, Feb 9 – 12, Hong Kong

16

Tuning The Parameters of KMV-Peer Blog

Trec-1K 1

1

0.9 0.9

0.8

0.8

0.6 nDCG

nDCG

0.7 0.5 0.4

0.7 0.6

0.3 0.2

0.5

0.1 0

0.4

0

10

20

30

40

50

60

70

0

10

Num ber of selected peers

20

30

40

50

60

70

Num ber of selected peers

l10,M20

l10,M5,G5

l5,M5,G10

l10,M5,G10

l20,M10

l5,M10

l10,M5

l100,M1

l5,M5

Wsdm’11, Feb 9 – 12, Hong Kong

l10,M5

l5,M10

l10,M10

l25,M1

17

Testing Different Variants of KMV-Peer Blog

Trec-1K 0.9

0.9 0.85

0.8 0.8

nDCG

nDCG

0.75 0.7

0.7

0.6

0.65 0.6

0.5 0.55 0.5

0.4 0

10

20

30

40

50

60

70

80

90

100

0

10

20

Num ber of selected peers

30

40

50

60

70

80

90

100

Num ber of selected peers

KMV-exp-adaptive

KMV-exp-nonAdaptive

KMV-exp-adaptive

KMV-exp-nonAdaptive

KMV-int-adaptive

KMV-int-nonAdaptive

KMV-int-adaptive

KMV-int-nonAdaptive

Wsdm’11, Feb 9 – 12, Hong Kong

18

Testing Different Scoring Functions nDCG at K=20

  

Lucene – Apache Lucene score with global synchronization BM25 – Okapi BM25 score with global synchronization Lucene* – Lucene score with the parameters (e.g., idf) derived by each peer from its own collection Wsdm’11, Feb 9 – 12, Hong Kong

19

Conclusions 







We presented a fully decentralized peer-selection algorithm (KMV-peer) for approximating the results of a centralized search engine, while using only a small subset of the peers and controlling the communication cost. The algorithm employs two scoring functions for ranking peers. The first is the intersection score and is based on a nonemptiness estimator. The second is the expected score. KMV-peer outperforms the state-of-the-art methods and achieves an improvement of more than 400% over other methods Regarding communication-cost, we showed how to filter out peers in early stages of the algorithm, thereby saving the need to send their synopses. Wsdm’11, Feb 9 – 12, Hong Kong

20

Future Work 



Investigate further reductions in communication cost by using top-k algorithms with a stopping condition Consider less restrictive non-emptiness estimators (disjunctive queries)

Wsdm’11, Feb 9 – 12, Hong Kong

21

Thank You! Questions ?

Wsdm’11, Feb 9 – 12, Hong Kong

22