KMV-Peer: A Robust and Adaptive Peer-Selection Algorithm Yosi Mass, Yehoshua Sagiv, Michal Shmueli-Scheuer IBM Haifa Research Lab Hebrew University of Jerusalem
Wsdm’11, Feb 9 – 12, Hong Kong
1
Motivation and Problem Statement
Motivation
Scale up Indexing and retrieval of large data collections Solution is described in the context of cooperative peers, P1 each has its own collection P3 P2
Problem Statement
P4
Find a good approximation of a centralized system for answering conjunctive multi-term queries, while keeping at a minimum both the number of peers that are contacted and the communication cost Wsdm’11, Feb 9 – 12, Hong Kong
2
Solution Framework - Indexing Create small-size per-term local statistics
Make all statistics globally available
Statistics of P1 for term t1
Full posting list of P1 for term t1
t1 d1,d3,…,
σ11
t2 d1,d5,d3,…
σ12
Use DHT to assign terms to peers A peer that is responsible for a term has the statistics of all other peers for that term
P1
P1 P3
P2
P3 P2
P4
t1 d’8,d’2,…,
σ41
t2 d’2,d’5,…
σ42
P4
P3 responsible for term t1
t1 (P1,σ11),(P4,σ41)
t2 (P1,σ12),(P4,σ42) Wsdm’11, Feb 9 – 12, Hong Kong
3
Our Contributions
A novel per-term statistics based on KMV (Beyer et el. 2007) synopses and histograms A peer-selection algorithm that exploits the above statistics An improvement of the state-of-the-art by a factor of four
Wsdm’11, Feb 9 – 12, Hong Kong
4
Agenda
Collection statistics Peer-selection algorithm Experiments Summary and Future Work
Wsdm’11, Feb 9 – 12, Hong Kong
5
Per-term KMV Statistics
Keep posting list for each term tj, sorted by increasing score for q=(tj) Divide the documents into M equi-width score intervals Apply a uniform hash function to the doc ids in each interval and take the l minimal values 1 Sij M
0 tj
2 Sij M
Lij1
KMV synopses of peer Pi for term tj
Lij2
(Max score)
… d20,d14,d25,...
d1,d3,d5,d15... d8,d2,...
ij :
S ij
Lij3
Lij5
KMV synopsis for interval 5
Wsdm’11, Feb 9 – 12, Hong Kong
6
Peer-Scoring Functions
Given a query q=(t1,…,tn) and the statistics of peer Pi for the query terms, use the histograms to estimate the score of a virtual document that belongs to Pi. t3
t2
t1
scoreq (d ) g aggr(scoret1 (d ),..., scoretn (d ))
tn
Pi
i3
in
i2
i1
scoreq ( pi ) F ?( i1 ,..., in ) Wsdm’11, Feb 9 – 12, Hong Kong
7
Peer-Scoring Functions - contd
Consider the set C {h (h1 ,..., hn ) | h j ij } namely all combinations of one KMV synopsis for each query term. The score associated with a KMV synopsis hj, denoted by mid(hj), is the middle of the interval that corresponds to that synopsis
KMV synopsis h1
i1 :
h L1i1
Li21
Li31
i2 :
L1i 2
Li22
Li32
Li42
…
in :
L1in
Lin2
Lin3
scoreq (d ) g aggr(scoret1 (d ),..., scoretn (d ))
score( h ) g aggr(mid (h1 ),..., mid (hn )) Wsdm’11, Feb 9 – 12, Hong Kong
8
KMV-int: The Peer Intersection Score
h
i1 :
L1i1
Li21
Li31
i2 :
Li22
L1i 2
Li32
Li42
…
in :
L1in
Lin2
Lin3
Non-emptiness estimator h is true if the intersection of {h1,…,hn} is not empty q
Intersection score - score ( pi ) max ( score( h ))
h C h
If h is true, then we are guaranteed there is a document d with all query terms
But h can be an underestimate (false negative) especially for queries with a large number of terms Wsdm’11, Feb 9 – 12, Hong Kong
9
KMV-exp: The Peer Expected Score
Measures the expected relevance of the documents of Pi to the query q
h
i1 :
L1i1
Li21
Li31
i2 :
L1i 2
Li22
Li32
Li42
…
in :
L1in
Lin2
Lin3
score ( pi ) | Di | score( h ) Pr( h ) E q
h C
n
e( h j )
j 1
| Di |
Pr( h )
KMV size estimator for hj
All docs in peer Pi Wsdm’11, Feb 9 – 12, Hong Kong
10
A Basic Peer-Selection Algorithm
Input: q=(t1,…,tn), k (top-k results), K (max number of peers to contact)
Locate the peers that are responsible for the query terms
Get all their statistics
P1 P3
t1 (P1,σ11),(P4,σ41) t2 (P1,σ12),(P4,σ42)
P2
…
P4
tn (P1,σ1n),(P5,σ5n),(P9,σ9n)
Rank the peers using KMV-int and if less than K peers have non-empty intersection then rank the rest by KMV-exp
Select the top-K peers and contact them to get their top-k results
Merge the returned results and return the top-k Wsdm’11, Feb 9 – 12, Hong Kong
11
Algorithm Improvements – Save Communication Cost
At the query initiating peer Pq :
Locate the two peers that are responsible for the terms tf ts P P with the smallest statistics. Call them and t Forward the query to peer P s
At peer P t : s
t
Get all statistics from peer P f Apply KMV-int on the peers in the two lists and obtain a set of candidate peers P Get the rest of the statistics about q but only for peers in P Wsdm’11, Feb 9 – 12, Hong Kong
12
Algorithm Improvements – Adaptive Ranking
Work in rounds
In each round contact the next best k’ peers (k’ < K) Obtain a threshold score (min-k) which is the score of the last (i.e., k-th) document among the current top-k Adaptively rank the remaindered peers Define high ( h ) g aggr(high (h1 ),..., high (hn ))
h
i1 :
L1i1
Li21
Li31
i2 :
L1i 2
Li22
Li32
Li42
…
in :
L1in
Lin2
Lin3
In the scoring functions (KMV-int and KMV-exp), ignore tuples whose high ( h ) < min-k Wsdm’11, Feb 9 – 12, Hong Kong
13
KMV-Peer: The Peer-Selection Algorithm k – top-k results are requested k’ – number of peers to contact in each iteration K – max number of peers to contact
Score peers by KMV-int, but if less than k’ peers have a non-zero score then use KMV-exp
Wsdm’11, Feb 9 – 12, Hong Kong
14
Experimental Setting
Datasets
Setups
Trec – 15 queries from the topic-distillation track of the TREC 2003 Web Track benchmark Blog – 75 queries from the blog track of TREC 2008
Parameters
Trec-10K – 10,000 peers, each having 1,000 documents Trec-1K – 1,000 peers, each having 10,000 documents Blog – 1,000 peers, each having 2,000 documents
Queries
Trec – 10M web pages from Trec GOV2 collection Blog – 2M Blog posts from Blogger.com
l (KMV size), M (num score intervals), G (num groups)
Evaluation
Normalized DCG (nDCG), which considers the order of the results in the ground truth (i.e., a centralized system) MAP Wsdm’11, Feb 9 – 12, Hong Kong
15
KMV-Peer Compared to State-of-the-Art Blog (l10,M5)
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
nDCG
nDCG
Trec-10K (l10,M5)
0.5 0.4
0.5 0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0
10
20
30
40
50
60
70
0
10
Num ber of selected peers KMV
hist
cdf-ctf
cori
20
30
40
50
60
70
Num ber of selected peers
crcs
KMV
hist
cdf-ctf
cori
crcs
Communication cost (KBytes)
Wsdm’11, Feb 9 – 12, Hong Kong
16
Tuning The Parameters of KMV-Peer Blog
Trec-1K 1
1
0.9 0.9
0.8
0.8
0.6 nDCG
nDCG
0.7 0.5 0.4
0.7 0.6
0.3 0.2
0.5
0.1 0
0.4
0
10
20
30
40
50
60
70
0
10
Num ber of selected peers
20
30
40
50
60
70
Num ber of selected peers
l10,M20
l10,M5,G5
l5,M5,G10
l10,M5,G10
l20,M10
l5,M10
l10,M5
l100,M1
l5,M5
Wsdm’11, Feb 9 – 12, Hong Kong
l10,M5
l5,M10
l10,M10
l25,M1
17
Testing Different Variants of KMV-Peer Blog
Trec-1K 0.9
0.9 0.85
0.8 0.8
nDCG
nDCG
0.75 0.7
0.7
0.6
0.65 0.6
0.5 0.55 0.5
0.4 0
10
20
30
40
50
60
70
80
90
100
0
10
20
Num ber of selected peers
30
40
50
60
70
80
90
100
Num ber of selected peers
KMV-exp-adaptive
KMV-exp-nonAdaptive
KMV-exp-adaptive
KMV-exp-nonAdaptive
KMV-int-adaptive
KMV-int-nonAdaptive
KMV-int-adaptive
KMV-int-nonAdaptive
Wsdm’11, Feb 9 – 12, Hong Kong
18
Testing Different Scoring Functions nDCG at K=20
Lucene – Apache Lucene score with global synchronization BM25 – Okapi BM25 score with global synchronization Lucene* – Lucene score with the parameters (e.g., idf) derived by each peer from its own collection Wsdm’11, Feb 9 – 12, Hong Kong
19
Conclusions
We presented a fully decentralized peer-selection algorithm (KMV-peer) for approximating the results of a centralized search engine, while using only a small subset of the peers and controlling the communication cost. The algorithm employs two scoring functions for ranking peers. The first is the intersection score and is based on a nonemptiness estimator. The second is the expected score. KMV-peer outperforms the state-of-the-art methods and achieves an improvement of more than 400% over other methods Regarding communication-cost, we showed how to filter out peers in early stages of the algorithm, thereby saving the need to send their synopses. Wsdm’11, Feb 9 – 12, Hong Kong
20
Future Work
Investigate further reductions in communication cost by using top-k algorithms with a stopping condition Consider less restrictive non-emptiness estimators (disjunctive queries)
Wsdm’11, Feb 9 – 12, Hong Kong
21
Thank You! Questions ?
Wsdm’11, Feb 9 – 12, Hong Kong
22