PP-Index - Semantic Scholar

Report 4 Downloads 286 Views
PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli [email protected] Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo” Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, 1 — 56124 Pisa, Italy

ISTI:Science seminar, May 12, 2009

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

1 / 48

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

2 / 48

Introduction

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

3 / 48

Introduction

Similarity search

Outline

1

Introduction Similarity search Permutation based methods Local similarity hashing methods

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

4 / 48

Introduction

Similarity search

Similarity search The similarity search model involves: A collection of objects D, belonging to a domain O; a query object q ∈ O; a distance function d : O × O → R+ . The goal is to sort the objects in D by their distance with respect to q, returning the objects that are closer to q, which are considered to be the most similar. Typically only the k-top ranked objects are returned (k-NN query), or those within a maximum distance value r (range query). The determination of a meaningful r value is often a non-easy task. k-NN queries are usually preferred, specially in end-user applications, also for the direct control on the result set size. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

5 / 48

Introduction

Similarity search

Similarity search Example (R2 , L2 ): o2

o1

o3

o0

o5 r

o4

q

o2

o1 o10

o3 o5

o0

o9

q

o8 o11

o7

o6

o6

o9

o8 o11

o7 o12

o12

Figure 2: k-NN query (k = 5).

Figure 1: Range query.

Andrea Esuli (ISTI-CNR)

o10

o4

PP-Index

ISTI:Science

6 / 48

Introduction

Similarity search

Approximate similarity search Exhaustive search: for all oi ∈ D compute the distance d(q, oi ), while keeping track of which objects satisfy the query. It does not scale to large collections.

Exact methods: equivalent to exhaustive search, but using data structures that leverage on the properties of the observed similarity space (e.g., vectorial spaces, metric spaces) in order to reduce the number of objects of D to be compared with the query. Usually efficient but still not enough for huge collections.

Approximate methods: accepting that the results could contain errors (e.g., d(q, o1 ) < d(q, o2 ), o2 is in the results and o1 is not), gaining efficiency. Approximation is acceptable, e.g., when d is an approximation of a complex, human-perceived concept of similarity. It (obviously) scales! Typically derived from “relaxed” exact methods. Natively approximated proposals, e.g.: local similarity hashing (LSH) index and permutation-based index (the PP-Index takes inspiration from both). Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

7 / 48

Introduction

Similarity search

Approximate similarity search Approximation quality: What have we missed? What have we included? How much have we saved? o2

o1

o3 o5

o0

o4 q

o6

o10 o9

o8 o11

o7 o12

Figure 3: Approximate result for a k-NN query (k = 5).

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

8 / 48

Introduction

Permutation based methods

Outline

1

Introduction Similarity search Permutation based methods Local similarity hashing methods

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

9 / 48

Introduction

Permutation based methods

Permutation based methods Independently proposed by Amato and Savino1 and Chavez et al.2 , using different data structures. The idea: an object is represented by its view of the surrounding world. Intuively, if two objects “see” the elements of a set of reference objects R in the same order of (increasing) distance, they are likely to be close one to the other. Example Where am I likely to live if I see the main European cities in the following order? Rome, Milan, Bern, Marseilles, Munich, Luxembourg, Bonn, Vienna, Belgrade, Brussels, Barcelona, Paris, Berlin, Amsterdam, London, Copenhagen, Madrid, Istanbul, Dublin, Athens, Oslo, Stockholm, Lisbon, Helsinki. 1 G. Amato and P. Savino, Approximate similarity search in metric spaces using inverted files, INFOSCALE 2008, pages 1-10. 2 E. Chavez, K. Figueroa, and G. Navarro, Effective proximity retrieval by ordering permutations, IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9):1647-1658, 2008. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

10 / 48

Introduction

Permutation based methods

Permutation based methods The method: A set of reference objects R = {r0 , . . . , r|R|−1 } ⊂ O is defined (e.g., by randomly selecting |R| objects from D). Every object oi ∈ D is then represented by a permutation Πoi of h0, . . . , |R| − 1i, i.e., the list of the identifiers of reference objects, so that the identifiers are sorted by the distance of their relative reference objects with respect to oi . The search process mainly consists in computing Πq and estimating the true distance d(q, oi ) using a permutation-based distance d0 (Πq , Πoi ), e.g., the Spearman’s footrule distance. Amato and Savino have shown that using only the prefix Πloi of the permutation Πoi (e.g., l = 100 when |R| = 500) improves both efficiency and effectiveness. The PP-Index adopts a permutation-based data representation model, using very short prefixes (e.g., l = 6 when |R| = 1000). Differently from previous approaches, the permutation prefixes are used just to quickly find a small set of candidate objects from D for inclusion into results, not to estimate their relative order. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

11 / 48

Introduction

Permutation based methods

Permutation based methods r0



r0









r1

r2





r3





r4





r1



r2



r3



r4





r5

r5







Figure 4: Regions of the 2-dimensional space identified by 6 randomly selected reference points, using the Euclidean distance, and full-lenght permutations (left) or permutation prefixes of lenght 3.

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

12 / 48

Introduction

Local similarity hashing methods

Outline

1

Introduction Similarity search Permutation based methods Local similarity hashing methods

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

13 / 48

Introduction

Local similarity hashing methods

Local similarity hashing methods A family H of hash functions f : O → U is called (r, , p1 , p2 )-sensitive, with r,  > 0, p1 > p2 > 0, if for any p, q ∈ O: if d(p, q) ≤ r then P[h(p) = h(q)] ≥ p1 if d(p, q) > r(1 + ) then P[h(p) = h(q)] ≤ p2 for any function h randomly selected from H. Intuitively: two objects have a (high) probability x1 ≥ p1 to collide if they are closer than r, and a (low) probability x2 ≤ p2 if they are more distant than r(1 + ). LSH-Index3 : j randomly chosen functions hi ∈ H define a hash function g(x) = (h1 (x)h2 (x) . . . hj (x)), i.e. bad collision probability is significantly lowered to pj2 . t different hash tables are built, based on randomly generated g1 . . . gt functions, in order to increase good collision probability. 3 P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, STOC 1998, pages 604-613. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

14 / 48

Introduction

Local similarity hashing methods

Local similarity hashing methods It is hard to tune LSH-Index (length of hash keys) in order to obtain good efficacy, due to the dependence between data distribution and hash length. LSH-Forest4 : Use of variable length hash keys. Long hash key are indexed in a prefix tree (LSH-Tree). At search time the key length is varied in order to retrieve a given number of candidate objects. Candidate objects are retrieved sequentially from a data storage on disk. Multiple LSH-Tree, i.e., a forest, are used to improve effectiveness. The PP-Index uses similar data structure. 4 M. Bawa, T. Condie, and P. Ganesan, LSH-Forest: self-tuning indexes for similarity search, WWW 2005, pages 651-660. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

15 / 48

The PP-Index

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

16 / 48

The PP-Index

Data structures

Outline

1

Introduction

2

The PP-Index Data structures Algorithms

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

17 / 48

The PP-Index

Data structures

PP-Index: data structures The PP-Index represents each indexed object with a permutation prefix of length l. Data structures: a prefix tree kept in main memory, indexing the permutation prefixes, and a data storage kept on disk, storing the information required to compute real distances between objects in D and any object in O. The prefix tree is used in order to rapidly identify a set of at least z candidates (z ≥ k), leaving to the original distance function the task of determining the final k-NN result from such set of candidates. Candidates are retrieved from the data storage with a few sequential disk accesses. The PP-Index adopts a bulk data processing model, similar to the one used for text-based inverted list indexes (assumption on the static nature of data). It is easy to provide update capabilities (i.e., insert, delete, modify). Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

18 / 48

The PP-Index

Algorithms

Outline

1

Introduction

2

The PP-Index Data structures Algorithms

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

19 / 48

The PP-Index

Algorithms

PP-Index: building the index BuildIndex(D, d, R, l) 1 pref ixT ree ← EmptyPrefixTree() 2 dataStorage ← EmptyDataStorage() 3 for i ← 0 to |D − 1| 4 do oi ← GetObject(D, i) 5 dataBlockoi ← GetDataBlock(oi ) 6 poi ← Append(dataBlockoi , dataStorage) 7 woi ← ComputePrefix(oi , R, d, l) 8 ho i ← i 9 Insert(woi , hoi , poi , pref ixT ree) 10 L ← ListPointersByOrderedVisit(pref ixT ree) 11 P ← CreateInvertedList(L) 12 ReorderStorage(dataStorage, P ) 13 CorrectLeafValues(pref ixT ree, dataStorage) 14 index ← NewIndex(d, R, l, pref ixT ree, dataStorage) 15 return index Figure 5: The BuildIndex function.

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

20 / 48

The PP-Index

Algorithms

PP-Index: building the index

Input: dataset D, distance function d, reference objects R, prefix length l. Indexing process: Main loop: permutation prefixes are inserted into the prefix tree, data blocks are appended to data storage. Data storage reordering: data blocks are sorted to reflect the order of prefixes.

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

21 / 48

The PP-Index

Algorithms

PP-Index: building the index Prefix tree root

1 2 4 5 Index characteristics |D|=10, |R|=6, l=3 Permutation prefixes = wo = = wo = = wo = = wo = = wo9 =

3

3

1 3

0

3

2

|

wo wo wo wo wo

0

2

4 6

8

2 4

5

3

1

3

5

h0 h4 h8 p0 p4 p8

h p66

h p11

h3 h5 p3 p5

h h p22 p77 main memory secondary memory

7

Figure 6: Sample data.

h9 p9

o0 ... o1 ... o2... o 3 ... o4 ... o5 ... o6 ... o7 ... o8 ... o9 ... Data storage Figure 7: Index data structure after the first phase of object insertion.

Main loop: permutation prefixes are inserted into the prefix tree, data blocks are appended to data storage. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

22 / 48

The PP-Index

Algorithms

PP-Index: building the index Prefix tree

Prefix tree

root

root

1 2 4 5

1 2 4 5

3 2 4 h0 h4 h8 p0 p4 p8

h p66

3

1 3

0

3

h p11

h3 h5 p3 p5

3

2 5

2 4

3 h9 p9

h h p p

h h p22 p77

start 132 start 132

end 132 end 132

h p

134 134

h p

3

1 3

0

3

230

h h p p

230

start 413 start 413

2 5

end 413 end 413

h p

3 435 435

h h p p start 532 start 532

end 532 end 532

main memory secondary memory

main memory secondary memory

o0 ... o1 ... o2... o 3 ... o4 ... o5 ... o6 ... o7 ... o8 ... o9 ...

o0 ... o4 ... o8... o 6 ... o1 ... o3 ... o5 ... o9 ... o2 ... o7 ...

Data storage

Data storage

Figure 8: Index data structure after the first phase of object insertion.

Figure 9: Index data structure after the first phase of object insertion.

Data storage reordering: data blocks are sorted to reflect the order of prefixes. The leaves of the final prefix tree point to intervals of the data storage. Efficiency alert: performed using a m-way merge sort algorithm. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

23 / 48

The PP-Index

Algorithms

PP-Index: search function FindCandidates(q, pref ixT ree, R, d, l, z) 1 wq ← ComputePrefix(q, R, d, l) 2 for i ← l to 1 3 do wqi ← SubPrefix(wq , i) 4 node ← SearchPath(wqi , pref ixT ree) 5 if node 6= nil 6 then minLeaf ← GetMin(node, pref ixT ree) 7 maxLeaf ← GetMax(node, pref ixT ree) 8 if (maxLeaf.hend − minLeaf.hstart + 1) ≥ z ∨ i = 1 9 then return (minLeaf.pstart , maxLeaf.hend ) 10 return (0, 0) Figure 10: The FindCandidates function.

Given the prefix representing the query, FindCandidates searches for the smallest subtree of the prefix tree pointing to at least z data blocks (z 0 ).

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

24 / 48

The PP-Index

Algorithms

PP-Index: search function Search(q, k, z, index) 1 (pstart , pend ) ← FindCandidates(q, index.pref ixT ree, index.R, index.d, index.l, z) 2 resultsHeap ← EmptyHeap() 3 cursor ← pstart 4 while cursor ≤ pend 5 do dataBlock ← Read(cursor, index.dataStorage) 6 AdvanceCursor(cursor) 7 distance ← index.d(q, dataBlock.data) 8 if resultsHeap.size < k 9 then Insert(resultsHeap, distance, dataBlock.id) 10 else if distance < resultsHeap.top.distance 11 then ReplaceTop(resultsHeap, distance, dataBlock.id) 12 Sort(resultsHeap) 13 return resultsHeap Figure 11: The Search function.

The z 0 candidate data blocks are sequentially read from the data storage. A heap is used to keep track of the best k results. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

25 / 48

The PP-Index

Algorithms

PP-Index: improving the search effectiveness The basic search strategy is designed for efficiency. Effectiveness can be boosted using various search strategies: Multiple index: building n PP-Index using different R sets, R1 . . . Rn (LSH-Forest style). Projecting different R-induced “grids” on the objects, helps to approximate a better (less skewed) partitioning of the space. Can be implemented using data replication (faster/more storage) or using data referencing (slower/less storage). k-NN results from the various indexes are merged together in the final one.

Multiple query: generating m perturbed versions of wq in order to explore the neighborhood of wq . The perturbed wqi prefixes are generated by swapping pairs of elements of wq , first selecting those with the smaller distance difference with respect to q. All the wqi prefixes are used to find candidates on the same index. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

26 / 48

The PP-Index

Algorithms

PP-Index: prefix tree optimizations Prefix tree

Prefix tree

root

root

1 2 4 5

3

1,3 2 4 5

1 3

1 3

2 4

2 4 h h p p start 132 start 132

end 132 end 132

h p

134 134

h p

230 230

h h p p start 413 start 413

end 413 end 413

h p

435 435

h h p p start 532 start 532

h h p p start 132 start 132

end 532 end 532

end 132 end 132

h p

134 134

h p

230 230

h h p p start 413 start 413

end 413 end 413

main memory secondary memory

h p

435 435

h h p p start 532 start 532

end 532 end 532

main memory secondary memory

o0 ... o4 ... o8... o 6 ... o1 ... o3 ... o5 ... o9 ... o2 ... o7 ...

o0 ... o4 ... o8... o 6 ... o1 ... o3 ... o5 ... o9 ... o2 ... o7 ...

Data storage

Data storage

Figure 12: Pruning of only-child paths to leaves.

Figure 13: Only-child paths compression.

Scalability alert: reducing to a single leaf any subtree pointing to less than z data blocks. Applicable when z is hardcoded into the search function. Does not affect search results quality. Lossy with respect to index update operations. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

27 / 48

The PP-Index

Algorithms

PP-Index: merging (and updating) the index Scalability alert: the index (prefix tree) reaches its maximum memory requirement at the end of the main loop of the indexing process. Could not fit into memory, when the final index will (after optimizations). Strategy: building many smaller indexes, using the same R set, then merging them together. The merge process is efficient: The source prefix trees are merged into the final prefix tree by performing a parallel ordered visit on them. Data storages are merged into the final data storage while building the final prefix tree. Can be done from-disk-to-disk, minimum memory occupation. Linear cost with respect to index size (if not done in an m-way style). Uses only sequential reads/writes. Update operations can be supported by keeping track of such operations by using a small all-in-memory index and performing periodic merge operations. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

28 / 48

Experiments

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

29 / 48

Experiments

The CoPhIR collection

Outline

1

Introduction

2

The PP-Index

3

Experiments The CoPhIR collection Evaluation measures Results

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

30 / 48

Experiments

The CoPhIR collection

The CoPhIR collection The CoPhIR5 consists of a crawl of 106 millions images from the Flickr photo sharing website. Textual data + five MPEG-7 visual descriptors (240 GB of XML description data). Visual similarity measure: linear combination of distance functions defined on the MPEG-7 descriptors. MPEG-7 Visual Descriptor Scalable Color Color Structure Color Layout Edge Histogram Homogeneous Texture

Distance type L1 L1 sum of L2 L1 L1

Dimension 64 64 80 62 12

Weight 2 3 2 4 0.5

Table 1: Details on the five MPEG-7 visual descriptors used in CoPhIR, and the weights used in the linear combination. The “Dim.” column refer to the specific dimension for visual descriptors adopted by the CoPhIR data set.

Experiments made on 1, 10, and 100 millions images, using 100 randomly selected images (excluded from indexes). 5

http://cophir.isti.cnr.it/ http://www.saphir.eu/ http://www.flickr.com/ Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

31 / 48

Experiments

Evaluation measures

Outline

1

Introduction

2

The PP-Index

3

Experiments The CoPhIR collection Evaluation measures Results

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

32 / 48

Experiments

Evaluation measures

Evaluation measures Effectiveness measures: Recall (ranking-based6 ): Recall(k) =

|Dqk ∩ Pqk | k

(1)

Relative Distance Error (distance-based): k 1 X d(q, Pqk (i)) RDE(k) = −1 k i=1 d(q, Dqk (i))

(2)

where Dqk is the list of the k closest elements of D to q, sorted by their distance with respect to q, and Pqk is the list returned by the algorithm. Efficiency measures: index time. index size (RAM, disk). number of candidates retrieved from disk (z 0 ). average search time. 6

M. Patella and P. Ciaccia, The many facets of approximate similarity search, SISAP 2008, pages 10-21. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

33 / 48

Experiments

Results

Outline

1

Introduction

2

The PP-Index

3

Experiments The CoPhIR collection Evaluation measures Results

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

34 / 48

Experiments

Results

Results Indexing time (s)

1000000 100000 10000 1000

100M 10M 1M

100 100

200

500

1000

|R| Figure 14: Indexing time w.r.t. to the size of R and the data set size.

|D| 1M 10M 100M

indexing time (sec) 419 4385 45664

prefix tree size full comp. 7.7 MB 91 kB 53.8 MB 848 kB 354.5 MB 6.5 MB

data storage 349 MB 3.4 GB 34 GB

l0 2.1 2.7 3.5

Table 2: Indexing times (with |R| = 100), resulting index sizes, and average prefix tree depth l0 (after prefix tree compression with z = 1, 000), for the various data set sizes.

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

35 / 48

Experiments

Results

Results 0.300

Search time (s)

0.250 0.200 0.150 0.100

100M 10M 1M

0.050 0.000

100

200

500

1000

|R| Figure 15: Search time w.r.t. to the size of R and the data set size. Search performed with z = 1, 000 and k = 100 (single index, single query).

|R| 100 200 500 1,000

1M 4,075 3,320 1,803 1,091

|D| 10M 5,817 5,571 5,065 4,748

100M 7,941 7,302 6,853 6,644

Table 3: Average z 0 value (z = 1, 000), i.e., average number of retrieved candidate objects for a query, with respect to the size of the reference objects set and data set size. Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

36 / 48

Experiments

Results

Results 0.20

0.20 100M 10M 1M

0.10 0.05

0.00 100

200

500

1000

100

|R|

1.0

200

100M 10M 1M

0.8

0.6 0.4 0.2

500

1000

500

1000

|R|

1.0

Recall(k)

Recall(k)

0.10 0.05

0.00

0.8

k=100 k=10 k=1

0.15

RDE(k)

RDE(k)

0.15

k=100 k=10 k=1

0.6 0.4 0.2

0.0

0.0 100

200

500

1000

100

|R|

|R|

Figure 16: Effectiveness with respect of the size of R set, on various index sizes, using k = 100, and z = 1, 000 (single index, single query). Andrea Esuli (ISTI-CNR)

200

Figure 17: Effectiveness with respect of the size of R set, on the 100M index, using z = 1, 000 (single index, single query).

PP-Index

ISTI:Science

37 / 48

Experiments

Results

Results 0.10

0.10 k=100 k=10 k=1

0.06 0.04 0.02

0.06 0.04 0.02

0.00

0.00 1

2

4

8

1

|indexes|

1.0

2

Recall(k)

0.4 0.2

8

4

8

k=100 k=10 k=1

0.8

0.6

4

|queries|

1.0

k=100 k=10 k=1

0.8

Recall(k)

k=100 k=10 k=1

0.08

RDE(k)

RDE(k)

0.08

0.6 0.4 0.2

0.0

0.0 1

2

4

8

1

|indexes|

Figure 18: Effectiveness of the multiple index search strategy on the 100M index, using |R| = 1, 000 and z = 1, 000. Andrea Esuli (ISTI-CNR)

2

|queries|

Figure 19: Effectiveness of the multiple query search strategy on the 100M index, using |R| = 1, 000 and z = 1, 000. PP-Index

ISTI:Science

38 / 48

Experiments

Results

Results 0.01

1e-3 100M 10M 1M

6e-3 4e-3 2e-3

6e-4 4e-4 2e-4

0

0 1

2

5

10

20

50

100

1

k

1.00 0.98 0.96 0.94 0.92 0.90 0.88 0.86 0.84 0.82 0.80

2

5

10

20

50

100

20

50

100

k

1.00 0.99 100M 10M 1M

Recall(k)

Recall(k)

100M 10M 1M

8e-4

RDE(k)

RDE(k)

8e-3

0.98 0.97 100M 10M 1M

0.96 0.95 1

2

5

10

20

50

100

1

k

5

10

k

Figure 20: Effectiveness of the combined multiple query and multiple index search strategies, using eight queries and eight indexes, on various data set sizes, using |R| = 100, and z = 1, 000. Andrea Esuli (ISTI-CNR)

2

Figure 21: Effectiveness of the combined multiple query and multiple index search strategies, using eight queries and eight indexes, on various data set sizes, using |R| = 1, 000, and z = 1, 000.

PP-Index

ISTI:Science

39 / 48

Demo

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

40 / 48

Demo

Demo

http://mipai.esuli.it Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

41 / 48

Conclusions

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

42 / 48

Conclusions

Summary

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions Summary Questions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

43 / 48

Conclusions

Summary

Summary The PP-Index: is a simple but effective data structure for approximate similarity search. scales well, both at indexing time and at search time. can be kept updated with minor additional effort. has good parallelization properties. relates well with other data structures (i.e., inverted lists). There is a lot still to investigate: policies for reference points selection. studing the relations between l, |R|, z, k, l0 , and z 0 . giving a theoretical foundation to the permutation based methods. applicability to other domains and similarity space types. policies for data partitioning

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

44 / 48

Conclusions

Questions

Outline

1

Introduction

2

The PP-Index

3

Experiments

4

Demo

5

Conclusions Summary Questions

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

45 / 48

Conclusions

Questions

Questions?

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

46 / 48

Conclusions

Questions

FAQ

Q: How does the PP-Index differ from the “orthodox” metric approach X? A: Please help yourself: No “explicit” requirement of metric properties. Use of a predetermined (i.e., fixed) set of reference points. Any reference point has a “global” influence. Different data access model. Different data update model. ...

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

47 / 48

Conclusions

Questions

FAQ

Q: What are the key differences between the permutation-based methods and the LSH-based methods? The permutation-based methods are mostly based on geometrical A: considerations, while LSH-based methods are mostly based on probabilistic considerations. The permutation-based methods are able to take into account how data is distributed in the similarity space (by means of R), while the LSH hash functions are derived only from the distance function. Each element of the hash key generated by an LSH hash function is independent from the others, while the order relation between the elements of a permutation is the crucial information for a permutation-based method.

Andrea Esuli (ISTI-CNR)

PP-Index

ISTI:Science

48 / 48

Recommend Documents