Coding for Random Projections and Approximate Near Neighbor Search

Report 0 Downloads 30 Views
Coding for Random Projections and Approximate Near Neighbor Search Ping Li Department of Statistics & Biostatistics Department of Computer Science Rutgers University Piscataway, NJ 08854, USA [email protected]

Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University Cambridge, MA 02138, USA [email protected]

Anshumali Shrivastava Department of Computer Science Cornell University Ithaca, NY 14853, USA [email protected] Abstract This technical note compares two coding (quantization) schemes for random projections in the context of sub-linear time approximate near neighbor search. The first scheme is based on uniform quantization [4] while the second scheme utilizes a uniform quantization plus a uniformly random offset [1] (which has been popular in practice). The prior work [4] compared the two schemes in the context of similarity estimation and training linear classifiers, with the conclusion that the step of random offset is not necessary and may hurt the performance (depending on the similarity level). The task of near neighbor search is related to similarity estimation with importance distinctions and requires own study. In this paper, we demonstrate that in the context of near neighbor search, the step of random offset is not needed either and may hurt the performance (sometimes significantly so, depending on the similarity and other parameters). For approximate near neighbor search, when the target similarity level is high (e.g., correlation > 0.85), our analysis suggest to use a uniform quantization to build hash tables, with a bin width w = 1 ∼ 1.5. On the other hand, when the target similarity level is not that high, it is preferable to use larger w values (e.g., w ≥ 2 ∼ 3). This is equivalent to say that it suffices to use only a small number of bits (or even just 1 bit) to code each hashed value in the context of sublinear time near neighbor search. An extensive experimental study on two reasonably large datasets confirms the theoretical finding. Coding for building hash tables is a different task from coding for similarity estimation. For near neighbor search, we need coding of the projected data to determine which buckets the data points should be placed in (and the coded values are not stored). For similarity estimation, the purpose of coding is for accurately estimating the similarities using small storage space. Therefore, if necessary, we can actually code the projected data twice (with different bin widths). In this paper, we do not study the important issue of “re-ranking” of retrieved data points by using estimated similarities. That step is needed when exact (all pairwise) similarities can not be practically stored or computed on the fly. In a concurrent work [5], we demonstrate that the retrieval accuracy can be further improved by using nonlinear estimators of the similarities based on a 2-bit coding scheme.

1

1

Introduction

This paper focuses on the comparison of two quantization schemes for random projections in the context of sublinear time near neighbor search. The task of near neighbor search is to identify a set of data points which are “most similar” (in some measure of similarity) to a query data point. Efficient algorithms for near neighbor search have numerous applications in search, databases, machine learning, recommending systems, computer vision, etc. Developing efficient algorithms for finding near neighbors has been an active research topic since the early days of modern computing [2]. Near neighbor search with extremely high-dimensional data (e.g., texts or images) is still a challenging task and an active research problem. Among many types of similarity measures, the (squared) Euclidian distance (denoted by d) and the correlation (denoted by ρ) are most commonly used. Without loss of generality, consider two high-dimensional data vectors u, v ∈ RD . The squared Euclidean distance and correlation are defined as follows: ∑D D ∑ ui vi 2 d= |ui − vi | , ρ = √∑ i=1√∑ (1) D D 2 2 i=1 u v i=1 i i=1 i In practice, it appears that the correlation is more often used than the distance, partly because ∑ |ρ| is nicely 2 normalized within 0 and 1. In fact, in this study, we will assume that the marginal l2 norms D i=1 |ui | and ∑D 2 i=1 |vi | are known. This is a reasonable assumption. Computing the marginal l2 norms only requires scanning the data once, which is anyway needed during the data collection process. In machine learning practice, it is common to first normalize the data (to have unit l2 norm) before feeding the data to classification (e.g., SVM) or clustering (e.g., K-means) algorithms. For convenience, throughout this paper, we assume unit l2 norms, i.e., ∑D D D D ∑ ∑ ∑ u i vi = ui vi , where u2i = vi2 = 1 (2) ρ = √∑ i=1√∑ D D 2 2 i=1 i=1 i=1 u v i=1 i i=1 i

1.1

Random Projections

As an effective tool for dimensionality reduction, the idea of random projections is to multiply the data, e.g., u, v ∈ RD , with a random normal projection matrix R ∈ RD×k (where k ≪ D), to generate: x = u × R ∈ Rk ,

y = v × R ∈ Rk ,

k R = {rij }D i=1 j=1 ,

rij ∼ N (0, 1) i.i.d.

(3)

The method of random projections has become popular for large-scale machine learning applications such as classification, regression, matrix factorization, singular value decomposition, near neighbor search, etc. The coding with a small number of bits arise because the (uncoded) projected data, ∑potential benefits of ∑ D xj = D u r and y = j i=1 i ij i=1 vi rij , being real-valued numbers, are neither convenient/economical for storage and transmission, nor well-suited for indexing. The focus of this paper is on approximate (sublinear time) near neighbor search in the framework of locality sensitive hashing [3]. In particular, we will compare two coding (quantization) schemes of random projections [1, 4] in the context of near neighbor search.

1.2

Uniform Quantization

The recent work [4] proposed an intuitive coding scheme, based on a simple uniform quantization: h(j) w (u) = ⌊xj /w⌋ ,

h(j) w (v) = ⌊yj /w⌋

(4)

where w > 0 is the bin width and ⌊.⌋ is the standard floor operation. ( ) (j) (j) The following theorem is proved in [4] about the collision probability Pw = Pr hw (u) = hw (v) . 2

Theorem 1 ∞ ∫ ) ∑ (j) (j) Pw = Pr hw (u) = hw (v) = 2

(

i=0

[ (

(i+1)w

ϕ(z) Φ

iw

(i + 1)w − ρz √ 1 − ρ2

)

( −Φ

iw − ρz √ 1 − ρ2

)] dz (5)

In addition, Pw is a monotonically increasing function of ρ. The fact that Pw is a monotonically increasing function of ρ makes (4) a suitable coding scheme for approximate near neighbor search in the general framework of locality sensitive hashing (LSH).

1.3

Uniform Quantization with Random Offset

[1] proposed the following well-known coding scheme, which uses windows and a random offset: ⌊ ⌋ ⌊ ⌋ x j + qj y j + qj (j) (j) hw,q (u) = , hw,q (v) = w w where qj ∼ unif orm(0, w). [1] showed that the collision probability can be written as ( )( ) ( ) ∫ w 1 t t (j) (j) √ 2ϕ √ Pw,q =Pr hw,q (u) = hw,q (v) = 1− dt w d d 0

(6)

(7)

where d = ||u − v||2 = 2(1 − ρ) is the Euclidean distance between u and v. Compared with (6), the scheme (4) does not use the additional randomization with q ∼ unif orm(0, w) (i.e., the offset). [4] elaborated the following advantages of (4) in the context of similarity estimation: 1. Operationally, hw is simpler than hw,q . 2. With a fixed w, hw is always more accurate than hw,q , often significantly so. 3. For each coding scheme, one can separately find the optimum bin width w. The optimized hw is also more accurate than optimized hw,q , often significantly so. 4. hw requires a smaller number of bits than hw,q . In this paper, we will compare hw,q with hw in the context of sublinear time near neighbor search.

1.4

Sublinear Time c-Approximate Near Neighbor Search

√ Consider √ a data vector u. Suppose there exists another vector √ whose Euclidian distance ( d) from u is at most d0 (the target distance). The goal of c-approximate d0 -near neighbor algorithms is to return data √ vectors (with high probability) whose Euclidian distances from u are at most c × d0 with c > 1. Recall that, in our definition, d√= 2(1 − ρ) is the squared Euclidian distance. To be consistent with [1], √ we present the results in terms of d. Corresponding to the target distance d0 , the target similarity ρ0 can be computed from d0 = 2(1 − ρ0 ) i.e., ρ0 = 1 − d0 /2. To simplify the presentation, we focus on ρ ≥ 0 (as is common in practice), i.e., 0 ≤ d ≤ 2. Once we fix a target similarity ρ0 , c can not exceed a certain value: √ √ √ 1 c 2(1 − ρ0 ) ≤ 2 =⇒ c ≤ (8) 1 − ρ0 √ For example, when ρ0 = 0.5, we must have 1 ≤ c ≤ 2.

3

Under the general framework, the performance of an LSH algorithm largely depends on √ the difference √ (1) (2) (gap) between the two collision probabilities P and P (respectively corresponding to d0 and c d0 ): Pw(1) = Pr (hw (u) = hw (v))

when d = ||u − v||22 = d0

Pw(2)

when d = ||u −

= Pr (hw (u) = hw (v)) (1)

v||22

2

= c d0

(9) (10)

(2)

Corresponding to hw,q , the collision probabilities Pw,q and Pw,q are analogously defined. A larger difference between P (1) and P (2) implies a more efficient LSH algorithm. The following “G” values (Gw for hw and Gw,q for hw,q ) characterize the gaps: (1)

Gw =

(1)

log 1/Pw

, (2)

Gw,q =

log 1/Pw

log 1/Pw,q (2)

(11)

log 1/Pw,q

A smaller G (i.e., larger difference between P (1) and P (2) ) leads to a potentially more efficient LSH algorithm and ρ < 1c is particularly desirable [3]. The general theory says the query time for c-approximate d0 -near neighbor is dominated by O(N G ) distance evaluations, where N is the total number of data vectors in the collection. This is better than O(N ), the cost of a linear scan.

2 Comparison of the Collision Probabilities To help understand the intuition why hw may lead to better performance than hw,q , in this section we examine their collision probabilities Pw and Pw,q , which can be expressed in terms of the standard normal ∫x x2 pdf and cdf functions: ϕ(x) = √12π e− 2 and Φ(x) = −∞ ϕ(x)dx, ( ) 2 2 w √ + √ ϕ √ −1− √ (12) Pw,q = Pr = = 2Φ 2πw/ d w/ d d [ ( ) ( )] ∞ ∫ (i+1)w ( ) ∑ (i + 1)w − ρz iw − ρz (j) (j) √ ϕ(z) Φ Pw = Pr hw (u) = hw (v) = 2 −Φ √ dz (13) 2 2 1 − ρ 1 − ρ iw i=0 (

(j) hw,q (u)

h(j) w,q (v)

)

(

w √ d

)

It is clear that Pw,q → 1 as w → ∞. Figure 1 plots both Pw and Pw,q for selected ρ values. The difference between Pw and Pw,q becomes apparent when w is not small. For example, when ρ = 0, Pw quickly approaches the limit 0.5 while Pw,q keeps increasing (to 1) as w increases. Intuitively, the fact that Pw,q → 1 when ρ = 0, is undesirable because it means two orthogonal vectors will have the same coded value. Thus, it is not surprising that hw will have better performance than hw,q , for both similarity estimation and sublinear time near neighbor search.

4

ρ=0 1

2

3

4

5 6 w

7

8

9 10

Pw Pw,q ρ = 0.75 1

2

3

4

5 6 w

7

8

9 10

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

Pw P

w,q

Prob

Prob

Pw,q

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

ρ = 0.25 1

2

3

4

5 6 w

7

8

9 10

ρ = 0.9 Pw Pw,q

1

2

3

4

5 6 w

7

8

9 10

Prob

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

Pw

Prob

Prob Prob

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

Pw Pw,q ρ = 0.5 1

2

3

4

5 6 w

7

8

9 10

Pw Pw,q

ρ = 0.99 1

2

3

4

5 6 w

7

8

9 10

Figure 1: Collision probabilities, Pw and Pw,q , for ρ = 0, 0.25, 0.5, 0.75, 0.9, and 0.99. The scheme hw has smaller collision probabilities than the scheme [1] hw,q , especially when w > 2.

3

Theoretical Comparison of the Gaps

Figure 2 compares Gw with Gw,q at their “optimum” w values, as functions of c, for a wide range of target similarity ρ0 levels. Basically, at each c and ρ0 , we choose the w to minimize Gw and the w to minimize Gw,q . This figure illustrates that Gw is smaller than Gw,q , noticeably so in the low similarity region. Figure 3, Figure 4, Figure 5, and Figure 6 present Gw and Gw,q as functions of w, for ρ0 = 0.99, ρ0 = 0.95, ρ0 = 0.9 and ρ0 = 0.5, respectively. In each figure, we plot the curves for a wide range of c values. These figures illustrate where the optimum w values are obtained. Clearly, in the high similarity region, the smallest G values are obtained at low w values, especially at small c. In the low (or moderate) similarity region, the smallest G values are usually attained at relatively large w. In practice, we normally have to pre-specify a w, for all c and ρ0 values. In other words, the “optimum” G values presented in Figure 2 are in general not attainable. Therefore, Figure 7, Figure 8, Figure 9, and Figure 10 present Gw and Gw,q as functions of c, for ρ0 = 0.99, ρ0 = 0.95, ρ0 = 0.9 and ρ0 = 0.5, respectively. In each figure, we plot the curves for a wide range of w values. These figures again confirm that Gw is smaller than Gw,q .

5

0.96 0.94

1

Gw,q

0.8

1.2 c

1/c

0.7 0.6 0.5

0.7 0.6 0.5 0.4

ρ0 = 0.7 1.6

1.8

0.3 1

2

1/c

ρ0 = 0.9 2.5

3

3.5

Optimum Gap

Optimum Gap

Gw,q

2

1/c

ρ0 = 0.6 1.1

1.2

2

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 0 1 1.5 2 2.5

c

2.5

Gw,q 1/c

3.5

4

4.5

1.5

5

1.6

G

w,q

1/c

0.7 0.6 0.5

0.3 1

Gw

3 c

1.4

0.8

0.4

ρ0 = 0.8 1.5

1.3 c

Gw

ρ0 = 0.85 1.5

c Gw

1.5

Gw,q

1

1/c

1.2

Gw

0.9

Gw,q

0.8

1.15

0.7

0.5 1

1.5

Gw

c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1

1.4

1.1 c

0.8

0.6 1.3

1.05

0.9

ρ0 = 0.5 1.2

ρ0 = 0.3

1

1/c

0.9

Gw,q

0.8

1.4

1.2

Gw,q

1.1

0.8

0.7 1

Gw

1

Gw

1.2

1.15

1/c

0.85

c

1

0.3 1

1.1 c

0.7

0.5 1

1.3

0.9

0.4

1.05

Gw,q

0.9

0.75

0.8

0.6

ρ0 = 0.4 1.1

ρ0 = 0.2

Optimum Gap

0.6 1

0.85

0.9

1/c

0.7

0.9

1

Gw

0.9

1/c

0.8 1

1.01 1.02 1.03 1.04 1.05 1.06 c

Gw,q

0.95

Gw

0.95

Optimum Gap

0.9 1

ρ0 = 0.1

Optimum Gap

0.92

Optimum Gap

Optimum Gap

w,q

1/c

1

Gw Optimum Gap

G

Optimum Gap

Optimum Gap

0.98

Optimum Gap

1

Gw

Optimum Gap

1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 0 1 1.5 2 2.5

2 c

2.5

3

Gw Gw,q 1/c

3 c

3.5

4

4.5

5

Figure 2: Comparison of the optimum gaps (smaller the better) for hw and hw,q . For each ρ0 and c, we can find the smallest gaps individually for hw and hw,q , over the entire range of w. We can see that for all target similarity levels ρ0 , both hw,q and hw exhibit better performance than 1/c. hw always has smaller gap than hw,q , although in high similarity region both schemes perform similarly.

6

1 0.99

ρ0 = 0.99 , c = 1.05

1

Gw

0.98

G

ρ0 = 0.99 , c = 1.1

G

0.95

w,q

0.97

Gap

0.96 Gap

Gap

Gw

ρ0 = 0.99 , c = 1.2

Gw,q

w,q

0.98

1

Gw

0.94

0.96

0.92

0.95

0.9

0.9 0.85

0.94 0 1 0.95

1

2

3 w

4

5

0.88 0

6

1

2

3 w

4

1

ρ0 = 0.99 , c = 1.3

G

Gw,q

0.8 0

6

2

3 w

4

6

G

0.9

w,q

5

Gw

ρ0 = 0.99 , c = 1.5

G

0.9

1

1

Gw

ρ0 = 0.99 , c = 1.4

w

5

w,q

0.85

Gap

Gap

Gap

0.9

0.8

0.8

0.8

0.7

0.7

0.75 1

2

3 w

4

1

0.6 0

6

0.9

G

5

0.6 0

6

1

Gw

ρ0 = 0.99 , c = 2

0.9

G

1

2

3 w

4

ρ0 = 0.99 , c = 2.5

5

6

Gw Gw,q

0.8 Gap

0.7

0.7

0.6 0.5

0.6

0.4 0.5

0.5 0

6

0.4 0

6

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

1

2

3 w

4

ρ0 = 0.99 , c = 3

5

Gw Gw,q

0.8

Gap

0.7 Gap

4

w,q

0.6

0.6 0.5 0.4 0.3 0.2 0

3 w

0.8

0.7

1

2

w,q

0.8

0.9

1

1

Gw

ρ0 = 0.99 , c = 1.7

Gap

Gap

0.9

5

1

2

3 w

4

5

0.3 1

2

3 w

4

5

0.2 0

6

Gw

ρ0 = 0.99 , c = 4

Gw,q Gap

0.7 0

1

2

3 w

4

5

6

1

2

3 w

1 0.9 ρ = 0.99 , c = 5 0 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 w

4

5

6

Gw Gw,q

4

5

6

Figure 3: The gaps Gw and Gw,q as functions of w, for ρ0 = 0.99. In each panel, we plot both Gw and Gw,q for a particular c value. The plots illustrate where the optimum w values are obtained.

7

1 0.99

ρ0 = 0.95 , c = 1.05

1

Gw

0.98

G

G

0.95

w,q

0.97

Gap

0.96 Gap

Gap

Gw

ρ0 = 0.95 , c = 1.2

Gw,q

w,q

0.98

1

Gw

ρ0 = 0.95 , c = 1.1

0.94

0.96

0.92

0.95

0.9

0.9 0.85

0.94 0

1

2

3 w

4

1

5

0.88 0

6

2

3 w

4

1

Gw

ρ0 = 0.95 , c = 1.3

1

0.8 0

6

0.9

Gw,q

0.9

Gap

Gap

Gap

0.9 0.8

0.8 0.7

1

2

3 w

4

1

0.6 0

6

1

Gw

ρ0 = 0.95 , c = 1.7

0.9

G

ρ0 = 0.95 , c = 2

5

3 w

4

5

6

Gw

ρ0 = 0.95 , c = 1.5

Gw,q

0.8 0.7

0.5 0

6

1

Gw

0.9

G

1

2

3 w

4

ρ0 = 0.95 , c = 2.5

5

6

Gw Gw,q

0.8 Gap

0.7 0.6 0.5

0.6

0.4 0.5

0.5 0

0.4 0

1

2

3 w

4

ρ0 = 0.95 , c = 3

5

6

Gw Gw,q

0.8

Gap

0.7 Gap

4

0.7

0.6

0.6 0.5 0.4 0.3 0.2 0

3 w

w,q

0.7

1

2

0.8

0.8

0.9

1

w,q

Gap

Gap

0.9

5

2

0.6

1

2

3 w

4

5

6

0.3 1

2

3 w

1 0.9 ρ0 = 0.95 , c = 4 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 w

4

5

6

0.2 0

6

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

Gw Gw,q Gap

0.7 0

1

1

Gw

ρ0 = 0.95 , c = 1.4

Gw,q

5

4

5

1

2

3 w

4

ρ0 = 0.95 , c = 4.4

6

Gw Gw,q

1

2

3 w

4

Figure 4: The gaps Gw and Gw,q as functions of w, for ρ0 = 0.95 and a range of c values.

8

5

5

6

1 0.99

ρ0 = 0.9 , c = 1.05

1

Gw

0.98

G

ρ0 = 0.9 , c = 1.1

Gw,q

w,q

0.95

Gw

ρ0 = 0.9 , c = 1.2

G

w,q

0.97

Gap

0.96 Gap

Gap

0.98

1

Gw

0.94

0.96

0.92

0.95

0.9

0.9 0.85

2

3 w

4

0.95

0.88 0

6

G

4

5

0.8 0

6

1

2

3 w

4

1

5

0.9

G

1

2

3 w

4

5

1

Gw

ρ0 = 0.9 , c = 2

0.9

G

6

Gw G

w,q

1

2

3 w

4

ρ0 = 0.9 , c = 2.5

5

6

Gw Gw,q

0.8

w,q

Gap

5

0.8

0.6 0

6

0.8

0.7

4

0.7

w,q

0.8

3 w

0.9

w,q

1

Gw

ρ0 = 0.9 , c = 1.7

2

ρ0 = 0.9 , c = 1.5

G

0.8

0.6 0

6

1

1

Gw

0.7

0.75

Gap

3 w

0.9

w,q

0.8

0.9

2

ρ0 = 0.9 , c = 1.4

0.85

0.7 0

1

1

Gw

ρ0 = 0.9 , c = 1.3

Gap

Gap

0.9

5

Gap

1

0.7 Gap

0.94 0

0.7

0.6 0.5

0.6

0.4

0.6

0.5

0.5 0

0.4 0

1

2

3 w

4

5

6

0.3 1

2

3 w

4

5

6

0.2 0

1

2

3 w

4

Figure 5: The gaps Gw and Gw,q as functions of w, for ρ0 = 0.9 and a range of c values.

9

5

6

0.97 0.96

0.94

0.88 2

3 w

4

1

5

0.9

1

2

3 w

4

Gap

0.8 0.7

2

3 w

4

5

6

0.9

w,q

0.8

0.6 0

0.85

1

0.7

1

Gw,q

0.9

0.75 0

6

G

0.9

w,q

5

Gw

ρ0 = 0.5 , c = 1.35

G

Gw

ρ0 = 0.5 , c = 1.2

0.8

1

Gw

ρ0 = 0.5 , c = 1.3

0.6 0

0.86 0

6

0.95

Gw,q

0.92 0.9

1

1

Gw

0.94

0.95

0.93 0

ρ0 = 0.5 , c = 1.1

0.96 Gap

Gap

0.98

Gw,q

0.98

Gap

1

Gw

Gap

ρ0 = 0.5 , c = 1.05

Gap

1 0.99

1

2

3 w

4

ρ0 = 0.5 , c = 1.4

5

6

Gw G

w,q

0.8 0.7 0.6

1

2

3 w

4

5

6

0.5 0

1

2

3 w

4

Figure 6: The gaps Gw and Gw,q as functions of w, for ρ0 = 0.5 and a range of c values.

10

5

6

5

Gw,q Gap

1/c

3.5

4

4.5

5

Gw Gw,q 1/c

4

4.5

5

Gw Gw,q 1/c

3.5

4

4.5

5

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 1.25 0 1 1.5 2 2.5 3 3.5 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 2 0 1 1.5 2 2.5 3 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 4 0 1 1.5 2 2.5 3 c

Gw,q Gap

1/c

4

4.5

5

Gw Gw,q 1/c Gap

4.5

Gw

4

4.5

5

Gw Gw,q 1/c Gap

4

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 0.5 0 1 1.5 2 2.5 3 3.5 c

3.5

4

4.5

5

Gw Gw,q 1/c Gap

Gap

1/c

Gw

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 1.75 0 1 1.5 2 2.5 3 3.5 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 3 0 1 1.5 2 2.5 3 c

Gw,q

Gap

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 1 0 1 1.5 2 2.5 3 c

Gw

Gap

Gap Gap Gap Gap

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 0.25 0 1 1.5 2 2.5 3 3.5 c

3.5

4

4.5

5

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 0.75 0 1 1.5 2 2.5 3 3.5 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 1.5 0 1 1.5 2 2.5 3 3.5 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 2.5 0 1 1.5 2 2.5 3 3.5 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.99 , w = 5 0 1 1.5 2 2.5 3 c

Gw Gw,q 1/c

4

4.5

5

Gw Gw,q 1/c

4

4.5

5

Gw Gw,q 1/c

4

4.5

5

Gw Gw,q 1/c

3.5

4

4.5

5

Figure 7: The gaps Gw and Gw,q as functions of c, for ρ0 = 0.99. In each panel, we plot both Gw and Gw,q for a particular w value.

11

Gw,q Gap

1/c

3.5

4

4.5

Gw Gw,q 1/c

4

4.5

Gw Gw,q 1/c

3.5

4

4.5

G

w,q

Gap

1/c

3.5

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 1.25 0 1 1.5 2 2.5 3 3.5 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 2 0 1 1.5 2 2.5 3 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 4 0 1 1.5 2 2.5 3 c

4

4.5

Gw Gw,q 1/c Gap

4.5

Gw

4

4.5

Gw Gw,q 1/c Gap

Gap 4

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 0.5 0 1 1.5 2 2.5 3 c

3.5

4

4.5

Gw Gw,q 1/c Gap

w,q

1/c

Gw

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 1.75 0 1 1.5 2 2.5 3 3.5 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 3 0 1 1.5 2 2.5 3 c

G

Gap

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 1 0 1 1.5 2 2.5 3 c

Gw

Gap

Gap Gap Gap Gap

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 0.25 0 1 1.5 2 2.5 3 3.5 c

3.5

4

4.5

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 0.75 0 1 1.5 2 2.5 3 3.5 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 1.5 0 1 1.5 2 2.5 3 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ρ0 = 0.95 , w = 2.5 0 1 1.5 2 2.5 3 c 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 ρ = 0.95 , w = 5 0 0.1 0 1 1.5 2 2.5 3 c

Gw G

w,q

1/c

4

4.5

Gw Gw,q 1/c

3.5

4

4.5

Gw Gw,q 1/c

3.5

4

4.5

Gw Gw,q 1/c

3.5

4

4.5

Figure 8: The gaps Gw and Gw,q as functions of c, for ρ0 = 0.95. In each panel, we plot both Gw and Gw,q for a particular w value.

12

1

0.8

Gw,q

0.7

1/c

0.6

Gw

0.9

Gw

0.8

Gw,q

0.8

Gw,q

0.7

1/c

0.7

1/c

0.6 0.5

0.5

0.4

0.4

0.4

0.2 1

ρ0 = 0.9 , w = 0.25 1.5

2

2.5

0.3 3

0.2 1

3.5

1.5

2

2.5

3

0.2 1

3.5

ρ0 = 0.9 , w = 0.75 1.5

2

c

1

Gw

2.5

3

3.5

c

1

1

0.9

Gw

0.9

Gw

0.8

Gw,q

0.8

Gw,q

0.8

Gw,q

0.7

1/c

0.7

1/c

0.7

1/c

0.6

0.6

0.5

0.5

0.4

0.4

0.3 0.2 1

ρ0 = 0.9 , w = 1 1.5

2

0.3 2.5

3

0.2 1

3.5

Gap

0.9

Gap

Gap

0.3

ρ0 = 0.9 , w = 0.5

c

0.6 0.5 0.4 0.3

ρ0 = 0.9 , w = 1.25 1.5

2

c

2.5

3

0.2 1

3.5

0.7

1/c

0.6

1

Gw

0.8

0.8

Gw,q

0.7

1/c

0.7

1/c

0.6

0.6 0.5

0.4

0.4

0.3 ρ = 0.9 , w = 2 0 0.2 1 1.5 2

0.3

2.5

3

3.5

c

2.5

3

0.2 1

3.5

ρ0 = 0.9 , w = 2.5 1.5

2

c

1

2.5

3

3.5

c

1

1

0.9

Gw

0.9

Gw

0.8

Gw,q

0.8

Gw,q

0.8

Gw,q

0.7

1/c

0.7

1/c

0.7

1/c

0.6

0.6

0.5

0.5

0.4

0.4

0.3 0.2 1

ρ0 = 0.9 , w = 3 1.5

2

0.3 2.5

3

3.5

Gap

Gw

Gap

0.9

3.5

Gw,q

0.4

2

3

0.9

0.5 ρ0 = 0.9 , w = 1.75

2.5

Gw

0.5

1.5

2

0.9

Gap

0.8

Gw,q Gap

Gw

0.2 1

1.5

c

1

0.9

0.3

ρ0 = 0.9 , w = 1.5

c

1

Gap

0.6

0.5 0.3

Gap

1

0.9

Gap

Gw

Gap

Gap

1 0.9

0.2 1

0.6 0.5 0.4

ρ0 = 0.9 , w = 4 1.5

2

c

0.3 2.5

c

3

3.5

0.2 1

ρ0 = 0.9 , w = 5 1.5

2

2.5

3

3.5

c

Figure 9: The gaps Gw and Gw,q as functions of c, for ρ0 = 0.9. In each panel, we plot both Gw and Gw,q for a particular w value.

13

0.9

G

w,q

1/c

0.8 0.7 0.6

1.1

1.2

1.3

1.4

0.5 1

1.5

1.1

1.2

c

0.7

0.8 0.7 0.6

ρ0 = 0.5 , w = 1 1.3

1.4

0.5 1

1.5

1/c

1.1

1.2

0.8 0.7

1.2

1.3

1.4

1.5

1/c

1.2

1.1

1.2

1.3

1.4

Gw

1.5

Gw,q 1/c

0.7

0.5 1

ρ0 = 0.5 , w = 2.5 1.1

1.2

1.4

1

1.5

w,q

1/c

0.7

0.5 1

1.1

1.2

1.3 c

1.4

1.5

Gw,q 1/c

0.8 0.7 0.6

ρ = 0.5 , w = 4

c

1.5

Gw

0.9

G

0.8

0.6

ρ = 0.5 , w = 3 1.4

1.3

G

Gap

0.7

1.5

c

0.9

1/c

1.3

1.4

w

Gw,q

0.8

1.3

0.8

0.6

ρ0 = 0.5 , w = 2

1

Gap

Gap

0.9

1.2

1.1

0.9

Gw,q

0.7

0.5 1

Gw

1.1

ρ0 = 0.5 , w = 1.5

c

1

0.5 1

1/c

1

Gw

c

0.6

1.5

Gw,q

c

0.8

0.6

ρ0 = 0.5 , w = 1.75

1.5

0.7

0.5 1

Gap

1/c

1.1

1.4

0.9

Gw,q Gap

Gap

0.9

0.5 1

1.3

1

Gw

1.4

Gw

c

1

1.3

0.8

0.6

ρ0 = 0.5 , w = 1.25

c

0.6

1.2

0.9

Gw,q Gap

0.8

Gap

Gap

1/c

1.2

1.1

1

Gw

0.9

Gw,q

1.1

1.5

c

1

Gw

0.9

0.5 1

ρ = 0.5 , w = 0.75 0

1.4

0.5 1

c

1

0.6

1.3

w,q

1/c

0.7 0.6

ρ = 0.5 , w = 0.5

G

0.8

0

0

0.5 1

w,q

1/c

0.7

Gw

0.9

G

0.8

0.6

ρ = 0.5 , w = 0.25

1

Gw

0.9 Gap

Gap

1

Gw

Gap

1

0.5 1

ρ0 = 0.5 , w = 5 1.1

1.2

1.3

1.4

1.5

c

Figure 10: The gaps Gw and Gw,q as functions of c, for ρ0 = 0.5. In each panel, we plot both Gw and Gw,q for a particular w value.

14

4

Optimal Gaps

To view the optimal gaps more clearly, Figure 11 and Figure 12 plot the best gaps (left panels) and the optimal w values (right panels) at which the best gaps are attained, for selected values of c and the entire range of ρ. The results can be summarized as follows • At any ρ and c, the optimal gap Gw,q is always at least as large as the optimal gap Gw . At relatively low similarities, the optimal Gw,q can be substantially larger than the optimal Gw . • When the target similarity level ρ is high (e.g., ρ > 0.85), for both schemes hw and hw,q , the optimal w values are relatively low, for example, w = 1 ∼ 1.5 when 0.85 < ρ < 0.9. In this region, both hw,q and hw behavior similarly. • When the target similarity level ρ is not so high, for hw , it is best to use a large value of w, in particular w ≥ 2 ∼ 3. In comparison, for hw,q , the optimal w values grow smoothly with decreasing ρ. These plots again confirm the previous comparisons: (i) we should always replace hw,q with hw ; (ii) if we use hw and target at very high similarity, a good choice of w might be w = 1 ∼ 1.5; (iii) if we use hw and the target similarity is not too high, then we can safely use w = 2 ∼ 3. We should also mention that, although the optimal w values for hw appear to exhibit a “jump” in the right panels of Figure 11 and Figure 12, the choice of w does not influence the performance much, as shown in previous plots. In Figures 3 to 6, we have seen that even when the optimal w appear to approach “∞”, the actual gaps are not much difference between w = 3 and w ≫ 3. In the real-data evaluations in the next section, we will see the same phenomenon for hw . Note that the Gaussian density decays rapidly at the tail, for example, 1 − Φ(6) = 9.9 × 10−10 . If we choose w = 1.5, or 2, or 3, then we just need a small number of bits to code each hashed value.

15

1

6

c = 1.05

c = 1.05

0.9

Optimum w

Optimum Gap

5

0.8 Gw

0.7

1

0.2

0.4

ρ

0.6

0.8

3 2 1

Gw,q 0.6 0

4

0 0

1

6

c = 1.1

Gw Gw,q 0.2

0.4

ρ

0.6

0.8

1

0.6

0.8

1

0.6

0.8

1

c = 1.1

0.9

Optimum w

Optimum Gap

5

0.8 G

0.7

w

Gw,q

Optimum Gap

1

0.2

0.4

ρ

0.6

0.8

1

3 2

Gw

Gw,q

0 0

0.2

0.8

0.4

ρ

0.6

0.8

c = 1.3

3 2 1

0.2

ρ

4

0.7 0.6 0

0.4

5

Gw,q

0.9

Gw

1

6

c = 1.3

Optimum w

0.6 0

4

0 0

1

Gw Gw,q 0.2

0.4

ρ

Figure 11: Left panels: the optimal (smallest) gaps at given c values and the entire range of ρ. We can see that Gw,q is always larger than Gw , confirming that it is better to use hw instead of hw,q . Right panels: the optimal values of w at which the optimal gaps are attained. When the target similarity ρ is very high, it is best to use a relatively small w. When the target similarity is not that high, if we use hw , it is best to use w > 3.

16

6

c = 1.5

0.6 0.58 Gw

0.56 0.7

ρ

0.8

0.9

Gw Gw,q 0.7

0.8 ρ

0.9

1

0.8 ρ

0.9

1

c = 1.7 5

0.52 0.5 G

4 3 2

Gw

w

0.48

1

Gw,q

0.46 0.6

0.7

0.8 ρ

0.9

Gw,q

0 0.6

1

6

c=2

0.7

c=2

5

0.44 0.42 0.4 0.38

2

6

c = 1.7

0.54

0.46

3

0 0.6

1

Optimum w

Optimum Gap

0.6

4

1

Gw,q

0.54 0.5

Optimum Gap

Optimum w

0.62

0.56

c = 1.5

5

Optimum w

Optimum Gap

0.64

Gw Gw,q

0.36 0.7 0.75 0.8 0.85 0.9 0.95 ρ

4 3 2 1 0 0.7

1

Gw Gw,q 0.75

0.8

0.85 ρ

0.9

0.95

1

Figure 12: Left panels: the optimal (smallest) gaps at given c values and the entire range of ρ. We can see that Gw,q is always larger than Gw , confirming that it is better to use hw instead of hw,q . Right panels: the optimal values of w at which the optimal gaps are attained. When the target similarity ρ is very high, it is best to use a relatively small w. When the target similarity is not that high, if we use hw , it is best to use w > 3.

17

5

An Experimental Study

Two datasets, Peekaboom and Youtube, are used in our experiments for validating the theoretical results. Peekaboom is a standard image retrieval dataset, which is divided into two subsets, one with 1998 data points and another with 55599 data points. We use the larger subset for building hash tables and the smaller subset for query data points. The reported experimental results are averaged over all query data points. Available in the UCI repository, Youtube is a multi-view dataset. For simplicity, we only use the largest set of audio features. The original training set, with 97934 data points, is used for building hash tables. 5000 data points, randomly selected from the original test set, are used as query data points. We use the standard (K, L)-LSH implementation [3]. We generate K × L independent hash functions hi,j , i = 1 to K, j = 1 to L. For each hash table j, j = 1 to L, we concatenate K hash functions < h1,j , h2,j , h3,j , ..., hK,j >. For each data point, we compute the hash values and place them (in fact, their pointers) into the appropriate buckets of the hash table i. In the query phase, we compute the hash value of the query data points using the same hash functions to find the bucket in which the query data point belongs to and only search for near neighbor among the data points in that bucket of hash table i. We repeat the process for each hash table and the final retrieved data points are the union of the retrieved data points in all the hash tables. Ideally, the number of retrieved data points will be substantially smaller than the total number of data points. We use the term fraction retrieved to indicate the ratio of the number of retrieved data points over the total number of data points. A smaller value of fraction retrieved would be more desirable. To thoroughly evaluate the two coding schemes, we conduct extensive experiments on the two datasets, by using many combinations of K (from 3 to 40) and L (from 1 to 200). At each choice of (K, L), we vary w from 0.5 to 5. Thus, the total number of combinations is large, and the experiments are very time-consuming. There are many ways to evaluate the performance of an LSH scheme. We could specify a threshold of similarity and only count the retrieved data points whose (exact) similarity is above the threshold as “true positives”. To avoid specifying a threshold and consider the fact that in practice people often would like to retrieve the top-T nearest neighbors, we take a simple approach by computing the recall based on top-T neighbors. For example, suppose the number of retrieved data points is 120, among which 70 data points belong to the top-T . Then the recall value would be 70/T = 70% if T = 100. Ideally, we hope the recalls would be as high as possible and in the meanwhile we hope to keep the fraction retrieved as low as possible. Figure 13 presents the results on Youtube for T = 100 and target recalls from 0.1 to 0.99. In every panel, we set a target recall threshold. At every bin width w, we find the smallest fraction retrieved over a wide range of LSH parameters, K and L. Note that, if the target recall is high (e.g., 0.95), we basically have to effectively lower the target threshold ρ, so that we do not have to go down the re-ranked list too far. The plots show that, for high target recalls, we need to use relatively large w (e.g., w ≥ 2 ∼ 3), and for low target recalls, we should use a relatively small w (e.g., w = 1.5). Figures 14 to 18 present similar results on the Youtube dataset for T = 50, 20, 10, 5, 3. We only include plots with relatively high recalls which are often more useful in practice. Figures 19 to 24 present the results on the Peekaboom dataset, which are essentially very similar to the results on the Youtube dataset. These plots confirm the previous theoretical analysis: (i) it is essentially always better to use hw instead of hw,q , i.e., the random offset is not needed; (ii) when using hw and the target recall is high (which essentially means when the target similarity is low), it is better to use a relatively large w (e.g., w = 2 ∼ 3); (iii) when using hw and the target recall is low, it is better to use a smaller w (e.g., w = 1.5); (iv) when using hw , the influence is w is not that much as long as it is in a reasonable range, which is important in practice. 18

0.7

hw,q hw

0.6 0.5 0.4

3

4

0.6 0.5 0.4 0.3 0

5

1

2

3

w Youtube: Top 100 Recall = 0.95

0.5

hw,q hw

0.4

0.3

0.2 0

1

2

3

4

w,q

h

w

0.14 0.12 0.1

0.3 0.2

1

2

3

3

4

Fraction Retrieved

Fraction Retrieved

0.04

h

w

0.04 0.03

2

h

0.08

w,q

3

4

3

4

h

hw

0.01 0.005

1

2

3 w

4

5

3

4

1

2

5

hw

4

3 w

5

Youtube: Top 100 Recall = 0.4

h

w,q

h

w

0.015 0.01 0.005 0 5

hw,q

6

2

4

1

2

3

4

5

w

8

1

3

−3

x 10 Youtube: Top 100 Recall = 0.2

2 0

w,q

h

0.02

Fraction Retrieved

10

hw,q Fraction Retrieved

Fraction Retrieved

Youtube: Top 100 Recall = 0.3

h

0.06

w,q

0.02

2

5

0.08

0.025

h

0.03

1

4

w

0.04 0

5

w

0.01 0

5

3

Youtube: Top 100 Recall = 0.7

w

0.015

0 0

2

0.1

−3

0.02

1

w

Youtube: Top 100 Recall = 0.5

w

hw

0.2

w,q

0.1

2

hw,q

0.3

0.12

h

0.12

1

5

w

0.14

0.06 0

5

4

Youtube: Top 100 Recall = 0.9

0.1 0

5

w

h

0.05

1

3

w

Youtube: Top 100 Recall = 0.6

0.02 0

4

Youtube: Top 100 0.16 Recall = 0.8

w 0.06

hw

0.18

h

0.16

2

0.4

hw,q

0.4

0.1 0

5

Fraction Retrieved

Fraction Retrieved

Youtube: Top 100 0.2 Recall = 0.85 0.18

1

2

w

0.22

0.08 0

1

w

Youtube: Top 100 Recall = 0.93

w

hw

0.3 0.2 0

5

hw,q

0.4

w

Fraction Retrieved

Fraction Retrieved

0.5

4

Youtube: Top 100 Recall = 0.97

0.5

Fraction Retrieved

2

hw

Fraction Retrieved

1

0.6

hw,q

Fraction Retrieved

0.3 0

Youtube: Top 100 Recall = 0.98

Fraction Retrieved

Youtube: Top 100 Recall = 0.99

Fraction Retrieved

Fraction Retrieved

0.7

4

5

x 10 Youtube: Top 100 Recall = 0.1

hw,q hw

4 3 2 1 0

1

2

3

4

w

Figure 13: Youtube Top 100 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-100) with respect to w for both coding schemes hw and hw,q . Lower is better. 19

5

Youtube: Top 50 Recall = 0.99

0.6

hw,q Fraction Retrieved

Fraction Retrieved

0.7

hw

0.6 0.5 0.4 0.3 0

1

2

3

4

Youtube: Top 50 Recall = 0.97

0.4 0.3

1

2

w Youtube: Top 50 Recall = 0.95

0.3

hw,q hw

0.4 0.3 0.2 0.1 0

1

2

3

4

hw

0.14 0.12 0.1

0.2 0.15

1

2

3

4

5

hw,q hw

0.12 0.1 0.08 0.06 0

5

Youtube: Top 50 0.07 Recall = 0.7

0.05

hw,q Fraction Retrieved

Fraction Retrieved

4

1

2

3

4

5

w

0.08

hw

0.06 0.05 0.04 2

3

Youtube: Top 50 0.14 Recall = 0.8

w

1

hw

0.16

hw,q

0.16

0.03 0

hw,q

0.25

0.1 0

5

Fraction Retrieved

Fraction Retrieved

Youtube: Top 50 0.18 Recall = 0.85

2

5

w

0.2

1

4

Youtube: Top 50 Recall = 0.9

w

0.08 0

3 w

Fraction Retrieved

Fraction Retrieved

0.5

hw

0.5

0.2 0

5

hw,q

3

4

w

hw,q hw

0.04 0.03 0.02 0.01 0

5

Youtube: Top 50 Recall = 0.6

1

2

3

4

5

w

Figure 14: Youtube Top 50 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-50) with respect to w for both coding schemes hw and hw,q .

20

Youtube: Top 20 Recall = 0.99

0.5

hw,q Fraction Retrieved

Fraction Retrieved

0.7

hw

0.6 0.5 0.4 0.3 0

1

2

3

4

Youtube: Top 20 Recall = 0.97

0.3

1

2

w

3

4

5

w

Youtube: Top 20 Recall = 0.95

0.24

hw,q Fraction Retrieved

Fraction Retrieved

0.5

hw

0.4

0.2 0

5

hw,q

hw

0.4 0.3 0.2

0.22

Youtube: Top 20 Recall = 0.9

hw,q hw

0.2 0.18 0.16 0.14 0.12 0.1

0.1 0

1

2

3

4

0.08 0

5

1

2

w Youtube: Top 20 0.16 Recall = 0.85

hw

0.14 0.12 0.1 0.08 1

2

3

4

Youtube: Top 20 0.12 Recall = 0.8

0.06

0.04

hw,q Fraction Retrieved

Fraction Retrieved

Youtube: Top 20 0.06 Recall = 0.7

hw

0.05 0.04 0.03 2

hw

0.08

0.04 0

5

hw,q

1

2

3

4

5

w

0.07

1

5

0.1

w

0.02 0

4

0.14

hw,q Fraction Retrieved

Fraction Retrieved

0.18

0.06 0

3 w

3

4

w

hw,q hw

0.03

0.02

0.01 0

5

Youtube: Top 20 Recall = 0.6

1

2

3

4

5

w

Figure 15: Youtube Top 20 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-20) with respect to w for both coding schemes hw and hw,q .

21

Youtube: Top 10 0.6 Recall = 0.99

0.5

hw,q Fraction Retrieved

Fraction Retrieved

0.7

hw

0.5 0.4 0.3 0.2 0

1

2

3

4

Youtube: Top 10 Recall = 0.97

0.3 0.2

1

2

3

w

4

5

w

Youtube: Top 10 Recall = 0.95

0.22

hw,q Fraction Retrieved

Fraction Retrieved

0.5

hw

0.4

0.1 0

5

hw,q

hw

0.4 0.3 0.2

Youtube: Top 10 0.2 Recall = 0.9 0.18

hw,q hw

0.16 0.14 0.12 0.1

0.1 0

1

2

3

4

0.08 0

5

1

2

0.15 Youtube: Top 10 Recall = 0.85 0.13

0.12

hw,q hw

0.11 0.09 0.07 0.05 0

1

2

3

4

0.06

0.04

hw,q hw

0.04 0.03

2

hw

0.08

0.04 0

5

hw,q

0.1

Fraction Retrieved

Fraction Retrieved

Youtube: Top 10 Recall = 0.7

1

5

1

2

3

4

5

w

0.05

0.02 0

4

Youtube: Top 10 Recall = 0.8

w 0.06

3 w

Fraction Retrieved

Fraction Retrieved

w

3

4

w

hw,q hw

0.03

0.02

0.01 0

5

Youtube: Top 10 Recall = 0.6

1

2

3

4

5

w

Figure 16: Youtube Top 10 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-10) with respect to w for both coding schemes hw and hw,q .

22

Youtube: Top 5 Recall = 0.99

0.5

hw,q Fraction Retrieved

Fraction Retrieved

0.6

hw

0.5 0.4 0.3 0.2 0

1

2

3

4

Youtube: Top 5 Recall = 0.97

0.3 0.2

1

2

3

w

4

5

w

Youtube: Top 5 Recall = 0.95

0.2

hw,q Fraction Retrieved

Fraction Retrieved

0.4

hw

0.4

0.1 0

5

hw,q

hw

0.3

0.2

Youtube: Top 5 0.18 Recall = 0.9 0.16

hw,q hw

0.14 0.12 0.1 0.08

0.1 0

1

2

3

4

0.06 0

5

1

2

w Youtube: Top 5 0.12 Recall = 0.85

hw

0.1 0.08 0.06 1

2

3

4

Youtube: Top 5 0.07 Recall = 0.8

0.04 1

2

hw

0.03 0.02

2

3

4

0.03

hw,q Fraction Retrieved

Fraction Retrieved

Youtube: Top 5 Recall = 0.7

1

hw

0.05

0.03 0

5

hw,q

5

w

0.04

0.01 0

5

0.06

w 0.05

4

0.08

hw,q Fraction Retrieved

Fraction Retrieved

0.14

0.04 0

3 w

3

4

Youtube: Top 5 0.025 Recall = 0.6

w

hw

0.02 0.015 0.01 0.005 0

5

hw,q

1

2

3

4

5

w

Figure 17: Youtube Top 5 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-5) with respect to w for both coding schemes hw and hw,q .

23

Youtube: Top 3 Recall = 0.99

0.5

hw,q Fraction Retrieved

Fraction Retrieved

0.6

hw

0.5 0.4 0.3 0.2 0

1

2

3

4

Youtube: Top 3 Recall = 0.97

0.3 0.2

1

2

3

w Youtube: Top 3 Recall = 0.95

0.18

hw,q hw

0.3

0.2

0.1 0

1

2

3

4

Youtube: Top 3 0.16 Recall = 0.9

0.1 0.08 1

2

hw

0.1 0.08 0.06 3

4

hw

0.04 0.03 1

2

3

4

0.03

hw,q hw

0.03 0.02

2

hw,q

0.05

0.02 0

5

Fraction Retrieved

Fraction Retrieved

Youtube: Top 3 Recall = 0.7

1

5

5

w

0.04

0.01 0

4

Youtube: Top 3 0.06 Recall = 0.8

w 0.05

3

0.07

hw,q Fraction Retrieved

Fraction Retrieved

Youtube: Top 3 0.12 Recall = 0.85

2

hw

0.12

0.06 0

5

hw,q

w

0.14

1

5

0.14

w

0.04 0

4

w

Fraction Retrieved

Fraction Retrieved

0.4

hw

0.4

0.1 0

5

hw,q

3

4

Youtube: Top 3 0.025 Recall = 0.6

w

hw

0.02 0.015 0.01 0.005 0

5

hw,q

1

2

3

4

5

w

Figure 18: Youtube Top 3 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-3) with respect to w for both coding schemes hw and hw,q .

24

Peekaboom: Top 100 Recall = 0.99

0.7

hw,q Fraction Retrieved

Fraction Retrieved

0.9

hw

0.8 0.7 0.6 0.5 0

1

2

3

4

Peekaboom: Top 100 Recall = 0.97

0.5 0.4

1

2

w 0.5

hw,q Fraction Retrieved

Fraction Retrieved

Peekaboom: Top 100 0.6 Recall = 0.95

hw

0.5 0.4 0.3 1

2

3

4

hw

0.2 0.15

0.3 0.2

1

2

3

4

5

hw,q hw

0.2 0.15 0.1 0.05 0

5

Peekaboom: Top 100 0.14 Recall = 0.7

1

2

3

4

0.12

hw,q Fraction Retrieved

Fraction Retrieved

4

5

w

0.16

hw

0.12 0.1 0.08 0.06 2

3

Peekaboom: Top 100 0.25 Recall = 0.8

w

1

hw

0.3

hw,q

0.25

0.04 0

hw,q

0.4

0.1 0

5

Fraction Retrieved

Fraction Retrieved

Peekaboom: Top 100 0.3 Recall = 0.85

2

5

w

0.35

1

4

Peekaboom: Top 100 Recall = 0.9

w

0.1 0

3 w

0.7

0.2 0

hw

0.6

0.3 0

5

hw,q

3

4

Peekaboom: Top 100 0.1 Recall = 0.6

w

hw

0.08 0.06 0.04 0.02 0

5

hw,q

1

2

3

4

5

w

Figure 19: Peekaboom Top 100 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-100) with respect to w for both coding schemes hw and hw,q .

25

Peekaboom: Top 50 0.8 Recall = 0.99

0.6

hw,q Fraction Retrieved

Fraction Retrieved

0.9

hw

0.7 0.6 0.5

Peekaboom: Top 50 Recall = 0.97

hw,q hw

0.5

0.4

0.3 0.4 0

1

2

3

4

5

0

1

2

0.5 Peekaboom: Top 50 Recall = 0.95

hw

0.3

1

2

3

4

Peekaboom: Top 50 0.3 Recall = 0.9

0.15

0.25

hw,q Fraction Retrieved

Fraction Retrieved

Peekaboom: Top 50 0.25 Recall = 0.85

hw

0.2 0.15 0.1 2

1

2

3

4

5

hw,q hw

0.2 0.15 0.1 0.05 0

5

1

2

3

4

0.12

hw,q Fraction Retrieved

Fraction Retrieved

Peekaboom: Top 50 0.12 Recall = 0.7

hw

0.1 0.08 0.06 0.04 2

4

5

w

0.14

1

3

Peekaboom: Top 50 Recall = 0.8

w

0.02 0

hw

0.2

0.1 0

5

hw,q

w

0.3

1

5

0.25

w

0.05 0

4

0.35

hw,q

0.4

0.2 0

3 w

Fraction Retrieved

Fraction Retrieved

w

3

4

Peekaboom: Top 50 0.1 Recall = 0.6

w

hw

0.08 0.06 0.04 0.02 0 0

5

hw,q

1

2

3

4

5

w

Figure 20: Peekaboom Top 50 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-50) with respect to w for both coding schemes hw and hw,q .

26

0.6

hw,q Fraction Retrieved

Fraction Retrieved

Peekaboom: Top 20 Recall = 0.99

hw

0.6

0.5

0.4 0

1

2

3

4

Peekaboom: Top 20 Recall = 0.97

0.4 0.3

1

2

w Peekaboom: Top 20 Recall = 0.95

hw,q hw

0.4 0.3 0.2 0.1 0

1

2

3

4

Peekaboom: Top 20 0.25 Recall = 0.85

0.1

0.2

hw,q Fraction Retrieved

Fraction Retrieved

hw

0.15

0.05 0

5

hw,q

0.2

1

2

hw

0.2 0.15 0.1

0.18

3

4

Peekaboom: Top 20 Recall = 0.8

5

hw,q hw

0.16 0.14 0.12 0.1 0.08 0.06

1

2

3

4

0.04 0

5

1

2

w Peekaboom: Top 20 0.1 Recall = 0.7

Fraction Retrieved

hw

0.06 0.04 2

4

0.1

hw,q

0.08

1

3

5

w

0.12 Fraction Retrieved

5

w

0.3

0.02 0

4

0.3 Peekaboom: Top 20 Recall = 0.9 0.25

w

0.05 0

3 w

Fraction Retrieved

Fraction Retrieved

0.5

hw

0.5

0.2 0

5

hw,q

3

4

Peekaboom: Top 20 0.08 Recall = 0.6

w

hw

0.06 0.04 0.02 0 0

5

hw,q

1

2

3

4

5

w

Figure 21: Peekaboom Top 20 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-20) with respect to w for both coding schemes hw and hw,q .

27

Peekaboom: Top 10 Recall = 0.99

hw,q Fraction Retrieved

Fraction Retrieved

0.7

hw

0.6 0.5 0.4

Peekaboom: Top 10 Recall = 0.97

0.5

hw,q hw

0.4 0.3 0.2

0.3 0

1

2

3

4

5

0

1

2

w Peekaboom: Top 10 Recall = 0.95

hw

0.4 0.3 0.2 0.1 0

1

2

3

4

Peekaboom: Top 10 0.25 Recall = 0.9

0.1

0.2

hw,q Fraction Retrieved

Fraction Retrieved

Peekaboom: Top 10 0.2 Recall = 0.85

hw

0.15 0.1 0.05 2

1

2

3

4

5

hw,q hw

0.15 0.1 0.05 0 0

5

1

2

3

4

0.1

hw,q Fraction Retrieved

Fraction Retrieved

Peekaboom: Top 10 0.08 Recall = 0.7

hw

0.06 0.04 0.02 2

4

5

w

0.1

1

3

Peekaboom: Top 10 Recall = 0.8

w

0 0

hw

0.15

0.05 0

5

hw,q

w

0.25

1

5

0.2

w

0 0

4

0.3

hw,q Fraction Retrieved

Fraction Retrieved

0.5

3 w

3

4

Peekaboom: Top 10 0.08 Recall = 0.6

w

hw

0.06 0.04 0.02 0 0

5

hw,q

1

2

3

4

5

w

Figure 22: Peekaboom Top 10 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-10) with respect to w for both coding schemes hw and hw,q .

28

Peekaboom: Top 5 0.6 Recall = 0.99

hw,q Fraction Retrieved

Fraction Retrieved

0.7

hw

0.5 0.4 0.3 0.2 0

1

2

3

4

0.5 Peekaboom: Top 5 Recall = 0.97 0.4

0.2

1

2

w Peekaboom: Top 5 Recall = 0.95

hw

0.3 0.2

1

2

3

4

Peekaboom: Top 5 0.2 Recall = 0.9

0.05 1

2

hw

0.15 0.1 0.05 3

4

hw

0.08 0.06 0.04

0.08

hw,q hw

0.04 0.02

2

hw,q

0.1

0.02 0

5

Fraction Retrieved

Fraction Retrieved

Peekaboom: Top 5 Recall = 0.7

1

5

1

2

3

4

5

w

0.06

0 0

4

Peekaboom: Top 5 0.12 Recall = 0.8

w 0.08

3

0.14

hw,q Fraction Retrieved

Fraction Retrieved

Peekaboom: Top 5 0.2 Recall = 0.85

2

hw

0.1

0 0

5

hw,q

w

0.25

1

5

0.15

w

0 0

4

0.25

hw,q

0.4

0.1 0

3 w

Fraction Retrieved

Fraction Retrieved

0.5

hw

0.3

0.1 0

5

hw,q

3

4

w

hw,q hw

0.06 0.04 0.02 0 0

5

Peekaboom: Top 5 Recall = 0.6

1

2

3

4

5

w

Figure 23: Peekaboom Top 5 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-5) with respect to w for both coding schemes hw and hw,q .

29

Peekaboom: Top 3 0.6 Recall = 0.99

0.5

hw,q Fraction Retrieved

Fraction Retrieved

0.7

hw

0.5 0.4 0.3 0.2 0

1

2

3

4

Peekaboom: Top 3 Recall = 0.97

0.3 0.2

1

2

w

3

4

5

w

0.3 Peekaboom: Top 3 Recall = 0.95

0.25

hw,q Fraction Retrieved

Fraction Retrieved

hw

0.4

0.1 0

5

hw,q

hw

0.25 0.2 0.15

Peekaboom: Top 3 0.2 Recall = 0.9

hw,q hw

0.15 0.1 0.05

0.1 0

1

2

3

4

0 0

5

1

2

w Peekaboom: Top 3 Recall = 0.85

hw

0.15 0.1 0.05 0 0

1

2

3

4

Peekaboom: Top 3 0.1 Recall = 0.8

0.04 0.02

0.04

hw,q Fraction Retrieved

Fraction Retrieved

Peekaboom: Top 3 Recall = 0.7

hw

0.04 0.02

1

2

hw

0.06

0 0

5

hw,q

1

2

3

4

5

w

0.06

0 0

5

0.08

w 0.08

4

0.12

hw,q Fraction Retrieved

Fraction Retrieved

0.2

3 w

3

4

w

hw,q hw

0.03 0.02 0.01 0 0

5

Peekaboom: Top 3 Recall = 0.6

1

2

3

4

5

w

Figure 24: Peekaboom Top 3 . In each panel, we plot the optimal fraction retrieved at a target recall value (for top-3) with respect to w for both coding schemes hw and hw,q .

30

6

Conclusion

We have compared two quantization (coding) schemes for random projections in the context of sublinear time approximate near neighbor search. The recently proposed scheme based on uniform quantization [4] is simpler than the influential existing work [1] (which used uniform quantization with a random offset). Our analysis confirms that, under the general theory of LSH, the new scheme [4] is simpler and more accurate than [1]. In other words, the step of random offset in [1] is not needed and may hurt the performance. Our analysis provides the practical guidelines for using the proposed coding scheme to build hash tables. Our recommendation is to use a bin width about w = 1.5 when the target similarity is high and a bin width about w = 3 when the target similarity is not that high. In addition, using the proposed coding scheme based on uniform quantization (without the random offset), the influence of w is not very sensitive, which makes it very convenient in practical applications.

References [1] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokn. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253 – 262, Brooklyn, NY, 2004. [2] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for finding nearest neighbors. IEEE Transactions on Computers, 24:1000–1006, 1975. [3] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998. [4] Ping Li, Michael Mitzenmacher, and Anshumali Shrivastava. Coding for random projections. Technical report, arXiv:1308.2218, 2013. [5] Ping Li, Michael Mitzenmacher, and Anshumali Shrivastava. Coding for random projections and nonlinear estimators. Technical report, 2014.

31