Fast Cross-Polytope Locality-Sensitive Hashing Christopher Kennedy∗ and Rachel Ward†
arXiv:1602.06922v2 [cs.DS] 18 Mar 2016
Department of Mathematics, University of Texas at Austin March 21, 2016
Abstract We provide a variant of cross-polytope locality sensitive hashing with respect to angular distance which is both optimal in asymptotic sensitivity and provably fast. Precisely, we substitute the random rotation on S d−1 in the standard cross-polytope scheme for a fast Johnson-Lindenstrauss transform followed by lifted rotation to reduce the time for hash comptations from O(d2 ) to O(d ln d). This builds on a recent result in [AIL+ 15] by providing an LSH scheme for angular distance which is not only optimal in asymptotic sensitivity, but also fast. Finally, we present a discretized version of the scheme which reduces the number of random bits to O(ln9 d) while still retaining asymptotic optimality and efficient hash computations.
1
Introduction
In recent years, the nearest neighbor search problem has become important in a wide array of applications including data compression, information retrieval, image storage, computer vision, and pattern recognition. Nearest neighbor search (NN) can be stated as follows: given a metric space (X, D) and a set of points P = {x1 , ..., xn } ⊂ X, for a query point x ∈ P find y = argminxi ∈P \{x} D(xi , x). In high dimensions, it is known that no algorithm can perform better than a linear search (see [WSB]); that is, for a query point x ∈ P , any algorithm for NN must essentially compute the distances between x and each point in P \ {x}. One way to improve on linear search is to relax the problem, and instead solve (R, c) nearest neighbor search ((R, c)-NN): given a query point x ∈ P and the assurance of a point y ′ ∈ P such that D(y ′ , x) < R, find y ∈ P such that D(y, x) < cR. In particular, it is known that approximate nearest neighbor search can be solved in sublinear time using locality sensitive hashing (LSH). The idea in LSH is to specify a function from the domain X to a discrete set of hash values – a so-called hash function – which sends closer points to the same hash value with higher probability than points which are far apart. Then, for a set of points P = {x1 , ..., xn } ⊂ X and a query point x ∈ P, search within its corresponding hash bucket for a nearest neighbor. To quantify the notion of how sensitive a hash function is, we let p1 be the probability of collision for points that are distance R from each other and p2 the probability of collision for points that are distance cR from each other (for c > 1). Note ∗
Email:
[email protected]. C. Kennedy was supported in part by R. Ward’s NSF CAREER grant and an ASOFR Young Investigator Award † Email:
[email protected]. R. Ward was supported in part by an NSF CAREER grant and an ASOFR Young Investigator Award
1
that ideally for any fixed radius R, we want p2 to decrease rapidly as a function of c. For this reason, ln(p−1 ) we study the parameter ρ = ln(p1−1 ) . In the case of the unit sphere endowed with euclidean distance, 2
spherical LSH ([AINR14], [AR15a]) has been shown to satisfy the optimal asymptotic sensitivity ρ = c12 [OWZ14]; however, the corresponding hash functions are not practical to compute. The work [AIL+ 15] showed the existence of an LSH scheme with both optimal sensitivity and hash functions which are practical to implement; namely, the cross-polytope LSH scheme which had been previously proposed in [TT07] (see also [AR15b], [OWZ14], [MNP06]). Still, the standard cross-polytope construction in [AIL+ 15] requires O(d2 ) computations per hash, whereas in practice one would like a scheme with O(d ln d) computations. Our contribution. The goal of this paper is to provide a hash function with optimal sensitivity parameter ρ that is computable in time O(d ln d) where d is the dimension of the data. In particular, we introduce a variant of cross-polytope LSH which first embeds the points in ln(d) dimensions via a fast Johnson-Lindenstrauss transform before lifting back up to S d−1 and hashing. This is to our knowledge the first hashing scheme over the unit sphere with euclidean metric to provably achieve the optimal sensitivity parameter ρ with efficient hash computations. Additionally, by discretizing our hashing matrix, we extend this result to a scheme that uses only O(ln9 d) bits of randomness while still achieving the asymptotically optimal ρ. Related work. Many of our results hinge on the careful analysis of collision probabilities for the cross-polytope LSH scheme given in [AIL+ 15]. Additionally, various ways to reduce the runtime of cross-polytope LSH, specifically using fast, structured projection matrices, are mentioned in [BL15]. Johnson-Lindenstrauss transforms have previously been used in many approximate nearest neighbors algorithms, (see [IM98], [LMYG04], [AC09], [Osi11], [AINR14], and [DKS11], to name a few), primarily as a preprocessing step to speed up computations that have some dependence on the dimension. LSH with p-stable distributions, as introduced in [IDIM04], uses a random projection onto a single dimension, which is later generalized in [AI08] to random projection onto o(ln d) dimensions, with the latter having optimal exponent ρ = c12 + O(ln(ln d)/ ln1/3 d). We make a note that our scheme uses dimension reduction slightly differently, as an intermediate step before lifting the vectors back up to a different dimension. Similar dimension reduction techniques have been used in [LJC+ 13], where the data is sparsified and then a random projection matrix is applied. The authors exploit the fact that the random projection matrix will have the restricted isometry property, which preserves pairwise distances between any two sparse vectors. This result is notable in that the reduced dimension has no dependence on n, the number of points. See section 5 for more discussion.
2
Notation
We now establish notation that will be used in the remainder. OR (f (d)) is to mean OR (f (d)) = O(f (d)g(R)) for some finite valued function g : (0, 2) → R. The expression o(1) is a quantity such that limd→∞ o(1) = 0. H ∈ Rd×d is the Hadamard matrix. Db ∈ Rd×d is a diagonal matrix whose entries are i.i.d. Rademacher variables. For a matrix M ∈ Rd×d , MS will denote the restriction of M to its rows indexed by the set S ⊂ {1, ..., d}. The variable G will always denote a matrix with i.i.d. standard normal Gaussian entries, where the matrix may vary in size. The variable Gb will always denote a matrix with i.i.d. copies of a discrete random variable X which roughly models a Gaussian, details of which can be found in the appendix. The variables Ci will denote constants that are bounded
2
independent of the dimension. We will use m to denote the projected dimension of our points, where m ≪ d, and d′ the lifted dimension, where m ≤ d′ ≤ d.
3
Preliminaries
3.1
Locality Sensitive Hashing
We fix our space X = S d−1 endowed with the euclidean metric. First, we define an explicit notion of what it means for a hash family to be sensitive. Definition 1 For r1 ≤ r2 and p2 ≤ p1 , a hash family H is (r1 , r2 , p1 , p2 )-sensitive if for all x, y ∈ S d−1 , • If kx − yk2 ≤ r1 , then PrH [h(x) = h(y)] ≥ p1 . • If kx − yk2 ≥ r2 , then PrH [h(x) = h(y)] ≤ p2 . We primarily care about the case where r1 = R, r2 = cR, in which case we study the parameter ρ=
ln(p−1 1 ) −1 , ln(p2 )
(1)
as a measure of sensitivity for our hashing scheme. For a hash function with sensitivity ρ, there exists an algorithm that, with constant probability, solves (R, c)-NN with query time O(dnρ ) where n is the number of points to be hashed (see Theorem 1 in [IDIM04]). The optimal sensitivity is ρ = c12 ; note that as ρ decreases (or as c increases), we get more improvement in query time compared to linear search.
3.2
Cross-polytope LSH
The cross-polytope LSH scheme (which is actually a special case the spherical LSH algorithm from [TT07]) is of interest because it is simple and achieves the asymptotically optimal sensitivity. Given a Gaussian matrix G ∈ Rd×d with i.i.d. N (0, 1) entries, the cross polytope hash of a point x ∈ S d−1 is
defined as
Gx
, h(x) = argmin − u (2)
kGxk2
u={±ei } 2
where {ei }di=1 is the standard basis for Rd . A recent paper of Andoni, Indyk, Laarhoven, and Razenshteyn [AIL+ 15] gives the following collision probability for cross-polytope LSH. Proposition 2 (Theorem 1 in [AIL+ 15]) Suppose x, y ∈ S d−1 are such that kx − yk2 = R, with 0 < R < 2, and H is the hash family defined in (2). Then, ln
1 PrH [h(x) = h(y)]
Consequently, ρ=
3.3
=
R2 ln d + OR (ln(ln d)). 4 − R2
(3)
1 4 − c2 R 2 + o(1). c2 4 − R 2
Johnson Lindenstrauss Transforms
Johnson Lindenstrauss (JL) transforms have been used in a wide variety of applications to reduce the dimension of data sets while preserving pairwise distances according to a given metric. Formally, given
3
a finite metric space (X, k · k) ⊂ Rd , a JL transform is a linear map Φ : Rd → Rm such that for all x ∈ X, (1 − δ)kxk2 ≤ kΦxk2 ≤ (1 + δ)kxk2 , with m ≪ d close to the optimal scaling m = Cδ −2 ln(|X|) [JL84, Alo03, LN14]. In our case, we consider the unit sphere with euclidean metric, and our data set is simply the two points we want to hash, along with their difference. There has been a lot of recent work towards constructing fast JL transforms (to name a few, see [AC09, AL13, KW11]), many of which involve subsampling Hadamard or Fourier matrices. We consider a scheme where first we multiply x ∈ S d−1 by a diagonal matrix Db ∈ Rd×d with i.i.d. Rademacher variables, followed by a subsampled Hadamard matrix HS ∈ Rm×d . Here, S ⊂ {1, ...d} is a random subset of size |S| = m. The following lemma combines a recent improvement on the restricted isometry property for partial Hadamard matrices [HR15] with a reduction from RIP to Johnson-Lindenstrauss transforms in [KW11]; we defer the proof to the appendix. Lemma 3 Suppose γ > 0, x, y ∈ S d−1 , x e = HS Db x, ye = HS Db y and HS ∈ Rm×d is such that m = O(γ ln4 (d) ln4 (ln d)). Then with probability 1 − O(d−γ ),
4
1 1 1− ≤ ke xk22 ≤ 1 + , ln d ln d 1 1 ≤ ke yk22 ≤ 1 + , 1− ln d ln d 1 1 1− kx − yk22 ≤ ke x − yek22 ≤ 1 + kx − yk22 ln d ln d
(4) (5) (6)
Main Results
In order to speed up the cross-polytope LSH algorithm, we first use a JL transform to embed our data into a lower dimension m ≪ d, then “lift” the points back into dimension d′ (m ≤ d′ ≤ d) while ′ rotating them with an standard i.i.d. Gaussian matrix G ∈ Rd ×m . A simple but key observation is that computing Ge x" where x e = HS Db x ∈ Rm is equivalent to applying a d′ × d′ Gaussian matrix # x e to the sparse vector ; therefore, assuming the distortion of our points due to the JL transform is 0 not too large, this fast scheme should behave the same as the original cross-polytope LSH scheme. For practical purposes it would be ideal to lift the dimension as little as possible, but this results in empirically higher collision probabilities when points are far apart (see section 6). We can define the hash function for fast cross-polytope LSH as follows.
G(HS Db x)
hF (x) = argmin − u
. kG(Hs Db x)k2 u={±ei }
(7)
2
It is easy to see that for this scheme, hash computation x → h(x) takes O(d′ m) time from the Gaussian matrix multiplication and O(d ln d) time from the JL transform. Table 1 contains the construction of the original cross-polytope LSH scheme, our fast cross-polytope scheme, as well as the discretized version which we will present later. We now formalize the above intuition about how this scheme behaves relative to cross-polytope LSH. Theorem 4 Suppose H is the family of hash functions defined in (7) with the choice m = O(ln5 (d) ln4 (ln d)), and ρ is as defined in (1) for this particular family. Then we have ρ=
1 4 − c2 R 2 + o(1). c2 4 − R 2
4
Table 1: Various LSH Families and corresponding Hash Functions. LSH Family Cross-Polytope LSH Fast Cross-Polytope LSH Fast Discrete Cross-Polytope LSH
Hash Function
Gx − u h(x) = argmin kGxk
, G ∈ Rd×d 2 2
u={±ei }
′
G(HS Db x) − u hF (x) = argmin kG(H
, G ∈ Rd ×m s Db x)k2 2
u={±ei }
b ′
G(Zx)
hD (x) = argmin kG(Zx)k − u
, Gb ∈ Rd ×m b u={±ei }
2
2
and this hashing scheme runs in time O(d ln d). A few remarks are in order. • This sensitivity ρ matches that of standard cross-polytope LSH in Proposition 2, and for R ∈ 4−c2 R2 [0, R0 ] and R0 < 2, is optimal up to factor 4−R2 0 . 0
• When hashing n points simultaneously, the embedded dimension m picks up a factor of ln(n). Assuming that n is polynomial in d, the result in Theorem 4 still holds simultaneously over all pairs of points. In addition to creating a fast hashing scheme, one can reduce the amount of randomness involved. In particular, we show that a slight alteration of the scheme still achieves the optimal ρ-value while using only O(ln9 d) bits of randomness. The idea is to replace the Gaussian matrix by a matrix of i.i.d. discrete random variables. Some care is required in tuning the size of this matrix so that the correct number of bits is achieved. As a consequence the number of hash values for this scheme is of order O(m) (i.e. we lift up to a smaller dimension), which lowers performance in practice, but does not affect the asymptotic sensitivity ρ. We additionally use a JL transform developed by Nelson and Kane [KN14] that only uses O(ln(d) ln(ln d)) bits of randomness. Specifically, the hash function for this scheme is
G(Zx)
b
hD (x) = argmin − u b
u={±ei } kG(Zx)k 2 2
where Gb ∈ Rd ×m is a matrix with i.i.d. copies of a discrete random variable X which roughly models a Gaussian, and Z ∈ Rd×m is the JL transform constructed in [KN14]. Our analysis allows us to pick the threshold value d′ = m to minimize the number of random bits. ′
Theorem 5 There is a hash family H with O(ln9 d) bits of randomness that achieves the bound ρ=
1 4 − c2 R 2 + o(1), c2 4 − R 2
and runs in time O(d ln d).
5
Open Problems
Although we achieve a logarithmic number of bits of randomness in Theorem 5, there is no reason to believe this is optimal among all hash families. More generally, given a particular rate of convergence
5
to the optimal asymptotic sensitivity we would like to know the minimal number of required bits of randomness. We can state this as follows. Problem 6 Given a rate of convergence f (d, R) such that for each 0 < R < 2, limd→∞ f (d, R) = 0, find the minimal number of bits Of (d) such that any hash family H over the sphere S d−1 with Of (d) bits of randomness satisfies ρ ≥ c12 + f (d). We note that in existing schemes, it seems unlikely that we can remove the dependence of f on R unless we impose a threshold value for R. There is a construction of practical use from [AIL+ 15] that sends x → HDb HDb′ HDb′′ x before rounding to the largest entry in absolute value, where as above H ∈ Rd×d is a Hadamard matrix and Db ∈ Rd×d is a diagonal matrix with i.i.d. Rademacher entries on the diagonal. This scheme has exactly 3d bits of randomness and is easy to implement, hence it would be of interest to prove theoretical guarantees. A more practical question is, given a rate of convergence for ρ, what is the fastest one could compute a hash family achieving this rate. Problem 7 Given a rate of convergence f (d, R) as in Problem 6, find the hash family H over S d−1 such that ρ ≥ c12 + f (d) with the fastest hash computations. It would be natural to extend our theoretical analysis to the case of hashing a collection of n points simultaneously. In this setting, the embedding dimension of the JL matrix would inherit an additive factor depending on ln(n). Inspired by the construction in [LJC+ 13] which first sparsifies the data then exploits the restricted isometry property which applies uniformly over all sparse vectors, we can aim for a construction that doesn’t depend on the number of data points.
6
Numerical Experiments
To illustrate our theoretical results in the low dimensional case, we ran Monte Carlo simulations to compare the collision probabilities for regular cross-polytope LSH as well as the fast and discrete versions for various values of the original and lifted dimension. We refer to [AIL+ 15] for an in depth comparison of run times for cross-polytope LSH and other popular hashing schemes. The experiments were run with N = 20000 trials. The discretized scheme used 10 bits of randomness for each entry. The fast, discrete, and regular cross-polytope LSH schemes exhibit similar collision probabilities for small distances, with fast/discrete cross-polytope having marginally higher collision probabilities for larger distances. It is clear that as the lifted dimension decreases, the fast and discrete versions have higher collision probabilities at further distances, which decreases the sensitivity of the schemes.
6
Standard Fast Fast Discrete
10
-2
0.4
0.6
0.8
1
1.2
Fast Fast Discrete -2
1.4
10
Standard Fast Fast Discrete 10
0.2
0.4
0.6
0.8
1
1.2
-1
-2
1.4
0.2
0.4
0.6
0.8
1
1.2
Distance
Distance
Distance
Figure 1: d = 128, d′ = 128
Figure 2: d = 128, d′ = 64
Figure 3: d = 128, d′ = 32
100
Collision Probability
100
Collision Probability
-1
Standard
10 0.2
10
10
Collision Probability
-1
100
-1
Standard Fast Fast Discrete
10-2 0.2
0.4
0.6
0.8
10
-1
Standard Fast Fast Discrete 10-2
1
1.2
1.4
1.4
100
Collision Probability
10
100
Collision Probability
Collision Probability
100
10
-1
Standard Fast Fast Discrete
10-2 0.2
0.4
0.6
0.8
1
1.2
1.4
0.2
0.4
0.6
0.8
1
1.2
Distance
Distance
Distance
Figure 4: d = 256, d′ = 256
Figure 5: d = 256, d′ = 128
Figure 6: d = 256, d′ = 64
1.4
Acknowledgement We thank Arie Israel and Eric Price for valuable suggestions that helped improve this manuscript.
References [AC09]
Nir Ailon and Bernard Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302–322, May 2009.
[AI08]
Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, January 2008.
[AIL+ 15]
Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practical and Optimal LSH for Angular Distance. ArXiv e-prints, September 2015.
[AINR14]
Alexandr Andoni, Piotr Indyk, Huy L Nguyen, and Ilya Razenshteyn. Beyond LocalitySensitive Hashing. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1018–1028. SIAM, 2014.
[AL13]
Nir Ailon and Edo Liberty. An almost optimal unrestricted fast johnson-lindenstrauss transform. ACM Transactions on Algorithms (TALG), 9(3):21, 2013.
[Alo03]
Noga Alon. Problems and results in extremal combinatorics. Discrete Mathematics, 273:31– 53, 2003.
7
[AR15a]
Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. arXiv preprint arXiv:1501.01062, 2015.
[AR15b]
Alexandr Andoni and Ilya Razenshteyn. Tight Lower Bounds for Data-Dependent LocalitySensitive Hashing. ArXiv e-prints, July 2015.
[BL15]
Anja Becker and Thijs Laarhoven. Efficient (ideal) lattice sieving using cross-polytope lsh. Cryptology ePrint Archive, Report 2015/823, 2015. http://eprint.iacr.org/.
[DKS11]
Anirban Dasgupta, Ravi Kumar, and Tam´ as Sarl´os. Fast locality-sensitive hashing. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1073–1081. ACM, 2011.
[HR15]
Ishay Haviv and Oded Regev. The restricted isometry property of subsampled Fourier matrices. arXiv preprint arXiv:1507.01768, 2015.
[IDIM04]
Piotr Indyk, Mayur Datar, Nicole Immorlica, and Vahab Mirrokni. Locality sensitive hashing scheme based on p-stable distributions. In Proceedings of the tmentieth annual Symposium on Computational Geometry. New York, pages 253–262, 2004.
[IM98]
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 604–613, New York, NY, USA, 1998. ACM.
[JL84]
William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26:189–206, 1984.
[KN14]
Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms. J. ACM, 61(1):4:1–4:23, January 2014.
[KW11]
Felix Krahmer and Rachel Ward. New and improved Johnson-Lindenstrauss embeddings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3):1269– 1281, 2011.
[LJC+ 13]
Yue Lin, Rong Jin, Deng Cai, Shuicheng Yan, and Xuelong Li. Compressed hashing. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 446–451, June 2013.
[LMYG04] Ting Liu, Andrew W. Moore, Ke Yang, and Alexander G. Gray. An investigation of practical approximate nearest neighbor algorithms. In Lawrence K. Saul, Yair Weiss, and Lon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 825– 832. MIT Press, Cambridge, MA, 2004. [LN14]
Kasper Green Larsen and Jelani Nelson. The johnson-lindenstrauss lemma is optimal for linear dimensionality reduction. arXiv preprint arXiv:1411.2404, 2014.
[MNP06]
Rajeev Motwani, Assaf Naor, and Rina Panigrahi. Lower bounds on Locality Sensitive Hashing. In Proceedings of the Twenty-second Annual Symposium on Computational Geometry, SCG ’06, pages 154–157, New York, NY, USA, 2006. ACM.
[Osi11]
Andrei Osipov. A Randomized Approximate Nearest Neighbors Algorithm. PhD thesis, New Haven, CT, USA, 2011. AAI3467911.
[OWZ14]
Ryan O’Donnell, Yi Wu, and Yuan Zhou. Optimal lower bounds for Locality-Sensitive Hashing (except when Q is tiny). ACM Trans. Comput. Theory, 6(1):5:1–5:13, March 2014.
[TT07]
Kengo Terasawa and Yuzuru Tanaka. Spherical LSH for approximate nearest neighbor search on unit hypersphere. In Algorithms and Data Structures, pages 27–38. Springer, 2007.
8
[WSB]
7
Roger Weber, Hans-J¨org Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces.
Appendix
7.1
Proof of Lemma 3
Define the event Ev,δ := {v ∈ Rn : (1 − δ)kvk2 ≤ ke v k2 ≤ (1 + δ)kvk2 }. Combining Theorem 4.5 of [HR15] and Theorem 3.1 of [KW11], we know that for any η ∈ (0, 1), any s ≥ 40 ln(12/η), some C0 > 0, and provided m = O(δ −2 ln2 (1/δ)s ln2 (s/δ) ln(d)), Pr[Ex,δ ∩ Ey,δ ∩ Ex−y,δ ] ≥ (1 − η)(1 − 2−C0 ln(d) ln(s/δ) ) Setting δ = 1/ ln(d), η = d−γ , s = 40C ln(12d), we get Pr[Ex,δ ∩ Ey,δ ∩ Ex−y,δ ] ≥ (1 − d−γ )(1 − 2−C0 ln(d) ln(40γ ln(12d) ln(d)) ), and the lemma follows.
7.2
Proof of Theorem 4
First we state an elementary limit result that we will apply to the proofs of both Theorem 4 and Theorem 5. Lemma 8 Suppose md (a), md (b) are positive functions, limd→∞ md (a) = a, limd→∞ md (b) = b, and (d) = ∞. Then, that f (d), g(d) are also positive, limd→∞ f (d) = limd→∞ g(d) = ∞, limd→∞ fg(d) md (a)f (d) + g(d) a = d→∞ md (b)f (d) + g(d) b lim
Proof. We know that for any ǫ > 0 and d large enough, md (b) ≥ b − ǫ, so that g(d) g(d) ≤ lim d→∞ md (b)f (d) + g(d) d→∞ (b − ǫ)f (d) + g(d) 1 = lim = 0, d→∞ (b − ǫ) f (d) + 1 g(d) lim
and by positivity the inequality is an equality. This implies that lim
d→∞
md (a)f (d) + g(d) md (a)f (d) = lim . d→∞ md (b)f (d) + g(d) md (b)f (d) + g(d)
The same argument on the reciprocal shows that lim
d→∞
md (a)f (d) a md (a)f (d) = lim = md (b)f (d) + g(d) d→∞ md (b)f (d) b
" # ve ′ ′ where G0 ∈ Rd ×d is Recall that x, y ∈ S d−1 such that kx − yk22 = R2 . Note that Ge v = G0 0 an i.i.d. Gaussian matrix. This means that we can reduce to finding bounds for the angular distance
9
between x e and ye and use the results for the regular polytope LSH. Note that
2
x
x, yei
e − ye = 2 − 2 he .
ke xk2 ke yk2 2 ke xk2 ke y k2
(8)
From Lemma 3, with probability 1 − C1 d−γ (with the choice m = C2 γ(ln d)4 ln4 (ln d)) we can bound the inner product above by ke xk22 + ke y k22 − ke x − yek22 2 kx − yk22 1 1 − 1− ≤1+ ln d ln d 2 3 ≤ hx, yi + , ln d
he x, yei =
(9)
where the inequality follows from (4), (5), (6). The same argument shows that he x, yei ≥ hx, yi −
Now, combining (4), (5), (8), (9),
3 . ln(d)
2 3
x
d
e − ye ≤ 2 − 2 hx, yi − ln
ke xk2 ke yk2 2 1 + ln1d =
Using the reverse inequalities, we get
Let Fd be the event that
5 kx − yk22 + . ln d + 1 1 + ln1d
2
2
x 5
e − ye ≥ kx − yk2 − . 1
ke xk2 ke y k2 2 ln d − 1 1 − ln d
2
2
x 5 5 kx − yk22
e − ye ≤ kx − yk2 + − ≤ . 1 1
ln d − 1 ke xk2 ke y k2 2 ln d + 1 1 − ln d 1 + ln d
(10)
Recalling that kx − yk22 = R2 , this motivates the definition R+ :=
s
R2 5 + , 1 ln d + 1 1 + ln d
and similarly for R− (using the bounds in (10)). Given that Fd holds, combining (10) and Theorem 2, "
" #! " #!# x e ye Pr[hF (x) = hF (y)] = Pr h =h 0 0 ′
≤ (d )
−R2 + 4−R2 +
OR+ (ln−1 (d′ )).
We now note that for large enough d, R+ and R− are bounded away from 2 so that OR+ (ln−1 (d′ )) = C+ ln−1 (d′ ), OR− (ln−1 (d′ )) = C− ln−1 (d′ ), where C+ and C− are constants that are bounded inde-
10
pendent of the dimension. This implies that Pr[hF (x) = hF (y)] = Pr[Fd ∩ {hF (x) = hF (y)}] + Pr[Fdc ∩ {hF (x) = hF (y)}] −γ
= (1 − C1 d ′
≤ (d )
−R2 + 4−R2 +
′
)(d )
−R2 + 4−R2 +
C+ ln−1 (d′ ) + C1 d−γ
C3 ln−1 (d′ ),
with the choice γ = ln(d) provided d is large enough that ln(d) >
R2+ 4−R2+
(note that d′ ≤ d). Hence,
m = C2 γ ln4 (d) ln4 (ln d) = C2 ln5 (d) ln4 (ln d). A similar but less delicate argument shows that (for the same choice of γ), ′
Pr[hF (x) = hF (y)] ≥ (d )
−R2 − 4−R2 −
C4 ln−1 (d′ ).
In particular, this implies that
ρ≤
=
R2+ 4−R2+
ln(d′ ) + ln ln(d′ ) + C3
c2 R2− 4−c2 R2−
ln(d′ ) + ln ln(d′ ) + C4
1 4 − c2 R 2 + o(1), by lemma 8. c2 4 − R 2
2 2 We make a technical note that we need d large enough that R− , R+ are bounded away from 4 so that the above expression isn’t undefined.
8
Proof of Theorem 5
We will use the following result (formulated as an analogue to lemma 3) , due to Kane and Nelson, that reduces the amount of randomness require to perform a JL transform. Proposition 9 (Theorem 13 and Remark 14 in [KN14]) Suppose γ > 0, x, y ∈ S d−1 . Then, there is a random matrix Z ∈ Rd×m with m = O(γ ln3 (d)) and sampled with O(γ ln2 (d)) random bits such that with probability 1 − O(d−γ ), 1 1 ≤ kZxk22 ≤ 1 + , 1− ln d ln d 1 1 ≤ kZyk22 ≤ 1 + , 1− ln d ln d 1 1 1− kx − yk22 ≤ kZ(x − y)k22 ≤ 1 + kx − yk22 ln d ln d First we want to construct a hash scheme that uses a Gaussian rotation with which to compare our
′ discretized scheme. Define
G Zx ′
− u (11) hD (x) = argmin ′
, kG Zxk 2 u={±ei } 2
where G ′ ∈ Rm×m is a standard i.i.d. Gaussian matrix. The following elementary lemma gives us a suitable replacement for each Gaussian in the matrix G ′ . Lemma 10 Suppose g ∼ N (0, 1). Then, there is a symmetric, discrete random variable X taking 2b
11
values such that for any x ∈ R, Pr[g ≤ x] = Pr[X ≤ x] + O(2−b )
(12)
b
−1 Proof. Let {xi }2i=0 be such that Pr[g ≤ xi ] = i2−b + 2−b−1 . Let X be such that Pr[X = xi ] = 2−b . It is now easy to check that (12) is satisfied and X is symmetric.
The discretized scheme can now be constructed by
GZx
b − u , hD (x) = argmin b
u={±ei } kGZxk2
(13)
2
where the entries of Gb ∈ Rd ×m are i.i.d. copies of the random variable X in Lemma 10. Note that each discrete random variable has b bits of randomness, so the hashing scheme has minimial randomness when d′ = m, thus there are m × m × b + O(γ ln2 (d)) = O(γ 2 ln6 (d)b + γ ln2 (d)) bits of randomness. As we will see, we can choose γ and b to be a power of ln(d) while still achieve the optimal asymptotic ρ. For this we have the following lemma. ′
Lemma 11 Let x, y ∈ Rd be such that kx − yk2 = R, x e = Zx, and let h, h′ be as defined in (13) and 4 e = ke (11) respectively with m = O(ln (d)), b = log2 (d) where R x − yek2 . Then, ln(Pr[hD (x) = hD (y)]) = ln(Pr[h′D (x) = h′D (y)]) + ORe (1)
(14)
bx are symmetric and i.i.d., the probability of hashing to Proof. Note that since the entries of Ge one value is equal for all hash values, so we get Pr[hD (x) = hD (y)] = 2d′ Pr[hD (x) = hD (y) = e1 ] bx)1 ≥ |(Gbx = 2d′ Pr[∩dj=2 (Ge e)j |, (Gbye)1 ≥ |(Gbye)j |] ′
bx)1 ≥ |(Gbx by )1 ≥ |(Gbye)2 |]d′ −1 ). = 2d′ E(Gbxe)1 ,(Gbye)1 (Pr[(Ge e)2 |, (Ge
(15)
by )1 ≥ |(Gbye)2 |] in terms of the probability Our goal is to bound the probability Pr[(Gbx e)1 ≥ |(Gbx e)2 |, (Ge ′ ′ ′ ′ e)1 ≥ |(G ′ x e)2 |, (G ′ ye)1 ≥ |(G ′ ye)2 |} and Pr[(G x e)1 ≥ |(G x e)2 |, (G ye)1 ≥ |(G ye)2 |]. Define EG ′ = {(G ′ x b Since EG ′ is a convex polytope, we can write similarly for G. Pr[EG ′ ] =
Z Z I1
...
I2 (x1 )
Z
Im (x1 ,x2 ,...,xm−1 )
2 2 1 e−(x1 +...+xm )/2 dxm ...dx1 , m (2π)
where each Ij (x1 , ..., xj ) is a (possibly unbounded) interval. By construction of X, Z
Ij (x1 ,...,xj )
1 −x2j+1 /2 e dxj+1 = 2π
Z
pX (xj+1 )dxj+1 + O(2−b )
Ij (x1 ,...,xj )
where pX (x) is the pdf of X. This implies that Pr[EG ′ ] =
Z Z I1
... =
...
I (x1 )
Z Z2 I1
I2 (x1 )
Z
Im (x1 ,...,xm−1 )
...
Z
2 2 1 e−(x1 +...+xm−1 )/2 pX (xm )dxm ...dx1 + O(2−b ) (2π)m−1
pX (x1 )...pX (xm )dxm ...dx1 + O(m2−b )
Im (x1 ,...,xm−1)
= Pr[EGb] + O(m2−b ).
12
Plugging this into (15), we get ′
−b d −1 ′ Pr[hD (x) = hD (y)] = 2d′ E(Ge bx)1 ,(G by e)1 (Pr[EG ] + O(m2 ))) ′ dX −1 ′ ′ d − 1 = 2d′ E(Ge Pr[EG ′ ]k (O(m2−b ))d −1−k . bx)1 ,(G by e)1 k k=1
We now make the choice m = C5 ln4 (d), b = log2 (d) ln(d), so that the above summation becomes ′ dX −1 ′
k=1
′ d −1 Pr[EG ′ ]d −1−k (C5 ln4 (d)d− ln(d) )k k =
′ dX −1 ′
k=1
′ d −1 Pr[EG ′ ]d −1−k (C5 ln4 (d)d− ln(d) )k k ′
This first term in the summation is the main term Pr[EG ′ ]d −1 and the other terms can be bounded using Sterling’s approximation as follows, ′ ′ k ′ d −1 de Pr[EG ′ ]d −1−k (C5 ln4 (d)d− ln(d) )k ≤ (C5 ln4 (d)d− ln(d) )k . k k For k ≥ 1 this is certainly bounded by O(d− ln(d)+1 ), and we have ′ dX −1 ′
k=1
′ d −1 Pr[EG ′ ]d −1−k (C5 ln4 (d)d− ln(d) )k k ′
= Pr[EG ′ ]d −1 + O(d− ln(d)+2 )
We note that the last asymptotic approximation is very rough but sufficient for our purposes. This means that ′
d −1 ′ Pr[hD (x) = hD (y)] = 2d′ E(Ge ) + O(md− ln(d)+2 ). bx)1 ,(G by e)1 (Pr[EG ]
(16)
Using the same technique as above where we replace the Gaussian density function with PX (x), we have ′
Pr[h′D (x) = h′D (y)] = 2d′ E(G ′ xe)1 ,(G ′ ye)1 (Pr[EG ′ ]d −1 ) ′
−b d −1 = 2d′ E(Ge bx)1 ,(Ge by)2 (Pr[EG ′ ] + O(m2 )) ′
d −1 = 2d′ E(Ge ) + O(md− ln(d)+2 ) bx)1 ,(Ge by)2 (Pr[EG ′ ]
Finally, plugging this into (16), we get Pr[hD (x) = hD (y)] = Pr[h′D (x) = h′D (y)] + O(md− ln(d)+2 ) = Pr[h′D (x) = h′D (y)] + O(d− ln(d)+3 ). e2
R ′ ′ Now, we know that by Theorem 2, ln(Pr[hD (x) = hD (y)]) = − 4− e (ln(ln d )), so provided e2 ln(d ) + OR R
d is large enough that ln(d) − 2 >
e2 R e2 4−R
, the lemma follows.
To complete the proof of Theorem 5, we know that, using the notation of the proof of Theorem 4, since our choice of m is large enough, ORe (1) is bounded in the dimension, so that after we fix γ = ln d
13
and running the same argument as in Theorem 4, we have ln(Pr[hD (x) = hD (y)]) ln(Pr[hD (cx) = hD (cy)]) ln(Pr[h′D (x) = h′ (y)]) + ORe (1) = ln(Pr[h′D (cx) = h′ (cy)]) + ORe (1)
ρ=
=
=
=
R2+ 4−R2+
ln(d′ ) + ln ln(d′ ) + C+ + ORe (1)
c2 R2− 4−c2 R2− R2+ 4−R2+
ln(d′ ) + ln ln(d′ ) + C− + ORe (1)
ln(d′ ) + ln ln(d′ ) + C6
c2 R2− 4−c2 R2−
ln(d′ ) + ln ln(d′ ) + C7
1 4 − c2 R 2 + o(1), by lemma 8. c2 4 − R 2
Finally, by our choice of γ, b in the above lemma, we know that there are O(ln9 (d)) bits of randomness.
14