A Nonlinear Approach to Dimension Reduction

Report 0 Downloads 105 Views
A Nonlinear Approach to Dimension Reduction Lee-Ad Gottlieb∗ Robert Krauthgamer∗ Weizmann Institute of Science

arXiv:0907.5477v1 [cs.CG] 31 Jul 2009

July 31, 2009

Abstract A powerful embedding theorem in the context of dimension reduction is the ℓ2 flattening lemma of Johnson and Lindenstrauss [JL84]. It has been conjectured that improved dimension bounds may be achievable for some data sets by bounding the target dimension in terms of the intrinsic dimensionality of the data set (for example, the doubling dimension). One such problem was proposed by Lang and Plaut [LP01] (see also [GKL03, Mat02, ABN08]), and is still open. We pose another question in this line of work: Does the snowflake metric d1/2 of a doubling set S ⊂ ℓ2 always embed with distortion O(1) into ℓD 2 , for dimension D that depends solely on the doubling constant of the metric? We resolve this question in the affirmative, and furthermore obtain distortion arbitrarily close to 1. Moreover, our techniques are sufficiently robust to be applicable also to the more difficult spaces ℓ1 and ℓ∞ , although these extensions achieve dimension bounds that are quantitatively inferior than those for ℓ2 .

1

Introduction

Dimension reduction, in which high-dimensional data is faithfully represented in a low-dimensional space, is a key tool in several fields. Probably the most prevalent mathematical formulation of this problem considers the data to be a set S ⊂ ℓ2 , and the goal is to map the points in S into a low-dimensional ℓk2 . (Here and throughout, ℓkp denotes the space Rk endowed with the ℓp -norm; ℓp is the infinite-dimensional counterpart of all sequences that are p-th power summable.) A celebrated result in this area is the so-called JL-Lemma: Theorem 1.1 (Johnson and Lindenstrauss [JL84]). For every n-point subset S ⊂ ℓ2 and every 0 < ε < 1, there is a mapping ΨJL : S → ℓk2 that preserves all interpoint distances in S within factor 1 + ε, and has target dimension k = O(ε−2 log n). This positive result is remarkably strong; in fact the map ΨJL is an easy to describe linear transformation. It has found many applications, and has become a basic tool. It is natural to seek the optimal (minimum) target dimension k possible in this theorem. The logarithmic dependence on n = |S| is necessary, as can be easily seen by volume arguments, and Alon [Alo03] further proved ∗

This work was supported in part by The Israel Science Foundation (grant #452/08), and by a Minerva grant. Weizmann Institute of Science, Rehovot, Israel. Email: {lee-ad.gottlieb,robert.krauthgamer}@weizmann.ac.il

1

that the JL-Lemma is optimal up to a factor of O(log 1ε ). These lower bounds are existential, meaning that there are sets S for which the result of the JL-Lemma cannot be improved. However, it may still be possible to significantly reduce the dimension for sets S that are “intrinsically” low-dimensional. This raises the interesting and fundamental question of bounding k in terms of parameters other than n, which we formalize next. We recall some basic terminology involving metric spaces.1 The doubling constant of a metric (M, dM ), denoted λ(M ), is the smallest λ ≥ 1 such that every (metric) ball in M can be covered by at most λ balls of half its radius. We say that M is doubling if its doubling constant λ(M ) is def

bounded independently of |M |. It is sometimes more convenient to refer to dim(M ) = log2 λ(M ), which is known as the doubling dimension of M [GKL03]. An embedding of one metric space (M, dM ) into another (N, dN ) is a map Ψ : M → N . We say that Ψ attains distortion D ′ ≥ 1 if Ψ preserves every pairwise distance within factor D ′ , namely, there is a scaling factor s > 0 such that 1≤

dN (Ψ(x), Ψ(y)) ≤ D′ , s · dM (x, y)

∀x, y ∈ M.

The following problem was posed independently by [LP01] and [GKL03] (see also [Mat02, ABN08]): ′ Question 1. Does every doubling subset S ⊂ ℓ2 embed with distortion D′ into ℓD 2 for D, D that depend only on λ(S)?

This question is still open and seems very challenging. Resolving it in the affirmative seems to require completely different techniques than the JL-Lemma, since such an embedding cannot be achieved by a linear map [IN07, Remark 4.1]. For algorithmic applications, the ideal embedding would be an even stronger variant of Question 1, where the target distortion D ′ is an absolute constant independent of λ(S), or even 1 + ε as in the JL-Lemma. This stronger version has not been excluded, and is still open as well. The motivations for a theoretical study of dimension reduction (such as Question 1 as well as our results, which are similar in spirit, but technically incomparable) can be summarized as follows: 1. Conceptual – understand the formal relationship between intrinsic dimension and embedding dimension. 2. Technical – devise tools that exploits (low) intrinsic dimensionality and methods for embedding into low-dimensional spaces. 3. Algorithmic – the performance of many common algorithms highly depends on the embedding dimension (rather than intrinsic dimension), and dimension reduction can boost their performance significantly. This is true also for heuristics such as k-means clustering or SVM classification, but then the final outcome is a heuristic as well.2

1.1

Results and Techniques

We present dimension reduction results for doubling subsets in Euclidean spaces. In fact, we devise a framework that is robust and even extends to the notoriously difficult spaces ℓ1 and ℓ∞ . Our 1

A metric space is a set M of points endowed with a distance function dM (·, ·) that is nonnegative, symmetric, and satisfies the triangle inequality; in addition, we stipulate that dM (x, y) = 0 if and only if x = y. 2 An alternative approach is to adapt a known algorithm so that it exploits a (low) doubling dimension, see e.g. the near neighbor search algorithms of [IN07, DF08]. This approach can potentially achieve superior performance, but it requires specialization, while “generic” dimension reduction is an off-the-shelf technology.

2

results incur constant or 1 + ε distortion, with target dimension that depends not on |S| but rather on dim(S) (and this dependence is unavoidable due to volume arguments). We remark that embeddings of this type (low distortion and dimension) are not commonly achieved for metric embeddings, despite being highly desirable. All our results are algorithmic; they are constructive and can be computed in polynomial time. The details are mostly straightforward, and thus we will not address this issue explicitly. We state our results in the context of finite metrics (subsets of ℓp ), although the results extend to infinite ˜ ) to denote f · (log f )O(1) . subsets of Lp via standard arguments. We use O(f Snowflake Embedding. Our primary embedding achieves distortion 1 + ε for the snowflake metric dα of an input metric d (i.e. the snowflake metric is obtained by raising every pairwise distance to power α). It is instructive to view α as a fixed constant, say α = 1/2. We prove the following in Section 3. Theorem 1.2. Let 0 < ε < 1/4 and 0 < α < 1. Every finite subset S ⊂ ℓ2 admits an embedding ˜ −4 (1 − α)−1 (dim S)2 ), such that Φ : S → ℓk2 for k = O(ε 1≤

kΦ(x) − Φ(y)k2 ≤ 1 + ε, kx − ykα2

∀x, y ∈ S.

Notice the differences between our embedding and the embedding of Question 1: Our embedding achieves better distortion 1 + ε, but it applies to the (often easier) snowflake metric dα . Our result is also related to the following theorem of Assouad [Ass83]: For every doubling metric (M, d) and ′ ′ every 0 < α < 1, the snowflake metric dα embeds into ℓD 2 with distortion D , where D, D depend only on λ(M ) and α. Note the theorem’s vast generality – the only requirements are the doubling property (which by volume arguments is an obvious necessity) and that the data be a metric – at the nontrivial price that the distortion achieved depends on λ(M ). Compared to Assouad’s theorem, our embedding achieves a stronger distortion 1 + ε, but requires the additional assumption that the input metric is Euclidean. Previously, Theorem 1.2 was only known to hold in the special case where S = R (the real line). For this case, Kahane [Kah81] and Talagrand [Tal92] exhibit a 1 + ε distortion embedding of the snowflake metric |x − y|α into ℓk2 . The results of Kahane [Kah81] and Talagrand [Tal92] differ in that [Kah81] achieves dimension k = O(1/ε) for the specific snowflake metric |x − y|1/2 (also known as Wilson’s helix), while [Tal92] embeds every snowflake metric |x − y|α , α ∈ (0, 1), achieving dimension O(K(α)/ε2 ) (which is larger). Thus, our theorem can be viewed as a generalization of [Kah81, Tal92] to arbitrary doubling subset of ℓ2 (or other ℓp ), albeit with a somewhat worse dependence on ε. Embedding for a Single Scale. Most of our technical work is devoted to designing an embedding that preserves distances at a single scale r > 0, while still maintaining a one-sided guarantee (Lipschitz condition) for all scales. We now state our most basic result, which achieves only a constant distortion (for the desired scale). Theorem 1.3. For every scale r > 0 and every 0 < δ < 1/4, every finite set S ⊂ ℓ2 admits an ˜ S)2 log 1δ ), satisfying: embedding ϕ : S → ℓk2 for k = O((dim (a). Lipschitz: kϕ(x) − ϕ(y)k2 ≤ kx − yk2 for all x, y ∈ S; 3

(b). Bi-Lipschitz at scale r: kϕ(x) − ϕ(y)k2 = Ω(kx − yk2 ) whenever kx − yk2 ∈ [δr, r]; and (c). Boundedness: kϕ(x)k2 ≤ r for all x ∈ S.

The constant factor accuracy achieved by this theorem is too weak to achieve the 1+ε distortion asserted in Theorem 1.2. While we cannot improve condition (b) to a factor of 1 + ε, we are able ˜ : R → R, to refine it in a useful way. Roughly speaking, we introduce a “correction” function G such that whenever kx − yk2 ∈ [δr, r], kϕ(x) − ϕ(y)k2 ˜ kx−yk2 ). = (1 ± ε) G( r kx − yk2

(1)

˜ does not depend on r and equals Θ(1) in the appropriate range. Using the This function G correction function, we obtain very accurate bounds on distances in the target space, at the price 3 ). This high-level idea is implemented in Theorem ˜ of increasing the dimension by a factor of O(1/ε 3.1, which immediately implies Theorem 1.3, although the precise guarantee therein slightly differs from Equation (1). Embeddings for a single scale are commonly used in the embeddings literature, but not in the context of dimension reduction, and it is plausible that in some applications, the single-scale embedding may suffice, or even provide better bounds than our snowflake embedding (or Question 1). Technical Ingredients. The main technical challenge is to keep both distortion and dimension under tight control. We use the JL-Lemma to obtain the desired low distortion and dimension −1 “locally” for one scale, e.g. for an (εr)-net in an O(r)-ball, which contains λO(log ε ) net points. (See Section 2 for more standard terminology.) This creates the problem of “stitching” many local (linear) maps into one global embedding which must be smooth (Lipschitz) yet faithful (lowdistortion at one scale). Previous work (e.g. [GKL03, KLMN05, ABN08]) devised techniques that solve this stitching problem, however these techniques significantly increase the distortion and dimension. Fortunately, we have additional structure at hand (each local piece is Euclidean), which we can exploit to refine the previous stitching technique so as to maintain the low distortion and dimension. The main technical tools we use are: (1) dimension reduction for a finite set, such as the JLLemma; (2) Lipschitz extension theorems, such as Kirszbraun’s Theorem [Kir34]; (3) bounding the interpoint distances at a certain threshold, for example Schoenberg’s Theorem [Sch38]; (4) probabilistic partitioning of metric spaces, such as padded decomposition; (5) gluing embeddings of several local pieces by smoothing them near the boundary; (6) Assouad’s technique for snowflake √ embedding (scaling down by r) [Ass83]. Nowadays, these are all either standard or classic tools, yet applying them in the right combination yields a surprisingly strong outcome. In fact, the way we employ these standard technique has some new and interesting aspects. The first three ingredients rely on the ℓ2 structure, while the last three are standard in the context of doubling metrics, and our work is the first to combine these two groups of tools. Similarly, our result is the first to exploit the full power of Assouad’s snowflake technique to achieve an embedding with 1 + ε distortion (as opposed to constant distortion). Observe also that most of these steps are nonlinear. Our results may also be viewed as partial progress towards the resolution of Question 1. Indeed, Theorem 1.2 answers positively the special case where the given metric satisfies the condition that its square is Euclidean, and Theorem 1.3 achieves guarantees that relax those required by Question 1. 4

Extension to Other Spaces. Our embedding framework extends to ℓp (i.e. S ⊂ ℓp and Φ : S → ℓkp ) for both p = 1 and p = ∞, as discussed in more detail in Section 4. The bounds we obtain therein are worse than in the ℓ2 case, namely the dimension k is at least exponential in dim(S), which is to be expected because of strong lower bounds known in terms of n = |S| (see section 1.3). We remark that previous work on dimension reduction in ℓp spaces did not establish any dimension bound in terms of λ(S); these bounds are all expressed in terms of the size of S, or the dimension of S as a linear subspace (Section 1.3). Our framework extends to ultrametrics with even stronger guarantees, and we resolve Question 1 in the affirmative, as follows. Ultrametrics embed isometrically (i.e. with distortion 1) into ℓ2 , hence Theorem 1.2 immediately applies. The dimension bound can be further improved by replacing some steps with more specialized machinery. Moreover, the snowflake operator can be eliminated altogether by observing that (M, d) is an ultrametric if and only if its snowflake (M, dβ ) is an ultrametric (for all 0 < β), and applying Theorem 1.2 to the ultrametric d2 with α = 1/2. However, a near-optimal bound can be achieved by a simpler and more direct construction; we defer further details.

1.2

Applications

In many application areas it is important to succinctly represent data that is given as points in ℓp , and here our embeddings may be applicable, especially if the data has a low doubling dimension, and if the snowflake effect can be mitigated or if an embedding for a single scale suffices. Here we illustrate the effectiveness and potential of our results by describing a few immediate (theoretical) algorithmic applications. Distance Labeling Scheme (DLS). Consider this problem for the family of n-point ℓ2 metrics with a given bound on the doubling dimension. As usual, we assume the interpoint distances are in the range [1, R]. Our snowflake embedding into ℓk2 (Theorem 1.2 for α = 21 ) immediately provides a DLS with approximation (1 + ε)2 ≤ 1 + 3ε, simply by rounding each coordinate to a multiple of ε/2k. It achieves label size R ˜ −4 (dim S)2 ) log R. = O(ε k · log ε/2k Notice that, apart from the log R term, this bound is independent of n. The published bounds of this form (see [HM06] and references therein) apply to the the more general family of all doubling metrics (not necessarily Euclidean) but require exponentially larger label size, roughly (1/ε)O(dim S) . Clustering. Many clustering problems are defined as an optimization problem whose objective function is expressed in terms of distances between data points. Consider for example the k-center problem, where one is given a metric (S, d) and is asked to identify a subset of centers C ⊂ S that minimizes the objective maxx∈S d(x, C). When the data set S is Euclidean (and the centers are discrete, i.e. from S), one can apply our snowflake embedding (Theorem 1.2) and solve the problem in the target space, which has low dimension k. Indeed, it is easy to see how to map solutions from the original space to the target space and vice versa, with a loss of at most a (1 + ε)2 ≤ 1 + 3ε factor in the objective. In some other problems, like k-median or min-sum clustering, the objective function is the sum of certain distances. The argument above applies, except that now in the target space we need an algorithm that solves the problem with ℓ2 -squared costs. For instance, to solve the k-median 5

problem in the original space, we might need an algorithm for k-means in the target space. One example for this are algorithms designed by Schulman [Sch00] for min-sum clustering under both ℓ2 and ℓ2 -squared costs, whose running time depend exponentially on the dimension.

1.3

Other Related Work on Low Dimensionality

The subject of dimensionality has attracted considerable attention from various directions, and clearly we can only mention a few of the most relevant threads. The JL-Lemma and Other ℓp Spaces. The dimension reduction bounds known for ℓp norms other than ℓ2 are much weaker than those given by the JL-Lemma, despite significant research, see [Sch87, Tal90] for p = 1, [Bal90, Tal95] for 1 < p < 2, and [Mat96] for p = ∞. In fact, strong dimension reduction lower bounds are known for ℓ1 [BC05, LN04] and for ℓ∞ [Mat96]. Another negative finding in this direction, due to [JN09], is that dimension reduction a l´a the JL-Lemma is quite unique to the ℓ2 -norm. Alternative Notions of Dimension Reduction. A few other notions of dimension reduction were suggested in the literature. For ℓ1 , Ostrovsky and Rabani [OR02] designed a weak analogue of the JL-Lemma [JL84], which is faithful for a range of distance scales (but not for all distances). For ℓp spaces, 1 ≤ p < 2, Indyk [Ind06] devised a different analogue to the JL-Lemma, which uses p-stable distributions to produce accurate estimates of interpoint distances. Strictly speaking, this is not an embedding into ℓp (e.g. it uses median over the coordinates). Motivated by the Nearest Neighbor Search (NNS) problem, Indyk and Naor [IN07] proposed a weaker form of dimension reduction, and showed that every doubling subset S ⊂ ℓ2 admits such a dimension reduction into ℓ2 with dimension O(dim S). Roughly speaking, this notion is weaker in that distances in the target space are allowed to err in one direction (be too large) for all but one pair of points. Bartal, Recht and Schulman [BRS07] develop a variant of the JL-Lemma that is local – it preserves the distance between every point and the kˆ points closest to it. Assuming S ⊂ ℓ2 satisfies a certain growth rate condition, they achieve, for any desired kˆ and ε > 0, an of this type with distortion 1 + ε and ˆ dimension O(ε−2 log k). Embedding with Low Dimension and Distortion. Very few results establish sufficient conO(1) ditions for constant distortion embedding into ℓp , and we have already mentioned [Ass83, Kah81, Tal92]. Apart from these, Gupta, Krauthgamer and Lee [GKL03] show that every doubling tree O(1) metric admits constant distortion embedding into ℓ2 . Other embeddings with distortion and target dimension that are low (but not both constant) include [BM04, KLMN05, ABN08, CGT08]. Computational Aspects. The computational problem of embedding an input metric into ℓk2 , i.e. the input is a distance function d(·, ·) on n points, and the goal is to embed it into ℓk2 with minimum distortion, is NP-hard even for k = 1, and is even hard to approximate within factor roughly n1/12 [BCIS05]. An approximation algorithm for k = 1 is given in [BDG+ 05], and nearoptimal hardness of approximation results are given for k ≥ 2 in [MS08]. Dimensionality of data points was also studied from a property testing perspective, see [PR01, KS03].

6

2

Preliminaries and tools

2.1

Geometric properties

Doubling dimension. For a metric (X, d), let λ be the smallest value such that every ball in X can be covered by λ balls of half the radius. The doubling dimension of X is dim(X) = log2 λ. A metric is doubling when its doubling dimension is constant. The following property can be demonstrated via a repetitive application of the doubling property. Property 2.1. For set S with doubling dimension log λ, if the minimum interpoint distance in S is at least α, and the diameter of S is at most β, then |S| = λO(log(β/α)) . ǫ-nets. For a point set S, an ǫ-net of S is a subset T ⊂ S with the following properties: 1. Packing: For every pair u, v ∈ T , d(u, v) ≥ ǫ.

2. Covering: Every point u ∈ S is strictly within distance ǫ of some point v ∈ T : d(y, x) < ǫ.

2.2

Embeddings

Lipschitz function. Let (X, dX ) and (Y, dY ) be metric spaces. A function f : X → Y is said to be K-Lipschitz (for K > 0) if for all x, x′ ∈ X we have dY (f (x), f (x′ )) ≤ K · dX (x, x′ ). The Lipschitz constant (or Lipschitz norm) of f , denoted kf kLip , is the smallest K > 0 satisfying the above. A 1-Lipschitz function is called in short Lipschitz. The following basic properties are easy to prove: (P1). Let f : X → ℓk2 and α > 0. Then kαf kLip ≤ αkf kLip . Pm k (P2). Let f1 , . . . , fm : X →P ℓk2 . Then the sum i=1 fi which maps x → f1 (x) + . . . + fm (x) ∈ ℓ2 P m m 1/2 has Lipschitz norm k i=1 fi kLip ≤ ( i=1 kfi kLip ) . Lm Similarly, the direct fi which maps x → f1 (x) ⊕ · · · ⊕ fm (x) ∈ ℓmk 2 , has Lipi=1P Lm sum 1/2 . Both cases can be further bounded by kf k ) schitz norm k i=1 fi kLip ≤ ( m i Lip i=1 m1/2 maxi=1,...,m kfi kLip .

(P3). Let f : X → Y and g : Y → Z. Then their composition g ◦ f mapping x → g(f (x)) ∈ Z has Lipschitz norm kg ◦ f kLip ≤ kf kLip · kgkLip .

(P4). Let f : X → ℓk2 and g : X → R. Then their product f g : x → g(x) · f (x) has Lipschitz norm kf gkLip ≤ kf kLip · max |g(x)| + kgkLip · max kf (x)k. x

x

(P5). Let f : X → R and T > 0. Then thresholding f at value T , i.e. g : x → min{f (x), T }, has Lipschitz norm kgkLip ≤ kf kLip . Extension Theorem. The Kirszbraun Theorem [Kir34] states that if S and X are Euclidean spaces, T ⊂ S, and there exists a Lipschitz function f : T → X; then there exist a Lipschitz function f˜ : S → X that has the same Lipschitz constant as f and also extends f , meaning that the restriction of f˜ to T is identical to f , f˜|T = f .

Threshold function. A threshold function is a map that guarantees a threshold on the maximum interpoint distance between points in the map. To create a threshold function, we follow the lead 7

of Schoenberg [Sch38, DL97], who gave a technique to implement the Gaussian transform. For a parameter r > 0, the transform is a map g : ℓ2 → ℓ2 that transforms every interpoint distance t to 2 2 Gr (t) = r(1 − e−t /r )1/2 . Note that Gr (t) ≤ t,

∀t ≥ 0,

(2)

thus kgkLip < 1. In addition, Gr (t) ≤ r, hence we indeed obtain a threshold at r.

Probabilistic partitions. Probabilistic partitions are a common tool used in embeddings. Let (X, d) be a finite metric space. A partition P of X is a collection of non-empty pairwise disjoint clusters C(P ) = {C1 , C2 , . . . , Ct } such that X = ∪j Cj . For x ∈ X we denote by P (x) the cluster containing x. We will need the following decomposition lemma due to Gupta, Krauthgamer and Lee [GKL03] and Abraham, Bartal and Neiman [ABN08]. Let B(x, r) = {y| kx − yk ≤ r}. Theorem 2.1 (Padded Decomposition of doubling metrics [GKL03, ABN08]). There exists a constant c0 > 1, such that for every metric space (X, d) and every ∆ > 0, there is a multi-set D = [P1 , . . . , Pm ] of partitions of X, such that S 1. Bounded radius: diam(C) ≤ ∆ for all clusters C ∈ m i=1 Pi . 2. Padding: If P is chosen uniformly from D, then for all x ∈ X,

∆ ) ⊆ P (x)] ≥ 1 − ε. Pr [B(x, c0 dim(X)

P ∈D

3. Support: m ≤ c0 ε−1 dim(X) log dim(X). Remark: [GKL03] provided slightly different quantitative bounds than in Theorem 2.1. The first two properties follow from Lemma 2.7 in [ABN08], and the third property from an application of the Lovasz local lemma sketched there.

3

Dimension Reduction for ℓ2

In this section we first design a single scale embedding that achieves distortion 1+ε after including a correction function. This result is stated in Theorem 3.1 below, which is a refined version of Theorem 1.3. We then use this single scale embedding to prove Theorem 1.2 in Section 3.3. Throughout this section, the norm notation k · k denotes ℓ2 -norms. We make no attempt to optimize constants. 2 Following Section 2, define G : R → R by G(x) = (1 − e−x )1/2 , and let Gr (x) = r · G(x/r) = r(1 − e−x

2 /r 2

)1/2 .

Theorem 3.1. For every scale r > 0 and every 0 < δ, ε < 1/4, every finite set S ⊂ ℓ2 admits an ˜ −3 log 1 log2 λ), satisfying: embedding ϕ : S → ℓk2 for k = O(ε δ (a). Lipschitz: kϕ(x) − ϕ(y)k ≤ kx − yk for all x, y ∈ S.

(b). 1 + ε distortion to the Gaussian (at scales near r): For all x, y ∈ S with δr ≤ kx − yk ≤ rδ , 1 kϕ(x) − ϕ(y)k ≤ ≤ 1; 1+ε Gr (kx − yk) 8

(c). Boundedness: kϕ(x)k ≤ r for all x ∈ S. In the sequel, we shall prove guarantees that are slightly weaker than those stated above but only by a constant C > 1, e.g. kϕkLip ≤ 1 + Cε. The actual theorem follows immediately from 1 , and of ε by C. these guarantees by scaling of ϕ by 1+Cε

3.1

Embedding for a single scale

We now describe the construction of the embedding ϕ for Theorem 3.1. All the hidden constants are absolute, i.e. independent of λ, ε, δ and r. We believe the dimension can be improved to depend near-linearly in log λ, by carefully combining some of these steps. Step 1 (Net Extraction): Let N ⊆ S be an (εδr)-net in S.

Step 2 (Padded Decomposition): Compute for N a padded decomposition with padding 3r δ . More specifically, by Theorem 2.1, there is a multiset [P1 , . . . , Pm ] of partitions of N , where every point is 3r δ -padded in 1 − ε fraction of the partitions, all clusters have diameter bounded by ∆ = O( δr log λ), and m = O(ε−1 log λ log log λ).

Step 3 (Thresholding Distances): In each partition Pi and each cluster C ∈ Pi , threshold the interpoint distances in C at maximum value r. Specifically, apply a Gaussian transform as per Section 2 to obtain a map gC : C → ℓ2 such that 2

2

kgC (x) − gC (y)k22 = Gr (kx − yk)2 = r 2 (1 − e−kx−yk2 /r ),

∀x, y ∈ C.

Step 4 (Dimension Reduction): For each partition Pi and each cluster C ∈ Pi , the point set gC (C) ∈ ℓ2 admits a JL dimension reduction, with distortion 1 + ε. Specifically, by the ′ JL-Lemma there is a map ΨJL : gC (C) → ℓk2 such that kt−t′ k 1+ε

≤ kΨJL (t) − ΨJL (t′ )k ≤ kt − t′ k,

∀t, t′ ∈ gC (C),

(3)

and the target dimension is (using Property 2.1) k′ = O(ε−2 log |C|) = O(ε−2 log(λO(log(∆/εδr)) )) = O(ε−2 log

1 εδ

· log λ log log λ). ′

Composing the last two steps, define fC = ΨJL ◦ gC mapping C → ℓk2 .

Step 5 (Gluing Clusters): For each partition Pi , “glue” the cluster embeddings fC by smoothing them near the boundary. Specifically, for each cluster C ∈ Pi , assume by translation that fC attains the all-zeros vector, i.e. there exists zC ∈ C such that kfC (zC )k = 0. Define hC : C → R by hC (x) = miny∈N \C kx − yk, as a proxy for x’s distance to the boundary of its ′ cluster. Now define ϕi : N → ℓk2 by ϕi (x) = fPi (x) (x) · min{1,

δ r hPi (x) (x)};

recall that Pi (x) is the unique cluster C ∈ Pi containing x.

Step 6 (Gluing Partitions): Combine the maps obtained in the Lmprevious step via direct sum ′ −1/2 mk and scaling. Specifically, define ϕ : N → ℓ2 by ϕ = m i=1 ϕi . Step 7 (Extension beyond the Net): Use the Kirszbraun theorem to extend the map ϕ to all of S, without increasing the Lipschitz constant. 9

3.2

Analysis for a single scale

Let us show the embedding ϕ constructed above indeed satisfies the guarantees required for Theorem 1 3.1. By the description above, the target dimension is mk′ = O(ε−3 log εδ (log λ log log λ)2 ). We first focus on points in the net N , and later extend the analysis to all points in S. Let us start with a few immediate observations. Lemma 3.2. For every x, y ∈ N and every i ∈ {1, . . . , m}, (i). kfPi (x) (x)k ≤ r.

(ii). If Pi (x) = Pi (y) = C then kfC (x) − fC (y)k ≤ Gr (kx − yk) ≤ kx − yk.

(iii). If Pi (x) 6= Pi (y) then hPi (x) (x) ≤ kx − yk.

Proof of Lemma 3.2. For the first assertion, recall that by the translation, every cluster, and in particular C = Pi (x), contains a point zC ∈ N such that fC (zC ) = 0. Thus, using Equation (3) we have kfC (x) − fC (zC )k ≤ kgC (x) − gC (zC )k = Gr (kc − zC k) ≤ r. To prove the second assertion, use Equations (2) and (3), to get kfC (x) − fC (y)k ≤ kΨJL kLip · kgC (x) − gC (y)k ≤ Gr (kx − yk) ≤ kx − yk. For the third assertion, since C = Pi (x) 6= Pi (y) we have that y ∈ N \ C, and so hC (x) = minz∈N \C kx − zk ≤ kx − yk. Analysis for the net N . x, y ∈ N .

We now prove Properties (a)-(c) for (only) net points. To this end, fix

(a) Lipschitz: If kx − yk > δr , we use the boundedness condition and the fact that δ ≤ kϕ(x) − ϕ(y)k ≤ kϕ(x)k + kϕ(y)k ≤ 2r
> , kx − yk (2r/δ) (2r/δ) 3

(5)

We proceed by considering the exact same three cases as above. Case 1′ : x is padded. By the analogous case above, Pi (x) = Pi (y) = C and kϕi (x) − ϕi (y)k = kfC (x) − fC (y)k. kfC (x)−fC (y)k By (3) we have 1 − ε ≤ kg ≤ 1, where, by construction, the denominator C (x)−gC (y)k equals Gr (kx − yk). Altogether, we get

1−ε ≤

kϕi (x) − ϕi (y)k ≤ 1. Gr (kx − yk)

Case 2′ : x is not padded and Pi (x) 6= Pi (y). Combining the analogous case above and Equation (5), we have kϕi (x) − ϕi (y)k ≤ 2δkx − yk < 6Gr (kx − yk). 11

Case 3′ : x is not padded and x, y belong to the same cluster Pi (x) = Pi (y) = C. Refining the analysis in the analogous case above, we have ˜ C (x) − fC (y)h ˜ C (y)k kϕi (x) − ϕi (y)k = kfC (x)h ˜ C (x) − fC (x)h ˜ C (y)k + kfC (x)h ˜ C (y) − fC (y)h ˜ C (y)k ≤ kfC (x)h ˜ C (x) − h ˜ C (y)| + kfC (x) − fC (y)k · |h ˜ C (y)| ≤ kfC (x)k · |h

≤ r · δr kx − yk + Gr (kx − yk) · 1 ≤ 4Gr (kx − yk),

where the last inequality again uses Equation (5). Again combine these three cases by plugging into Equation (4) and recalling that x is padded in at least 1 − ε fraction of partitions, we get (1 − ε)2 ≤

kϕ(x) − ϕ(y)k2 ≤ (1 − ε) + ε · 36 = 1 + 35ε. Gr (kx − yk)2

Later, we will make use of the fact that kϕ(x) − ϕ(y)k ≤ Gr (kx − yk)(1 + 35ǫ)1/2 ≤ Gr (kx − yk)(1 + 6ǫ) ≤ 25 Gr (kx − yk). (c). Boundedness: By the fact 0 ≤ hPi (x) (x) ≤ 1 and Lemma 3.2(i) kϕ(x)k2 ≤

1 m

m X i=1

kϕi (x)k2 ≤

1 m

m X i=1

kfPi (x) (x)k2 ≤ r 2 .

This completes the analysis for net points x, y ∈ N . Analysis for entire S. Finally, we extend the analysis to all points in S. Fix x, y ∈ S, and let x′ , y ′ ∈ N be the net points closest to x and y, respectively. Recalling that N is an εδr-net, we have kx − x′ k, ky − y ′ k ≤ εδr. To prove the Lipschitz requirement, recall that Step 8 extends ϕ from the net N to the entire S using the Kirszbraun theorem, i.e. without increasing its Lipschitz norm, hence kϕ(x) − ϕ(y)k ≤ kx − yk. Using this Lipschitz condition and the triangle inequality, we immediately obtain the boundedness requirement: kϕ(x)k ≤ kϕ(x′ )k + kϕkLip kx − x′ k ≤ (1 + εδ)r. To prove the requirement of distortion to the Gaussian (which is slightly more involved) assume further that δr ≤ kx − yk ≤ δr . By the triangle inequality, (6) kx − yk − kx′ − y ′ k ≤ kx − x′ k + ky − y ′ k ≤ 2εδr.

We conclude that (1 − 2ε)δr ≤ kx′ − y ′ k ≤ ( 1δ + 2εδ)r, and hence x′ , y ′ ∈ N possess the guarantee for distortion to the Gaussian. It also follows that 2εδr ≤ 4(1 − 2ε)εδr ≤ 4εkx′ − y ′ k. Using the 12

Lipschitz condition on ϕ, and the above distortion to the Gaussian for net points, we similarly derive ′ ′ kϕ(x)−ϕ(y)k−kϕ(x )−ϕ(y )k ≤ kx−x′ k+ky −y ′ k ≤ 2εδr ≤ 4εkx′ −y ′ k < 10εGr (kx′ −y ′ k). (7) We shall need the following bound on the behavior of Gr (t).

Lemma 3.3. Let 0 < η < 1/3 and suppose 0 < t′ ≤ (1 + η)t. Then

Gr (t′ ) Gr (t)

≤ 1 + 3η.

Proof of Lemma 3.3. Observe that Gr (t) is monotonically increasing (in t), and thus Gr (t′ ) Gr ((1 + η)t) G((1 + η)t/r) ≤ ≤ . Gr (t) Gr (t) G(t/r) Letting s = t/r, we have 2 s2

2

G((1 + η)s)2 G((1 + η)s)2 − G(s)2 e−s − e−(1+η) − 1 = = G(s)2 G(s)2 1 − e−s2

2

2

e−s (1 − e−3ηs ) ≤ . 1 − e−s2

(8)

Recall that for all 0 ≤ z ≤ 1 we have 1 − z ≤ e−z ≤ 1 − z + z 2 /2 ≤ 1 − z/2. Using this estimate, we now have three cases: • When s2 ≤ 1, the righthand side of (8) is at most

1·3ηs2 s2 /2

≤ 6η. −s2

2

2

·3ηs • When 1 ≤ s2 ≤ 1/3η, the righthand side of (8) is at most e 1−1/e ≤ 6ηs2 e−s ≤ 6η/e, where −z the last inequality follows from the observation that z 7→ ze is monotonically decreasing for all z ≥ 1. −s2

·1 • When s2 ≥ 1/3η, the righthand side of (8) is at most e1−1/e ≤ last inequality follows similarly to the previous case. ′ √ ≤ 1 + 6η ≤ 1 + 3η. Altogether, we conclude that GGrr(t(t)) ≤ G((1+η)s) G(s)

2

e−s ·3ηs2 1−1/e

≤ 6η/e, where the

We are now ready to complete the proof of distortion to the Gaussian (for the entire set S). Similar to the derivation of Equation (6), we derive kx′ − y ′ k ≤ (1 + 2ε)kx − yk, and by Lemma 3.3 we get Gr (kx′ − y ′ k) ≤ (1 + 6ε)Gr (kx − yk). Together with Equation (7) and the earlier bound for net points, we obtain kϕ(x) − ϕ(y)k ≤ kϕ(x′ ) − ϕ(y ′ )k + 10εGr (kx′ − y ′ k) ≤ (1 + 6ε + 10ε)Gr (kx′ − y ′ k)

≤ (1 + 16ε)(1 + 6ε)Gr (kx − yk).

The other direction is analogous. By (6) we have kx − yk ≤ (1 + 4ε)kx′ − y ′ k, and by Lemma 3.3 we get Gr (kx − yk) ≤ (1 + 12ε)Gr (kx′ − y ′ k). Together with (7) and the earlier bound for net points, we obtain kϕ(x) − ϕ(y)k ≥ kϕ(x′ ) − ϕ(y ′ )k − 10εGr (kx′ − y ′ k) ≥ (1 − ε − 10ε)Gr (kx′ − y ′ k)

≥ (1 − 11ε)(1 − 12ε)Gr (kx − yk). 13

3.3

Snowflake Embedding

We now use Theorem 3.1 (the single scale embedding) to prove Theorem 1.2. For simplicity, we will first prove the theorem for α = 1/2, and then extend the proof to arbitrary α. Fix a set S ⊂ ℓ2 . Assume without loss of generality that the minimum interpoint distance in S is 1. Define p = 6⌈log1+ε ( 1ε )⌉ = O( 1ε log 1ε ), and the set I = {i ∈ Z : ε5 ≤ (1 + ε)i ≤ ε−5 diam(S)}. For each i ∈ I, let ϕi : S → ℓk2 be the embedding guaranteed by Theorem 1.3 for S and ε with respect to parameters r = (1 + ε)i and δ = (1 + ε)−p/2 = Θ(ε3 ). Notice that each ϕi has target ˜ −3 log2 λ). dimension k = O(ε We shall now use the following technique due to Assouad [Ass83]. First, each ϕi is scaled √ by 1/ r = (1 + ε)−i/2 . They are then grouped in a round robin fashion into p groups, and the embeddings in each group are summed up. This yields p embeddings, each into ℓk2 , which are combined using a direct-sum, resulting in one map into ℓpk 2 . Formally, let i ≡L j denote that two integer i, j are equal modulo p. Define Φ : S → ℓpk p 2 using k the direct sum Φ = j∈[p] Φj , where each Φj : S → ℓ2 is given by Φj =

X

i∈I: i≡p j

ϕi . (1 + ε)i/2

√ For M = M (ε) > 0 that will be defined later, our final embedding is Φ/ M : S → ℓpk 2 , which ˜ −4 log2 λ), as required (for α = 1/2). It thus remains to prove the has target dimension pk ≤ O(ε distortion bound. The key idea is that in each Φj , most of the contribution comes from a single ϕi , as formulated in the following lemma. Lemma 3.4. Let Φ : S → ℓpk 2 be as above, let x, y ∈ S, and let A ⊂ I be an interval of size p, i.e. there is a such that A = {a, a + 1, . . . , a + p − 1}. Then, kΦ(x) − Φ(y)k2 ≤

X  kϕi (x) − ϕi (y)k

kΦ(x) − Φ(y)k2 ≥

X

i∈A

i∈A

(1 + ε)i/2

X

+

i′ ∈I\A: i′ ≡p i

n kϕ (x) − ϕ (y)k i i − max 0, i/2 (1 + ε)

kϕi′ (x) − ϕi′ (y)k 2 ′ (1 + ε)i /2 X

i′ ∈I\A: i′ ≡p i

kϕi′ (x) − ϕi′ (y)k o2 ′ (1 + ε)i /2

Proof. By construction,

2 X

2

X





Φj (x) − Φj (y) =

Φ(x) − Φ(y) = j∈[p]

i∈A

X

i′ ∈I: i′ ≡p i

ϕi′ (x) − ϕi′ (y)

2

. i/2 (1 + ε)

Fix i ∈ A and let us bound the term corresponding to i. The first required inequality now follows by separating (among all i′ ∈ I with i′ ≡p i) the unique i′ ∈ A (namely i′ = i) from P the rest, P and applying the following triangle inequality for vectors v1 , . . . , vs ∈ ℓk2 , namely, k l vl k ≤ l kvl k. The second inequality follows similarly by separating the term for i′ = i fromP the rest, and applying k , namely, ku + the following triangle inequality for vectors u, v , . . . , v ∈ ℓ 1 s 2 l vl k ≥ max{0, kuk − P kv k}. l l ∗

We proceed in the proof of Theorem 1.2. Fix x, y ∈ S, and let i∗ ∈ I be such that (1 + ε)i ≤ ∗ kx− yk ≤ (1+ ε)i +1 . We wish to apply Lemma 3.4. To this end, let A = {i∗ − p/2+ 1, . . . , i∗ + p/2} 14

and consider i ∈ A. Observe that ∗ −i

δ ≤ (1 + ε)−p/2 ≤ (1 + ε)i

kx − yk ∗ ≤ (1 + ε)i +1−i ≤ (1 + ε)p/2 ≤ 1δ , i (1 + ε)



hence we can apply Theorem 3.1(b) to obtain 1 1+ε



kϕi (x)−ϕi (y)k G(1+ε)i (kx−yk)

≤ 1.

(9)

Combining this with the monotonicity of Gr and Lemma 3.3, and noting that Gr (1) > ε < 41 , we further obtain

1+ε 2

(1 + ε)i−1 G(1) kϕi (x) − ϕi (y)k ≥ ≥ 21 (1 + ε)i/2 . (1 + ε)i/2 (1 + ε)i/2

when

(10)

By Theorem 3.1(a) and (c), for all i′ ∈ I, ′

kϕi′ (x) − ϕi′ (y)k ≤ min{kx − yk, (1 + ε)i }. and thus X

i′ ∈I\A: i′ ≡p i

kϕi′ (x) − ϕi′ (y)k ≤ (1 + ε)i′ /2 ≤

X

kϕi′ (x) − ϕi′ (y)k + (1 + ε)i′ /2

X

(1 + ε)i /2 +

i′ i: i′ ≡p i

Recalling that a geometric series with ratio less than

1 2

kϕi′ (x) − ϕi′ (y)k (1 + ε)i′ /2

kx − yk . (1 + ε)i′ /2

sums to less than twice the largest term,

kϕi′ (x) − ϕi′ (y)k ≤ 2(1 + ε)(i−p)/2 + 2kx − yk(1 + ε)−(i+p)/2 ′ /2 i (1 + ε) i′ ≡p i i h ≤ 2(1 + ε)i/2 (1 + ε)−p/2 + 1ε (1 + ε)1−p/2

X

i′ ∈I\A:

≤ ε(1 + ε)i/2 . Observe that the last bound is at most 2ε times (10). Using this information and plugging (9) into Lemma 3.4, we obtain kΦ(x) − Φ(y)k2 ≥ ≥

X

1−2ε 1+ε

i∈A

X

2 · G(1+ε)i (kx − yk) 

1−2ε 1+ε

−p/2