Distributed Kernel Principal Component Analysis

Report 4 Downloads 191 Views
Distributed Kernel Principal Component Analysis Maria-Florina Balcan∗, Yingyu Liang†, Le Song‡, David Woodruff§, Bo Xie¶

arXiv:1503.06858v3 [cs.LG] 13 Oct 2015

Abstract Kernel Principal Component Analysis (KPCA) is a key technique in machine learning for extracting the nonlinear structure of data and pre-processing it for downstream learning algorithms. We study the distributed setting in which there are multiple workers, each holding a set of points, who wish to compute the principal components of the union of their pointsets. Our main result is a communication efficient algorithm that takes as input arbitrary data points and computes a set of global principal components, that give relative-error approximation for polynomial kernels, or give relative-error approximation with an arbitrarily small additive error for a wide family of kernels including Gaussian kernels. While recent work shows how to do PCA in a distributed setting, the kernel setting is significantly more challenging. Although the “kernel trick” is useful for efficient computation, it is unclear how to use it to reduce communication. The main problem with previous work is that it achieves communication proportional to the dimension of the data points, which would be proportional to the dimension of the feature space, or to the number of examples, both of which could be very large. We instead first select a small subset of points whose span contains a good approximation (the column subset selection problem, CSS), and then use sketching for low rank approximation to achieve our result. The column subset selection is done using a careful combination of oblivious subspace embeddings for kernels, oblivious leverage score approximation, and adaptive sampling. These lead to nearly optimal communication bound for CSS, and also lead to nearly optimal communication for KPCA in the constant approximation region. Experiments on large scale real world datasets show the efficacy of our algorithm.

1

Introduction

Principal Component Analysis (PCA) is a widely used tool for dimensionality reduction and data pre-processing. It consists of finding the most relevant lower-dimension subspace of the data in the sense that the projection should capture as much of the variance of the original data as possible. Kernel Principal Component Analysis (KPCA) is the extension of this method to data mapped to the kernel feature space [26]. It allows one to exploit the nonlinear structure of the data, since the vectors in the kernel feature space correspond to nonlinear functions in the original space. This is crucial for complex data such as text and image data, and thus KPCA finds its application for many learning problems. However, such data sets are often large scale, and KPCA is known to be computationally expensive and not easily scalable. Since large scale data is often partitioned across multiple workers, there is an increasing interest in solving learning problems in the distributed model. Kernel methods in this setting face an additional subtle difficulty. The algorithm should use the kernel trick to avoid going to the kernel feature space explicitly, so intermediate results are often represented by a function (e.g., a weighted combination) of the feature mapping of some data points. Communicating such intermediate results requires communicating all the data points they depend on. To lower the communication, the intermediate results should only depend on a small number of points. A distributed algorithm then needs to be carefully designed to meet this constraint. In this paper, we propose a distributed algorithm to compute KPCA for polynomial kernels, and for a large family of kernels with random feature expansions including Gaussian kernels. For n data points in Rd arbitrarily partitioned ∗ School

of Computer Science, Carnegie Mellon University. Email: [email protected] of Computer Science, Princeton University. Email: [email protected] ‡ College of Computing, Georgia Institute of Technology. Email:[email protected] § Almaden Research Center, IBM Research. Email: [email protected] ¶ College of Computing, Georgia Institute of Technology. Email:[email protected] † Department

1

over s workers, the algorithm computes a rank-k subspace in the kernel feature space represented by O(k/) original points. For polynomial kernels, it achieves a (1 + ) relative-error approximation to the best rank-k subspace, and for kernels with random feature expansions, it achieves (1+)-approximation with an additive error term that can be made ˜ arbitrarily small. In both cases, the total communication is O(sρk/ + sk 2 /3 ) words, where ρ is the average number of nonzero entries in one data point, and is always bounded by d. This for constant  nearly matches the lower bound Ω(sdk) for linear PCA [9]. As far as we know, this is the first algorithm that can achieve provable approximation with such communication bounds. Experimental results show that our algorithm can be run on large datasets for which non-distributed algorithms can be prohibitive, and needs fewer communication and representation points to achieve the same error as the competitors. Our algorithm also leads to some other distributed kernel algorithms: the data can then be projected onto the subspace found and processed by downstream applications. For example, an immediate application is for distributed spectral clustering that first computes KPCA to rank-k/ and then does k-means on the data projected on the subspace found by KPCA (e.g., [17]). This can be done by combining our algorithm with any efficient distributed k-means algorithms (e.g., [7]). In the process, we develop an algorithm for the distributed Column Subset Selection (CSS) problem, which can select a set of O(k/) points whose span contains (1 + )-approximation, with communication O(sρk/ + sk 2 ). This is the first algorithm that addresses the problem for kernels, and it nearly matches the communication lower bound Ω(sρk/) for this problem in the linear case [11]. The column subset selection problem has various applications in big data scenarios, so this result can be of independent interest. Main Results Suppose there are s workers that can communicate to a central processor, where the communication i d×ni cost is measured in words1 . Worker , and the global data set A ∈ Rd×n is Psi ∈ [s] has a local data set A ∈ R the union of the local data (n = i=1 ni ). Consider a kernel k(x, x0 ) with feature mapping φ(x) ∈ H such that k(x, x0 ) = hφ(x), φ(x0 )iH . Let φ(A) ∈ Hn denote the matrix obtained by applying φ on each column of A and concatenating the results. The goal of distributed KPCA is to find and send to each worker a subspace of dimension k that approximates φ(A). Thus it is also called (kernel) low rank approximation. Formally,

2 Definition 1. A subspace L ∈ Hk is a rank-k (1+, ∆)-approximation subspace for φ(A) if φ(A) − LL> φ(A) H ≤ 2 (1 + ) kφ(A) − [φ(A)]k kH + ∆, and L> L = Ik , where [φ(A)]k is the best rank-k approximation to φ(A). Our main result is a randomized algorithm for distributed KPCA. It takes as input the local datasets, a rank k, and error parameters , ∆, and outputs a subspace that with constant probability is a (1 + , ∆)-approximation to the optimum. Formally, Theorem 2. There exists a randomized algorithm (Algorithm 2) that produces a subspace L that with probability at least 0.9 satisfies the following. 1. L is a rank-k (1 + , 0)-approximation subspace when applied to polynomial kernels. 2. L is a rank-k (1 + , ∆)-approximation when applied to continuous shift-invariant kernels k(x, y) = k(x − y) with proper regularization, i.e., defined over bounded compact domain in Rd , with k(0) ≤ 1 and bounded ∇2 k(0). ˜ sρk + sk32 ), where ρ is the average number of nonzero In both cases, the total communication cost required is O(   entries in one data point. The theorem also holds for other properly regularized kernels with random feature expansions (see [23, 15] for more such kernels); the extension of our proof is straightforward. The constant success probability can be boosted up to any high probability 1 − δ by repetition, which adds only an extra O(log 1δ ) term to the communication and ˜ computation. The output subspace L is represented by O(k/) sampled points Y from A (i.e., L = φ(Y )C for some |Y |×k coefficient matrix C ∈ R ), so L can be easily communicated and the projection of any point on L can be easily computed by the kernel trick. The communication has linear dependence on the dimension and the number of workers, ˜ only hides a factor of log k), which is crucial for big data and has no dependence on the number of data points (the O 1 We

assume that the word has Ω(log(sdn/)) bits to guarantee the accuracy of real value computation.

2

scenarios. Also, it does not depend on ∆ (but the computation does), so the additive error can be made arbitrarily small with more computation. Our distributed KPCA algorithm begins with selecting a small subset whose span contains a good approximation, which is typically referred to as the Column Subset Selection problem (CSS). This intermediate result of independent interest is summarized as follows. Theorem 3. There exists a randomized algorithm (Algorithm 1) that with probability at least 0.9 selects a subset Y of A such that the span of φ(Y ) contains a rank-k (1+, 0)-approximation subspace when applied to polynomial kernels, and contains a (1+, ∆)-approximation when applied to continuous shift-invariant kernels with proper regularization. 2 In both cases, Y contains O(k/) points, and the communication is O( sρk  + sk ). Main Techniques Our algorithm first solves the column subset selection problem, and then computes a good solution from the subset. In the first step, it first computes a set of weights measuring the importance of the data points, and then samples accordingly. The weights computed are statistical leverage scores, which play a key role in many modern randomized algorithms. Computing them needs to be done with care, as na¨ıve approaches lead to high communication and computational cost. We observe that for rank-k approximation, it suffices to compute the generalized leverage scores w.r.t. rank k. Our key technique for this is subspace embeddings, which project the feature mapping of the data to a low dimensional space while preserving the scores, and thus allows for computation with low communication depending on k rather than on the feature space dimension or the number of points. For this purpose, we propose communication and computational efficient subspace embeddings for kernels. After computing the leverage scores, simply sampling according to them will not lead to the desired solution directly: it would give a (1+)-approximation but with a rank-O(k/) subspace (suppose ∆ = 0 for now). Fortunately, there exists an adaptive sampling approach that can start with a constant approximation with rank O(k) and produce O(k/) samples containing an rank-k (1 + )-approximation subspace. So we first use the leverage scores to achieve constant approximation, and then use adaptive sampling to pick O(k/) points Y with the desired bounds. The last step is then to find a good approximation within the span of φ(Y ). Intuitively, one can just find the best rank-k approximation for the projection of the data on the span. This na¨ıve approach has high communication depending on the number of points, but it can be improved since it suffices to do so on a sketch of the projected data, which essentially reduces the number of points needed. These steps are carefully designed to get the desired error and communication bounds. We note that na¨ıve methods or standard combinations of these techniques fail to do so; see the remark at the end of Section 5. Outline Section 2 reviews related work, Section 3 presents our key building block subspace embedding. Section 4 and Section 5 describe our algorithms for CSS and KPCA, and Section 6 provides empirical results. Due to space limitation, some proofs and implementation details are presented in the appendix.

2

Related Work

There has been a surge of work on distributed machine learning, e.g., [6, 28, 20, 7]. The most related to ours are [20, 7] which give efficient algorithms for linear PCA. However, these algorithms cannot be directly adapted to the kernel case. The subspace embedding approach in [20] will need to be performed on the explicit feature mapping, which is costly; the approach in [7] requires sending the local PCA solutions, which need to be represented by the explicit feature mapping or by the set of all the local points, neither of which is practical. Prior work [9, 11] give the first distributed protocol and lower bound for column subset selection. Our main advance over theirs is that our protocol is the first to work for low rank approximation in the kernel space rather than the original space, which is obtained by several key differences. For instance, in all intermediate steps in our protocol we can only afford to communicate points in the original space, i.e., both the kernel feature space dimension and the number of points are too large, whereas all prior protocols depend linearly on at least one of these quantities. Another difference is that in the kernel case, we can only obtain leverage scores of the data sketched on one side, which we show to be sufficient for our purpose. Note that kernel CSS is not directly comparable to the Nystr¨om method, as their guarantees are different.

3

The key technique, subspace embeddings, has been extensively studied in recent years [25, 1, 3, 2, 13]. The recent fast sparse subspace embeddings [13] and its optimizations [21, 22] are particularly suitable for large scale sparse datasets, since their running time is linear in the number of non-zero entries in the data matrix. [4] show that a fast computational approach, T ENSOR S KETCH, is indeed a subspace embedding for the polynomial kernel, which is a key tool used by our algorithm. Another element is leverage score sampling, which recently has been successfully used in randomized matrix algorithms; see [27].

3

Subspace Embedding for Kernels

Notations In the distributed setting, there are s workers that are connected to a central processor. WorkerP i ∈ [s] has s a local data set Ai ∈ Rd×ni , and the global data set A ∈ Rd×n is the concatenation of the local data (n = i=1 ni ). d×n For any matrix M ∈ R and any i ∈ [d], j ∈ [n], let Mi: denote the i-th row of M and M:j denote the j-th column of M . Let kM kF denote its Frobenius norm, and kM k2 denote its spectral norm. Let Id denote the d × d identity matrix. Let the rank of M ∈ Rd×n be r ≤ min {n, d}, and denote the SVD of M as M = U ΣV > where U ∈ Rd×r , Σ ∈ Rr×r , and V ∈ Rn×r . For a kernel k(x, x0 ), let φ(·) ∈ H denote the corresponding feature mapping. Regard M ∈ Ht as a matrix whose t1 ∈ Ht2 , columns are elements in H and define matrix operation accordingly. More precisely,  for any M ∈n H and N n×t 2 > > t1 ×t2 let B = M N ∈ R where PnBij = hM:i , N:j iH , and let kM kH = tr M M . For M ∈ H and C ∈ R , let L = M C ∈ Ht where L:j = i=1 Cij M:i . Finally, for a matrix A ∈ Rd×n , let φ(A) denote the set of elements in H corresponding to the columns A:j , that is, φ(A) = (φ(A:1 ), φ(A:2 ), . . . , φ(A:n )) ∈ Hn . Subspace Embeddings Subspace embeddings are a useful technique that embeds data points into lower dimension while preserving interesting properties needed. Formally, Definition 4. An -subspace embedding of M ∈ Rm×n is a matrix S ∈ Rt×m such that kSM xk2 = (1 ± ) kM xk2 , ∀x ∈ Rn . M x is in the column space of M and SM x is its embedding, so the definition means that the norm of any vector in the column space of M is approximately Subspace embeddings can also be done on the right hand side,

preserved. i.e., S ∈ Rn×t and x> M S 2 = (1 ± ) x> M 2 for any x ∈ Rm . Our algorithm repeatedly makes use of subspace embeddings. In particular, the embedding we use is the concatenation of the following known sketching matrices: C OUNT S KETCH and i.i.d. Gaussians (or the concatenation of C OUNT S KETCH, fast Hadamard and i.i.d. Gaussians). The details can be found in [27]; we only need the following fact. Fact 1. For M ∈ Rm×n , there exist sketching matrices S ∈ Rt×m with t = O(n/2 ) that are -subspace embeddings. ˜ Furthermore, SM can be successfully computed in time O(nnz(M )) with probability at least 1 − δ, where nnz(M ) is the number of non-zero entries in M . Kernel Subspace Embeddings Subspace embeddings can also be generalized for the feature mapping of kernels, simply by setting M = φ(A), S a linear mapping from H 7→ Rt and using the corresponding inner product. If the kernel subspace embedding suffices for solving the problem under consideration, then only S(φ(A)) in much lower dimension is needed. This is especially interesting for distributed kernel methods, since naively using the feature mapping or the kernel trick in this setting will leads to high communication cost. For our purpose, we will need the embedding S to have the following properties. Definition 5. S is a (, ∆)-good subspace embedding for kernel k(·, ·) if P1 (Subspace Embedding): For any V ∈ Hk with orthonormal elements (i.e., V > V = Ik ), for all x ∈ Rk , kS(V )xk2 = (1 ± c0 ) kV xkH , where c0 is a sufficiently small constant. P2 (Approximate Matrix Product):

S(M )> S(N ) − M > N 2 ≤  kM k2 kN k2 + ∆, for any M ∈ Ht1 , N ∈ Ht2 . H H k F The following lemma shows the implication of a good subspace embedding for kernels. 4

 s Algorithm 1 Distributed Kernel Column Subset Selection: Y = disKCSS( Ai i=1 , k, , ∆) 1:

Each worker i: do a (c0 , ∆)-good subspace embedding E i = S(φ(Ai )) ∈ Rt×ni , t = O( ck2 ); 0

do another subspace embedding E i T i ∈ Rt×p with p = O( ct2 ), and send E i T i to Center. 0  > Center: QR-factorize E 1 T 1 , . . . , E s T s = U Z; send Z to all workers.

 P

2

Each worker i: compute `˜ij = (Z > )−1 E i :j , and send j `˜ij to the other works by Center. n 2 P ˜i o 2: Workers: sample Ai:j with probability min 1, O(k log k)`˜ij / ij `j ; send to Center; Center: send the union P of the sampled points to all the workers; Workers: sample O(k/) points Y˜ according to the square distances to P in feature space; send to Center; Center: send Y = Y˜ ∪ P to all the workers. Lemma 6. S is a (, ∆)-subspace embedding for kernel k(·, ·), and suppose S(φ(A)) has SVD U ΣV > . Then there 2 2 exists X such that kXS(φ(A)) − φ(A)kH ≤ (1 + ) kφ(A) − [φ(A)]k kH + ∆. Lemma 6 means that the row space of S(φ(A)) contains a good approximation to φ(A). As discussed below, S(φ(A)) can be used to compute the generalized leverage scores for φ(A). Polynomial Kernels For polynomial kernels, there exists an efficient algorithm T ENSOR S KETCH to compute the embedding [4]. However, the embedding dimension has a quadratic dependence on the rank k, which will increase the communication. Fortunately, subspace embedding can be concatenated, so we can further apply another known subspace embedding such as i.i.d. Gaussians or fast Hadamard which, though not fast for feature mapping, is fast for the already embedded data and has lower dimension. In this way, we can enjoy the benefits of both approaches. Lemma 7. For polynomial kernels k(x, x0 ) = (hx, x0 i)q , there exists an (, 0)-good subspace embedding matrix q S ∈ Rt×d with t = O(k/). Kernels with Random Feature Expansions Polynomial kernels have finite dimensional feature mappings, for which the sketching seems natural. It turns out that it is possible to extend subspace embeddings to kernels with infinite dimensional feature mappings. More precisely, we propose subspace embeddings for kernels with random feature expansions, i.e., k(x, y) = Eω [ξω (x)ξω (y)] for some function ξ(·). A large family of kernels have such random feature expansions, including Gaussian kernels and other shift-invariant kernels, inner product kernels, etc ([23, 15]). We provide guarantees for shift-invariant kernels using Fourier random features; the extension of the analysis to other kernels is straightforward. A continuous shift-invariant kernel takes the form k(x, y) = k(x−y). Bochner’s theorem [24] states √ that k(x, y) = Eω∼p(ω) [ξω (x)ξω∗ (y)] = Eω,b [zω (x)zω (y)] where ω, b are from known distributions and zω (x) = 2 cos(ω > x + b). Therefore, one can approximate the kernel by using m features zω (x) on randomly sampled ω, b [23]. Such random feature expansion can be exploited for subspace embeddings: view the expansion as the “new” data points and apply a sketching matrix on top of it. Compared to polynomial kernels, the finite random feature expansion leads to an additional additive error term. To bound such an error, we need some regularization conditions on the kernel: it is defined over bounded compact domain in Rd , with k(0) ≤ 1 and bounded ∇2 k(0). Our analysis shows that bounding the additive error term only requires sufficiently large sampled size m, which affects the computation but does not affect the final embedding dimension and thus the communication cost. Lemma 8. For a continuous shift-invariant kernels k(x, y) = k(x − y) with proper regularization, there exists an (, ∆)-good subspace embedding S : H 7→ Rt with t = O(k/). Furthermore, S(φ(x)) = T R(φ(x)), where R(φ(x)) ∈ Rm is m random features for x and T ∈ Rt×m is a sketching matrix as in Fact 1.

5

4

Distributed Kernel Column Subset Selection

As mentioned above, our algorithm first computes the leverage scores that measure the non-uniform structure, and then samples to get the desired subset. We first review the (generalized) leverage scores and discuss why it fits our purpose and how to compute it by subspace embedding, and then point out that sampling to them does not lead to the desired guarantee, but adaptive sampling can address the issue. Finally, we present and analyze the algorithm. Leverage Scores Leverage scores are critical for importance sampling in many fast randomized algorithms.For our purpose, it suffices to consider a generalized notion of leverage scores w.r.t. rank k. Definition 9. For E ∈ Rt×n with SVD E = U ΣV > , the leverage score `j for its j-th column is the squared `2 2 norm of the j-th column of V > . Formally, `j = kVj: k2 . If E can approximate the row space of M up to , i.e., kXE − M kF ≤ (1 + ) kM − [M ]k kF + ∆, then {`j } are also called the leverage scores for M with respect to rank k.2 A key property is that a small set of columns sampled according to the leverage scores will span a subspace close to that spanned by all the columns. Fact 2 (Thoerem 5 in [19]). Suppose E ∈ Rt×n has rank at most r, M ∈ Rm×n and  ∈ (0, 1]. Let `j be the leverage scores of the j-th column in E. Define samplingnprobabilities po j such that for some β ∈ (0, 1], pj ≥ β`j P , j `j

r for all j ∈ [n]. Sample M:j with probability min 1, O( r log β )pj . Define an n × n diagonal sampling √ matrix T for which T

j,j = 1/ pj if M:j is sampled, and Tj,j = 0 otherwise. Then with probability at least 0.99,

M T (ET )† E − M ≤ (1 + ) minX kXE − M k . F F

In the theorem, the leverage scores in E are used for sampling in M . A special case is when the row space of E approximates the row space of M , i.e., there exists X such that kM − XEkF is small. Then the sampling error will be small, and {`j } are just the leverage scores for M w.r.t. rank k by Definition 9. So our task reduces to computing the generalized scores for M , which further reduces to finding a matrix E that approximates M . In general, computing leverage scores is non-trivial: naive approaches require SVD which is expensive. Even ignoring computation cost, naive SVD is prohibitive in the distributed kernel method setting due to its high communication cost. However, computing the generalized scores with respect to rank k could be much more efficient, since the intrinsic dimension now becomes k rather than the ambient dimension (the number of points or the dimension of the feature space). The only difficulty left is to efficiently find a smaller matrix that can approximate the row space of the original data, which can be readily solved by subspace embeddings (Lemma 6). Adaptive Sampling Leverage score sampling can be used to get a set of points P such that the span of φ(P ) contains a rank-O(k/) (1 + )-approximation to φ(A). This is still not the desired guarantee. Therefore, we first sample a constant approximation and then resort to the adaptive sampling algorithm in [16, 10], which can reduce the bound to a linear dependence. Lemma 10. Suppose the span of φ(P ) contains a rank-k (c, ∆)-approximation for φ(A), and let rj to be the square distance from φ(A:j ) to its projection on the subspace spanned by φ(P ). Sample a set of O(ck/) points Y˜ from A according to rj , and let Y = Y˜ ∪ P . Then with probability at least 0.99, the span of φ(Y ) contains a (1 + , ∆)approximation for φ(A). Algorithm The details are described in Algorithm 1. First, we compute a constant approximation of the generalized leverage scores. Since an c0 -approximation leads to an O(c0 ) in the later sample size, a constant c0 sufficies. To get the generalized scores,   we can do a (c0 , ∆)-subspace embedding, and then compute the scores of the embedded data E = E 1 , . . . , E s . Again, we only need to approximate the scores of E, so we can apply another embedding to  1 1 s s reduce the number of data points. Let ET = E T , . . . , E T denote the embedded data, and do QR factorization 2 This

generalizes the definition in [18] by allowing rank(E) larger than k and allowing additive error ∆.

6

 s Algorithm 2 Distributed Kernel Principal Component Analysis: L = disKPCA( Ai i=1 , k, , ∆)  s 1: Y = disKCSS( Ai i=1 , k, , ∆). 2: Each worker i: compute the basis Q for φ(Y ) and Πi = Q> φ(Ai ); i |Y |×w embed Πi T withw = O(|Y |/2 ), and send Πi T i to Center;  1∈ R 1 Center: concatenate ΠT = Π T , . . . , Πs T s and send the top k singular vectors W of ΠT to the workers. Each worker i: set L = QW . −1 ET are a set of basis for ET . Then, think of U > T † = Z > E  > −1 as the basis for E, so it sufficies to compute the norms of the columns in Z E. In the second step, we sample a subset of points Y that acts as a proxy for the original data. We first sample a set P of O(k log k) points according to the scores, which then contains a constant approximation. Then adaptive sampling O(k/) points according to the square distances from the points to the projections on P and combining with P leads to the desire Y .

(ET )> = U Z. Now, the rows of U > = Z >

−1

Proof of Theorem 3 First, by Lemma 6 we know that the row space of the embedding matrix E = S(φ(A)) contains enough information about φ(A), and thus the scores for E are the generalized scores for φ(A). Note that the algorithm  can be viewed as applying an embedding T = diag T 1 , . . . , T s on E to approximate the scores while saving the costs. Since each T i is an O(c0 )-subspace embedding matrix, by simple calculation T is O(c0 )-subspace embedding. Such a scheme of using embedding for approximating the scores has been analyzed in [18]. Let `ij be the true leverage score of (E i ):j in E. Fact 3 (Lemma 6 in [18]). If T is an c-subspace embedding, then we have `˜ij = (1 ± c0 )`ij . So it suffices to prove the span of φ(P ) contains an (O(1), ∆)-approximation by Lemma 10. By Fact 3, `˜ij ∈ [1/2, 3/2]`ij where `ij is the true leverage score for the column E:ji in E. Then by Fact 2 and Lemma 6,

φ(A) − φ(A)T (ET )† E 2 H

2

≤ O(1) mint kφ(A) − XEkH X∈H

2

≤ O(1) kφ(A) − [φ(A)]k kH + ∆. Note that T is the sample matrix to get P from A, so φ(A)T can be written as φ(P )T 0 for some T 0 . Then C = 2 2 T 0 (ET )† E satisfies kφ(A) − φ(P )CkH ≤ O(1) kφ(A) − [φ(A)]k kH + ∆, completing the proof. Note that the size of P can be reduced to O(k) by using BSS sampling to get P ; see [10].

5

Distributed Kernel Principal Component Analysis

The details of our algorithm are described in Algorithm 2. After getting the selected subset Y , it simply computes a subspace L from the span of φ(Y ). It first projects the data on the span to get the low dimensional data Π. Now it suffices to compute the best rank-k approximation for Π. Again, since we onlyneed approximation, we can do an  embedding to reduce the number of data points and get ΠT = Π1 T 1 , . . . , Πs T s . Then the algorithm computes the best rank-k approximation W for ΠT , which is then a good approximation for Π and thus φ(A). It then returns L, the representation of W in the coordinate system of φ(A). Proof of Theorem 2 It just follows by combining Theorem 3 and the following lemma. Lemma 11. If the span of φ(Y ) contains a (1 + , ∆)-approximate subspace, then

>

LL φ(A) − φ(A) 2 ≤ (1 + )2 kφ(A) − [φ(A)] k2 + (1 + )∆. k H H

7

Proof Sketch. For our choice of w, T i is an -subspace embedding matrix for Πi . Then their concatenation B is an -subspace embedding for Π, the concatenation of Πi . Then we can apply the idea in Lemma 5 in [4] (also implicit

2

in Theorem 1.5 in [20]). By Pythagorean Theorem, the error can be factorized into LL> φ(A) − QQ> φ(A) H

2

2 and φ(A) − QQ> φ(A) H . The first term is about [Q> φ(A)]k − Q> φ(A) H , since LL> = QW W > Q> and by

2 property of sketching. Again by Pythagorean, the error is now roughly Q[Q> φ(A)]k − φ(A) H . The lemma then follows from the assumption on φ(Y ). Remark Equipped with the subspace embedding tools, one can come up with the following simpler algorithms. Here we point out why they fail. 1. Return the PCA of the subspace embedding. The subspace embedding only guarantees that the row space of the embedded matrix E contains a good approximation, but the best rank-k approximation of E is not necessarily a good approximation; see Lemma 6. To guarantee it is, the embedded dimension needs to be linear in the number of data points, which is impractical. 2. Take m random features of the data points and then apply distributed linear PCA. 2 ˜ The communication is O(skm/), which is too large. One needs m = O(d/ ) to preserve the kernel value up to additive error  (which is still not enough to get the desired final bound). 3. Classic leverage score sampling. Approximating the classic leverage scores requires communication O(sm2 ), where m is the feature dimension (dq for polynomial kernels or number of random features for other kernels). Furthermore, simply sampling according to the leverage scores will not give the desired rank-k solution. Our algorithm succeeds with much lower communication, since we only need to get a constant approximation of the leverage scores w.r.t. rank k, which allows a much more aggressive subspace embedding (a kernel subspace embedding followed by another embedding on the right hand side, as in Step 1 of Algorithm 1). This results in a bilinear sketch which is crucial for low communication. We stress the standard way of applying subspace embedding is only on the right-hand side, which would not work here since the communication would be proportional to the feature dimension.

6

Experiments

First, we verify the efficacy of our algorithm by comparison to the non-distributed batch KPCA on small datasets where the latter is feasible. Then under varying communication budget, we compare to the baseline that uniformly samples data from the dataset and then performs batch KPCA (denoted as uniform+batch KPCA), which is often adopted in practice. We also evaluate a simplified version of our algorithm (denoted as uniform+disKPCA), obtained by replacing the first step (disKCSS) with uniform sampling. We also evaluate on the spectral clustering tasks. Part of the results for polynomial and Gaussian kernels are presented here, while other results are provided in the appendix. Datasets We use small datasets 1) insurance (dimension 86, 9000 points); 2) har (561 × 10299) for comparison with batch KPCA. We also use larger datasets: 1) 20 newsgroup (61118 × 11269); 2) CT-slice (384 × 53500); 3) YearPrediction MSD (90×463715); 4) MNIST8M (784×8000000); 5) HIGGS (28×11000000); 6) SUSY (18×3000000); 7) Protein Structure (9 × 45730). These are from UCI Repository [5] and [8]. Methodology The data is partitioned on different workers according to power law distribution with exponent 2 to simulate the distribution of the data over large networks [14]. Depending on the size of the dataset, the number of workers used ranges from 5 to 200. We compare their low rank approximation error for distributed KPCA. Each method is run 5 times; the mean and the standard deviation are then plotted. Note that our guarantee is in the multiplicative error, so the ideal criterion is the ratio between the errors of the solution and the best rank-k solution. But computing the latter for larger datasets is infeasible, so only the error of the solution is plotted. So the best way to compare two algorithms is to compare the communication needed to achieve the same error.

8

insurance 4.3 Low−rank approximation error

15

5300 Run time

Low−rank approximation error

6

20 disKPCA Batch KPCA

5400

5200 5100 5000

disKPCA Batch KPCA

10

5

4900 4800

0

1

2 3 Communication cost

0

4

0

1

6

x 10

(a) error on insurance

2 3 Communication cost

30 disKPCA Batch KPCA

25

4.1 4 3.9

disKPCA Batch KPCA

15

5

3.7 3

4

6

x 10

(b) runtime on insurance

20

10

3.8

3.6

4

har

har

x 10

4.2

Run time

insurance 5500

5 6 Communication cost

7

0

8

3

4

6

x 10

(c) error on har

5 6 Communication cost

7

8 6

x 10

(d) runtime on har

Figure 1: KPCA for polynomial kernels on small datasets: low-rank approximation error and runtime insurance

insurance

45

disKPCA Batch KPCA

10

5

44

0

1

2 3 Communication cost

4

0

0

1

6

x 10

(a) error on insurance

2 3 Communication cost

4

20 disKPCA batch KPCA

40

15 Run time

15

46

har

41 Low−rank approximation error

47

43

har

20 disKPCA Batch KPCA Run time

Low−rank approximation error

48

39 38

5

37 36

1

2

6

x 10

(b) runtime on insurance

disKPCA Batch KPCA

10

3 4 Communication cost

5

0

6 6

x 10

(c) error on har

1

2

3 4 Communication cost

5

6 6

x 10

(d) runtime on har

Figure 2: KPCA for Gaussian kernels on small datasets: low-rank approximation error and runtime 14 Low−rank approximation error

Low−rank approximation error

6

higgs

x 10

disKPCA uniform + disKPCA uniform + batch KPCA

1.2

1

0.8

0.6

0

2

4 6 8 Communication cost

8

x 10

2.4 disKPCA uniform + disKPCA uniform + batch KPCA

12 10 8 6 4

10

12

susy

x 10

Low−rank approximation error

7

1.4

0

1

2 3 4 Communication cost

disKPCA uniform + disKPCA uniform + batch KPCA

2.2 2 1.8 1.6 1.4

5

20news

x 10

0

0.5

8

x 10

1 1.5 2 Communication cost

2.5

3 7

x 10

Figure 3: KPCA for polynomial kernels on larger datasets Results Figure 1 and Figure 2 show the comparison between our disKPCA algorithm and batch KPCA on small datasets for the polynomial kernel (degree q = 4) and for the Gaussian kernel, respectively. disKPCA achieves similar errors as batch KPCA, but it is much faster: we gain a speed up of 10× by using five workers. Figure 3 shows the performance on polynomial kernels. Our algorithm outperforms the others by significant margins. uniform+batch KPCA only scales up to a small number of points, so it is not performed on Gaussian kernels. uniform+disKPCA can use more data points, but still cannot get the same error as disKPCA even with much higher communication. Figure 4 shows the performance on Gaussian kernels. Similarly, disKPCA achieves better performance. The leverage scores for Gaussian kernels are typically more uniform than those for polynomial kernels, so the advantage of disKPCA is less significant than in the polynomial kernel case. Still, uniform+disKPCA needs 3 times communication or more to achieve the same error. Since it does not have the communication of computing leverage scores, this means that it needs to sample more points to get similar performance. Therefore, the consistent advantage of our algorithm demonstrates the efficacy of disKCSS.

9

disKPCA uniform + disKPCA

3161.5 3161 3160.5 3160 3159.5 0

0.5

1

1.5

2

Communication cost

2.5

3 8

x 10

mnist8m

Low−rank approximation error

susy

Low−rank approximation error

Low−rank approximation error

higgs

disKPCA uniform + disKPCA

2215

2210

2205

2200 0

5

10

Communication cost

15

1832 disKPCA uniform + disKPCA

1830 1828 1826 1824 0

7

x 10

0.5

1

1.5

2

Communication cost

2.5 8

x 10

Figure 4: KPCA for Gaussian kernels on larger datasets

References [1] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of Computer and System Sciences, 2003. [2] N. Ailon and B. Chazelle. The fast johnson-lindenstrauss transform and approximate nearest neighbors. SIAM Journal on Computing, 2009. [3] R. I. Arriaga and S. Vempala. An algorithmic theory of learning: Robust concepts and random projection. In FOCS, 1999. [4] H. Avron, H. Nguyen, and D. Woodruff. Subspace embeddings for the polynomial kernel. In NIPS, 2014. [5] K. Bache and M. Lichman. UCI machine learning repository, 2013. [6] M.-F. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication complexity and privacy. COLT, 2012. [7] M.-F. Balcan, V. Kanchanapally, Y. Liang, and D. Woodruff. Improved distributed principal component analysis. In NIPS. 2014. [8] P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 2014. [9] C. Boutsidis, M. Sviridenko, and D. P. Woodruff. manuscript, 2015.

Optimal distributed principal component analysis.

In

[10] C. Boutsidis and D. P. Woodruff. Optimal CUR matrix decompositions. In STOC, pages 353–362, 2014. [11] C. Boutsidis and D. P. Woodruff. Communication-optimal distributed principal component analysis in the column-partition model. CoRR, abs/1504.06729, 2015. [12] Y. Cho and L. K. Saul. Kernel methods for deep learning. In NIPS, pages 342–350, 2009. [13] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In STOC, 2013. [14] A. Clauset, C. R. Shalizi, and M. E. Newman. Power-law distributions in empirical data. SIAM review, 51(4):661–703, 2009. [15] B. Dai, B. Xie, N. He, Y. Liang, A. Raj, M.-F. F. Balcan, and L. Song. Scalable kernel methods via doubly stochastic gradients. In NIPS, 2014. [16] A. Deshpande and S. Vempala. Adaptive sampling and fast low-rank matrix approximation. APPROX & RANDOM, 2006. 10

[17] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized cuts. In KDD, 2004. [18] P. Drineas, M. Magdon-Ismail, M. Mahoney, and D. Woodruff. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1):3475–3506, 2012. [19] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error cur matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, 2008. [20] R. Kannan, S. Vempala, and D. Woodruff. Principal component analysis and higher correlations for distributed data. In COLT, 2014. [21] X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In STOC, 2013. [22] J. Nelson and H. L. Nguyˆen. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In FOCS, 2013. [23] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007. [24] W. Rudin. Fourier analysis on groups. Number 12. John Wiley & Sons, 1990. [25] T. Sarl´os. Improved approximation algorithms for large matrices via random projections. In FOCS, 2006. [26] B. Sch¨olkopf, A. J. Smola, and K.-R. M¨uller. Kernel principal component analysis. In ICANN. Springer-Verlag, 1997. [27] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Theoretical Computer Science, 10(1-2):1–157, 2014. [28] Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-efficient algorithms for statistical optimization. In NIPS, 2012.

11

A

Remark on Using Kernel Tricks in the Algorithms

In computing the final solution from Y in Algorithm 2, we need to compute the projection of φ(A) onto φ(Y ). This can be done by using kernel trick and implicit Gram-Schmidt. Note that Πi = Q> φ(Ai ) where Q is the basis for φ(P ). Suppose φ(Y ) has QR-factorization φ(Y ) = QR. Then Q = φ(Y )R−1 and Q> φ(A) = (R−1 )> φ(Y )> φ(A) where φ(Y )> φ(A) is just the kernel value between points in Y and points in A. For R, we have Q> Q = (R−1 )> φ(Y )> φ(Y )R−1 = I, so R> R = φ(Y )> φ(Y ) and thus R can be computed by factorizing the kernel matrix on Y . Similarly, to compute the distance for adaptive sampling in Algorithm 1, we first need to compute the projection of φ(A) onto φ(P ), which can be done in the same way. Then the square distance from the original data to the projection can be computed by subtracting the square norm of the projection from the square norm of the original data.

B

Additional Proofs

The section provide the missing proofs in the main text. Some proofs mainly follow known arguments in the literature, slightly generalized to handle the additive error term for our purpose; we include them here for completeness.

B.1

Properties of Subspace Embeddings for Kernels

Lemma 6. S is a (, ∆)-subspace embedding for kernel k(·, ·), and suppose E = S(φ(A)) has SVD E = U ΣV > . Then there exists W such that 2

2

kW E − φ(A)kH ≤ (1 + O()) kφ(A) − [φ(A)]k kH + O(∆). ˆΣ ˆ Vˆ > . Define Proof. Let φk := [φ(A)]k and φ := φ(A). Suppose φk has SVD U ˜ := argmin kS(φk )X − S(φ)k X F X

and also note that I = argminX kφk X − φkF since φk is defined to be the best rank-k approximation of φ. ˆ > φk ( X ˜ − I) = U ˆ > (φk X ˜ − φk ). We have First, consider bounding β := U ˆ )> S(U ˆ ) − I)β . ˆ )> S(U ˆ )β + (S(U β = S(U {z } | {z } | T1

T2

For the first term, since S is a linear mapping, ˆ )> S(U ˆ )β = S(U ˆ )> S(U ˆU ˆ > φk )(X ˜ − I) = S(U ˆ )> S(φk )(X ˜ − I). T1 = S(U ˜ we have S(φk )> (S(φk )X ˜ − S(φ)) = 0. Since U ˆ = φk W for some W , Also note that by definition of X, > > > ˆ ˜ ˜ S(U ) (S(φk )X − S(φ)) = W S(φk ) (S(φk )X − S(φ)) = 0. So ˆ )> S(φk )(X ˜ − I) + S(U ˆ )> (S(φ) − S(φk )X) ˜ = S(U ˆ )> (S(φ) − S(φk )) T1 = S(U which leads to



ˆ 2 2 2

U kφ − φk kH + ∆ =  kφ − φk kH + ∆. k H For the second term, by the subspace embedding property, we have 2

kT1 kF ≤

2

ˆ > ˆ

2 2 kT2 kF ≤ S(U ) S(U ) − I kβkF . 2

Therefore, we have 2

kβkF ≤

∆  2 kφ − φk kH + . 1 − c20 1 − c20 12

Since I = argminX kφk X − φkF , we have φ> k (φk − φ) = 0. Then by Pythagorean Theorem,

2

2

˜

˜

2 − φk .

φk X − φ = kφk − φkH + φk X H

H

2

˜ 2 − φk , we arrive at Since kβkF = φk X H

2

˜ 2

φk X − φ ≤ (1 + O()) kφk − φkH + O(∆). H

˜ = (S(φk ))† S(φ) where (S(φk ))† is the pseudoinverse of S(φk ). Then φk X ˜ = φk (S(φk ))† E, and Note that X W := φk (S(φk ))† satisfies the statement.

B.2

Existence of Subspace Embeddings for Kernels

Lemma 7. For the polynomial kernel k(x, x0 ) = (hx, x0 i)q , there exists an (, 0)-subspace embedding matrix S ∈ q Rt×d with such Sφ(A) can be successfully computed with probability at least 1 − δ in  t =O(k/). Furthermore,  q3q k2 k ˜ time O  + q nnz(A) + 2 δ . Proof. First use T ENSOR S KETCH [4] to bring the dimension down to O(3q k 2 + k/). Then, we can use an i.i.d. Gaussian matrix, which reduces it to t = O(k/); or we can first use fast Hadamard transformation to bring it down to O(kpolylog(k)/), then multiply again by i.i.d Gaussians to bring down to O(k/). P1 follows immediately from the definition, so we only need to check the matrix product. Let S = ΩT where T is the T ENSOR S KETCH matrix and Ω is an i.i.d. Gaussian matrix. Since they are both subspace embedding matrices, then for any M and N , r

> >

M S SN − M > T > T N ≤  kM T k kN T k , F F F k r

> >

M T T N − M > N ≤  kM k kN k . H H F k By the subspace embedding property, kM T kF = (1 ± 0 ) kM kH and kN T kF = (1 ± 0 ) kN kH for some small constant 0 . Combining all these bounds and choosing proper , we know that S = ΩT satisfies P2. Lemma 8. For a continuous shift-invariant kernels k(x, y) = k(x − y) with proper regularization, there exists an (, ∆)-good subspace embedding S : H 7→ Rt with t = O(k/). Furthermore, S(φ(x)) = T R(φ(x)), where R(φ(x)) ∈ Rm is m random features for x and T ∈ Rt×m is a sketching matrix as in Fact 1. Proof. Let R(φ(x)) be the random feature expansion matrix with m random features. Let T ∈ Rt×m be a subspace embedding matrix. We define S(φ(x)) := T R(φ(x)). Note that S(·) is a linear mapping from H to Rt . For the random feature expansion, we have Fact 4 (Claim 1 in [23]). Let k be a continuous shift-invariant positive-definite function k(x, y) = k(x − y) defined on a compact set of Rd of diameter D, with k(0) = 1 and such that ∇2 k(0) exists. Let σp2 denote the second moment of the Fourier transform of k. Then |k(x, y) − R(φ(x))> R(φ(y))| ≤ 0 with probability ≤ 1 − δ when    d σp D . m = O 2 log 0 0 δ For the subspace embedding matrix, we have Fact 5 (Lemma 32 in [13]). For A and B matrices with n rows, and given  > 0, there is t = Θ(2 ), so that for a t × n generalized sparse embedding matrix S, or t × n fast JL matrix, or tlog(nd) × n subsampled randomized Hadamard matrix, or leverage-score sketching matrix for A under the condition that A has orthonormal columns,

  Pr (SA)> SB − A> B F ≤  kAkF kBkF ≥ 1 − δ for any fixed constant δ > 0. 13

Then we have



S(φ(A))> S(φ(B)) − φ(A)> φ(B) 2 ≤ S(φ(A))> S(φ(B)) − R(φ(A))> R(φ(B)) 2 F F

2

> >

+ R(φ(A)) R(φ(B)) − φ(A) φ(B) F . For the first term, we have

S(φ(A))> S(φ(B)) − R(φ(A))> R(φ(B)) 2 ≤2 kR(φ(A))k2 kR(φ(B))k2 F F F 2

2

≤2 kφ(A)kH kφ(B)kH + O(2 ab0 ). where a is the number of columns in A and b is the number of columns in B. Similarly the second term is bounded by O(a0 + b0 + ab20 ). So the matrix product approximation is satisfied with

S(φ(A))> S(φ(B)) − φ(A)> φ(B) 2 ≤ 2 kφ(A)k2 kφ(B)k2 + O(2 ab0 ) + O(a0 + b0 + ab20 ). H H F Now consider the subspace embedding condition. Let V be an orthonormal basis spanning the subspace. Then the condition is equivalent to saying S(V )> S(V ) have bounded eigenvalues in [1 ± O()]. Note that

S(V )> S(V ) − V > V 2 ≤ 2 kV k4 + O(2 k 2 0 ) + O(k0 + k0 + k 2 20 ) H F 4

where kV kH = k 2 .

n o 2 2 ˜ 2 ab/k 2 ∆)2 d), O(dk ˜ Then choosing t = O(k/) and m = max O(da2 /∆2 ), O(db2 /∆2 ), O(( /c0 ) satisfies the two conditions.

B.3

Adpative Sampling

Lemma 10. Suppose the span of φ(P ) contains a rank-k (c, ∆)-approximation for φ(A), and letPrj to be the square of the distance from φ(A:j ) to its projection on the subspace spanned by φ(P ) and let pj = rj / i rj . Sample a set of O(ck/(α)) points Y˜ from A according to a distribution qj which satisfies qj ≥ αpj , and let Y = Y˜ ∪ P . Then with probability at least 0.99, the span of φ(Y ) contains a (1 + , ∆)-approximation for φ(A). Proof. Let Z denote the best rank-k approximation for φ(A) in the span of φ(Y ), and let W denote the projection of φ(A) on the span of φ(P ). By Theorem 3 in [16], we have h i  2 2 2 E kφ(A) − ZkH ≤ kφ(A) − [φ(A)]k kH + kφ(A) − W kH c 2

2

where kφ(A) − W kH ≤ c kφ(A) − [φ(A)]k kH + ∆ by our assumption. Our lemma is then proved by Markov’s inequality.

B.4

Compute Approximation Subspace

Lemma 11. In Algorithm 2, if the span of φ(Y ) contains a (1 + , ∆)-approximation subspace, then

>

LL φ(A) − φ(A) 2 ≤ (1 + )2 kφ(A) − [φ(A)] k2 + (1 + )∆. k H H Proof. For our choice of w, T i is an -subspace embedding matrix for Πi . Then their concatenation B is an subspace embedding for Π, the concatenation of Πi . Then we can apply the argument in Lemma 5 in [4] (also implicit in Theorem 1.5 in [20]). Below we give a simplified proof. First, let X denote the (1 + , ∆)-approximation subspace in the span of φ(A). Then

Q[Q> φ(A)]k − φ(A) 2 ≤ kφ(A) − Xk2 ≤ (1 + ) kφ(A) − [φ(A)]k k2 + ∆. H H H 14

(1)

Low−rank approximation error

Low−rank approximation error

ctslice

ctslice

x 10

disKPCA uniform + disKPCA uniform + batch KPCA

7.4 7.2 7 6.8 6.6

0

1

2 3 4 Communication cost

5

yearpredmsd

217.8

Low−rank approximation error

5

7.6

disKPCA uniform + disKPCA

217.6 217.4 217.2 217

6 7

x 10

0.5

1

1.5

Communication cost

(a) Polynomial kernels

648 disKPCA uniform + disKPCA

646

644

642

640

2

2

7

4

6

8

10

12

14 6

Communication cost

x 10

x 10

(b) Gaussian kernels

Figure 5: Additional KPCA results for polynomial and Gaussian kernels Second, we apply Theorem 7 in [20] on Q> φ(A). Note that that theorem is stated for a specific subspace embedding scheme but it holds for any subspace embedding. Then we have



W W > Q> φ(A) − Q> φ(A) 2 ≤ (1 + ) [Q> φ(A)]k − Q> φ(A) 2 . H H

(2)

We now bound the error using the above two claims. By Pythagorean Theorem,

>



LL φ(A) − φ(A) 2 = LL> φ(A) − QQ> φ(A) 2 + φ(A) − QQ> φ(A) 2 . H H H

(3)

Noting L = QW , the first term on the RHS is

>

LL φ(A) − QQ> φ(A) 2 H

2 = W W > Q> φ(A) − Q> φ(A) H

2 ≤ (1 + ) [Q> φ(A)]k − Q> φ(A) H

2 = (1 + ) Q[Q> φ(A)]k − QQ> φ(A) H ,

where the inequality is by (2) and the equalities are because multiplying Q on the vectors in the span of Q does not change their norms. Plugging into (3),  



>

LL φ(A) − φ(A) 2 ≤ (1 + ) Q[Q> φ(A)]k − QQ> φ(A) 2 + φ(A) − QQ> φ(A) 2 H

H



2 (1 + ) Q[Q> φ(A)]k − φ(A) H



(1 + )2 kφ(A) − [φ(A)]k kH + (1 + )∆,

H

2

where the second inequality is by Pythagorean Theorem and the last is by (1).

C

Additional Experimental Results

Experiment Details We used different number of workers for datasets with different sizes. The smallest number is 5 while the largest is 200. For polynomial kernel, we used q = 4 for the degree and for Gaussian RBF kernel, the kernel bandwidth σ is set to 0.2 of the median pairwise distance among a subset of 20000 randomly chosen data points (a.k.a, the “median trick”). For Gaussian random feature expansion, we used 2000 random features. In all experiments, we set the number of principle components k = 10, which is the same number for k-means. The algorithm specific parameters are set as follows: 1) The subspace embedding dimension for the feature expansion t is 50; 2) The subspace embedding dimension for the data points p is 250; 3) We vary the number of sampled points |Y | from 50 to 400; 4) The subspace embedding dimension w is set to equal |Y |. 15

protein

20news

x 10

Low−rank approximation error

Low−rank approximation error

6

4

disKPCA uniform + disKPCA uniform + batch KPCA

3.5 3 2.5 2 1.5

0

1

2

3

4

5

6

7

6000 disKPCA uniform + disKPCA uniform + batch KPCA

5500 5000 4500 4000 3500 3000

0

6

Communication cost

1

2

3

4 6

Communication cost

x 10

x 10

(a) Arc-cos kernels

Figure 6: KPCA results for arc-cos kernels

3.5 disKPCA uniform + disKPCA uniform + batch KPCA

3.5 3

ctslice

susy

x 10

0.994 disKPCA uniform + disKPCA

disKPCA uniform + disKPCA uniform + batch KPCA

3 k−means cost

4 k−means cost

7

20news

x 10

k−means cost

20

4.5

2.5

2

2.5

0.982 2

0

0.5

1 1.5 2 Communication cost

2.5

3

1.5

0

7

x 10

1

2 3 4 Communication cost

5

6 8

x 10

(a) Polynomial kernels

0

0.5

1

1.5

2

Communication cost

2.5

3 7

x 10

(b) Gaussian kernels

Figure 7: KPCA + k-means clustering Distributed Kernel PCA Figure 5 shows additional results for distributed KPCA tasks, where Figure 5(a) shows those for polynomial kernels, Figure 5(b) shows those for Gaussian kernels. Figure 6 shows results for arc-cosine kernels[12]. The arc-cosine kernels have random feature bases similar to the Rectified Linear Units (ReLU) used in deep learning. We used degree n = 2 for the arc-cosine kernels. These results are similar to those in the main text: our disKPCA algorithm consistently outperforms the alternatives. Distributed Spectral Clustering We present additional experimental results of distributed spectral clustering (KPCA followed by k-means clustering). The evaluation criterion is the k-means objective (in the feature space). We compare disKPCA, the baseline uniform+batch KPCA(first uniform sample points and then do batch KPCA), and uniform+disKPCA (replace disKCSS with uniform sampling in disKPCA). Figure 7 presents results for polynomial kernels on the 20 newsgroup and the susy datasets. The curve of the baseline is cut short due to its drastic run time increase. Our disKPCA algorithm compares favorably with the other methods and achieves a better tradeoff of communication and error. This means that although the other methods require similar communication, they need to sample more data points to achieve the same loss, demonstrating the effectiveness of disKCSS.

16