Robust Subspace Clustering via Thresholding

Report 1 Downloads 258 Views
Robust Subspace Clustering via Thresholding Reinhard Heckel and Helmut B¨olcskei

arXiv:1307.4891v1 [stat.ML] 18 Jul 2013

Dept. of IT & EE, ETH Zurich, Switzerland

Abstract The problem of clustering noisy and incompletely observed high-dimensional data points into a union of low-dimensional subspaces and a set of outliers is considered. The number of subspaces, their dimensions, and their orientations are assumed unknown. We propose a simple low-complexity clustering algorithm based on thresholding correlations between the data points followed by spectral clustering. A statistical performance analysis shows that this algorithm succeeds even when the subspaces intersect and when the dimensions of the subspaces scale (up to a log-factor) linearly in the ambient dimension. Moreover, we prove that the algorithm succeeds even when the data points are incompletely observed with the number of missing observations scaling (up to a log-factor) linearly in the ambient dimension. The algorithm is shown to be robust in the presence of structured (i.e., sparse) and of unstructured additive noise. Our results reveal an explicit tradeoff between the affinity of the subspaces and the tolerable noise level. Finally, we propose a simple scheme that provably detects outliers, and present numerical results on real and synthetic data.

1

Introduction

One of the major challenges in modern data analysis is to extract relevant information from large high-dimensional data sets. The relevant features in high-dimensional data sets are often of limited complexity, or, more specifically, have low-dimensional structure. For example, images of faces are high-dimensional as the number of pixels is typically large, whereas the set of images of a given face under varying illumination conditions approximately lies in a 9-dimensional linear subspace [1]. This and similar findings for other types of data have motivated research on finding lowdimensional representations of high-dimensional data [33]. A prevalent low-dimensional structure is that of data points lying in a union of (unknown) low-dimensional subspaces. The problem of finding the assignments of the data points to these subspaces is known as subspace clustering [33] or hybrid linear modeling. An example application of subspace clustering is the following: Given a set of images of faces, cluster the images such that each of the resulting clusters corresponds to a single person [13]. Other application areas include unsupervised learning, image representation and segmentation [14], computer vision, specifically motion segmentation [34, 25], and disease detection [16]; we refer to [33] and the references therein for a more complete list. Often the data available is corrupted by noise and contains outliers. The general subspace clustering problem we consider can therefore be formulated as follows. Suppose we are given a set of N data points in Rm , denoted by X , and assume that X = X1 ∪ ... ∪ XL ∪ O, Parts of this paper were presented at the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) [12] and at the 2013 IEEE International Symposium on Information Theory (ISIT) [11].

1

where O denotes a set of outliers and the nl := |Xl | points in Xl , l ∈ [L], are given by (l)

(l)

(l)

xj = yj + ej , (l)

(1)

(l)

where ej ∈ Rm is noise and yj ∈ Sl with Sl a dl -dimensional subspace of Rm . The association of the points in X with the Xl and O, the number of subspaces L, their dimensions dl , and their orientations are all unknown. We want to cluster the (noisy) points in X , i.e., find the assignments of the points in X to the sets X1 , ..., XL , O. Once these associations have been identified, it is straightforward to extract approximations (recall that we have access to noisy observations only) of the subspaces Sl through principal component analysis (PCA). Numerous approaches to subspace clustering have been proposed in the literature, including algebraic, statistical, and spectral clustering methods; we refer to [33] for an excellent overview. Spectral clustering methods (see [36] for an introduction) have found particularly widespread use thanks to their excellent performance properties and efficient implementations. At the heart of spectral clustering lies the construction of an adjacency matrix A ∈ RN ×N , where the (i, j)th entry of A measures the similarity between the data points xi , xj ∈ X . A typical measure of similarity is, e.g., e−dist(xi ,xj ) , where dist(·, ·) is some distance measure [33]. The association of the points in X to the subspaces Sl is obtained by finding the connected components in the graph G which adjacency matrix A. This is accomplished via a singular value decomposition of the Laplacian of G followed by k-means clustering [36]. As noted in [28], there are very few algorithms that are computationally tractable and succeed provably under non-restrictive conditions (such as overlapping subspaces). A notable exception is the sparse subspace clustering (SSC) algorithm by Elhamifar and Vidal [6, 7], which applies spectral clustering to an adjacency matrix A whose clever construction is inspired by ideas from sparse signal recovery. SSC provably succeeds, in the noiseless case, under very general conditions, as shown in [27] via an elegant (geometric function) analysis. Most importantly, the statistical analysis in [27] reveals that SSC succeeds even when the subspaces Sl intersect (the linear subspaces Sl and Sk are said to intersect if Sl ∩ Sk 6= {0}). Analytical performance results for subspace clustering in the presence of noise are even more scarce. Vidal noted in [33] that “the development of theoretically sound algorithms [...] in the presence of noise and outliers is a very important open challenge”. A significant step towards addressing this challenge was reported in [28], posted while [11] was submitted. The robust SSC (RSSC) algorithm in [28] essentially replaces the `1 −minimization steps in SSC by `1 -penalized least squares, i.e., LASSO, steps and provably clusters data points corrupted by Gaussian noise under quite general conditions on the orientations of the subspaces Sl and on the number of points in the subspaces. To construct the adjacency matrix A, SSC for the noiseless and RSSC for the noisy case require the solution of N `1 -minimization and N LASSO instances, respectively, each in N variables; this poses significant computational challenges for large data sets. Contributions We introduce a simple, computationally efficient, subspace clustering algorithm, which applies spectral clustering to an adjacency matrix A obtained by thresholding correlations between the data points in X . Specifically, A is constructed from the q nearest neighbors of each data point in X , with correlation serving as the underlying distance measure. The resulting algorithm is termed thresholding based subspace clustering (TSC). The gist of the results we obtain is that TSC provably succeeds (even when the data is noisy or incompletely observed) if the subspaces are sufficiently distinct and if X contains sufficiently many points from each subspace. Specifically, under additive Gaussian noise, TSC is shown to 2

succeed under very general conditions on the relative orientations of the subspaces and the number of points from each subspace contained in X . In particular, the subspaces are allowed to intersect. Moreover, in the noiseless case, TSC is shown to succeed even when the dimensions of the subspaces scale (up to a log-factor) linearly in the ambient dimension. While SSC for the noiseless case and RSSC for the noisy case share these desirable properties, TSC is computationally much less demanding. Specifically, the construction of the adjacency matrix A in the TSC algorithm requires the computation of N 2 inner products followed by thresholding only, as opposed to solving N `1 -minimization and N LASSO instances, each in N variables, for SSC and RSSC, respectively. Our results for the case of noisy data nicely reflect the intuition that the more distinct the orientations of the subspaces, the more noise TSC tolerates. What is more, TSC can succeed even under massive noise on the data, provided the subspaces are sufficiently low-dimensional. We also provide results which show that TSC provably succeeds under the influence of sparse noise, which models, e.g., corruptions or occlusions of images in face recognition [37], or mistracking of feature points in motion segmentation [25]. In practical applications the data points to be clustered are often incompletely observed, due to, e.g., scratches on images. The literature is essentially void of corresponding analytical performance results. We prove that TSC succeeds even when the number of missing entries in each data vector scales (up to a log-factor) linearly in the ambient dimension. Finally, we propose a simple scheme for outlier detection which provably succeeds, even in the noisy case. Numerical results on synthetic data and on clustering handwritten digits (MNIST data set of handwritten digits [19]) and images of faces (extended Yale Face Database B [9, 21]) complement our theoretical results. Relation to previous work Lauer and Schnorr [18] also apply spectral clustering to an adjacency matrix constructed from correlations between data points, albeit, without thresholding. More importantly, no analytical performance results are available for the algorithm in [18]. Liu et al. [24] consider spectral clustering applied to an adjacency matrix A obtained from a low-rank representation (LRR) of the data points through a nuclear norm minimization. A deterministic performance analysis reported in [24, Theorem 3.1] shows that LRR succeeds provided the subspaces are independent (the linear subspaces Sl are called independent if the dimension of their (set) sum is equal to the sum of their dimensions), which implies that the subspaces must not intersect. Moreover, minimizing the nuclear norm results in significant computational challenges for large data sets. Dyer et al. [4] propose to substitute the `1 -minimization step in SSC by an orthogonal matching pursuit (OMP) step, and derive deterministic recovery conditions for the resulting algorithm. In [5], the `1 -minimization step in SSC is replaced by an `2 -minimization step, which results in a leastsquares representation of each data point in terms of all the other data points. While numerical results in [5] demonstrate the effectiveness of this approach, no analytical performance results are available. The local subspace affinity algorithm [38] and the spectral local best-fit flats (SLBF) algorithm [39] are based on spectral clustering and rely on an adjacency matrix A built by finding the nearest neighbors in Euclidean distance (cf. [33]). Lerman and Zhang [23] pose the problem of the recovery of multiple subspaces from data drawn from a distribution on a union of subspaces as a non-convex optimization problem. No computationally tractable algorithm to solve this optimization problem [22] seems to be available, though. The problem of fitting a single low-dimensional subspace to a data set which consists of a modest number of noisy inliers and a large number of outliers was considered in [22]. A convex 3

programming algorithm proposed in [22] is shown to provably succeed. Outline of the paper The remainder of this paper is organized as follows. In Section 2, we introduce the TSC algorithm. Sections 3 and 4 contain analytical performance results for the noiseless and the noisy case, respectively. In Section 5, we analyze the impact of incompletely observed data points on TSC. Section 6 describes an outlier detection scheme along with corresponding analytical performance guarantees. In Section 7, we put our analytical results into perspective with respect to other analytical performance results for subspace clustering available in the literature. Section 8 contains numerical results on synthetic and on real data, including a performance comparison to SSC/RSSC. We decided to discuss the various settings (noiseless, noisy, incomplete observations, and outliers) in an isolated fashion to keep the exposition accessible. Proofs are given in the appendices. Notation We use lowercase boldface letters to denote (column) vectors, e.g., x, and uppercase boldface letters to designate matrices, e.g., A. The superscript T stands for transposition. For the vector x, xq denotes its qth entry. For the matrix A, Aij designates the entry in its ith row and −1 jth column, A† := (AT A) AT is its pseudo-inverse, kAk2→2 := maxkvk2 =1 kAvk2 its spectral P norm, kAkF := ( i,j |Aij |2 )1/2 its Frobenius norm, and Im denotes the m × m identity matrix. log(·) is the natural logarithm, and x ∧ y denotes the minimum of x and y. For the set T , |T | designates its cardinality and T stands for its complement. The set {1, ..., N } is denoted by [N ]. We write N (µ, Σ) for a Gaussian random vector with mean µ and covariance matrix Σ. The unit sphere in Rm is S m−1 := {x ∈ Rm : kxk2 = 1}. 1{A} (·) denotes the indicator function of the set A. For notational convenience, we use the shorthand maxk6=l for maxk∈[L] : k6=l and maxk,l : k6=l for maxk,l∈[L] : k6=l .

2

The TSC algorithm

The formulation of the thresholding based subspace clustering (TSC) algorithm provided below assumes that outliers have already been removed from X , e.g., through the outlier detection scheme described in Section 6, and that the data points in X are either normalized or of comparable norm. The latter assumption is relevant for Step 1 below and is not restrictive as the data points can be normalized prior to clustering. TSC algorithm. Given a set of data points X and the parameter q (the choice of q is discussed below), perform the following steps: Step 1: For every xj ∈ X , identify the set Tj ⊂ [N ] \ j of cardinality q defined through |hxi , xj i| ≥ |hxp , xj i| for all i ∈ Tj and all p ∈ / Tj . Step 2: Let zj ∈ RN be the vector with ith entry |hxj , xi i| if i ∈ Tj , and 0 if i ∈ / Tj . Step 3: Construct the adjacency matrix A according to A = Z + ZT , where Z = [z1 , ..., zN ]. ˆ= Step 4: Estimate the number of subspaces using the eigengap heuristic [36] according to L arg maxi∈[N −1] (λi+1 −λi ), where λ1 ≤ λ2 ≤ ... ≤ λN are the eigenvalues of the normalized Laplacian of the graph with adjacency matrix A. ˆ Step 5: Apply normalized spectral clustering [36] to (A, L). The hope is that the indices in Tj correspond to vectors in the subspace xj lies in. If for each j ∈ [N ] the indices in Tj correspond to vectors in the subspace xj lies in, then each connected 4

component in the graph G with adjacency matrix A corresponds only to points that lie in the same subspace. The idea behind estimating L according to Step 4 above is the following. By a well known result in spectral graph theory [29] the number of zero eigenvalues of the Laplacian of G is equal to the number of connected components of G. In the noisy case, where the eigenvalues are not exactly zero, a robust estimator for L is the eigengap heuristic [36] (which is also used for SSC in [27]). Finally, segmentation of the data into subspaces is obtained by applying spectral ˆ clustering to (A, L). The following definition is motivated by the desire that each connected component in the graph with adjacency matrix A correspond to points within one subspace. Definition 1. The subspace detection property holds for X = X1 ∪ ... ∪ XL and adjacency matrix A if i. Aij 6= 0 only if xi and xj belong to the same set Xl and if ii. for every i ∈ [N ], Aij 6= 0 for at least q points xj that belong to the same set Xl as xi . This subspace detection property is akin to the `1 subspace detection property introduced in [27]. The corresponding notion in [28] is that of A having “no false discoveries” and at least q “true discoveries” in each row (column). The subspace detection property guarantees that each node in the graph G (with adjacency matrix A) is connected to at least q other nodes, all of which correspond to points within the same subspace. Owing to the spectral clustering step, the subspace detection property does not guarantee that TSC yields the correct associations of the data points to the subspaces. Conversely, even when the subspace detection property does not hold strictly, but the Aij for pairs xi , xj belonging to different subspaces are “small enough”, TSC may still cluster the data correctly, as will be seen in the numerical results in Section 8. It is easily seen that for the noiseless case the subspace detection property is satisfied for subspaces that are orthogonal to each other, as long as q ≤ minl (|Xl | − dl ). Specifically, for subspaces that are orthogonal to each other, hxp , xj i = 0 for all xp ∈ Xl , xj ∈ Xk , l 6= k, while for each l, there are at least |Xl | − dl inner products hxp , xj i with xp , xj ∈ Xl that are not equal to zero (since no more than dl points in a dl -dimensional subspace can be orthogonal to each other). The results in the following sections show that the subspace detection property can be satisfied under much more general conditions, in particular also when the subspaces intersect. Recall that q is an input parameter to the TSC algorithm. Choosing q too small/large will lead to over/under-estimation of the number of subspaces L. Our analytical performance results ensure that the subspace detection property holds given that q is sufficiently small relative to the nl . Once this condition is satisfied, the specific choice of q is irrelevant in terms of our performance guarantees, in particular it does not depend on the dl and the noise variance. Indeed, we will see, in the numerical results in Section 8, that TSC is not overly sensitive to the specific choice of q. Finally, note that TSC has only one input parameter. SSC shares this desirable feature and has only the LASSO parameter λ as input parameter. Finally, we note that a natural substitute for Step 2 in the TSC algorithm is to construct zj from the best linear approximation of xj through points indexed by Tj . Specifically, let XTj be the matrix whose columns are the vectors in X indexed by Tj . Substitute Step 2 by: Step 2b: Set the entries of zj ∈ RN indexed by Tj to the absolute values of X†Tj xj and set all other entries to zero. We call the TSC algorithm with Step 2 replaced by Step 2b TSCb. The formal relationship −1 between Steps 2 and 2b is brought out by noting that [zTj ]i = |[X†Tj xj ]i | = |[(XTj XTTj ) XTTj xj ]i |, i ∈ [q], and realizing that zTj from the original Step 2 is the vector containing the absolute values of 5

XTTj xj . Our analytical performance results stated below also apply to TSCb since the subspace detection property depends on the location of the non-zero entries of A only, which in turn depends on the sets Tj obtained in Step 1 only. The numerical properties of TSC and TSCb will be compared in Section 8. We note that TSCb is similar, in spirit, to SSC, as the Steps 1 and 2b above yield a (sparse) linear representation of xj in terms of q points in X \ xj . SSC finds a sparse linear representation of xj in terms of points in X \ xj via `1 -minimization. We are now ready to state our main performance results.

3

Performance results for the noiseless case

In order to elicit the impact of the relative orientations of the subspaces Sl on the performance of TSC, we take the Sl to be deterministic and choose the points within the Sl randomly. To this end, we represent the data points in Sl by (l)

(l)

yj = U(l) aj

(2) (l)

where U(l) ∈ Rm×dl is a basis for the dl -dimensional subspace Sl , and the aj ∈ Rdl are random. Unless explicitly stated otherwise, we take each matrix U(l) to be deterministic and orthonormal. We will consider the following two data models. (l)

1. The aj are i.i.d. uniformly distributed on S dl −1 (throughout the paper, whenever we say that (l)

(l)

the aj or ej are i.i.d., we actually mean i.i.d. across j and l). Since each U(l) is orthonormal, (l)

(l)

the data points yj = U(l) aj are distributed uniformly on {y ∈ Sl : kyk2 = 1}. (l) (l) are i.i.d. N (0, (1/dl )Idl ). As the corresponding direction vectors yj / yj 2 are

(l) 2

(l) 2 distributed uniformly on {y ∈ Sl : kyk2 = 1} and yj 2 = aj 2 concentrates around its h i (l) 2 expectation E aj 2 = 1, this model is conceptually equivalent to the first model. The raison d’ˆetre for this model is that it is sometimes analytically more tractable. (l)

2. The aj

We first consider clustering of noise-free data sets that have no outliers. Before stating concrete results, let us start with a back-of-the-envelope calculation to get an idea of the nature of the performance guarantees we can expect to find. For ease of exposition, we set dl = d and nl = (l) |Xl | = n, for all l. We take the ai to be i.i.d. N (0, (1/d)Id ) distributed. By Definition 1, the subspace detection property can only be satisfied (for a given q) if Aij 6= 0 only if xi and xj belong (l) to the same set Xl . This is the case, if for each xi ∈ Xl , and for each Xl , the set Ti (associated (l) (l) with xi ) corresponds to points in Xl only. Consider a specific xi with associated set Ti and (k) (l) (k) (k) define zj := xj , xi . Note that, for simplicity of exposition, the notation zj does not reflect (l)

dependence on xi . From Step 1 in the TSC algorithm it follows that Ti corresponds to points in Xl only if (l)

(k)

(3)

z(n−q) > max zj k6=l,j

(l)

(l)

(l)

(l)

where the order statistics z(1) ≤ z(2) ≤ ... ≤ z(n−1) are defined by sorting the {zj }j∈[n]\i in ascending order. We seek a sufficient condition on the relative orientations of the subspaces Sl and (l) the number of points from each subspace contained in X , for (3) to hold. Conditioned on xi , 6

(l)

(k)

and hence on ai , maxk6=l,j zj

is the maximum of the absolute values of independent Gaussian

T (l) 2 random variables with variances U(k) U(l) ai 2 and is thus with high probability smaller than (see Lemma 5 in Appendix G)

(k) T (l) (l) p 0 c max U U ai log n + log(L − 1) (4) k6=l

2

(l)

(l)

with an absolute constant c0 . On the other hand, conditioned on xi , the random variable z(n−q) is, with high probability, larger than (see Lemma 4 in Appendix G) s  

p n−1

(l) T (l) (l)

(l) c˜ U U ai log = c˜ ai log(n − 1) − log(q − 1) (5) q−1 2 2 where c˜ is an absolute constant. We can hence expect TSC to succeed (with high probability) if (l) (4) is smaller than (5) for all ai which leads to the condition

(k) T (l) (l) p U ai

U c˜ log(n − 1) − log(q − 1) 2

max < 0 p , for all i ∈ [n]. (6)

(l) k,l : k6=l c log n + log(L − 1)

ai 2

Indeed, the rigorous performance guarantees for TSC stated below depend on the relative orientations of the subspaces Sl through the quantity on the left hand side (LHS) of (6). Specifically, the more distinct the relative orientations of the subspaces, the easier it becomes to satisfy (6). Inspired by the LHS of (6), we define two different notions of affinity between subspaces, on which the formal recovery guarantees stated below will be seen to depend, namely

T

affp(Sk , Sl ) := U(k) U(l) 2→2

and aff(Sk , Sl ) := √

1

(k) T (l) U .

U F dk ∧ dl

Note that the affinity notion [27, Definition 2.6] and [28, Definition 1.2], relevant to the analysis of SSC and RSSC, is equivalent to aff(·, ·). The relation between the affinity notions affp(·) and aff(·) is brought out by expressing them in terms of principal angles between Sk and Sl according to affp(Sk , Sl ) = cos(θ1 )

(7)

p cos2 (θ1 ) + ... + cos2 (θdk ∧dl ) √ aff(Sk , Sl ) = dk ∧ dl

(8)

while

where θ1 , ..., θdk ∧dl with 0 ≤ θ1 ≤ ... ≤ θdk ∧dl ≤ π/2 denotes the principal angles between Sk and Sl , defined as follows. Definition 2. The principal angles θ1 , ..., θdk ∧dl between the subspaces Sk and Sl are defined recursively according to cos(θj ) = hvj , uj i , where (vj , uj ) = arg max hv, ui with the maximization carried out over all v ∈ Sk : kvk2 = 1, u ∈ Sl : kuk2 = 1, subject to hv, vi i = 0 and hu, ui i = 0 for all i = 1, ..., j − 1 (for j = 1, this constraint is void). 7

Note that 0 ≤ aff(Sk , Sl ) ≤ affp(Sk , Sl ) ≤ 1. If Sk and Sl intersect in p dimensions, i.e., if Sk ∩ Sl is p-dimensional, then cos(θ1 ) = ... = cos(θp ) =p 1 [10]. Hence, if Sk and Sl intersect in p ≥ 1 dimensions we have affp(Sk , Sl ) = 1 and aff(Sk , Sl ) ≥ p/(dk ∧ dl ). We are now ready to state our first main result. Theorem 1. Suppose that Xl , l ∈ [L], is obtained by choosing nl = ρl q, with ρl ≥ 10/3, points (l) (l) (l) corresponding to Sl at random according to xj = U(l) aj , j ∈ [nl ], where the aj are i.i.d. N (0, (1/dl )Idl ), and set X = X1 ∪ ... ∪ XL . If √ log ρmin max affp(Sk , Sl ) ≤ c1 √ (9) k,l : k6=l log N with ρmin = minl ρl , then the subspace detection property holds (for X ) with probability at least X −2  X  −c2 (nl −1) +2 nk 1− nl e k6=l

l∈[L]

where c1 and c2 are absolute constants satisfying 0 < c1 , c2 < 1. Theorem 1 states that TSC succeeds with high probability if maxk,l : k6=l affp(Sk , Sl ) is sufficiently small, and if X contains sufficiently many points from each subspace (i.e., ρl = nl /q ≥ 10/3, for all l). Intuitively, we expect that clustering becomes easier when the nl increase. To see that Theorem 1 confirms this intuition, set nl = n, for all l, and observe that the probability of success in Theorem 1 increases in n; at the same time, for a fixed number of subspaces L, the RHS of (9) increases in ρmin (recall that nl = ρl q), and therefore (9) becomes easier to satisfy. Theorem 1 does not impose any restrictions on the dimensions of the subspaces; the only dependence on the subspaces Sl is via the affinity in the clustering condition (9). It is interesting to see that the clustering condition (9) implies (6), the condition derived in our back-of-the-envelope calculation. Note that Theorem 1 does not apply to subspaces that intersect as affp(Sk , Sl ) = 1 in that case and the RHS of (9) is strictly smaller than 1. We next present a result analogous to Theorem 1 that applies to intersecting subspaces. Theorem 2. Suppose that Xl , l ∈ [L], is obtained by choosing nl = ρl q, with ρl ≥ 6, points (l) (l) (l) corresponding to Sl at random according to xj = U(l) aj , j ∈ [nl ], where the aj are i.i.d. uniform on S dl −1 , and set X = X1 ∪ ... ∪ XL . If max aff(Sk , Sl ) ≤

k,l : k6=l

1 15 log N

(10)

with m ≥ 6P log N , then the subspace detection property holds (for X ) with probability at least 1 − 10/N − l∈[L] nl e−c(nl −1) , where c > 0 is an absolute constant. The interpretation of Theorem 2 is analogous to that of Theorem 1 with the important difference that the P RHS of (10), as opposed to the RHS of (9), decreases, albeit very slowly, in the nl since N = l nl . For SSC a result in the spirit of Theorem 2 was reported in [27, Theorem 2.8].

4

Impact of noise

In many practical applications the data points to be clustered are corrupted by measurement noise, typically modeled as additive Gaussian noise. Other noise sources require a structured noise model, 8

that describes, e.g., occlusions of images in face recognition [37], mistracking of feature points in motion segmentation [25], or corruptions in images. We will deal both with unstructured and structured noise and, as in the last section, take the subspaces to be deterministic and the data points in the subspaces to be random.

4.1

Unstructured (additive white Gaussian) noise

Theorem 3. Suppose that Xl , l ∈ [L], is obtained by choosing nl = ρl q, with ρl ≥ 6, points corre(l) (l) (l) (l) sponding to Sl at random according to xj = U(l) aj + ej , j ∈ [nl ], where the aj are i.i.d. uniform (l)

on S dl −1 and the ej are i.i.d. N (0, (σ 2 /m)Im ). Set X = X1 ∪ ... ∪ XL and let dmax = maxl dl . If √ σ(1 + σ) dmax 1 √ max aff(Sk , Sl ) + √ ≤ k,l : k6=l 15 log N m log N

(11)

with m ≥ then the subspace detection property holds (for X ) with probability at least P 6 log N ,−c(n 10 l −1) , where c > 0 is an absolute constant. 1 − N − l∈[L] nl e Theorem 3 states that TSC succeeds with high probability if the additive noise variance and the affinities between the subspaces are sufficiently small, and if X contains sufficiently many points from each subspace. Condition (11) nicely reflects the intuition that the more distinct the orientations of the subspaces, the more noise TSC tolerates. What is more, Condition h(11) reveals

(l) 2 i that TSC can succeed even when the noise variance σ 2 is large, i.e., when σ 2 = E ej 2 >

(l) (l) 2

U a = 1, provided that dmax /m is sufficiently small. This means that TSC can succeed even 2 under massive noise, provided the dimensions p of the subspaces are not too large. The intuition behind the factor σ(1 + σ) dmax /m in (11) is as follows. Assume, for simplicity, that dl = d, for all l, and consider the most favorable situation of subspaces that are orthogonal to each other, i.e., aff(Sk , Sl ) = 0, for all k 6= l. TSC relies on the inner products between points within a given subspace to typically be larger than the inner products between points in distinct subspaces. First, note that hxj , xi i = hyj , yi i + hej , ei i + hyj , ei i + hej , yi i. Then, under the statistical data  h i1/2 model of Theorem 3, we have E |hyj , yi i|2 = √1d if yj , yi ∈ Sl and hyj , yi i = 0 if yj ∈ Sk

and yi ∈ Sl , with k 6= l. When the terms hej , ei i, hyj , ei i, and hej , yi i are small relative to we have a margin on the order of clusters. Indeed, if

√σ m

√1 d

to separate points within a given cluster from points in other

is small relative to

σ2

√1 , d

√1 , d

hyj , ei i and hej , yi i are (sufficiently) small, while

being small relative to √1d ensures that hej , ei i is (sufficiently) small. These two conditions are p satisfied when σ(1 + σ) d/m is (sufficiently) small. √ m

4.2

Structured noise

Examples of noise sources which require a structured noise model are as follows. In face recognition problems errors due to occlusion and corruption are common [37]. In motion segmentation trajectories may be corrupted due to feature points that are mistracked [25]. The resulting errors in these two examples can be modeled as corruptions that are sparse in the identity basis [37, 25]. In the ensuing analysis we will be a bit more general by assuming that the data points are corrupted by errors that are sparse in a general orthonormal basis C ∈ Rm×m with Gaussian coefficient vectors. We consider two settings. First, we take both the orientations of the subspaces as well as the sparse support of the error signal to be random. Specifically, we will assume that the U(l) are Gaussian 9

random matrices, which ensures that the U(l) are approximately orthonormal with high probability. Second, we consider the case of deterministic subspaces and sparse error signals with deterministic support. We start with our result on the former setting. Theorem 4. Suppose that Xl , l ∈ [L], is obtained by choosing nl = ρl q, with ρl ≥ 6, points (l) (l) (l) (l) (l) corresponding to Sl at random according to xj = U(l) aj + Cj cj , j ∈ [nl ], where the aj ∈ Rd (l)

are i.i.d. uniform on S d−1 and the cj ∈ Rs are i.i.d. N (0, (σ 2 /s)Is ). Let the entries of each matrix (l)

U(l) ∈ Rm×d be i.i.d. N (0, 1/m) and suppose that the Cj ∈ Rm×s are chosen uniformly at random from all submatrices of C ∈ Rm×m with s columns, where C is orthonormal. Set X = X1 ∪ ... ∪ XL . If √ c3 p d 1 √ 3d + log L + 4 √ σ(2 + σ) ≤ (12) 12 log N m m

then the subspace detection property is satisfied (for X ) with probability at least 1−

s2 m ds 12 − N e−c(n−1) − N 2 e− 2 − N 2 e−18s m−s+1 m2 − 4e−c1 m N

(13)

where c, c1 , c3 > 0 are absolute constants. The first term in the clustering condition (12) accounts for the affinity between the random subspaces, the second term accounts for noise and is structurally equivalent to the corresponding term in the clustering condition for the case of additive white Gaussian noise (11). Since s must be large for the probability estimate (13) to be close to one, our result suggests that sparse errors can be detrimental. On the other hand, since the term in the clustering condition (12) accounting for the noise, is structurally equivalent to (11), we expect structured noise according to the statistical model in Theorem 4 to have an effect similar to that of white Gaussian noise, as long as s is sufficiently large. This result should, however, be taken with a grain of salt, as choosing the U(l) (l) (l) Gaussian ensures that for each pair (U(l) , Cj ), the matrices U(l) and Cj are close to orthogonal to each other, and, in addition, that aff(Sk , Sl ), for k 6= l, is close to zero, with high probability. This is a favorable situation for subspace clustering, which one is unlikely to encounter in practice. (l) We therefore next consider the case of deterministic subspaces and deterministic Cj . Theorem 5. Suppose that Xl , l ∈ [L], is obtained by choosing nl = ρl q, with ρl ≥ 6, points (l) (l) (l) (l) (l) corresponding to Sl at random according to xj = U(l) aj + Cj cj , j ∈ [nl ], where the aj ∈ Rdl (l)

(l)

are i.i.d. uniform on S dl −1 , the cj ∈ Rs are i.i.d. N (0, (σ 2 /s)Is ), and the Cj orthonormal. Set X = X1 ∪ ... ∪ XL and let dmax = maxl dl . If √ dmax 1 max aff(Sk , Sl ) + 2σ √ (σ + 2) ≤ k6=l 12 log N s

∈ Rm×s are (14)

then the subspace detection property is satisfied (for X ) with probability at least 1 − P − l∈[L] nl e−c(nl −1) , where c > 0 is an absolute constant.

12 N

Observe that the clustering condition (14) is structurally equivalent to (11) with d/m substituted by d/s. Again, this result suggests that sparse errors can be detrimental. Moreover, it also suggests that more unstructured noise is better tolerated by TSC. This will be confirmed in the numerical results in Section 8. 10

5

Incomplete data and impact of subspace dimension

In practical applications the data points to be clustered are often incompletely observed, think of, e.g., images that exhibit scratches or have missing parts. We next investigate the impact of incompletely observed data points on the performance of TSC. We chose to work in the original m-dimensional signal space and set the missing entries in each data vector to zero. As the TSC algorithm depends on inner products between the data points only this ensures that the missing observations result in contributions equal to zero. Understanding the impact of incompletely observed data points on clustering performance is obviously of significant importance. The literature seems, however, essentially void of corresponding analytical results. We start by noting that in the deterministic subspace setting any such result will necessarily depend on the specific orientations of the subspaces. To simplify things, we therefore assume both the orientations of the subspaces as well as the points in the subspaces to be random. Specifically, we will again assume that the U(l) are Gaussian random matrices, which ensures that the U(l) are approximately orthonormal with high probability. For simplicity of exposition, we furthermore take all subspaces to have equal dimension d and let the number of points in each of the subspaces be n. Theorem 6. Suppose that Xl , l ∈ [L], is obtained by choosing n = ρq, with ρ ≥ 10/3, points (l) (l) (l) corresponding to Sl at random according to xj = U(l) aj , j ∈ [n], where the aj are i.i.d. N (0, (1/d) Id ), and set X = X1 ∪ ... ∪ XL . Assume that in each xj ∈ X up to s entries (possibly different for different xj ) are set to 0. Let the entries of each U(l) ∈ Rm×d be i.i.d. N (0, 1/m). If  me  log N  m ≥ c4 3d + log L + s log + sc3 (15) log ρ 2s then the subspace detection property holds (for X ) with probability at least 1 − Lne−c2 (n−1) − 2L − 4e−c1 m , where c1 , c2 , c3 , c4 > 0 are absolute constants. If s = 0, (15) reduces to m ≥ (L−1)2 n N c4 log log ρ (3d + log L).

Strikingly, Theorem 6 shows that the number of missing entries in the data vectors is allowed to scale (up to a log-factor) linearly in the ambient dimension. Impact of subspace dimension We can furthermore conclude, from Theorem 6, that TSC succeeds with high probability even when the dimensions of the subspaces scale (up to a log-factor) linearly in the ambient dimension. Drawing such a conclusion from Theorems 1 or 2, which are based on deterministic subspaces, seems difficult as the relation between m, d, and L is implicit in the affinity measures. These findings should, however, be taken with a grain of salt as the fully random subspace model ensures that the subspaces are approximately orthogonal to each other with high probability. In the case of s = 0, a result for SSC, analogous to Theorem 6, was reported in [27, Thm. 1.2]. We note, however, that [27, Theorem 1.2] assumes the U(l) to be chosen uniformly from the set of all orthonormal matrices in Rm×d , whereas in Theorem 6 the U(l) are Gaussian. Conceptually, these two models for the U(l) are equivalent.

6

Outlier detection

We discuss the noisy and noise-free cases separately as the corresponding statistical models for the outliers differ slightly; moreover, the proof for the noiseless case is very simple and insightful and thus warrants an individual presentation. 11

6.1

Noise-free case

Outliers are data points that do not lie in one of the low-dimensional subspaces Sl and have no low-dimensional linear structure. Here, this is conceptualized by assuming random outliers distributed uniformly on the unit sphere of Rm . The outlier detection criterion we employ is based on the following observation. The maximum inner product between and any other point √ an outlier √ (outlier or inlier) in X is, with high probability, smaller than c log N / m. We therefore classify xj as an outlier if p √ max |hxi , xj i| < c log N / m. (16) i∈[N ]\j

The maximum √ inner product between any point xj ∈ Xl and the points in Xl \ xj is√unlikely to be √ smaller than 1/ dmax . Hence, an inlier is unlikely to be misclassified as an outlier if c log N / m ≤ √ 1/ dmax , i.e., if dmax /m is sufficiently small. Theorem 7. Suppose that the set of outliers, O, is obtained by choosing N0 outliers i.i.d. uniformly on S m−1 , and that Xl , l ∈ [L], is obtained by choosing nl points corresponding to Sl at random (l) (l) (l) according to xj = U(l) aj , j ∈ [nl ], where the aj are i.i.d. uniform on S dl −1 and each U(l) is deterministic and orthonormal. Set X = XP 1 ∪ ... ∪ XL ∪ O and declare xj ∈ X to be an outlier if √ (16) holds with c = 6. Then, with N = l nl + N0 , every outlier is detected with probability at least 1 − 2N0 /N 2 . Furthermore, with dmax = maxl dl , provided that dmax 1 ≤ m 6 log N

(17)

with probability at least 1 − nl e− 2 log( 2 )(nl −1) no inlier in Sl is misclassified as an outlier. 1

π

Theorem 7 states that, provided Condition (17) holds, and the set X contains sufficiently many points from each subspace, outlier detection succeeds with high probability, i.e., every outlier is detected and no inlier is misclassified as an outlier. The dependence on nl is intuitive. If nl = 1, there is no way of distinguishing the singlem inlier Pin Sl from the outliers. Since (17) can be rewritten as N0 ≤ e 6dmax − l nl , it follows that outlier detection succeeds even if the number of outliers scales exponentially in m/dmax , i.e., if dmax is kept constant, exponentially in the ambient dimension! Note that this result does not make any assumptions on the orientations of the subspaces Sl . The outlier detection scheme proposed in [27] in the context of SSC allows to identify outliers under a very similar condition. However, the scheme in [27] requires the solution of N `1 -minimization problems, each in N variables, while the algorithm proposed here needs to compute N 2 inner products followed by thresholding only.

6.2

Noisy case

We next consider outlier detection under additive noise. To simplify the analysis, we change our outlier model slightly. Specifically, we assume the outliers to be distributed as N (0, (1/m)Im ). Conceptually, this outlier model is equivalent to that used in Section 6.1, as the directions of the outliers in the present model, i.e., xi /kxi k2 , are uniformly distributed on S m−1 , and kxi k2 concentrates around 1. We furthermore employ a normalization which ensures that the inliers also have about unit norm. This guarantees that outlier detection is not trivially accomplished by exploiting differences in the norms between inliers and outliers.

12

Theorem 8. Suppose that O is obtained by choosing N0 outliers i.i.d. N (0, (1/m)Im ), and that (l) 1 Xl , l ∈ [L], is obtained by choosing nl points corresponding to Sl at random according to xj = √1+σ 2   (l) (l) (l) (l) (l) d −1 l U aj + ej , j ∈ [nl ], where the aj are i.i.d. uniform on S and the ej are i.i.d. N (0, (l)

(σ 2 /m)Im ) and independent of the aj . Set X = X1 ∪ ... ∪ XL ∪ O and declare xj ∈ X to be an √ P outlier if (16) holds with c = 2.3 6. Then, with N = l nl + N0 , assuming m ≥ 6 log N , every N0 outlier is detected with probability at least 1 − 3 N 2 . Furthermore, with dmax = maxl dl , provided that dmax c1 ≤ 2 m (1 + σ )2 log N

(18)

where c1 is an absolute constant, with probability at least 1 − nl e− 2 log( 2 )(nl −1) − n2l N73 no inlier belonging to Sl is misclassified as an outlier. 1

π

Theorem 8 states that, provided Condition (18) holds, and X contains sufficiently many points from each subspace, outlier detection succeeds with high probability. Our result furthermore shows that outlier detection can succeed even when σ 2 is large, provided that dmax /m is sufficiently small. An outlier detection scheme for RSSC does not seem to be available.

7

Comparison with SSC/RSSC and other algorithms

As mentioned in the introduction, there are very few clustering algorithms that are both computationally tractable and succeed provably under non-restrictive conditions. Notable exceptions are the SSC algorithm [27], and for the noisy case the RSSC algorithm [28]. Since our results are in the spirit of those for SSC and RSSC—in particular we use the same statistical data model—we next compare our findings to those in [27, 28]. While SSC and RSSC employ a “global” criterion for building the adjacency matrix A by sparsely representing each data point in terms of all the other data points through `1 -minimization, TSC is based on a “local” criterion, namely the comparison of inner products of pairs of data points. This makes TSC computationally much less demanding than SSC and RSSC, while, perhaps surprisingly, sharing the performance guarantees and clustering conditions of SSC and RSSC. Specifically, for SSC in the noiseless case and RSSC in the noisy case, results analogous to Theorems 2 and 3 were reported in [27, Theorem 2.8] and [28, Theorem 3.1], respectively. The clustering condition for SSC in [27, Theorem 2.8] is identical (up to constants and log-factors) to (10), and the clustering condition for RSSC in [28] is identical (up to constants and log-factors) to (11) with σ(1 + σ) in (11) replaced by σ. We note, however, that [28] requires σ to be bounded in the sense of σ ≤ c, for some constant c, an assumption not needed here. If we took σ to be bounded, i.e., σ ≤ c, the factor σ(1 + σ) in Condition (11) above would be replaced by σ(1 + c) and we would get a condition that is equivalent (up to constants and log-factors) to that in [28]. In [28] σ is assumed to be bounded in the sense of σ ≤ c to ensure sufficiently many “true discoveries” in each row/column in the affinity matrix; (i, j) is said to be a true discovery if Aij 6= 0 for xi and xj belonging to the same subspace. Condition (11) in Theorem 3 guarantees that the subspace detection property is satisfied, and thus ensures (see ii. in Definition 1) q true discoveries in each row/column. Owing to the simplicity of TSC, the proofs in this paper are conceptually and technically less involved than the proofs of the corresponding results for SSC and RSSC in [27, 28]. Finally, we provide performance guarantees for clustering of incompletely observed data points and structured noise. Corresponding results for SSC and RSSC do not seem to be available. 13

A comparison of the analytical performance results for RSSC (in particular [28, Theorem 3.1]) to those for a number of representative subspace clustering approaches such as generalized PCA (GPCA) [35], the K-flats algorithm [32], and LRR [24], can be found in [28, Section 5]. In addition, this comparison also features computational tractability and robustness aspects. As the main analytical performance results for TSC are structurally equivalent to those for SSC and RSSC the conclusions drawn in the comparison in [28, Section 5] carry over to TSC.

8

Numerical results

We use the following performance metrics. • The clustering error (CE) measures the number of misclassified points relative to the total number of points and is defined as follows. Denote the estimate of the number of subspaces ˆ and note that possibly L ˆ 6= L. Let c ∈ [L]N and c ˆ N be the original and estimated ˆ ∈ [L] by L assignments of the points in X to the individual clusters. The clustering error is given by ! N 1 X CE(ˆ c, c) = min 1 − 1{π(ci )=ˆci } π N i=1

ˆ (for L ˆ = L, π is simply a where the minimum is with respect to all assignments π : [L] → [L] permutation). Note that π naturally appears in this definition as the specific cluster indices are irrelevant to the clustering error. The problem of finding the optimal assignment π can be cast as finding the maximal matching of a weighted bipartite graph, which can be solved efficiently via the Hungarian algorithm [30]. • The error in estimating the number of subspaces L is denoted as EL and takes the value 0 if ˆ is correct, 1 if L < L, ˆ and −1 if L > L. ˆ We employ a signed error measure the estimate L to be able to discriminate between under- and overestimation. In principle, EL averaged ˆ 6= L for each individual problem over several problem instances, may equal zero, while L instance. However, as it turns out (in the numerical results below), for a given choice of ˆ or L > L ˆ almost consistently. problem parameters, we get that either L < L • The feature detection error (FDE) is defined as N  1 X FDE(A) = 1 − kbxi k2 /kbi k2 N i=1

where bi is the ith column of the adjacency matrix A and bxi is the vector containing the entries of bi corresponding to the subspace xi lives in. The FDE measures to which extent points from different subspaces are connected in the graph with adjacency matrix A, and equals zero if the subspace detection property holds.

8.1

Synthetic data

Throughout this section, unless explicitly stated otherwise, we set, for simplicity, nl = n and dl = d, for all l, and generate the d-dimensional subspaces Sl by drawing independently (across l) a corresponding orthonormal basis U(l) ∈ Rm×d uniformly at random from the set of all orthonormal matrices in Rm×d . 14

FDE

CE

30

0.3

20

0.2

10 0

EL

0.1 5 10 15 20

0

5 10 15 20

0.8

1

0.6

0.5

0.4

0

0.2

−0.5

0

5 10 15 20

−1

Figure 1: Clustering error metrics as a function of q on the vertical and n on the horizontal axis. 8.1.1

Impact of the choice of q

The goal of the first experiment is to demonstrate, as stated in Section 2, that TSC is not overly sensitive to the exact choice of q. We generate L = 10 subspaces in R50 , each of dimension d = 5, and vary the number of points in each subspace n, as well as the parameter q. For a given choice of subspaces, the data points are chosen according to the statistical model Theorem 2 is based on. For each pair (n, q) we average over 100 problem instances. The results, summarized in Figure 1, show that if q is too small (i.e., q ≤ 3 in this experiment), TSC fails, even though the FDE is close to zero. This is a consequence of too small q leading to overestimation of L, which, in turn, results in points belonging to a single subspace being assigned to distinct clusters. For q > n, TSC can fail because points from different subspaces may be assigned to a single cluster. When n is sufficiently large (for this experiment n ≥ 3d) TSC is seen to succeed for q ∈ {4, ..., 30}. This shows that TSC is not overly sensitive to the exact choice of q. 8.1.2

Intersection of subspaces

We next demonstrate that TSC can succeed even when the subspaces Sl intersect, as proven in Theorem 2. To facilitate comparison to SSC, we perform the same experiment as in [27, Sec. 5.1.2]. Specifically, we set m = 200, d = 10, q = d, and generate two subspaces, S1 and S2 , at random through their defining bases U(1) and U(2) as follows. We choose, uniformly at random, from the set of all 2d − t orthonormal vectors in Rm , a set of 2d − t orthonormal vectors, and identify the columns of U(1) and U(2) with the last and first d of those vectors, respectively. This ensures that the intersection of S1 and S2 is of dimension t. Next, we generate n = 20d data points in each of (l) (l) (l) the two subspaces according to xi = U(l) ai , with the ai drawn i.i.d. uniformly on S d−1 . For each t = 0, ..., d the CE, EL, and FDE are obtained by averaging over 100 problem instances. The results, shown in Figure 2, allow us to conclude that, as long as the dimension of the intersection of the subspaces is not too large, TSC does, indeed, succeed. An analogous experiment was performed for SSC in [27, Sec. 5.1.2] and delivered comparable results. 8.1.3

Influence of d, n, and incomplete data

The goal of the next experiment is to elicit the dependence of d, n, and the number of missing entries in the data points on clustering performance, and to furthermore demonstrate that TSC can succeed even when the subspace detection property is not satisfied strictly. We generate L = 10 subspaces of R50 , set q = d, and vary the dimension d of the subspaces and the number of points in each subspace n = dγ by varying the parameter γ. The data points are chosen according to the 15

EL

CE

FDE 0.3 0.2 0.1 0

0.6 0.4 0.2 0 0

5

10

0.6 0.4 0.2 0 0

5

10

0

5

10

Figure 2: Clustering error metrics as a function of the dimension of the intersection t. EL

CE

FDE 30

0.8 0.4

20

0.4

0.2

10

0

0.6 −0.5

0.2 2 4 6 8

0

2 4 6 8

2 4 6 8

−1

Figure 3: Clustering error metrics as a function of the dimension of the subspaces, d, on the vertical and γ on the horizontal axis. statistical model Theorem 2 is based on. For each pair (d, γ), the FDE, CE, and EL are obtained by averaging over 10 problem instances. The results, depicted in Figure 3, show, as indicated in Section 2, that TSC can, indeed, succeed even when the subspace detection property does not hold strictly. Finally, we perform the same experiment, but set the entries of xi with indices in Di to zero, where the sets Di are chosen uniformly at random from the set {D ⊆ [m] : |D| = s} and independently for each xi . The results, summarized in Figure 4, show that TSC succeeds even when a large fraction of the entries in each data vector is missing. 8.1.4

Unstructured noise

We next perform the same experiment as in Section 8.1.3, but subject the data points to additive noise before clustering. Specifically, the data points are chosen according to the statistical model Theorem 3 is based on. The results, depicted in Figure 5, show that TSC can succeed even when the noise variance is large, provided that n is sufficiently large. In Section 4.1, we found that TSC can succeed even under massive noise (i.e., if σ 2 > 1), s=0

s=5

s = 10

s = 15

s = 20

15

0.8 0.6 0.4 0.2

10 5 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

Figure 4: CE as a function of the dimension of the subspaces, d, on the vertical and γ on the horizontal axis for s missing entries in the data vectors. 16

σ2 = 0

σ 2 = 0.1 σ 2 = 0.2 σ 2 = 0.3 σ 2 = 0.4 σ 2 = 0.5 σ 2 = 0.6 σ 2 = 0.8

15

0.8 0.6 0.4 0.2

10 5 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

Figure 5: CE as a function of the dimension of the subspaces, d, on the vertical and γ on the horizontal axis for different noise variances σ 2 . LE

CE

FDE

0.5

0.4

6 4 2 0 10 20 30

0.3

0.6

0.2

0.4

0.1

0.2

0

10 20 30

0

0 −0.5 10 20 30

−1

Figure 6: Clustering metrics as a function of the noise variance, σ 2 , on the vertical and γ on the horizontal axis. provided that d/m is sufficiently small. To demonstrate this effect numerically, we generate L = 5 subspaces in R100 , each of dimension d = 2 (hence d/m = 1/50), and choose the data points again according to the statistical model Theorem 3 is based on. We vary the number of points in each subspace n = dγ by varying γ, and we also vary the noise variance. Moreover, we set q = 0.65n. The results, depicted in Figure 6, show that, indeed, even when σ 2 > 1, TSC can succeed. 8.1.5

Sparse noise

Finally, we investigate the impact of sparse corruptions on clustering performance. We generate L = 15 subspaces, and vary the dimension of the subspaces, d, and the number of points in each subspace n = dγ as a fraction of d, again by varying γ. The points in each subspace are chosen according to the statistical model Theorem 4 is based on, with σ 2 = 0.4 and C = Im . The results, depicted in Figure 7, corroborate the results in Section 4.2, namely that the effect of sparse corruptions is drastic if s is small, and similar to that for additive unstructured noise if s is large. 8.1.6

Detection of outliers

In order to facilitate a comparison with the outlier detection scheme proposed for SSC in [27], we perform our experiment with exactly the parameters used in [27, Sec. 5.2]. Specifically, we set d = 5, vary m ∈ {50, 100, 200}, and generate L = 2m/d subspaces at random. We choose n inliers per s=1

s=2

s=4

s=6

s = 10

s = 15

s=n 0.8 0.6 0.4 0.2

15 10 5 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

Figure 7: CE as a function of the dimension of the subspaces, d, on the vertical and γ on the horizontal axis, for σ 2 = 0.4, and varying s.

17

clustering error

0.8 0.6 0.4 TSCb SSC

0.2 0

TSC

0

100 200 300 400 number of points of each digit n

Figure 8: Empirical mean and standard deviation of the CE for clustering handwritten digits. subspace and a total of N0 = Ln outliers according to the statistical model Theorem 7 is based on. Note that this results in the number of outliers being equal to the number of inliers. We measure performance in terms of the misclassification probability, defined as the number of misclassified points (i.e., outliers misclassified as inliers and inliers misclassified as outliers) divided by the total number of points in X . We find a misclassification error probability of {0.017, 1.510−4 , 2.510−5 } for m = {50, 100, 200}, respectively. Similar performance was reported for the scheme proposed in [27] for SSC.

8.2

Clustering handwritten digits

We next apply TSC to the problem of clustering handwritten digits. Specifically, we work with the MNIST data set of handwritten digits [19], and we use the training set that contains 60,000 centered 20 × 20 pixel images of handwritten digits. The assumption underlying the idea of posing the problem of clustering handwritten digits as a subspace clustering problem is that the images of a single digit lie in a low-dimensional subspace of unknown dimension and orientation. We compare the performance of TSC and TSCb to the performance of SSC/RSSC. The empirical mean and variance of the clustering error are computed by averaging over 50 of the following problem instances. We choose the digits {2, 4, 8} and for each digit we choose n images uniformly at random from the set of all images of that digit. In order to make the LASSO step in SSC computationally more tractable, we first apply PCA to the corresponding images to reduce the dimension of the images from M = 400 to m = 50. We choose q = 10 for TSC and set the LASSO penalty to λ = 10 for SSC. The choice λ = 10 is motivated by the fact that it yields the best performance among λ ∈ {0.5, 1, 2, 5, 10, 20, 30} for n = 200. The results, summarized in Figure 8, show that TSC performs best, TSCb second best, and SSC performs slightly worse than TSCb. Note that the same experiment, for n = 200, was performed in [39, Sec. 3.4] for several further subspace clustering algorithms. Comparing the results in [39, Tab. 8] to those obtained for TSC shows that TSC is competitive with respect to state-of-the-art subspace clustering algorithms such as GPCA [35], K-flats [32], and SLBF [39].

18

clustering error

0.3

0.2

0.1

4

6 8 number of faces L

10

Figure 9: Empirical mean and standard deviation of the CE for clustering of faces.

8.3

Clustering faces

We next apply TSC to the problem of clustering images of faces under varying illumination conditions. The motivation for applying TSC to this problem comes from the insight that images of a given face under varying illumination conditions can be approximated well by a 9-dimensional linear subspace [1]. Each 9-dimensional subspace Sl would then contain the images corresponding to a given person. We work with the extended Yale Face Database B [9, 21], and compare our results to those reported in [39, Tab. 5] for the same experiment. The extended Yale Face Database B contains 168 × 192 pixel (cropped) images of 38 persons, each taken under 64 different illumination conditions. We resized the images to 84 × 96 pixels using the imresize function in Matlab. In each problem instance, we chose L persons uniformly at random from the set of 38 persons. As pointed out in [39, Section 3.3] the subspaces corresponding to different persons are extremely close to each other, which makes clustering hard. Discrimination between the clusters (and hence persons) can be improved through preprocessing of the data as described in [39, Section 3.3]. Specifically, we remove the first three principal components of X, where X is the matrix whose columns are the data points in the data set X . For each L, we apply TSC with q = 10, and the CE was obtained by averaging over 500 problem instances. The results are summarized in Figure 9. Comparing Figure 9 to [39, Tab. 5] (which contains performance results for a number of subspace clustering algorithms, such as SSC [6, 7], GPCA [35], K-flats [32], and SLBF [39]) shows that the performance of TSC is comparable to that of state of the art subspace clustering algorithms. We hasten to add, however, that in this experiment SSC performs better than TSC.

A

Proof of Theorem 1

Theorem 1 is a special case of the following, more general, result, which is a key ingredient of the proof of Theorem 6 as well. Theorem 9. Suppose that Xl , l ∈ [L], is obtained by choosing nl = ρl q, with ρl ≥ 10/3, points corre(l) (l) (l) sponding to Sl at random according to xj = U(l) aj , j ∈ [nl ], where the aj are i.i.d. N (0, (1/dl ) Idl ), 19

and set X = X1 ∪...∪XL . Assume that in each xj ∈ X up to s entries (possibly different for different xj ) are missing, i.e., set to 0. Assume that the U(l) satisfy

(l) T (l)

min (19)

UD U a ≥ η l,D : |D|≤2s,kak2 =1

2

and max

k,l : k6=l,D : |D|≤2s,kak2 =1



(k) T (l)

U

D U a ≤ δ

(20)

2

(l)

where UD ∈ Rm×dl is the matrix obtained from U(l) ∈ Rm×dl by setting the rows with indices in D ⊆ [m] to zero. If √ δ log ρmin (21) ≤ c1 √ η log N with ρmin = minl ρl , then the subspace detection property holds (for X ) with probability at least X −2  X  −c2 (nl −1) 1− nl e +2 nk (22) k6=l

l∈[L]

where c1 and c2 are absolute constants satisfying 0 < c1 < 1 and c2 > 0. Proof of Theorem 1. Theorem 1 follows as a special case of Theorem 9 by setting s = 0 and (l) choosing the U(l) to be orthonormal. For s = 0, UD = U(l) and since the U(l) are orthonormal, (l) T

T

UD U(l) = U(l) U(l) = Idl . Thus (19) is satisfied with η = 1, and the LHS of (20) reduces to maxk,l : k6=l affp(Sk , Sl ), in turn (20) and (21) reduce to (9), which then establishes the result.

Proof of Theorem 9. As stated in Section 3, the subspace detection property is satisfied if for each (l) xi ∈ Xl , and for each Xl , we have (l)

(k)

z(nl −q) > max zj

(23)

k6=l,j

(k)

where zj (l)



(l) (l) (l) (l) := x(k) and the order statistics z(1) ≤ z(2) ≤ ... ≤ z(nl −1) are defined by sorting j , xi (k)

the {zj }j∈[nl ]\i in ascending order. For simplicity of exposition the notation zj

does not reflect

(l) xi .

the dependence on Our proof proceeds in two steps. First, we upper-bound the probability of (l) (l) (23) being violated for a given xi . A union bound over all N vectors xi , i ∈ [nl ], l ∈ [L], then yields the final result. Note that  D E D E  (k) (l) (k) (k) (l) (l) (k) (k) T (l) (l) (k) zj = xj , xi = UD aj , UE ai = aj , UD UE ai . (k)

Here the sets D, E ⊂ [m] contain the indices of the entries of xj (l)

(k)

missing (i.e., set to zero). Thus, conditioned on ai , zj

is distributed as



(k) T (l) (l) (k)

U

D UE ai z¯j 2

20

(l)

and xi , respectively, that are

(k)

where the z¯j are the absolute values of independent (over j and k) N (0, 1) distributed random variables. Condition (19) guarantees that for every D, E with |D|, |E| ≤ s, for every l, and for every

(l) T (l) (l)

(l) (l) a , we have U U a ≥ η a . We can, therefore, bound the probability that (23), i

D (l)

E

i

i

2

2

conditioned on ai , does not hold, according to

   



(k) T (l) (l) (k)

(l) (l) (l) (k)

max UD UE ai z¯j ≤ P z¯(nl −q) η ai ≤ P z(nl −q) ≤ max zj k6=l,j 2 k6=l, j, D, E : |D|,|E|≤s 2   δ (l) (k) ≤ P z¯(nl −q) ≤ max z¯j η k6=l,j

(24)

(k) T (l) (l)

(l) where we used (20) to conclude that UD UE ai 2 ≤ δ ai 2 for all D, E with |D|, |E| ≤ s. Next, we note that for random variables X and Y and constants φ and ϕ satisfying φ ≥ ϕ, we have that P[X ≤ Y ] ≤ P[{X ≤ φ} ∪ {ϕ ≤ Y }] Applying (25) to (24), with φ = q P δ 6 log( k6=l nk ), yields η



≤ P[X ≤ φ] + P[ϕ ≤ Y ] .

(25)

θ log ρl , where θ > 0 is an absolute constant and ϕ =

r    i h X  p δ (l) (k) (l) (k) P z¯(nl −q) ≤ max z¯j ≤ P z¯(nl −q) ≤ θ log ρl + P 6 log nk ≤ max z¯j . k6=l k6=l,j η k6=l,j

(26)

Note that this requires φ ≥ ϕ, i.e., r X  p δ θ log ρl ≥ 6 log nk k6=l η p P which is implied by (21) with c1 = θ/6 upon noting that k6=l nk ≤ N . Using ρl = nl −q 1 1−γ with γ = nl −1 and applying Lemma 4 in Appendix G yields s "  i h p (l) (l) P z¯(nl −q) ≤ θ log ρl ≤ P z¯(nl −q) ≤ θ log

1 1−γ

#

≤ e−c2 (nl −1) .

nl q


0 if ρl > 10/3, a condition which ensures 7 γ > 10 , needed for Lemma 4. Application of (27) and Lemma 5 in Appendix G to (26) yields   2 (l) (k) P z(nl −q) ≤ max zj ≤ e−c2 (nl −1) + P (28) 2 . k6=l,j n k k6=l (l)

Taking the union bound over all xi , i ∈ [nl ], l ∈ [L], yields the desired lower bound on the probability of the subspace detection property to hold.

21

B

Proof of Theorems 2 and 3

Theorem 2 follows from Theorem 3 by setting σ = 0. We therefore need to prove Theorem 3 only. We again start by noting that, as stated in Section 3, the subspace detection property is satisfied (l) if for each xi ∈ Xl , and for each Xl , we have (l)

(k)

z(nl −q) > max zj

(29)

k6=l,j

(k)

where zj

(k) (l) = xj , xi . Next, we upper-bound the probability of (29) being violated. A union (l)

bound over all N vectors xi , i ∈ [nl ] l ∈ [L], will, as before, yield the final result. We start by setting D E T (k) (k) (l) z˜j := aj , U(k) U(l) ai (30) and noting that D E D E T (k) (l) (k) (l) (k) = xj , xi = aj , U(k) U(l) ai + ej

(31)

D E D E D E (k) (l) (l) (l) (k) (k) (l) := e(k) + e , U a + U a , e . , e j i j i j i

(32)

(k)

zj with (k)

ej (l)

(l)

(l)

(l)

Recalling that z(1) ≤ z(2) ≤ ... ≤ z(nl −1) are the order statistics of {zj }j∈[nl ]\i , it follows that (l)

(l)

(l)

z˜(nl −q) − max |ej | ≤ z(nl −q) j6=i

and hence the probability of (29) being violated can be upper-bounded as     (l) (k) (l) (k) (l) (k) P z(nl −q) ≤ max zj ≤ P z˜(nl −q) −max ej ≤ max z˜j + max ej k6=l,j j6=i k6=l,j k6=l,j   ν (l) ≤ P z˜(nl −q) ≤ √ dl   (l) (k) (k) +P α + 2 ≤ max ej +max z˜j +max ej j6=i k6=l,j k6=l,j     ν (l) (k) ≤ P z˜(nl −q) ≤ √ + P max z˜j ≥ α k6=l,j dl     (l) (k) + P max ej ≥  + P max ej ≥  j6=i k6=l,j | {z h i } P (k) ≤ (j,k)6=(i,l) P ej ≥ 

(33)

(34)

where α, , and ν are chosen later and for (33) we used (25) with φ = √νd and ϕ = α + 2, which l leads to the assumption α + 2 ≤ √νd , resolved below. We next upper-bound the individual terms l in (34) to get the following results proved at the end of this appendix: Step 1

Setting  =

2σ(1+σ) √ β, m

we have for all β ≥

√1 2π

√ satisfying β/ m ≤ 1 that

h i β2 (k) P ej ≥  ≤ 7e− 2 . 22

(35)

Step 2

Setting α=

β(1 + β) 1 T

√ max √ U(k) U(l) k6=l F dl dk

we have for all β ≥ 0 that   X β2 β2 (k) (1 + 2nk )e− 2 ≤ 3N e− 2 . P max z˜j ≥ α ≤ k6=l,j

Step 3

(36)

(37)

k∈[L]\l

For ν = 2/3 and nl /q = ρl ≥ 6, there is a constant c = c(ρl , ν) > 1/20 such that   ν (l) P z˜(nl −q) ≤ √ ≤ e−c(nl −1) . dl

(38)

Before presenting the detailed arguments leading to (35), (37), and (38), we show how the proof is √ √ completed. Setting β = 6 log N and using (35), (37) (note that β/ m ≤ 1 is satisfied since, by assumption, m ≥ 6 log N ), and (38) in (34) yields   β2 β2 (l) (k) P z(nl −q) ≤ max zj ≤ e−c(nl −1) + 3N e− 2 + 7N e− 2 k6=l,j

=

10 + e−c(nl −1) . N2

(39)

(l)

Taking the union bound over all vectors xi , i ∈ [nl ], l ∈ [L], yields the desired lower bound on the probability of the subspace detection property to hold. Recall that for (33) we imposed the condition α + 2 ≤ √νd . With our choices for , α, and ν in l Step 1,2, and 3, respectively, this condition becomes √

dl 2 1

(k) T (l) β(1 + β) max √ U U + 4σ(1 + σ) √ β ≤ . (40) k6=l 3 m F dk Next, note that p p (1 + β) ≤ 3.2 log N ≤ 4 log N

(41)

as a consequence of N ≥ 6, by assumption. Therefore, (40) is implied by √

1 σ(1 + σ) dl 2

(k) T (l) √ ≤ √ max √ U U + √ k6=l F log N m dk 3 · 4 6 log N which, in turn, is implied by (11). This concludes the proof. It remains to prove the bounds (35), (37), and (38). Step 1, proof of (35)



(k) (k) (k) (l) Conditional on aj , with U(k) aj = 1, we have U(k) aj , ei ∼

N (0, σ 2 /m). Using Lemma 3 in Appendix G, for β ≥

√1 , 2π

2

we hence get

 D  E β2 σ (k) (k) (l) P U aj , ei > √ β ≤ 2e− 2 . m 23

(42)

h

i

(k) (l) 2 (l) 2  (k) (l) (l) Next, we upper-bound P ej , ei ≥ φ . Conditional on ei , we have ej , ei ∼ N 0, σm ei 2 .

Lemma 3 yields, for β ≥

√1 , 2π

that

 D

 E β2 σ

(l) (k) (l) P ej , ei ≥ β √ ei ≤ 2e− 2 . m 2

(43)

√ Since β/ m ≤ 1, by assumption, we get     i h β2 β

(l)

(l) P ei ≥ 2σ ≤ P ei ≥ 1 + √ σ ≤ e− 2 m 2 2

(44)

where the second inequality follows from (103). Next, note that for random variables X, Y (possibly dependent) and a constant φ, we have P[X ≥ φ] = P[{X ≥ Y ≥ φ} ∪ {X ≥ φ ≥ Y } ∪ {Y ≥ X ≥ φ}] ≤ P[{X ≥ Y } ∪ {Y ≥ φ}] ≤ P[X ≥ Y ] + P[Y ≥ φ] .

(45)

Combining (43) and (44) via (45) yields P

i h D E h D E h i β2 2σ 2 i σ (k) (l) (k) (l)

(l)

(l) ej , ei ≥ √ β ≤ P ej , ei ≥ β √ ei + P ei ≥ 2σ ≤ 3e− 2 . (46) m m 2 2 {z } | {z | } {z } | X

Y

φ

Finally, combining (42) and (46) via a union bound yields (35). (k)

(k)

Step 2, proof of (37) We first upper-bound the probability that maxj z˜j (with z˜j as defined in (30)) is larger than a constant, which then yields, via a union bound, an upper bound on the T (k) probability that maxk6=l,j z˜j is larger than a constant. For convenience, we define B = U(k) U(l) . (k) (l) (k) With this notation, z˜ = a , Ba . We have [27, Proof of Lem. 7.5] j

j

i

 

κ2 −dl kBkF

(l) 2kBk2 2→2 . P Bai ≥ √ +κ ≤e 2 dl

(47)

√ Setting κ = βkBkF / dl in (47) yields   2 kBk2

F 2 − β2 1+β

(l) − β2 kBk2 2→2 ≤ e P Bai ≥ √ kBkF ≤ e . 2 dl By Theorem 12 in Appendix G we have  D

 E β2 β (k)

(l) (l) P aj , Bai > √ Bai ≤ 2e− 2 . 2 dk

24

(48)

(49)

(k)

Using (45) with X = maxj z˜j , φ =

√β 1+β √ kBk , F dk dl

and Y =



(l) √β Ba , i dk 2

we get

   D

 E β 1+β β (k)

(l) (k) (l) √ kBkF ≤ P max aj , Bai > √ Bai P max z˜j ≥ √ j j 2 dk dl d   k

1+β

(l) + P Bai ≥ √ kBkF 2 dl  

1+β

(l) ≤ P Bai ≥ √ kBkF 2 dl

 E X  D (k) β

(l) (l) P aj , Bai > √ Bai + 2 dk

(50)

j∈[nk ]

≤ (1 + 2nk )e−

β2 2

(51)

where a union bound is used to obtain (50), and (51) follows from (48) and (49). Taking the union bound over k ∈ [L]\l concludes the proof of (37).

(l) (l) We first note that the pdf of aj , ai is given by f (z) = (l) (l)  dl −3 (l) z 2 ) 2 1{|z|≤1} . Hence, we get recall that z˜j = aj , ai Step 3, proof of (38)

Γ(dl /2) √1 (1− π Γ((dl −1)/2)

  Z √ν dl −3 dl 2 Γ(dl /2) ν (l) ≤√ (1 − z 2 ) 2 1{z≤1} dz P z˜j ≤ √ π Γ((dl − 1)/2) 0 dl r 2 Γ(dl /2) 2 ν √ ≤ ≤√ ν π π Γ((dl − 1)/2) dl | {z }

(52)

p:=

where the last inequality follows from [8, Eq. 8.1]. Next, observe that,    ν (l) √ = P there exists a set I ⊂ [nl − 1] with |I| = nl − q P z˜(nl −q) ≤ dl  ν (l) such that z˜i ≤ √ for all i ∈ I d    l nl − 1 ν (l) ≤ max P z˜i ≤ √ , for all i ∈ I nl − q I : |I|=nl −q dl  q−1   nl −q nl − 1 ν (l) ≤ e P z˜i ≤ √ q−1 dl  q−1   n l −1 (n −1) 1− 1 nl − 1 l nl −q % % ≤ e p = (e%) p q−1 =e

     −(nl −1) log p1 1− %1 − %1 log(e%)

= e−c(%,ν)(nl −1)

(53) (54) (55) (56)

  k n where we used a union bound to get (53), n−k = nk ≤ en [3] for (54), and (52) yields (55); k nl −1 we also set % = q−1 for convenience. Here c(%, ν) satisfies c(%, ν) > 1/20 for ν = 2/3 and % ≥ 6. l −1 Note that % = nq−1 ≥ nql = ρl ≥ 6, where the last inequality holds by assumption.

25

C

Proofs of Theorems 4 and 5

Theorems 4 and 5 rely on the following result. Theorem 10. Suppose that Xl , l ∈ [L], is obtained by choosing nl = ρl q, with ρl ≥ 6, points (l) (l) (l) (l) (l) corresponding to Sl at random according to xj = U(l) aj + Cj cj , j ∈ [nl ], where the aj ∈ Rdl (l)

(l)

are i.i.d. uniform on S dl −1 , the cj ∈ Rs are i.i.d. N (0, (σ 2 /s)Is ), and Cj ∈ Rm×s . Set X = X1 ∪ ... ∪ XL and let dmax = maxl dl . If max aff(Sk , Sl ) + ξ ≤

k,l : k6=l

with

1 12 log N





 1 1 dmax

(k) T (l) (k) T (l)

√ Cj Ci + 2 max √ U σ max ξ = 2σ √ Ci k,l,i s s F k,j : (k,j)6=(l,i) dk F

(57)

(58)

then the subspace detection property is satisfied (for X ) with probability at least 1 − P − l∈[L] nl e−c(nl −1) , where c > 0 is an absolute constant.

12 N

(k) T (l) √ Proof of Theorem 5. Theorem 5 follows from Theorem 10 by noting that Cj Ci F ≤ s and

(k) T (l) √ (l)

U C ≤ dk (recall that C is a submatrix of the orthonormal matrix C and thus has i

F

i

orthonormal columns). Thus, ξ is upper-bounded by 2σ (14).

√ d √max s

(σ + 2), and hence (57) is implied by

Proof of Theorem 4. Theorem 4 follows from Theorem 10 by bounding the probability that the clustering condition (57) √ is violated. √ First, set δ = √c3m 3d + log L with c3 = c2 , where c2 is the constant in Lemma 2 in Appendix D. Using (12), we can bound the probability that (57) is violated according to " # √   d P max aff(Sk , Sl ) +  ≥ δ + 4 √ σ(2 + σ) ≤ P max aff(Sk , Sl ) ≥ δ k,l : k6=l k,l : k6=l m r   X 1 s (k) T (l)

+ P √ Cj Ci F ≥ 2 (59) m s (k,j)6=(l,i) " r # X

1 d T (l) + P √ U(k) Ci F ≥ 2 . (60) m s k,l,i

T (l) First, we upper-bound the probability in (60). Observe that U(k) Ci F is the `2 -norm of an (l)

N (0, (1/m)Ids ) random vector (recall that Ci is a submatrix of the orthonormal matrix C and thus has orthonormal columns), and therefore we have, according to (103) in Appendix G, that " # √ β2 1 ds + β T (l) (k) P √ U Ci F ≥ √ ≤ e− 2 s ms √ which, with β = ds, reduces to " r #

ds 1 d T (l) P √ U(k) Ci F ≥ 2 ≤ e− 2 . (61) m s 26

(k)

(l)

Next, we upper-bound the probability in (59). Since Cj and Ci are submatrices of the

(k) T (l) 2 (k) (l) orthonormal matrix C, Cj Ci F is simply given by the number of columns in Cj and Ci (k)

that are equal. Since the Cj are chosen uniformly at random from all submatrices of C with s columns, this number is simply the number of successes in s draws without replacement from a population of size m containing s successes, and thus follows a hypergeometric distribution [26]. According to [26, Eq. 1.3], we have for κ > 0 that  s 

(k) T (l) 2 m 2 P Cj Ci F ≥ s + κ ≤ e−2s m−s+1 κ m and we get, with κ = 3s/m, that r   m s2 1 s (k) T (l)

P √ Cj Ci F ≥ 2 ≤ e−18s m−s+1 m2 . m s Finally, we have, with aff(Sk , Sl ) ≤ affp(Sk , Sl ), that     P max aff(Sk , Sl ) ≥ δ ≤ P max affp(Sk , Sl ) ≥ δ ≤ 4e−c1 m k,l : k6=l

k,l : k6=l

(62)

(63)

where the last inequality follows by application of Lemma 2 in Appendix D. Note that this requires m ≥ δc22 (3d + log L) which is ensured by the definition of δ above. Using (61), (62), and (63) in (60) yields the following upper bound for the probability of (57) being violated: m

ds

s2

N 2 e− 2 + N 2 e−18s m−s+1 m2 + 4e−c1 m which concludes the proof. Proof of Theorem 10. The proof is identical to the proof of Theorem 3 in Appendix B, with Step 1 in Appendix B substituted by the following statement proven below: Setting



 σ 1 1

(k) T (l) (k) T (l)

√ Cj Ci + 2 max √ U Ci  = β(1 + β) √ σ max k,l,i s s F k,j : (k,j)6=(l,i) dk F we have for β ≥ 0, that h i β2 (k) P ej ≥  ≤ 9e− 2 where

(k)

ej

(64)

D E D E D E (k) (k) (l) (l) (k) (k) (l) (k) (l) (l) = Cj cj , Ci ci + Cj cj , U(l) ai + U(k) aj , Ci ci .

Note that (64) is equivalent to (35), with, however, a different choice of . Using (64) instead of (35) in the proof of Theorem 3, (37), and (38) in (34), and choosing α and β as in the proof of Theorem 3 yields   12 (l) (k) (65) P z(nl −q) ≤ max zj ≤ 2 + e−c(nl −1) . k6=l,j N

27

√ Recall that for (39) we imposed the condition α + 2 ≤ ν/ dl with ν = 2/3, which here becomes

√  1 dl (k) T (l)

√ Cj Ci max aff(Sk , Sl ) + 2σ √ σ max

+ k,l : k6=l s s k,j : (k,j)6=(l,i) F

 1 2

(k) T (l) 2 max √ U ≤ Ci . (66) k,l,i 3β(1 + β) F dk √ The proof is concluded by using (1 + β) ≤ 3.2 log N (cf. (41)) to show that 2 1 ≥ . 3β(1 + β) 12 log N

(67)

which, in turn, yields that (66) is implied by (57). (l) T

(k)

It remains to prove (64). For expositional convenience, we set B = Ci Cj . We start by upperh D E i

(k) (l) 2 (k) (l) (k) (k) 2  bounding P Bcj , ci ≥ φ . Conditional on cj we have Bcj , ci ∼ N 0, σs Bcj 2 . Using Lemma 3, we get  D

 E β2 σ

(k) (k) (l) P Bcj , ci ≥ β √ Bcj ≤ 2e− 2 . (68) s 2 (k) (l)

(k) σ √ , and Y = β √ Bcj 2 , we obtain Invoking (45) with X = Bcj , ci , φ = kBkF σ 2 √βs 1+β s s  D   D

 E E σ

(k) (k) (l) (k) (l) 2 β 1+β P Bcj , ci ≥ kBkF σ √ √ ≤ P Bcj , ci ≥ β √ Bcj s s s 2  

1+β

(k) + P Bcj ≥ σ √ kBkF s 2 ≤ 3e−

β2 2

(69)

where we used (68) and (102) to get (69). T (k) (k) Next, for expositional convenience, set D = U(l) Cj . Conditional on cj , it follows from Theorem 12 that  D

 E β2 β

(k) (k) (l) (70) P Dcj , ai ≥ √ Dcj ≤ 2e− 2 . 2 dl By (102) we have  

β2 σ

(k) (71) P Dcj ≥ kDkF (1 + β) √ ≤ e− 2 . s 2 D

E (k) (l) (k) β σ √β √ √ Again invoking (45) with X = Dcj , ai , φ = kDkF (1 + β) s d , and Y = d Dcj , we l l 2 get  D  E σ β (k) (l) P Dcj , ai ≥ kDkF (1 + β) √ √ s dl  D   

E β (k) σ

(k) (k) (l) √ √ ≤ P Dcj , ai ≥

Dcj + P Dcj ≥ kDkF (1 + β) s 2 2 dl ≤ 3e−

β2 2

(72)

where we used (70) and (71) to get (72). Finally, (64) is obtained by combining (69) and (72) via a union bound. 28

D

Proof of Theorem 6

Theorem 6 follows from Theorem 9 and the following result from [8], which builds on the concentration inequality the Johnson-Lindenstrauss Lemma is based on, and a covering argument. Lemma 1 ([8, Eq. 9.12 with ρ = e32−1 ]). Let U be a p × s random matrix which satisfies for some c˜ > 0, for every x ∈ Rs , and every t ∈ (0, 1), i h 2 (73) P kUxk22 − kxk22 ≥ t2 kxk22 ≤ 2e−˜ct p . Then, we have

  2 P UT U − Is 2→2 ≥ δ ≤ 2e−0.6˜cδ p+3s .

(74)

We note that (73) is satisfied for random matrices with i.i.d. N (0, 1/p) entries as well as for random matrices with i.i.d. subgaussian entries of variance 1/p [8, Lem. 9.8] (a random variable x is 2 subgaussian [8, Sec. 7.4] if its tail probability satisfies P[|x| > t] ≤ c1 e−c2 t for constants c1 , c2 > 0). (l)

Lemma 2. Let the entries of the U(l) ∈ Rm×d , l ∈ [L], be i.i.d. N (0, 1/m), and let UD ∈ Rm×d be the matrix obtained from U(l) ∈ Rm×d by setting the rows with indices in D ⊆ [m] to zero. Then, we have for δ ∈ (0, 1) that

(l) T (l) m − 2s

UD U a min (75)

≥ (1 − δ) m l, D : |D|≤2s, kak2 =1 2 and max

k6=l, D : |D|≤2s, kak2 =1



(k) T (l)

U

D U a ≤ δ

(76)

2

with probability at least 1 − 4e−c1 m , provided that  me  c2  m ≥ 2 3d + log L + s log + c3 s δ 2s

(77)

where c1 , c2 , c3 > 0 are absolute constants. Before proving q Lemma 2, we show how the proof of Theorem 6 can be completed using Lemma 2.

Set δ =

c1 c3 −2 2 c3

log ρ log N ,

where c1 is the constant in (21) and c3 is the constant in (15). With this

choice of δ, (15) (with c4 = c2 (2c3 /(c1 (c3 − 2)))2 ) implies (77) and hence, for η = (1 − δ) m−2s m , −˜ c m according to Lemma 2, (19) and (20) are satisfied with probability ≥ 1 − 4e , where c˜ is an absolute constant. Next, observe that using c ≤ 1 (by assumption) and ρ ≤ N , we have δ = 1 q c1 c3 −2 2 c3

log ρ log N

≤ 12 . Thus,

δ 1−δ

≤ 2δ, and hence

√ δ δ m m c3 log ρ = ≤ 2δ ≤ 2δ = c1 √ η 1 − δ m − 2s m − 2s c3 − 2 log N m 3 where we used that m−2s ≤ c3c−2 as a consequence of m ≥ c3 s (from (15)). We therefore established that (21) holds with probability ≥ 1 − 4e−˜cm , and application of Theorem 9 concludes the proof.

29

(k)

Proof of Lemma 2. We start with (76). First, note that since the rows of UD indexed by D (k) T

have all entries equal to zero by definition, we have UD U(l) = ViT Vj , where Vi ∈ Rp×d and Vj ∈ Rp×d , with p = m − |D|,pdenote the restrictions of U(k) and U(l) , respectively, to the rows ˜i V ˜ j ] ∈ Rp×2d , and note that the entries of U ˜ i = m/pVi , let U = [V indexed by [m] \ D. Set V are i.i.d. N (0, 1/p). Using m ≥ p, we have



T

m

˜T ˜

ViT Vj

Vi Vj = ≤ UT U − I2d 2→2 ≤ V V

j i 2→2 2→2 p 2→2 ˜TV ˜ j is a submatrix of UT U − I2d [31, Ex. where the last inequality follows from the fact that V i 3.4]. Therefore, we get

    P ViT Vj 2→2 ≥ δ ≤ P UT U − I2d 2→2 ≥ δ ≤ 2e−c0 δ ≤ 2e−c0

2 p+3·2d

(78)

δ 2 (m−2s)+6d

(79)

where c0 is an absolute constant, (78) follows from Lemma 1, and (79) is a consequence of p ≥ m−2s. Taking the union bound over all pairs (i, j), i.e., over all pairs (k, l) with k, l ∈ [L] and for each  2s m ≤ me of those pairs (k, l) over all D ⊆ [m] with |D| = 2s, i.e., over 2s sets, we obtain 2s    me 2s

T 2

P max Vi Vj 2→2 ≥ δ ≤ L2 2e−c0 δ (m−2s)+6d (80) i6=j 2s me 2 = 2e−c0 δ (m−2s)+6d+2 log L+2s log( 2s ) = 2e ≤ 2e

h −c0 δ 2 m−2s− −c1 m

2 c0 δ 2

i

(3d+log L+s log( me 2s ))

(81) (82)

where the last inequality holds for c1 > 0 provided that (77) (with c2 and c3 sufficiently large) holds, as (77) implies that the term in the square brackets in (81) equals mc4 for a constant c4 > 0. We therefore established that (76) holds with probability ≥ 1 − 2e−c1 m , provided that (77) is satisfied. ˜ i (recall that We next prove (75). With the same notation as above, applying Lemma 1 to V the entries of Vi are i.i.d. N (0, 1/p)), we get

i h 2

˜T ˜

P V V − I ≥ δ ≤ 2e−c0 δ p+3d . i d i 2→2

Next, taking the union bound over all L subspaces and over all D ⊆ [m] with |D| ≤ 2s, yields  

 me 2s 2

˜T ˜

P max Vi Vi − Id 2e−c0 δ p+3d (83) ≥δ ≤L i 2s 2→2 ≤ 2e−c1 m (84) where we used that the RHS of (83) is smaller than the RHS of (80) (recall that p ≥ m − 2s) and therefore (84) follows from the steps leading from (80) to (82). Note that for every B ∈ Rm×d and every a ∈ Rd , we have

kak22 − kBak22 = (Id − BT B)a, a ≤ BT B − Id 2→2 kak2 (85)

30

where we used that Id − BT B is self adjoint [15]. Since there is an a ∈ Rd that satisfies (85) with equality, we have



1 − BT B − Id 2→2 = min kBak22 = min BT Ba 2 . (86) kak2 =1

kak2 =1

Applying (86), we get



p

˜T ˜ min ViT Vi a 2 = min V V a i i m i,kak2 =1 2 i, kak2 =1  

p

˜T ˜

1 − max Vi Vi − Id . = i m 2→2

(87)

From (87), and p ≥ m − 2s, it follows that    

T

T

m−2s p P min Vi Vi a 2 ≤ (1−δ) ≤ P min Vi Vi a 2 ≤ (1−δ) m m i,kak2 =1 i,kak2 =1  

˜T ˜

= P max V ≥ δ ≤ 2e−c1 m i Vi − Id i

2→2

where the last inequality follows by application of (84). We therefore established (75), which concludes the proof.

E

Proof of Theorem 7

The proof consists of two parts, corresponding to the two statements in Theorem 7. First, we bound the probability that the outlier detection scheme does not detect all outliers, and then we bound the probability that one or more of the inliers are misclassified. We start by bounding the probability that the outlier detection scheme fails to detect a given outlier. A union bound over all N0 outliers will then yield a bound on the probability that the outlier detection √ scheme does not detect all outliers. Let xj be an outlier. The probability that (16) with c = 6 is violated for xj can be upper-bounded as √ √    X  6 log N 6 log N ≤ P |hxi , xj i| > √ kxj k2 P max |hxi , xj i| > √ m m i∈[N ]\j i∈[N ]\j

≤ 2N e−3 log N =

2 N2

(88)

where we used the union bound and kxj k2 = 1 in the first inequality and Theorem 12 in the second. Taking the union bound over all N0 outliers, we have thus established that the probability of our scheme failing to detect one or more outliers is at most 2N0 /N 2 . Next, we bound the probability that the outlier detection scheme misclassifies a given inlier as an outlier. A union bound over all nl inliers in Sl will then yield a bound on the probability that the outlier detection scheme misclassifies one or more of the inliers in Sl as an outlier. Consider the inlier xj ∈ Sl . With D D E E (l) (l) (l) (l) max |hxi , xj i| ≥ max xi , xj = max ai , aj i∈[N ]\j

i∈[nl ]\j

i∈[nl ]\j

31

√ √ √ √ √ and using (17), i.e., 6 log N / m ≤ 1/ dmax ≤ 1/ dl , the probability that (16) holds with c = 6 for xj can be bounded as √     D E 6 log N 1 (l) (l) P max |hxi , xj i| ≤ √ ≤ P max ai , aj ≤ √ m i∈[N ]\j i∈[nl ]\j dl   D E Y 1 (l) (l) = P ai , aj ≤ √ dl i∈[nl ]\j r Y 1 π 2 ≤ = e− 2 log( 2 )(nl −1) (89) π i∈[nl ]\j

where (89) follows from (52) with ν = 1. Taking the union bound over all inliers proves that the probability of the outlier detection scheme misclassifying one or more of the inliers in Sl as an 1 π outlier is at most nl e− 2 log( 2 )(nl −1) , which concludes the proof.

F

Proof of Theorem 8

The basic structure of the proof is the same as that of the proof of Theorem 7. The individual steps are more technical, though, due to the additive noise term. We start by bounding the probability that the outlier detection scheme fails to detect a given outlier. A union bound over all N0 outliers will then yield a bound on the probability√that the outlier detection scheme does not detect √ all outliers. Let xj be an outlier and set β = 6 log N . The probability that (16) with c = 2.3 6 is violated for xj can be upper-bounded as    X  2.3β 2.3β P max |hxi , xj i| > √ ≤ P |hxi , xj i| > √ m m i∈[N ]\j i∈[N ]\j   X   β (90) ≤ P |hxi , xj i| ≥ √ kxi k2 + P[kxi k2 ≥ 2.3] m i∈[N ]\j

√ , and Y = √β kxi k to get (90). We next bound where we used (45) with X = |hxi , xj i|, φ = 2.3β 2 m m the first term in the sum in (90). Since xj√∼ N (0, (1/m)Im ), we have that, conditioned on xi , hxj , xi i ∼ N (0, kxi k22 /m). Hence, with β = 6 log N , it follows from (100) that   β2 β 2 P |hxj , xi i| ≥ √ kxi k2 ≤ 2e− 2 = 3 . (91) N m

We next bound the second term in the sum in (90) and treat the cases where xi is an inlier and where it is an outlier separately. First, suppose that xi is an inlier. Since √1+2σ ≤ 2.3 for σ ≥ 0, we have 1+σ 2  

1 1 + 2σ

(l) (l) (l) P[kxi k2 ≥ 2.3] ≤ P √

U ai + ei ≥ √ 2 1 + σ2 1 + σ2

h i

(l) (l)

(l) ≤ P U ai + ei ≥ 1 + 2σ (92) 2 2 h i

(l) = P ei ≥ 2σ 2

1 ≤ 3 N

(93) 32

where (92) follows from the triangle inequality, and to obtain (93) we applied (44) with β = and used that √βm ≤ 1, by assumption.

Next, suppose that xi is an outlier. Applying (44) with σ = 1 (again using that assumption), we have P[kxi k2 ≥ 2.3] ≤



√β m

1 . N3

6 log N

≤ 1, by (94)

Finally, combining (91), (93) (for xi an inlier), and (94) (for xi an outlier) in (90) yields √    X  2 1 3 2.3 6 log N √ ≤ + 3 ≤ N 3. P max |hxi , xj i| > 3 N N N m i∈[N ]\j i∈[N ]\j

Taking the union bound over all N0 outliers yields the desired result, namely that the probability of our scheme not detecting all N0 outliers is at most N0 N32 . Next, we bound the probability that the outlier detection scheme misclassifies a given inlier as an outlier. Consider the inlier xj belonging to subspace Sl . Then D E (l) (l) max |hxi , xj i| ≥ max xi , xj = max

i∈[N ]\j

i∈[nl ]\j

i∈[nl ]\j

1 (l) z 1 + σ2 i

(k) T (l) (l) (k) with zi = aj , U(k) U(l) ai + ej . Thus, for  ≥ 0, under the assumption 1 1 + σ2



√  2.3 6 log N 1 √ − ≥ √ m dl

resolved below, we have √      2.3 6 log N 1 1 1 (l) √ √ P max |hxj , xi i| ≤ zi ≤ ≤ P max − 1 + σ2 m i∈[N ]\j i∈[nl ]\j 1 + σ 2 dl   1 (l) = P max zi ≤ √ −  i∈[nl ]\j dl   1 (l) (l) ≤ P max z˜i − max |ei | ≤ √ −  i∈[nl ]\j i∈[nl ]\j dl     1 (l) (l) √ ≤ P max z˜i ≤ + P  ≤ max |ei | i∈[nl ]\j i∈[nl ]\j dl (l)

(l)

(95)

(96) (97)

where (96) follows with ei and z˜i as defined in (32) and (30), respectively, (97) follows from (25) (l) (l) with X = maxi∈[nl ]\j z˜i − √1d , Y = maxi∈[nl ]\j |ei | − , and φ = ϕ = 0. Next, note that (52) with l ν = 1 yields     Y 1 1 (l) (l) P max z˜i ≤ √ = P z˜i ≤ √ i∈[nl ]\j dl dl i∈[nl ]\j r Y 1 π 2 ≤ = e− 2 log( 2 )(nl −1) . (98) π i∈[nl ]\j

33

√ √ √ Application of (98) and (35) with  = 2σ(1+σ) 6 log N (using that β/ m ≤ 1, as verified below) m to (97) yields √   1 π 7 2.3 6 log N √ ≤ e− 2 log( 2 )(nl −1) + nl 3 . (99) P max |hxj , xi i| ≤ N m i∈[N ]\j To see that

√β m

≤ 1, note that for c1 sufficiently small (i.e., c1 ≤ 1/6), we have from (18), 1 dmax c1 c1 ≤ ≤ ≤ . m m (1 + σ 2 )2 log N log N

Taking a union bound over all inliers belonging to Sl shows that the probability of the outlier detection scheme misclassifying one or more of the inliers belonging to Sl as an outlier is at most   7 − 12 log( π2 )(nl −1) + nl 3 . nl e N Finally, we show that (95), for all l ∈ [L], is implied by (18). Rewriting (95) yields √   1 dl 1 2σ(1 + σ) √ ≥√ 2.3 + . 1 + σ 2 6 log N 1 + σ2 m Since

σ(1+σ) 1+σ 2

≤ 1.3 for σ ≥ 0, (95) is implied by

√ 1 1 dmax √ ≥ √ 4.9 2 1+σ m 6 log N

which is equal to (18) with c1 =

G

1 . (4.9)2 ·6

Supplementary results

For convenience, in the following, we summarize tail bounds from the literature that are frequently used throughout this paper. We start with a well-known tail bound on Gaussian random variables. Lemma 3 ([17, Prop. 19.4.2]). Let x ∼ N (0, 1). For β ≥ P[x ≥ β] ≤ e−

β2 2

.

√1 , 2π

we have (100)

Theorem 11 ([20, Eq. 1.6]). Let f be Lipschitz on Rm with Lipschitz constant L ∈ R, i.e., |f (b1 ) − f (b2 )| ≤ Lkb1 − b2 k2 for all b1 , b2 ∈ Rm , and let x ∼ N (0, Im ). Then, for β > 0, we have β2 P[f (x) ≥ E [f (x)] + β] ≤ e− 2L2 .

34

Applying the concentration inequality in Theorem 11 to f (x) = kBxk 2 with B ∈ Rs×m which has L = kBk2→2 , and using that by Jensen’s inequality (E [kBxk2 ])2 ≤ E kBxk22 = kBk2F , we get P[kBxk2 ≥ kBkF + β] ≤ e



β2 2kBk2 2→2

.

(101)

With kBk2→2 ≤ kBkF (101) becomes P[kBxk2 ≥ kBkF (1 + β)] ≤ e−

β2 2

.

(102)

We will frequently need (101) particularized to B = Im , which reads   √ β2 P kxk2 ≥ m + β ≤ e− 2 .

(103)

Theorem 12 ([2, Theorem 1]). Let a be distributed uniformly on S m−1 and fix b ∈ Rm . Then, for β ≥ 0, we have   β2 β P |ha, bi| > √ kbk2 ≤ 2e− 2 . m Lemma 4. Consider the order statistics z(1) ≤ z(2) ≤ ... ≤ z(n) for the set {zi }i∈[n] , where zi = |xi | and the xi are i.i.d. N (0, 1), and set γ = r/n. For γ > 0.7 there are constants c(γ) > 0 and α(γ) > 0 such that h i p P z(r) ≤ α(γ) log(1/(1 − γ)) ≤ e−c(γ)n . (104) Proof.   P z(r) ≤ β = P[There exists a set I with |I| = r such that zi ≤ β for all i ∈ I]   n ≤ max P[zi ≤ β, for all i ∈ I] r I : |I|=r   n = (P[zi ≤ β])r r   en n−r ≤ (P[zi ≤ β])r (105) n−r  n(1−γ) e = (P[zi ≤ β])nγ (106) 1−γ    k n where we used nr = n−r and nk ≤ en [3] in (105). We first consider the case where γ is large, k p i.e., close to one, set β = α log(1/(1 − γ)), and assume that α is chosen such that β ≥ 2. With   β2 2e− 2 ≤ 1 − β12 β1 for β ≥ 2, and a standard lower bound on the Q-function [17, Prop. 19.4.2], we get r   2 β2 2 −β 2 1 − β2 1 1 − β2 √ e 2 ≤ 1− 2 √ e− 2 e = 2e π β 2π β 2π 1 − P[zi ≤ β] ≤ Q(β) = . 2 35

Applying this to (106) yields   P z(r) ≤ β ≤ =



!nγ r n(1−γ) e 2 −β 2 1−2 e 1−γ π !γ !n r (1−γ)  e 2 1−2 (1 − γ)α 1−γ π

≤ e−c(γ,α)n

(107) (108)

p where we used β = α log(1/(1 − γ)). For the last inequality, we assumed that the term in the outer brackets in (107) is smaller than 1; this is the case if γ is sufficiently close to 1 so that α can be chosen sufficiently small. Else, we simply set β = 1/2, thus get α = 1/(4 log(1/(1 − γ))), and note that P[zi ≤ 1/2] < 0.383 (by numerical evaluation). Thus, with (106), we have !n  (1−γ)   e P z(r) ≤ 1 = 0.383γ 1−γ ≤ e−c(γ,α)n

(109)

and we see (by numerical evaluation) that c(γ, α) > 0 as long as γ ≥ 0.7. Lemma 5. Let z(n) = maxi∈[n] zi , where zi = |xi | and the xi are i.i.d. N (0, 1). Then, for ξ ≥ 0, we have h i p 2 P z(n) ≥ 2(1 + ξ) log n ≤ ξ . (110) n p Proof. Observe that with β = 2(1 + ξ) log n, we have   P z(n) ≥ β = P[{z1 ≥ β} ∪ ... ∪ {zn ≥ β}] ≤ nP[zi ≥ β] ≤ 2ne−

β2 2

= 2e−

β2 2

+log n

=

2 nξ

(111) (112) (113)

where (112) follows from the union bound, and (113) from Lemma 3.

References [1] R. Basri and D.W. Jacobs. Lambertian reflectance and linear subspaces. IEEE Trans. Pattern Anal. Mach. Intell., 25(2):218 – 233, 2003. [2] Nathanael Berestycki. Concentration of measure, 2009. Lecture notes. [3] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to algorithms. MIT Press, 2001. [4] Eva L. Dyer, Aswin C. Sankaranarayanan, and Richard G. Baraniuk. Greedy feature selection for subspace clustering. arXiv:1303.4778, 2013. 36

[5] Eva L. Dyer, Christoph Studer, and R.G. Baraniuk. Subspace clustering with dense representations. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013. [6] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pages 2790–2797, 2009. [7] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. arXiv:1203.1005, 2012. To appear in IEEE Trans. Pattern Anal. Mach. Intell. [8] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. Springer, Berlin, Heidelberg, 2013. [9] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001. [10] Gene H. Golub and Charles F. Van Loan. Matrix computations. JHU Press, 1996. [11] Reinhard Heckel and Helmut B¨ olcskei. Noisy subspace clustering via thresholding. In Proc. of IEEE International Symposium on Information Theory, 2013. [12] Reinhard Heckel and Helmut B¨ olcskei. Subspace clustering via thresholding and spectral clustering. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013. [13] J. Ho, Ming-Husang Yang, Jongwoo Lim, Kuangchih Lee, and D. Kriegman. Clustering appearances of objects under varying illumination conditions. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 11–18, 2003. [14] Wei Hong, J. Wright, Kun Huang, and Yi Ma. Multiscale hybrid linear models for lossy image representation. IEEE Trans. Image Process., 15(12):3655–3671, 2006. [15] Roger A. Horn and Charles R. Johnson. Matrix analysis. Cambridge University Press, 1986. [16] Hans-Peter Kriegel, Peer Kr¨ oger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data, 3(1):1–58, 2009. [17] Amos Lapidoth. A foundation in digital communication. Cambridge University Press, 1 edition, August 2009. [18] F. Lauer and C. Schnorr. Spectral clustering of linear subspaces for motion segmentation. In Proc. of 12th IEEE International Conf. on Computer Vision, pages 678–685, 2009. [19] Yann LeCun and Corinna Cortes. The MNIST database, 2013. [20] M. Ledoux and M. Talagrand. Probability in Banach spaces: Isoperimetry and processes. Springer, Berlin, Heidelberg, 1991. [21] K.C. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans. Pattern Anal. Mach. Intelligence, 27(5):684–698, 2005.

37

[22] Gilad Lerman, Michael McCoy, Joel A. Tropp, and Teng Zhang. Robust computation of linear models, or how to find a needle in a haystack. arXiv:1202.4044, 2012. [23] Gilad Lerman and Teng Zhang. Robust recovery of multiple subspaces by geometric `p minimization. Ann. Statist., 39(5):2686–2715, 2011. [24] Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by low-rank representation. In Proc. of 27th Int. Conf. on Machine Learning, pages 663–670, 2010. [25] S.R. Rao, R. Tron, R. Vidal, and Yi Ma. Motion segmentation via robust subspace separation in the presence of outlying, incomplete, or corrupted trajectories. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pages 1–8, 2008. [26] R. J. Serfling. Probability inequalities for the sum in sampling without replacement. Ann. Statist., 2(1):39–48, 1974. [27] Mahdi Soltanolkotabi and Emmanuel J. Cand`es. A geometric analysis of subspace clustering with outliers. Ann. Statist., 40(4):2195–2238, 2012. [28] Mahdi Soltanolkotabi, Ehsan Elhamifar, and Emmanuel J. Cand`es. Robust subspace clustering. arXiv:1301.2603, 2013. [29] D. Spielman. Spectral graph theory. Lecture notes, 2012. [30] A.P. Topchy, M.H.C. Law, A.K. Jain, and A.L. Fred. Analysis of consensus partition in cluster ensemble. In Proc. of Fourth IEEE International Conf. on Data Mining, pages 225–232, 2004. [31] Lloyd N. Trefethen and David Bau. Numerical linear algebra. SIAM, 1997. [32] P. Tseng. Nearest q-flat to m points. J. Optim. Theory Appl., 105(1):249–252, 2000. [33] R. Vidal. Subspace clustering. IEEE Signal Process. Mag., 28(2):52–68, 2011. [34] R. Vidal and R. Hartley. Motion segmentation with missing data using PowerFactorization and GPCA. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages 310–316, 2004. [35] R. Vidal, Yi Ma, and S. Sastry. Generalized principal component analysis (GPCA). IEEE Trans. Pattern Anal. Mach. Intell., 27(12):1945–1959, 2005. [36] Ulrike von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17(4):395–416, 2007. [37] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell., 31(2):210 –227, 2009. [38] Jingyu Yan and Marc Pollefeys. A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In European Conf. Computer Vision, pages 94–106, 2006. [39] Teng Zhang, Arthur Szlam, Yi Wang, and Gilad Lerman. Hybrid linear modeling via local best-fit flats. Int. J. Comput. Vision, 100:217–240, 2012.

38

Recommend Documents