Randomized Subspace Learning with Structure Preservation

Report 6 Downloads 56 Views
arXiv:1401.4489v2 [cs.CV] 5 Feb 2014

Randomized Subspace Learning with Structure Preservation Devansh Arpit, Gaurav Srivastava,Venu Govindaraju February 6, 2014 Abstract Modeling data as being sampled from a union of independent or disjoint subspaces has been widely applied to a number of real world applications. Recently, high dimensional data has come into focus because of advancements in computational power and storage capacity. However, a number of algorithms that assume the aforementioned data model have high time complexity which makes them slow. Dimensionality reduction is a commonly used technique to tackle this problem. In this paper, we first formalize the concept of Independent Subspace Structure and then we propose two randomized algorithms (supervised and unsupervised) for subspace learning that theoretically preserve this structure for any given dataset. Under supervised setting, we show that 2K projection vectors are sufficient for structure preservation of K class data. On the other hand for unsupervised framework, we show that random projection preserves this structure without the knowledge of labels. We support our theoretical analysis with empirical results on both synthetic and real world data.

1

Introduction

Subspace clustering or subspace segmentation has recently emerged as a powerful tool for data modeling and analysis in low dimensions. Conventionally, a given dataset is approximated by a single low dimensional subspace via PCA (Principal Component Analysis). However, in many problems, it is more logical to approximate the data by a union of multiple independent linear subspaces rather than a single subspace. Given just the set of data points, the problem of subspace clustering is to segment them into clusters each of which can be fitted by a low-dimensional subspace. Such low dimensional data modeling with multiple independent subspaces has found numerous applications in image segmentation [20], motion segmentation [18], face clustering [12] [9], image representation and compression [10] and systems theory [17]. With increasing technological advancements in computation speed and memory, high dimensional feature vectors are being used in many machine learning applications. However, as computation time increases with dimensionality, a number of algorithms for the above mentioned tasks become increasingly slow. Dimensionality reduction techniques (PCA, LDA, LLE [13], LPP [7]) are usually employed to overcome this problem. However, for the application of algorithms that perform tasks like 1

subspace segmentation or low rank recovery, the subspace structure of the data should be preserved after dimensionality reduction. In this paper, we formalize the idea of Independent Subspace Structure for datasets. Based on this definition, we propose two different algorithms for dimensionality reduction (supervised and unsupervised) that preserve this structure of a dataset. For the unsupervised setting, we show that random projection preserves the subspace structure of data vectors generated from a union of independent linear subspaces. In the past decade, random projection has been extensively used for data dimensionality reduction for a number of tasks like k-means clustering [5], classification [3] [15]. In these papers the authors prove that certain desired properties of data vectors for the respective tasks are preserved under random projection. However, to the best of our knowledge, a formal analysis of linear subspace structure preservation under random projection has not been reported this far. On the other hand, for the supervised setting, we first show that for any two disjoint subspaces with arbitrary dimensionality, there exists a two dimensional subspace such that both the subspaces collapse to form two lines. We then extend this idea to the K class case and show that 2K projection vectors are sufficient for preserving the subspace structure of a dataset.

2

Definitions and Motivation

A linear subspace in Rn of dimensions (d) can be represented using a matrix B ∈ Rn×d where the columns of B form the support of the subspace. Then any vector in this subspace can be represented as x = Bw ∀w ∈ Rd . Without any loss of generality, we will assume that the columns of B are unit vectors and pair-wise orthogonal to each other. Let there be K independent subspaces denoted by S1 , S2 . . . SK . Any subspace Si is said to be independent of all other subspaces if there does not exist any non-zero vector in Si which is a linear combination of vectors in the other subspaces. Formally, K X

Si = ⊕K i=1 Si

i=1

where, ⊕ denotes direct sum of subspaces. We will use the notation x = Bi w to denote that the data vector x ∈ Si . Here, the columns of Bi ∈ Rn×di form the support of Si and w ∈ Rdi can be any arbitrary vector. While the above definition simply states the condition under which two or more subspaces are independent, it does not specifically tells us quantitatively how well they are separated. This leads us to the definition of margin between a pair of subspaces. Definition 1 (Subspace Margin) Subspaces Si and Sj are separated by margin γij if γij =

max

u∈Si ,v∈Sj

hu, vi kuk2 kvk2

(1)

Geometrically, the above definition simply says that margin between any two subspaces is defined as the maximum dot product between two unit vectors, one from either subspace. The vector pair u and v that maximize this dot product is known 2

as the principal vector pair between the two subspaces while the angle between these vectors is called the principal angle. Then there are min(di , dj ) such principal angles (vector pairs) which can be found by performing the SVD analysis of BiT Bj . Notice that γij ∈ [0, 1] such that γij = 0 implies that the subspaces are maximally separated while γij = 1 implies that the two subspaces are not independent. Having defined these concepts, our goal is to learn a subspace from any given dataset that is sampled from a union of independent linear subspaces such that this independent subspace structure property is approximately preserved in the dataset. We will make this idea more concrete in a bit. Before that, notice that the above definitions of independent subspaces and separation margin (definition 1) apply explicitly to well defined subspaces. So a natural question is: How do we define these concepts for datasets? We define the Independent Subspace Structure for a dataset as follows, Definition 2 (Independent Subspace Structure) Let X = {xj }N j=1 be a K class dataset of N data vectors in Rn and Xi ⊂ X (i ∈ {1 . . . K}) such that data vectors in Xi belong to class i. Then we say that the dataset X has Independent Subspace Structure if each ith class data x ∈ Xi is sampled from a linear subspace Si (i ∈ {1 . . . K}) in Rn such that each subspace is independent. Again, the above definition only specifies that data samples from different classes belong to independent subspaces. To estimate the margin between subspaces these, we define Subspace Margin for datasets as follow, Definition 3 (Subspace Margin for datasets) For a dataset X with Independent Subspace Structure, class i (i ∈ {1 . . . K}) data is separated from all the other classes hx,yi ≤ γi , where γi ∈ [0, 1). with margin γi , if ∀x ∈ Xi and ∀y ∈ X \ {Xi }, kxk 2 kyk2 With these definitions, we will now make the idea of independent subspace structure preservation concrete. Specifically, by subspace structure preservation, we mean that if originally, a given set of data vectors are sampled from a union of independent linear subspaces, then after projection, the projected data vectors also belong to a union of independent linear subspaces. Formally, let X be a K class dataset in Rn with independent subspace structure such that class i samples (x ∈ Xi ) are drawn from subspace Si , then the projected data vectors (using projection matrix P ∈ Rn×m ) in ¯ i := {P T x : x ∈ Xi } for i ∈ {1 . . . K} are such that data vectors in each set the sets X ¯ i belong to a linear subspace (S¯i in Rm ), such that the subspaces S¯i (i ∈ {1 . . . K}) X PK ¯ are independent, i.e., i=1 S¯i = ⊕K i=1 Si . In this paper, we consider both supervised and unsupervised problem setting for subspace learning. While in case of unsupervised setting, we present a non-adaptive subspace learning technique, in the supervised setting, we show that for a K class problem, 2K projection vectors (with certain properties) are sufficient for independent subspace structure preservation. Further, under the latter setting, we present a heuristic algorithm that approximately computes such 2K projection vectors from a given dataset.

3

3

Dimensionality Reduction for Unsupervised setting

Random Projection has gained large popularity in recent years due to its low computational cost and the guarantees it comes with. Specifically, it has been shown in cases of linearly separable data [15] [3] and data that lies on a low dimensional compact manifold [4] [8], that random projection preserves the linear separability and manifold structure respectively, given that certain conditions are satisfied. Notice that a union of independent linear subspaces is a specific case of manifold structure and hence the results of random projection for manifold structure apply in general to our case. However, as those results are derived for a more general case, their results are weak when applied to our problem setting. Further, to the best of our knowledge, there has not been any prior analysis of random projection on the margin between independent subspaces. In this section, we derive bounds on the minimum number of random vectors required specifically for the preservation of independent linear subspace structure for a given dataset. The various applications of random projection for dimensionality reduction are rooted in the following version of the Johnson-Lindenstrauss lemma [16]: Lemma 4 For any vector x ∈ Rn , matrix R ∈ Rm×n where each element of R is drawn i.i.d. from a standard Gaussian distribution, Rij ∼ √1m N (0, 1) and any  ∈ (0, 1/2)  P r (1 − )kxk2 ≤ kRxk2 ≤ (1 + )kxk2  m  ≥ 1 − 2 exp − 2 − 3 4

(2)

This lemma states that the `2 norm of the randomly projected vector is approximately equal to the `2 norm of the original vector. Using this lemma, it can be readily shown that for any number of data vectors, the pair-wise distances among them are preserved with high probability under the mild condition that the number of random vectors used should be proportional to the logarithmic of the number of data vectors. While conventionally, elements of the random matrix are generated from a Gaussian distribution, it has been proved [1] [11] that one can indeed use sparse random matrices (with most of the elements being zero with high probability) to achieve the same goal. Before studying the conditions required for independent subspace structure preservation for binary and multiclass cases, we first state our cosine preservation lemma which simply states that the cosine of angle between any two fixed vectors is approximately preserved under random projection. A similar angle preservation theorem is stated in [15], but we will state the difference between the two after the lemma. Lemma 5 (Cosine preservation) For all x, y ∈ Rn , any  ∈ (0, 1/2) and matrix R ∈ Rm×n where each element of R is drawn i.i.d. from a standard Gaussian distribution, Rij ∼ √1m N (0, 1), one of the following inequalities holds true 1 hx, yi  hRx, Ryi − ≤ (1 − ) kxk2 kyk2 1− kRxk2 kRyk2 1 hx, yi  ≤ + (1 + ) kxk2 kyk2 1+ 4

(3)

if

hx,yi kxk2 kyk2

< −,

1 hx, yi  hRx, Ryi − ≤ (1 − ) kxk2 kyk2 1− kRxk2 kRyk2 1 hx, yi  ≤ + (1 − ) kxk2 kyk2 1− if − ≤

hx,yi kxk2 kyk2

(4)

< , and

1 hx, yi  hRx, Ryi − ≤ (1 + ) kxk2 kyk2 1+ kRxk2 kRyk2 1 hx, yi  ≤ + (1 − ) kxk2 kyk2 1−

(5)

hx,yi ≥ . Further the inequality holds true with probability at least 1 − if kxk 2 kyk2  m 8 exp − 4 2 − 3 . hx,yi Proof: Let x ¯ := x/kxk2 and y¯ := y/kyk2 and consider the case when kxk ≥ 2 kyk2 . Then from lemma 4,  P r (1 − )k¯ x + y¯k2 ≤ kR¯ x + R¯ y k2 ≤ (1 + )k¯ x + y¯k2  m  ≥ 1 − 2 exp − 2 − 3 4 (6)  P r (1 − )k¯ x − y¯k2 ≤ kR¯ x − R¯ y k2 ≤ (1 + )k¯ x − y¯k2  m  2 − 3 ≥ 1 − 2 exp − 4

Using union bound on the  above two, both hold true2 simultaneously 2with probability 2 3 1 − 4 exp − m  −  . Notice that kR¯ x + R¯ y k − kR¯ x − R¯ y k = 4hR¯ x, R¯ y i. 4 Using 6, we get 4hR¯ x, R¯ y i = kR¯ x + R¯ y k2 − kR¯ x − R¯ y k2 ≤ (1 + )k¯ x + y¯k2 − (1 − )k¯ x − y¯k2 = (1 + )(2 + 2h¯ x, y¯i) − (1 − )(2 − 2h¯ x, y¯i)

(7)

= 4 + 4h¯ x, y¯i We can similarly prove in the other direction to yield hR¯ x, R¯ y i ≥ h¯ x, y¯i − . Together we have that h¯ x, y¯i −  ≤ hR¯ x, R¯ y i ≤ h¯ x, y¯i + 

(8)

 2 3 . holds true with probability at least 1 − 4 exp − m 4  − Finally, applying lemma 4 on vectors x and y, we get (1 − )kxk2 kyk2 ≤ kRxk2 kRyk2

(9)

5

Thus, hR¯ x, R¯ yi =

hRx,Ryi kxk2 kyk2

hRx,Ryi ≥ (1 − ) kRxk . Combining this with eq 8, we get 2 kRyk2

hRx,Ryi kRxk2 kRyk2

1  ≤ (1−) h¯ x, y¯i + 1− . We can similarly get the other inequality to achieve 3. Notice that we made use of lemma 4 fourtimes and hence inequality 3 holds with 2 3 probability at least 1 − 8 exp − m using union bound. 4  − Inequalities 4 and 5 can be achieved similarly. 

We would like to point out that cosine of both acute and obtuse angles are preserved under random projection as is evident from the above lemma. However, if the cosine value is close to zero, the additive error in the inequalities 3, 4 and 5 distorts the cosine significantly after projection. On the other hand, [15] in their paper state that obtuse angles are not preserved. As evidence, the authors empirically show cosines with negative value close to zero. However, as already stated, cosine values close to zero are not well preserved. Hence this does not serve as an evidence that obtuse angles are not preserved under random projection which we show empirically otherwise to be true. Notice that this is not the case for the JL lemma (4) where the error is multiplicative and hence length of vectors are preserved to a good degree invariantly for all vectors. On the other hand, in general, inner product between vectors is not well preserved under random projection irrespective of the angle between the two vectors. This can be analysed using equation 8. Rewriting this equation in the following form, we have that, hx, yi − kxk2 kyk2 ≤ hRx, Ryi ≤ hx, yi + kxk2 kyk2

(10)

holds with high probability. Clearly, because the error term itself depends on the length of the vectors, inner product between arbitrary vectors after random projection is not well preserved. However, as a special case, inner product of vectors with length less than 1 is preserved (corollary 2 in [2]) because the error term gets diminished in this case. For ease of representation, in all further analysis, we will use equation 5 while making use of the cosine preservation lemma. Notice that the difference between the 3 inequalities is in cosine preservation error while each holds with the same probability bound. We will later see that this will not affect our bounds on the number of random vectors required for subspace margin preservation and hence does not affect our analysis. In order for independent subspace structure to be preserved for any dataset, we need two things to hold simultaneously. First, data sampled from each subspace should continue to belong to a linear subspace after projection. Second, the subspace margin for the dataset should be preserved. Remark 6 (Individual Subspace preservation) Let Xi denote the set of data vectors (x) drawn from the subspace Si , and let R ∈ Rm×n denote the random projection matrix as defined before. Then after projection, all the vectors in Xi continue to lie along the linear subspace in the span of RBi , where the columns of Bi denote the span of Si . The above straight forward remark states that the first requirement always holds true. Now we need to derive the condition needed for the second requirement to hold true. 6

Theorem 7 (Binary Subspace Preservation) Let X = {xj }N j=1 be a 2 class (K = 2) dataset with Independent Subspace structure and margin γ. Then for any  ∈ (0, 1/2), the subspace structure of the dataset is preserved with margin γ¯ after random projection using matrix R ∈ Rm×n (Rij ∼i.i.d. √1m N (0, 1)) as follows   1  P r γ¯ ≤ γ+ (1 − ) 1− (11)  m  2 2 ≥ 1 − 6N exp −  − 3 4 Proof: Let x ∈ X1 and y ∈ X2 be samples from class 1 and 2 respectively in dataset X. Then applying union bound on lemma 5 for a single vector x and all vectors y ∈ X2 , hx, yi hRx, Ryi 1  ≤ + kRxk2 kRyk2 (1 − ) kxk2 kyk2 1− hx, yi 1  max ≤ + (1 − ) x∈X1 ,y∈X2 kxk2 kyk2 1− 1  = γ+ (1 − ) 1−

(12)

 2 3 holds with probability at least (1−6N2 exp − m ), where N2 is the number 4  − of samples in X2 . As we are only interested in upper bounding the projected margin, we have only used one side of the inequality in lemma 5 and hence a tighter probability bound. Again, applying the above bound for all the samples x ∈ X1 , hRx, Ryi 1  ≤ γ+ kRxk2 kRyk2 (1 − ) 1−

(13)

 2 3 ), where N1 is the holds with probability at least (1 − 6N1 N2 exp − m 4  − number of samples in X1 . Using N 2 to upper bound N1 N2 , we get 11.  The above theorem analyses the condition required for margin preservation for binary class case. Now we analyse the requirement for multiclass case. Theorem 8 (Multiclass Subspace Preservation) Let X = {xj }N j=1 be a K class dataset with Independent Subspace structure and the ith class have margin γi . Then for any  ∈ (0, 1/2), the subspace structure of the entire dataset is preserved after random projection using matrix R ∈ Rm×n (Rij ∼i.i.d. √1m N (0, 1)) with margin γ¯i for class i as follows   1  P r γ¯i ≤ γi + , ∀i ∈ {1 . . . K} (1 − ) 1− (14)  m  2 2 3 ≥ 1 − 6N exp −  − 4 Proof: Applying union bound on lemma 5 for a single vector x ∈ X1 and all vectors y ∈ X \ {X1 }, hRx, Ryi 1  ≤ γi + kRxk2 kRyk2 (1 − ) 1− 7

(15)

 ¯2 exp − m 2 − 3 ), where N ¯i := N − Ni . holds with probability at least (1 − 6N 4 Again, applying the above bound for all the samples x ∈ X1 , 1  hRx, Ryi ≤ γi + kRxk2 kRyk2 (1 − ) 1−

(16)

 ¯1 exp − m 2 − 3 ). Computing bounds holds with probability at least (1 − 6N1 N 4 similar to 16 for all the classes, we have that, 1  γi + , ∀i ∈ {1 . . . K} (17) (1 − ) 1−  PK ¯i exp − m 2 − 3 ). Notice that holds with probability at least (1 − i=1 6Ni N 4 PK PK 2 ¯ i=1 Ni N = N which leads to 14.  i=1 Ni Ni ≤ γ¯i ≤

Notice that the probability bounds are same for both binary class and multi-class case. This implies that number of classes do not affect subspace structure preservation; it only depends on the number of data vectors. Also recall that from our discussion in the Cosine preservation lemma (5), cosine values close to zero are not well preserved under random projection. However, from our above error bound on the margin (eq 14), it turns out that this is not a problem. Notice that two subspaces separated with a margin close to zero implies that the principal angle between them is almost orthogonal, i.e., they are maximally separated. Under these circumstances, the projected subspaces are also well separated. Formally, let  γ = 0, then after projection, γ¯ ≤ 1− which is further upper bounded by 1 as  tends to 0.5. In practice we set  to be a much smaller quantity and hence γ¯ is well below 1. Corollary 9 For any δ ∈ [0, 1) and  ∈ (0, 1/2), if any K class (K > 1) dataset X with N data samples has Independent Subspace Structure, then the number of random vectors (m) required for this structure to be preserved with probability at least δ must satisfy √ 8 6N m≥ 2 ln (18) ( − 3 ) 1−δ The above corollary states the minimum number of random vectors required for independent subspace structure preservation with error parameter  and probability δ for any given dataset. While all the analysis so far only talks about structure preservation for datasets with Independent subspace structure, it is not hard to see that the same bounds also apply to a dataset with disjoint subspace structure, i.e., each subspace (class) is pairwise disjoint with each other but not independent overall.

4

Dimensionality Reduction for Supervised setting

In this section, we propose a new subspace learning approach applicable to labeled datasets that theoretically also guarantees independent subspace structure preservation. 8

In the previous section, we saw that the minimum number of random vectors required for dimensionality reduction depends on the size of a dataset. Under supervised setting, the number of projection vectors required by our new approach will not only be independent of the size of the dataset but will also be fixed, depending only on the number of classes. Specifically, we show that for any K class labeled dataset with independent subspace structure, only 2K projection vectors are required for structure preservation. The entire idea of being able to find a fixed number of projection vectors for the structure preservation of a K class dataset is motivated by theorem 10. This theorem states a useful property of any pair of disjoint subspaces. Theorem 10 Let unit vectors v1 and v2 be the ith principal vector pair for any two disjoint subspaces S1 and S2 in Rn . Let the columns of the matrix P ∈ Rn×2 be any two orthonormal vectors in the span of v1 and v2 . Then for all vectors x ∈ Sj , P T x = αtj (j ∈ {1, 2}), where α ∈ R depends on x and tj ∈ R2 is a fixed vector tT t

independent of x. Further, kt1 k12 kt22 k2 = v1T v2 Proof: We use the notation (M )j to denote the j th column vector of matrix M for any arbitrary matrix M . We claim that tj = P T vj (j ∈ {1, 2}). Also, without any loss of generality, assume that (P )1 = v1 . Then in order to prove theorem 10, it suffices to show that ∀x ∈ S1 , (P )T2 x = 0. By symmetry, ∀x ∈ S2 , P T x will also lie along a line in the subspace spanned by the columns of P . Let the columns of B1 ∈ Rn×d1 and B2 ∈ Rn×d2 be the support of S1 and S2 respectively, where d1 and d2 are the dimensionality of the two subspaces. Then we can represent v1 and v2 as v1 = B1 w1 and v2 = B2 w2 for some w1 ∈ Rd1 and w2 ∈ Rd2 . Let B1 w be any arbitrary vector in S1 where w ∈ Rd1 . Then we need to show that T := (B1 w)T (P )2 = 0∀w. Notice that, T = (B1 w)T (B2 w2 − (w1T B1T B2 w2 )B1 w1 ) = wT (B1T B2 w2 − (w1T B1T B2 w2 )w1 ) ∀w

(19)

Let U SV T be the SVD of B1T B2 . Then w1 and w2 are the ith columns of U and V respectively, and v1T v2 is the ith diagonal element of S if v1 and v2 are the ith principal vectors of S1 and S2 . Thus, T = wT (U SV T w2 − Sii (U )i )

(20)

= wT (Sii (U )i − Sii (U )i ) = 0 

Geometrically, this theorem says that after projection on the plane (P) defined by any one of the principal vector pairs between subspaces S1 and S2 , both the entire subspaces collapse to just two lines such that points from S1 lie along one line while points from S2 lie along the second line. Further, the angle that separates these lines is equal to the angle between the ith principal vector pair between S1 and S2 if the span of the ith principal vector pair is used as P . We apply theorem 10 on a three dimensional example as shown in figure 1. In figure 1 (a), the first subspace (y-z plane) is denoted by red color while the second subspace is the black line in x-y axis. Notice that for this setting, the x-y plane (denoted by blue color) is in the span of the 1st (and only) principal vector pair between the two 9

(a) Independent subspaces in 3 dimensions

(b) Subspaces after projection

Figure 1: A three dimensional example of the application of theorem 10. See text in section 4 for details. subspaces. After projection of both the entire subspaces onto the x-y plane, we get two lines (figure 1 (b)) as stated in the theorem. We would also like to mention that for this geometrical property to hold, the two subspaces need not be disjoint in general. On careful examination of the statement of theorem , we find that there must exist at least one principal vector pair between the two subspaces that are separated by a non-zero angle. However, because such a vector pair is hard to estimate from a given dataset, we consider the case of disjoint subspaces. Before stating our main theorem (12), we first state lemma 11 which we will use later in our proof. This lemma states that if two vectors are separated by a non-zero angle, then after augmenting these vectors with any arbitrary vectors, the new vectors remain separated by some non-zero angle as well. This straightforward idea will help us extend the two subspace case in theorem 10 to multiple subspaces. Lemma 11 Let x1 , y1 be any two fixed vectors of same dimensionality with respect to xT y each other such that kx1 k12 ky11 k2 = γ where γ ∈ [0, 1). Let x2 , y2 be any two arbitrary vectors of same dimensionality with respect to each other. Then there exists a constant γ¯ ∈ [0, 1) such that vectors x0 = [x1 ; x2 ] and y 0 = [y1 ; y2 ] are also separated such 0T 0 y ≤ γ¯ . that kx0xk2 ky 0k 2 Proof: Notice that, x0T y 0

xT1 y1 + xT2 y2 kx0 k2 ky 0 k2 γkx1 k2 ky1 k2 + xT2 y2 = kx0 k2 ky 0 k2 kx1 k2 ky1 k2 + xT2 y2 < kx0 k2 ky 0 k2 kx1 k2 ky1 k2 + xT2 y2 ≤ max =1  x2 ,y2 kx0 k2 ky 0 k2 kx0 k2 ky 0 k2

=

(21)

Finally, we now show that for any K class dataset with independent subspace structure, 2K projection vectors are sufficient for structure preservation. The theorem also 10

Algorithm 1 Computation of projection matrix P for supervised setting INPUT: X,K,λ for k=1 to K do ¯ w2∗ ← random vector in RNk while γ not converged do ¯ w∗ X w1∗ ← maxw1 kXk w1 − kX¯ kkw∗2k2 k2 + λkw1 k2 2 X w∗ ¯ k w2 k2 + λkw2 k2 w∗ ← maxw k k ∗1 − X 2

2 kXk w k2 1 ∗ T ¯ (Xk w1 ) (Xk w2∗ )

γ← end while ¯ k w∗ Pk ← orthonormalized form of vectors Xk w1∗ , X 2 end for P ← orthonormalized form of vectors in [P1 . . . PK ] OUTPUT: P states which 2K vectors have this property. n Theorem 12 Let X = {x}N i=1 be a K class dataset in R with Independent Subspace n×2K structure. Let P = [P1 . . . PK ] ∈ R be a projection matrix for X such that the columns of the matrix Pk ∈ Rn×2 consists of orthonormal vectors in the span of P any principal vector pair between subspaces Sk and j6=k Sj . Then the Independent Subspace structure of the dataset X is preserved after projection on the 2K vectors in P. P Proof: It suffices to show that data vectors from subspaces Sk and j6=k Sj (for any k ∈ {1 . . . K}) are separatedP by margin less than 1 after projection using P . Let x and y be any vectors in Sk and j6=k Sj respectively and the columns of the matrix Pk be in the span of the ith (say) principal vector pair between these subspaces. Using theorem 10, the projected vectors PkT x and PkT y are separated by an P angle equal to the the angle between the ith principal vector pair between Sk and j6=k Sj . Let the cosine of this angle be γ. Then, using lemma 11, the added dimensions in the vectors PkT x and PkT y to form the vectors P T x and P T y are also separated by some margin γ¯ < 1. As the same argument holds for vectors from all classes, the Independent Subspace Structure of the dataset remains preserved after projection. 

For any two disjoint subspaces, theorem 10 tells us that there is a two dimensional plane in which the entire projected subspaces form two lines. Using lemma 11, it can be argued that after adding arbitrary valued finite dimensions to the basis of this plane, the two projected subspaces will also remain disjoint. Theorem 12 simply applies this argument to each subspace and the sum of the remaining subspaces one at a time. Thus for K subspaces, we get 2K projection vectors.

4.1

Implementation

Even though theorem 12 guarantees the structure preservation of the dataset X after projection using P as specified, this does not solve the problem of dimensionality reduction. The reason is that given a labeled dataset sampled from a union of independent 11

subspaces, we do not have any information about the basis or even the dimensionality of the underlying subspaces. Under these circumstances, constructing the projection matrix P as specified in theorem 12 itself becomes a problem. To solve this problem, we propose a heuristic algorithm that Ptries to approximate the underlying principal vector pair between subspaces Sk and j6=k Sj (for k = 1 to K) given the labeled dataset X. The assumption behind this attempt is that samples from each subspace (class) are not heavily corrupted and that the underlying subspaces are independent. Notice that we are not specifically interested in a particular principal vector pair between any two subspaces for the computation of the projection matrix. This is because we have assumed independent subspace structure and so each principal vector pair is separated by some margin γ < 1. Then we need an algorithm that computes any arbitrary principal vector pair, given data from two independent subspaces. These vectors can then be used to form one of the K submatrices in P as specified in theorem 12. For computingPthe submatrix Pk , we need to find a principal vector pair between subspaces Sk and j6=k Sj . In terms of dataset X, we heuristically estimate the vector ¯ k where X ¯ k := X \ {Xk }. We repeat this process for each pair using data in Xk and X class to finally form the entire matrix P . Our heuristic approach is stated in algorithm ¯ k and find 1. For each class k, the idea is to start with a random vector in the span of X the vector in Xk closest to this vector. Then fix this vector and search of the closest ¯ k . Repeating this process till the convergence of the cosine between these 2 vector in X vectors leads to a principal vector pair. Also notice that in order to estimate the closest vector from opposite subspace, we have used a quadratic program in 1 that minimizes the reconstruction error of the fixed vector (of one subspace) using vectors from the opposite subspace. The regularization in the optimization is to handle noise in data.

5

Empirical Analysis

In this section, we present empirical evidence to support our theoretical analysis presented so far. We present experiment on both our supervised as well as unsupervised subspace learning approach.

5.1

Analysis of our Unsupervised approach

5.1.1

Cosine preservation

In lemma 5, we concluded that the cosine of the angle between any two vectors remains preserved under random projection irrespective of the angle being acute or obtuse. However, we also stated that cosine values close to zero are not well preserved. Here, we perform empirical analysis on vectors with varying angles (both acute and obtuse) and arbitrary length to verify the same. In order to achieve this, we use settings similar to [15]. We generate 2000 random projection matrices Ri ∈ Rm×n (i = 1 to 2000) where we vary m = {30, 60, . . . 300} and n = 300 is the dimension of the original space. We define empirical rejection probability for cosine preservation similar to [15]

12

1

1

γ=0.019021 γ=0.37161 γ=0.67809 γ=0.92349

0.9

0.8

0.7 rejection probability P

rejection probability P

0.8

0.6 0.5 0.4 0.3

0.7 0.6 0.5 0.4 0.3

0.2

0.2

0.1

0.1

0

0

50

100 150 200 projected dimensionality (m)

250

γ=0.019021 γ=0.37161 γ=0.67809 γ=0.92349

0.9

0

300

0

50

(a)  = 0.1

100 150 200 projected dimensionality (m)

250

300

(b)  = 0.3

Figure 2: Empirical rejection probability for cosine preservation (acute angle). See section 5.1.1 for details. as, Pˆ = 1 −

2000 hRi x, Ri yikxk2 kyk2 1 X 1((1 − ) ≤ 2000 i=1 kRi xk2 kRi yk2 hx, yi

≤ (1 + )) where we vary  ∈ {0.1, 0.3} and 1(.) is the indicator operator. For acute angle, we randomly generate vectors x and y of arbitrary length but with fixed cosine values γ = {0.019021, 0.37161, 0.67809, 0.92349}. For obtuse angle, we similarly generate vectors x and y with fixed cosine values γ = {−0.036831, −0.45916, −0.65797, −0.92704}. We then compute the empirical rejection probability as mentioned above for different values of . Figures 2 and 3 show the results on these vectors. In both figures, notice that the rejection probability decreases as the absolute value of cosine of the angle (γ) increases (from 0 to 1), as well as for higher value of . Notice, for cosine values close to zero, the rejection probability is close to 1 even at high dimensions. These results corroborate with our theoretical analysis in lemma 5. 5.1.2

Inner Product under Random projection

We use the same experimental setting as in section 5.1.1. We define the empirical rejection probability of inner product similar to [15] as Pˆ = 1 −

2000 1 X hRi x, Ri yi 1((1 − ) ≤ ≤ (1 + )) 2000 i=1 hx, yi

We use the same vectors as in 5.1.1 for experiments in this section. We then compute the empirical rejection probability as mentioned above for different values of . Figures 4 and 5 show the results on these vectors. As is evident from these figures, inner product between vectors is not well preserved (even when cosine values are close to 1). This result is in line with our theoretical bound in equation 10 as the vector lengths in our experiment are arbitrarily greater than 1. 13

1

1 γ=−0.036831 γ=−0.45916 γ=−0.65797 γ=−0.92704

0.9

0.8

0.7

rejection probability P

rejection probability P

0.8

0.6 0.5 0.4 0.3

0.7 0.6 0.5 0.4 0.3

0.2

0.2

0.1

0.1

0

0

50

100 150 200 projected dimensionality (m)

250

γ=−0.036831 γ=−0.45916 γ=−0.65797 γ=−0.92704

0.9

0

300

0

50

(a)  = 0.1

100 150 200 projected dimensionality (m)

250

300

(b)  = 0.3

Figure 3: Empirical rejection probability for cosine preservation (obtuse angle). See section 5.1.1 for details. 1

γ=0.019021 γ=0.37161 γ=0.67809 γ=0.92349

0.9 0.8 rejection probability P

0.8 rejection probability P

1

γ=0.019021 γ=0.37161 γ=0.67809 γ=0.92349

0.9

0.7 0.6 0.5

0.7 0.6 0.5 0.4 0.3

0.4

0.2 0.3

0.1 0.2

0

50

100 150 200 projected dimensionality (m)

250

0

300

0

(a)  = 0.1

50

100 150 200 projected dimensionality (m)

250

300

(b)  = 0.3

Figure 4: Empirical rejection probability for inner product preservation (acute angle). See section 5.1.2 for details. 5.1.3

Required number of random vectors

As shown in corollary 9, we study the number of random vectors required for subspace preservation by varying different parameters. For δ = 0.99, figure 6 shows the minimum number of random vectors (m) required (y-axis) while varying the number (N ) of data samples (x-axis) between 100 and 10, 000, and the values of  = {0.15, 0.3, 0.4}. It can be seen that for N = 1000 and  = 0.15, random projection to lower dimensions is effective only if m > 6000 while for  = 0.4, m > 900 suffices. The choice of  depends on the robustness of the algorithm (for the respective task) towards noise and is a trade-off between noise (allowed) and the number of random vectors (m) required. 5.1.4

Subspace preservation for synthetic data

To test our claim of subspace structure preservation, we run a subspace segmentation algorithm on randomly projected synthetic data. Similarly to [12], we construct 5 5 K = 5 independent subspaces {Si }i=1 ⊂ Rn , (n = 1000), whose bases {Ui }i=1 14

1

γ=−0.036831 γ=−0.45916 γ=−0.65797 γ=−0.92704

0.9 0.8 rejection probability P

0.8 rejection probability P

1

γ=−0.036831 γ=−0.45916 γ=−0.65797 γ=−0.92704

0.9

0.7 0.6 0.5

0.7 0.6 0.5 0.4 0.3

0.4 0.2

0.3 0.2

0.1

0

50

100 150 200 projected dimensionality (m)

250

0

300

0

50

100 150 200 projected dimensionality (m)

(a)  = 0.1

250

300

(b)  = 0.3

Figure 5: Empirical rejection probability for inner product preservation (obtuse angle). See section 5.1.2 for details. 10000

required number of random vectors (m)

9000 8000

ε = 0.15 ε = 0.3 ε = 0.4

7000 6000 5000 4000 3000 2000 1000 0

0

1000

2000

3000 4000 5000 6000 7000 number of data samples (N)

8000

9000 10000

Figure 6: Required number of random vectors (m) versus number of data samples (N). The blue broken line is for reference. See section 5.1.3 for details.

105 0% 10 % 20 % 30 % 40 % 50 %

100

% segmentation accuracy

95 90 85 80 75 70 65 60 55 50

0

200

400 600 800 1000 number of projection vectors (m)

1200

1400

Figure 7: Accuracy vs projection dimension m. Performance at m = 100, 300, 500 is approximately preserved relative to the original dimension n = 1000. This holds true for different degrees of data corruption.See section 5.1.4 for details.

15

Table 1: Segmentation accuracy on Extended Yale dataset B under unsupervised setting. Accuracy in original feature space (R10,000 , no projection) is 83.75 %. See section 5.1.5 for details. m 400 900 1600 2500 accuracy (%) 81.47 83.14 83.18 83.39 ± 2.71 ± 1.20 ± 1.17 ± 1.15 are computed by Ui+1 = T Ui , i ∈ {1, . . . , 4}, where T is a random rotation matrix and U1 is a random orthogonal matrix of dimension 1000 × 5. So each subspace has a dimension of 5. We randomly sample 200 data vectors from each subspace by Xi = Ui Qi , i ∈ {1, . . . , 5} with Qi being a 5 × 200 i.i.d. N (0, 1) matrix, resulting in a total of 1000 data vectors. We also add a data-dependent Gaussian noise to a fraction of the generated data. For a data vector x chosen to be corrupted, the observed vector is computed as x + ng where ng is a Gaussian noise with zero mean and variance 0.3 kxk2 (kxk2 mostly ranges from 0.1 to 1.7 in this experiment). We vary the fraction of samples corrupted in this way from 0 to 0.5 with a step size of 0.1. In our experiments, we use m = 100, 300, 500 random vectors and project the 1000-dimensional original data to lower dimensions and then apply the LRR algorithm [12] for subspace segmentation. For each value of m and fraction of data corrupted, we do 50 runs of random projection matrix generation, data projection and subspace segmentation. For comparison purpose, subspace segmentation is also carried out in the original dimensions n = 1000. We report the mean and standard deviation of the segmentation accuracy averaged over the 50 runs. This is illustrated in Figure 7 where we show % segmentation accuracy when m is varied. We observe that for every magnitude of data corruption fraction, the segmentation accuracy stays nearly flat when m is varied in the set {100, 300, 500, n(= 1000)}. This corroborates our claim that the subspace structure is preserved when the data is randomly projected to a lower dimensional subspace. We would also like to mention that the reason for high accuracy at low dimensions (high error) is because the segmentation algorithm is robust to noise. 5.1.5

Subspace preservation for real data

We choose Extended Yale face dataset B [6] for evaluation on real data. It consists of ∼ 2414 frontal face images of 38 individuals and 64 images each. For our evaluation purposes we use the first 5 classes as our 5 clusters and perform the task of segmentation; thus the total dataset size N = 320. We choose a face dataset because it is generally assumed that face images with illumination variation lie along linear subspaces [14]. We use 100 × 100 images and stack all the pixels to form our feature vector. For  = 0.4, the minimum required number of random vectors (m) is ∼ 900. For evaluation, we project the image vector onto dimensions ranging between 400 and 2500 using random vectors. Similar to our synthetic data analysis, we do 50 runs of random projection matrix generation and segmentation for each value of m we choose and report the results. The accuracy of subspace segmentation after random projection can be seem in table 1. The accuracy of subspace segmentation in original feature space (100 × 100 dimensions) is 83.75%. Evident from the result, accuracy at 400 16

0.5

0.6

0.4 0.4

0.3 0.2

0.2 0.1

0

0 −0.2

−0.1 −0.2

−0.4

−0.3 −0.6

−0.4 −0.5 −0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

−0.8 −0.8

0.5

(a) Data projected using Pa

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

(b) Data projected using Pb

Figure 8: Synthetic two class high dimensional data projected onto two dimensional subspace under supervised setting. See section 5.2.1 for details. dimensions is slightly low but saturates after 900 dimensions.

5.2

Analysis of our Supervised approach

5.2.1

Two Subspaces Two Lines

We test both the claim of theorem 10 and the quality of approximation achieved by algorithm 1 in this section. We perform these tests on both synthetic and real data. 1. Synthetic Data: We generate two random subspaces in R1000 of dimensionality 20 and 30 (notice that these subspaces will be independent with probability 1). We randomly generate 100 data vectors from each subspace and normalize them to have unit length. We then compute the 1st principal vector pair between the two subspaces using their basis vectors by performing SVD of B1T B2 , where B1 and B2 are the basis of the two subspaces. We orthonormalize the vector pair to form the projection matrix Pa . Next, we use the labeled dataset of 200 points generated to form the projection matrix Pb by applying algorithm 1. The entire dataset of 200 points is then projected onto Pa and Pb separately and plotted in figure 8. The green and red points denote data from either subspace. The results not only substantiate our claim in theorem 10 but also suggest that our proposed heuristic algorithm for estimating the principal vector pair between subspaces based on sampled dataset to finally form the projection matrix is a good approximation. 2. Real Data: Here we use Extended Yale dataset B for analysis (see section 5.1.5 for details). Since we are interested in projection of two class data in this experimental setup, we randomly choose 4 different pairs of classes from the dataset and use the labeled data from each pair to generate the two dimensional projection matrix (for that pair) using our algorithm 1. The resulting projected data from the 4 pairs of classes can be seen in figure 9. As is evident from the figure, the projected two class data for each pair approximately lie along two different lines.

17

0.3

0.1

0.25

0.05

0

0.2

0.15

−0.05

0.1

−0.1

0.05

−0.15

0

−0.2

−0.05 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

−0.25 −1

1

−0.9

−0.8

−0.7

(a)

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

−0.4

−0.3

−0.2

−0.1

0

(b) 0.05

0.05

0

0

−0.05

−0.05 −0.1

−0.1

−0.15

−0.15

−0.2

−0.2

−0.25

−0.25

−0.3

−0.3

−0.35 −1

−0.35 −1

−0.9

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

(c)

−0.9

−0.8

−0.7

−0.6

−0.5

(d)

Figure 9: Four different pairs of classes from the Extended Yale dataset B projected onto two dimensional subspaces under supervised setting. See section 5.2.1 for details. 5.2.2 Projection of K class data In order to evaluate theorem 12, we perform a classification experiment on a real dataset after projecting the data vectors using different dimensionality reduction technique including our proposed method. We use all 38 classes from Extended Yale dataset B for evaluation (see section 5.1.5 for details) with fixed 50% − 50% train-test split. Further, since the classes have independent subspace structure, we make use of sparse coding [19] for classification (see paper for technical details). We compare our approach against PCA, LDA and random projection (RP). The results are shown in table 2. Since our supervised method is randomized, we perform 50 runs of computing the projection matrix using algorithm 1 and then perform evaluation. The same argument holds for RP, so we generate 50 different random matrices and then perform classification. Since all other methods are deterministic, there is no need for multiple runs. Evidently, our proposed supervised approach yields the best performance accuracy using just 2K = 76 projection vectors.

6

Conclusion

In this paper, we presented two randomized algorithms for dimensionality reduction that preserve the independent subspace structure of datasets. First, we show that ran18

Table 2: Classification Accuracy on Extended Yale dataset B under supervised setting. See section 5.2.2 for details. Method Ours PCA LDA RP dim 76 90 37 90 acc 95.57 ± 0.22 93.62 86.81 93.95 ± 0.19 dom projection matrices preserve this structure and we additionally derive the bound on the minimum number of random vectors required for this to hold (corollary 9). We conclude that this number depends logarithmically on the number of data samples. All the above arguments also hold under disjoint subspace setting. As a side analysis, we also show that while cosine values (lemma 5)are preserved under random projection, inner product (equation 10) between vectors are not well preserved in general. These results were confirmed from our empirical analysis. Second, we propose a supervised subspace learning algorithm for dimensionality reduction. As a theoretical analysis, we show that for K independent subspaces, 2K projection vectors are sufficient for subspace structure preservation (theorem 12). This result is motivated from our observation that for any two disjoint subspaces of arbitrary dimensionality, there exists a two dimensional subspace such that after projection, the entire subspaces collapse to just two lines (theorem 10). Further, we propose a heuristic algorithm (1) that tries to exploit these properties of independent subspaces for learning a projection matrix for dimensionality reduction. However, there may be better approaches of taking advantage of these properties which is a part of the future work.

References [1] Dimitris Achlioptas. Database-friendly random projections: Johnsonlindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671–687, June 2003. [2] Rosa I. Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and random projection. Machine Learning, 63:161–182, 2006. [3] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Kernels as features: On kernels, margins, and low-dimensional mappings. In In 15th International Conference on Algorithmic Learning Theory (ALT 04, pages 79–94, 2004. [4] Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Foundations of Computational Mathematics, 9:51–77, 2009. [5] Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. Random projections for k-means clustering. CoRR, abs/1011.4632, 2010. [6] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001. [7] X. He and P. Niyogi. Locality preserving projections. Proc. of the NIPS, Advances in Neural Information Processing Systems. Vancouver: MIT Press, 103, 2004. 19

[8] Chinmay Hegde, Michael B. Wakin, and Richard G. Baraniuk. Random projections for manifold learning. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, NIPS. Curran Associates, Inc., 2007. [9] Jeffrey Ho, Ming-Husang Yang, Jongwoo Lim, Kuang-Chih Lee, and David Kriegman. Clustering appearances of objects under varying illumination conditions. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pages I–11–I–18. IEEE, 2003. [10] Wei Hong, John Wright, Kun Huang, and Yi Ma. Multiscale hybrid linear models for lossy image representation. Image Processing, IEEE Transactions on, 15(12):3655–3671, 2006. [11] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 287–296, New York, NY, USA, 2006. ACM. [12] Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by low-rank representation. In ICML, 2010. [13] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 2000. [14] ”Gregory Shakhnarovich and Baback Moghaddam”. Face recognition in subspaces. In In: S.Z. LI, A.K. Jain, Handbook of Face Recognition, pages 141–168. Springer, 2004. [15] Qinfeng Shi, Chunhua Shen, Rhys Hill, and Anton van den Hengel. Is margin preserved after random projection? CoRR, abs/1206.4651, 2012. [16] S. Vempala. The Random Projection Method. Dimacs Series in Discrete Mathematics and Theoretical Computer Science, 2004. [17] Ren´e Vidal, Stefano Soatto, Yi Ma, and Shankar Sastry. An algebraic geometric approach to the identification of a class of linear hybrid systems. In Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, volume 1, pages 167– 172. IEEE, 2003. [18] Ren´e Vidal, Roberto Tron, and Richard Hartley. Multiframe motion segmentation with missing data using powerfactorization and gpca. International Journal of Computer Vision, 79(1):85–105, 2008. [19] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEEE TPAMI, 31(2):210 –227, Feb. 2009. [20] Allen Y Yang, John Wright, Yi Ma, and S Shankar Sastry. Unsupervised segmentation of natural images via lossy data compression. Computer Vision and Image Understanding, 110(2):212–225, 2008.

20