Clustering-based Locally Linear Embedding

Report 7 Downloads 236 Views
Clustering-based Locally Linear Embedding Kanghua Hui, Chunheng Wang Institute of Automation, Chinese Academy of Sciences {kanghua.hui, chunheng.wang}@ia.ac.cn Abstract The locally linear embedding (LLE) algorithm is considered as a powerful method for the problem of nonlinear dimensionality reduction. In this paper, first, a new method called clustering-based locally linear embedding (CLLE) is proposed, which is able to solve the problem of high time consuming of LLE and preserve the data topology at the same time. Then, how the proposed method achieves decreasing the time complexity of LLE is analyzed. Moreover, the further comparison shows that CLLE performs better in most cases than LLE on the time cost, topology preservation, and classification performance with several different data sets.

between the two methods will be described in Section 3. Besides, the comparison on the time cost and classification performance will also be shown in Section 3.

2.1. Algorithms of LLE and CLLE LLE can be briefly summarized in 3 steps: (1) Find k nearest neighbors Ω i = { xi1 , xi2 , , xik } of each point xi in original space, i = 1, 2, , N . (2) Compute weights wi , j , which best reconstruct each point xi from its neighbors:

subject to

1. Introduction The locally linear embedding (LLE) [1], an unsupervised learning method which obtains the low dimensional and neighborhood preserving embeddings of high dimensional data, is proposed as one of the effective algorithms for the problem of dimensionality reduction. It has desirable properties, such as fast implementation [3], and the ability to process new data without rerunning the overall algorithm [4, 6]. However, it’s still a high time consuming work for LLE to find k nearest neighbors of each point and compute the bottom eigenvectors. In this paper a new method called clustering-based locally linear embedding (CLLE) is presented, which combines K -means clustering [2] with LLE. The proposed method not only decreases the time complexity efficiently but also preserves the data topology in the low dimensional mapped space, at the same time it achieves the better classification performance during the supervised learning.

2. LLE and Clustering-based LLE In this section, the algorithms of LLE and CLLE are introduced firstly. Then the time complexity analysis of the above two methods is discussed in detail. The comparison of preserving the data topology

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

2

N

N

i =1

j =1

ε (W ) = min ∑ xi − ∑ wi , j x j N

∑w

i, j

=1,

,

(1)

and wi , j = 0, if x j ∉ Ωi .

j =1

(3) Compute the d dimensional embeddings best reconstructed by the weights wi , j : N

N

i =1

j =1

Φ ( Y ) = min ∑ yi − ∑ wi , j y j

subject to 1

N

∑y y N i

T i

= I , and

i =1

N

∑y

i

2

,

(2)

= 0.

i =1

In step (1), LLE needs to find k nearest neighbors of each point in the data set, and in step (3) the eigenvector computation - N × N problem also needs to be done, each of which is an expensive step. In order to obtain decreased cost and preserve the data topology at the same time, CLLE is proposed. It can be summarized in the following steps: (1) The data set C is divided into K clusters {C1 , C2 , , CK } by K -means clustering, subject to K

C = ∪ Ci ,

and Ci ∩ C j = Φ , ∀ i, j ∈ {1, 2, , K }.

i =1

(2) Find k nearest neighbors Ω i = { x1i , xi2 , , xik } in C j , ∀ xi ∈ C j , i = 1, 2,

, C j , j = 1, 2,

, K , and C j

denotes

the number of points belong to the cluster C j . (3) Compute weight matrix W : N

N

i =1

h =1

2

ε (W ) = min ∑ xi − ∑ wi , h xh ,

(3)

N

∑w

subject to

i, h

= 1 , and wi , h = 0 , if xi ∈ C j , but

h =1

xh ∉ Ω i .

(4) Compute the d dimensional embeddings: N

N

i =1

j =1

Φ ( Y ) = min ∑ yi − ∑ wi , j y j

subject to 1

N

N

∑y y i

T i

= I , and

i =1

N

∑y

i

2

,

(4)

= 0.

i =1

Although step (4) of CLLE looks the same as step (3) of LLE, the method used for eigenvector computation in step (4) of CLLE is quite different and has decreased time complexity which will be analyzed soon.

Each step of LLE has the following time complexity. In step (1), the time complexity of finding k nearest neighbors is O ( kDN 2 ) . In step (2), the time complexity of computing weight matrix is O ( DNK 3 ) .

Finally, in step (3), the time complexity of computing d bottom eigenvectors is O ( dN 2 ) . Correspondingly, the time complexity of CLLE in its four steps is O ( kDN ) - K -means clustering, ⎛k ⎞ O ⎜ DN 2 ⎟ - finding k nearest neighbors from one K ⎝ ⎠ N cluster which has points approximatively, K O ( DNK 3 ) - calculating weight matrix, and

⎛d ⎞ O ⎜ N 2 ⎟ - calculating d bottom eigenvectors, ⎝K ⎠ respectively. In step (4), the eigenvector computation, which has decreased time complexity compared with the eigenvector computation in step (3) of LLE, can be achieved as follows.

First of all, there will be a new rank

(

C1 = x1 , x2 ,

(

, x C1

)

C = ( x1 , x2 ,

(

( x1 , x2 ,

, xN )

, xN ) , such that

, C2 = x C1 +1 , x C1 + 2 ,

, CK = xN − CK +1 , xN − CK + 2 ,

w1, N ⎞ ⎛ W1 ⎟ ⎜ w2, N ⎟ ⎜ W2 = ⎟ ⎜ ⎟ ⎜ wN , N ⎟⎠ ⎜⎝

w1,2 w2,2 wN ,2

⎞ ⎟ ⎟, ⎟ ⎟ WK ⎟⎠

where the size of Wi is Ci × Ci , i = 1, 2, , K , and accordingly, ⎛ M1 ⎞ ⎜ ⎟ T M2 ⎟, M = I −W I − W = ⎜ ⎜ ⎟ ⎜⎜ ⎟ M K ⎟⎠ ⎝

(

)(

(

where M i = I i − Wi

)

)( I

i

− Wi

)

T

, and the sizes of M i

and I i both are Ci × Ci too.

2.2. Time complexity analysis

of the data set

⎛ w1,1 ⎜ w2,1 W =⎜ ⎜ ⎜⎜ ⎝ wN ,1

)

, x C1 + C2

),

, xN , and ∀ xi ∈ C j ,

its k nearest neighbors are also in C j , where C j is the j th cluster of C .

Then, to compute d eigenvectors Y of M , where y1,d ⎞ ⎛ y1,1 y1,2 ⎜ ⎟ y y y 2,1 2,2 2, d ⎟ , Y =⎜ ⎜ ⎟ ⎜⎜ ⎟ y N ,d ⎟⎠ ⎝ y N ,1 y N ,2 is equivalent to compute d eigenvectors Yi of M i , where y Ni−1 +1,2 yNi−1 +1, d ⎞ ⎛ y Ni−1 +1,1 ⎜ ⎟ yNi−1 + 2,2 y Ni−1 + 2, d ⎟ ⎜ yNi−1 + 2,1 Yi = ⎜ ⎟, ⎜ ⎟ ⎜ y N + C ,1 y N + C ,2 ⎟ y N + C , d i −1 i i −1 i ⎝ i−1 i ⎠ N Ni −1 = C1 + C2 + + Ci −1 , and usually Ci ≈ d, K i = 1, 2, , K . As a result, the N × N problem in step N N (3) of LLE is reduced to × problem in the step K K (4) of CLLE. Hence, the time complexity of CLLE in ⎛d ⎞ step (4) is O ⎜ N 2 ⎟ . Note that the i th row of Y is ⎝K ⎠ the sample mapped by xi . Table 1. Data sets used in this paper D N (T) Data Iris 150(30) 4 Wine 178(36) 13 MNIST digits 0,1 2400(400) 784 MNIST digits 3,6 2400(400) 784 MNIST digits 1,5,8 3600(600) 784

3. Experiments th

Next, the weight matrix W , the i column of which are the weights of xi , can be obtained:

Data sets used in this paper are represented in Table 1, where N denotes the number of both the train and test samples in each data set, T means the number of test samples (i.e. N-T corresponds to the number of

train samples), and D is the dimensionality of each sample. All of the test samples are chosen optionally from each data set and the data set of each digit are the first 1200 samples from MNIST.

3.1. Time cost From Section 2, the time cost of LLE is O ( kDN 2 ) + O ( DNK 3 ) + O ( dN 2 ) , and that of CLLE is ⎛k ⎞ ⎛d ⎞ O ( kDN ) + O ⎜ DN 2 ⎟ + O ( DNK 3 ) + O ⎜ N 2 ⎟ . It is ⎝K ⎠ ⎝K ⎠ obvious that the time costs depend on the parameters k , d and K . Moreover, it can be found that the 1 of LLE approximatively, time cost of CLLE is K where usually N K . As shown in Figure 1, data set of MNIST digits 0 and 1 is used to compare the costs of the two methods. It is clear that CLLE is faster than LLE. In the experiment, k is fixed to 20, and K changes from 20 to 30 as the number of samples varies from 1,000 to 6,000.

CLLE. Furthermore, from Table 2, it can be found that CLLE obtains better topology preservation than LLE, but the values of the two methods are slightly different. This means that CLLE can preserve approximate data topology of LLE. The better values are bold.

3.3. Classification performance In order to estimate the classification performance of mapped data (train data), first, supervised LLE (SLLE) [5] and supervised CLLE (SCLLE) where all of the points from one cluster belong to the same class, are used. Then the new data (test data) are mapped into the same low dimensional space by a linear generation method (LG), which is used in both LLE and CLLE. Note that the generalization procedure of the test data does not use label information, so each point from the test data doesn’t have label. LG [6] of LLE: To find yN +1 in the mapped space corresponding to xN +1 , first, the k nearest neighbors Ω N +1 = { x1N +1 , xN2 +1 ,

, xNk +1}

of

in the high

xN +1

dimensional space are found. Second, the weights wN +1 , which best reconstruct xN +1 are computed by Eq. (2) with the sum-to-one constraint:

N

∑w j =1

In the end, the new output

N +1, j

= 1.

yN +1 is obtained:

N

yN +1 = ∑ wN +1, j y j , where y j corresponds to the j th j =1

nearest neighbor of xN +1 . LG of CLLE: To find yN +1 , first, determine the K

Figure 1. Comparison of costs between LLE and CLLE

3.2. Topology preservation In order to evaluate the topology preservation of LLE and CLLE, two evaluation criteria, Spearman’s rho ( ρ Sp ) and procrustes measure ( Procr ) are adopted. The Spearman’s rho reflects how well the data topology is preserved in the low dimensional mapped space. The closer the value of ρ Sp to 1, the better the data topology is preserved in the mapped space. The procrustes measure indicates how well a linear transformation (translation, reflection, orthogonal rotation, and scaling) of the points in the mapped space conforms to the points in the corresponding high dimensional space. The smaller the value of Procr , the better the fitting obtained. All of the train data sets are mapped into 2 dimensional space by LLE and CLLE respectively. Figure 2 shows two examples mapped by LLE and

cluster Ci that satisfies d ( xN +1 , Ci ) = min d ( xN +1 , Cl ). l =1

Second, the k nearest neighbors of xN +1 in Ci ,

Ω N +1 = { x1N +1 , xN2 +1 ,

, xNk +1} are

found.

Third,

the

weights wN +1 , which best reconstruct xN +1 are computed by Eq. (3). Finally, the new output yN +1 is N

found: yN +1 = ∑ wN +1, j y j , where y j corresponds to j =1

the j

th

nearest neighbor of xN +1 in Ci .

When the test data all are mapped into 2 dimensional space by LG of LLE and CLLE respectively, the mapped test data are classified by K nearest neighbor classifier (KNN), with K=1, 3, and 5. In other words, each point of the test data finds K nearest neighbors from the train data mapped by SLLE and SCLLE respectively. The experiment results are shown in Table 3. One can see from the results that in most cases CLLE reaches better classification performance than LLE. The better values are bold.

(a) (b) Figure 2. 2000 samples (digits 0 and 1) are mapped into 2 dimensional space by LLE (a) and CLLE (b) Table 2. Spearman’s rho ( ρ Sp ) and procrustes measure ( Procr ) of LLE and CLLE for the data sets Data

ρ Sp (LLE)

ρ Sp (CLLE)

Procr (LLE)

Procr (CLLE)

Iris Wine 0, 1 3, 6 1, 5, 8

0.25 0.30 0.44 0.10 -0.17

0.06 0.07 0.42 0.12 -0.26

0.61 0.50 0.86 0.88 0.81

0.47 0.48 0.81 0.84 0.79

Table 3. Test error rate (%) of LLE and CLLE K KNN (LLE) KNN (CLLE) Data 1 26.67 26.67 Iris 3 26.67 23.33 16.67 5 30.00 1 38.89 16.67 22.22 Wine 3 25.00 22.22 5 30.56 1 11.00 5.50 5.00 0, 1 3 9.50 4.25 5 8.00 3, 6

1, 5, 8

1 3 5 1 3 5

24.75 22.00 20.00 26.33 24.17 21.33

12.75 8.50 6.00 21.83 21.17 20.17

4. Conclusions In this paper, first, a new method called CLLE is proposed. Then the time complexity of CLLE is analyzed. Furthermore, the comparison between LLE and CLLE is carried out according to several evaluation criteria. As can be seen that in most cases, CLLE can achieve better results than LLE on the time cost, topology preservation, and classification performance with several different data sets described in Section 3. In the future, the work about unfixed number of nearest neighbors of each point is to be discussed.

5. Acknowledgement This work was supported by the National Natural Science Foundation of China under Grant No.60602031 and No.60621001.

6. References [1] S. Roweis, L. Saul. Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323-2326. [2] J.B. MacQueen, Some methods for classification and analysis of multivariate observations, Proceedings of the Symposium on Mathematical Statistics and Probability, 1967, 281–297. [3] O. Kouropteva, O. Okun, M. Pietikäinen. Selection of the optimal parameter value for the locally linear embedding algorithm. Proc. of the 1st International Conference on Fuzzy Systems and Knowledge Discovery, Singapore, 2002, 359-363. [4] O. Kouropteva, O. Okun, M. Pietikäinen. Incremental locally linear embedding. Pattern Recognition, 38(10), 2005, 1764-1767. [5] O. Kouropteva, O. Okun, A. Hadid, M. Soriano, S. Marcos, and M. Pietik¨ainen. Beyond locally linear embedding algorithm. Technical Report MVG-01-2002, Machine Vision Group, University of Oulu, Finland, 2002. [6] L. K. Saul, S. T. Roweis. Think globally, fit locally: unsupervised learning of nonlinear manifolds, Journal of Machine Learning Research, 4 (2003) 119-155.