Dimensionality Reduction for Classification through Visualisation

Comment

Report 2 Downloads 120 Views

Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010

Dimensionality Reduction for Classification through Visualisation Using L1SNE Lennon V. Cook and Junbin Gao School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW 2795, Australia [email protected], [email protected]

Abstract. Dimensionality Reduction algorithms have wide precedent for use in preprocessing for classiﬁcation problems. This paper presents a new algorithm, based on a modiﬁcation to Stochastic Neighbour Embedding and t-Distributed SNE to use the Laplacian distribution instead of, respectively, the Gaussian Distribution and a mismatched pair of the Gaussian Distribution and Student’s t-Distribution. Experimental results are presented to demonstrate that this modiﬁcation yields improvement.

1

Introduction

Recent years have seen a large increase in studies on dimensionality reduction (DR) algorithms, the goal of which is to ﬁnd a lower-dimensional representation of high-dimensional data [20]. A common motive for this is to overcome the “curse of dimensionality” [20,6] that the complexity of many algorithms is bound by the dimension of the data, and can become intractable in many real-world datasets where the dimensionality is quite high. Application areas include manifold learning[11], pattern recognition [13], data mining[15], data classiﬁcation[2] and data visualisation [19,8]. Originally, the focus was on algorithms such as Principal Component Analysis (PCA) [10], and Linear Discriminant Analysis (LDA) [4] which assume linearity - that is, that the low-dimensional vectors are a linear combination of the original high-dimensional vectors. More recently, however, much improvement has been gained by relaxing this constraint. Resulting Non-Linear Dimensionality Reduction (NLDR) algorithms include, for example, Local Linear Embedding (LLE) [16], Lapacian Eigenmaps (LE) [1], Isometric mapping (Isomap) [17], Local Tangent Space Alignment (LTSA) [21], and Gaussian Process Latent Variable Model (GPLVM) [12]. The Stochastic Neighbour Embedding (SNE) algorithm is one of these nonlinear dimensionality reduction techniques [7], which considers the probability that any two points will be neighbours. A major justiﬁcation of this technique is

Corresponding author.

J. Li (Ed.): AI 2010, LNAI 6464, pp. 204–212, 2010. c Springer-Verlag Berlin Heidelberg 2010

Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010

Dimensionality Reduction for Classiﬁcation through Visualisation

205

that it can be readily extended to more complex relationships between the highand low-dimensional data than the strictly one-to-one mapping of many earlier algorithms. This is presented as being useful in classiﬁcation of documents based on the words they contain, where a strict one-to-one relationship fails to account for a single word having multiple meanings. However, SNE suﬀers the so-called “crowding problem” [19], where distances between points in the high-dimensional space are reﬂected less accurately in the low-dimensional space as they increase. To counteract this, the authors of [19] propose that mismatched distributions for the neighbouring probabilities be used, suggesting a heavier-tailed distribution for the lower-dimensional vectors. They choose the Student-t distribution for this, and their experiments demonstrate the new algorithm, tSNE, outperforms most existing nonlinear dimensionality reduction algorithms. Using the Student-t distribution, which is a heavy-tailed generalisation of the Gaussian distribution used by SNE, increases the robustness of the model against the crowding problem. The same goal can be achieved by instead using the centered Laplacian distribution (or L1 distribution or the least absolute deviance). The L1 distribution is much less sensitive to outliers compared to the Gaussian density and also has only one tunable parameter, while the Student-t distribution is determined by two parameters (the degrees of freedom and the scale parameter). The approach of using the L1 distribution originates from LASSO [18], and has caught some interest in machine learning [14] and statistics. Besides the robustness against outliers context, L1 distribution assumption is also used as a penalty/regularization term on model parameters to enforce sparsity, or parameter/feature selection, such as sparse PCA [9,22], and logistic regression [14]. A recent paper [3] gives a detailed analysis on the generalized distribution which includes L1 as a special case. This paper presents a new dimensionality reduction algorithm for data visualisation to aid classiﬁcation based on the generalized L1 distribution in the classical SNE. This new algorithm, similarly to tSNE, diﬀers in the optimisation problem which is solved - that is, this paper presents the case that this technique provides a better match between the formal optmisation problem and the abstract research goal, rather than presenting a better way of ﬁnding a solution to the same optimisation problem. The paper is organised as follows. Section 2 presents a revision of the existing background work, focusing on two algorithms for Dimensionality Reduction. Section 3 presents the proposed new algorithm. Following this, experiments for classiﬁcation on several datasets are conducted in Section 4 and results are analysed and compared against the results of the existing algorithms. Finally, the conclusions drawn from these experiments are presented. For the remainder of the paper, X = {xi }N i=1 is the set of D-dimensional input vectors, and Y = {y i }N is the set of d-dimensional reduced vectors. It i=1 can be assumed from the direct goal of dimensionality reduction that d D.

Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010

206

2 2.1

L.V. Cook and J. Gao

SNE and t-SNE Formulation of SNE

The algorithm works by determining which pairs of points should be matched in terms of their Euclidean distance in the high- and low-dimensional spaces, respectively. The matching is based on the normalized discrete distribution determined by their Gaussian kernels in the two spaces. For the high-dimensional input space, the probability that any xi and xj are neighboured is: 1 2 exp − 2 xi − xj σ i pij = (2.1) 1 2 exp − 2 xi − xk σk k=i

where x denotes the Euclidean norm of x. Pi = {pij ∀j} is the Gaussiandistributed set of all such probabilities for a particular xi . The low-dimensional neighbouring probabilities are calculated the same way: 2 exp − y i − y j qij = (2.2) 2 exp − y i − y k k=i

And likewise Qi = {qij ∀j}. SNE aims to select y i so that each Qi matches its associated Pi as closely as possible. This leads to a cost function of: pij C= KL (Pi ||Qi ) = pij log (2.3) qij i i j Where KL (Pi ||Qi ) is the Kullback-Leibler divergence between the distributions. 2.2

t-Distributed SNE

The t-distributed SNE algorithm is originally presented by [19]. It is formulated as a modiﬁcation to SNE in which the Gaussian distribution (2.1) is retained in the high-dimensional space, while the heavier-tailed Student-t distribution is used for the low-dimensional space. This leaves the calculation for each pij identical to the calculations in SNE. However, the calculations for qij change to: 2 −1 1 + y i − y j qij = (2.4) −1 2 1 + y k − y i k=i

In this formulation(2.4), as in [19], the degress of freedom parameter in the Student-t distribution is assumed to be 1. The cost function, like SNE, is given from Kullback-Leibler divergences (2.3).

Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010

Dimensionality Reduction for Classiﬁcation through Visualisation

3

207

Laplacian-distrubted (L1) SNE

SNE’s Pi and Qi distributions measure the probability of points being neighboured. The diﬀerence between them is not that they measure this on diﬀerent data, but rather that they measure it on diﬀerent representations of the data. Therefore, it makes sense that the particularly probability distributions should be identical. This is at odds with tSNE’s use of a diﬀerent distribution for each Qi than for its corresponding Pi , and suggests that the crowding problem it seeks to solve is caused by an inappropriate choice of distribution in the ﬁrst place. The choice of a Gaussian distribution in SNE is never justiﬁed by [7], but simply taken as a default. [5] notes that this is a common default choice, but suggests that it is not necessarily justiﬁed by statistics theory. It seems sensible, then, that the argument presented in [5] can apply here, and that hence the Laplacian (L1) distribution may be a better choice than the Gaussian for the neighbourhood probability model of SNE. Therefore, we deﬁne: xi − xj 1 pi|j = exp − (3.1) 2σi2 where |x|1 = |x1 |+|x2 |+|x3 |+....|xD | is the L1-norm of x, and σi2 is the variance of xi . In order to constrain symmetry in pij = pji and hence simplify the cost function, we then deﬁne: pi|j + pj|i pij = (3.2) 2n with pii = 0. And similarly in the low dimensional space: exp − y i − y j 1 qij = (3.3) exp (− y k − y l 1 ) k=l

with qii = 0. Again, we seek to minimise the cost function (2.3) given by the KullbackLeibler divergences between Pi and Qi . Like the authors of [19], we do this by iterative gradient descent. Since the only distances dij = y i − y j 1 aﬀected by a change in any particular yi are dij and dji for all j, we have: ∂C ∂C =2 sgn y i − y j ∂yi ∂d ij j where sgn (y) =

y1 , y2 , . . . , |yydd | |y1 | |y2 |

is the sign vector of y.

Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010

208

L.V. Cook and J. Gao

From 2.3, we can ﬁnd: ∂C ∂log qkl = pkl ∂dij ∂dij k=l = pij − qij pkl k=l

= pij − qij ∂C ∴ =2 (pij − qij ) sgn y i − y j ∂yi

(3.4)

j

This form has roughly equivalent computationally complexity to the gradient of tSNE. Also, as mentioned in the introduction, it lacks the degrees of freedom parameter which tSNE requires to be tuned to the particular dataset.

4

Experimental Results

Experiments were run to compare tSNE and L1SNE on three datasets with class information, each of which is detailed in the following subsections. Both algorithms are optimised using the iterative gradient descent method mentioned previously. In all cases, the initial solution is given randomly, and the optimisation is terminated after a ﬁxed number of iterations. The implementation of tSNE is from Laurens van der Maaten’s MATLAB Toolbox for Dimensionality Reduction ( http://ict.ewi.tudelft.nl/~lvandermaaten/Matlab_Toolbox_ for_Dimensionality_Reduction.html) For each experiment, graphs are presented to show the results. These are colourised to show the true class information of each point, which is not made available to the algorithms as they run. Quantitative errors are also provided for each experiment, calculated as a modidﬁed KNN-error called the k-point local clustering error (KLCE). This calculation alleviates KNN’s need for a backprojection for new high-dimensional points to be mapped individually onto a precalculated low-dimensional space. KLCE considers each low-dimensional point in turn, and deduces what proportion of the k nearest points are of a diﬀerent true class to the point under consideration. The ﬁnal error is the average of these across the reduced dataset. 4.1

Handwritten Digits

Images of handwritten digits 0-9 were selected randomly from the MNIST dataset, with 1,500 images taken per digit. The results are presented on a two-dimensional scatterplot, colourised to reﬂect the digit each point represents. Results are provided for three diﬀerent random subsets to demonstrate repeatability. It can be seen in Figure 1 that both L1SNE and tSNE successfully group the data into tight clusters in the lower dimensional space according to the digit they represent. L1SNE does not spread the distinct clusters as clearly from

Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010

Dimensionality Reduction for Classiﬁcation through Visualisation

209

30

100

20 50

10

0

0

−10

−50

−20 −100

−30

−150 −150

−100

−50

0

50

100

−40 −30

150

−20

150

30

100

20

50

10

0

0

−50

−10

−100

−20

−150 −150

−100

−50

0

50

100

−10

0

10

20

30

(b) L1SNE Experiment 1

(a) tSNE Experiment 1

−30 −30

150

−20

−10

0

10

20

30

(d) L1SNE Experiment 2

(c) tSNE Experiment 2 30

150

20

100

10 50 0 0 −10 −50 −20

−100

−150 −150

−30

−100

−50

0

50

100

−40 −30

150

−20

−10

0

10

(f) L1SNE Experiment 3

(e) tSNE Experiment 3

Fig. 1. Handwritten Digits Experiments Table 1. KLCE Errors for Handwritten Digits Experiments Experiment tSNE L1SNE

20

1

2

3

0.0723 0.0899

0.0731 0.0937

0.0847 0.1043

30

Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010

210

L.V. Cook and J. Gao

30

15

20

10

10 5

0 −10

0

−20 −5

−30 −40 40

−10 15 20 0 −20 −40

−60

−40

−20

0

20

40

60

10

80

5 0 −5 −10

(a) tSNE (Lights labels)

−15

−10

−5

0

5

10

15

(b) L1SNE (Lights labels)

30

15

20

10

10 5

0 −10

0

−20 −5

−30 −40 40

−10 15 20 0 −20 −40

−60

−40

−20

0

20

40

60

10

80

5 0 −5 −10

(c) tSNE (1st Poses labels)

−15

−10

−5

0

5

10

15

(d) L1SNE (1st Poses labels)

30

15

20

10

10 5

0 −10

0

−20 −5

−30 −40 40

−10 15 20 0 −20 −40

−60

−40

−20

0

20

40

60

80

10 5 0 −5 −10

(e) tSNE (2nd Poses labels)

−15

−10

−5

0

5

10

15

(f) L1SNE (2nd Poses labels)

Fig. 2. Photographs of Faces

one another as does tSNE, however it shows less tendency for small groups of data to ’wander’ between clusters. In the cases where it does do this, the oﬀending group tends to appear far nearer to its proper cluster, as well as visibly smaller in number, than in tSNE. L1SNE’s tendancy to place the clusters much closer together in the space appears to be cause of its slightly higher error rate (Table 1): points near to the edge of have many close points which belong to a neighbouring cluster. If the clusters were spread further from each other, as in

Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010

Dimensionality Reduction for Classiﬁcation through Visualisation

211

tSNE, then points deeper in the same cluster would be closer than points on the nearest edge of a nearby cluster. 4.2

Photographs of Faces

This experiment is designed to show utility in reducing to higher than three dimensions, as well as when multiple, independent, classiﬁcation schemes are available for the data. The data is taken from the Faces dataset, which contains photgraphs of faces in varying poses and with changed lighting. Each datapoint is the pixel data from a 64-by-64 pixel image, giving 698 4096-dimensional points. This data set comes with three sets of labels, two representing the pose of the subject and one representing the lighting of the photograph. Three experiments were performed to test the utility of reduction to diﬀerent low-dimensional spaces. Data are reduced by each algorithm to four dimensions. Since the labelling information for this dataset is continuous rather than discrete, KLCE errors are unavailable for this experiment. Instead, the data is projected onto a 3-dimensional space using Principle Component Analysis[10] so that it can be graphed for visual inspection. The results in Figure 2 suggest that classiﬁcation by L1SNE would be more accurate on two of the three sets of labels. Once again, it can be seen that L1SNE’s primary drawback is a failure to clearly demarkate the boundaries of each class of points, but that it shows improvement on the existing algorithm other than this.

5

Conclusions

In these experiments, L1SNE shows improvement over its predecessors. It shows particular strength in the case where the reduced dimension is greater than three, and where there are multiple, independent, ways to classify the data. The primary identiﬁable issue which could interfere with automatic classiﬁcation of data is that it fails to separate the classes of points suﬃciently far from each other, so that points on the very edge of a class cluster may appear to belong to the neighbouring cluster. This could be overcome, for example, by partially human-assisted classiﬁcation in ambiguous edge cases. The gradient descent method used here for optimisation is rather trivial, as are the random initial solution and the ﬁxed-iterations termination condition. Future research may look for improved techniques in all of these areas, and this may do much to improve the results of all of the algorithms tested here. However, since these issues were identical across both algorithms, both algorithms could be reasonably expected to improve by equal margins. The experiments conducted argue that the L1SNE cost function has a minimum which is closer to the abstract “ideal” solution, regardless of the optimisation method used to search for it. The evidence then suggests that, of these two, L1SNE would continue to give the better results under such improvement, while maintaining relatively low computational complexity compared to the original SNE algorithm.

Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010

212

L.V. Cook and J. Gao

References 1. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003) 2. Buchala, S., Davey, N., Frank, R.J., Gale, T.M.: Dimensionality reduction of face images for gender classifcation (2004) 3. Caron, F., Doucet, A.: Sparse bayesian nonparametric regression. pp. 88–95 (2008) 4. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179–188 (1936) 5. Gao, J.: Robust L1 principal component analysis and its Bayesian variational inference. Neural Computation 20(2), 555–572 (2008) 6. Guo, Y., Kwan, P.W.H., Hou, K.X.: Visualization of protein structure relationships using constrained twin kernel embedding (2008) 7. Hinton, G., Roweis, S.: Stochastic neighbour embedding. In: Roweis, S. (ed.) Advances in Neural Information Processing Systems, vol. (15), pp. 833–840. MIT Press, Cambridge (2003) 8. Huang, S., Ward, M., Rundensteiner, E.: Exploration of dimensionality reduction for text visualisation. In: Proceedings of the Third Internations Conference on Coordinated and Multiple Views in Exploratory Visualisation (2005) 9. Jolliﬀe, I.: Principal component analysis, 2nd edn. Springer, New York (2002) 10. Jolliﬀe, M.: Principal Component Analysis. Springer, New York (1986) 11. Kentsis, A., Gindin, T., Mezei, M., Osman, R.: Calculation of the free energy and cooperativity of protein folding (May 2007) 12. Lawrence, N.: Probabilistic non-linear principal component analysis with gaussian process latent variable models. Journal of Machine Learning Research 6, 1783–1816 (2005) 13. Lima, A., Zen, H., Nankaku, Y., Tokuda, K., Miyajima, C., Kitamura, T.: On the use of kernel pca for feature extraction in speech recognition (2004) 14. Ng, A.: Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of Intl. Conf. Machine Learning (2004) 15. Oliveria, S., Za¨ıane, O.: Privacy-preserving clustering by object similarity-based representation and dimensionality reduction transformation. In: Workshop on privacy and security aspects of data mining, pp. 21–30 (2004) 16. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(22), 2323–2326 (2000) 17. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(22), 2319–2323 (2000) 18. Tibshirani, R.: Regression shrinkage and selection via the LASSO. J. Royal. Statist. Soc B. 58, 267–288 (1996) 19. van der Maaten, L., Hinton, G.: Visualising data using t-sne (2008) 20. van der Maaten, L., Postma, E.O., van den Hick, H.J.: Dimensionality reduction: A comparative review (2008) 21. Zhang, Z., Zha, H.: Principal manifolds and nonlinear dimensionality reduction via tangent space. SIAM Journal on Scientiﬁc Computing 26(1), 313–338 (2005) 22. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. Technical report, Statistics Department, Stanford University (2004)

Recommend Documents

Dimensionality reduction and visualisation of ... - Semantic Scholar

Dimensionality Reduction and Classification Feature Using Mutual ...

Hyperspectral image classification and dimensionality reduction: an ...

On Dimensionality Reduction for Classification and ... - EECS @ UMich

Bayesian Supervised Dimensionality Reduction