Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010
Dimensionality Reduction for Classification through Visualisation Using L1SNE Lennon V. Cook and Junbin Gao School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW 2795, Australia
[email protected],
[email protected] Abstract. Dimensionality Reduction algorithms have wide precedent for use in preprocessing for classification problems. This paper presents a new algorithm, based on a modification to Stochastic Neighbour Embedding and t-Distributed SNE to use the Laplacian distribution instead of, respectively, the Gaussian Distribution and a mismatched pair of the Gaussian Distribution and Student’s t-Distribution. Experimental results are presented to demonstrate that this modification yields improvement.
1
Introduction
Recent years have seen a large increase in studies on dimensionality reduction (DR) algorithms, the goal of which is to find a lower-dimensional representation of high-dimensional data [20]. A common motive for this is to overcome the “curse of dimensionality” [20,6] that the complexity of many algorithms is bound by the dimension of the data, and can become intractable in many real-world datasets where the dimensionality is quite high. Application areas include manifold learning[11], pattern recognition [13], data mining[15], data classification[2] and data visualisation [19,8]. Originally, the focus was on algorithms such as Principal Component Analysis (PCA) [10], and Linear Discriminant Analysis (LDA) [4] which assume linearity - that is, that the low-dimensional vectors are a linear combination of the original high-dimensional vectors. More recently, however, much improvement has been gained by relaxing this constraint. Resulting Non-Linear Dimensionality Reduction (NLDR) algorithms include, for example, Local Linear Embedding (LLE) [16], Lapacian Eigenmaps (LE) [1], Isometric mapping (Isomap) [17], Local Tangent Space Alignment (LTSA) [21], and Gaussian Process Latent Variable Model (GPLVM) [12]. The Stochastic Neighbour Embedding (SNE) algorithm is one of these nonlinear dimensionality reduction techniques [7], which considers the probability that any two points will be neighbours. A major justification of this technique is
Corresponding author.
J. Li (Ed.): AI 2010, LNAI 6464, pp. 204–212, 2010. c Springer-Verlag Berlin Heidelberg 2010
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010
Dimensionality Reduction for Classification through Visualisation
205
that it can be readily extended to more complex relationships between the highand low-dimensional data than the strictly one-to-one mapping of many earlier algorithms. This is presented as being useful in classification of documents based on the words they contain, where a strict one-to-one relationship fails to account for a single word having multiple meanings. However, SNE suffers the so-called “crowding problem” [19], where distances between points in the high-dimensional space are reflected less accurately in the low-dimensional space as they increase. To counteract this, the authors of [19] propose that mismatched distributions for the neighbouring probabilities be used, suggesting a heavier-tailed distribution for the lower-dimensional vectors. They choose the Student-t distribution for this, and their experiments demonstrate the new algorithm, tSNE, outperforms most existing nonlinear dimensionality reduction algorithms. Using the Student-t distribution, which is a heavy-tailed generalisation of the Gaussian distribution used by SNE, increases the robustness of the model against the crowding problem. The same goal can be achieved by instead using the centered Laplacian distribution (or L1 distribution or the least absolute deviance). The L1 distribution is much less sensitive to outliers compared to the Gaussian density and also has only one tunable parameter, while the Student-t distribution is determined by two parameters (the degrees of freedom and the scale parameter). The approach of using the L1 distribution originates from LASSO [18], and has caught some interest in machine learning [14] and statistics. Besides the robustness against outliers context, L1 distribution assumption is also used as a penalty/regularization term on model parameters to enforce sparsity, or parameter/feature selection, such as sparse PCA [9,22], and logistic regression [14]. A recent paper [3] gives a detailed analysis on the generalized distribution which includes L1 as a special case. This paper presents a new dimensionality reduction algorithm for data visualisation to aid classification based on the generalized L1 distribution in the classical SNE. This new algorithm, similarly to tSNE, differs in the optimisation problem which is solved - that is, this paper presents the case that this technique provides a better match between the formal optmisation problem and the abstract research goal, rather than presenting a better way of finding a solution to the same optimisation problem. The paper is organised as follows. Section 2 presents a revision of the existing background work, focusing on two algorithms for Dimensionality Reduction. Section 3 presents the proposed new algorithm. Following this, experiments for classification on several datasets are conducted in Section 4 and results are analysed and compared against the results of the existing algorithms. Finally, the conclusions drawn from these experiments are presented. For the remainder of the paper, X = {xi }N i=1 is the set of D-dimensional input vectors, and Y = {y i }N is the set of d-dimensional reduced vectors. It i=1 can be assumed from the direct goal of dimensionality reduction that d D.
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010
206
2 2.1
L.V. Cook and J. Gao
SNE and t-SNE Formulation of SNE
The algorithm works by determining which pairs of points should be matched in terms of their Euclidean distance in the high- and low-dimensional spaces, respectively. The matching is based on the normalized discrete distribution determined by their Gaussian kernels in the two spaces. For the high-dimensional input space, the probability that any xi and xj are neighboured is: 1 2 exp − 2 xi − xj σ i pij = (2.1) 1 2 exp − 2 xi − xk σk k=i
where x denotes the Euclidean norm of x. Pi = {pij ∀j} is the Gaussiandistributed set of all such probabilities for a particular xi . The low-dimensional neighbouring probabilities are calculated the same way: 2 exp − y i − y j qij = (2.2) 2 exp − y i − y k k=i
And likewise Qi = {qij ∀j}. SNE aims to select y i so that each Qi matches its associated Pi as closely as possible. This leads to a cost function of: pij C= KL (Pi ||Qi ) = pij log (2.3) qij i i j Where KL (Pi ||Qi ) is the Kullback-Leibler divergence between the distributions. 2.2
t-Distributed SNE
The t-distributed SNE algorithm is originally presented by [19]. It is formulated as a modification to SNE in which the Gaussian distribution (2.1) is retained in the high-dimensional space, while the heavier-tailed Student-t distribution is used for the low-dimensional space. This leaves the calculation for each pij identical to the calculations in SNE. However, the calculations for qij change to: 2 −1 1 + y i − y j qij = (2.4) −1 2 1 + y k − y i k=i
In this formulation(2.4), as in [19], the degress of freedom parameter in the Student-t distribution is assumed to be 1. The cost function, like SNE, is given from Kullback-Leibler divergences (2.3).
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010
Dimensionality Reduction for Classification through Visualisation
3
207
Laplacian-distrubted (L1) SNE
SNE’s Pi and Qi distributions measure the probability of points being neighboured. The difference between them is not that they measure this on different data, but rather that they measure it on different representations of the data. Therefore, it makes sense that the particularly probability distributions should be identical. This is at odds with tSNE’s use of a different distribution for each Qi than for its corresponding Pi , and suggests that the crowding problem it seeks to solve is caused by an inappropriate choice of distribution in the first place. The choice of a Gaussian distribution in SNE is never justified by [7], but simply taken as a default. [5] notes that this is a common default choice, but suggests that it is not necessarily justified by statistics theory. It seems sensible, then, that the argument presented in [5] can apply here, and that hence the Laplacian (L1) distribution may be a better choice than the Gaussian for the neighbourhood probability model of SNE. Therefore, we define: xi − xj 1 pi|j = exp − (3.1) 2σi2 where |x|1 = |x1 |+|x2 |+|x3 |+....|xD | is the L1-norm of x, and σi2 is the variance of xi . In order to constrain symmetry in pij = pji and hence simplify the cost function, we then define: pi|j + pj|i pij = (3.2) 2n with pii = 0. And similarly in the low dimensional space: exp − y i − y j 1 qij = (3.3) exp (− y k − y l 1 ) k=l
with qii = 0. Again, we seek to minimise the cost function (2.3) given by the KullbackLeibler divergences between Pi and Qi . Like the authors of [19], we do this by iterative gradient descent. Since the only distances dij = y i − y j 1 affected by a change in any particular yi are dij and dji for all j, we have: ∂C ∂C =2 sgn y i − y j ∂yi ∂d ij j where sgn (y) =
y1 , y2 , . . . , |yydd | |y1 | |y2 |
is the sign vector of y.
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010
208
L.V. Cook and J. Gao
From 2.3, we can find: ∂C ∂log qkl = pkl ∂dij ∂dij k=l = pij − qij pkl k=l
= pij − qij ∂C ∴ =2 (pij − qij ) sgn y i − y j ∂yi
(3.4)
j
This form has roughly equivalent computationally complexity to the gradient of tSNE. Also, as mentioned in the introduction, it lacks the degrees of freedom parameter which tSNE requires to be tuned to the particular dataset.
4
Experimental Results
Experiments were run to compare tSNE and L1SNE on three datasets with class information, each of which is detailed in the following subsections. Both algorithms are optimised using the iterative gradient descent method mentioned previously. In all cases, the initial solution is given randomly, and the optimisation is terminated after a fixed number of iterations. The implementation of tSNE is from Laurens van der Maaten’s MATLAB Toolbox for Dimensionality Reduction ( http://ict.ewi.tudelft.nl/~lvandermaaten/Matlab_Toolbox_ for_Dimensionality_Reduction.html) For each experiment, graphs are presented to show the results. These are colourised to show the true class information of each point, which is not made available to the algorithms as they run. Quantitative errors are also provided for each experiment, calculated as a modidfied KNN-error called the k-point local clustering error (KLCE). This calculation alleviates KNN’s need for a backprojection for new high-dimensional points to be mapped individually onto a precalculated low-dimensional space. KLCE considers each low-dimensional point in turn, and deduces what proportion of the k nearest points are of a different true class to the point under consideration. The final error is the average of these across the reduced dataset. 4.1
Handwritten Digits
Images of handwritten digits 0-9 were selected randomly from the MNIST dataset, with 1,500 images taken per digit. The results are presented on a two-dimensional scatterplot, colourised to reflect the digit each point represents. Results are provided for three different random subsets to demonstrate repeatability. It can be seen in Figure 1 that both L1SNE and tSNE successfully group the data into tight clusters in the lower dimensional space according to the digit they represent. L1SNE does not spread the distinct clusters as clearly from
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010
Dimensionality Reduction for Classification through Visualisation
209
30
100
20 50
10
0
0
−10
−50
−20 −100
−30
−150 −150
−100
−50
0
50
100
−40 −30
150
−20
150
30
100
20
50
10
0
0
−50
−10
−100
−20
−150 −150
−100
−50
0
50
100
−10
0
10
20
30
(b) L1SNE Experiment 1
(a) tSNE Experiment 1
−30 −30
150
−20
−10
0
10
20
30
(d) L1SNE Experiment 2
(c) tSNE Experiment 2 30
150
20
100
10 50 0 0 −10 −50 −20
−100
−150 −150
−30
−100
−50
0
50
100
−40 −30
150
−20
−10
0
10
(f) L1SNE Experiment 3
(e) tSNE Experiment 3
Fig. 1. Handwritten Digits Experiments Table 1. KLCE Errors for Handwritten Digits Experiments Experiment tSNE L1SNE
20
1
2
3
0.0723 0.0899
0.0731 0.0937
0.0847 0.1043
30
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010
210
L.V. Cook and J. Gao
30
15
20
10
10 5
0 −10
0
−20 −5
−30 −40 40
−10 15 20 0 −20 −40
−60
−40
−20
0
20
40
60
10
80
5 0 −5 −10
(a) tSNE (Lights labels)
−15
−10
−5
0
5
10
15
(b) L1SNE (Lights labels)
30
15
20
10
10 5
0 −10
0
−20 −5
−30 −40 40
−10 15 20 0 −20 −40
−60
−40
−20
0
20
40
60
10
80
5 0 −5 −10
(c) tSNE (1st Poses labels)
−15
−10
−5
0
5
10
15
(d) L1SNE (1st Poses labels)
30
15
20
10
10 5
0 −10
0
−20 −5
−30 −40 40
−10 15 20 0 −20 −40
−60
−40
−20
0
20
40
60
80
10 5 0 −5 −10
(e) tSNE (2nd Poses labels)
−15
−10
−5
0
5
10
15
(f) L1SNE (2nd Poses labels)
Fig. 2. Photographs of Faces
one another as does tSNE, however it shows less tendency for small groups of data to ’wander’ between clusters. In the cases where it does do this, the offending group tends to appear far nearer to its proper cluster, as well as visibly smaller in number, than in tSNE. L1SNE’s tendancy to place the clusters much closer together in the space appears to be cause of its slightly higher error rate (Table 1): points near to the edge of have many close points which belong to a neighbouring cluster. If the clusters were spread further from each other, as in
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010
Dimensionality Reduction for Classification through Visualisation
211
tSNE, then points deeper in the same cluster would be closer than points on the nearest edge of a nearby cluster. 4.2
Photographs of Faces
This experiment is designed to show utility in reducing to higher than three dimensions, as well as when multiple, independent, classification schemes are available for the data. The data is taken from the Faces dataset, which contains photgraphs of faces in varying poses and with changed lighting. Each datapoint is the pixel data from a 64-by-64 pixel image, giving 698 4096-dimensional points. This data set comes with three sets of labels, two representing the pose of the subject and one representing the lighting of the photograph. Three experiments were performed to test the utility of reduction to different low-dimensional spaces. Data are reduced by each algorithm to four dimensions. Since the labelling information for this dataset is continuous rather than discrete, KLCE errors are unavailable for this experiment. Instead, the data is projected onto a 3-dimensional space using Principle Component Analysis[10] so that it can be graphed for visual inspection. The results in Figure 2 suggest that classification by L1SNE would be more accurate on two of the three sets of labels. Once again, it can be seen that L1SNE’s primary drawback is a failure to clearly demarkate the boundaries of each class of points, but that it shows improvement on the existing algorithm other than this.
5
Conclusions
In these experiments, L1SNE shows improvement over its predecessors. It shows particular strength in the case where the reduced dimension is greater than three, and where there are multiple, independent, ways to classify the data. The primary identifiable issue which could interfere with automatic classification of data is that it fails to separate the classes of points sufficiently far from each other, so that points on the very edge of a class cluster may appear to belong to the neighbouring cluster. This could be overcome, for example, by partially human-assisted classification in ambiguous edge cases. The gradient descent method used here for optimisation is rather trivial, as are the random initial solution and the fixed-iterations termination condition. Future research may look for improved techniques in all of these areas, and this may do much to improve the results of all of the algorithms tested here. However, since these issues were identical across both algorithms, both algorithms could be reasonably expected to improve by equal margins. The experiments conducted argue that the L1SNE cost function has a minimum which is closer to the abstract “ideal” solution, regardless of the optimisation method used to search for it. The evidence then suggests that, of these two, L1SNE would continue to give the better results under such improvement, while maintaining relatively low computational complexity compared to the original SNE algorithm.
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence 3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings Springer-Verlag Berlin Heidelberg 2010
212
L.V. Cook and J. Gao
References 1. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003) 2. Buchala, S., Davey, N., Frank, R.J., Gale, T.M.: Dimensionality reduction of face images for gender classifcation (2004) 3. Caron, F., Doucet, A.: Sparse bayesian nonparametric regression. pp. 88–95 (2008) 4. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179–188 (1936) 5. Gao, J.: Robust L1 principal component analysis and its Bayesian variational inference. Neural Computation 20(2), 555–572 (2008) 6. Guo, Y., Kwan, P.W.H., Hou, K.X.: Visualization of protein structure relationships using constrained twin kernel embedding (2008) 7. Hinton, G., Roweis, S.: Stochastic neighbour embedding. In: Roweis, S. (ed.) Advances in Neural Information Processing Systems, vol. (15), pp. 833–840. MIT Press, Cambridge (2003) 8. Huang, S., Ward, M., Rundensteiner, E.: Exploration of dimensionality reduction for text visualisation. In: Proceedings of the Third Internations Conference on Coordinated and Multiple Views in Exploratory Visualisation (2005) 9. Jolliffe, I.: Principal component analysis, 2nd edn. Springer, New York (2002) 10. Jolliffe, M.: Principal Component Analysis. Springer, New York (1986) 11. Kentsis, A., Gindin, T., Mezei, M., Osman, R.: Calculation of the free energy and cooperativity of protein folding (May 2007) 12. Lawrence, N.: Probabilistic non-linear principal component analysis with gaussian process latent variable models. Journal of Machine Learning Research 6, 1783–1816 (2005) 13. Lima, A., Zen, H., Nankaku, Y., Tokuda, K., Miyajima, C., Kitamura, T.: On the use of kernel pca for feature extraction in speech recognition (2004) 14. Ng, A.: Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of Intl. Conf. Machine Learning (2004) 15. Oliveria, S., Za¨ıane, O.: Privacy-preserving clustering by object similarity-based representation and dimensionality reduction transformation. In: Workshop on privacy and security aspects of data mining, pp. 21–30 (2004) 16. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(22), 2323–2326 (2000) 17. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(22), 2319–2323 (2000) 18. Tibshirani, R.: Regression shrinkage and selection via the LASSO. J. Royal. Statist. Soc B. 58, 267–288 (1996) 19. van der Maaten, L., Hinton, G.: Visualising data using t-sne (2008) 20. van der Maaten, L., Postma, E.O., van den Hick, H.J.: Dimensionality reduction: A comparative review (2008) 21. Zhang, Z., Zha, H.: Principal manifolds and nonlinear dimensionality reduction via tangent space. SIAM Journal on Scientific Computing 26(1), 313–338 (2005) 22. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. Technical report, Statistics Department, Stanford University (2004)