Deep Bottleneck Classifiers in Supervised Dimension Reduction Elina Parviainen BECS (Dept. of Biomedical Engineering and Computational Science) Aalto University School of Science and Technology, Finland
Abstract. Deep autoencoder networks have successfully been applied in unsupervised dimension reduction. The autoencoder has a "bottleneck" middle layer of only a few hidden units, which gives a low dimensional representation for the data when the full network is trained to minimize reconstruction error. We propose using a deep bottlenecked neural network in supervised dimension reduction. Instead of trying to reproduce the data, the network is trained to perform classification. Pretraining with restricted Boltzmann machines is combined with supervised finetuning. Finetuning with supervised cost functions has been done, but with cost functions that scale quadratically. Training a bottleneck classifier scales linearly, but still gives results comparable to or sometimes better than two earlier supervised methods.
1
Introduction
This work contributes to the line of research started by [1], where the authors present a way of training a deep neural network by pretraining using restricted Boltzmann machines (RBM), and finetuning the result by backpropagation. Backpropagation, if started from a random initialization, easily ends up in a local optimum, especially in a high-dimensional parameter space like that of a deep network. Pretraining mitigates this problem by finding a good initialization. In an autoencoder, the finetuning phase minimizes the reconstruction error. By using a different cost function the network can be made to learn a different mapping, for example one that gives results similar to an existing dimension reduction method [2, 3]. Also information about class labels can be introduced into the cost function. To the best of our knowledge, supervised finetuning of deep RBM-pretrained networks has only been done using cost functions based on the concept of neighborhood. The problem of neighborhood-based cost is the necessity to refer to neighboring points when computing the cost for a point, which results in complex cost functions. Two such cost functions, neighborhood components analysis (NCA) [4] and a large margin k-nearest neighbor (kNN) classification cost [5] will be described in detail later. In this work, we suggest a much faster way of achieving supervision: the network is finetuned to perform nonlinear classification. We add a two-layer classifier after the bottleneck layer, and backpropagate its classification error
through the whole network to finetune the weights. Training a classifier network scales linearly in the number of data points. We start by briefly reviewing earlier work on supervision in bottleneck architectures in Sect. 2, and introduce the bottleneck classifier in Sect. 3. Methods used for comparison are described in Sect. 4. In Sect. 5 and 6 we report our experiments and their results, and conclusions are drawn in Sect. 7.
2
Related Work
Dimension reduction with autoencoders was suggested already in the 1980s, e.g. [6], but training a deep autoencoder remained a challenge until recently. Currently, at least two different strategies for training deep networks are known [7]: training each layer as an RBM or as a shallow autoencoder. We will compare the bottleneck classifier to two supervised cost functions, which have been used to finetune an RBM-pretrained deep network. Neighborhood components analysis [8] attempts to maximize the expected number of correctly classified points in nearest neighbor classification. The same cost function is used with a deep network in [4], yielding a nonlinear extension of NCA. Another neighborhood-based cost function, for achieving presentations with a large margin in k-nearest neighbor (kNN) classification, is suggested in [5]. It is based on distances to a number of same-class and other-class points. In addition to these fully supervised cost functions, different combinations of supervised and unsupervised structures have been suggested. A strategy called divergent autoencoding [9] uses a model with a joint bottleneck layer but separate decoder networks for each class. Each decoder is trained with samples from a certain class. Regularizers, either for the network output or each layer in turn, are used for adding supervisory information to a deep network in [10]. Supervision in the form of linear classifiers has been suggested in [11] for a deep architecture with autoencoder pretraining for layers. Each layer is coupled with a linear classifier, and a tuning parameter determines the importance of the two in training. The authors achieve good precision and recall on three document data sets, which makes us expect good results also with the RBM-based architecture combined with classification. A shallow classifier with a bottleneck layer was used in [12]. The authors studied whether adding class labels would help the network produce presentations which ignore features irrelevant to classification. They used two artificial data sets, small by today’s standards, and considered a bottleneck network with three hidden layers. In that setting, they found that the bottleneck classifier improved results over autoencoders. Our work brings this idea up to date, using a deep architecture, modern pretraining and real data sets of tens of thousands of samples.
3 3.1
Bottleneck Classifier vs. Other Network Layouts RBM Pretraining for a Deep Network
Structure of a restricted Boltzmann machine [13] is a fully connected bipartite graph. Nodes in its visible layer are associated with input data, and those in the hidden layer form a representation of the data. Training tries to find the weights so that the marginal distribution of the hidden nodes would match that of the visible nodes as well as possible. Minimizing the Kullback-Leibler divergence between the two distributions would achieve this, but the gradient contains terms which cannot be computed analytically. The work in [1] goes around this problem by minimizing a slightly different cost function called contrastive divergence. It can be optimized by standard gradient descent. Pretraining a deep network needs a stack of RBMs, which form the layers of the deep network. They are trained one at a time, using the data as input in the first layer, and training each subsequent layer to reproduce the result of the previous layer as accurately as possible. Sizes of the layers are, of course, determined by the desired layout of the deep network. The weights for encoder part of the network, that is, all layers up to and including the bottleneck layer, are the weights of the RBM stack. For the decoder part (layers after the bottleneck), the same weights are reused (in the autoencoder) or a random initialization is used (the bottleneck classifier). 3.2
Finetuning
Network layouts for the methods compared in this work are illustrated in Fig. 1. Pretraining is done in exactly the same way for all architectures. The finetuning phases differ according to the cost function used and the number and structure of layers used in the decoder part of the network. Finetuning error is backpropagated through the network. In our experiments, conjugate gradient optimized was used for minimizing the error. The bottleneck classifier produces an estimate y˜ ≈ y when trained with input x and target y (y are class numbers, presented in 1-of-c binary encoding). The two-dimensional embedding for points x can be read from outputs of the bottleneck layer. After pretraining, two classifier layers (hidden and output) are added, with random initialization for the weights. Both layers have sigmoid activation functions. The ouputs are combined using softmax, and the network is finetuned to minimize cross-entropy classification error. The autoencoder produces a reconstruction x ˜ ≈ x when trained with input x and target x. Outputs of the middle layer form the low-dimensional embedding for points x. After pretraining, the decoding layers are added, mirroring the weights from pretrained layers. The network is finetuned to minimize mean square reconstruction error. With NCA, kNN and t-SNE (which will be introduced in Sect. 4.1) cost functions, no new layers are added. Distances computed from the low-dimensional embedding e (see Fig. 1) are compared to distances between input vectors using the cost function, and the finetuning tries to minimize the cost.
1000
1000
500 250 x
d
H
y ˜
d
250
250
x
(a) Bottleneck classifier (H hidden units, C classes).
500
500 250
C
1000
1000 500
x ˜
(b) Autoencoder.
x
d e
(c) NCA, kNN and tSNE costs.
Fig. 1. Illustration of network layouts for the bottleneck classifier and comparison models, each performing dimension reduction to ddimensions. The text gives more details on training each network.
4
Comparison Methods
We compare the bottleneck classifier to two supervised cost functions (NCA and kNN), and to two unsupervised methods. To justify the often heavier computation required in the supervised setting, a supervised method should produce better results than a good unsupervised one. As t-SNE has achieved remarkably good results on several data sets, especially on clustered data, we include it in our comparison as the unsupervised baseline method. We also include autoencoders in the comparison because of their structural similarity to the bottleneck classifier. In the following, we denote and index generic data points with h, i, j and l. N is the number of data points. The point i has coordinates xi in the data space and yi in the low dimensional space. Formulas below are taken from the cited references, but notation for different methods has been unified. 4.1
Stochastic Neighbor Embedding with Student-t Distributions P pij between T-SNE [14] minimizes the Kullback-Leibler divergence i6=j pij log qij neighborhood probabilities pij in the data space and qij in the low dimensional space. Neighborhood probabilities in the data space are based on normal distributions, p + pi|j exp − kxi − xj k2 /2σi2 , pij = j|i pj|i = P . (1) 2 /2σ 2 2 exp − kx − x k i h i h6=i Variances σi vary according to data density, and are determined by perplexity parameter, which can be thought as a soft version of number of neighbors. In the low-dimensional space, neighborhood probabilities are computed using tdistribution, with a degrees-of-freedom parameter α,
qij = P
(1 + kyi − yj k2 /α)−
h6=l (1
α+1 2
+ kyh − yl k2 /α)−
α+1 2
.
(2)
The heavy tails of the t-distribution compensate for the larger volume of a highdimensional neighborhood. In the low-dimensional space, the points are allowed to move slightly further from each other than in the data space, but still the neighborhood relations are preserved well. A deep neural network which uses t-SNE cost function has been presented in [3]. 4.2
Large Margin KNN Classification
A deep network for finding classification-preserving low-dimensional presentations is given in [5]. It applies, in a bit modified form, a cost function presented in [15]. Minimizing the cost function of [5] N N X N X X
ηil γij · max(0, 1 + kyi − yl k2 − kyi − yj k2 )
(3)
i=1 j=1 l=1
maximizes the margin between classes by requiring that distance from point i to an other-class points j must be at least one plus the largest distance to k nearest same-class points j. The class relationships are encoded with the help of binary variables ηil (which is 1 for same-class point pairs i and l, if l is one of the k same-class neighbors chosen for comparison) and γij (equals 1 if i and j is a chosen point from a class different than i). Applying this idea directly would make the computational complexity O(kN 2 ). Therefore, the computation is simplified by considering only m points from any other class. With C classes, this reduces the complexity to O((C − 1)kmN ), but, as neighbors from all classes are needed for each point, the method is still quite slow. 4.3
Neighborhood Components Analysis
Neighborhood components analysis [8] learns a discriminative Mahalanobis metric by maximizing accuracy of nearest neighbor classification. The cost function measures expected number of correctly classified points, using soft neighbor assignments. Soft neigborhoods are defined as probabilities with which a point i chooses another point j as its nearest neighbor, exp(−kyi − yj k2 ) 2 i6=h exp(−kyi − yh k )
pij = P
,
pii = 0 .
(4)
In nearest neighbor classification this neighbor j determines the class for point i, and therefore the neighborhood probabilities depend on distances in the low-dimensional space. The closer the point i is located to other points of its class, the more probable it is to have its nearest neighbor from the correct class,
PN P and the higher the value for the sum i=1 j:ci =cj pij (where ci denotes the class of point i). The nonlinear version of NCA [4] finetunes a deep network to maximize this sum. Computational complexity is O(N 2 ).
5
Experiments
5.1
Data Sets
We tried the bottleneck classifier, as well as the comparison methods, on three data sets. Dimension reduction was done to 30D, which would be appropriate when looking for low-dimensional but neverthless classification-preserving presentations of the data, and to 2D, which is easy to visualize. The data sets used were the MNIST benchmark data on handwritten digits, USPS digits data, and a 100-dimensional variant of the 20 newsgroups data set.1 The MNIST data contains 70000 digit images of 28x28 pixels, with roughly equal numbers of digits from each class. The data set has a fixed division into training data (60000 samples) and test data (10000 samples), which was respected also in our experiments. USPS data has 1100 16x16 pixel images from each class. These were used to form training data of 9000 samples and test data of 2000 samples. The newsgroups data consists of 16242 documents collected from internet newsgroups, with binary occurrence vectors for 100 chosen words. The documents are classified into four broad categories (comp.*, rec.*, sci.*, talk.*). We used 12000 randomly chosen documents for training and 4200 for testing (for easier division into training batches some documents were left out). The training samples were further divided into training set proper (from now on, "training data" refers to this set) and validation data (10% of samples). The validation data was used for choosing parameter values and for determining the number of epochs in the finetuning phase. 5.2
Network Parameters
For RBM training and autoencoder finetuning we used the implementation of authors of [1]. 2 We implemented other cost functions based on the same code. For the MNIST data, we used network layout of 784 × 1000 × 500 × 250 × d, with d set to 2 or 30 as appropriate. In the 2D case we also experimented with 784 × 500 × 500 × 2000 × d layout often used in the literature [1, 4, 3], but due to larger number of weights it was slower to train and did not seem to improve the results, so a smaller net was chosen for the experiments. For the newsgroups data, the network layout was 100 × 100 × 75 × 50 × d. Literature value 50 epochs [1] was used in pretraining. Number of finetuning epochs was determined by early stopping, with upper limit of 100 epochs. Same pretraining result was used with different finetuning costs. 1 2
All data sets are available from http://www.cs.nyu.edu/˜roweis/data.html. Available from http://web.mit.edu/˜rsalakhu/www/DBM.html. We thank the authors for generously making their code available, which greatly facilitated our work.
5.3
Cost Function Parameters
Parameters for kNN cost were set to k = 3 and m = 10. Small values were chosen to keep the running times reasonable. Autoencoder and NCA cost have no tuning parameters. Degrees of freedom for t-SNE were α = d − 1, one of the alternatives suggested in [3]. A preliminary run was done to check parameter sensitivities of t-SNE and the bottleneck classifier. The network was trained with the training data and results computed for the validation set. Ten uniformly spaced parameter values were tried and the one with best results was chosen for final experiments. Initial experimentation had shown little variation over repeated runs, so we decided not to do full cross-validation, which would have been computationally costly. Perplexity values 10 + 20j, j = 0 . . . 9 were tried. In the final runs, we used perplexity values of 10, 50, 110, 10, 10, 130 MNIST/2D, MNIST/30D, news/2D, news/30D, USPS/2D, USPS/30D cases respectively. For the bottleneck classifier, number of hidden units needs to be chosen. Values 10 + 40a, a = 0 . . . 9 were tried. The chosen values were 210, 130, 10, 170, 10, 90 for MNIST/2D, MNIST/30D, news/2D, news/30D, USPS/2D, USPS/30D cases. In the 30D case, number of hidden units had no dramatic effect on accuracy: the differences between best and worst 1NN classification error were 0.6% (MNIST), 1.67% (newsgroups) and 1.33% (USPS). The 2D case had more variation, with differences 2.22%, 4.92% and 15.44%. 5.4
Semisupervised Experiment
Unlabeled data is often easier and cheaper to obtain in large quantities than labeled data. A benefit of using a generative model to initialize the network is that, even if the finetuning uses labels, the initialization phase can benefit also from the unlabeled samples. We performed a small experiment with semisupervised learning, initializing the network with full training data but using varying fractions of data with labels in finetuning. The experiment was made on MNIST data set, with same parameters as with fully labeled data.
6
Results
The top row of Fig. 2 illustrates the baseline results: what we would achieve without supervision by class labels. Pretraining clearly gives a feasible starting point, with some structure visible but classes not clearly separable. Autoencoder does somewhat better, and t-SNE gives a good result, though with some overlap between classes. Visualizations using supervised methods are shown in the bottom row of Fig. 2. All supervised methods compared manage to separate the classes more clearly than the unsupervised methods. As all the supervised methods aim at keeping the classes separate, classification accuracy is a natural criterion for numerical comparisons. We used k-NN classifier (for k=1,3) and a one-hidden-layer neural network classifier trained on
(a) RBM-pretraining.
(b) Autoencoder.
(c) T-SNE cost.
(d) NCA cost.
(e) KNN cost.
(f) Bottleneck classifier.
Fig. 2. Visualizations of the MNIST test data. The top row shows unsupervised and the bottom row supervised results. For readability, only half of the points have been plotted. The visualization does not show clear quality differences between the different supervised methods. For numerical comparisons, see tables in Sect. 6.
the 2D/30D results produced by each network. Two methods were used, since k-NN classification might favor the neighborhood-based methods, whereas the bottleneck classifier could get some unfair benefit in network-based classification. Results are shown in Table 1 for the neural network classifier and in Table 2 for the k-NN classifier. The bottleneck classifier performs consistently well in the 2D case, usually being the best method. In the 30D case, it typically falls second to one of the comparison methods while being superior to the other one. These results lead us to conclusion that performance of different supervised methods is somewhat data-dependent. It may also depend on parameters used, especially when increasing a parameter value would increase computational effort (as with the kNN cost, where relatively small numbers of neighbors were used). Based on these experiments, no clear order of superiority could be established between the three methods. Therefore, issues like computational complexity become important when choosing the method. The bottleneck classifier is worth considering because of its computational simplicity. Also, if main interest lies at the 2D visualizations, the bottleneck classifier could be the best choice. Results for the semi-supervised experiment are shown in Table 3. In the 2D case, accuracy gained by adding labels seems to saturate rapidly: difference between 30% and 90% of labels is only about 4 percentage units, much less than the about 8 percentage unit difference between 10% and 30% of labels. In the 30D case, adding more labels improves results roughly linearly.
Table 1. Classification errors (%) for the test data with a neural network classifier, which was trained on 2D/30D embeddings of the training data.
2D
MNIST news USPS 30D MNIST news USPS
pretrain 54.3 43.4 52.2 5.1 22.4 8.1
autoenc 34.3 30.8 56.2 4.2 21.8 7.5
t-SNE 12.1 30.0 44.4 4.9 23.0 10.8
NCA 6.1 22.0 26.7 3.9 18.7 8.2
KNN 5.5 25.5 36.1 2.0 22.1 6.8
bottleclass 4.9 21.9 19.3 3.1 20.0 7.3
Table 2. Classification errors (%) for the test data with a k-NN classifier.
2D
MNIST 1-NN 3-NN news 1-NN 3-NN USPS 1-NN 3-NN 30D MNIST 1-NN 3-NN news 1-NN 3-NN USPS 1-NN 3-NN
7
pretrain 61.8 60.9 41.0 40.0 60.2 59.1 3.6 3.2 28.9 27.6 9.8 9.8
autoenc 40.5 37.9 33.5 30.9 64.8 60.6 3.2 2.8 28.5 26.8 8.4 8.6
t-SNE 14.5 10.6 32.5 30.7 40.4 36.5 3.9 3.6 27.2 25.3 10.4 10.1
NCA 8.3 6.6 25.93 23.05 28.3 25.0 1.9 1.8 22.62 20.29 5.40 5.7
KNN 8.1 6.3 30.7 27.1 48.6 42.5 1.88 1.47 25.1 23.0 5.8 5.35
bottleclass 6.91 5.43 27.7 24.2 27.80 21.75 1.9 1.8 24.9 22.3 5.9 6.5
Conclusions
We presented a method for introducing supervision into an RBM-pretrained deep neural network which performs dimension reduction by forcing information to flow through a bottleneck layer. The idea presented is very simple: after pretraining, the network is trained as a classifier. The method is simple to implement and scales linearly with the number of data points. We compared our method with two other supervised methods. We found that the bottleneck classifier gives results comparable to the other methods. We therefore see it as a choice worth recommending, since it is computationally much lighter than the other methods. In this work we trained the network as a multiclass classifier. It could equally easily be trained to perform regression. It is a possible direction for future work to test the applicability of the idea for sufficient dimension reduction (see e.g. [16] and references therein), which aims to find low-dimensional presentations preserving the information necessary for predicting a target variable in regression.
Table 3. K-NN classification errors (%) for the test data in the semisupervised experiment (MNIST data), for different amounts of label information.
2D 1-NN 3-NN 30D 1-NN 3-NN
1% 42.4 39.1 3.3 3.0
5% 26.3 22.1 3.0 2.6
10 % 19.8 16.7 2.7 2.5
30 % 11.4 9.2 2.3 2.1
50 % 10.2 7.7 2.0 1.8
70 % 8.1 6.3 1.8 1.6
90 % 7.92 6.06 1.67 1.53
References 1. Hinton, G.E., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313 (2006) 504–507 2. Mao, J., Jain, A.K.: Artificial neural networks for feature extraction and multivariate data projection. IEEE Transactions on Neural Networks 6(2) (1995) 296–317 3. van der Maaten, L.: Learning a parametric embedding by preserving local structure. In: Proc. of AISTATS. Volume 5 of JMLR: W&CP 5. (2009) 384–391 4. Salakhutdinov, R., Hinton, G.: Learning a nonlinear embedding by preserving class neighborhood structure. In: Proc. of AISTATS. (2007) 5. Min, R., Stanley, D.A., Yuan, Z., Bonner, A., Zhang, Z.: A deep non-linear feature mapping for large-margin kNN classification. In: Proc. of ICDM. (2009) 6. Elman, J.L., Zipser, D.: Learning the hidden structure of speech. Journal of the Acoustic Society of America 83(4) (1988) 1615–1626 7. Larochelle, H., Bengio, Y., Louradour, J., Lamblin, P.: Exploring strategies for training deep neural networks. Journal of Machine Learning Research 10 (2009) 1–40 8. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighbourhood components analysis. In: Proc. of NIPS. Volume 17. (2005) 513–520 9. Kurtz, K.J.: The divergent autoencoder (DIVA) model of category learning. Psychonomic Bulletin & Review 14(4) (2007) 560–576 10. Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding. In: Proc. of ICML. (2008) 11. Ranzato, M., Szummer, M.: Semi-supervised learning of compact document representations with deep networks. In: Proc. of ICML. (2008) 12. Intrator, N., Edelman, S.: Learning low-dimensional representations via the usage of multiple-class labels. Network: Computation in Neural Systems 8(3) (1997) 259–281 13. Smolensky, P.: Information processing in dynamical systems: Foundations of harmony theory. In Rumelhart, D.E., McClelland, J.L., eds.: Parallel distributed processing. Volume 1. MIT Press, Cambridge, USA (1986) 194–281 14. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008) 2579–2605 15. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb 2009) 207– 244 16. Kim, M., Pavlovic, V.: Covariance operator based dimensionality reduction with extension to semi-supervised learning. In: Proc. of AISTATS. Volume 5 of JMLR: W&CP 5. (2009) 280–287