MULTILAYER PERCEPTRON WITH SPARSE HIDDEN OUTPUTS FOR PHONEME RECOGNITION G.S.V.S. Sivaram and Hynek Hermansky Dept. of Electrical & Computer Engineering, Center for Language and Speech Processing, Human Language Technology, Center of Excellence, The Johns Hopkins University, USA. e-mail : {sivaram, hynek}@jhu.edu ABSTRACT This paper introduces the sparse multilayer perceptron (SMLP) which learns the transformation from the inputs to the targets as in multilayer perceptron (MLP) while the outputs of one of the internal hidden layers is forced to be sparse. This is achieved by adding a sparse regularization term to the cross-entropy cost and learning the parameters of the network to minimize the joint cost. On the TIMIT phoneme recognition task, the SMLP based system trained using perceptual linear prediction (PLP) features performs better than the conventional MLP based system. Furthermore, their combination yields a phoneme error rate of 21.2%, a relative improvement of 6.2% over the baseline. Index Terms— Multilayer perceptron, sparse features, machine learning, phoneme recognition. 1. INTRODUCTION Sparse features have been used successfully in many pattern classification applications such as face recognition [1], handwritten digit recognition [2, 3, 4], and phoneme recognition [5, 6, 7] etc. Majority of the approaches determine sparse features by expressing a signal as a linear combination of minimum number (sparse) of minimum number (sparse) of atoms in a dictionary [1, 4, 5, 6]. Alternatively, the transformation from input to sparse features is learned using a trainable network that is optimized to minimize the reconstruction error of the input [3, 7]. Once these sparse features are derived, a separate classifier is used for making decisions. However, sparse features have not often been optimized in conjunction with the subsequent classifier for the discriminability of various output classes. Some of the previous works have attempted to address this issue by considering a simple scenario. For example, a two-class classification problem with a linear or bilinear classifier has been considered in [4]. In a different work, Fisher’s linear discrimination criterion with sparsity is used [2]. In this paper, we propose to jointly learn both sparse features and nonlinear classifier boundaries that best discriminate multiple output classes. Specifically, we propose to learn sparse features at the output of a hidden layer of a MLP trained to discriminate multiple output classes. The core idea is to force one of the hidden layers The research presented in this paper was partially funded by IARPA BEST program under contract Z857701 and DARPA RATS program under D10PC20015. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the IARPA or DARPA.
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
5336
outputs to be sparse when learning the parameters of a MLP for a classification task. This is achieved by adding a sparse regularization term to the conventional cross-entropy cost between the target values and their predicted values at the output layer. The parameters of the MLP are learned to minimize the joint cost using error back-propagation algorithm which takes the additional sparse regularization term into consideration. The resultant model is referred to as the sparse multilayer perceptron (SMLP) throughout this paper. Further, SMLP estimates the Bayesian a posteriori probabilities of the output classes conditioned on the sparse hidden features if certain conditions hold which we shall discuss. The performance of the proposed SMLP classifier is evaluated using the state-of-the art TIMIT phoneme recognition system. Phoneme recognition system used in our experiments is based on a hierarchical hybrid Hidden Markov Model (HMM)-MLP approach [8, 9], where the hierarchically estimated phoneme posterior probabilities are converted to the scaled likelihoods which are then used to model the HMM states. Note that SMLP classifier is being used for estimating the 3-state phoneme posterior probabilities as shown in the Fig. 1. Experimental results indicate that SMLP based system achieves better performance than the conventional MLP based system. Additionally, their combination improves over the baseline MLP system. 2. THEORY OF SMLP The notations used in this paper are as follows. m - number of layers (including input and output layers) Nl - number of neurons (or nodes) in the lth layer φl - output nonlinearity at the lth layer th th xlj `- input ´ tolthe j neuron inththe l layer l φl xj = yj - output of the j neuron in the lth layer l−1 wij - weight connecting the ith neuron in (l − 1)th layer and j th neuron in lth layer dj - target of the j th neuron in the output layer . ej = dj − yjm - error of the j th neuron in the output layer The goal of SMLP classifier is to jointly learn sparse features at the output of its pth layer and estimate posterior probabilities of multiple classes at its output layer. In the case of MLP, estimates of the posterior probabilities are typically obtained by minimizing the cross-entropy cost between the output layer values (after softmax) and the hard targets1 . We modify this cost function for SMLP as 1 Hard
target vector consist of all zeros except a one at an index corre-
ICASSP 2011
MLP
SMLP
input
hidden sparse layer−2 hidden outputs
hidden layer output
input
output
temporal context of 230 ms
PLP features
Single state phoneme posterior probabilities
(23 frames)
3−state phoneme posterior probabilities
Fig. 1. SMLP based hierarchical estimation of posterior probabilities of phonemes. Though both the networks are fully connected, part of the connections are shown for clarity. follows.
the training data in an average sense. Since the learning is based on stochastic gradient descent, determining the gradient of the cost function (2) with respect to the weights is the key.
2.1. Cost function The two objectives of the SMLP are • minimize the cross-entropy cost between the output layer values and the hard targets, and • force the outputs of the pth layer to be sparse for a particular p ∈ {2, 3, ..., m − 1}. The instantaneous2 cross-entropy cost is L=−
Nm X ˆ
` ´ ` ´˜ dj log yjm + (1 − dj ) log 1 − yjm .
(1)
j=1
To obtain the SMLP instantaneous cost function we add an additional sparse regularization term to the cross-entropy cost (1), yielding Np ` ´ λX L˜ = L + log 1 + (yjp )2 . (2) 2 j=1 where λ is a positive scalar controlling the trade-off between the N ` ´ Pp sparsity and the cross-entropy cost. The function log 1 + (yjp )2 j=1
which is continuous and differentiable everywhere, was successfully used in previous works to obtain a sparse representation [10, 3, 6, 7]. The weights of the SMLP are adjusted to minimize (2), and is discussed below. 2.2. Error back-propagation training Stochastic gradient descent technique is applied for updating the weights of SMLP. The conventional error back-propagation training algorithm is a result of applying the chain rule of calculus to compute the gradient of a cross-entropy cost (1) function w.r.t. weights. For training SMLP, the error back-propagation must be modified in order to accommodate the additional sparse regularization term. In the rest of this section, we derive update equations for training the SMLP by minimizing the cost function (2) w.r.t. weights3 over sponding to the phoneme that current input feature vector belongs to. 2 By instantaneous we mean corresponding to a single input pattern. 3 The bias values at any layer can be interpreted as weights connecting an imaginary node in the previous layer, with its output being unity, and all the nodes in the current layer.
5337
2.2.1. Gradient of L˜ w.r.t. yjl From (1) and (2), ∀l ∈ {p + 1, p + 2, ..., m}, ∀j ∈ {1, 2, ..., Nl }, ∂L ∂ L˜ = . ∂yjl ∂yjl Using (2), for layer p, ∀j ∈ {1, 2, ..., Np }, „ « yjp ∂L ∂ L˜ = + λ . ∂yjp ∂yjp 1 + (yjp )2
(3)
(4)
Using (2) and chain rule of calculus, ∀ (l − 1) ∈ {2, 3..., p − 1}, ∀i ∈ {1, 2..., Nl−1 }, ! ! ! Nl X ∂yjl ∂xlj ∂ L˜ ∂ L˜ = ∂yjl ∂xlj ∂yil−1 ∂yil−1 j=1 ! Nl “ ” X ∂ L˜ l−1 = . (5) φl xlj wij l ∂y j j=1 The above equations (3),(4) and (5) indicate that the gradients of L˜ w.r.t. yjl can be computed from the gradients of L w.r.t. yjl . Specifically, we need the gradients of L w.r.t. yjl , ∀l ∈ {p, p + 1, ..., m}, ∀j ∈ {1, 2, ..., Nl } in order to compute gradients of L˜ w.r.t. yjl , ∀l ∈ {2, 3, ..., m}, ∀j ∈ {1, 2, ..., Nl }. Gradient of L w.r.t. yjl can be obtained using the conventional error back-propagation algorithm. l−1 2.2.2. Gradient of L˜ w.r.t. wij l−1 By definition, wij denotes the weight connecting the ith neuron in th (l − 1) layer and j th neuron in lth layer. Thus by using chain rule, ! ! ! ∂yjl ∂xlj ∂ L˜ ∂ L˜ = l−1 l−1 ∂yjl ∂xlj wij ∂wij ! “ ” ∂ L˜ (6) = φl xlj yil−1 . l ∂yj
2.2.3. Update equations 22
Weights of a SMLP are updated using stochastic gradient descent. The gradient of the cost function (6) with respect to a particular weight is accumulated for several input patterns and then the weight is updated using ←
l−1 wij
∂ L˜ − η l−1 , wij
where η is a small positive learning rate, and lated value of the gradient.
˜ ∂L l−1 wij
PER
l−1 wij
21.6
(7) is the accumu-
21.2
20.8
20.4
20.0 0
2.3. SMLP as a posterior probability estimator The number of input and output nodes of the SMLP is set to be equal to the dimensionality of its input acoustic feature vector and the number of output phoneme classes respectively. Softmax nonlinearity is used at its output layer. The weights of SMLP are adjusted to minimize (2) when the hard targets are being used. Note from the equations (3), (4), (5) and (6) that the sparse regularization term l , ∀l ∈ {1, 2, ..., p − 1}. affects the update of only those weights wij l This implies that the weights wij , ∀l ∈ {p, p + 1, ..., m − 1} can be adjusted to minimize the cross-entropy term of (2) without affecting the sparse regularization term. If p < m − 1 and one of the hidden layers between pth and mth layers is sigmoidal (nonlinear) then the pth layer outputs can be nonlinearly transformed to the SMLP outputs. Therefore, in such a case, SMLP estimates the posterior probabilities of output classes conditioned on the pth layer outputs (sparse representation). This follows from the facts that MLP with a single nonlinear hidden layer estimates posterior probabilities of output classes conditioned on the input features when it is trained to minimize the cross-entropy cost (1) between the hard targets and its outputs [11], and SMLP outputs are conditionally independent of the inputs given the outputs of pth layer.
0.01
Lambda
0.05
0.08
Fig. 2. PER of the cross-validation data as a function of λ for PLP features.
3.3. System description
to model the HMM states [8]. In all our experiments, posterior probabilities are estimated in a hierarchical manner [9]. The 61 hand-labeled phone symbols are mapped to 49 phoneme classes for the purpose of training the classifiers. The mapping is obtained by treating each of the following set of phonemes as a single class: {/tcl/, /pcl/, /kcl/}, {/gcl/, /dcl/, /bcl/}, {/h#/, /pau/}, {/eng/, /ng/}, {/axr/, /er/}, {/axh/, /ah/}, {/ux/, /uw/}, {/nx/, /n/}, {/hv/, /hh/}, and {/em/, /m/}. As shown in the Fig. 1, a four layer (m = 4) SMLP is used for estimating 3-state phoneme posterior probabilities. It consists of an input layer to receive a given feature stream, two hidden layers with a sigmoid nonlinearity, and an output layer with a softmax nonlinearity. The number of nodes in the input and output layers is set to be equal to the dimensionality of the input feature vector and the number of phoneme states (i.e., 49 x 3 = 147) respectively. The outputs of the first hidden layer (p = 2) are forced to be sparse with number of nodes in it being same as that of the input layer. The number of nodes in the second hidden layers is chosen to be 1000. The SMLP 3-state posterior probabilities are mapped to a single state phoneme posterior probability estimates by training a MLP which operates on a context of 230 ms or 23 posterior probability vectors. Its hidden layer consists of 3500 nodes with a sigmoid nonlinearity, and output layer consists of 49 nodes with a softmax nonlinearity. The value of λ in SMLP cost function (2) is chosen to minimize the PER of the cross-validation data. Fig. 2 shows the effect of λ on PER of the cross-validation data. As a baseline, λ is being set to zero when training the SMLP. In other words, SMLP is replaced with a four layer MLP. 4 The hierarchically estimated single state phoneme posterior probabilities are converted to the scaled likelihoods by dividing them with the corresponding prior probabilities of phonemes obtained from the training data. A HMM with 3 states, which has equal self and transition probabilities associated with each state, is used for modeling each phoneme. The emission likelihood of each state is set to be the scaled likelihood. A bigram language model is used in all the experiments. Finally, the Viterbi algorithm is applied for decoding the phoneme sequence. For the purpose of scoring, the 49 phoneme labels are mapped to 39 phoneme labels as described in [13]. While evaluating the PER on the test set, the language model scaling factor is chosen to minimize the PER of the cross-validation data.
The phoneme recognition system in our experiments is based on a hybrid HMM/MLP approach, where the posterior probability estimates of various phonemes are converted to the scaled likelihoods
4 We observed that four layer MLP based system performs slightly better than conventional three layer MLP based system having same number of parameters. Therefore, we use four layer MLP based system as a baseline.
3. EXPERIMENTAL RESULTS 3.1. Database Phoneme recognition experiments are conducted on the TIMIT database. It has a total of 630 speakers with 10 utterances per speaker, each being sampled at 16 kHz. The two SA dialect sentences per speaker are excluded from the setup as they are identical across all the speakers. The training, cross-validation and test sets consist of 3400, 296 and 1344 utterances from 425, 37 and 168 speakers respectively. 3.2. Features PLP cepstral coefficients [12] are extracted from the speech signal by using an analysis window of length 25 ms and a frame shift of 10 ms. A nine frame context of these vectors along with the corresponding delta and delta-delta features are used as input acoustic features. These features are normalized for speaker specific mean and variance.
5338
Table 1. PER (in %) on TIMIT test set for various systems using PLP features. The classifier used for estimating 3-state posterior probabilities is listed for each system. Last column indicates the PER of the combination of SMLP and baseline systems. MLP (baseline) (4-layers) 22.6
SMLP (4-layers) 21.9
SMLP + baseline
[1] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2008. [2] K. Huang and S. Aviyente, “Sparse representation for signal classification,” Advances in neural information processing systems 19, pp. 609–616, 2006.
21.2
Table 2. Average measure of sparsity (κ) of the first hidden layer outputs of a four layer MLP and SMLP trained on PLP features. MLP (4-layers) 0.275
6. REFERENCES
SMLP (4-layers) 0.496
[3] M. Ranzato, Y. Boureau, and Y. LeCun, “Sparse Feature Learning for Deep Belief Networks,” Advances in neural information processing systems 20, pp. 1185–1192, 2007. [4] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary learning,” Advances in neural information processing systems 21. pp. 1033–1040, 2008.
3.4. Results Table 1 shows the PER of the proposed SMLP based hierarchical hybrid system and the baseline MLP based hierarchical hybrid system for PLP features on the TIMIT test set. It can be observed that SMLP system achieves better PER than the baseline MLP system yielding 3.1% relative improvement. This improved performance can be attributed to the sparse regularization term. The Dempster-Shafer (DS) combination [14] of SMLP system and baseline MLP system at the posterior level yields a PER of 21.2%, a relative improvement of 6.2% over the baseline system. This result indicates that the proposed SMLP system combines well with the conventional MLP system. In the next set of experiments, we quantify the sparsity of pth hidden layer outputs using the following measure (κ) [15]. 1 0 N2 P p B |yi | C C B√ 1 C. B N2 − si=1 (8) κ= √ C B N2 − 1 @ N2 P p 2A (yi ) i=1
It is to be noted that yip represents the output of a node i in layer p of a SMLP. Furthermore, 0 ≤ κ ≤ 1, and the value of κ is one (or large) for maximally sparse and close to zero (or small) for minimally sparse representations. Table 2 lists the average κ value of the first hidden layer outputs over the cross-validation data for various phoneme recognition systems. As expected, SMLP based system has a higher κ value than MLP based system. 4. CONCLUSIONS In this paper, we introduced the theory of SMLP in which one of its hidden layers output is forced to be sparse while learning the mapping from the inputs to the targets. We proposed a new cost function and derived the update equations for training the SMLP. Further, we experimentally showed that SMLP based system outperforms the state-of-the-art MLP based phoneme recognition system on TIMIT and also their combination results in a better performance. 5. ACKNOWLEDGEMENTS Authors would like to thank Samuel Thomas and Balakrishnan Varadarajan for sharing some scripts used in the baseline phoneme recognition system.
5339
[5] T.N. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran, “Bayesian compressive sensing for phonetic classification,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4370–4373, 2010. [6] G.S.V.S. Sivaram, S.K. Nemala, M. Elhilali, T. Tran and H. Hermansky, “Sparse Coding for Speech Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4346–4349, 2010. [7] G.S.V.S. Sivaram, G. Sriram and H. Hermansky, “Sparse Autoassociative Neural Networks: Theory and Application to Speech Recognition,” Proc. of INTERSPEECH-2010. [8] H. Bourlard and N. Morgan, “Connectionist speech recognition: a hybrid approach,” Springer, 1994. [9] J. Pinto, G.S.V.S. Sivaram, M. Magimai.-Doss, H. Hermansky and H. Bourlard, “Analyzing MLP Based Hierarchical Phoneme Posterior Probability Estimator,” IEEE Transactions on Audio, Speech, and Language Processing, DOI:10.1109/TASL.2010.2045943, to be published. [10] B.A. Olshausen and D.J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?,” Vision research, vol. 37, no. 23, pp. 3311–3325, 1997 [11] M.D. Richard and R.P. Lippmann, “Neural network classifiers estimate Bayesian a posteriori probabilities,” Neural computation, vol. 3, no. 4, pp. 461–483, 1991. [12] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990. [13] K.F. Lee and H.W. Hon, “Speaker-independent phone recognition using hidden Markov models,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 11, pp. 1641–1648, 1989. [14] F. Valente, “A Novel Criterion for Classifiers Combination in Multistream Speech Recognition,” IEEE Signal Processing Letters, vol. 16, no. 7, pp. 561–564, July 2009. [15] P.O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” The Journal of Machine Learning Research, vol. 5, pp. 1457–1469, 2004.