Recognition of Greek Phonemes using Support Vector Machines Iosif Mporas, Todor Ganchev, Panagiotis Zervas, Nikos Fakotakis Wire Communications Laboratory, Dept. of Electrical and Computer Engineering University of Patras, 261 10 Rion, Patras, Greece, Tel: +30 2610 997336, Fax: +30 2610 997336 {imporas, tganchev, pzervas, fakotaki}@wcl.ee.upatras.gr
Abstract. In the present work we study the applicability of Support Vector Machines (SVMs) on the phoneme recognition task. Specifically, the Least Squares version of the algorithm (LS-SVM) is employed in recognition of the Greek phonemes in the framework of telephone-driven voice-enabled information service. The N-best candidate phonemes are identified and consequently feed to the speech and language recognition components. In a comparative evaluation of various classification methods, the SVM-based phoneme recognizer demonstrated a superior performance. Recognition rate of 74.2% was achieved from the N-best list, for N=5, prior to applying the language model.
1 Introduction The increased interest of the market in multilingual speech-enabled systems, such as telephone-driven information access systems, has raised the necessity of developing computationally efficient and noise-robust speech and language recognition methods. In speech and language recognition, the phonotactic approach became very popular, since it offers a good trade-off between recognition accuracy and amount of data required for training. In brief, in the phonotactic approach the speech signal is decoded to a phoneme sequence, which is further processed by a statistical language model for the language of interest. This technique, proposed by Zissman [1], is known as phoneme recognition followed by language model (PRLM). Due to the success of the phonotactic approach, phoneme recognition became a corner stone in every speech and language recognition component. At present, various approaches to phoneme recognition have been proposed. In [2], a combination of context-dependent and context-independent ANNs has led a phoneme recognition accuracy of about 46%. Phoneme recognition using independent component analysis (ICA)-based feature extraction [3] yielded accuracy of 51%. Continuous mixture HMM-based phoneme recognizer with a conventional three-state left-to-right architecture [4] achieved recognition performance of 54%. A languagedependent approach to phoneme recognition demonstrated accuracy in the range 45% to 55% [5]. Speaker-independent approach, using multiple codebooks of various LPC parameters and discrete HMMs, achieved 65% accuracy on context-independent test
corpus [6]. The Global Phone project provides phoneme recognizers for multiple languages with accuracy varying in the range 55% to 65% [8]. A mixture of language dependent phonemes and language-independent speech units achieved accuracy 38% of phoneme recognition [9]. The language-dependent phone recognition approach, modeling a tree-state context independent HMM, achieved accuracy between 33% and 52% [10]. Broad phoneme recognition approximation, trained with context independent HMMs, yielded accuracy of 50% ÷ 60% [11]. The CMU Sphinx 3 system, which employes a three emitting-state Gaussian mixture HMMs [7] yielded phoneme recognition accuracy of 69%. Finally, an approach [12] similar to ours, which uses SVMs with framewise classification on TIMIT [13], reports 70,6% accuracy of correctly classified frames. In the present study we employ an SVM-based classifier on the phoneme recognition task. Because of their strong discrimination capabilities, Support Vector Machines (SVMs) became a popular classification tool, which was successfully employed in various real-world applications. Section 3 offers further details about the SVM algorithm, and Section 4 provides comparison with other classification methods.
2 Phoneme recognition In a phoneme recognizer built on the PRLM architecture, the sequence of phonemes decoded by the recognizer is matched against a set of phoneme-bigram language models, one for each language of interest. Thus, training of both acoustic models of the phonemes and language model for each language is required. Since the present work aims at selecting the optimal classification approach for phoneme recognition, details about the training of the language model are not discussed. The typical structure of a generic phoneme recognizer is presented in Fig.1. In brief, firstly the speech signal is sampled and subsequently pre-processed. The preprocessing consists of: (1) band-pass filtering, (2) pre-emphasis, (3) segmentation. The band-pass filtering aims at suppressing the frequency bands with little contribution to the speech contents, eliminating the drift of the signal, reducing the effects caused by saturation by level, and smoothing the clicks. In telephone quality speech, these are frequently observed degradations. The pre-emphasis reduces the spectral tilt for the higher frequencies for enhancing the estimation of the higher formants. Subsequently, segmentation of the speech signal is performed to extract the phoneme borders. As illustrated in Fig.1, during the second step, speech parameterization is performed for each speech segment. Specifically, 13 Melfrequency cepstral coefficients (MFCC) and the first four formants are estimated. During the post-processing step, the ratios between the second, third and fourth to the first formant F2/F1, F3/F1, F4/F1,, respectively, are computed. Next, all speech parameters are grouped together to form the feature vector, and a mean normalization is performed for each parameter. The normalized {MFCC0, …, MFCC12, F2/F1, F3/F1, F4/F1} vectors are fed to the phoneme classifier. The phoneme classifier estimates the degree of proximity between the input and a set of predefined acoustic models of the phonemes. The output of the classifier consists of the N-best list of candidates, which
Fig. 1. Structure of a generic phoneme recognizer
is further utilized by a rule-driven language modeling. The output for the language model (specific for each language) is the likelihood of the phoneme sequence in the specific language. Specifically, the N-best list of phonemes is processed by n-gram language models to estimate the likelihood of every hypothesized phoneme sequence. Maximum likelihood selection criterion is utilized to select the most probable phoneme. For a given phoneme sequence Ai=a1,a2,a3,…aT the likelihood is:
L( Ai | AM i ) =
T 1 log( P(ai | AM i )) + ∑ log P(at |at −1,...,a1, AM i ) T t =2
(1)
where AMi is the corresponding acoustic model. The sequence with maximum likelihood is detected as:
Ai = arg max L ( Ai | AM i ) i
(2)
In the present work we focus on the classification stage. Specifically, seeking out a classifier our choice felt on the SVM method due to its strong discrimination capabilities and the compact models it employs. In the following Section 3 we confine to description of the SVM approach and its least squares version (LS-SVM). Next, in Section 4, the performance of the LS-SVM classifier on the phoneme recognition task is contrasted to the one of other classification algorithms.
3 Fundamentals on Support Vector Machines Initially, SVMs have been introduced as a two-class classifier for solving pattern recognition problems (Vapnik, 1995;1998). Since many real-world problems, such as phoneme recognition, involve multiple classes, techniques to extend SVMs to multiple classes have been proposed. In the present work we utilize the voting scheme [14] for SVM-based multi-class classification. In this method a binary classification is created for each pair of classes. For K classes the resulting number of binary classifiers is K(K-1)/2. The data are mapped into a higher dimensional input space and an optimal separating hyperplane is constructed in this space. Specifically in the present study, we employ the least squares version of SVMs (LS-SVM) [15] for identification of the Greek phonemes. For completeness of exposition, in Section 3.1 we firstly introduce the original SVM theory. Subsequently, in Section 3.2 the specifics of the LS-SVM version are discussed.
3.1 SVMs for classification For a training set of N data points , where is the k-th input pattern and {yk,xk}k=1N is the k-th output pattern, the SVM algorithm constructs a classifier of the form:
⎡N ⎤ y ( x) = sign ⎢ ∑ ak ykψ ( x, xk ) + b ⎥ ⎣ k =1 ⎦
(3)
where αk are positive real coefficients and b is a bias term. Assuming that
⎧⎪ wT φ ( xk ) + b ≥ +1, yk = +1 ⎨ T ⎪⎩ w φ ( xk ) + b ≤ −1, yk = −1
(4)
yk ⎡ wT φ ( xk ) + b ⎤ ≥ 1, k = 1,..., N , ⎣ ⎦
(5)
which is equivalent to
where φ(.) is a nonlinear function which maps the input space into a higher dimensional space. To be able to violate (5), in case a separating hyperplane in this higher dimensional space does not exist, variables ξk are introduced such that
⎧⎪ y ⎡ wT φ ( x ) + b ⎤ ≥ 1 − ξ , k = 1,..., N k k k ⎦ ⎨ ⎣ ⎪⎩ξ k ≥ 0, k = 1,..., N
(6)
According to the structural risk minimization principle, the risk bound is minimized by formulating the optimization problem:
min J1 ( w, ξ k ) = w,ξ
k
N 1 T w w + c ∑ ξk 2 k =1
(7)
subject to (6), so a Lagrangian is constructed
L1 ( w, b, ξ k , ak , vk ) = J1 ( w, ξ k ) − N
N
k =1
k =1
T ∑ ak { yk [ w φ ( xk ) + b] − 1 + ξ k } − ∑ vk ξ k
(8)
by introducing Lagrange multipliers ak ≥ 0, vk ≥ 0 (k=1,…,N). The solution is given by Lagrange’s saddle point by computing
max min L1( w, b, ξ k ; ak , vk ) w,ξ
k
w,ξ
k
(9)
Fig. 2. Linear separating hyperplanes. The support vectors are circled
This leads to N ⎧ ∂L1 = 0 → w = ∑ ak ykφ ( xk ) ⎪ k =1 ⎪ ∂w N ⎪∂L1 = 0 → ∑ ak yk = 0 ⎨ k =1 ⎪ ∂b ⎪ ∂L1 = 0 → 0 ≤ ak ≤ c, k = 1,..., N ⎪ ⎩ ∂ξ k
(10)
which gives the solution of the quadratic programming problem
max Q1 (ak ;φ ( xk )) = − ak
N 1 N T ∑ yk ylφ ( xk ) φ ( xl )ak al + ∑ ak 2 k ,l =1 k =1
(11)
such that
⎧N ⎪ ∑ ak yk = 0 ⎨ k =1 ⎪0 ≤ a ≤ c, k = 1,...N k ⎩
(12)
The function φ(xk) in (11) is related to ψ(x,xk) by imposing
φ ( x )T φ ( xk ) = ψ ( x, xk )
(13)
which is motivated by Mercer’s Theorem. The classifier (3) is designed by solving
max Q1 (ak ;ψ ( xk , xl )) = − ak
subject to constraints in (11).
N 1 N ∑ yk ylψ ( xk , xl )ak al + ∑ ak 2 k ,l =1 k =1
(14)
Fig. 3. Mapping (φ) the input space into a higher dimensional space
Neither w nor φ(xk) have to be calculated in order to determine the decision surface. The matrix associated with this quadratic programming problem is not indefinite, so the solution to (14) will be global. Hyperplanes (5) satisfying the constraint ||w||2 ≤ α have a Vapnik-Chervonenkis (VC) dimension h which is bounded by
h ≤ min([ r 2 a 2 ], n ) + 1
(15)
where [.] is the integer part and r is the radius of the smallest ball containing the points φ(x1),…, φ(xN). This ball is computed by defining the Lagrangian N
L2 (r , q, λk ) = r 2 − ∑ λk (r 2 − || φ ( xk ) − q ||22 ) k =1
(16)
where q is the center of the ball and λk are positive Lagrangian multipliers. In a similar way as for (7), the center is equal to q=Σkλkφ(xk), where the Lagrangian multipliers follow from N
N
k ,l =1
k =1
max Q2 (λk ; φ ( xk )) = − ∑ yk ylφ ( xk )T φ ( xl ) + ∑ λkφ ( xk )T φ ( xl ) λk
(17)
such that
⎧N ⎪ ∑ λk = 0 ⎨k =1 ⎪λ ≥ 0, k = 1,...N ⎩ k
(18)
Based on (13), Q2 can also be expressed in terms of ψ(xk,xl). Finally, a support vector machine with minimal VC dimension is selected by solving (14) and computing (15) form (17). 3.2 Least Squares Support Vector Machines Formulating the classification problem as
min J 2 ( w, b, e) =
w,b,e
1 T 1 N w w + γ ∑ ek2 2 2 k =1
(19)
subject to equality constrains
yk [ wT φ ( xk ) + b ] = 1 − ek , k = 1,..., N
(20)
a least squares version to SVM classifier is introduced. The Lagrangian is defined as N
L3 ( w, b, e; a ) = J 3 ( w, b, e) − ∑ ak { yk [ wT φ ( xk ) + b] − 1 + ek }
(21)
k =1
where αk are the Lagrange multipliers, which can further be either positive or negative (Kuhn-Tucker conditions). The conditions for optimality
⎧ ∂L3 ⎪ ∂w ⎪ ⎪ ∂L3 ⎪ ∂b ⎪ ⎨ ∂L ⎪ 3 ⎪ ∂ek ⎪ ⎪ ∂L3 ⎪⎩ ∂ak
N
= 0 → w = ∑ ak ykφ ( xk ) k =1
N
= 0 → ∑ ak yk = 0 k =1
(22)
= 0 → ak = γ ek , k = 1,..., N = 0 → yk [ wT φ ( xk ) + b]ek = 0, k = 1,..., N
are written immediately as the solution to the following linear equations set
⎡ I 0 0 − Z T ⎤ ⎡ w⎤ ⎡0 ⎤ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ 0 0 0 −Y T ⎥ ⎢ b ⎥ = ⎢ 0 ⎥ ⎢0 0γ I ⎥ − I ⎥ ⎢ e ⎥ ⎢ 0G ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎢⎣ Z Y I 0 ⎥⎦ ⎣ a ⎦ ⎢⎣1 ⎥⎦ where Z=[φ(x1)Ty1;…; φ(xN)NyN], Y=[y1;…;yN], α=[α1;…;αN]. The solution can also be given by
(23)
G 1 =[1;…;1],
−Y T ⎡0 ⎤ ⎡ b ⎤ ⎡0 ⎤ ⎢Y ⎥ ⎢ ⎥ = ⎢G ⎥ ⎣ ZZ T + γ −1I ⎦ ⎣ a ⎦ ⎣1 ⎦
e=[e1;…;eN],
(24)
Mercer’s condition can be applied to the matrix Ω=ΖΖΤ where
Ωkl = yk ylφ ( xk )T φ ( xl ) = yk ylψ ( xk , xl )
(25)
The classifier (3) is found by solving the linear set of equations (24) – (25) instead of quadratic programming. The support values αk are proportional to the errors at the data points.
4 Experiments and Results Evaluation of the recognition was carried out on the SpeechDat (II)- FDB-5000-Greek corpus. It contains recordings from 5000 native Greek speakers (2.405 males, 2.595 females) recorded over the fixed telephone network of Greece. Speech samples are stored as sequences of 8-bit, 8 kHz, A-law format. A comprehensive description of the SpeechDat (II) FDB-5000-Greek corpus is available in [16], [17]. On a common experimental setup, we have evaluated on the phoneme recognition task the following six classification techniques: • NB: Naïve Bayes [18], using kernel density estimation, • MLP: Multi Layer Perceptron, using back-propagation for training, • M5P: regression decision tree based on M5 algorithm, • J.48: learning algorithm [19], generating a pruned or not C4.5 decision tree, • Support Vector Machines, • Least Squares Support Vector Machines. These algorithms were used alone or combined with several meta-classification algorithms, such as Bagging and AdaBoost [20]. Bagging takes random samples, with replacement, and builds one classifier on each sample. The ensemble decision is made by the majority vote. AdaBoost designs the ensemble members one at a time, based on the performance of the previous member, in order to give more chances to objects that are difficult to classify, to be picked in the subsequent training sets. The WEKA implementation (WEKA machine learning library [21]) of these classification techniques, with default parameter setup, was used. The performance of the evaluated classifiers is presented in Table 1. Two feature vectors were compared: (1) MFCC presented in the second column and (2) MFCC + normalized formants in the third column of the table. Table 1. Phoneme recognition accuracy in percentage for various classifiers and two feature vectors CLASSIFIER NB MLP J.48 Meta-classification via Regression M5P Meta Bagging J.48 Meta AdaBoost M1 J.48 SVM LS-SVM
Accuracy [%] {MFCC0, …, MFCC12} 26.17 28.83 23.11
Accuracy [%] {MFCC0-12, F2/F1, F3/F1, F4/F1} 28.91 30.01 24.73
26.37
28.79
28.15 28.97 34.88 34.84
29.82 30.09 37.98 38.06
As the experimental results suggest, the SVM classifier demonstrated the highest accuracy when compared to other classifiers. The observed superiority of the SVM algorithm was expected, since NB and decision trees require discretized attributes. Except this, SVM perform well in higher dimensional spaces since they do not suffer from the curse of dimensionality. Moreover, SVMs have the advantage over other approaches, such as neural networks, etc, that their training always reaches a global
minimum [22]. In addition to the results presented in Table 1, in Fig. 4 illustrates the N-best list for the most successful classifier, namely the Least Squares version of SVM. The results for the N-best list for N=1 to N=5 shown in Fig. 4, illustrate the potential improvement of the accuracy that could be gained after including the language model. As it is well-known some phonemes, which share a similar manner of articulation and therefore possess kindred acoustic features, are often misclassified. Examples of phonemes with similar manner of articulation are the glides /i/ and /j/, the fricatives /f/, /x/ and /T/, the nasals /m/ and /n/ and the palatals /k/ and /x/. This results to pairs as well as groups of phonemes that could be confused during the recognition process. The correlation among the phonemes can be seen from the confusion matrix presented in Table 2. 80 70
Accuracy (%)
60 50 40 30 20 10 0
1
2
3
4
5
FS 1
38,06
53,41
62,13
68,58
74,19
FS 2
34,84
50,66
61,29
66,1
73,01
N-best list
Fig. 4. N-best List accuracy (%) for LS-SVM and two different Feature Sets (FS). FS1 consists of {MFCC0, …, MFCC12, F2/F1, F3/F1, F4/F1}. FS2 consists of {13 MFCC}.
We deem the lower accuracy we observed in our experiments, when compared to [12], is mainly due to the nature of the recordings in the SpeechDat(II) database. While in [12] the authors experimented with the TIMIT recordings (clean speech sampled at 16 kHz, and recorded with a high-quality microphone) the SpeechDat(II) recordings have been recorded over the fixed telephone network of Greece. Thus, the speech in the SpeechDat(II) database is sampled at 8 kHz, band-limited to telephone quality, and distorted by the nonlinear transfer function of the handset and the transmission channel. Moreover, there is interference from the real-world environment. Finally, the TIMIT recordings consist of read speech, while the SpeechDat(II) contains both prompted and spontaneous speech, which consequences in larger variations in the length of the phonemes, as well as to a stronger coarticulation effects.
Table 2. Confusion matrix of the Greek phonemes (phonemes are transcribed according to SAMPA). The accuracy is presented in percentage.
5 Conclusions A phoneme recognizer based on Support Vector Machines and specifically on their Least Squares version (LS-SVM) has been presented. It employs a feature vector based on MFCC and formants. A comparative evaluation of the SVM classifier with several other classification methods was performed on the task of Greek phoneme recognition. The SVM method demonstrated the highest accuracy. The N-best list of candidate phonemes produced by the classifier is further feed to language-specific models in order to increase the phoneme recognition performance and facilitate the speech and language recognition components. This phoneme recognizer is intended as a part of telephone-driven voice-enabled information service.
References 1. Zissman M., “Comparison of four Approaches to Automatic Language Identification of Telephone Speech”, IEEE Trans. Speech and Audio Proc., vol.4, Jan.96, pp.31-44. 2. Mak M., “Combining ANNs to improve phone recognition'', IEEE ICASSP’97, Munich, Germany, 1997,, vol. 4, pp.3253-3256. 3. Kwon O-W, Lee T-W., “Phoneme recognition using ICA-based feature extraction and transformation”, Signal Processing, June 2004. 4. Caseiro D., Trancoso I., “Identification of Spoken European Languages”, Eusipco, IX European Signal Processing Conference, Greece, Sept. 1998.
5. Yan Y., Barnard E., “Experiments for an approach to Language Identification with conversational telephone speech”, ICASSP, Atlanta, USA, May 1996, vol. 1, pp. 789-792. 6. Lee K-F., Hon H-W, “Speaker Independent Phone Recognition using HMM”, IEEE Trans. on Acoustics Speech and Audio Processing, vol.37, no.11, Nov.89. 7. Schultz T., Waibel A., “Language Independent and Language Adaptive Acoustic Modeling for Speech Recognition”, Speech Communication, vol.35, issue 1-2, pp 31-51, Aug. 01. 8. Dalsgaard P., Andersen O., Hesselager H., Petek B., “Language Identification using Language-dependent phonemes and Language-independent speech units”, ICSLP, 1996, p.1808-1811. 9. Corredor-Ardoy C., Gauvain J., Adda_decker M., Lamel L., “Language Identification with Language-independent acoustic models”, Proc. of EUROSPEECH 97, September 1997. 10. Martin T., Wong E., Baker B., Mason M., “Pitch and Energy Trajectory Modeling in a Syllable Length Temporal Framework for Language Identification”, ODYSSEY 2004, May 31- June 3, 2004, Toledo, Spain. 11. Pusateri E., Thong JM., ”N-best List Generation using Word and Phoneme Recognition Fusion”, 7th European Conference on Speech Communication and Technology (EuroSpeech), September 2001, Aalborg, Denmark. 12. Salomon J., King S., Osborne M., “Framewise phone classification using support vector machines”, In Proceedings International Conference on Spoken Language Processing, Denver, 2002. 13. Garofolo J., “Getting started with the DARPA-TIMIT CD-ROM: An acoustic phonetic continuous speech database”, National Institute of Standards and Technology (NIST), Gaithersburgh, MD, USA, 1988. 14. Friedman J., “Another approach to polychotomous classification”, Technical report, Stanford University, UA, 1996. 15. Suykens J., Vandewalle J., “Least Squares Support Vector Machine Classifiers”, Neural Processing Letters, vol. 9, no. 3, Jun. 1999, pp. 293-300. 16. Hodge H., “SpeechDat multilingual speech databases for teleservices: across the finish line”, EUROSPEECH'99, Budapest, Hungary, Sept 5-9 1999, Vol. 6, pp. 2699-2702. 17. Chatzi, I., Fakotakis N. and Kokkinakis G., “Greek speech database for creation of voice driven teleservices”, EUROSPEECH'97, Rhodes, Greece, Sept. 22-25 1997, vol. 4, pp.1755-1758. 18. John G., Langley P., “Estimating Continuous Distributions in Bayesian Classifiers”, 11th Conference on Uncertainty in Artificial Intelligence, pp. 338-345, Morgan Kaufmann, San Mateo, 1995. 19. Quinlan J. R., “C4.5: Programs for Machine Learning“, Morgan Kaufmann Publishers, San Mateo, 1993. 20. Quinlan J. R., “Bagging, Boosting, and C4.5”, AAAI/IAAI, Vol. 1, 1996. 21. Witten I., Frank E., “Data Mining: Practical machine learning tools with Java implementations”, Morgan Kaufmann, 1999. 22. Burges C., “A tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery, Vol. 2, Number 2, p.121-167, Kluwer Academic Publishers, 1998.