Speaker Identification Using Glottal-Source Waveforms and Support-Vector-Machine Modelling David Vandyke, Michael Wagner, Roland Goecke, Girija Chetty University of Canberra, ACT, Australia {david.vadyke, michael.wagner, roland.goecke, girija.chetty}@canberra.edu.au
Abstract Speaker identification experiments are performed with novel features representative of the glottal source waveform. These are derived from closed-phase analysis and inverse filtering. Source waveforms are segmented into two consecutive periods and normalised in prosody, forming so called source-frame feature vectors. Support-vector-machines are used to construct speaker discriminative hyperplanes and identification rates are reported. Groups of male speakers of size 5 to 20 are examined from the YOHO corpus and 65% correct identification rates are achieved on a per source-frame basis. Finally the source-frames phonetic independence is confirmed with the TI 46-Word corpus. Index Terms: Glottal Waveform, Speaker Identification, Support Vector Machine, Linear Predictive Coding
1. Introduction The prevailing paradigm for speaker recognition is based on short term magnitude spectrum features and speaker characteristics represented as Gaussian Mixture Models (GMM). Work over the last decade, however, has shown that the excitation signal in the source/filter model of speech production contains speaker discriminating information [1, 2, 3]. Further, this information has been shown empirically to be complementary to common vocal tract related parameters such as mel-frequency cepstral coefficients (MFCC)[1]. This excitation in the source/filter model represents physically the volume velocity flow of air passing through the vocal tract, which is controlled and modulated by the vocal fold vibration. Linear predictive coding (LPC) allows an estimation of this excitation signal from the recorded speech signal via making a short term determination of the parameters of the vocal tract as a linear filter and using those to inverse-filter the speech signal. This vocal tract estimate is most accurate during the closed phase of the vocal fold vibration, when the vocal tract is closed off at the glottal end and can be modelled as an all-pole linear filter. The determination of the instants of glottal closure and opening from the speech signal alone is a difficult task. Glottal instant detection algorithms have been developed with the use of a speech signal together with a simultaneous electroglottograph signal as a baseline [4, 5]. The determination of the closure instant of the vocal folds has been found to be an easier task in general than the instant of opening. This is due to the greater discontinuity in the motion of the vocal folds at the instant of closure, compared to their more gradual opening.
The inverse filtering following this closed-phase LPC analysis allows finally the determination of the derivative of the volume velocity flow source waveform, where the derivative accounts for the single pole filter modelling of the spectral fall off due to radiation of the speech pressure wave at the lips. Models of these glottal waveforms have been proposed, with their development historically aimed at high perceptual quality speech synthesis. These have been informed by physiological studies and high speed observations of vocal fold motion [6]. The majority of these have consisted of piecewise functions, generally modelling the opening, closing and return phases of the vocal folds during one pitch-period, for example Rosenberg [7] and Fant et al. [8]. For their use in speaker recognition, where the aim is to retain and model effectively all speaker discriminating information, data-driven models have been considered and proved important in other speaker recognition studies involving the source waveform [3, 9]. This paper describes the discriminative modelling of source waveforms which are based on the deterministic component of the Drugman et al. deterministic plus stochastic model of the glottal waveform, first proposed for speech synthesis [10]. A speaker verification experiment using these signals was reported in [11], where using only an arithmetic distance measure in the Euclidean space of the source-frames, it was able to achieve an equal-error-rate of 9.43% for a collection of 53 male speakers from the YOHO speaker verification corpus [12]. This paper continues the discriminative modelling of these glottal source features via support-vector-machines (SVM). While most applications of SVMs to speaker recognition have been applied to supervectors extracted from generative models, based on these positive results obtained using a distance measure to compare source-frames in the original feature space, we investigate the ability of SVMs to separate our source-frame feature vectors themselves. The remainder of the paper is structured as follows: Section 2 describes the feature extraction and modelling processes and section 3 reports the SVM speaker identification experiment results. Section 4 analyses the phonetic dependence of the glottal waveforms, and a summary, including future research directions, is given in section 5.
2. Glottal source-frames First we describe how the pitch-synchronous glottal waveforms are extracted from the speech signal.
Table 1: Number of source-frames used for cross-validation experiments & cohort size. Speaker Group Size 5 10 15 20
Number source-frames 300 500 800 1000
Figure 1: Single source-frame vector extracted by inverse filtering. A single period of the source waveform is evident centrally. 2.1. Feature extraction Autocorrelation linear prediction analysis (LPC) is done on 25ms frames with a 5ms shift before inverse filtering is used with the overlap-add method to determine the error signal over the whole utterance. Instances of glottal closure and opening during voiced speech are determined using this error signal and an averaged version of the speech waveform as per [5], shown to work better than other glottal instant detection algorithms, like DYPSA [4]. Closed-phase LPC covariance analysis is performed to best determine the linear filter representing the vocal tract. The inverse of this all pole filter is then used to inverse filter the speech signal, determining the pitch-synchronous error signal representing the source waveform. A collection of double pitch periods, centred on glottal closure instants (GCI), is gathered, and each of these double periods is normalised in amplitude and length. We refer to these normalised, double pitch period, GCI centred glottal source waveforms as source-frames. The amplitude scaling is done by normalising the standard deviation of the source-frame data, and the frame length is mapped to a constant number of samples by interpolation or decimation as necessary, along with the required anti-aliasing, low pass filtering. Finally the sourceframes are Hamming windowed to emphasise the shape of the signal around the glottal-closure instant. These features are based upon the deterministic component of the DSM model [10]. One such source-frame feature vector is shown in Figure 1. 2.2. Support-vector-machine modelling We investigate the ability of SVMs to separate these sourceframes based on speaker identity. We increase the number of source-frames used in 10-fold cross-validation identification experiments as the cohort group of speakers increases. This is shown in Table 1, where for example each speaker in the group of 5 speakers had 300 source-frames for SVM training. For each closed cohort of speakers, their training data is taken and a common basis is derived via principal component analysis (PCA) and used for dimension reduction. The percentage of variation within the cohorts’ data by retaining increasing number of principal components is shown in Figure 2. We see that more than 98% of the variation within the data is covered
Figure 2: Percentage variation of data covered by PCA. by the first 50 principal components, independent of cohort size. This is confirmed by including results from even larger cohorts than are used for these identification experiments. These PCA projections are then used to train a higher dimensional, speaker-discriminative hyperplane via SVM. Data was not scaled further than the prosody normalisation step during feature extraction. Identification rates on a per source-frame basis are given below in section 3.
3. Speaker identification results We use the YOHO speaker identification database consisting of multiple session, real world office environment recordings of combination-lock phrases (e.g ”14-23-11”) sampled at 8kHz [12]. We report source-frame identification rates for groups of males speakers, from a group size of 5 up to 20. Source-frames are extracted from the YOHO corpus for the first 20 male speakers. The two-pitch-period variable length waveforms are normalised in length to 256 samples. Support vector machines are then applied to a principal component representation of these source-frames with the intention of training speaker discriminative hyperplanes. Empirically it was found that radial basis function (RBF) kernels were best at separating them, performing slightly better than 3rd degree polynomials. We use the common libSVM library of support vector machines for these experiments [13]. A disjoint set of male YOHO speakers are used with a coarse to fine scale grid search to determine the best parameters for the RBF kernel, (gamma = 0.007 & cost = 32). Figure 3 shows identification rates on a per source-frame level. The highest of these identification rates for each speaker cohort size is given in Table 2.
Table 3: Voiced Vowel Groups of English Language used for phonetic dependence testing on TI-46 database. Vowel Group \ \ \\ \ \ \ \ \ \ \ \
Letters A, H, J, K B, C, D, E, G, P, T, V, Z I, Y O Q, U F, S, X
Table 3, with group labelings from the International Phonetic Alphabet. Two experiments are performed with the aim of discerning Figure 3: Source-frame identification rates against PCA dimension size. (Group sizes 15 and 20 are nearly identical.) Table 2: Best source-frame ID rates for each cohort size. Cohort Size 5 10 15 20
ID rate 70.8 64.3 65.0 64.7
These results are significantly better than the chance identification rate for each cohort size of 1/No. Speakers. Also, as we initially investigate how to maximally separate speakers with these source-frames, results are reported at the lowest level, being on a per source-frame basis, not a per utterance basis. In the same way lower phoneme recognition rates translate into acceptable word recognition systems, these identification rates are expected to translate to strong utterance based speaker recognition systems.
4. Phonetic variation Ideally for text-independent speaker recognition, these sourceframe features should be independent of the phonetic content being uttered. This has indeed been found in [9, 3] for different representations of the source signal, and is investigated here on the source-frames we use for SVM speaker identification, as described in section 2. We use the TI 46-Word database [14] which has 16 speakers, evenly divided by gender, recording 10 instances of each letter of the English alphabet for training. The testing portion consists of 8 sessions with each speaker recording 2 utterances of the alphabet each time. TI-46 is sampled at 12.5kHz with 12 bit resolution. � Y � ∈ RN , where N is the length the Source-frames X, source-frames are normalised to, are compared by a scaled Euclidean distance metric: 1 x − yˆ|| (1) d(ˆ x, yˆ) = ||ˆ N We refer to this as an arithmetic Euclidean distance. The highly voiced letters of the English alphabet are grouped according to their phonetic similarities, as shown in
1. How the glottal waveform varies BETWEEN SPEAKERS across the voiced letter groups 2. What the variation is WITHIN SPEAKERS across the voiced letter groups These are described and results given in Sections 4.1 and 4.2. In each case we perform our confirmatory data analysis by using a two-sample Kolmogorov-Smirnov test to compare the relative-frequency histograms (via their cumulative density functions) of the match and mismatch data generated. Our null hypothesis is that there is no difference in glottal waveform shape due to the phonetic content, and we consider the p-value at the α = 5% significance level to determine if this hypothesis can be statistically rejected. 4.1. Between speakers Aiming to determine if there is a common glottal waveform for certain voiced letters, independent of the speaker, we combine all training and testing data of TI-46, eliminating temporal/sessional variations. For each letter in our vowel groups we calculate a mean glottal waveform from all the source-frames from this letter, done per speaker, per training/testing set. This gives us a collection of 672 mean glottal waveforms; number of speaker (16) × training/testing (2) × number of letters in our vowel groups (21). We then calculate arithmetic Euclidean distances by (1) between vowels of the same group, and between different groups. This forms our two sets of data which we compare. We know from [11] that the distributions of scores generated in this way from comparisons of source-frames from same and different speakers are highly separated. Thus we do not compare any two source waveforms from the same speaker. Based on the p-values from the Kolmogorov-Smirnov tests, for no vowel group could we conclude a distinctive glottal waveform existed. At the α = 5% significance level we could not reject the null hypothesis for any of the vowel groups. In fact all p-values where greater than 65% and qualitatively the relative histogram distributions of same-vowel and different-vowel scores showed significant overlap. Typical of this is Figure 4 showing the two distributions from the -vowel group. 4.2. Within speakers Having found no phone-dependent, distinct glottal waveform across speakers, we investigate (with the same null hypothesis) whether there exists any significant differences within individual speakers.
Figure 4: Between Speakers Relative-Frequency Histogram: vowel group. Abscissa shows the arithmetic Euclidean distance scores on mean source waveforms between letters from the same and from different vowel groups. Table 4: Kolmogorov-Smirnov test p-values for TI46 speakers Speaker F1 F2 F3 F4 F5 F6 F7 F8
p-value 0.6751 0.6751 0.9748 0.9748 0.9748 0.6751 0.1108 0.9748
Speaker M1 M2 M3 M4 M5 M6 M7 M8
p-value 0.9748 0.9748 0.9748 0.6751 0.9748 0.3129 0.9748 0.6751
The 16 speakers in the TI46 database are processed as follows. Using only the training portion, source-frames are extracted from each letter in a vowel group, and for each phone the collection of source-frames is divided into four groups and mean source waveforms are calculated for each of these. Arithmetic Euclidean distances are then calculated between phones from the same vowel groups and between phones from different vowels groups. This produces two distributions of scores which are compared with a Kolmogorov-Smirnov test to determine if any statistically significant difference existed. These results are given in Table 4 below. The use of the Kolmogorov-Smirnov test allows us to conclude that we can’t reject the case that there is no difference between the two groups of measurements. Although a statistically weak conclusion, this is a positive result for their use in speaker recognition, particularly text-independent. Ideally we wish to conclude with statistical significance that the two distributions are statistically similar; for this we require further exploration with equivalence testing.
5. Conclusions No qualitative (visible) differences in the source waveforms were observed as a result of phonetic content. This was established quantitatively with a comparison of cumulative densities of vowel match and mismatch comparison scores via a
Kolmogorov-Smirnov test. Based on p-values at a 5% significance level, we were not able to conclude that there was a statistical difference in the two distributions. This is a positive result for text-independent speaker recognition. With a suitable collection of databases the dependence of the source-frame on language may also be investigated; our hypothesis being that they are also independent. The novel source-frame features were shown to be highly useful for speaker recognition. Using support-vector-machines in a speaker identification experiment, we were able to achieve on average a 66% correct identification rate on a per sourceframe basis across each speaker cohort group size. We aim to show how these strong SVM source-frame identification rates transfer to an accurate speaker recognition system for classifying utterances. We also intend to develop generative models of these source-frames, and explore the sourceframes complementarily with short term magnitude spectral features like MFCC, where we expect score or feature fusion to result in higher performing, robust speaker recognition.
6. References [1] M. Plumpe, T. Quatieri, and D. Reynolds, “Modeling of the glottal flow derivative waveform with application to speaker identification,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 5, pp. 569–586, Sep. 1999. [2] M. Wagner, “Speaker verification using the shape of the glottal excitation function for vowels,” in SST, 2006, pp. 233–238. [3] T. Drugman and T. Dutoit, “On the potential of glottal signatures for speaker recognition,” in INTERSPEECH, 2010, pp. 2106– 2109. [4] P. A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes, “Estimation of glottal closure instants in voiced speech using the dypsa algorithm,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 34–43, jan 2007. [5] T. Drugman and T. Dutoit, “Glottal closure and opening instant detection from speech signals,” in INTERSPEECH, 2009, pp. 2891–2894. [6] D. G. Childers and J. N. Larar, “Electroglottography for laryngeal function assessment and speech analysis,” IEEE Transactions on Biomedical Engineering, vol. 31, no. 12, pp. 807–817, 1984. [7] S. Rosenberg, “Glottal pulse shape and vowel quality,” in Journal of the Acoustic Society of America, 49, 2, 1970, pp. 583–590. [8] G. Fant, J. Liljencrants, and Q. Lin, “A four-parameter model of glottal flow,” Speech Transmission Laboratory QPSR, vol. 4, no. 4, pp. 1–13, 1985. [9] J. Gudnason, M. R. P. Thomas, D. P. W. Ellis, and P. A. Naylor, “Data-driven voice source waveform analysis and synthesis,” Speech Communication, vol. 54, no. 2, pp. 199–211, 2012. [10] T. Drugman, G. Wilfart, and T. Dutoit, “A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis,” in INTERSPEECH, 2009, pp. 1779–1782. [11] D. Vandyke, M. Wagner, and R. Goecke, “Speaker verification with an euclidean distance measure on normalised glottal waveforms,” in In preparation, 2012. [12] J. Campbell Jr, “Testing with the yoho cd-rom voice verification corpus,” in International Conference on Acoustics, Speech, and Signal Processing, ICASSP, vol. 1, may 1995, pp. 341 –344. [13] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27, 2011, software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [14] M. Liberman and et al., “Ti 46-word,” Linguistic Data Consortium, Philadelphia, 1993.