Phoneme recognition using zerocrossing interval distribution of ...

Report 5 Downloads 75 Views
Int J Speech Technol DOI 10.1007/s10772-012-9169-x

Phoneme recognition using zerocrossing interval distribution of speech patterns and ANN R.K. Sunil Kumar · V.L. Lajish

Received: 16 February 2012 / Accepted: 11 July 2012 © Springer Science+Business Media, LLC 2012

Abstract The speech signal is modeled using zerocrossing interval distribution of the signal in time domain. The distributions of these parameters are studied over five Malayalam (one of the most popular Indian language) vowels. We found that the distribution patterns are almost similar for repeated utterances of the same vowel and varies from vowel to vowel. These distribution patterns are used for recognizing the vowels using multilayer feed forward artificial neural network. After analyzing the distribution patterns and the vowel recognition results, we realize that the zerocrossing interval distribution parameters can be effectively used for the speech phone classification and recognition. The noise adaptness of this parameter is also studied by adding additive white Gaussian noise at different signal to noise ratio. The computational complexity of the proposed technique is also less compared to the conventional spectral techniques, which includes FFT and Cepstral methods, used in the parameterization of speech signal. Keywords Zerocrossing interval distribution (ZCD) · Zerocrossing number (ZCN) · ANN

1 Introduction Phoneme recognition has major contribution in designing practical efficient speech recognition systems. Many phoneme recognition systems are reported in literature R.K. Sunil Kumar · V.L. Lajish () Department of Computer Science, University of Calicut, Kerala 673635, India e-mail: [email protected] R.K. Sunil Kumar e-mail: [email protected]

(Chandrasekhar and Yegnanarayana 1996; Kim and Hwang 1991; Rabiner and Juang 1993; Waibet et al. 1988 and Chandrasekhar 1996). Parameterization of analog speech signal is the first step in the speech recognition process. Although various aspects of phoneme recognition have been investigated by researchers, the parameterization of analog speech and its computational complexity is still a headache for them. This paper is an attempt to model the speech signal with a computationally simple speech parameter using the zerocrossing information of the signal and to use it for phoneme recognition applications. Many speech recognition systems have been built by various research works using different feature extraction techniques such as LPC analysis, band pass filtering method and Cepstral method (Itakura 1975; Bui et al. 1983; Kwok et al. 1983). Even though satisfactory recognition accuracies are reported, most of the techniques need either complex hardware or large amount of computations (Lau and Chan 1985). Band pass filtering techniques usually employ 7–12 well tuned analog filters (Bui et al. 1983 and Kwok et al. 1983). In LPC method, 10–12 poles are commonly computed for each time interval resulting, 10–12 dimensional feature vectors (Itakura 1975). The time domain methods like zerocrossing analysis technique have been applied to several signal processing and signal analysis tasks. Some of these tasks include speech analysis and speech recognition (Arai and Yoshida 1990; Niederjohn and Lahat 1985; Niederjohn et al. 1987; Erdol et al. 1993 and Sreenivas and Niederjohn 1992) electroencephalographic (EEG) and biomedical applications, communication application, oceanographic analysis and many others (Niederjohn 1975). The relative easy method by which zerocrossing information can be extracted and its low cost implementation made this technique attractive. Zerocrossing rate is widely used in many speech analysis and recognition purposes. In application involving

Int J Speech Technol

speech processing and speech recognition, additional interest in zerocrossing analysis has gained impetus through observation of Licklider and Pollack who showed that clipped speech is highly intelligible (Licklider and Pollack 1948 and Licklider 1950). As a result, numerous speech analysis and recognition devices have been built utilizing zerocrossing analysis techniques (Sreenivas and Niederjohn 1992; Erdol et al. 1993 and Kim et al. 1996). In previous work utilizing zerocrossing analysis techniques, several methods for extracting significant feature from zerocrossing data have been proposed. Zerocrossing information of the speech signal is a perceptually meaningful parameter because parameters like formant frequency can be extracted using zerocrossing information of the noise corrupted speech (Sreenivas and Niederjohn 1992 and Kim et al. 1996). Douglas et al. has shown that zerocrossing information can be used for speaker recognition applications (Wasson and Donaldson 1975). Parameter derived form zerocrossing information and its noise robustness is studied by Kim et al. (1996). The importance of zerocrossing locations and zerocrossing interval on the intelligibility of the speech is reported (Niederjohn et al. 1987). But the variations of zerocrossing interval of the signal and its uses in parametric estimation for recognition purposes are yet to be investigated. A new feature using zerocrossing Interval Distribution (ZCD) is proposed. The present work illustrates how the parameter can be effectively used in neural network and other statistical pattern recognition application based phoneme recognition systems. The extraction of zerocrossing interval distribution is a less time consuming process than other spectral techniques of parameterization of speech signal. The paper is organized in three sections. First section deals the speech parameterization using zerocrossing interval and its distribution. The second section describes vowel recognition using Artificial Neural Network (ANN) and the final section deals the conclusion.

2 Speech modeling using zerocrossing intervals Consider the speech segment shown in Fig. 1. ZCik shows k the ith zerocrossing and ZCi+1 shows the (i + 1)th zerocrossing of kth observation window. The time interval between these two points is called ith zerocrossing interval Tik in the kth observation window. In the present work we extracted original zerocrossing k interval by Tik = ZCi+1 − ZCik . The aim of the study is to extract a robust coefficient, based on zerocrossings for phoneme recognition. The actual zerocrossing can be found out by linear interpolating the consecutive samples of alternative sign as shown in Fig. 2. In our computations we approximated the midpoint of the samples with alternative

Fig. 1 Speech segment in kth observation window

signs and took it as the zerocrossing point. From our studies we found that such an approximation works well. These zerocrossings are then used to model speech as follows. Define T1k , T2k , . . . , TMk as successive zerocrossing interval duration of signal X(t) in kth observation window, i.e., M 

Tik ≈ W k

i=1

where W k is the duration of kth observation window. An XY plot is generated by plotting the index number of Tik along X axis and Tik along Y axis. The aim of the plot is to model the speech signal using Tik . This plot can be used for recognition application. The analysis is done for five Malayalam vowels /a/, /i/, /u/, /e/, and /o/. Figure 3(a–e) shows zerocrossing interval duration vs zerocrossing interval index number plot of the vowels /a/, /i/, /u/, /e/, and /o/ respectively. From these plots we see that the zerocrossing intervals of the speech signal continuously vary and these variation is different for different vowels. That is this information can be used for model the speech signal. 2.1 Speech modeling using zerocrossing interval distribution Aim of the section is to investigate how the consecutive zerocrossing intervals of a speech signal is varying and hence to find the distribution of zerocrossing intervals throughout the signal. Let us define a range of interval duration tj = interval[Tmax (j ), Tmin (j )], where Tmax (j ) is the maximum of j th interval and Tmin (j ) is the minimum of j th interval as illustrated in Fig. 4. Now we define a parameter P k (tj ) that gives the distribution of zerocrossing interval duration of the signal X(t) in a particular interval tj of kth observation window W k . The parameter can be mathematically written as k

M    η tj , Tik P (tj ) = k

i=1

Int J Speech Technol Fig. 2 Approximation of zerocrossings

Fig. 3 Zerocrossing interval duration vs zerocrossing interval index number for five different vowels

Int J Speech Technol

where j = 1, 2, . . . , L and k = 1, 2, 3, . . . , N , and η(tj , Tik ) = 1 if Tik lies in between the range specified for tj = 0 otherwise. For k = 1 we get the distribution values in all the L intervals as P 1 (t1 ), P 1 (t2 ), P 1 (t3 ), . . . , P 1 (tL ). For k = 2, P 2 (t1 ), P 2 (t2 ), P 2 (t3 ), . . . , P 2 (tL ) etc. In general, for k = N the distribution values are P N (t1 ), N P (t2 ), P N (t3 ), . . . , P N (tL ). For j = 1 we get the total value of the distribution function as Ptot (t1 ) = P 1 (t1 ) + P 2 (t1 ) + P 3 (t1 ) + · · · + P N (t1 ). Or, in general, for j = L the total value of the distribution function will be Ptot (tL ) = P 1 (tL ) + P 2 (tL ) + P 3 (tL ) + · · · + P N (tL ). Therefore Ptot (tj ) =

N 

P k (tj );

j = 1, 2, 3, . . . , L

k=1

where Ptot (tj ) represents the distribution of zerocrossing intervals in ‘tj ’. The total number of zerocrossing will be given by Ztot =

L 

Fig. 4 Maximum and minimum intervals

0.95–1 etc. The distributions of the normalized intervals were found out throughout the ranges mentioned above. Figure 5(a–e) shows the zerocrossing interval distribution of the Malayalam vowels /a/, /i/, /u/, /e/, and /o/ uttered by a single speaker at different times. This pattern is used for training and testing the artificial neural network mentioned in the following section. However the process of normalization didn’t improve the results much. From the analysis of the distribution graph of the vowels, it can be seen that for each vowel, the distribution of zerocrossing intervals are similar for repeated utterances of the same vowel and is distinguishable among the vowels. So zerocrossing interval distribution of a vowel segment can be used as a parameter for speech recognition.

Ptot (tj )

j =1

and zerocrossing rate of the speech signal is give by Zrate = Ztot

N 

Wk

k=1

where W k is the duration of the kth observation window. Here the order of computation is n, where n is the number of samples used. It is also evident that the computation of the proposed zerocrossing based parameterization technique is lesser compared to the computation of conventional spectral parameters which include FFT and Cepstral methods (Cristi 2007). 2.2 Pattern formation Speech signal is low pass filtered to 4 kHz and sampled at 8 kHz rate and digitized using 16 bit A/D converter. Five Malayalam vowels /a/, /i/, /u/, /e/, and /o/ uttered by a single speaker is used in the present study. From the band limited speech data, we had extracted zerocrossing intervals by locating the midpoint between samples of alternate sign as mentioned above. For finding the zerocrossing interval distribution, we fixed an upper threshold; say 0.01, by examining the zerocrossing intervals of all the vowels. This threshold range is then divided into twenty uniform ranges, viz: 0–0.0005, 0.0005–0.001, . . ., 0.0095–0.01. The distributions of the zerocrossing intervals throughout these ranges were then evaluated. The distributions were also found out by normalizing the zerocrossing interval data. For this we fixed twenty uniform time ranges like 0–0.05, 0.05–0.1, . . .,

3 Vowel recognition using artificial neural networks (ANN) The application of artificial neural networks to speech recognition is the youngest and least understood of the recognition technologies. The ANN is based on the notion that complex “computing” operations can be implemented by massive integration of individual computing units, each of which performs an elementary computation. Artificial neural networks have several advantages relative to sequential machines. First, the ability to adapt is at the very center of ANN operations. Adaptation takes the form of adjusting the connection weights in order to achieve desired mappings. Furthermore ANN can continue to adapt and learn, which is extremely useful in processing and recognition of speech. Second, ANN tend to move robust or fault tolerant than Von Neumann machines because the network is composed of many interconnected neurons, all computing in parallel, and failure of a few processing units can often be compensated for by the redundancy in the network. Similarly ANN can often generalize from incomplete or noisy data. Finally ANN when used as classifier does not require strong statistical characterization or parameterization of data (HechtNielsen 1990). These are the main motivations to choose artificial neural networks for phoneme recognition. 3.1 Recognition experiment We used Multilayer Layer Feed Forward architecture for this experiment. The network consists of 20 input nodes and

Int J Speech Technol

Fig. 5 Zerocrossing interval distribution vs interval range number for five different vowels

five output nodes, representing the five vowels. The network is trained using the zerocrossing information of the signal mentioned in Sect. 2. The experiment is repeated by changing the number of hidden layers and adding Additive White Gaussian Noise (AWGN) to the signal with different Signal to Noise Ratio (SNR). The experimental details of the vowel recognition using multilayer feed forward neural network trained using zerocrossing interval distribution of the

signal is as follows. We used the database of 150 samples of each vowel spoken by a single speaker. Seventy five patterns of the repeated utterances of each vowel are used for training and the remaining 75 patterns of each vowel are used for testing. The error tolerance (Emax ) is fixed as 0.001 and the learning parameter (η) is chosen between 0.01 and 1.0. The training of the network is done using error back propagation learning method.

Int J Speech Technol Table 1 Recognition accuracy for five Malayalam vowels with additive white Gaussian noise of different dB levels

Vowel

Recognition accuracy % 0 dB

3 dB

10 dB

20 dB

30 dB

Normal

/a/

26.0

29.3

33.3

65.3

86.6

90.6

/i/

33.3

33.3

34.7

70.7

84.0

88.0

/u/

26.6

28.0

36.0

69.3

78.7

92.0

/e/

21.3

26.6

28.0

73.3

86.6

90.6

/o/

26.6

33.3

45.3

61.3

85.3

86.6

Average

26.76

30.10

35.46

67.98

84.24

89.56

It is also found that the network shows poor recognition accuracy when the signal is added with Additive White Gaussian Noise of 0 dB, 3 dB and 10 dB SNR. Above 20 dB SNR they show better results. The average vowel recognition in this experiment is 89.56 %. This percentage is comparable with the phoneme recognition using spectral methods based on ANN. The detailed recognition results are tabulated in Table 1 for the network with three hidden layers.

4 Conclusion The speech signal is modeled using the zerocrossing interval distribution of the signal. This distribution pattern is used for recognizing five Malayalam vowels using multilayer feed forward artificial neural network. The average vowel recognition accuracy for single speaker using ANN based method is 89.56 %. The advantage of this method is that the network shows better performance than the other conventional techniques and it takes less computation than the other conventional techniques of parameterization of speech signal like FFT and Cepstral methods. The present work can be extended, with the proposed neural network model and training algorithm, using the zerocrossing interval distribution pattern of each vowels uttered by different male and female speakers under different age groups for analyzing the speaker independency of the patterns.

References Arai, T., & Yoshida, Y. (1990). Study on zerocrossing of speech signals by means of analytic signal. Journal of Acoustical Society of Japan, 692, 242–246. Bui, N. C., Jmonbaron, J., & Michel, J. G. (1983). An integrated voice recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-31, 323–328. Chandrasekhar, C. (1996). Neural network models for recognition of stop consonant vowel(scv) segments in continuous speech. PhD thesis, Department of Computer Science and Engg, IIT, Madras, India. Chandrasekhar, C., & Yegnanarayana, B. (1996). Recognition of stop– consonant–vowel (SCV) segments in continuous speech using neural network models. Journal of Institution of Electronics and Telecommunication Engineers, 42, 269–280.

Cristi, R. (2007). Modern digital signal processing. Washington: Thomson Brooks/Cole. Erdol, N., Castelluccia, C., & Zilouchian, A. (1993). Recovery of missing speech packet using the short time energy and zerocrossing measurements. IEEE Transactions on Audio and Electroacoustics, 1.1(3), 295–303. Hecht-Nielsen, R. (1990). Neurocomputing. Reading: AddisonWesley. Itakura, F. (1975). Minimum prediction residual principle applied to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-23, 67–72. Kim, K.-S., & Hwang, H.-Y. (1991). A study on the speech recognition of Korean phonemes using recurrent neural network models. Transactions of the Korean Institute of Electrical Engineers, 40(8), 782–791. Kim, D.-S., Jeong, J. H., Kim, J. W., & Lee, S.Y. (1996). Feature extraction based on zerocrossing with peak amplitudes for robust speech recognition in noisy environments. IEEE Transactions on Audio and Electroacoustics, AU 17, 61–64. Kwok, H. L., Tai, T. C., & Fung, Y. M. (1983). Machine recognition of the Cantonese digits using band pass filters. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-31, 220–222. Lau, Y.-K., & Chan, C.-K. (1985). Speech recognition based on zerocrossing rate and energy. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(1). Licklider, J. C. R. (1950). Intelligibility of amplitude-dichotomized time quantized speech waves. The Journal of the Acoustical Society of America, 22(5), 820–823. Licklider, J. C. R., & Pollack, I. (1948). Effects of differentiation, integration and infinite pack clipping upon intelligibility of speech. The Journal of the Acoustical Society of America, 20, 42. Niederjohn, R. J. (1975). A mathematical formulation and comparison of zerocrossing analysis techniques which have been applied to automatic speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-23(4), 373–380. Niederjohn, R. J., & Lahat, M. (1985). A zero-crossing consistency method for formant tracking of voiced speech in high noise levels. IEEE Transactions on Acoustics, Speech, and Signal Processing, 2, 349–355. Niederjohn, R. J., Krutz, M. W., & Brown, B. M. (1987). An experimental investigation of perceptual effects of altering the zerocrossing of a speech signal. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-35(5), 618–625. Rabiner, L. R., & Juang, B.H. (1993). Fundamentals of speech recognition. New York: Prentice Hall. Sreenivas, T. V., & Niederjohn, R. J. (1992). Zerocrossing based spectral analysis and SVD spectral analysis for formant frequency estimation in noise. IEEE Transactions on Signal Processing, 40(2). Wasson, D. A., & Donaldson, R. W. (1975). Speech amplitude and zerocrossing for automatic identification of human speakers. IEEE Transactions on Acoustics, Speech, and Signal Processing, 390– 392.

Int J Speech Technol Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K. (1988). Phoneme recognition: Neural Networks vs. Hidden Markov Models. IEEE Transactions on Neural Networks, 18(2), 107-110.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1988). Phoneme recognition: neural networks vs. hidden Markov models. IEEE Transactions on Neural Networks, 18(2), 107–110.