audio signal classification with temporal envelopes - Semantic Scholar

Report 1 Downloads 31 Views
AUDIO SIGNAL CLASSIFICATION WITH TEMPORAL ENVELOPES M Umair Bin Altaf and Biing-Hwang(Fred) Juang School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta GA 30332 umair,[email protected] ABSTRACT The conventional approach to audio processing, based on the short-time power spectrum model, is not adequate when it comes to general audio signals. We propose an approach, justified by studies from psycho-acoustics and neuroimaging, which uses the magnitude and frequency envelope of the audio signal in the from of AM-FM modulations to build an ARMA model which is then fed to a GMM to classify into various audio classes. We show that it makes explicit certain aspects of the signal which are overlooked when processing is limited to the spectral domain. Index Terms— Audio Signal Processing, Audio Classification, Temporal Features, AM-FM Signal Models 1. INTRODUCTION The processing of audio, whether it be for compression,coding, classification or recognition has been largely based on the short-time frequency representation. This is motivated by the interpretation of the human audition system as essentially a frequency analyzer, comprising a bank of analysis filters, with overlapping bandwidths and logarithmically distributed center frequencies. The organization of these filters is referred to as tonotopy. This spectral organization in the auditory system begins with the cochlea and has been shown to extend through to the primary auditory cortex and beyond, both functionally and topologically. It forms the basis of the power spectral model where the stimuli are assumed to be represented by their long term power spectra derived from the short-time Fourier analysis. While the viability of this model is not in question, it is also a fact that it ignores many temporal aspects of the signal which have been shown to have at least as much neurobiological and perceptual motivations as the spectral model. This sometimes leads to unexplained phenomena as relative phases and the temporal variations of the signal on short and long time scales are ignored. Examples include the well known ability to differentiate time-reversed signals or the various aspects uncovered by masking experiments such as ‘dip listening’ [1] or the ‘co-modulation masking release’ [2]. There are also clear perceptual differences when listening to audible impulse and white noise, though both have essentially

978-1-4577-0539-7/11/$26.00 ©2011 IEEE

469

the same flat spectra. Furthermore, beats are perceived when two tones with slightly different frequencies are added together. This perception arises from the varying envelope of the signal and is not evident from the short-time spectrum of the signal. In the case of specific auditory tasks, for example, [3] shows the importance of the temporal envelope in a speech recognition task with a psycho-acoustic experiment where the envelope is defined as the slowly varying component of the signal. [4] uses the Hilbert Transform to extract the envelope and phase of the audio signals and demonstrate their significance in audition. Moreover, studies in neuroimaging [5] have also turned up evidence which indicates that the temporal information is not only encoded in timing of the spikes of the auditory nerve but, like the tonotopic organization, also extends to higher cortical regions along the auditory pathways, though not as explicitly as the tonotopic organization [6]. The presence of specialized structures in the auditory cortex with high temporal resolution and complementary regions with finer spectral resolution also suggests a more integrated handling of the audio signal in the auditory cortex [5]. These pieces of evidence highlight the discrepancies with observations in the most basic assumptions of the power spectrum model and suggest that some dimensions of the signal which encode the temporal information at various time scales, such as the phase and envelope, are important to our sense of audition as these add to its capabilities and robustness. The features derived from short-time Fourier analysis assume the absence of such variations with in the analysis time frame and do not account for variations across frame boundaries. Notable exceptions that model non-spectral aspects motivated by the auditory system are few while those that take into the limitations of the short-time analysis are even fewer (e.g., [7, 8, 9]) and mostly limited to speech. These and other such approaches have been dismissed as yielding marginal performance improvements coupled with high computational costs [10]. We do not agree with this assessment and maintain that the claims of limited usefulness are not fully justified on account of the incomplete understanding of the auditory system; the short-time nature of the analysis, many simplifications made in these models and the constraints imposed by the structure of the system, which is generally designed inde-

ICASSP 2011

pendent of the features makes it difficult to judge the value of the non-conventional approach. We wish to point out this gap in the traditional framework and present preliminary but conceptually plausible ideas to address the shortcomings. 2. SIGNAL ENVELOPE In this paper we will focus on modeling the time-varying frequency and amplitude envelope of a general audio signal and present experimental results to support its use in an auditory classification task. The work in the area of audio event classification is very sparse [11] and mostly relies on and extends the techniques developed in speech recognition and classification [12]. In the mammalian auditory system, the information of the envelope is encoded in the timing of the spikes on the auditory nerves for a particular filter. It is generally believed that the fidelity of this information is preserved upto 5 KHz [13] with a high time resolution [1] and very good dynamic range of 80 dB that is better than that of the tonotopic information, which is around 40dB. We argue on the basis of results in this paper and elsewhere, that a properly defined envelope can represent the significant information in an auditory signal. This is especially true for an general audio signal since, unlike speech, we can not impose a production model on it. This implies the need for a minimally constrained representation for the signal, which the envelope can provide. The general audio signal is also highly non-stationary which renders the long term power spectrum useless but this makes the time varying envelope a suitable representation. The envelope is not a uniquely defined concept, not least, due to the gaps in our understanding of the auditory system and most of the time the definition has been tailored to nature of the task as we noted in the previous section. However, the response of the peripheral auditory system to modulation information in complex harmonic signals and the detection specific responses to spectro-temporal modulation content of the signal in the auditory cortex [14] shows that treating any audio signal as a complex AM-FM(Amplitude ModulationFrequency Modulation) modulated resonance is fully justified. The AM component models the temporal variations while the FM component models the frequency variations. Note that a similar conclusions can be drawn about speech but the motivation in that case is the production model of speech [7] while here it is motivated from the auditory system, which is more general. The next section addresses the task of extracting these components. 3. ENVELOPE FEATURES EXTRACTION AND MODELING Consider the following discrete real valued AM-FM signal.    n q(m)dm + θ (1) s(n) = e(n) cos Ωc n + Ωm 0

470

Where Ωc is the discrete-time carrier frequency and Ωm is the discrete-time deviation from Ωc . q(m) is the frequency modulating signal with |q(m)| ≤ 1. e(n) is the amplitude envelope and we define the discrete instantaneous frequency as:  n d (Ωc n + Ωm q(m)dm + θ) = Ωc + Ωm q(n) Ωinst  dn 0 (2) Then using the Discrete Energy Separator Algorithm(DESA), the instantaneous frequency and amplitude envelope is given as [15]:   Ψ[s(n + 1) − s(n − 1)] 1 Ωinst (n) ≈ arccos 1 − (3) 2 2Ψ[s(n)]

|einst (n)| ≈ 

2Ψ[s(n)] Ψ[s(n + 1) − s(n − 1)]

(4)

Where ψ(.) is the discrete time Teager-Kaiser energy operator which forms the basis for DESA. ψ(s)  s2 (n) − s(n − 1)s(n + 1)

(5)

The assumptions made in arriving at eqs. (3) and (4) require that the variations in e(n) and the deviation Ωm be small relative to Ωc . 3.1. Parametrization of the Envelope Features The Autoregressive Moving Average model of a signal y(n) or ARMA(p,q) is defined as: y(n) +

q  k=1

ak y(n − k) =

p 

bj (n − j)

(6)

j=1

Where (n) is unit variance white noise. ak , k = 1, . . . , q are the autoregressive coefficients and bj , j = 1, . . . , p are moving average coefficients. Since we are not interested in the spectrum of the signal, stationarity is not necessary for the model. We model the instantaneous AM and FM envelopes extracted from DESA as a time series, parameterized with the ARMA(p,q) model i.e. einst (n) and Ωinst (n) from eqs. (3) and (4) would substitute for y(n) in eq. (6). The parameters of the model are estimated iteratively, minimizing the squared error between the model and the time-series. 3.2. GMM model The parameters of the ARMA model are aggregated by a Gaussian Mixture Model(GMM) for classification. A GMM is a weighted sum of M component Gaussian densities given by:

M 

wi pi (x)

(7)

i=1

Where x is the D-dimensional random vector. wi , i = 1, . . . , M is the mixture weight and λ is the model. Each fi (x), i = 1, . . . , M is the D-varite Gaussian density, given by:   1 1 T −1 exp − (x − μi ) Σi (x − μi ) fi (x) = 2 (2π)d/2 |Σi |1/2 (8) matrix. The μi is the mean vector and Σi is the covariance M mixture weights satisfy the constraints i=1 wi = 1. The GMM model λ is then parametrized by the following collection : λ = {wi , μi , Σi }

i = 1, . . . , M

(9)

Where each class is represented by its model λ. Following that, we estimate the GMM parameters by the EM algorithm using the class specific training data. We use the standard approach to calculate the GMM log-likelihood i.e. For a sequence of T data vectors X = {x1 , x2 , . . . , xT }, L(X|λ) = log p(X|λ) =

T 

log p(xt |λ)

(10)

t=1

4. EXPERIMENTS 4.1. Data Set and Experimental Setup We used the non-speech sounds from the RWCP database [16]. We choose eight classes within the database as our data set. Each sound in the data set is sampled at 16KHz and vary in duration from 0.5-2 second. Two of the eight classes have about 50 sound samples for each class while the rest have 100 sound samples per class. We used 70% of the data for training and the rest for testing. The frequency and magnitude envelopes are calculated using (3) and (4) with in a frame length of 50ms. Ideally, for a given type of sound source, this would be replaced by an optimal temporal window computed from the effective durations of the signals themselves. The envelopes are then parameterized with the ARMA coefficients at p = 1 and q = 6 for the frequency and magnitude envelopes. The order of the models is estimated by 10-fold cross-validation over the classification results. Our use of short time window in this analysis is purely on the grounds of parametrization for classification as both the frequency and magnitude envelopes are instantaneous. The baseline system uses 13 MFCC’s and log-energy for a total of 14 parameters. This is chosen to make a fair comparison with the proposed system. The frame length is set at 25ms and the MFCC parameters are calculated every 12.5ms.

471

4.2. Experimental Results and Discussion Fig. 1 shows the results of the classification experiments which are obtained by 10-fold cross validation repeated 10 times. Correct Classification rate

P (x|λ) =

GMM models are estimated for both the baseline and proposed system with 8-mixtures for the 14 parameters using a diagonal covariance matrix. Both systems are tested under white Gaussian noise and clean conditions. For the noisy case, the models are trained with noisy data.

1

0.95

0.9

ENV MFCC

0.85 0

10

20

30 SNR dB

40

50

inf

Fig. 1. Comparison of classification rates between the MFCC and Envelope based classifier in presence of noise. The rates have been normalized to 1. It is clear that the envelope features do not perform well in noise while the clean data performance is almost equal to the base line MFCC system. Note that there are still a few errors in the proposed system for clean data. The confusion matrix for clean data in table 1 shows that the class ‘phone4’ is the main culprit, though the margin is small. We can also notice from fig. 1 that there are dips at 15dB and 40dB in the plot which do not seem to follow the trend. This seems to indicate that the performance gap is not completely due to noise. The confusion matrix at 40dB in table 2 shows that this problem is almost exclusive to class ‘phone4’ and is accentuated at this relatively high SNR when the effect of noise is not expected to be dominant. A similar observation is made for the confusion matrix at 15dB(not shown). On closer inspection of the samples from class ‘phone4’, it becomes apparent that the class is not entirely uniform. Fig. 2 shows two representative samples from the class which do not have similar magnitude envelopes and are perceptually different when listened to. Thus we conclude that some of the samples constitute a new class distinct from the label in the database. The envelope features make this distinction explicit, while on the other hand it is completely missed by the MFCC based system which only focuses on the spectral structure of the signal. 5. CONCLUSIONS We showed that temporal features motivated from the psychoacoustic studies highlight significant dimensions of the audio

buzzer 1 0 0 0 0 0 0 0

buzzer clock2 phone1 phone2 phone3 phone4 pipong toy2

clock2 0 .99 0 0 0 0 0 0

phone1 0 0.01 1 0 0 0.06 0 0

phone2 0 0 0 1 0 0 0 0

phone3 0 0 0 0 1 0 0 0

phone4 0 0 0 0 0 0.94 0 0

pipong 0 0 0 0 0 0 1 0

toy2 0 0 0 0 0 0 0 1

[3]

[4]

Table 1. Confusion matrix of the proposed system with clean Data buzzer 1 0 0 0 0 0 0 0.01

buzzer clock2 phone1 phone2 phone3 phone4 pipong toy2

clock2 0 1 0 0 0 0 0 0

phone1 0 0 1 0 0 0.04 0 0.01

phone2 0 0 0 1 0 0.2 0 0

phone3 0 0 0 0 1 0 0 0

phone4 0 0 0 0 0 0.41 0 0

pipong 0 0 0 0 0 0.35 1 0

toy2 0 0 0 0 0 0 0 0.98

[5]

[6]

[7]

Table 2. Confusion matrix of the proposed system with SNR=40dB

Magnitude

1

[9]

0 −1 0

0.5

1

1.5

2

[10]

phone4−069.raw

4

x 10 Magnitude

[8]

phone4−032.raw

4

x 10

1 0

[11]

−1 0

0.5

1 Time(seconds)

1.5

2

[12]

Fig. 2. Two sound samples from the class ‘phone4’ signal which are not captured by the MFCC’s, yet can be important to our perception. The performance of these AM-FM features does suffer at high noise levels. This work is not meant to be a conclusive evidence for the usefulness of the proposed approach but it does caution against the blanket application of short-time Fourier analysis where its use is unwarranted on account of weakness of the underlying assumptions. We believe that a holistic approach which integrates the spectral-temporal dichotomy, among other characteristics of an audio signal, is a necessary step to gain full advantage of the information that is available with in an audio signal. 6. REFERENCES [1] Brian C.J. Moore, An Introduction to the Psychology of Hearing, Academic Press, 5th edition, Feb. 2003. [2] J. W. Hall, M. P. Haggard, and M. A. Fernandes, “Detection in

472

[13]

[14]

[15]

[16]

noise by spectro-temporal pattern analysis,” The Journal of the Acoustical Society of America, vol. 76, no. 1, pp. 50–56, July 1984. R. V. Shannon, F-G. Zeng, V. Kamath, J. Wygonski, and M. Ekelid, “Speech recognition with primarily temporal cues,” Science, vol. 270, no. 5234, pp. 303–304, Oct. 1995. Z. M. Smith, B. Delgutte, and A. J. Oxenham, “Chimaeric sounds reveal dichotomies in auditory perception,” Nature, vol. 416, no. 6876, pp. 87–90, Mar. 2002. R. J Zatorre and J. T Gandour, “Neural specializations for speech and pitch: moving beyond the dichotomies,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 363, no. 1493, pp. 1087–1104, Mar. 2008. S. Shamma, “On the role of space and time in auditory processing,” Trends in Cognitive Sciences, vol. 5, no. 8, pp. 340–348, Aug. 2001. F-G. Zeng, K. Nie, G. S. Stickney, Y-Y. Kong, M. Vongphoe, A. Bhargave, C. Wei, and K. Cao, “Speech recognition with amplitude and frequency modulations,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 7, pp. 2293–2298, Feb. 2005. D. Dimitriadis and P. Maragos, “Continuous energy demodulation methods and application to speech analysis,” Speech Communication, vol. 48, no. 7, pp. 819–837, July 2006. O. Crouzet and W. A Ainsworth, “On the various influences of envelope information on the perception of speech in adverse conditions: An analysis of between-channel envelope correlation,” in Workshop on consistent and reliable acoustic cues for sound analysis, Aalborg, Denmark, 2001. C.R. Jankowski, H.-D.H. Vo, and R.P. Lippmann, “A comparison of signal processing front ends for automatic word recognition,” Speech and Audio Processing, IEEE Transactions on, vol. 3, no. 4, pp. 286–293, 1995. A. Temko and C. Nadeu, “Classification of acoustic events using SVM-based clustering schemes,” Pattern Recognition, vol. 39, no. 4, pp. 682–694, 2006. I. Potamitis and T. Ganchev, “Generalized recognition of sound events: Approaches and applications,” in Multimedia Services in Intelligent Environments, George A. Tsihrintzis and Lakhmi C Jain, Eds., pp. 41–79. Springer, 2008. Y. Ando, “Model of temporal and spatial factors in the central auditory system,” in Auditory and Visual Sensations, pp. 73– 89. Springer New York, 2009. M. Schnwiesner and R. J. Zatorre, “Spectro-temporal modulation transfer function of single voxels in the human auditory cortex measured with high-resolution fMRI,” Proceedings of the National Academy of Sciences, vol. 106, no. 34, pp. 14611– 14616, 2009. P. Maragos, J.F. Kaiser, and T.F. Quatieri, “Energy separation in signal modulations with application to speech analysis,” Signal Processing, IEEE Transactions on, vol. 41, no. 10, pp. 3024–3051, 1993. S. Nakamura, K. Hiyane, F. Asano, Y. Kaneda, T. Yamada, T. Nishiura, T. Kobayashi, S. Ise, and H. Saruwatari, “Design and collection of acoustic sound data for hands-free speech recognition and sound scene understanding,” in Proc. IEEE International Conference on Multimedia and Expo, 2002, vol. 2, pp. 161–164.