A BINAURAL ALGORITHM FOR SPACE AND PITCH DETECTION Wen-Sheng Chou, Kah-Meng Cheong and Tai-Shih Chi Department of Electrical Engineering National Chiao Tung University, Hsinchu, Taiwan 300, R.O.C. ABSTRACT A binaural algorithm to simultaneously detect the azimuth angle and the pitch of the sound source is proposed in this paper. This algorithm is extended from the stereausis model with two-dimensional coincidence detectors in the joint Space-Pitch domain. In our simulations, sounds from different locations are produced by passing through the Head-Related-Transfer-Function (HRTF). Simulation results show that estimated azimuth angles from our proposed algorithm are more accurate than those from the stereausis model in the single sound source testing condition. Satisfactory results in streaming sound sources from the two-sound mixture by using estimated Space-Pitch information are also demonstrated in our pilot experiments. Index Terms—binaural process, sound localization, pitch detection, stereausis, coincidence detector 1. INTRODUCTION Localizing and segregating multiple sound sources in a complex sound field is a very challenging task for machines, especially when the sound signals are with overlapping spectra. However, humans deal with such problems in their daily lives all the time. This type of problem is referred to as the “Cocktail Party” problem [1]. It is well known that binaural and monaural cues are delicately used by humans in a complex sound field to have satisfactory results in segregating sounds. Observed from humans’ social behaviors, the binaural cues seem first being used to localize sound sources, then monaural cues are further adopted to separate sound streams of the same location. The monaural cues comprise the spectro-temporal properties of sounds, such as the pitch, the harmonicity, the FM and AM of sounds. The research topic on the monaural sound segregation is called the Computational Auditory Scene Analysis (CASA). It is well known that the Interaural Time Difference (ITD) and the Interaural Level Difference (ILD) are crucial cues for binaural localization of sounds. Neuro-physiological studies show that neurons in Medial Superior Olive (MSO) and Inferior Colliculus (IC) discharge maximally when the internal delay precisely compensates the external delay (ITD), which is called the Best Delay (BD) of the neurons
[2]. Besides, the Best Frequency (BF) of a neuron is shown to strongly correlate with its BD [3]. Up to date, the most common models to explain the mechanism of a neuron’s BD are the Jeffress model [4] and the stereausis model [5]. Both models use coincidence detectors to simulate neurons’ responses. Jeffress model compensates the external delay by introducing a dual delay line (the axonal delay) whereas the stereausis model compensates the external delay by the traveling wave delay along the cochlea. On the other hand, psycho-acoustical studies show that the harmonicity cue is more salient than the ITD cue in sound segregation when they are both present [6]. It raises a question that whether the harmonicity and location cues are resolved simultaneously by auditory neurons at a certain stage along the hearing pathway? Or the binaural cues are suppressed by a higher cognitive function when the harmonicity cue is resolved later? More neuro-physiological and psycho-acoustical studies are needed to answer that question. In this study, we extend the stereausis model and propose a computational binaural process which resolves the ITD as well as the pitch of sound simultaneously. A series of 2-dimensional space-pitch templates (SPTs) are trained from outputs of the stereausis model and 2-dimensional coincidence detectors are utilized to further detect the azimuth angle and the pitch of sounds. This paper is organized as follows. In section 2, a brief review of the monaural cochlear model and the stereausis model are given. In section 3, our extended algorithm, which includes the trained SPTs, the 2D coincidence detectors and the decision mechanism in detecting the azimuth angle and the pitch of the sound, is described. In section 4, experimental results on a single sound source and on a two-sound mixture are demonstrated. We end in section 5 with conclusions and discussions. 2. MONAURAL AND BINAURAL MODEL In this section, a brief introduction of the monaural cochlear model [7] and the binaural stereausis model [5] is given. 2.1. Monaural Cochlear Model Stages of the monaural cochlear model are shown in Fig. 1. It first consists of a bank of 128 constant-Q band-pass filters which mimic the frequency selectivity of the cochlea. Each
This research is supported by the National Science Council, R.O.C. under Grant NSC 99-2220-E-009-056.
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
4976
ICASSP 2011
H (Z; x1 )
Right ear auditory spectrogram 2000 1000
Low freq.
500 250
Coincidence detector C(i,j,t)
xxx
125 0
200
400
600
800
2000 1000
Low freq.
High freq.
t1 t2 Left ear auditory spectrogram
L(f,t2) x x x
L(f,t1)
500 250 125
W (Z )
x x x
Z
Z
u
Z
Z
0
200
400
500
600
Time (ms)
1000
800
Fig. 2. Process of the stereausis model
250 125
Fig. 1. Stages of the monaural cochlear model 2.2. Binaural Stereausis Model The stereausis model is based on the fact that the ITD causes different traveling-wave propagation time along the cochlea between two ears. Fig. 2 shows the main process of the stereausis model. At any time instant ̇ , the element ʻ˼ ʿ ˽ ʼ of the coincidence detector matrix can be written as ˖ ʻ˼ ʿ ˽ ʿ ̇ ʼ ː ˥ ʻ˼ ʿ ̇ ʼ ٛ ˟ʻ ˽ ʿ ̇ ʼ , where R and L represent the auditory spectrograms of the right and left ear, and ˼ ʿ ˽ are the indexes of the cochlear filter. Finally, averaging within a 1 short time interval T (= t2 - t1) such as ¦ C (i, j, t ) gives T tT the output matrix of the stereausis model. For a zero ITD sound, the stereausis model produces the maximum response along the diagonal of the output matrix. For a non-zero ITD sound, the maximum response shown in the output matrix would shift toward the ipsilateral ear pertaining to the sound source. However, the maximum responses of the high frequency channels shift more (away from the diagonal) than the maximum responses of the low frequency channels due to the progressive longer traveling-wave delays along the cochlea from the base (high frequency channels) to the apex (low frequency channels). In other words, for a fixed ITD, the channel index difference between maximum activated channels of two ears is larger with a high frequency input sound than with a low frequency input sound. Therefore, the stereausis model, which further collapses the output along the diagonal direction and produces an energy distribution on the lag axis to estimate the azimuth angle of input sound, would report
4977
The final output of the stereausis model is affected by other variables, such as the operation of the coincidence detectors (e.g., ˖ ʻ˼ʿ ˽ ʿ ̇ ʼ ː ˮ˥ ʻ˼ʿ ̇ ʼ ʾ ˟ʻ ˽ ʿ ̇ ʼ ˰˅ ), parameters of the non-linear hair cell stage, the time interval T, and the frequency range. In this study, we adopt following assumptions: the multiplication operation done by coincidence detectors, the 50 ms time interval (T), an almost linear hair cell stage, and the frequency range f d 1.5 kHz, which provides adequate ITD cues [3]. More discussions on effects from these parameters can be found in [5]. 3. EXTENDED BINAURAL ALGORITHM In this section, we propose an extension of the stereausis model to simultaneously detect the azimuth angle and pitch of the input sound. The overview of our proposed algorithm is shown in Fig. 3. Signal training
Right ear HRTF
Left ear HRTF
Stereausis model Monaural model
Space-Pitch Templates Z
Pitch 2D coincidence detector
Monaural model
Acoustic signal
Azimuth
g (u )
U (Z )
2000
Signal testing
Azimuth
Z
H (Z ; x2 )
Frequency (Hz)
H (Z ; xn )
different ITDs for high frequency and low frequency sounds at the same location.
A in vg. t e ti rv m al e T
filter is designed to have a specific delay related to its center frequency (CF) to model the traveling wave delay of input sound along the cochlea. As indicated by the dashed line in the rightmost panel of Fig. 1, the lower the CF, the longer the delay. The output of each filter is then fed into a non-linear hair cell stage, which contains a high-pass filter, a sigmoid function and a low-pass filter. This stage imitates the discharge and high-gain saturation of inner hair cells. Finally, the signal in each band is passed through a lateral inhibitory network (LIN), which effectively sharpens the frequency resolution of the cochlear filter and accounts for the frequency masking effect. The final 2-dimensional output is referred to as an auditory spectrogram, which represents neurons’ activities along the time and log-frequency axis. More details of this model can be found in [7].
Coincidence Matrix R Pitch Decision algroithm Grouping algroithm
Fig. 3. Overall flowchart of proposed method. 3.1. Trained SPTs & 2D Coincidence Detectors As mentioned in section 2.2, sound sources of different frequencies at the same location (with the same ITD) evoke
R (T , f 0)
¦C
i , j
train
$
$
(i, j , T , f 0) u Ctest (i, j )
(3)
Equation (3) generates a similar result as from the 2-D correlation. However, to reduce the complexity, the multiplicative 2-D coincidence detectors are adopted in deriving R, the coincidence matrix.
Coincidence matrix R real azimuth = -15 35 -90
azimuth (degree)
azimuth :{4 | 90 d 4 d 90 , 5 stepsize} (2) ® ¯pitch :{F 0 | 80 d F 0 d 350 Hz , 2 Hz stepsize} During the testing phase, the stereausis output ˖ ̇˸̆̇ ʻ˼ ʿ ˽ ʼ is calculated for the input sound mixture. A matrix of 2-D coincidence detectors is utilized to assess the similarity between ˖ ̇˸̆̇ ʻ˼ ʿ ˽ ʼ and each trained SPT in set Z as follows. $
M T est r 5q , 25q d T est d 25q ° q q q q q (6) ®M T est r 10 , 55 d T est d 30 30 d T est d 55 ° q q q q q ¯M T est r 25 , 90 d T est d 60 60 d T est d 90 Repeat the third step until W (T , f 0) becomes a null matrix or the source number derived in the second step is reached. Note, equation (6) reflects the different azimuth angle resolution of humans in different directions, i.e., poorer and poorer resolution from 0° to ̈́ 90° [10].
60
-60
40
-30
20
0
0
30
-20
60 90 80
-40 120
160
200
240
280
320
After picking peak matrix W estimate azimuth = -15 45 estimate pitch = 248 136
azimuth (degree)
different stereausis outputs due to the traveling-wave delay along the cochlea. Therefore, we hypothesize that in addition to the ITD cue, the harmonicity is resolved by certain mechanism simultaneously. To build the SPTs, harmonic sequences with the highest frequency below 1.5 kHz are passed through the HRTF to embed different ITDs. The ‘diffuse_elev0’ impulse response in [8] is used to simulate different azimuth angles with a zero elevation angle. The stereausis outputs of these harmonic sequences are stored as SPTs. These templates are generated with the pitch resolution of 2 Hz and the azimuth angle resolution of 5 degrees, which is bounded by the resolution of the HRTF. These templates can be expressed as: Z {Ctrain (i, j , azimuth, pitch)} (1)
-90 -60
60
-30
40
0 30
20
60 90 80
120
160
200
240
280
320
0
pitch (Hz)
Fig. 4. (a) Coincidence matrix of a sound mixture; (b) Coincidence matrix after thresholding and local peak picking. 4. EXPERIMENT EVALUATION
3.2. Decision Mechanism Fig. 4(a) demonstrates the coincidence matrix R of a sound mixture with two equal-powered speech utterances from -15° and 35° azimuth angles. A three-step algorithm is proposed to detect the pitch and azimuth angle from the matrix R. First, the R is thresholded by a quarter of its maximum value followed by a local peak picking procedure to produce the W (T , f 0) as in Fig. 4(b). Second, the W (T , f 0) is collapsed along the pitch axis and produce an energy distribution on the azimuth axis. From a smoothed version of this energy distribution (which is shown on the left of Fig. 4(b)), the number of sources is determined by the number of peaks. Third, the maximum peak in W (T , f 0) is extracted as the first source with estimated azimuth angle T est and pitch f 0est . Then, similar to procedures in [9], we clean up W (T , f 0) around the harmonic frequencies of the first source by setting: W (T , f r 2 Hz ) 0, T and 80 d f d 350 Hz (4) where f k u f 0est , k 0.5,1, 2,3 ; and around its azimuth angle by setting: W (M , F 0) 0 (5) where
4978
In the section, our proposed SPT-based algorithm is compared to the stereausis model in detecting the azimuth angle of sound sources. The absolute degree error (ADE) and its standard deviation (STD) are measured for comparison. As for the pitch detection, the accuracy rate (AR), the ratio of the number of frames with correct estimated pitch to the total number of frames, is demonstrated. Two experimental scenarios are considered as follows. Experiment 1: Our algorithm is tested with one single speech source continuously moving from -90° to 90° in the step of 5° and lasting 50 ms in each location. Fig. 5(a) shows the waveform of a sample speech (in the top panel) and detected azimuth angles by our SPT algorithm and by the stereausis model (in the bottom panel). It clearly demonstrates the improvement of our algorithm to the stereausis model. In addition, the contour of detected pitch by our algorithm is superimposed on the monaural auditory spectrogram as shown in Fig. 5(b). Table 1 shows the average performance from ten utterances spoken by 5 male and 5 female speakers from TIMIT corpus under this testing scenario.
Input signal
Pitch contour
0.05
2000
0
azimuth (degree)
0
200
400
600
800
Frequency (Hz)
-0.05
1000
1000 1200 1400 1600 1800
Moving single sound detector 50 ture SPTs stereausis
0 -50 200
400
600
800
1000 1200 1400 1600
500 250 125 0
1800
200 400 600 800 10001200140016001800
time (ms)
time (ms)
Fig. 5 (a) (Top) waveform of input speech; (bottom) detected azimuth angles by our method and the stereausis model; (b) Estimated pitch contour by our method. Experiment 2: The experiment in separating two sounds from a mixture by their ITDs and pitches is conducted. Sound source 1 is a male-spoken utterance from TIMIT corpus placed at the -20° azimuth angle and source 2 is a female-spoken utterance with an equal energy from AURORA2 corpus at the 30° azimuth angle. Fig. 6(a) shows the auditory spectrograms of each original sound and their mixture. Fig. 6(b) shows three separated pitch contours, which are based on detected pitches and azimuth angles by our algorithm, superimposed on original monaural spectrograms. A pilot grouping algorithm, which mainly considers the continuity of the azimuth angle and the pitch value over time, is developed at early stage to sequentially group pitch contours. Undoubtedly, more advanced grouping algorithms would produce better streaming results and will be pursued in the future. Table 1: Average performance of tracking single source ADE 3.067 14.37
SPT algorithm Stereausis
STD 3.312 10.17
Frequency (Hz)
Male speech at -20 degree
Female speech at 30 degree 2000
2000
1000
1000
1000
500
500
+
=
250
250
125
0
200
400
600
800
1000
1200
500
250
125
0
200
400
600
(a)
800
1000
0
1200
2000
2000
2000
1000
1000
1000
500
500
500
250
250
250
125
125 0
200
400
600
800
time (ms)
1000
1200
200
400
600
800
1000
1200
Azimuth : mean = 29.8 std = 3.4
Azimuth : mean = -5 std = 2.5
Azimuth : mean = -20 std = 1.95
Frequency (Hz)
Mixed Speech
2000
125
After process
Pitch accuracy rate 93.84% N/A
125 0
200
400
600
800
time (ms)
(b)
1000
1200
0
200
400
600
800
1000
1200
time (ms)
Fig. 6 (a) Original auditory spectrograms of each sound source and their mixture; (b) Separated pitch contours. 5. CONCLUSIONS AND DISCUSSIONS A computational binaural algorithm, which simultaneously detects the azimuth angle and the pitch of sounds, is proposed in this study. This algorithm outperforms the stereausis model in detecting the azimuth angle under a single sound source experimental setup. Besides, it captures
the pitch of the sound with about 94% accuracy rate. By cascading a sequential grouping algorithm, a satisfactory segregation of pitch contours from a two-sound mixture is observed in our pilot study. However, our algorithm has difficulty in detecting the azimuth angle when two sounds are harmonic-related (as indicated by the left circle in Fig. 6(a)). Although the azimuth angle is mis-detected as -5°, the pitch value is still accurate as shown in the center panel of Fig. 6(b). This result matches the conclusion of [6] that the pitch cue is more salient than the ITD cue when both are presented at the same time. Nevertheless, our algorithm has several restrictions induced by the decision mechanism in section 3.2. For example, a fixed threshold would inevitably neglect the sound with small energy, as shown by the right circle in Fig. 6(a) where the male speech is ignored from 750 to 1000 ms. Another problem rises from the smoothing function in our decision mechanism. It would confine the resolution of the azimuth angle of our algorithm. Therefore, our algorithm has difficulties in streaming nearby sounds or sounds with unbalanced volumes. Both of these tasks are also difficult to humans when they are not paying attentions. In the future, we will develop certain “adaptive” decision mechanisms to model the binaural-process related “attentional” behaviors of humans. 6. REFERENCES [1] S. Haykin and Z. Chen, “The cocktail party problem,” Neural Computation, vol. 17, pp. 1875-1902, 2005. [2] P. Joris and T. C. T. Yin, “A matter of time: internal delays in binaural processing,” Trends Neurosci., vol. 30, no. 2, pp. 70-78, 2007. [3] D. McAlpine, D. Jiang, and A. R. Palmer, “A neural code for low-frequency sound localization in mammals,” Nature Neurosci., vol. 4, no. 4, pp. 396-401, 2001. [4] L. A. Jeffress, “A place theory of sound localization,” J. Comp. Physiol. Psychol., vol. 41, no.1, pp. 35-39, 1948. [5] S. A. Shamma, “Stereausis: binaural processing without neural delays,” J. Acoust. Soc. Am., vol. 86, no. 3, pp. 989-1006, 1989. [6] T. N. Buell and E. R. Hafter, “Combination of binaural information across frequency bands,” J. Acoust. Soc. Am., vol. 90, no. 4, pp.1894-1900, 1991. [7] T. Chi, P. Ru, and S. A. Shamma, “Multi-resolution spectro-temporal analysis of complex sounds,” J. Acoust. Soc. Am., vol. 118, no. 2, pp. 887-906, 2005. [8] B. Gardner and K. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” Perceptual Computing Group, MIT Media Lab, Cambridge, MA, Tech. Rep. 280, 1994. [9] A. de Cheveigne, “Separation of concurrent harmonic sounds: fundamental frequency estimation and a time-domain cancellation model of auditory processing,” J. Acoust. Soc. Am., vol. 93, no. 6, pp. 3271-3290, 1993. [10] W. Grantham, “Spatial hearing and related phenomena,” in Hearing, edited by Brian C. J. Moore, Academic Press, London, pp.297-339, 1995.
4979