➠
➡ SCENE RECONSTRUCTION USING DISTRIBUTED MICROPHONE ARRAYS Parham Aarabi, Bob Mungamuru The Artificial Perception Laboratory University of Toronto, Toronto, Ontario, Canada M5S 3G4
[email protected],
[email protected] ABSTRACT A method for the joint localization and orientation estimation of a directional sound source using distributed microphones is presented. By modeling the signal attenuation due to the microphone directivity, the source directivity, and the source-microphone distance, a multi-dimensional search over all possible sound source locations and orientations is performed. This sound source scene reconstruction algorithm is presented in the context of an experiment with 24 microphones and a dynamic speech source. At a signal-tonoise ratio of 20dB and with a reverberation time of approximately 0.1s, accurate location estimates (20cm error) and orientation estimates (less than 10 o average error) are obtained. 1. INTRODUCTION Microphones distributed throughout an environment can provide extensive information regarding the sound sources within this environment [7, 1, 13, 12]. By sampling sound waves spatially as well as temporally, information regarding the location, orientation, speech information content, as well as the environmental acoustics is collected. Often, this information is convoluted to such an extent that the simultaneous localization, orientation estimation, speech recognition, and room impulse response estimation becomes difficult, if not impossible [1]. Of course, the estimation of these parameters is often necessary for a variety of applications including improved human-computer interactions, teleconferencing, and robotics [9, 10, 1]. Prior work in this area has primarily focused on sound source localization [7, 11, 4, 6, 3, 5]. Techniques such as the ones proposed by [8] and [4] were shown to successfully localize sound sources in reverberant and noisy environments. More recent sound localization techniques employing spatial likelihood functions have produced simple and effective algorithms for robust sound localization [4, 11, 2, 1]. In [4, 2], a spatial accessibility mask (or, alternatively, an observability mask) for each microphone sub-array was proposed, resulting in a substantially accuracy increase.
0-7803-7965-9/03/$17.00 ©2003 IEEE
More recently, in [14], a joint sound localization and orientation estimation technique was proposed which modeled the directions of the source and the microphones, thereby obtaining more accurate localization estimates while simultaneously estimating the source orientation. The accuracy gains of this technique could partially be attributed to its systematic accounting for the spatial accessibility mask, in contrast to the ad-hoc accounting of [4, 1]. In this paper, we briefly overview this joint sound localization and orientation estimation technique while employing it in an experiment within a reverberant environment with 24 microphones. 2. SOUND ATTENUATION MODEL We start with an overview of the sound attenuation model that was initially proposed in [14]. In this context, we aim to model the attenuation and the time-delay resulting from the locations and the directivities of the sound source and the microphones, in order to estimate the location and the orientation of the sound source in a multi-microphone environment. The three main causes of this attenuation are the directivity attenuation a(θ) due to the angle of arrival of sound to the microphone, the directivity attenuation b(θ) due to the angle of departure of the sound from the source, and the attenuation d(x s , xi ) due to the source-microphone distance. The time-delay τ i corresponding to microphone i is only a function of the source-microphone distance, as follows: τi =
xs − xi ν
(1)
where xs is the spatial location of the sound source, x i is the spatial location of the ith microphone, and ν is the speed of sound in air (≈ 345 m/s). In this paper, we assume that the arrival directivity attenuation a(θ) has the following form: θA a(θA ) = cos (2) 2 where θA is the direction of arrival in the frame of reference of the microphone, with θ A = 0 corresponding to the sound
III - 53
ICME 2003
➡
➡ signal arriving directly from the front of the microphone. To determine the departure attenuation b(θ), an experiment
microphones: M=
90 1 60
120
m1 (t1 ) m2 (t1 ) .. .
0.8
m1 (t2 ) m2 (t2 )
mN (t1 ) mN (t2 )
0.6 150
m1 (tQ ) m2 (tQ ) · · · mN (tQ ) ··· ···
(6)
30
we proceed to find the most likely location and orientation of the speaker, given the recorded data and the positions and orientations of the microphones, as follows:
0.4 0.2
180
0
[xs , θs ] = arg max P (xs , θs |M, x1 , . . . , xN , θ1 , . . . , θN ) xs ,θs
210
330
240
300 270
Experimentally measured amplitude attenuation Approximated attenuation = (1+0.6 sin2(θ/2))−1
Fig. 1. Experimentally measured directivity of a human speaker.
(7) If we assume that the only source of noise is zero-mean additive white Gaussian (independent for each microphone) with variance σ 2 , that the speech source emits independent Gaussian signals with variance φ 2 , and ignore the effects of reverberation in our model, our maximum likelihood estimation reduces to: N ψi2 i=1 1 (8) + 2 (ˆ xs , θˆs ) = arg max −Qσ 2 log 2 xs ,θs σ φ
was performed with a person speaking the same 10s phrase with different angles of departure. The normalized standard deviations of the resulting signals were measured, and are plotted in Figure 1. As a result of this experiment, the speaker directivity was modeled as follows: 1 b(θD ) = 1 + 0.6 sin2 θ2D
k xs − xi
(3)
(4)
where k is a constant whose exact value is not relevant in this paper. It should be noted that θ A and θD for each microphone are a function of the position x i and orientation θ i of the ith microphone and the position x s and orientation θ s of the speaker, all of which are referenced to the same global coordinate system. Putting the three attenuation functions together, we obtain the following overall attenuation function: ψ(xs , θs , xi , θi ) = a(θA ) · b(θD ) · d(xs , xi )
+
k=1
N
2
mi (tk + τi )ψi N σ2 2 ψi φ2 +
i=1
i=1
where θD is the direction of departure in the frame of reference of the speaker, with θ D = 0 corresponding to a direct frontal angle of departure of the speaker’s sound signal. Clearly, the distance attenuation factor d(x s , xi ) can be modeled as [13]: d(xs , xi ) =
Q
(5)
Assuming that we record the following set of N signal segments (each consisting of Q samples) from a total of N
−
Q N
m2i (tk
+ τi )
k=1 i=1
where each ψi is ψi (xs , θs ) evaluated at the current search values. 3. EXPERIMENTAL EXAMPLE We now evaluate the proposed sound localization and orientation estimation algorithm in a real environment with 24 microphones. Note that the developed algorithm makes several unrealistic assumptions, including a Gaussian noise and source model as well as the fact that reverberations are ignored. Nevertheless, our experiment consists of an actual human speaker in a reverberant environment (with a reverberation time of approximately 0.1s and a signal-to-noise ratio of approximately 20dB). The experimental setup including the positions of the speaker and the microphones is shown in Figure 2. Note that the variance σ 2 was estimated from a speech-less time segment, and this estimate was then used to evaluate φ2 for the speech segments. In our experiment, the speaker initially spoke facing his left, as shown in Figure 3, then slowly turned his head towards the right as shown in Figure 4, and then turned his
III - 54
➡
➡ 0.2m 0.2m
Microphones 1m Walls Speaker
1.6m
0.2m 0.2m
SNR ~= 20dB reverberation time ~= 0.1s sampling rate = 20kHz sound segment size ~= 0.5s room height ~= 2.5m sound source and microphone height = 1.6m
Fig. 2. The configuration of the environment with the 24 microphones and the human speaker. head back towards the left, to the final position shown in Figure 5.
Fig. 5. Ending point of the speaker’s direction of speech, once again facing left. the speaker’s orientation being fixed at the most likely one) for all the segments were averaged, with the result being displayed in Figure 6. As shown, the correct location of the speaker is found, with an error of approximately 20cm. This error can be partly attributed to the changing speaker directivity, due to the rotation of the speaker’s head. 0.2 0.4 0.6 0.8
X
1.0
Spatial x2 Axis
1.2
Fig. 3. Starting point of the speaker’s direction of speech, facing left.
1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 0.2
0.4
0.6
0.8
1.0
1.2
1.4 1.6 1.8 Spatial x Axis
2.0
2.2
2.4
2.6
2.8
3.0
1
Fig. 6. Spatial Likelihood Function (SLF) obtained as a result of averaging all of the SLFs from each 0.25s segment. The ’X’ here corresponds to the correct location of the speaker. Lighter regions correspond to a higher likelihood and darker regions correspond to a lower likelihood. Fig. 4. Speaker’s head rotated to the right. The expression in equation 8 was applied to the signals recorded by the 24 microphones. A window segment size of 0.25s was used for each location and orientation estimate, processed by half-overlapped Hanning windows. The entire recorded signal length was 10 seconds for the complete head rotation motion. For each 0.25s frame, a three-dimensional search for the most likely spatial location (2 dimensions) and orientation (third dimension) was performed. For the orientation with the highest likelihood, the spatial likelihood functions (i.e., the likelihoods of each position in space with
The most likely speaker direction (estimated using Equation 8) is plotted in Figure 7. For comparison, the true direction of the speaker, measured during certain time segments, is also plotted. Clearly, the estimated orientations accurately coincide with the true orientations, typically falling within 10o of the true orientations. 4. CONCLUSIONS A method for the joint localization and orientation estimation of a sound source using a distributed microphone array
III - 55
➡
➠ 1.2
[4] P. Aarabi. The fusion of distributed microphone arrays for sound localization. EURASIP JASP Special Issue on Sensor Networks (to appear), 2003.
1
Speaker Direction in Radians
0.8
[5] P. Aarabi and A. Mahdavi. The relation between speech segment selectivity and time-delay estimation accuracy. In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, May 2002.
0.6
0.4
0.2
[6] P. Aarabi and S. Zaky. Robust sound localization using multi-source audiovisual information fusion. Information Fusion, 3:2:209–223, September 2001.
0
−0.2
−0.4 0
1
2
3
4
5
6
7
8
[7] M.S. Brandstein. A Framework for Speech Source Localization Using Sensor Arrays. PhD thesis, Brown University, May 1995.
9
Time (seconds)
Fig. 7. Estimated speaker orientation (using Equation 8) indicated by the continuous line and actual speaker orientations indicated by the black bars.
[8] M.S. Brandstein and H. Silverman. A robust method for speech signal time-delay estimation in reverberant rooms. In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, May 1997.
was proposed. Given the sub-optimality of the assumptions that were made (e.g. Gaussian speech and noise signals and the fact that reverberations were ignored), very encouraging experimental results were obtained (i.e. a 20cm localization error and an 10 o average orientation error). Further work in this area will focus on the two main concerns of the current algorithm. First, reverberations were not modeled, and as such, in highly-reverberant environments, the localization and orientation estimation algorithm proposed here will not be successful. Consequently, by either directly modeling reverberations or utilizing techniques that are robust to reverberations, a more practical source location and orientation estimation algorithm would be obtained. The second concern regarding the proposed method is that both the source signal and the noise signal were assumed to be Gaussian. This is clearly not true in the case of a human speaker, and as a result, the direct modeling of the speaker’s signals will yield improved localization and orientation estimates.
[9] R. A. Brooks, M. Coen, D. Dang, J. DeBonet, J. Kramer, T. Lozano-Perez, J. Melor, P. Pook, C. Stauffer, L Stein, M. Torrance, and M. Wessler. The intelligent room project. In Proceedings of the Second International Cognitive Technology Conference, August 1997.
5. REFERENCES [1] P. Aarabi. The Integration and Localization of Distributed Sensor Arrays. PhD thesis, Stanford University, May 2001. [2] P. Aarabi. The integration of distributed microphone arrays. In Proceedings of the 4th International Conference on Information Fusion, August 2001.
[10] M. Coen. Design principles for intelligent environments. In Proceedings of the 1998 AAAI Conference on Artificial Intelligence, 1998. [11] J. DiBiase, H. Silverman, and M. Brandstein. Robust localization in reverberant rooms. M.S. Brandstein and D.B. Ward (eds.), Microphone Arrays:Signal Processing Techniques and Applications, 2001. [12] J. Flanagan, J. Johnston, R. Zahn, and G. Elko. Computer-steered microphone arrays for sound transduction in large rooms. Journal of the Acoustical Society of America, pages 1508–1518, November 1985. [13] K. Guentchev and J. Weng. Learning-based three dimensional sound localization using a compact noncoplanar array of microphones. In Proceedings of the 1998 AAAI Symposium on Intelligent Environments, 1998. [14] B. Mungamuru and P. Aarabi. Joint sound localization and orientation estimation. In the 6th International Conference on Information Fusion (submitted), July 2003.
[3] P. Aarabi. Self localizing dynamic microphone arrays. IEEE Transactions on Systems, Man, and Cybernetics Part C, 32(4), November 2002.
III - 56