A PLENACOUSTIC APPROACH TO ACOUSTIC ... - Semantic Scholar

Report 6 Downloads 97 Views
2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

October 18-21, 2015, New Paltz, NY

A PLENACOUSTIC APPROACH TO ACOUSTIC SIGNAL EXTRACTION Lucio Bianchi, Fabrizio D’Amelio, Fabio Antonacci, Augusto Sarti, Stefano Tubaro Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano Piazza Leonardo da Vinci 32 – 20133 Milano, Italy ABSTRACT This paper considers the problem of separation of acoustic sources from convolutive mixtures captured by a microphone array. The problem is approached through Plane Wave Decomposition (PWD) of the sound field measured at multiple points along the extension of the array. The directional components of the sound field are analyzed by means of the plenacoustic framework to accurately estimate the direction of arrival of the desired and undesired sources at every point at which the PWD is measured. Multiple spatial filters are designed, one for each PWD measurement point, to leave undistorted the desired source and attenuate the interferer. A successive stage of delay and sum of the outputs of the individual spatial filters enables the reconstruction of the desired source. The use of the plenacoustic framework allows us to gather intuitive and immediate interpretation of the acoustic scene. We prove the effectiveness of the proposed solution through simulations on speech data. Index Terms— Microphone array processing, informed spatial filtering, plenacoustic function. 1. INTRODUCTION Enhancement of a desired sound while attenuating interferers and background noise is a goal pursed by the signal processing community for decades [1]. An intuitive way of approaching the problem is to preliminarily gather spatial information about the acoustic scene, and then to focus on the source of interest through spatial filtering, similarly to the human auditory system [2]. Microphone arrays provide a technological mean to accurately capture spatial information [3], with techniques borrowed from antenna array processing. This approach presents some advantages, such as the availability of a large corpus of algorithms. However, it should be remarked that these algorithms are usually designed based on assumptions that are not, in general, satisfied in the acoustic domain: narrowband stationary signals, free-field propagation and far field sources. On the contrary, microphone array algorithms must be designed to deal with i) wideband non-stationary signals; ii) reverberant environments; and iii) near-field sources. The first issue is commonly solved by performing spatial filtering in a short-time-frequency domain [4, 5]. The second issue has been recently addressed, e.g. in [6, 7, 8]. In particular, authors in [8] propose a spatial filter that minimizes the diffuse plus noise power at filter output, subject to constraints on the desired spatial response. This way, the authors can specify a desired spatial response that enhances (attenuates) sounds coming from desired (undesired) source locations in the far-field; the constraints are set accordingly to an on-line estimate of the direction of arrival of the sources. However, this approach relies on the assumption of far-field sources, impinging on the array as plane-wave wavefronts. This assumption is, in

general, violated in practical acoustic scenarios, thus degrading the performance of the spatial filter. In this paper we present an approach for source separation that addresses the problem of working in near-field conditions, still using the simple and well established techniques borrowed from the array processing literature. We proceed in two steps, aimed at first localizing the source, and then using this information for separation purposes. As far as source localization is concerned, we make use of the directional plenacoustic function [9], which encodes the directional sound field components impinging on a microphone array. In particular, authors in [9] subdivide a linear array of microphones into overlapping sub-arrays composed by adjacent sensors. Data from each subarray are used to compute the Plane Wave Decomposition (PWD) through a beamscan technique [10]. The PWD, in the geometrical acoustic representation, represents an estimate of the acoustic energy carried by acoustic rays passing through a reference point in the sub-array. The magnitude of the plenacoustic function is then represented as an image that is built by juxtaposition of the PWDs. Hereinafter we refer to this image as plenacoustic image. The advantage of working in this framework is that acoustic primitives appear as linear patterns in the plenacoustic image. For example, a point-like source (i.e. a source in the near field) appears as a line in the plenacoustic image. Thus, the process of localizing a near-field acoustic source is accomplished by finding a linear pattern in the image. As for the step of source separation, in this paper we leverage on the technique in [8], modified to work on a sub-array fashion. More specifically, we perform a spatial filter for each sub-array, leaving undistorted the energy emitted by the desired sound source, while attenuating the interferer. Since sub-arrays are of smaller dimensions with respect to the overall array, the far-field propagation model is still applicable even in situations where the distance of the source is comparable with the overall array length. Finally, a time reversal approach is used to apply a delay (possibly negative) and summing the desired signals at each sub-array, in such a way to propagate the signals back to the location of the desired source. The rest of the paper is structured as follows. Section 2 introduces the notation used throughout the manuscript and states the problem of source separation in the near field. Section 3 reviews the basic concepts of plenacoustic imaging; localization from plenacoustic images; and describes the adopted spatial filtering technique. Section 4 describes the informed source extraction approach based on the plenacoustic representation. Section 5 validates the proposed approach by means of simulations. Finally, Section 6 draws some conclusions.

2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

pS,1 ...

y m1

q0

pS,l

x(ω)

θi,l

mi

... pS,L

Figure 1: Source locations and microphone array.

Figure 2: Signal extraction of near-field sources. Block diagram.

2. DATA MODEL AND PROBLEM STATEMENT We consider the setup depicted in Fig. 1, with L (possibly moving) acoustic sources located at pS,l (t) = [xS,l (t), yS,l (t)]T , l = 1, . . . , L, where t denotes time. An uniform linear array of M microphones is placed on the y axis between y = −q0 and y = q0 , in such a way that the ith microphone is located at mi = [0, q0 − 2q0 (i − 1)/(M − 1)]T , m = 1, . . . , M . The overall array is subdivided into I = M − W + 1 subarrays, W being the (odd) number of microphones in each subarray. We denote by mk , k = i−(W −1)/2, . . . , i+(W −1)/2 the location of microphones belonging to the subarray centered at mi . The distance between microphones is d = 2q0 /M . We denote by xk (t) the time-domain signal acquired by the kth microphone. In the following, we consider a short-time analysis framework, in which the signal acquired by the kth microphone is transformed in the signal xk (τ, ω) by Z ∞ xk (τ, ω) = xk (t)w(t − τ )e−jωt dt, (1) −∞

where w(t) denotes a suitable real window function. For the purpose of clarity, in the following we omit the dependancy on time and frequency; we will come back in making this dependancy explicit in Section 4, for the purpose of re-synthesizing time domain signals. Signals acquired by microphones are stacked into the vector xi = [xi−(W −1)/2 , . . . , xi+(W −1)/2 ]T . We assume each source to be at a distance greater than the size of the sub-array, so that farfield propagation holds within sub-arrays. Under this assumption, we can model the signal captured by the ith subarray as xi = Ai si + ψD + ψN , T

(2)

L×1

si = [s1,i , . . . , sL,i ] ∈ C being a vector containing the L source signals. We denote by ψD ∈ CW ×1 the diffuse field impinging on the subarray and by ψN ∈ CW ×1 the additive microphone noise. Ai denotes the collection of steering vectors [11] towards each of the L sources Ai = [a(θi,1 ), . . . , a(θi,L )], i− W2−1

a(θi,l ) = [ej (

)

ω d sin(θ i,l ) c

Ai ∈ CW ×L , . . . , ej (i+

W −1 2

) ωc d sin(θi,l ) ]T ,

(3) where the symbol θi,l denotes the angle under which the lth sources is seen by the ith subarray, i.e.   yS,l − q0 + 2q0 (i − 1)/(M − 1) θi,l = arctan . (4) xS,l The covariance matrix of array data can be modeled as [11] R = E[xi xTi ] = AH i ΣAi + ΨD + ΨN ,

x1 (ω) p1 (ω, θ) . . . Beamscan ... Subarray Mapping to partitioning xI (ω) MVDR pI (ω, θ) ray space I(ω, m, q) ˜ I(m, q) Frequency Localization averaging pˆS,l pˆS,l x1 (ω) yˆ1 (ω) ... ... Subarray Spatial Back yˆ(ω) partitioning xI (ω) filtering yˆI (ω) propagation

x mM −q0

October 18-21, 2015, New Paltz, NY

(5)

where Σ = diag(|s1,i |2 , . . . , |sL,i |2 ) contains the power of the L source signals on its diagonal, (·)H denotes Hermitian transpose, 2 2 2 2 ΨD = σD Γ and ΨN = σN I. We denote by σD and σN the expected power of the diffuse field and of the microphone noise, respectively; Γ is the diffuse field coherence matrix [12] and I is the identity matrix. 2.1. Problem Statement The main goal of this paper is to show how signal extraction of near-field sound sources can be conveniently accomplished in the plenacoustic framework. In particular, we show how data from individual subarrays concur in the estimation of the desired signal through the application of a spatial filter informed on the location of the sound sources. The spatial filters are designed at subarray level to attenuate/enhance sound coming from specific directions, derived from the source positions with basic geometric reasoning. Filter outputs from individual subarrays are then back-propagated from the acoustic center of each subarray to the estimated source location of interest. Finally, signals are re-synthesized in the time domain. Figure 2 shows a block diagram of the overall system: some of the notation used in the figure is to be explained in the next Section. 3. BACKGROUND 3.1. Review of plenacoustic imaging In order to obtain an estimate of the PWD from the ith subarray, authors in [9] employ multiple MVDR (Capon) beamformers [11, 10] steered towards a grid of directions. The power of the output of these multiple beamformers is then stored into the PWD pi (ω, θ). PWDs from all the subarrays are then mapped onto a convenient space P, called the ray space. According to [9], an acoustic ray is the set of points (x, y) that satisfy the linear equation y = mx + q, where the parameters (m, q) uniquely identify a ray and define the ray space P 1 . The transformation from the geometric domain to P is governed by m = tan(θ),

q = q0 − 2q0 (i − 1)/(M − 1).

(6)

The plenacoustic image I(ω, m, q) at frequency ω is then obtained through juxtaposition of pseudospectra pi (ω, m, q) as I(ω, m, qi ) = pi (ω, m, q), with q = qi = q0 −2q0 (i−1)/(M −1). (7) 1 The plenacoustic image is mapped onto P because acoustic primitives are mapped onto P as linear patterns. We invite the interested reader to [13].

2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

q

RpS,1

0

RpS,2

October 18-21, 2015, New Paltz, NY

4. SOURCE SIGNAL EXTRACTION BASED ON PLENACOUSTIC INFORMATION

−10 m −20 dB

−30

Figure 3: Illustrative plenacoustic image. The symbol RpS,l denotes the linear pattern corresponding to the lth source, i.e. RpS,l : yS,l = mxS,l + q.

Plenacoustic images computed at each frequency ω concur to the ˜ generation of a single plenacoustic image I(m, q), with the product of their geometric and harmonic means [14]. In [9] the authors propose a two-step method for the localization of acoustic sources based on the analysis of the plenacoustic image. The first step consists in the identification and disambiguation of DOAs corresponding to the acoustic sources, estimated from the peaks of the estimated PWDs. This operation is simplified by the geometrical nature of the ray-space, which enables the clustering of DOAs corresponding to the same source on a linear pattern. The assignement of peaks to one of the sources is accomplished using a Hough transform [15]. The second step consists in finding an estimate of the location of the sources through a least-squares regression. We refer the interested reader to [9] for details. Fig.3 shows an illustrative plenacoustic image generated by two acoustic sources. The black lines RpS,1 and RpS,2 represent the linear patterns corresponding to two acoustic sources. 3.2. Spatial filtering The goal of spatial filtering is to linearly combine the array data x such that sounds coming from different directions are enhanced or attenuated according to a desired directivity function, while attenuating both diffuse and noise field components. In order to simplify the notation, in the following of this Paragraph we consider the overall array case, thus allowing the omission of sub-array indices. The desired signal, i.e. a spatially filtered version of the source signals, can be written as y = cT s, where c ∈ RL×1 denotes the desired directivity function. A spatial filter h provides an estimate of y as a linear combination of array data yˆ = hH x,

h ∈ CW ×1 .

(8)

In [8] the spatial filter is designed by minimizing the sum of diffuse and noise field powers at filter output, subject to the desired directivity function, i.e. ˆ = arg min hH (ΨD + ΨN )h s.t AH h = c. h

(9)

h 2 2 By defining the Diffuse-to-Noise Ratio (DNR) σ 2 = σD /σN and J = σ 2 ΓD + I, the optimization problem in (9) can be rewritten in the equivalent form

ˆ = arg min hH Jh s.t AH h = c, h

(10)

h

whose solution is [16]  −1 ˆ = J−1 A AH J−1 A h c.

(11)

In this Section we show how to extract the signal of a near-field sound source in the plenacoustic framework. We consider the partitioning of the overall microphone array into maximally overlapped sub-arrays, as described in Section 2. On a sub-array basis we adopt the technique introduced in [8] and reviewed in Paragraph 3.2 to enhance/attenuate sound coming from different directions with respect to the ith subarray, while minimizing the diffuse and noise components. This filtering operation is supported by the information on the source locations, extracted from the plenacoustic image as described in Paragraph 3.1. Consider the source number L and locations {ˆ pS,l } estimated as described in Paragraph 3.1. We set a grid of directions {θˆi,l } defined according to the estimated positions {ˆ pS,l } as   yˆS,l − q0 + 2q0 (i − 1)/(M − 1) θˆi,l = arctan . (12) x ˆS,l Thanks to the use of the plenacoustic function, we expect the accuracy of the DOA estimation at each subarray to be higher with respect to [8]. In fact, the DOA is extracted from the estimate {ˆ pS,l } of the source location, coming from the least squares line fitting described in Section 3.1, which enables to attenuate the impact of wrong DOA estimation on a small number of sub-arrays. Let Ai denote the subarray steering matrix computed on the set of directions {θˆi,l }. The desired directivity function for the ith subarray is defined as ( 1, if the lth source is desired, (l = ˇ l) {ci }l = , (13) 0, if the lth source is unwanted, (l 6= ˇ l) where the symbol ˇ l denotes the index of the desired source. The ˆ ˇ to be applied to the ith subarray for estimating the spatial filter h i,l ˇ lth source is then computed through (11), and is given by ˆ Hˇxi . yˆi,ˇl = h i,l

(14)

The signals yˆiˇl estimated on a sub-array basis are then realigned and combined to form a single source signal estimate. For this purpose, we adopt a back-propagation approach, in which each sub-array signal yˆi,ˇl is back propagated from mi to pS,ˇl . The source signal estimate yˆˇl is then obtained as the linear combination of the back-propagated sub-array signals yˆˇl = 4π

I X

ω

yˆi,ˇl kpS,ˇl − mi kei c kpS,lˇ−mi k ,

(15)

i=1

where the back-propagation is performed assuming that the sound sources are isotropic point sources. We observe that this peculiar choice for the re-alignement of sub-array signals is arbitrary, as they could be re-aligned through back-propagation to an arbitrary point in space. However, the plenacoustic analysis framework provides an easy estimate of the actual source locations, hence it is natural to exploit this further information in the re-alignement phase. Finally, the estimated source signal is re-synthesized in the time domain though a Short-Time Fourier synthesis approach. In particular, denoting by yˆˇl (τ, ω) the signal estimated as in (15) from a frame windowed by w(t − τ ) temporally centered in τ , at a temporal frequency ω, we have Z ∞ Z ∞ 1 yˆ(t) = yˆˇ(τ, ω)w(t − τ )ejωt dω dτ. (16) 2π −∞ −∞ l

2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

October 18-21, 2015, New Paltz, NY

30

5. RESULTS

45◦

x

ˇ l=2

25 SIR [dB]

m1

20 15

mM

pS,2

10

(a) Simulation 1.

0

0.0125 0.025 0.0365   ση2 m2

∆y/2 x

SIR [dB]

ˇ l=1

pS,1

m1

ˇ l=2

30

25

∆y/2 mM

0.05

(b) Results from simulation 1.

y

pS,2

0.25

(c) Simulation 2.

0.5 0.75 ∆y/2 [m]

1

(d) Results from simulation 2. 40

2

y

(17)

where e(t) is defined as the difference of two contributions: i) the orthogonal projection of the estimated signal yˇl (t) onto the space spanned by all source signals (i.e. sl (t), l = 1, . . . , L), and; ii) the orthogonal projection of yˇl (t) onto the space spanned by the original desired source signal sˇl (t). The SIR metric has been computed using the Matlab implementation provided in [19]. The first simulation is aimed at assessing the impact of the localization error on the overall performance. For this purpose, we introduce an additive random variable to the true source locations, such that pˆS,l = pS,l + η, η being a bi-variate Gaussian random variable with covariance matrix ση2 I and zero mean. √ √ The at pS,1 = [ 2/2m, 2/2m]T and pS,2 = √ sources√are placed [ 2/2m, − 2/2m]T . In order to assess only the impact of the localization error, in this simulation the additive noise power is set 2 = 0, as depicted in Fig.4a. The standard deviation of the to σN localization error ση is let vary between 0 m and 0.23 m. For each value of ση , 50 realization of η are simulated. Figure 4b shows the SIR averaged over all the realization for each value of ση2 . We observe that the localization error introduces some impairements in the source extraction system, but even in cases where the localization error is high, there is still an acceptable degree of separation of the two sources. In particular, we observe that introducing a localization error with variance ση2 > 0.01, the SIR for both sources remains approximately constant. The second simulation is aimed at assessing the performance of the source extraction approach when two sources are placed at pS,1 = [1 m, ∆y/2]T and pS,2 = [1 m, −∆y/2]T . Figure 4c shows the simulative setup. We observe that the proposed method is still able to extract the two sources with SIR ≈ 25dB even when ∆y = 50 cm, which in the considered setup is equivalent to an angular separation of 25◦ . The extraction performance is further improved when ∆y is increased. Finally, the third simulation is aimed at assessing the performance of the source extraction approach when two sources are

ˇ l=1

ˇ l=2

30

m1 pS,1 pS,2 ∆x mM

(e) Simulation 3.

x

SIR [dB]

kˆ yˇl (t)k , ke(t)k2

pS,1 1m

In this section we present some simulations aimed at validating the proposed approach to acoustic signal extraction. All the simulations are performed with M = 16 microphones with spacing d = 6 cm. The number of sub-arrays is varied among simulations and will be specified for each case. The sampling frequency is set to 44.100 kHz. The short-time analysis is accomplished using a Hanning window of length 23.21 ms, 50% overlap. The speed of sound is fixed to 340 m s−1 . The set of directions {θi,l } is obtained by uniformly sampling the interval [−π/2, π/2] in 65 points. The number of the acoustic sources is fixed to L = 2. The source signals are speech signals from [17, Tracks 49 (female) and 50 (male)]. In the following, the female speech is assigned to the source with index 1, while the male speech is assigned to the source with index 2. Prior to the processing, the source signals are filtered with a bandpass filter whose cutoff frequencies are 500 Hz and 5 kHz. In all the 2 simulations, the power of the diffuse noise field is set to σD = 0. In the second and third simulations the variance of the additive noise 2 σN is set so that the signal-to-noise ratio is 20dB. The performance of the signal extraction approach is evaluated in terms of the Signal-to-Interference-plus-Noise Ratio (SIR), defined as the ratio between the energy of the desired source signal and the sum of noise and undesired sources energy [18] SIR = 10 log10

ˇ l=1

y

20 10 0 0.5

0.75

1 ∆x [m]

1.25

1.5

(f) Results from simulation 3.

Figure 4: Simulative setups and corresponding results.

aligned on the x axis, placed at pS,1 = [0.5 m, 0 m]T and pS,2 = [(0.5 + ∆x)m, 0 m]T . Figure 4e shows the simulative setup. We observe that the proposed method is able to extract the two sources even in this challenging scenario. In particular, the sound source closer to the microphone array is extracted with a SIR > 20dB when ∆x > 0.8 m. The sound source farthest from the microphone array is still extracted but with a lower SIR.

6. CONCLUSIONS In this paper we have proposed a method for the extraction of sound sources from convolutive mixtures captured by a microphone array. The array is subdivided into sub-arrays and data from each sub-array are filtered by a spatial filter that is designed to enhance/attenuate the desired/undesired sound sources, while minimizing diffuse field and microphone noise. We have showed how the use of the plenacoustic framework simplifies the whole process. We have provided numerical evaluations to prove the effectiveness of the proposed solution.

2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

7. REFERENCES

[17] Sound Quality Assessment Material recording for subjective tests, European Broadcasting Union Std., 2008.

[1] J. Benesty, S. Makino, and J. Chen, Speech Enhancement. Berlin, DE: Springer-Verlag, 2005. [2] E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,” J. Acoust. Soc. Am., vol. 25, pp. 975–979, Sept. 1953. [3] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing. Berlin, DE: Springer-Verlag, 2008. [4] J. B. Allen, “Short term spectral analysis, synthesis, and modification by discrete fourier transform,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-25, no. 3, pp. 235–238, 1977. [5] J. Benesty, J. Chen, and E. A. P. Habets, Speech Enhancement in the STFT Domain, ser. Springer Briefs in Electrical and Computer Engineering. Heidelberg, DE: Springer, 2012. [6] G. Reuven, S. Gannot, and I. Cohen, “Dual-source transferfunction generalized sidelobe canceller,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 4, pp. 711–727, May 2008. [7] S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple inferfering speech signals,” IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 6, pp. 1071–1086, Aug. 2009. [8] O. Thiergart and E. A. P. Habets, “An informed LCMV filter based on multiple instantaneous direction-of-arrival estimates,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), Vancouver, CA, May 2013. [9] D. Markovi´c, F. Antonacci, A. Sarti, and S. Tubaro, “Soundfield imaging in the ray space,” IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 12, pp. 2493– 2505, 2013. [10] H. L. V. Trees, Optimum Array Processing. New York, NY, USA: John Wiley & Sons, 2002, part IV of Detection, Estimation and Modulation Theory. [11] P. Stoica and R. Moses, Spectral Analysis of Signals. Saddle River, NJ, USA: Prentice Hall, 2004.

October 18-21, 2015, New Paltz, NY

Upper

[12] R. K. Cook, R. V. Waterhouse, R. D. Berendt, S. Edelman, and M. C. Thompson, “Measurement of correlation coefficients in reverberant sound fields,” J. Acoust. Soc. Am., vol. 27, pp. 1072–1077, 1955. [13] F. Antonacci, M. Foco, A. Sarti, and S. Tubaro, “Fast tracing of acoustic beams and paths through visibility lookup,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 4, pp. 812–824, 2008. [14] M. R. Azimi-Sadjadi, A. Pezeshki, L. L. Scharf, and M. Hohil, “Wideband DOA estimation algorithms for multiple target detection and tracking using unattended acoustic sensors,” in Proc. Defense and Security, 2004. [15] R. O. Duda and P. E. Hart, “Use of the Hough transformation to detect lines and curves in pictures,” Comm. ACM, vol. 15, no. 1, pp. 11–15, Jan. 1972. [16] O. L. Frost, “An algorithm for linearly constrained adaptive array processing,” Proceedings of the IEEE, vol. 60, no. 8, pp. 926–935, 1972.

[18] E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1462–1669, July 2006. [19] E. Vincent, S. Araki, F. Theis, G. Nolte, P. Bofill, H. Sawada, A. Ozerov, V. Gowreesunker, D. Lutter, and N. Q. K. Dong, “The signal sepratation evaluation campaign (2007-2010): Achievements and remaining challenges,” Signal Processing, vol. 92, pp. 1928–1936, 2012.