reproduction of independent narrowband ... - Semantic Scholar

Report 2 Downloads 42 Views
REPRODUCTION OF INDEPENDENT NARROWBAND SOUNDFIELDS IN A MULTIZONE SURROUND SYSTEM AND ITS EXTENSION TO SPEECH SIGNAL SOURCES Nasim Radmanesh, Ian S. Burnett School of Electrical and Computer Engineering, RMIT University, Melbourne, VIC, Australia [email protected], [email protected] Throughout the work in this paper, the soundfield is generated by an array of 300 loudspeakers. The first section of the paper considers the reproduction of sound-fields with two, multi-tone sources in separate active zones. This is used as a vehicle to examine the speaker weights required for generation of multiple soundfields for the full range of source angles. The paper then extends the analysis to a scenario in which independent speech signals are delivered to active zones simultaneously. This scenario might be e.g. found in a living room environment where two people wish to hold phone conversations over e.g. a VOIP enabled television. The scenario is similar to the personal audio spaces considered in [5]. It is important to note that the paper does not seek to address the issues of encoding multiple speakers in such an environment. It is reasonable to assume that close talking microphones or beamforming algorithms might be employed to solve such issues. Section 3 examines the effects of varying source positions relative to the active zones in terms of the reproduction of independent soundfields. This is quantified through the use of the PESQ [6] measure of speech quality. Section 4 of the paper concludes with a discussion of the consequences of the results and potential future directions.

ABSTRACT While higher order ambisonic approaches can be used to generate multiple zone soundfields, this paper adopts a Least Squares matching approach which provides a more flexible formulation. The base approach, adopted from [1] computes speaker weights which allow for the placement of single sources in the soundfield. In this paper the approach is extended firstly to two multifrequency sources and then to narrowband speech signals. The results for multi-frequency sources explore the zonal soundfield errors resulting from varied source positions. For speech signals, the approach provides a potential solution for multiple conversation reproduction in a multi user environment. The paper results indicate that the approach is feasible for zones which do not suffer occlusion effects from other zones. However, for more versatile multizone soundfield reproduction a 3D approach is recommended. Index Terms— Audio, Sound fields, Multizone, independent sound fields, 2D surround system

1. INTRODUCTION The generation of complex, surround audio in the form of soundfields has been a topic of interest since the ‘70s when Gerzon initiated work on the area of Ambisonics [2]. Higher order ambisonics [3] can be used to generate soundfields in which the audio is accurately reproduced in multiple zones on the basis of mode matching. In [1], Poletti proposed an alternative (but equivalent) approach using least squares matching approach to generate 2D monochromatic sound fields in a multizone surround system. This was based on the computation of loudspeaker weights for a sound source positioned within, or on a ring of speakers [1]. Multizone soundfield reproduction was further investigated in [4] where zones of differing radii and radial distance were considered. However, for real world audio applications, it is necessary to consider signals which include a range of frequencies rather than a single tone. While there are limitations to the production of independent monochromatic soundfields in multiple zones for single tones, the generation of independent narrowband soundfields is significantly more challenging. This paper addresses the problems of creating audio soundfields with multiple active zones, initially with multiple tone, narrowband sources and then using two independent speech sources.

978-1-4577-0539-7/11/$26.00 ©2011 IEEE

461

2. GENERATION OF INDEPENDENT NARROWBAND SOUNDFIELDS The following analysis is performed on the basis of a set of N zones (fixed in position) surrounded with an array of loudspeakers placed on a circle of fixed radius. For an accurate two-dimensional sound field to be reproduced over a radius R (where R is less than the radius of the speaker circle), the required number of loudspeakers is [7]: L ≥ 2kR + 1 where k denotes the wave number. It is assumed that there are S sources containing multiple sinusoidal components and these are generated by the L loudspeakers on the basis of a set of complex signal weights Wl . The aim is to generate independent sound fields in multiple zones and every zone is the target of just one of the S source signals.

2.1. Calculating loudspeakers weights The analysis initially considers the pressure generated by a single source with one sinusoidal component. Each of the N zones in the soundfield is covered by M matching points which are distributed so as to avoid spatial aliasing (see [1]). For each of the MN matching points, a pressure matching approach is used to calculate least squares optimized speaker weights which minimize the pressure error at that point. Assuming the time dependency e jωt

,

ICASSP 2011

the pressure produced by the loudspeakers for a given matching point m is given by [1]: →

p ( rm , φ m ) =

¦W .



− jk rl − rm

e

(1)

l



2



rl − rm

l =1

where the vector positions of the loudspeakers and matching points → → in polar coordinates are rl = (rl ,φl ) and rm = (rm , φm ) respectively. Equation (1) represents a monochromatic sound field with angular velocity, Ȧ, and an lth loudspeaker weight Wl .The desired sound field for one source produced at the matching points [1] is then given by: ­ − jk r→ − →r ° e s m ° A → → , m = 1,2,..., M ° rs − rm ° (2) D (rm , φm ) = ® → → ° − jk rs − rm °αA e , m = M + 1,...NM ° → → r − r s m °¯ where, the active zone is zone 1, → rs = (rs , φs ) is the source position (in polar coordinates), α is the sound field attenuation in inactive zones (zones 2,3 etc) and A is a constant which ensures a peak sound pressure of unity at the centre of the active zone. Combining equations (1) and (2) for the reproduced and desired pressure, the matrix notation [1] can be expressed as: HW = D (3) where H is the matrix of free field monopole sound pressures and W is the L by 1 vector of speaker weights. To calculate the optimum speaker weights, a least squares approach was taken though other optimizations are possible. On the basis of a least squared error computation, W can be determined such that: W = [ H H H + δI ]−1 H H D

(4)

H

Where H represents the conjugate transpose of H. Regularization is applied in (4) to control the least squares solutions [7] where δ is the constraint parameter and I is identity matrix (see [7] for an explanation of this process).When there are S sources containing Qs sinusoidal components, all complex loudspeaker weights for all source frequency components are calculated. Assuming a time dependency e jωt , the pressure generated by all sources at matching point, m, is then: ∧

p (rm , φm ) =

S

Qs

sql

s =1 q =1 l =1

3

Fig. 1. Sound field visualization for N=3, R=1.5m, Rz=0.3m with source1: f S 1,1 = 4 k Hz , f S1, 2 = 0.7kHz , φ s1 = 0 ° , zone 1 active, source2 : f S2,1 = 2kHz , f S 2 , 2 = 0 .3 k Hz , φs2 = 30° , zone 2 active 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

-150

-100

-50 0 50 100 speaker angle(Degrees)

150

Fig. 2. The computed speaker weights for three zones i.e. N=3, R=1.5m, Rz=0.3m with source1: f S 1,1 = 4 k Hz , f S1, 2 = 0.7kHz , φ s1 = 0 ° , zone 1 active, source2 : f S 2,1 = 2kHz, f S 2 , 2 = 0 .3k Hz , φs2 = 30° , zone 2 active.

2

1

→ → − jk sq rl −rm

L

¦¦¦W

1

Magnitude

L



e →



(5)

3

rl − rm

2.2. Simulation results For the purposes of simulations, the narrowband source signals consisted of two frequency components. The number of loudspeakers, L is equal to 300 and hence the array is able to produce accurate sound-fields with frequencies up to 4kHz in an area with radius of R=2m [1]. All zones (marked 1,2, and 3 in Figs 1 and 3) have radii R for all z = 0.3m and both source radii are r for all s = 4m . Fig.1. illustrates a sound field consisting of two narrowband sources at 0° and 30°. Source 1 consists of two sinusoidal components of 4kHz and 0.7kHz and the target zone for

462

Fig.3. Sound field visualization for N=3,R=1.5m , Rz=0.3m source1 f S1,1 = 3kHz , f S1,2 = 0.8kHz , φs1 = 340.8° , zone 1 active, source2, fS2,1 = 2.5kHz , fS 2 , 2 = 0 .5 kHz , φs2 =120° , zone 2 active. this signal is zone 1. Source 2 comprises two sinusoidal components of 2kHz and 0.3kHz and the target zone for this signal is zone 2. The mean soundfield reproduction error in zones 1 to 3

-20

-20

-25

1

-30

squared error(dB)

squared error(dB)

-25

-35 2

-40

3

-45

2 3

-35 1

-40 -45 -50

-50 -55

-30

-55

0

50

100

150 200 250 Source Angle

300

350

0

50

100

150 200 250 Source Angle

300

350

Fig.5. Zone errors for different angles of source2, zone2 active for source 2 and source 1 is fixed at φS1 = 0° with zone1 active. Both sources consists of f1 = 2 kHz , f2 = 0.3kHz. Simulations run with N=3, R=1.5m, Rz=0.3m and L=300.

Fig.4. Zone errors for different angles of source1, zone 1 active for source 1 and source 2 fixed at φS 2 = 0° with zone2 active. Both sources consist of f1 = 2 kHz , f2 = 0.3kHz . Simulations were run with N=3, R=1.5m, Rz=0.3m and L=300. is -17dB, -16dB and -19dB respectively. Fig.2. shows the speaker weights used for generation of the soundfield in Fig. 1. The speaker weights are significantly more active around the source angles, as expected. Generally, low frequency components use loudspeakers over a wider range of angles as compared to higher frequencies which require a much smaller aperture [1]. Fig.3. illustrates the soundfield for two narrowband sources at 340.8° and 120°. At 340.8°, zone2 is occluded by zone1 and this is thus the worst angle in terms of the soundfield quality generated for source 1 in zone1 (and lack of silence in zone 2). The mean error of soundfield reproduction in zones 1 to 3 is -9dB, -8dB and -14dB respectively. Fig.4. and Fig.5. illustrate zone errors for different angles of sources 1 and 2, respectively. In both figures the source signals comprise frequencies f1 = 2 kHz and f 2 = 0.3kHz . These results demonstrate that for angles of occlusion (of one zone by another) the error can be almost 30dB higher than non-occluded case e.g. in zones two and three. In these figures, both sources consisted of similar frequency components to facilitate purely the zonal aspects of the errors. In all cases, the error in the zones is the summation of errors generated by all sources at that zone’s matching point(s).

Fig. 6. Diagram of reproduction independent speech sound fields in multi zone surround system If G s is the spectrum of the sth speech signal, the speech signal at the matching point m is then the inverse Fourier transform of p∧ ∧

p ( rm , φ m ) =

L

¦ (¦ G .W ) s

l =1

→ → − jk rl − rm

S

s =1

s

(6)

e →



rl − rm → →

where k is a K by 1 vector of wave-numbers, rl −rm is a 1 by L vector of the distance between all the loudspeakers and a matching point, m, and Ws is a K by L matrix of all loudspeaker weights.

3. GENERATION OF INDEPENDENT SPEECH SOUNDFIELDS

3.2. Simulation results

In this section, independent speech soundfields are generated in a multizone system. Every zone is the target zone of only one speech signal and this effectively generates a set of personal audio spaces within the soundfield. Speech signals are sampled at 8kHz and the number of loudspeakers used is again L=300.

3.1. Generation of speech signals at zones Fig.6. illustrates the multizone, independent soundfield system used in the multiple speech source scenario. In the system, the radius of all zones is fixed at Rz for simplicity, but different radii are a simple extension of the approach. The area containing the zones is defined by an array of loudspeakers with radius rl and each loudspeaker is fed the summation of speech signals weighted by the corresponding speaker complex weight, wl .

463

For the simulation, the same configuration of multizone surround system used in section 2 was employed; the zones were positioned as per Figs. 1 and 3. Two utterances were used as the source signals; the first utterance reproduced in the first active zone (Zone 1) and the second utterance in the second active zone (Zone 2). We aim to minimise both speech soundfields in the third zone which is aimed to be a silent zone for both sources. Fig.7. shows the original utterances (“green” and “strong” by a male speaker) and the generated speech derived from the soundfield at the centre of zones 1, 2 and 3. Fig.8. records the speaker weights used for generating the speech soundfields in zones 1 and 2 independently and also to minimize (in a least squares sense) both soundfields in zone 3. Table1 shows PESQ values between the source speech signals and speech signals at the matching points in the centre of zones 1, 2 and 3 across a number of source angles.

1. Hence, one of the limitations of the current approach is that the zone positioning is necessarily limited so as to avoid occlusion if high quality multiple speech source reproduction is to be achieved. One issue with multizonal reproduction of speech utterances is that the frequency effects of occlusion are not uniform across all frequencies (as noted by Poletti in [1]). Since speech signals contain a range of frequency components different utterances (in this case the two test signals) are differently affected in terms of angular error.

1 0 -1 (a) 1 0 -1 Amplitude

(b) 1 0 -1 (c) 1 0 -1

4. CONCLUSION (d)

1 0 -1

0

0.1

0.2

0.3 Time (s) (e)

0.4

0.5

Fig.7. Results based on the configuration: N=3, R=1.5m, Rz=0.3m rs = 4m , φs1 = 0° and φs 2 = 120° (a) speech1 (utterance “green”) (b) speech 2 (utterance “strong”) (c) signal at the centre of zone1(d) ,zone2, and (e) zone3. 1400 1200

Magnitude

1000 800 600 400 200 0

This paper has investigated the generation of multiple independent narrowband sound-fields in a multizone 2D surround system. A Least Squares pressure matching approach was adopted and, initially, narrowband, multiple frequency sources were used to investigate zone errors for such sources. From the simulations, it is clear that the predominant error in the system is generated by zonal occlusion effects. The system was then further tested using speech source signals and the effects of the occlusion measured using PESQ. Occlusion effects are particularly important when considering multi-frequency component signals such as speech where utterances can be affected differently dependent on their frequency spectrum. The results indicate that it is feasible to create a multi-zone system for multiple users (effectively personal audio spaces) if the zone positioning is selected so as to avoid occlusion effects. The authors are currently investigating other optimization approaches and three-dimensional sound field reproduction to minimize and avoid the effects of zonal occlusion.

5. ACKNOWLEDGMENTS -150

-100 -50 0 50 100 speaker angle(Degrees)

150

This work has been supported by the Australian Research Council (ARC) through the grant DP1094053. We would also like to acknowledge useful initial conversations with Dr Mark Poletti.

Fig. 8. Speaker weights, N=3, R = 1.5m , Rz = 0.3m , rs = 4m Speech1 (utterance “green”), φ s1 = 0 ° , zone1 active, Speech2 (utterance “strong”), φ s 2 = 120 ° , zone 2 active Table 1. PESQ values between source speech signals and speech signals at the centres of zone1, 2 and 3. Zone 1 is active for speech1 (at azimuths indicated) and zone2 is active for speech2 (at azimuths indicated). N=3, R=1.5m, Rz=0.3m, L=300 and rs = 4m for both sources. ( φs1 = 0° ,

( φs1 = 0° ,

( φ s1 = 340 .8 °

φs2 =120° )

φs2 =90° )

, φs2 =120° )

Zone1 Zone2

(2.828,1.117) (1.222,2.858)

(2.052,0.932) (1.671,2.368)

(2.722,1.27) (1.025,2.606)

Zone3

(1.819,2.061)

(1.826,2.345)

(1.276,1.261)

As can be seen from the results in Table1, the PESQ values are considerably higher in all simulations for the speech signals in their corresponding active zones. For those source angles where occlusion is a problem, the PESQ values degrade (see column 3 of Table 1). This can be seen clearly if columns, 1 and 3 of Table 1 (where φs2 =120° ) are compared. In column 3, zone 2 is occluded by zone 1 and the PESQ for each source is reduced as compared to column

464

6. REFERENCES [1] M. Poletti, “An investigation of 2D multizone surround sound systems,” in Proc. AES 125th Convention. Audio Eng. Society, San Francisco, USA, Oct 2008. [2]M.A. Gerzon, “Periphony: With-height south field reproduction,” J. Audio. Eng. Soc, vol. 21, pp. 2–10, January1973. [3] J. Daniel, R. Nicol, and S. Moreau, “Further Investigations of High Order Ambisonics and Wavefield Synthesis for Holophonic Sound Imaging, ” Proc. AES 114th Convention Audio Eng. Society, vol. 51, p.425, Amsterdam, March 2003. [4]Y.J. WU and T.D. Abhayapala “Spatial Multizone Soundfield Reproduction,” Proc. IEEE Int. Conf. Acoust, Speech, Signal Processing, ICASSP’2009, pp. 93–96, Taipei, April 19–24, 2009. [5]I. Tashev, “Personal audio space’ http://research.microsoft.com/enus/events/techfest2007/demos.asx’.” [6]ITU P.862 (2000). Perceptual evaluation of speech quality (PESQ) and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. ITU-T Recommendation P. 862 [7]M. Poletti, “Robust two-dimensional surround sound reproduction for nonuniform loudspeaker layouts,” J. Audio Eng. Society, vol. 55, no. 7/8, pp. 598–610, July/August 2007.