Speech Modelling Using Subspace and EM ... - Semantic Scholar

Report 2 Downloads 73 Views
Speech Modelling Using Subspace and EM Techniques Gavin Smith

Jo~ao FG de Freitas

Tony Robinson

Cambridge University Cambridge University Cambridge University Engineering Department Engineering Department Engineering Department Cambridge CB2 1PZ Cambridge CB2 1PZ Cambridge CB2 1PZ England England England [email protected] [email protected] [email protected]

Mahesan Niranjan

Computer Science Sheeld University Sheeld. S1 4DP England [email protected]

Abstract This paper concerns modelling speech using a piecewise stationary linear stochastic state space model. The purpose of the paper is to compare two algorithms for speech model parameter estimation: subspace state space system identi cation (4SID) and ExpectationMaximisation (EM). The 4SID and EM methods are similar in that they both estimate a state sequence (but using Kalman lters and Kalman smoothers respectively), and then estimate parameters (but using least-squares and maximum likelihood techniques respectively). Reasons why the subspace methods are sub-optimal are discussed. Experiments on real speech show that subspace methods produce better pole estimates and richer formant structure than the EM algorithm, but larger time domain errors. A hybrid is thus proposed which uses a 4SID method to initialise the parameters for the EM algorithm. This hybrid method has small time domain errors while still maintaining rich formant structure. Also, EM convergence appears more rapid. When using a stochastic 4SID algorithm, a constraint guaranteeing positive realness of the covariance estimates is used.

1 Introduction The structure of the paper is as follows. The theory section introduces the stochastic state space model, 4SID parameter estimation techniques, and the EM-based algorithm. These two algorithms and a hybrid are then compared through experiments

on real speech data, results discussed and conclusions made.

2 Theory 2.1 The state-space model Speech is split into xed-length, overlapping frames. During each frame, speech is assumed quasi-stationary and represented as a linear time-invariant state space (SS) stochastic model.

x +1 = Ax + w y = Cx + v t

t

t

t

(1) (2)

t

t

x 2 R 1 is the state vector. A 2 R  and C 2 R1 are system matrices. The output y 2 R is the speech signal at the microphone. Process and observation noises are modelled as white zero-mean Gaussian stationary noises w 2 R( 1)  N (0; Q) and v 2 R  N (0; R) respectively. The system order is p. The problem de nition is to estimate parameters  = (A; C; Q; R) from output data y only. Equations 1 and 2 can be represented in block matrix form and are termed the state sequence and block output equations respectively [7]. Time series are of length N, and i > p is the block size. p

t

p

p

p

t

p

t

t

t

X

= A X0 ? +  W0 ?1 Y0 ?1 = ? X0 ? + H W0 ?1 + V0 ?1 i

i;N

;N

;N

i

;N

i

w i

;N

i

w i

;N

(3) (4)

;N

X is a state sequence matrix; its columns are the state vectors from time i to N . X0 ? is similarly de ned. Y0 ?1 is a Hankel matrix of outputs from time 0 to (N ? 1). W and V are similarly de ned.  is the reversed controllability matrix, ? is the observability matrix and H is a Toeplitz matrix. These are all de ned below where I 1 is a vector of ones. i;N

;N

i

;N

i

w i

w i

p

2 C 3 6 CA 7 = 64 .. 75 ? def .

= [x0 x1 x2 : : : x ? ] X0 ? def ;N



N

i

def = [A ?1 I i

w i

2 6 = 64 Y0 ?1 def

N

.. .

.. .

y ?1 y : : : i

i

i

A ?2 I : : : I ] i

y1 : : : y ? 3 y2 : : : y ? +1 77

y0 y1

;N

i

N

.. .

i

i

y ?1 N

CA ?1 i

2 def 6 5 H =6 4 w i

0

C

0 ... ...

.. .

03

CA ?2 : : : C 0 i

7 7 5

A sequence of outputs can be separated into two block output equations containing past and future outputs denoted with subscripts p and f respectively. With Y = Y0 ?1 , Y = Y + ?1 and similarly for W and V , and X = X0 ? and X = X , past and future are related by the equations def

p

i;N

f

def

def

def

;N

i;N

i

p

;N

i

f

X = A X + W Y = ? X +H W +V Y = ? X +H W +V i

f

p

i

f

i

p

w i

p

p

w i

p

p

f

w i

f

f

(5) (6) (7)

2.2 Subspace State Space System Identi cation (4SID) Techniques 4SID methods are discussed thoroughly by Van Overschee and De Moor [7] and are closely related to instrumental variable (IV) methods [8] [9]. 4SID algorithms are composed of two stages: low-rank approximation and estimation of the extended observability matrix directly from the output data, followed by estimation of either state sequence or system matrices or both from its column space. Consider the future output block equation 7. To remove noise e ects, matrices undergo an orthogonal projection onto the row space of Y . Y =Yp denotes Y Y (Y Y )?1 Y p

Y

f

Y =Yp f

f

f

= ? X +H W +V = ? X =Yp i

f

i

f

w i

f

f

T p

p

T p

p

(8)

This orthogonal projection coincides with a minimum error between true data Y and its linear prediction from Y in the Frobenius norm. Greater exibility is obtained by weighting the projection with matrices W1 and W2 and analysing this: W1 (Y =Yp )W2 . These relate to weighting errors in the frequency domain. 4SID and IV methods di er with respect to these weighting matrices. From these projections the state sequence can be determined. These state estimates can be considered as outputs from a parallel bank of Kalman lters each operating over the previous i output data samples for each state, initialised using zero conditions. Once the states are determined, the system parameters are estimated in a least squares sense. The particular subspace algorithm used in this paper is the sto pos algorithm for stochastic systems. Refer to [7] for detailed explanations and software. Although this algorithm introduces a small bias into the parameter estimates, it guarantees positive realness of the covariance sequence which allows a forward innovations model to be de ned. f

p

f

2.3 The Expectation-Maximisation (EM) Technique Given a sequence of observations y, the maximum likelihood estimate for  is ML = arg max p(yj) Because the direct maximisation of this likelihood is dif-

cult, the EM algorithm is used to break this down into iterative maximisations of simpler likelihood functions, generating a new estimate k each iteration. Rewriting max p(yj) in terms of the hidden state sequence x and taking expectations over p(xjy; k ) log p(yj) = log p(x; yj) ? log p(xjy; ) log p(yj) = E [log p(x; yj) ? E [log p(xjy; )]

(9) (10)

Iterative maximisation of the rst expectation in equation 10 guarantees an increase in log p(yj). This converges to a local or global maximum depending on initial parameter estimates.

k+1 = arg max E [log p(x; yjk )]

(11)

This EM algorithm can be applied to the stochastic state space model to determine optimal parameters . Detailed explanations are given in [2] [6]. This EM algorithm consists of two stages per iteration. Firstly, given current parameter estimates, states are estimated using a Kalman smoother. Secondly, given these states, new parameters are estimated. In this paper we use Zoubin Ghahramani's software (www.gatsby.ucl.ac.uk/~zoubin/ ). This software employs the Rauch-Tung-Striebel formulation of the Kalman smoother [1] [5] and initialises parameters using factor analysis.

3 Experiments Experiments are conducted on the phrase \in arithmetic", spoken by an adult male. The speech waveform is obtained from the Eurom 0 database [3] and sampled at 16 kHz. Speech is preemphasised prior to analysis to reduce the e ects of lip radiation. Due to non-stationarity, the speech waveform is divided into xed-length, overlapping and windowed (hamming) analysis frames 15 ms in duration, shifted 7.5 ms each frame. Speech is assumed stationary within each frame. Overlap gives smoother parameter transitions and better quality reconstruction. Reconstruction employs the overlap-add method. All models are order 8. In these experiments, three algorithms are compared: a subspace algorithm, an EM algorithm, and a subspace-EM hybrid. Algorithms are compared in terms of their rates of convergence (if iterative), the sum squared error between the true speech (as windowed frames) and the reconstructed speech 1 , and the phase of their pole estimates (which corresponds to formant frequencies).

log likelihood (norm)

EM

hybrid

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

10

20

Iterations

30

0 0

10

20

30

Iterations

Figure 1: Convergence graphs for the EM and hybrid methods 1 Speech is reconstructed from the model using a Kalman smoother. Given the parameter estimates, the state sequence is estimated by one run of the smoother, and then speech resynthesised using the equation yt = Cxt.

waveform amplitude

10000

5000

0

−5000 0

2000

4000

6000

8000

10000

error per frame sum error2

8

10

6

10

4

10

2

10

10

20

30

40

50

60

70

80

90

frame number

Figure 2: Speech waveform for \in arithmetic" and error between true windowed and reconstructed frames. [- - subspace, | EM, : : : hybrid] subspace phase (norm)

4

2

0

10

20

30

40

50

60

70

80

90

10

20

30

40

50

60

70

80

90

10

20

30

40 50 frame number

60

70

80

90

EM

phase (norm)

4

2

0

hybrid

phase (norm)

4

2

0

Figure 3: Pole frequency estimates for subspace, EM and hybrid methods

For subspace methods, the algorithm sto pos is used with a block size of 20. Parameters (A; C; Q; R) are then estimated in one step. For the EM algorithm, the software written by Ghahramani is used. This initialises the initial vector, covariance and other matrices using factorial analysis. We used 30 EM iterations to reestimate parameters. The hybrid algorithm is similar to the EM method. The only di erence is that the subspace method sto pos is used to initialise (A; C; Q; R) with initial state vector and covariance matrix set to zeros and identity respectively, and 30 EM iterations then implemented. Results are shown in Figures 1 and 2 and 3. Figure 1 shows the normalised likelihood of the output data given the state sequence for the EM and hybrid methods as a function of the number of iterations. The traces for all 94 frames are superimposed. The hybrid method shows better convergence. Figure 2 shows the preemphasised speech waveform and errors between true windowed data and reconstructed data. The EM and hybrid methods show smaller errors than the subspace method. Figure 3 shows estimates of the positive frequency poles against frame number. The subspace and hybrid methods show richer formant structure (which is in approximate agreement with the formant patterns when this speech is analysed using conventional polynomial modelling methods.)

4 Discussion Consider rst the pole estimates in gure 3 for the subspace, EM and hybrid algorithms. The phase of these poles give the normalised frequencies of the speech formants (  8kHz ). Subspace methods produce a much richer formant structure than EM and the plot shows how formants change and evolve during speech. This is despite the fact that EM produces smaller sum squared errors between true and reconstructed data in the time domain. This perhaps demonstrates the multimodality of the likelihood function surface p(yj). The hybrid algorithm captures the best of both: it has a smaller time domain error, but still captures the rich formant structure. Moreover from gure 1 it appears that the hybrid algorithm converges more rapidly than the EM algorithm. Both the subspace and EM algorithms employ similar methodologies: states are rst estimated and then used to estimate system parameters. In EM, states are estimated using both past and future outputs with a Kalman smoother, and then system parameters estimated from the states in a maximum likelihood (ML) framework. In 4SID, states are estimated as the outputs of non steady-state Kalman lters using the previous i data samples only, and then system matrices estimated using least-squares (LS) with positive realness of the covariance sequence as a constraint. Subspace algorithms are therefore sub-optimal for three reasons. Firstly, states are estimated using only partial output data. Secondly, the LS criterion is only an approximation to the ML criterion. Thirdly, the positive realness constraint introduces bias. This positive realness constraint is necessary for the subspace method. Without such a constraint, 25 out of the 94 frames generate covariance estimates without positive realness. Reasons for this include nite amount of data, and the stochastic SS model is only an approximation (perhaps especially true for voiced sounds). Subspace methods also have some advantages. Firstly, they are non-iterative and linear, and so do not su er from the disadvantages typical of iterative algorithms (including EM) such as sensitivity to initial conditions, convergence to local minima, and de nition of convergence criteria. Secondly, they require little prior parameterisation except de nition of the system order which can be determined in situ from

observation of the singular values of the orthogonal projection. Thirdly, the use of the SVD gives numerical robustness to the algorithms.

5 Conclusions This paper shows that a hybrid subspace-EM algorithm can lead to speech models with small time domain-errors while still maintaining rich formant structure. In the future we hope to determine whether these results generalise to other subspace algorithms, adopt a more principled probabilistic approach to subspace methods, relate these to EM, take advantage of weighting matrices to weight errors in the frequency domain and apply these methods to the speech enhancement problem.

Acknowledgements We are grateful for the use of 4SID software supplied with [7] and the EM software of Zoubin Ghahramani www.gatsby.ucl.ac.uk/~zoubin/. Gavin Smith is supported by the Schi Foundation, Cambridge University. Nando de Freitas is supported by two University of the Witwatersrand Merit Scholarships, a Foundation for Research Development Scholarship (South Africa), an ORS award and a Trinity College External Research Studentship (Cambridge).

6 References [1] Gelb, A. ed., (1974) Applied Optimal Estimation. Cambridge, MA: MIT Press. [2] Ghahramani, Z. & Hinton, G. (1996) Parameter estimation for linear dynamical systems, Tech. Rep. CRG-TR-96-2, Dept of Computer Science, University of Toronto, http://www.gatsby.ucl.ac.uk/ zoubin/papers.html. [3] Grice, M. & Barry, W. (1989) Multi-lingual speech input/output: Assessment, methodology and standardization, Tech. rep., University College, London, ESPRIT Project 1541 (SAM), extension phase nal report. [4] Hansen, P. (1997) Signal Subspace Methods for Speech Enhancement. Ph.D. thesis, IMM Dept of Mathematical Modelling, Technical University of Denmark. [5] Rauch, H. & Tung, F. & Striebel, C. (1965) Maximum likelihood estimates of linear dynamic systems AIAA Journal, vol. 3, no. 8, pp. 1445{1450. [6] Shumway, R. & Sto er, D. (1982) An approach to time series smoothing and forecasting using the em algorithm. Journal of Time Series Analysis, vol. 3, no. 4, pp. 253{264. [7] Van Overschee, P. & De Moor, B. (1996) Subspace Identi cation for Linear Systems: Theory, Implementation, Applications Dordrecht, Netherlands: Kluwer Academic Publishers. [8] Viberg, M. (1995) Subspace-based methods for the identi cation of linear timeinvariant systems. Automatica, vol. 31, no. 12, pp. 1835{1851. [9] Viberg, M. & Wahlberg, B. & Ottersten, B. (1997) Analysis of state space system identi cation methods based on instrumental variables and subspace tting. Automatica, vol. 33, no. 9, pp. 1603{1616.