Hidden Markov Models and Gaussian Mixture Models Overview

Comment

Report 0 Downloads 52 Views

Overview

Hidden Markov Models and Gaussian Mixture Models

HMMs and GMMs Key models and algorithms for HMM acoustic models Gaussians GMMs: Gaussian mixture models

Steve Renals and Peter Bell

HMMs: Hidden Markov models HMM algorithms Likelihood computation (forward algorithm) Most probable state sequence (Viterbi algorithm) Estimting the parameters (EM algorithm)

Automatic Speech Recognition— ASR Lectures 4&5 28/31 January 2013

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

1

Fundamental Equation of Statistical Speech Recognition

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

2

Acoustic Modelling

If X is the sequence of acoustic feature vectors (observations) and Decoded Text (Transcription)

Recorded Speech

W denotes a word sequence, the most likely word sequence W∗ is given by

Hidden Markov Model ∗

Signal Analysis

W = arg max P(W | X) W

Acoustic Model

Applying Bayes’ Theorem: p(X | W)P(W) p(X) ∝ p(X | W)P(W)

P(W | X) = ∗

Lexicon

Training Data

Language Model

W = arg max p(X | W) P(W) | {z } W | {z } Acoustic Language model model ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

Search Space

3

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

4

Hierarchical modelling of speech

NO r

ai

P(s2 | s2 )

P(s3 | s3 )

Word

RIGHT

oh

P(s1 | s1 )

Utterance

"No right"

Generative Model

n

Acoustic Model: Continuous Density HMM

sI

Subword

t

P(s1 |sI )

s1

P(s2 | s1 )

s2

p(x | s1 )

HMM

x Acoustics

P(s3 | s2 )

s3

P(sE | s3 )

sE

p(x | s3 )

p(x | s2 )

x

x

Probabilistic finite state automaton Paramaters λ: Transition probabilities: akj = P(sj | sk )

Output probability density function: bj (x) = p(x | sj ) Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

5

Acoustic Model: Continuous Density HMM

s3

s2

s1

sI

sE

P(s1 |sI )

s1

P(s2 | s1 )

P(s2 | s2 )

s2

p(x | s1 )

x

x1

x2

x3

x4

x5

6

HMM Assumptions P(s1 | s1 )

sI

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

x6

Transition probabilities: akj = P(sj | sk )

s3

sE

p(x | s3 )

p(x | s2 )

x

P(sE | s3 )

x

1

Observation independence An acoustic observation x is conditionally independent of all other observations given the state that generated it

2

Markov process A state is conditionally independent of all other states given the previous state

Probabilistic finite state automaton Paramaters λ:

P(s3 | s2 )

P(s3 | s3 )

Output probability density function: bj (x) = p(x | sj ) ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

6

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

7

HMM Assumptions s(t−1)

s(t)

s(t+1)

HMM OUTPUT DISTRIBUTION

x(t + 1)

x(t)

x(t − 1) 1

Observation independence An acoustic observation x is conditionally independent of all other observations given the state that generated it

2

Markov process A state is conditionally independent of all other states given the previous state

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

8

Output distribution

P(s1 |sI )

Hidden Markov Models and Gaussian Mixture Models

9

Background: cdf P(s1 | s1 )

sI

ASR Lectures 4&5

s1

P(s2 | s1 )

P(s2 | s2 )

s2

p(x | s1 )

x

P(s3 | s2 )

P(s3 | s3 )

s3

Cumulative distribution function (cdf) F (x) for X :

p(x | s3 )

p(x | s2 )

x

P(sE | s3 )

Consider a real valued random variable X

sE

F (x) = P(X ≤ x)

x

To obtain the probability of falling in an interval we can do the following:

Single multivariate Gaussian with mean µj , covariance matrix Σj :

P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a)

bj (x) = p(x | sj ) = N (x; µj , Σj )

= F (b) − F (a)

M-component Gaussian mixture model: bj (x) = p(x | sj ) = ASR Lectures 4&5

M X

m=1

cjm N (x; µjm , Σjm )

Hidden Markov Models and Gaussian Mixture Models

10

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

11

Background: pdf

The Gaussian distribution (univariate) The Gaussian (or Normal) distribution is the most common (and easily analysed) continuous distribution

The rate of change of the cdf gives us the probability density function (pdf), p(x):

It is also a reasonable model in many situations (the famous “bell curve”)

d p(x) = F (x) = F 0 (x) dX Z

If a (scalar) variable has a Gaussian distribution, then it has a probability density function with this form: −(x − µ)2 1 2 2 exp p(x|µ, σ ) = N(x; µ, σ ) = √ 2σ 2 2πσ 2

x

F (x) =

p(x)dx

−∞

p(x) is not the probability that X has value x. But the pdf is proportional to the probability that X lies in a small interval centred on x.

The Gaussian is described by two parameters: the mean µ (location) the variance σ 2 (dispersion)

Notation: p for pdf, P for probability

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

12

Plot of Gaussian distribution

N(x; µ, σ ) = √

0.35

mean=0 variance=2

0.3

0.3 p(x|m,s)

0.25

0.25 p(x|m,s)

exp

−(x − µ)2 2σ 2

mean=0 variance=1

pdf of Gaussian Distribution mean=0 variance=1

2πσ 2

pdfs of Gaussian distributions

0.4

0.2

0.2 mean=0 variance=4

0.15

0.15

0.1

0.1

0.05

0.05 0 −4

1

2

One-dimensional Gaussian with zero mean and unit variance (µ = 0, σ 2 = 1):

0.35

13

Properties of the Gaussian distribution

Gaussians have the same shape, with the location controlled by the mean, and the spread controlled by the variance

0.4

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

0 −8

−3

−2

−1

ASR Lectures 4&5

0 x

1

2

3

−6

−4

−2

0 x

2

4

6

8

4

Hidden Markov Models and Gaussian Mixture Models

14

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

15

Parameter estimation

Exercise Consider the log likelihood of a set of N data points {x 1 , . . . , x N } being generated by a Gaussian with mean µ and variance σ 2 :

Estimate mean and variance parameters of a Gaussian from data x 1 , x 2 , . . . , x n

N

1X L = ln p({x , . . . , x } | µ, σ ) = − 2 1

Use sample mean and sample variance estimates:

σ2 =

1 n

i=1 n X i=1

=−

(sample mean)

(x i − µ)2

2

n=1

n

1X i µ= x n

n

1 2σ 2

N X n=1

(xn − µ)2 − ln σ 2 − ln(2π) 2 σ

(xn − µ)2 −

N N ln σ 2 − ln(2π) 2 2

By maximising the the log likelihood function with respect to µ show that the maximum likelihood estimate for the mean is indeed the sample mean: N 1 X µML = xn . N

(sample variance)

n=1

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

16

The multidimensional Gaussian distribution

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

17

Covariance matrix The mean vector µ is the expectation of x: µ = E [x]

The d-dimensional vector x is multivariate Gaussian if it has a probability density function of the following form: 1 1 T −1 (x − µ) Σ (x − µ) exp − p(x|µ, Σ) = 2 (2π)d/2 |Σ|1/2

The covariance matrix Σ is the expectation of the deviation of x from the mean: Σ = E [(x − µ)(x − µ)T ] Σ is a d × d symmetric matrix:

The pdf is parameterized by the mean vector µ and the covariance matrix Σ.

Σij = E [(xi − µi )(xj − µj )] = E [(xj − µj )(xi − µi )] = Σji

The 1-dimensional Gaussian is a special case of this pdf

The sign of the covariance helps to determine the relationship between two components:

The argument to the exponential 0.5(x − µ)T Σ−1 (x − µ) is referred to as a quadratic form.

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

If xj is large when xi is large, then (xj − µj )(xi − µi ) will tend to be positive; If xj is small when xi is large, then (xj − µj )(xi − µi ) will tend to be negative. 18

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

19

Spherical Gaussian

Diagonal Covariance Gaussian

Contour plot of p(x1, x2)

2

Surface plot of p(x1, x2)

1.5 0.16

Contour plot of p(x1, x2)

4

Surface plot of p(x1, x2)

3

0.08

1

0.14

2

0.07

0.12

0.5

1

0.06

0.06

−0.5

0.04

x2

p(x1, x2)

0.05

0

x2

p(x1, x2)

0.1 0.08

0.04

0

0.03

−1 0.02

0.02

0.01

−1

0 2 1 0 −1 −2

x2

µ=

−1.5

−2

−0.5

−1

0

0.5

1.5

1

−1.5

4 0

Σ=

1 0 0 1

−1

−0.5

0 x1

0.5

1

1.5

2 x2

ρ12 = 0

µ=

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

−2

−2

−1.5

20

Full covariance Gaussian

−3

2

2 0

−2 −2

x1

0 0

−2

0 4

2

−4

−4

0 0

−4 −4

x1

Σ=

−3

1 0 0 4

−2

−1

0 x1

1

2

3

4

ρ12 = 0

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

21

Parameter estimation

It is possible to show that the mean vector µ ˆ and covariance ˆ that maximize the likelihood of the training data are matrix Σ given by:

Contour plot of p(x1, x2)

4

Surface plot of p(x1, x2)

3 0.1

2

0.08

x2

p(x1, x2)

1 0.06

N 1 X n µ ˆ= x N

0

0.04

−1

0.02

−2

0 4 2 0 −2 −4

x2

µ=

0 0

−4

−3

−2

−1

0

1

2

3

4

−3

−4 −4 x1

Σ=

ASR Lectures 4&5

−3

1 −1 −1 4

−2

−1

0 x1

1

2

3

ˆ = 1 Σ N

4

ρ12 = −0.5

Hidden Markov Models and Gaussian Mixture Models

n=1 N X n=1

(xn − µ)(x ˆ n − µ) ˆ T

The mean of the distribution is estimated by the sample mean and the covariance by the sample covariance

22

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

23

Example data

Maximum likelihood fit to a Gaussian

5

5

X2

10

X2

10

0

0

−5 −4

−2

0

2

ASR Lectures 4&5

X1

4

6

8

−5 −4

10

Hidden Markov Models and Gaussian Mixture Models

24

Data in clusters (example 1)

2

2

1.5

1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

µ1 = [0

0]T

−0.5

0

µ2 = [1

ASR Lectures 4&5

2

X1

4

6

8

10

Hidden Markov Models and Gaussian Mixture Models

25

Example 1 fit by a Gaussian 2.5

−1

0

ASR Lectures 4&5

2.5

−1.5 −1.5

−2

0.5

1]T

1

1.5

−1.5 −1.5

2

Σ1 = Σ2 = 0.2I

Hidden Markov Models and Gaussian Mixture Models

µ1 = [0

26

−1

0]T

−0.5

0

µ2 = [1

ASR Lectures 4&5

0.5

1]T

1

1.5

2

Σ1 = Σ2 = 0.2I

Hidden Markov Models and Gaussian Mixture Models

27

k-means clustering

k-means example: data set (4,13)

k-means is an automatic procedure for clustering unlabelled data

10 (2,9)

Requires a prespecified number of clusters

(7,8)

Clustering algorithm chooses a set of clusters with the minimum within-cluster variance

(6,6)

5

Guaranteed to converge (eventually)

(4,5)

(1,2)

0

(3,1)

0

(10,0)

10

5

28

k-means example: initialization

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

29

k-means example: iteration 1 (assign points to clusters)

(4,13)

(4,13)

10

10 (2,9)

(2,9) (7,8)

(6,6)

5

(7,8) (6,6)

(7,6)

(4,5)

(1,2) (1,1)

0

5

(10,5) (5,4)

0

(8,4)

(5,2)

(1,1)

Hidden Markov Models and Gaussian Mixture Models

(10,5) (5,4)

Clustering solution is dependent on the initialisation

ASR Lectures 4&5

(7,6)

(4,5)

(8,4)

(1,2) (1,1)

(10,0)

5

ASR Lectures 4&5

(10,5) (5,4)

(5,2) (3,1)

0

10

Hidden Markov Models and Gaussian Mixture Models

30

(7,6)

0

(8,4)

(5,2) (3,1)

(10,0)

5

ASR Lectures 4&5

10

Hidden Markov Models and Gaussian Mixture Models

31

k-means example: iteration 1 (recompute centres)

k-means example: iteration 2 (assign points to clusters)

(4,13)

10

(4,13)

10

(4.33, 10)

(4.33, 10)

(2,9)

(2,9) (7,8)

(6,6)

5

(7,8)

(7,6)

(6,6)

(4,5)

5

(10,5) (5,4)

(7,6)

(4,5)

(8,4)

(10,5) (5,4)

(8,4)

(8.75,3.75)

(8.75,3.75)

(3.57, 3) (1,2)

(1,2)

(5,2)

(1,1)

0

(3.57, 3)

(3,1)

(10,0)

0

0

10

5

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

(5,2)

(1,1)

10

5

ASR Lectures 4&5

10

(4.33, 10)

(4.33, 10) (2,9) (7,8)

(7,8)

(6,6)

5

(6,6)

(7,6)

(4,5)

(1,2) (3.17, 2.5)

0

5

(10,5) (5,4)

(1,1)

33

(4,13)

(2,9)

0

Hidden Markov Models and Gaussian Mixture Models

k-means example: iteration 3 (assign points to clusters)

(4,13)

10

(10,0)

0

32

k-means example: iteration 2 (recompute centres)

(3,1)

(8,4)

(4,5)

(8.2,4.2)

(1,2) (3.17, 2.5) (1,1)

(10,0)

5

(10,5) (5,4)

(5,2)

(3,1)

0

10

(7,6)

(8.2,4.2)

(5,2)

(3,1)

0

(8,4)

(10,0)

5

10

No changes, so converged ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

34

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

35

Mixture model

Component occupation probability We can apply Bayes’ theorem:

A more flexible form of density estimation is made up of a linear combination of component densities: p(x) =

M X

P(j|x) = p(x|j)P(j)

p(x|j)P(j) p(x|j)P(j) = PM p(x) j=1 p(x|j)P(j)

The posterior probabilities P(j|x) give the probability that component j was responsible for generating data point x

j=1

This is called a mixture model or a mixture density p(x|j): component densities

The P(j|x)s are called the component occupation probabilities (or sometimes called the responsibilities)

P(j): mixing parameters Generative model:

Since they are posterior probabilities:

1 2

M X

Choose a mixture component based on P(j) Generate a data point x from the chosen component using p(x|j)

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

36

Parameter estimation

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

37

Gaussian mixture model The most important mixture model is the Gaussian Mixture Model (GMM), where the component densities are Gaussians Consider a GMM, where each component Gaussian Nj (x; µj , σj2 ) has mean µj and a spherical covariance Σ = σ 2 I

If we knew which mixture component was responsible for a data point: we would be able to assign each point unambiguously to a mixture component and we could estimate the mean for each component Gaussian as the sample mean (just like k-means clustering) and we could estimate the covariance as the sample covariance

p(x) =

P X

P(j)p(x|j) =

j=1

P(j)Nj (x; µj , σj2 )

p(x)

P(1)

Maybe we could use the component occupation probabilities P(j|x)?

p(x|1)

x1 Hidden Markov Models and Gaussian Mixture Models

P X j=1

But we don’t know which mixture component a data point comes from...

ASR Lectures 4&5

P(j|x) = 1

j=1

38

P(M)

P(2) p(x|2)

x2 ASR Lectures 4&5

p(x|M)

xd Hidden Markov Models and Gaussian Mixture Models

39

GMM Parameter estimation when we know which component generated the data

Soft assignment Estimate “soft counts” based on the component occupation probabilities P(j|xn ):

Define the indicator variable zjn = 1 if component j generated component xn (and 0 otherwise) If zjn wasn’t hidden then we could count the number of observed data points generated by j: Nj =

N X

Nj∗ =

n=1

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

P(j|xn )

n=1

zjn

And estimate the mean, variance and mixing parameters as: P zjn xn µ ˆj = n Nj P zjn ||xn − µk ||2 σ ˆj2 = n Nj X Nj 1 ˆ = P(j) zjn = N n N

N X

40

EM algorithm

We can imagine assigning data points to component j weighted by the component occupation probability P(j|xn ) So we could imagine estimating the mean, variance and prior probabilities as: P P n n P(j|xn )xn n n P(j|x )x µ ˆj = P = n Nj∗ n P(j|x ) P P n )||xn − µ ||2 n n 2 k 2 n P(j|x n P(j|x )||x − µk || P σ ˆj = = n Nj∗ n P(j|x ) Nj∗ 1 X n ˆ P(j|x ) = P(j) = N n N ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

41

Maximum likelihood parameter estimation The likelihood of a data set X = {x1 , x2 , . . . , xN } is given by:

Problem! Recall that: P(j|x) =

p(x|j)P(j) p(x)

L=

We need to know p(x|j) and P(j) to estimate the parameters of p(x|j) and to estimate P(j).... Solution: an iterative algorithm where each iteration has two parts:

n=1

N X M Y

n=1 j=1

E = − ln L = −

p(xn |j)P(j)

N X

ln p(xn )

n=1

  N M X X =− ln  p(xn |j)P(j) n=1

Starting from some initialization (e.g. using k-means for the means) these steps are alternated until convergence This is called the EM Algorithm and can be shown to maximize the likelihood Hidden Markov Models and Gaussian Mixture Models

p(xn ) =

We can regard the negative log likelihood as an error function:

Compute the component occupation probabilities P(j|x) using the current estimates of the GMM parameters (means, variances, mixing parameters) (E-step) Computer the GMM parameters using the current estimates of the component occupation probabilities (M-step)

ASR Lectures 4&5

N Y

j=1

Considering the derivatives of E with respect to the parameters, gives expressions like the previous slide 42

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

43

Example 1 fit using a GMM

Peakily distributed data (Example 2)

2.5 4

2

3

1.5

2

1

1 0.5 0 0

−1

−0.5

−2

−1 −1.5 −1.5

−3

−1

−0.5

0

0.5

1

1.5

−4

2

−5 −4

−3

−2

−1

0

1

2

3

4

2.5

0]T

µ1 = µ2 = [0

2

Σ1 = 0.1I

Σ2 = 2I

1.5 1

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

44

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

45

0.5

Example 2 fit by a Gaussian

Example 2 fit by a GMM

0

−0.5 −1 4 −1.5 −1.5

4

−1

−0.5

0

0.5

1

1.5

3

2

3

2

2 Fitted with a two component GMM using EM

1

1

0

0

−1

−1

−2

−2

−3

−3

−4

−4

−5 −4

−5 −4

−3

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

4

4 4

µ1 = µ2 = [0

0]T

Σ1 = 0.1I

Σ2 = 2I

3 2 1

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

46

0 −1

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

47

Example 2: component Gaussians

4

4

3

3

2

2

1

1

0

0

−1

−1

−2

−2

−3

−3

−4 −4

−4 −4

−3

−2

−1

0

1

2

3

4

Comments on GMMs

GMMs trained using the EM algorithm are able to self organize to fit a data set Individual components take responsibility for parts of the data set (probabilistically) Soft assignment to components not hard assignment — “soft clustering” −3

−2

−1

0

1

2

3

4

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

GMMs scale very well, e.g.: large speech recognition systems can have 30,000 GMMs, each with 32 components: sometimes 1 million Gaussian components!! And the parameters all estimated from (a lot of) data by EM

48

Back to HMMs...

P(s1 |sI )

Hidden Markov Models and Gaussian Mixture Models

49

The three problems of HMMs P(s2 | s2 )

P(s1 | s1 )

sI

ASR Lectures 4&5

s1

P(s2 | s1 )

s2

p(x | s1 )

x

P(s3 | s2 )

P(s3 | s3 )

s3

sE

Working with HMMs requires the solution of three problems: 1

Likelihood Determine the overall likelihood of an observation sequence X = (x1 , . . . , xt , . . . , xT ) being generated by an HMM

2

Decoding Given an observation sequence and an HMM, determine the most probable hidden state sequence

3

Training Given an observation sequence and an HMM, learn the best HMM parameters λ = {{ajk }, {bj ()}}

p(x | s3 )

p(x | s2 )

x

P(sE | s3 )

x

Output distribution: Single multivariate Gaussian with mean µj , covariance matrix Σj : bj (x) = p(x | sj ) = N (x; µj , Σj ) M-component Gaussian mixture model: bj (x) = p(x | sj ) = ASR Lectures 4&5

M X

m=1

cjm N (x; µjm , Σjm )

Hidden Markov Models and Gaussian Mixture Models

50

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

51

1. Likelihood: The Forward algorithm

Recursive algorithms on HMMs

Goal: determine p(X | λ)

Visualize the problem as a state-time trellis

Sum over all possible state sequences s1 s2 . . . sT that could result in the observation sequence X

t-1

Rather than enumerating each sequence, compute the probabilities recursively (exploiting the Markov assumption)

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

52

1. Likelihood: The Forward algorithm

t

t+1

i

i

i

j

j

j

k

k

k

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

1. Likelihood: The Forward recursion

Goal: determine p(X | λ)

Initialization

Sum over all possible state sequences s1 s2 . . . sT that could result in the observation sequence X

α0 (sI ) = 1

Rather than enumerating each sequence, compute the probabilities recursively (exploiting the Markov assumption)

α0 (sj ) = 0

Forward probability, αt (sj ): the probability of observing the observation sequence x1 . . . xt and being in state sj at time t:

Recursion αt (sj ) =

N X

if sj 6= sI

αt−1 (si )aij bj (xt )

i=1

αt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ)

Termination

p(X | λ) = αT (sE ) =

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

53

54

ASR Lectures 4&5

N X

αT (si )aiE

i=1

Hidden Markov Models and Gaussian Mixture Models

55

1. Likelihood: Forward Recursion

Viterbi approximation

αt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ)

Instead of summing over all possible state sequences, just consider the most likely

t-1

Achieve this by changing the summation to a maximisation in the recursion:

aii

i

αt−1 (si ) a ji

b j (xt ) �

t

αt (si )

t+1

i

i

Vt (sj ) = max Vt−1 (si )aij bj (xt ) i

j

Changing the recursion in this way gives the likelihood of the most probable path

k

We need to keep track of the states that make up this path by keeping a sequence of backpointers to enable a Viterbi backtrace: the backpointer for each state at each time indicates the previous state on the most probable path

j

j

αt−1 (s j ) aki k

k

αt−1 (sk ) ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

56

Viterbi Recursion

Hidden Markov Models and Gaussian Mixture Models

57

Viterbi Recursion

Likelihood of the most probable path

t-1 i

ASR Lectures 4&5

aii

Vt−1 (si ) a ji

t

t+1

i

i

b j (xt ) Vt (si ) max

t-1

b j (xt )

i

t btt (si ) = s j t+1 i

Vt (si )

i

a ji j

j

Backpointers to the previous state on the most probable path

j

j

Vt−1 (s j ) aki

j

j

k

k

Vt−1 (s j ) k

k

k

k

Vt−1 (sk ) ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

58

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

59

2. Decoding: The Viterbi algorithm

Viterbi Backtrace

Initialization

Backtrace to find the state sequence of the most probable path

t-1

V0 (sI ) = 1 if sj 6= sI

V0 (sj ) = 0 bt0 (sj ) = 0

b j (xt )

i

t btt (si ) = s j t+1 i

Vt (si )

i

Recursion

a ji

N

Vt (sj ) = max Vt−1 (si )aij bj (xt ) i=1

j

N

btt (sj ) = arg max Vt−1 (si )aij bj (xt ) i=1

j

j

k

k

Vt−1 (s j )

Termination N

P ∗ = VT (sE ) = max VT (si )aiE i=1 N

k

sT∗ = btT (qE ) = arg max VT (si )aiE i=1

Hidden Markov Models and Gaussian Mixture Models

ASR Lectures 4&5

btt+1 (sk ) = si 60

3. Training: Forward-Backward algorithm

bj (x) = p(x | sj ) = N (x; µj , Σj )

C (si → sj ) aˆij = P k C (si → sk )

Parameters λ:

aij = 1

j

Gaussian parameters for state sj : mean vector µj ; covariance matrix Σj

Hidden Markov Models and Gaussian Mixture Models

61

If we knew the state-time alignment, then each observation feature vector could be assigned to a specific state A state-time alignment can be obtained using the most probable path obtained by Viterbi decoding Maximum likelihood estimate of aij , if C (si → sj ) is the count of transitions from si to sj

Assume single Gaussian output probability distribution

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

Viterbi Training

Goal: Efficiently estimate the parameters of an HMM λ from an observation sequence

Transition probabilities aij : X

ASR Lectures 4&5

62

Likewise if Zj is the set of observed acoustic feature vectors assigned to state j, we can use the standard maximum likelihood estimates for the mean and the covariance: P x∈Zj x µ ˆj = |Zj | P ˆ j )(x − µ ˆ j )T j x∈Zj (x − µ ˆ Σ = |Zj | ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

63

EM Algorithm

Backward probabilities To estimate the state occupation probabilities it is useful to define (recursively) another set of probabilities—the Backward probabilities

Viterbi training is an approximation—we would like to consider all possible paths In this case rather than having a hard state-time alignment we estimate a probability State occupation probability: The probability γt (sj ) of occupying state sj at time t given the sequence of observations. Compare with component occupation probability in a GMM We can use this for an iterative algorithm for HMM training: the EM algorithm Each iteration has two steps: E-step estimate the state occupation probabilities (Expectation) M-step re-estimate the HMM parameters based on the estimated state occupation probabilities (Maximisation) ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

βt (sj ) = p(xt+1 , xt+2 , xT | S(t) = sj , λ)

The probability of future observations given a the HMM is in state sj at time t These can be recursively computed (going backwards in time) Initialisation βT (si ) = aiE Recursion βt (si ) = Termination

p(X | λ) = β0 (sI ) = 64

j

t

βt (si )

aIj bj (x1 )β1 (sj ) = αT (sE )

j=1

Hidden Markov Models and Gaussian Mixture Models

65

The state occupation probability γt (sj ) is the probability of occupying state sj at time t given the sequence of observations Express in terms of the forward and backward probabilities: 1 γt (sj ) = P(S(t) = sj | X, λ) = αt (j)βt (j) αT (sE )

i

t+1

� aii

bi (xt+1 ) βt+1 (si )

αt (sj )βt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ)

ai j b j (xt+1 )

j

k

recalling that p(X|λ) = αT (sE ) Since

i

p(xt+1 , xt+2 , xT | S(t) = sj , λ)

j

aik k

ASR Lectures 4&5

N X

State Occupation Probability

βt (sj ) = p(xt+1 , xt+2 , xT | S(t) = sj , λ)

i

aij bj (xt+1 )βt+1 (sj )

j=1

Backward Recursion

t-1

N X

= p(x1 , . . . , xt , xt+1 , xt+2 , . . . , xT , S(t) = sj | λ)

βt+1 (s j )

= p(X, S(t) = sj | λ)

bk (xt+1 )

k

βt+1 (sk ) ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

P(S(t) = sj | X, λ) = 66

p(X, S(t) = sj | λ) p(X|λ)

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

67

Re-estimation of Gaussian parameters

Re-estimation of transition probabilities Similarly to the state occupation probability, we can estimate ξt (si , sj ), the probability of being in si at time t and sj at t + 1, given the observations:

The sum of state occupation probabilities through time for a state, may be regarded as a “soft” count

ξt (si , sj ) = P(S(t) = si , S(t + 1) = sj | X, λ) P(S(t) = si , S(t + 1) = sj , X | λ) = p(X|Λ) αt (si )aij bj (xt+1 )βt+1 (sj ) = αT (sE )

We can use this “soft” alignment to re-estimate the HMM parameters: PT

j

µ ˆ = Pt=1 T

γt (sj )x t

t=1 γt (sj )

ˆj = Σ

PT

ˆ j )(x t=1 γt (sj )(x t − µ PT t=1 γt (sj )

ASR Lectures 4&5

−µ ˆ j )T

Hidden Markov Models and Gaussian Mixture Models

We can use this to re-estimate the transition probabilities PT

t=1 ξt (si , sj ) PT k=1 t=1 ξt (si , sk )

aˆij = PN 68

Pulling it all together

2

Hidden Markov Models and Gaussian Mixture Models

69

Extension to a corpus of utterances

Iterative estimation of HMM parameters using the EM algorithm. At each iteration E step For all time-state pairs 1

ASR Lectures 4&5

We usually train from a large corpus of R utterances If xrt is the tth frame of the r th utterance Xr then we can compute the probabilities αtr (j), βtr (j), γtr (sj ) and ξtr (si , sj ) as before

Recursively compute the forward probabilities αt (sj ) and backward probabilities βt (j) Compute the state occupation probabilities γt (sj ) and ξt (si , sj )

The re-estimates are as before, except we must sum over the R utterances, eg:

M step Based on the estimated state occupation probabilities re-estimate the HMM parameters: mean vectors µj , covariance matrices Σj and transition probabilities aij

j

µ ˆ =

The application of the EM algorithm to HMM training is sometimes called the Forward-Backward algorithm

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

70

PR

PT r r r =1 t=1 γt (sj )x t PR PT r r =1 t=1 γt (sj )

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

71

Extension to Gaussian mixture model (GMM)

EM training of HMM/GMM Rather than estimating the state-time alignment, we estimate the component/state-time alignment, and component-state occupation probabilities γt (sj , m): the probability of occupying mixture component m of state sj at time t We can thus re-estimate the mean of mixture component m of state sj as follows PT γt (sj , m)x t jm µ ˆ = Pt=1 T t=1 γt (sj , m)

The assumption of a Gaussian distribution at each state is very strong; in practice the acoustic feature vectors associated with a state may be strongly non-Gaussian In this case an M-component Gaussian mixture model is an appropriate density function: bj (x) = p(x | sj ) =

M X

m=1

cjm N (x; µjm , Σjm )

Given enough components, this family of functions can model any distribution. Train using the EM algorithm, in which the component estimation probabilities are estimated in the E-step

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

72

Doing the computation

And likewise for the covariance matrices (mixture models often use diagonal covariance matrices) The mixture coefficients are re-estimated in a similar way to transition probabilities: PT γt (sj , m) cˆjm = PM t=1 PT `=1 t=1 γt (sj , `) ASR Lectures 4&5

73

Summary: HMMs

HMMs provide a generative model for statistical speech recognition Three key problems

The forward, backward and Viterbi recursions result in a long sequence of probabilities being multiplied

1

This can cause floating point underflow problems

2

In practice computations are performed in the log domain (in which multiplies become adds)

3

1

2

Hidden Markov Models and Gaussian Mixture Models

Computing the overall likelihood: the Forward algorithm Decoding the most likely state sequence: the Viterbi algorithm Estimating the most likely parameters: the EM (Forward-Backward) algorithm

Solutions to these problems are tractable due to the two key HMM assumptions

Working in the log domain also avoids needing to perform the exponentiation when computing Gaussians

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

74

Conditional independence of observations given the current state Markov assumption on the states

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

75

References: HMMs Gales and Young (2007). “The Application of Hidden Markov Models in Speech Recognition”, Foundations and Trends in Signal Processing, 1 (3), 195–304: section 2.2. Jurafsky and Martin (2008). Speech and Language Processing (2nd ed.): sections 6.1–6.5; 9.2; 9.4. (Errata at http://www.cs.colorado.edu/~martin/SLP/Errata/ SLP2-PIEV-Errata.html) Rabiner and Juang (1989). “An introduction to hidden Markov models”, IEEE ASSP Magazine, 3 (1), 4–16. Renals and Hain (2010). “Speech Recognition”, Computational Linguistics and Natural Language Processing Handbook, Clark, Fox and Lappin (eds.), Blackwells.

ASR Lectures 4&5

Hidden Markov Models and Gaussian Mixture Models

76

Recommend Documents

Hidden Markov Models