Overview
Hidden Markov Models and Gaussian Mixture Models
HMMs and GMMs Key models and algorithms for HMM acoustic models Gaussians GMMs: Gaussian mixture models
Steve Renals and Peter Bell
HMMs: Hidden Markov models HMM algorithms Likelihood computation (forward algorithm) Most probable state sequence (Viterbi algorithm) Estimting the parameters (EM algorithm)
Automatic Speech Recognition— ASR Lectures 4&5 28/31 January 2013
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
1
Fundamental Equation of Statistical Speech Recognition
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
2
Acoustic Modelling
If X is the sequence of acoustic feature vectors (observations) and Decoded Text (Transcription)
Recorded Speech
W denotes a word sequence, the most likely word sequence W∗ is given by
Hidden Markov Model ∗
Signal Analysis
W = arg max P(W | X) W
Acoustic Model
Applying Bayes’ Theorem: p(X | W)P(W) p(X) ∝ p(X | W)P(W)
P(W | X) = ∗
Lexicon
Training Data
Language Model
W = arg max p(X | W) P(W) | {z } W | {z } Acoustic Language model model ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
Search Space
3
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
4
Hierarchical modelling of speech
NO r
ai
P(s2 | s2 )
P(s3 | s3 )
Word
RIGHT
oh
P(s1 | s1 )
Utterance
"No right"
Generative Model
n
Acoustic Model: Continuous Density HMM
sI
Subword
t
P(s1 |sI )
s1
P(s2 | s1 )
s2
p(x | s1 )
HMM
x Acoustics
P(s3 | s2 )
s3
P(sE | s3 )
sE
p(x | s3 )
p(x | s2 )
x
x
Probabilistic finite state automaton Paramaters λ: Transition probabilities: akj = P(sj | sk )
Output probability density function: bj (x) = p(x | sj ) Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
5
Acoustic Model: Continuous Density HMM
s3
s2
s1
sI
sE
P(s1 |sI )
s1
P(s2 | s1 )
P(s2 | s2 )
s2
p(x | s1 )
x
x1
x2
x3
x4
x5
6
HMM Assumptions P(s1 | s1 )
sI
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
x6
Transition probabilities: akj = P(sj | sk )
s3
sE
p(x | s3 )
p(x | s2 )
x
P(sE | s3 )
x
1
Observation independence An acoustic observation x is conditionally independent of all other observations given the state that generated it
2
Markov process A state is conditionally independent of all other states given the previous state
Probabilistic finite state automaton Paramaters λ:
P(s3 | s2 )
P(s3 | s3 )
Output probability density function: bj (x) = p(x | sj ) ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
6
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
7
HMM Assumptions s(t−1)
s(t)
s(t+1)
HMM OUTPUT DISTRIBUTION
x(t + 1)
x(t)
x(t − 1) 1
Observation independence An acoustic observation x is conditionally independent of all other observations given the state that generated it
2
Markov process A state is conditionally independent of all other states given the previous state
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
8
Output distribution
P(s1 |sI )
Hidden Markov Models and Gaussian Mixture Models
9
Background: cdf P(s1 | s1 )
sI
ASR Lectures 4&5
s1
P(s2 | s1 )
P(s2 | s2 )
s2
p(x | s1 )
x
P(s3 | s2 )
P(s3 | s3 )
s3
Cumulative distribution function (cdf) F (x) for X :
p(x | s3 )
p(x | s2 )
x
P(sE | s3 )
Consider a real valued random variable X
sE
F (x) = P(X ≤ x)
x
To obtain the probability of falling in an interval we can do the following:
Single multivariate Gaussian with mean µj , covariance matrix Σj :
P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a)
bj (x) = p(x | sj ) = N (x; µj , Σj )
= F (b) − F (a)
M-component Gaussian mixture model: bj (x) = p(x | sj ) = ASR Lectures 4&5
M X
m=1
cjm N (x; µjm , Σjm )
Hidden Markov Models and Gaussian Mixture Models
10
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
11
Background: pdf
The Gaussian distribution (univariate) The Gaussian (or Normal) distribution is the most common (and easily analysed) continuous distribution
The rate of change of the cdf gives us the probability density function (pdf), p(x):
It is also a reasonable model in many situations (the famous “bell curve”)
d p(x) = F (x) = F 0 (x) dX Z
If a (scalar) variable has a Gaussian distribution, then it has a probability density function with this form: −(x − µ)2 1 2 2 exp p(x|µ, σ ) = N(x; µ, σ ) = √ 2σ 2 2πσ 2
x
F (x) =
p(x)dx
−∞
p(x) is not the probability that X has value x. But the pdf is proportional to the probability that X lies in a small interval centred on x.
The Gaussian is described by two parameters: the mean µ (location) the variance σ 2 (dispersion)
Notation: p for pdf, P for probability
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
12
Plot of Gaussian distribution
N(x; µ, σ ) = √
0.35
mean=0 variance=2
0.3
0.3 p(x|m,s)
0.25
0.25 p(x|m,s)
exp
−(x − µ)2 2σ 2
mean=0 variance=1
pdf of Gaussian Distribution mean=0 variance=1
2πσ 2
pdfs of Gaussian distributions
0.4
0.2
0.2 mean=0 variance=4
0.15
0.15
0.1
0.1
0.05
0.05 0 −4
1
2
One-dimensional Gaussian with zero mean and unit variance (µ = 0, σ 2 = 1):
0.35
13
Properties of the Gaussian distribution
Gaussians have the same shape, with the location controlled by the mean, and the spread controlled by the variance
0.4
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
0 −8
−3
−2
−1
ASR Lectures 4&5
0 x
1
2
3
−6
−4
−2
0 x
2
4
6
8
4
Hidden Markov Models and Gaussian Mixture Models
14
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
15
Parameter estimation
Exercise Consider the log likelihood of a set of N data points {x 1 , . . . , x N } being generated by a Gaussian with mean µ and variance σ 2 :
Estimate mean and variance parameters of a Gaussian from data x 1 , x 2 , . . . , x n
N
1X L = ln p({x , . . . , x } | µ, σ ) = − 2 1
Use sample mean and sample variance estimates:
σ2 =
1 n
i=1 n X i=1
=−
(sample mean)
(x i − µ)2
2
n=1
n
1X i µ= x n
n
1 2σ 2
N X n=1
(xn − µ)2 − ln σ 2 − ln(2π) 2 σ
(xn − µ)2 −
N N ln σ 2 − ln(2π) 2 2
By maximising the the log likelihood function with respect to µ show that the maximum likelihood estimate for the mean is indeed the sample mean: N 1 X µML = xn . N
(sample variance)
n=1
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
16
The multidimensional Gaussian distribution
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
17
Covariance matrix The mean vector µ is the expectation of x: µ = E [x]
The d-dimensional vector x is multivariate Gaussian if it has a probability density function of the following form: 1 1 T −1 (x − µ) Σ (x − µ) exp − p(x|µ, Σ) = 2 (2π)d/2 |Σ|1/2
The covariance matrix Σ is the expectation of the deviation of x from the mean: Σ = E [(x − µ)(x − µ)T ] Σ is a d × d symmetric matrix:
The pdf is parameterized by the mean vector µ and the covariance matrix Σ.
Σij = E [(xi − µi )(xj − µj )] = E [(xj − µj )(xi − µi )] = Σji
The 1-dimensional Gaussian is a special case of this pdf
The sign of the covariance helps to determine the relationship between two components:
The argument to the exponential 0.5(x − µ)T Σ−1 (x − µ) is referred to as a quadratic form.
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
If xj is large when xi is large, then (xj − µj )(xi − µi ) will tend to be positive; If xj is small when xi is large, then (xj − µj )(xi − µi ) will tend to be negative. 18
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
19
Spherical Gaussian
Diagonal Covariance Gaussian
Contour plot of p(x1, x2)
2
Surface plot of p(x1, x2)
1.5 0.16
Contour plot of p(x1, x2)
4
Surface plot of p(x1, x2)
3
0.08
1
0.14
2
0.07
0.12
0.5
1
0.06
0.06
−0.5
0.04
x2
p(x1, x2)
0.05
0
x2
p(x1, x2)
0.1 0.08
0.04
0
0.03
−1 0.02
0.02
0.01
−1
0 2 1 0 −1 −2
x2
µ=
−1.5
−2
−0.5
−1
0
0.5
1.5
1
−1.5
4 0
Σ=
1 0 0 1
−1
−0.5
0 x1
0.5
1
1.5
2 x2
ρ12 = 0
µ=
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
−2
−2
−1.5
20
Full covariance Gaussian
−3
2
2 0
−2 −2
x1
0 0
−2
0 4
2
−4
−4
0 0
−4 −4
x1
Σ=
−3
1 0 0 4
−2
−1
0 x1
1
2
3
4
ρ12 = 0
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
21
Parameter estimation
It is possible to show that the mean vector µ ˆ and covariance ˆ that maximize the likelihood of the training data are matrix Σ given by:
Contour plot of p(x1, x2)
4
Surface plot of p(x1, x2)
3 0.1
2
0.08
x2
p(x1, x2)
1 0.06
N 1 X n µ ˆ= x N
0
0.04
−1
0.02
−2
0 4 2 0 −2 −4
x2
µ=
0 0
−4
−3
−2
−1
0
1
2
3
4
−3
−4 −4 x1
Σ=
ASR Lectures 4&5
−3
1 −1 −1 4
−2
−1
0 x1
1
2
3
ˆ = 1 Σ N
4
ρ12 = −0.5
Hidden Markov Models and Gaussian Mixture Models
n=1 N X n=1
(xn − µ)(x ˆ n − µ) ˆ T
The mean of the distribution is estimated by the sample mean and the covariance by the sample covariance
22
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
23
Example data
Maximum likelihood fit to a Gaussian
5
5
X2
10
X2
10
0
0
−5 −4
−2
0
2
ASR Lectures 4&5
X1
4
6
8
−5 −4
10
Hidden Markov Models and Gaussian Mixture Models
24
Data in clusters (example 1)
2
2
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
µ1 = [0
0]T
−0.5
0
µ2 = [1
ASR Lectures 4&5
2
X1
4
6
8
10
Hidden Markov Models and Gaussian Mixture Models
25
Example 1 fit by a Gaussian 2.5
−1
0
ASR Lectures 4&5
2.5
−1.5 −1.5
−2
0.5
1]T
1
1.5
−1.5 −1.5
2
Σ1 = Σ2 = 0.2I
Hidden Markov Models and Gaussian Mixture Models
µ1 = [0
26
−1
0]T
−0.5
0
µ2 = [1
ASR Lectures 4&5
0.5
1]T
1
1.5
2
Σ1 = Σ2 = 0.2I
Hidden Markov Models and Gaussian Mixture Models
27
k-means clustering
k-means example: data set (4,13)
k-means is an automatic procedure for clustering unlabelled data
10 (2,9)
Requires a prespecified number of clusters
(7,8)
Clustering algorithm chooses a set of clusters with the minimum within-cluster variance
(6,6)
5
Guaranteed to converge (eventually)
(4,5)
(1,2)
0
(3,1)
0
(10,0)
10
5
28
k-means example: initialization
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
29
k-means example: iteration 1 (assign points to clusters)
(4,13)
(4,13)
10
10 (2,9)
(2,9) (7,8)
(6,6)
5
(7,8) (6,6)
(7,6)
(4,5)
(1,2) (1,1)
0
5
(10,5) (5,4)
0
(8,4)
(5,2)
(1,1)
Hidden Markov Models and Gaussian Mixture Models
(10,5) (5,4)
Clustering solution is dependent on the initialisation
ASR Lectures 4&5
(7,6)
(4,5)
(8,4)
(1,2) (1,1)
(10,0)
5
ASR Lectures 4&5
(10,5) (5,4)
(5,2) (3,1)
0
10
Hidden Markov Models and Gaussian Mixture Models
30
(7,6)
0
(8,4)
(5,2) (3,1)
(10,0)
5
ASR Lectures 4&5
10
Hidden Markov Models and Gaussian Mixture Models
31
k-means example: iteration 1 (recompute centres)
k-means example: iteration 2 (assign points to clusters)
(4,13)
10
(4,13)
10
(4.33, 10)
(4.33, 10)
(2,9)
(2,9) (7,8)
(6,6)
5
(7,8)
(7,6)
(6,6)
(4,5)
5
(10,5) (5,4)
(7,6)
(4,5)
(8,4)
(10,5) (5,4)
(8,4)
(8.75,3.75)
(8.75,3.75)
(3.57, 3) (1,2)
(1,2)
(5,2)
(1,1)
0
(3.57, 3)
(3,1)
(10,0)
0
0
10
5
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
(5,2)
(1,1)
10
5
ASR Lectures 4&5
10
(4.33, 10)
(4.33, 10) (2,9) (7,8)
(7,8)
(6,6)
5
(6,6)
(7,6)
(4,5)
(1,2) (3.17, 2.5)
0
5
(10,5) (5,4)
(1,1)
33
(4,13)
(2,9)
0
Hidden Markov Models and Gaussian Mixture Models
k-means example: iteration 3 (assign points to clusters)
(4,13)
10
(10,0)
0
32
k-means example: iteration 2 (recompute centres)
(3,1)
(8,4)
(4,5)
(8.2,4.2)
(1,2) (3.17, 2.5) (1,1)
(10,0)
5
(10,5) (5,4)
(5,2)
(3,1)
0
10
(7,6)
(8.2,4.2)
(5,2)
(3,1)
0
(8,4)
(10,0)
5
10
No changes, so converged ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
34
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
35
Mixture model
Component occupation probability We can apply Bayes’ theorem:
A more flexible form of density estimation is made up of a linear combination of component densities: p(x) =
M X
P(j|x) = p(x|j)P(j)
p(x|j)P(j) p(x|j)P(j) = PM p(x) j=1 p(x|j)P(j)
The posterior probabilities P(j|x) give the probability that component j was responsible for generating data point x
j=1
This is called a mixture model or a mixture density p(x|j): component densities
The P(j|x)s are called the component occupation probabilities (or sometimes called the responsibilities)
P(j): mixing parameters Generative model:
Since they are posterior probabilities:
1 2
M X
Choose a mixture component based on P(j) Generate a data point x from the chosen component using p(x|j)
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
36
Parameter estimation
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
37
Gaussian mixture model The most important mixture model is the Gaussian Mixture Model (GMM), where the component densities are Gaussians Consider a GMM, where each component Gaussian Nj (x; µj , σj2 ) has mean µj and a spherical covariance Σ = σ 2 I
If we knew which mixture component was responsible for a data point: we would be able to assign each point unambiguously to a mixture component and we could estimate the mean for each component Gaussian as the sample mean (just like k-means clustering) and we could estimate the covariance as the sample covariance
p(x) =
P X
P(j)p(x|j) =
j=1
P(j)Nj (x; µj , σj2 )
p(x)
P(1)
Maybe we could use the component occupation probabilities P(j|x)?
p(x|1)
x1 Hidden Markov Models and Gaussian Mixture Models
P X j=1
But we don’t know which mixture component a data point comes from...
ASR Lectures 4&5
P(j|x) = 1
j=1
38
P(M)
P(2) p(x|2)
x2 ASR Lectures 4&5
p(x|M)
xd Hidden Markov Models and Gaussian Mixture Models
39
GMM Parameter estimation when we know which component generated the data
Soft assignment Estimate “soft counts” based on the component occupation probabilities P(j|xn ):
Define the indicator variable zjn = 1 if component j generated component xn (and 0 otherwise) If zjn wasn’t hidden then we could count the number of observed data points generated by j: Nj =
N X
Nj∗ =
n=1
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
P(j|xn )
n=1
zjn
And estimate the mean, variance and mixing parameters as: P zjn xn µ ˆj = n Nj P zjn ||xn − µk ||2 σ ˆj2 = n Nj X Nj 1 ˆ = P(j) zjn = N n N
N X
40
EM algorithm
We can imagine assigning data points to component j weighted by the component occupation probability P(j|xn ) So we could imagine estimating the mean, variance and prior probabilities as: P P n n P(j|xn )xn n n P(j|x )x µ ˆj = P = n Nj∗ n P(j|x ) P P n )||xn − µ ||2 n n 2 k 2 n P(j|x n P(j|x )||x − µk || P σ ˆj = = n Nj∗ n P(j|x ) Nj∗ 1 X n ˆ P(j|x ) = P(j) = N n N ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
41
Maximum likelihood parameter estimation The likelihood of a data set X = {x1 , x2 , . . . , xN } is given by:
Problem! Recall that: P(j|x) =
p(x|j)P(j) p(x)
L=
We need to know p(x|j) and P(j) to estimate the parameters of p(x|j) and to estimate P(j).... Solution: an iterative algorithm where each iteration has two parts:
n=1
N X M Y
n=1 j=1
E = − ln L = −
p(xn |j)P(j)
N X
ln p(xn )
n=1
N M X X =− ln p(xn |j)P(j) n=1
Starting from some initialization (e.g. using k-means for the means) these steps are alternated until convergence This is called the EM Algorithm and can be shown to maximize the likelihood Hidden Markov Models and Gaussian Mixture Models
p(xn ) =
We can regard the negative log likelihood as an error function:
Compute the component occupation probabilities P(j|x) using the current estimates of the GMM parameters (means, variances, mixing parameters) (E-step) Computer the GMM parameters using the current estimates of the component occupation probabilities (M-step)
ASR Lectures 4&5
N Y
j=1
Considering the derivatives of E with respect to the parameters, gives expressions like the previous slide 42
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
43
Example 1 fit using a GMM
Peakily distributed data (Example 2)
2.5 4
2
3
1.5
2
1
1 0.5 0 0
−1
−0.5
−2
−1 −1.5 −1.5
−3
−1
−0.5
0
0.5
1
1.5
−4
2
−5 −4
−3
−2
−1
0
1
2
3
4
2.5
0]T
µ1 = µ2 = [0
2
Σ1 = 0.1I
Σ2 = 2I
1.5 1
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
44
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
45
0.5
Example 2 fit by a Gaussian
Example 2 fit by a GMM
0
−0.5 −1 4 −1.5 −1.5
4
−1
−0.5
0
0.5
1
1.5
3
2
3
2
2 Fitted with a two component GMM using EM
1
1
0
0
−1
−1
−2
−2
−3
−3
−4
−4
−5 −4
−5 −4
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
4
4 4
µ1 = µ2 = [0
0]T
Σ1 = 0.1I
Σ2 = 2I
3 2 1
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
46
0 −1
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
47
Example 2: component Gaussians
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4 −4
−4 −4
−3
−2
−1
0
1
2
3
4
Comments on GMMs
GMMs trained using the EM algorithm are able to self organize to fit a data set Individual components take responsibility for parts of the data set (probabilistically) Soft assignment to components not hard assignment — “soft clustering” −3
−2
−1
0
1
2
3
4
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
GMMs scale very well, e.g.: large speech recognition systems can have 30,000 GMMs, each with 32 components: sometimes 1 million Gaussian components!! And the parameters all estimated from (a lot of) data by EM
48
Back to HMMs...
P(s1 |sI )
Hidden Markov Models and Gaussian Mixture Models
49
The three problems of HMMs P(s2 | s2 )
P(s1 | s1 )
sI
ASR Lectures 4&5
s1
P(s2 | s1 )
s2
p(x | s1 )
x
P(s3 | s2 )
P(s3 | s3 )
s3
sE
Working with HMMs requires the solution of three problems: 1
Likelihood Determine the overall likelihood of an observation sequence X = (x1 , . . . , xt , . . . , xT ) being generated by an HMM
2
Decoding Given an observation sequence and an HMM, determine the most probable hidden state sequence
3
Training Given an observation sequence and an HMM, learn the best HMM parameters λ = {{ajk }, {bj ()}}
p(x | s3 )
p(x | s2 )
x
P(sE | s3 )
x
Output distribution: Single multivariate Gaussian with mean µj , covariance matrix Σj : bj (x) = p(x | sj ) = N (x; µj , Σj ) M-component Gaussian mixture model: bj (x) = p(x | sj ) = ASR Lectures 4&5
M X
m=1
cjm N (x; µjm , Σjm )
Hidden Markov Models and Gaussian Mixture Models
50
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
51
1. Likelihood: The Forward algorithm
Recursive algorithms on HMMs
Goal: determine p(X | λ)
Visualize the problem as a state-time trellis
Sum over all possible state sequences s1 s2 . . . sT that could result in the observation sequence X
t-1
Rather than enumerating each sequence, compute the probabilities recursively (exploiting the Markov assumption)
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
52
1. Likelihood: The Forward algorithm
t
t+1
i
i
i
j
j
j
k
k
k
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
1. Likelihood: The Forward recursion
Goal: determine p(X | λ)
Initialization
Sum over all possible state sequences s1 s2 . . . sT that could result in the observation sequence X
α0 (sI ) = 1
Rather than enumerating each sequence, compute the probabilities recursively (exploiting the Markov assumption)
α0 (sj ) = 0
Forward probability, αt (sj ): the probability of observing the observation sequence x1 . . . xt and being in state sj at time t:
Recursion αt (sj ) =
N X
if sj 6= sI
αt−1 (si )aij bj (xt )
i=1
αt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ)
Termination
p(X | λ) = αT (sE ) =
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
53
54
ASR Lectures 4&5
N X
αT (si )aiE
i=1
Hidden Markov Models and Gaussian Mixture Models
55
1. Likelihood: Forward Recursion
Viterbi approximation
αt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ)
Instead of summing over all possible state sequences, just consider the most likely
t-1
Achieve this by changing the summation to a maximisation in the recursion:
aii
i
αt−1 (si ) a ji
b j (xt ) �
t
αt (si )
t+1
i
i
Vt (sj ) = max Vt−1 (si )aij bj (xt ) i
j
Changing the recursion in this way gives the likelihood of the most probable path
k
We need to keep track of the states that make up this path by keeping a sequence of backpointers to enable a Viterbi backtrace: the backpointer for each state at each time indicates the previous state on the most probable path
j
j
αt−1 (s j ) aki k
k
αt−1 (sk ) ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
56
Viterbi Recursion
Hidden Markov Models and Gaussian Mixture Models
57
Viterbi Recursion
Likelihood of the most probable path
t-1 i
ASR Lectures 4&5
aii
Vt−1 (si ) a ji
t
t+1
i
i
b j (xt ) Vt (si ) max
t-1
b j (xt )
i
t btt (si ) = s j t+1 i
Vt (si )
i
a ji j
j
Backpointers to the previous state on the most probable path
j
j
Vt−1 (s j ) aki
j
j
k
k
Vt−1 (s j ) k
k
k
k
Vt−1 (sk ) ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
58
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
59
2. Decoding: The Viterbi algorithm
Viterbi Backtrace
Initialization
Backtrace to find the state sequence of the most probable path
t-1
V0 (sI ) = 1 if sj 6= sI
V0 (sj ) = 0 bt0 (sj ) = 0
b j (xt )
i
t btt (si ) = s j t+1 i
Vt (si )
i
Recursion
a ji
N
Vt (sj ) = max Vt−1 (si )aij bj (xt ) i=1
j
N
btt (sj ) = arg max Vt−1 (si )aij bj (xt ) i=1
j
j
k
k
Vt−1 (s j )
Termination N
P ∗ = VT (sE ) = max VT (si )aiE i=1 N
k
sT∗ = btT (qE ) = arg max VT (si )aiE i=1
Hidden Markov Models and Gaussian Mixture Models
ASR Lectures 4&5
btt+1 (sk ) = si 60
3. Training: Forward-Backward algorithm
bj (x) = p(x | sj ) = N (x; µj , Σj )
C (si → sj ) aˆij = P k C (si → sk )
Parameters λ:
aij = 1
j
Gaussian parameters for state sj : mean vector µj ; covariance matrix Σj
Hidden Markov Models and Gaussian Mixture Models
61
If we knew the state-time alignment, then each observation feature vector could be assigned to a specific state A state-time alignment can be obtained using the most probable path obtained by Viterbi decoding Maximum likelihood estimate of aij , if C (si → sj ) is the count of transitions from si to sj
Assume single Gaussian output probability distribution
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
Viterbi Training
Goal: Efficiently estimate the parameters of an HMM λ from an observation sequence
Transition probabilities aij : X
ASR Lectures 4&5
62
Likewise if Zj is the set of observed acoustic feature vectors assigned to state j, we can use the standard maximum likelihood estimates for the mean and the covariance: P x∈Zj x µ ˆj = |Zj | P ˆ j )(x − µ ˆ j )T j x∈Zj (x − µ ˆ Σ = |Zj | ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
63
EM Algorithm
Backward probabilities To estimate the state occupation probabilities it is useful to define (recursively) another set of probabilities—the Backward probabilities
Viterbi training is an approximation—we would like to consider all possible paths In this case rather than having a hard state-time alignment we estimate a probability State occupation probability: The probability γt (sj ) of occupying state sj at time t given the sequence of observations. Compare with component occupation probability in a GMM We can use this for an iterative algorithm for HMM training: the EM algorithm Each iteration has two steps: E-step estimate the state occupation probabilities (Expectation) M-step re-estimate the HMM parameters based on the estimated state occupation probabilities (Maximisation) ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
βt (sj ) = p(xt+1 , xt+2 , xT | S(t) = sj , λ)
The probability of future observations given a the HMM is in state sj at time t These can be recursively computed (going backwards in time) Initialisation βT (si ) = aiE Recursion βt (si ) = Termination
p(X | λ) = β0 (sI ) = 64
j
t
βt (si )
aIj bj (x1 )β1 (sj ) = αT (sE )
j=1
Hidden Markov Models and Gaussian Mixture Models
65
The state occupation probability γt (sj ) is the probability of occupying state sj at time t given the sequence of observations Express in terms of the forward and backward probabilities: 1 γt (sj ) = P(S(t) = sj | X, λ) = αt (j)βt (j) αT (sE )
i
t+1
� aii
bi (xt+1 ) βt+1 (si )
αt (sj )βt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ)
ai j b j (xt+1 )
j
k
recalling that p(X|λ) = αT (sE ) Since
i
p(xt+1 , xt+2 , xT | S(t) = sj , λ)
j
aik k
ASR Lectures 4&5
N X
State Occupation Probability
βt (sj ) = p(xt+1 , xt+2 , xT | S(t) = sj , λ)
i
aij bj (xt+1 )βt+1 (sj )
j=1
Backward Recursion
t-1
N X
= p(x1 , . . . , xt , xt+1 , xt+2 , . . . , xT , S(t) = sj | λ)
βt+1 (s j )
= p(X, S(t) = sj | λ)
bk (xt+1 )
k
βt+1 (sk ) ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
P(S(t) = sj | X, λ) = 66
p(X, S(t) = sj | λ) p(X|λ)
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
67
Re-estimation of Gaussian parameters
Re-estimation of transition probabilities Similarly to the state occupation probability, we can estimate ξt (si , sj ), the probability of being in si at time t and sj at t + 1, given the observations:
The sum of state occupation probabilities through time for a state, may be regarded as a “soft” count
ξt (si , sj ) = P(S(t) = si , S(t + 1) = sj | X, λ) P(S(t) = si , S(t + 1) = sj , X | λ) = p(X|Λ) αt (si )aij bj (xt+1 )βt+1 (sj ) = αT (sE )
We can use this “soft” alignment to re-estimate the HMM parameters: PT
j
µ ˆ = Pt=1 T
γt (sj )x t
t=1 γt (sj )
ˆj = Σ
PT
ˆ j )(x t=1 γt (sj )(x t − µ PT t=1 γt (sj )
ASR Lectures 4&5
−µ ˆ j )T
Hidden Markov Models and Gaussian Mixture Models
We can use this to re-estimate the transition probabilities PT
t=1 ξt (si , sj ) PT k=1 t=1 ξt (si , sk )
aˆij = PN 68
Pulling it all together
2
Hidden Markov Models and Gaussian Mixture Models
69
Extension to a corpus of utterances
Iterative estimation of HMM parameters using the EM algorithm. At each iteration E step For all time-state pairs 1
ASR Lectures 4&5
We usually train from a large corpus of R utterances If xrt is the tth frame of the r th utterance Xr then we can compute the probabilities αtr (j), βtr (j), γtr (sj ) and ξtr (si , sj ) as before
Recursively compute the forward probabilities αt (sj ) and backward probabilities βt (j) Compute the state occupation probabilities γt (sj ) and ξt (si , sj )
The re-estimates are as before, except we must sum over the R utterances, eg:
M step Based on the estimated state occupation probabilities re-estimate the HMM parameters: mean vectors µj , covariance matrices Σj and transition probabilities aij
j
µ ˆ =
The application of the EM algorithm to HMM training is sometimes called the Forward-Backward algorithm
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
70
PR
PT r r r =1 t=1 γt (sj )x t PR PT r r =1 t=1 γt (sj )
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
71
Extension to Gaussian mixture model (GMM)
EM training of HMM/GMM Rather than estimating the state-time alignment, we estimate the component/state-time alignment, and component-state occupation probabilities γt (sj , m): the probability of occupying mixture component m of state sj at time t We can thus re-estimate the mean of mixture component m of state sj as follows PT γt (sj , m)x t jm µ ˆ = Pt=1 T t=1 γt (sj , m)
The assumption of a Gaussian distribution at each state is very strong; in practice the acoustic feature vectors associated with a state may be strongly non-Gaussian In this case an M-component Gaussian mixture model is an appropriate density function: bj (x) = p(x | sj ) =
M X
m=1
cjm N (x; µjm , Σjm )
Given enough components, this family of functions can model any distribution. Train using the EM algorithm, in which the component estimation probabilities are estimated in the E-step
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
72
Doing the computation
And likewise for the covariance matrices (mixture models often use diagonal covariance matrices) The mixture coefficients are re-estimated in a similar way to transition probabilities: PT γt (sj , m) cˆjm = PM t=1 PT `=1 t=1 γt (sj , `) ASR Lectures 4&5
73
Summary: HMMs
HMMs provide a generative model for statistical speech recognition Three key problems
The forward, backward and Viterbi recursions result in a long sequence of probabilities being multiplied
1
This can cause floating point underflow problems
2
In practice computations are performed in the log domain (in which multiplies become adds)
3
1
2
Hidden Markov Models and Gaussian Mixture Models
Computing the overall likelihood: the Forward algorithm Decoding the most likely state sequence: the Viterbi algorithm Estimating the most likely parameters: the EM (Forward-Backward) algorithm
Solutions to these problems are tractable due to the two key HMM assumptions
Working in the log domain also avoids needing to perform the exponentiation when computing Gaussians
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
74
Conditional independence of observations given the current state Markov assumption on the states
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
75
References: HMMs Gales and Young (2007). “The Application of Hidden Markov Models in Speech Recognition”, Foundations and Trends in Signal Processing, 1 (3), 195–304: section 2.2. Jurafsky and Martin (2008). Speech and Language Processing (2nd ed.): sections 6.1–6.5; 9.2; 9.4. (Errata at http://www.cs.colorado.edu/~martin/SLP/Errata/ SLP2-PIEV-Errata.html) Rabiner and Juang (1989). “An introduction to hidden Markov models”, IEEE ASSP Magazine, 3 (1), 4–16. Renals and Hain (2010). “Speech Recognition”, Computational Linguistics and Natural Language Processing Handbook, Clark, Fox and Lappin (eds.), Blackwells.
ASR Lectures 4&5
Hidden Markov Models and Gaussian Mixture Models
76