Expectation- Maximization Algorithm and Applications

Report 9 Downloads 103 Views
ExpectationMaximization Algorithm and Applications Eugene Weinstein Courant Institute of Mathematical Sciences Nov 14th, 2006

List of Concepts ƒ ƒ ƒ ƒ ƒ ƒ ƒ

Maximum-Likelihood Estimation (MLE) Expectation-Maximization (EM) Conditional Probability Mixture Modeling Gaussian Mixture Models (GMMs) String edit-distance Forward-backward algorithms 2/31

Overview ƒ Expectation-Maximization ƒ Mixture Model Training ƒ Learning String Edit-Distance

3/31

One-Slide MLE Review ƒ Say I give you a coin with ƒ But I don’t tell you the value of θ ƒ Now say I let you flip the coin n times ƒ You get h heads and n-h tails

ƒ What is the natural estimate of θ ? ƒ This is

ƒ More formally, the likelihood of θ is governed by a binomial distribution: ƒ Can prove is the maximum-likelihood estimate of θ ƒ Differentiate with respect to θ, set equal to 0 4/31

EM Motivation ƒ So, to solve any ML-type problem, we analytically maximize the likelihood function? ƒ Seems to work for 1D Bernoulli (coin toss) ƒ Also works for 1D Gaussian (find µ, σ 2 )

ƒ Not quite ƒ Distribution may not be well-behaved, or have too many parameters ƒ Say your likelihood function is a mixture of 1000 1000dimensional Gaussians (1M parameters) ƒ Direct maximization is not feasible

ƒ Solution: introduce hidden variables to ƒ Simplify the likelihood function (more common) ƒ Account for actual missing data

5/31

Hidden and Observed Variables ƒ Observed variables: directly measurable from the data, e.g. ƒ The waveform values of a speech recording ƒ Is it raining today? ƒ Did the smoke alarm go off?

ƒ Hidden variables: influence the data, but not trivial to measure ƒ The phonemes that produce a given speech recording ƒ P (rain today | rain yesterday) ƒ Is the smoke alarm malfunctioning?

6/31

Expectation-Maximization ƒ Model dependent random variables: ƒ Observed variable x ƒ Unobserved (hidden) variable y that generates x

ƒ Assume probability distributions: ƒ θ represents set of all parameters of distribution

ƒ Repeat until convergence ƒ E-step: Compute expectation of

(θ ′,θ : old, new distribution parameters) ƒ M-step: Find θ that maximizes Q 7/31

Conditional Expectation Review ƒ Let X, Y be r.v.’s drawn from the distributions P(x) and P(y) ƒ Conditional distribution given by: ƒ Then ƒ For function h(Y ): ƒ Given a particular value of X (X=x): 8/31

Maximum Likelihood Problem ƒ Want to pick θ that maximizes the loglikelihood of the observed (x) and unobserved (y) variables given ƒ Observed variable x ƒ Previous parameters θ ′

ƒ Conditional expectation of given x and θ ′ is

9/31

EM Derivation ƒ Lemma (Special case of Jensen’s Inequality): Let p(x), q(x) be probability distributions. Then

ƒ Proof: rewrite as:

ƒ Interpretation: relative entropy non-negative 10/31

EM Derivation ƒ EM Theorem: ƒ If ƒ then

ƒ Proof:

ƒ By some algebra and lemma, ƒ So, if this quantity is positive, so is

ƒ 11/31

EM Summary ƒ Repeat until convergence ƒ E-step: Compute expectation of

(θ ′,θ : old, new distribution parameters) ƒ M-step: Find θ that maximizes Q

ƒ EM Theorem: ƒ If ƒ then

ƒ Interpretation ƒ As long as we can improve the expectation of the log-likelihood, EM improves our model of observed variable x ƒ Actually, it’s not necessary to maximize the expectation, just need to make sure that it increases – this is called “Generalized EM” 12/31

EM Comments ƒ In practice, the x is series of data points ƒ To calculate expectation, can assume i.i.d and sum over all points:

ƒ Problems with EM? ƒ Local maxima ƒ Need to bootstrap training process (pick a θ )

ƒ When is EM most useful? ƒ When model distributions easy to maximize (e.g., Gaussian mixture models)

ƒ EM is a meta-algorithm, needs to be adapted to particular application

13/31

Overview ƒ Expectation-Maximization ƒ Mixture Model Training ƒ Learning String Distance

14/31

EM Applications: Mixture Models ƒ Gaussian/normal distribution ƒ Parameters: mean µ and variance σ 2 ƒ In the multi-dimensional case, assume isotropic Gaussian: same variance in all dimensions ƒ We can model arbitrary distributions with density mixtures

15/31

Density Mixtures ƒ Combine m elementary densities to model a complex data distribution

ƒ kth Gaussian parametrized by

16/31

Density Mixtures ƒ Combine m elementary densities to model a complex data distribution

17/31

Density Mixtures ƒ Combine m elementary densities to model a complex data distribution

ƒ Log-likelihood function of the data x given

:

ƒ Log of sum – hard to optimize analytically! ƒ Instead, introduce hidden variable y ƒ

: x generated by Gaussian k

ƒ EM formulation: maximize 18/31

Gaussian Mixture Model EM ƒ Goal: maximize ƒ n (observed) data points: ƒ n (hidden) labels: ƒ

: xi generated by Gaussian k

ƒ Several pages of math later, we get: ƒ E step: compute likelihood of

ƒ M step: update αk, µk, σk for each Gaussian k=1..m

19/31

GMM-EM Discussion ƒ Summary: EM naturally applicable to training probabilistic models ƒ EM is a generic formulation, need to do some hairy math to get to implementation ƒ Problems with GMM-EM? ƒ Local maxima ƒ Need to bootstrap training process (pick a θ )

ƒ GMM-EM applicable to enormous number of pattern recognition tasks: speech, vision, etc. ƒ Hours of fun with GMM-EM 20/31

Overview ƒ Expectation-Maximization ƒ Mixture Model Training ƒ Learning String Distance

21/31

String Edit-Distance ƒ Notation: ƒ Operate on two strings: ƒ Edit-distance: transform one string into another using ƒ Substitution: kitten Æ bitten, cost ƒ Insertion: cop Æ crop, cost ƒ Deletion: learn Æ earn, cost

ƒ Can compute efficiently recursively 22/31

Stochastic String Edit-Distance ƒ Instead of setting costs, model edit operation sequence as a random process ƒ Edit operations selected according to a probability distribution ƒ For edit operation sequence ƒ View string edit-distance as ƒ memoryless (Markov): ƒ stochastic: random process according to δ (⋅) is governed by a true probability distribution ƒ transducer: 23/31

Edit-Distance Transducer ƒ Arc label a:b/0 means input a, output b and weight 0 ƒ Assume

24/31

Two Distances ƒ Define yield of an edit sequence ν (zn#) as the set of all strings hx,yi such that zn# turns x into y ƒ Viterbi edit-distance: negative loglikelihood of most likely edit sequence ƒ Stochastic edit-distance: negative loglikelihood of all edit sequences from x to y 25/31

Evaluating Likelihood ƒ Viterbi: ƒ Stochastic: ƒ Both require calculation of possible edit sequences ƒ

over all

possibilities (three edit operations)

ƒ However, memoryless assumption allows us to compute likelihood efficiently ƒ Use the forward-backward method! 26/31

Forward ƒ Evaluation of forward probabilities : likelihood of picking an edit sequence that generates the prefix pair ƒ Memoryless assumption allows efficient recursive computation:

27/31

Backward ƒ Evaluation of backward probabilities : likelihood of picking an edit sequence that generates the suffix pair ƒ Memoryless assumption allows efficient recursive computation:

28/31

EM Formulation ƒ Edit operations selected according to a probability distribution ƒ So, EM has to update δ based on occurrence counts of each operation (similar to coin-tossing example) ƒ Idea: accumulate expected counts from forward, backward variables ƒ γ(z): expected count of edit operation z 29/31

EM Details

ƒ γ(z): expected count of edit operation z ƒ e.g,

30/31

References ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ ƒ

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, 39(1), 1977 pp. 1-38. C. F. J. Wu, On the Convergence Properties of the EM Algorithm, The Annals of Statistics, 11(1), Mar 1983, pp. 95-103. F. Jelinek, Statistical Methods for Speech Recognition, 1997 M. Collins, The EM Algorithm, 1997 J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, Technical Report, University of Berkeley, TR-97-021, 1998 E. S. Ristad and P. N. Yianilos, Learning string edit distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2), 1998, pp. 522-532. L.R. Rabiner. A tutorial on HMM and selected applications in speech recognition, In Proc. IEEE, 77(2), 1989, pp. 257-286. A. D'Souza, Using EM To Estimate A Probablity [sic] Density With A Mixture Of Gaussians M. Mohri. Edit-Distance of Weighted Automata, in Proc. Implementation and Application of Automata, (CIAA) 2002, pp. 1-23 J. Glass, Lecture Notes, MIT class 6.345: Automatic Speech Recognition, 2003 Carlo Tomasi, Estimating Gaussian Mixture Densities with EM – A Tutorial, 2004 Wikipedia 31/31