Probabilistic Graphical Models
Learning Parameter Estimation
Bayesian Bayesian Estimation Daphne Koller
Limitations of MLE
• Two teams play 10 times, and the first wins 7 of the 10 matches Ö Probability of first team winning = 0.7
• A coin is tossed 10 times, and comes out ‘heads’ 7 of the 10 tosses Ö Probability of heads = 0.7
• A coin is tossed 10000 times, and comes out ‘heads’ 7000 of the 10000 tosses Ö Probability of heads = 0.7
Daphne Koller
Parameter Estimation as a PGM θ
θ X Data m
X[1] 1
...
X[M]
• Given a fixed θ, tosses are independent • If θ is unknown, tosses are not marginally independent – each toss tells us something about θ
Daphne Koller
Bayesian Inference • Joint probabilistic model P ( x [1 ],..., x [ M ], θ )
X[1]
θ
...
X[M]
= P ( x [1 ],..., x [ M ] | θ ) P (θ ) M
= P (θ ) ∏ P ( x [ i ] | θ ) i =1 M
= P (θ )θ
H
(1 − θ ) M T
P ( x [1 ],..., x [ M ] | θ ) P (θ ) P (θ | x [1 ],..., x [ M ]) = P ( x [1 ],..., x [ M ]) Daphne Koller
Dirichlet Distribution • θ is a multinomial distribution over k values • Dirichlet distribution θ ~Dirichlet(α1,...,αk) –
1 whereP (θ ) = Z
k
∏θ i =1
Γ (α ) ∏ and Z = Γ (∑ α ) k
α i −1 i
i =1
i
k
i =1
i
∞
Γ(x) =
∫t
x −1
e − t dt
0
• Intuitively, hyperparameters correspond to the number of samples we have seen Daphne Koller
Dirichlet Distributions
5
Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
0
0.2
0.4
0.6
0.8
1 Daphne Koller
Dirichlet Priors & Posteriors P (θ | D ) ∝ P ( D | θ ) P (θ )
P(D |θ ) =
∏
θ i =1 i k
M
i
P (θ ) ∝
k
α θ ∏ i
i
−1
i =1
• If P(θ) is Dirichlet Di ichl t and nd th the lik likelihood lih d is multinomial, then the posterior is also Dirichlet – Prior is Dir(α1,...,αk) – Data counts are M1,...,Mk – Posterior is Dir(α1+M1,...αk+Mk)
• Dirichlet is a conjugate prior for the multinomial Daphne Koller
Summary
• Bayesian learning treats parameters as random variables – Learning is then a special case of inference
• Dirichlet distribution is conjugate to multinomial – Posterior has same form as prior – Can be updated in closed form using sufficient statistics from data
Daphne Koller