Probabilistic Graphical Models
Learning Parameter Estimation
Bayesian Bayesian Prediction Daphne Koller
Bayesian Prediction
P(X = xi |θ )
= =
θ
~ Dirichlet(α1,...,αk)
X
P(X ) =
1 Z
∫ θ ⋅ ∏ jθ
α j −1
∫θ P ( X
| θ ) P (θ ) d θ
dθ
θ
αi ∑ jα
j
• Dirichlet hyperparameters correspond to the number of samples we have seen
Daphne Koller
Bayesian Prediction P ( x [ M + 1 ] | x [1 ],..., x [ M ]) = =
~ Dirichlet(α1,...,αk)
θ
...
X[1]
X[M]
∫θ P ( x [ M
+ 1 ] | x [1 ],..., x [ M ], θ ) P (θ | x [1 ],..., x [ M ]) d θ
∫θ P ( x [ M
+ 1 ] | θ ) P (θ | x [1 ],..., x [ M ]) d θ
P ( X [ M + 1 ] = x i | θ , x [1 ], K , x [ M ]) =
X[M+1]
~ Dirichlet(α Dirichlet( 1+M1,…,αk+Mk)
αi + M α + M
i
• Equivalent sample size α = α1+…+αK
– Larger α Ö more confidence in our prior Daphne Koller
Example: Binomial Data • Prior: uniform for θ in [0,1] 1 P (θ ) = Z
α θ ∏ k
k
−1
(M1,M0) = (4,1)
k
0
0.2
0.4
0.6
0.8
1
• MLE for P(X[6]=1)=4/5 • Bayesian prediction is 5/7 Daphne Koller
Effect of Priors • Prediction of P(X=1) after seeing data with M1=¼M0 as a function of sample size M 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0
06 0.6
Different strength α = α1 + α0 Fixed ratio α1 / α0
0.5 0.4
Fixed strength α = α1 + α0 Different ratio α1 / α0
0.3 0.2 0.1 20
40
60
80
100
0
0
20
40
60
80
100 Daphne Koller
Effect of Priors
• In real data, Bayesian estimates are less sensitive to noise in the data P(X = 1|D)
0.7
MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) ( )
0.6 05 0.5 0.4 0.3 0.2 0.1 1 0
5
10
15
20
N
25
30
35
40
45
50 Toss Result
M
Daphne Koller
Summary • Bayesian prediction combines sufficient statistics from imaginary Dirichlet samples and real data samples • Asymptotically the same as MLE • But Dirichlet hyperparameters determine both the prior beliefs and their strength Daphne Koller