Bayesian Bayesian Prediction

Report 4 Downloads 215 Views
Probabilistic Graphical Models

Learning Parameter Estimation

Bayesian  Bayesian Prediction Daphne Koller

Bayesian Prediction

P(X = xi |θ )

= =

θ

~ Dirichlet(α1,...,αk)

X

P(X ) =

1 Z

∫ θ ⋅ ∏ jθ

α j −1

∫θ P ( X

| θ ) P (θ ) d θ



θ

αi ∑ jα

j

• Dirichlet hyperparameters correspond to the number of samples we have seen

Daphne Koller

Bayesian Prediction P ( x [ M + 1 ] | x [1 ],..., x [ M ]) = =

~ Dirichlet(α1,...,αk)

θ

...

X[1]

X[M]

∫θ P ( x [ M

+ 1 ] | x [1 ],..., x [ M ], θ ) P (θ | x [1 ],..., x [ M ]) d θ

∫θ P ( x [ M

+ 1 ] | θ ) P (θ | x [1 ],..., x [ M ]) d θ

P ( X [ M + 1 ] = x i | θ , x [1 ], K , x [ M ]) =

X[M+1]

~ Dirichlet(α Dirichlet( 1+M1,…,αk+Mk)

αi + M α + M

i

• Equivalent sample size α = α1+…+αK

– Larger α Ö more confidence in our prior Daphne Koller

Example: Binomial Data • Prior: uniform for θ in [0,1] 1 P (θ ) = Z

α θ ∏ k

k

−1

(M1,M0) = (4,1)

k

0

0.2

0.4

0.6

0.8

1

• MLE for P(X[6]=1)=4/5 • Bayesian prediction is 5/7 Daphne Koller

Effect of Priors • Prediction of P(X=1) after seeing data with M1=¼M0 as a function of sample size M 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0

06 0.6

Different strength α = α1 + α0 Fixed ratio α1 / α0

0.5 0.4

Fixed strength α = α1 + α0 Different ratio α1 / α0

0.3 0.2 0.1 20

40

60

80

100

0

0

20

40

60

80

100 Daphne Koller

Effect of Priors

• In real data, Bayesian estimates are less sensitive to noise in the data P(X = 1|D)

0.7

MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) ( )

0.6 05 0.5 0.4 0.3 0.2 0.1 1 0

5

10

15

20

N

25

30

35

40

45

50 Toss Result

M

Daphne Koller

Summary • Bayesian prediction combines sufficient statistics from imaginary Dirichlet samples and real data samples • Asymptotically the same as MLE • But Dirichlet hyperparameters determine both the prior beliefs and their strength Daphne Koller