Max Likelihood for Log-Linear Models

Report 6 Downloads 51 Views
Probabilistic Graphical Models

Learning

Parameter Estimation

Max Likelihood for Log-Linear Models Daphne Koller

Log-Likelihood for Markov Nets A B C

• Partition function couples the parameters – No decomposition of likelihood – No closed form solution

Daphne Koller

Example: Log-Likelihood Function A B

0 -20 -40

C

-60 -80 -100 -120

60

40

20

0

-20

-40

-60

-80

-100

-120

-140

-160

-180

-200

200 180 160 140 120 100 80 60 40 20 0 -20 -40 -60

Daphne Koller

Log-Likelihood for Log-Linear Model

Daphne Koller

The Log-Partition Function Theorem:

Proof:

Daphne Koller

The Log-Partition Function Theorem:

• Log likelihood function – No local optima – Easy to optimize

Daphne Koller

Maximum Likelihood Estimation

Theorem:

is the MLE if and only if

Daphne Koller

Computation: Gradient Ascent • Use gradient ascent:

– typically L-BFGS – a quasi-Newton method

• For gradient, need expected feature counts: – in data – relative to current model

• Requires inference at each gradient step Daphne Koller

Example: Ising Model

Daphne Koller

Summary • Partition function couples parameters in likelihood • No closed form solution, but convex optimization – Solved using gradient ascent (usually L-BFGS)

• Gradient computation requires inference at each gradient step to compute expected feature counts • Features are always within clusters in clustergraph or clique tree due to family preservation – One calibration suffices for all feature expectations

Daphne Koller