Max Likelihood for Log-Linear Models Daphne Koller
Log-Likelihood for Markov Nets A B C
• Partition function couples the parameters – No decomposition of likelihood – No closed form solution
Daphne Koller
Example: Log-Likelihood Function A B
0 -20 -40
C
-60 -80 -100 -120
60
40
20
0
-20
-40
-60
-80
-100
-120
-140
-160
-180
-200
200 180 160 140 120 100 80 60 40 20 0 -20 -40 -60
Daphne Koller
Log-Likelihood for Log-Linear Model
Daphne Koller
The Log-Partition Function Theorem:
Proof:
Daphne Koller
The Log-Partition Function Theorem:
• Log likelihood function – No local optima – Easy to optimize
Daphne Koller
Maximum Likelihood Estimation
Theorem:
is the MLE if and only if
Daphne Koller
Computation: Gradient Ascent • Use gradient ascent:
– typically L-BFGS – a quasi-Newton method
• For gradient, need expected feature counts: – in data – relative to current model
• Requires inference at each gradient step Daphne Koller
Example: Ising Model
Daphne Koller
Summary • Partition function couples parameters in likelihood • No closed form solution, but convex optimization – Solved using gradient ascent (usually L-BFGS)
• Gradient computation requires inference at each gradient step to compute expected feature counts • Features are always within clusters in clustergraph or clique tree due to family preservation – One calibration suffices for all feature expectations