Probabilis1c( Graphical( Models(
Representa1on( Markov(Networks(
Pairwise( Markov( Networks( Daphne Koller
Daphne Koller
Daphne Koller
Daphne Koller
Daphne Koller
Daphne Koller
A1,1
A1,2
A1,3
A1,4
A2,1
A2,2
A2,3
A2,4
A3,1
A3,2
A3,3
A3,4
A4,1
A4,2
A4,3
A4,4 Daphne Koller
Daphne Koller
(a)
(b)
(c)
(d)
Daphne Koller
Probabilis.c' Graphical' Models'
Representa.on' Markov'Networks'
General'Gibbs' Distribu.on' Daphne Koller
P(A,B,C,D) A D
B C
Daphne Koller
Gibbs Distribution • Parameters: General factors φi (Di) Φ = {φi (Di)}
a1
b1
c1
0.25
a1
b1
c2
0.35
a1
b2
c1
0.08
a1
b2
c2
0.16
a2
b1
c1
0.05
a2
b1
c2
0.07
a2
b2
c1
0
a2
b2
c2
0
a3
b1
c1
0.15
a3
b1
c2
0.21
a3
b2
c1
0.09
a3
b2
c2
0.18
Daphne Koller
Gibbs Distribution
Daphne Koller
Induced Markov Network A D
B C
Induced Markov network HΦ has an edge Xi―Xj whenever Daphne Koller
Factorization P factorizes over H if there exist such that H is the induced graph for Φ Daphne Koller
Flow of Influence
A
D
B C
• Influence can flow along any trail, regardless of the form of the factors Daphne Koller
Active Trails • A trail X1 ─ … ─ Xn is active given Z if no Xi is in Z A
D
B C
Daphne Koller
Summary • Gibbs distribution represents distribution as a product of factors • Induced Markov network connects every pair of nodes that are in the same factor • Markov network structure doesn’t fully specify the factorization of P • But active trails depend only on graph structure Daphne Koller
Probabilis&c) Graphical) Models)
Representa&on) Markov)Networks)
Condi&onal) Random) Fields) Daphne Koller
Motivation • Observed variables X • Target variables Y
Daphne Koller
CRF Representation
Daphne Koller
CRFs and Logistic Model
Daphne Koller
CRFs for Language
Features: word capitalized, word in atlas or name list, previous word is “Mrs”, next word is “Times”, …
Daphne Koller
More CRFs for Language
Daphne Koller
Summary • A CRF is parameterized the same as a Gibbs distribution, but normalized differently • Don’t need to model distribution over variables we don’t care about • Allows models with highly expressive features, without worrying about wrong independencies
Daphne Koller
Probabilis1c' Graphical' Models'
Representa1on' Independencies'
Markov' Networks' Daphne Koller
Separation in MNs Definition: X and Y are separated in H given Z if there is no active trail in H between X and Y given Z A D
F B
C
E Daphne Koller
Factorization Independence: MNS Theorem: If P factorizes over H, and sepH(X, Y | Z) then P satisfies (X ⊥ Y | Z) A D
F B
C
E
Daphne Koller
Factorization Independence: MNs I(H) = {(X ⊥ Y | Z) : sepH(X, Y | Z)} If P satisfies I(H), we say that H is an I-map (independency map) of P Theorem: If P factorizes over H, then H is an I-map of P
Daphne Koller
Independence Factorization • Theorem (Hammersley Clifford): For a positive distribution P, if H is an I-map for P, then P factorizes over H
Daphne Koller
Summary Two equivalent* views of graph structure: • Factorization: H allows P to be represented • I-map: Independencies encoded by H hold in P If P factorizes over a graph H, we can read from the graph independencies that must hold in P (an independency map) * for positive distributions
Daphne Koller
Probabilis5c' Graphical' Models'
Representa5on' Independencies'
I"maps'and' Perfect'Maps' Daphne Koller
Capturing Independencies in P • P factorizes over G G is an I-map for P: • But not always vice versa: there can be independencies in I(P) that are not in I(G) Daphne Koller
Want a Sparse Graph • If the graph encodes more independencies – it is sparser (has fewer parameters) – and more informative
• Want a graph that captures as much of the structure in P as possible
Daphne Koller
Minimal I-map • Minimal I-map: I-map without redundant edges • Minimal I-map may still not capture I(P) D
I G
I
D G
Daphne Koller
Perfect Map • Perfect map: I(G) = I(P) – G perfectly captures independencies in P
Daphne Koller
Perfect Map A D
A B
C
D
A B
C
D
A B
C
D
B C
Daphne Koller
Another imperfect map X1
X2 Y
XOR
X1
X2
Y
Prob
0
0
0
0.25
0
1
1
0.25
1
0
1
0.25
1
1
0
0.25
Daphne Koller
MN as a perfect map • Perfect map: I(H) = I(P) – H perfectly captures independencies in P I
D G
I
D G
Daphne Koller
Uniqueness of Perfect Map
Daphne Koller
I-equivalence Definition: Two graphs G1 and G2 over X1, …,Xn are I-equivalent if I(G1)=I(G2)
Most G’s have many I-equivalent variants Daphne Koller
Summary
• Graphs that capture more of I(P) are more compact and provide more insight • A minimal I-map may fail to capture a lot of structure even if present • A perfect map is great, but may not exist • Converting BNs ↔ MNs loses independencies – BN to MN: loses independencies in v-structures – MN to BN: must add triangulating edges to loops
Daphne Koller
Probabilis1c* Graphical* Models*
Representa1on* Local*Structure*
Log$Linear* Models* Daphne Koller
Log-Linear Representation
• Each feature fj has a scope Dj • Different features can have same scope Daphne Koller
Representing Table Factors φ(X1, X2) =
a00
a01
a10
a11
Daphne Koller
Features for Language
Features: word capitalized, word in atlas or name list, previous word is “Mrs”, next word is “Times”, …
Daphne Koller
Ising Model
Daphne Koller
Metric MRFs • All Xi take values in label space V Xi
Xj
want Xi and Xj to take “similar” values
• Distance function µ : V × V → R
– Reflexivity: µ(v,v) = 0 for all v – Symmetry: µ(v1,v2) = µ(v2,v1) for all v1, v2 – Triangle inequality: µ(v1,v2) ≤ µ(v1,v3) + µ(v3, v2) for all v1, v2, v3 Daphne Koller
Metric MRFs • All Xi take values in label space V Xi
Xj
want Xi and Xj to take “similar” values
• Distance function µ : V × V → R
values of Xi and Xj far in µ
lower probability Daphne Koller
Metric MRF Examples µ(vk,vl) =
µ(vk,vl)
0
vk=vl
1
otherwise
vk-vl
0 1 1 1
µ(vk,vl)
1 0 1 1
1 1 0 1
1 1 1 0
vk-vl Daphne Koller
Metric MRF: Segmentation µ(vk,vl) =
0
vk=vl
1
otherwise
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
Daphne Koller
original
Metric MRF: Denoising Gaussian noise stdev 20
denoised
Gaussian noise stdev 50
denoised
µ(vk,vl) = min(|vk-vl|,d)
µ(vk,vl) = |vk-vl|
vk-vl Similar idea for stereo reconstruction
vk-vl Daphne Koller
Probabilis6c' Graphical' Models'
Representa6on' Template'Models'
Shared' Features'in'Log1 Linear'Models' Daphne Koller
Ising Models • In most MRFs, same feature and weight are used over many scopes Ising Model
same weight for every adjacent pair
Daphne Koller
Natural Language Processing • In most MRFs, same feature and weight are used over many scopes Yi Xi
Same energy terms wkfk(Xi,Yi) repeat for all positions i in the sequence Same energy terms wmfm(Yi,Yi+1) a;sp repeat for all positions i
Daphne Koller
Image Segmentation • In most MRFs, same feature and weight are used over many scopes
Same features and weights for all superpixels in the image Daphne Koller
Repeated Features • Need to specify for each feature fk a set of scopes Scopes[fk] • For each Dk∈Scopes[fk] we have a term wkfk(Dk) in the energy function
Daphne Koller
Summary
• Same feature & weight can be used for multiple subsets of variables – Pairs of adjacent pixels/atoms/words – Occurrences of same word in document
• Can provide a single template for multiple MNs – Different images – Different sentences
• Parameters and structure are reused within an MN and across different MNs • Need to specify set of scopes for each feature Daphne Koller