Pairwise( Markov( Networks(

Report 13 Downloads 114 Views
Probabilis1c( Graphical( Models(

Representa1on( Markov(Networks(

Pairwise( Markov( Networks( Daphne Koller

Daphne Koller

Daphne Koller

Daphne Koller

Daphne Koller

Daphne Koller

A1,1

A1,2

A1,3

A1,4

A2,1

A2,2

A2,3

A2,4

A3,1

A3,2

A3,3

A3,4

A4,1

A4,2

A4,3

A4,4 Daphne Koller

Daphne Koller

(a)

(b)

(c)

(d)

Daphne Koller

Probabilis.c' Graphical' Models'

Representa.on' Markov'Networks'

General'Gibbs' Distribu.on' Daphne Koller

P(A,B,C,D) A D

B C

Daphne Koller

Gibbs Distribution •  Parameters: General factors φi (Di) Φ = {φi (Di)}

a1

b1

c1

0.25

a1

b1

c2

0.35

a1

b2

c1

0.08

a1

b2

c2

0.16

a2

b1

c1

0.05

a2

b1

c2

0.07

a2

b2

c1

0

a2

b2

c2

0

a3

b1

c1

0.15

a3

b1

c2

0.21

a3

b2

c1

0.09

a3

b2

c2

0.18

Daphne Koller

Gibbs Distribution

Daphne Koller

Induced Markov Network A D

B C

Induced Markov network HΦ has an edge Xi―Xj whenever Daphne Koller

Factorization P factorizes over H if there exist such that H is the induced graph for Φ Daphne Koller

Flow of Influence

A

D

B C

•  Influence can flow along any trail, regardless of the form of the factors Daphne Koller

Active Trails •  A trail X1 ─ … ─ Xn is active given Z if no Xi is in Z A

D

B C

Daphne Koller

Summary •  Gibbs distribution represents distribution as a product of factors •  Induced Markov network connects every pair of nodes that are in the same factor •  Markov network structure doesn’t fully specify the factorization of P •  But active trails depend only on graph structure Daphne Koller

Probabilis&c) Graphical) Models)

Representa&on) Markov)Networks)

Condi&onal) Random) Fields) Daphne Koller

Motivation •  Observed variables X •  Target variables Y

Daphne Koller

CRF Representation

Daphne Koller

CRFs and Logistic Model

Daphne Koller

CRFs for Language

Features: word capitalized, word in atlas or name list, previous word is “Mrs”, next word is “Times”, …

Daphne Koller

More CRFs for Language

Daphne Koller

Summary •  A CRF is parameterized the same as a Gibbs distribution, but normalized differently •  Don’t need to model distribution over variables we don’t care about •  Allows models with highly expressive features, without worrying about wrong independencies

Daphne Koller

Probabilis1c' Graphical' Models'

Representa1on' Independencies'

Markov' Networks' Daphne Koller

Separation in MNs Definition: X and Y are separated in H given Z if there is no active trail in H between X and Y given Z A D

F B

C

E Daphne Koller

Factorization  Independence: MNS Theorem: If P factorizes over H, and sepH(X, Y | Z) then P satisfies (X ⊥ Y | Z) A D

F B

C

E

Daphne Koller

Factorization  Independence: MNs I(H) = {(X ⊥ Y | Z) : sepH(X, Y | Z)} If P satisfies I(H), we say that H is an I-map (independency map) of P Theorem: If P factorizes over H, then H is an I-map of P

Daphne Koller

Independence  Factorization •  Theorem (Hammersley Clifford): For a positive distribution P, if H is an I-map for P, then P factorizes over H

Daphne Koller

Summary Two equivalent* views of graph structure: •  Factorization: H allows P to be represented •  I-map: Independencies encoded by H hold in P If P factorizes over a graph H, we can read from the graph independencies that must hold in P (an independency map) * for positive distributions

Daphne Koller

Probabilis5c' Graphical' Models'

Representa5on' Independencies'

I"maps'and' Perfect'Maps' Daphne Koller

Capturing Independencies in P •  P factorizes over G  G is an I-map for P: •  But not always vice versa: there can be independencies in I(P) that are not in I(G) Daphne Koller

Want a Sparse Graph •  If the graph encodes more independencies –  it is sparser (has fewer parameters) –  and more informative

•  Want a graph that captures as much of the structure in P as possible

Daphne Koller

Minimal I-map •  Minimal I-map: I-map without redundant edges •  Minimal I-map may still not capture I(P) D

I G

I

D G

Daphne Koller

Perfect Map •  Perfect map: I(G) = I(P) –  G perfectly captures independencies in P

Daphne Koller

Perfect Map A D

A B

C

D

A B

C

D

A B

C

D

B C

Daphne Koller

Another imperfect map X1

X2 Y

XOR

X1

X2

Y

Prob

0

0

0

0.25

0

1

1

0.25

1

0

1

0.25

1

1

0

0.25

Daphne Koller

MN as a perfect map •  Perfect map: I(H) = I(P) –  H perfectly captures independencies in P I

D G

I

D G

Daphne Koller

Uniqueness of Perfect Map

Daphne Koller

I-equivalence Definition: Two graphs G1 and G2 over X1, …,Xn are I-equivalent if I(G1)=I(G2)

Most G’s have many I-equivalent variants Daphne Koller

Summary

•  Graphs that capture more of I(P) are more compact and provide more insight •  A minimal I-map may fail to capture a lot of structure even if present •  A perfect map is great, but may not exist •  Converting BNs ↔ MNs loses independencies –  BN to MN: loses independencies in v-structures –  MN to BN: must add triangulating edges to loops

Daphne Koller

Probabilis1c* Graphical* Models*

Representa1on* Local*Structure*

Log$Linear* Models* Daphne Koller

Log-Linear Representation

•  Each feature fj has a scope Dj •  Different features can have same scope Daphne Koller

Representing Table Factors φ(X1, X2) =

a00

a01

a10

a11

Daphne Koller

Features for Language

Features: word capitalized, word in atlas or name list, previous word is “Mrs”, next word is “Times”, …

Daphne Koller

Ising Model

Daphne Koller

Metric MRFs •  All Xi take values in label space V Xi

Xj

want Xi and Xj to take “similar” values

•  Distance function µ : V × V → R

–  Reflexivity: µ(v,v) = 0 for all v –  Symmetry: µ(v1,v2) = µ(v2,v1) for all v1, v2 –  Triangle inequality: µ(v1,v2) ≤ µ(v1,v3) + µ(v3, v2) for all v1, v2, v3 Daphne Koller

Metric MRFs •  All Xi take values in label space V Xi

Xj

want Xi and Xj to take “similar” values

•  Distance function µ : V × V → R

values of Xi and Xj far in µ

lower probability Daphne Koller

Metric MRF Examples µ(vk,vl) =

µ(vk,vl)

0

vk=vl

1

otherwise

vk-vl

0 1 1 1

µ(vk,vl)

1 0 1 1

1 1 0 1

1 1 1 0

vk-vl Daphne Koller

Metric MRF: Segmentation µ(vk,vl) =

0

vk=vl

1

otherwise

0 1 1 1

1 0 1 1

1 1 0 1

1 1 1 0

Daphne Koller

original

Metric MRF: Denoising Gaussian noise stdev 20

denoised

Gaussian noise stdev 50

denoised

µ(vk,vl) = min(|vk-vl|,d)

µ(vk,vl) = |vk-vl|

vk-vl Similar idea for stereo reconstruction

vk-vl Daphne Koller

Probabilis6c' Graphical' Models'

Representa6on' Template'Models'

Shared' Features'in'Log1 Linear'Models' Daphne Koller

Ising Models •  In most MRFs, same feature and weight are used over many scopes Ising Model

same weight for every adjacent pair

Daphne Koller

Natural Language Processing •  In most MRFs, same feature and weight are used over many scopes Yi Xi

Same energy terms wkfk(Xi,Yi) repeat for all positions i in the sequence Same energy terms wmfm(Yi,Yi+1) a;sp repeat for all positions i

Daphne Koller

Image Segmentation •  In most MRFs, same feature and weight are used over many scopes

Same features and weights for all superpixels in the image Daphne Koller

Repeated Features •  Need to specify for each feature fk a set of scopes Scopes[fk] •  For each Dk∈Scopes[fk] we have a term wkfk(Dk) in the energy function

Daphne Koller

Summary

•  Same feature & weight can be used for multiple subsets of variables –  Pairs of adjacent pixels/atoms/words –  Occurrences of same word in document

•  Can provide a single template for multiple MNs –  Different images –  Different sentences

•  Parameters and structure are reused within an MN and across different MNs •  Need to specify set of scopes for each feature Daphne Koller