Statistical Relational Learning Goals for the lecture - Semantic Scholar

Report 3 Downloads 87 Views
Statistical Relational Learning

Mark Craven Computer Sciences 760 Fall 2015 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Tom Dietterich, Pedro Domingos, Tom Mitchell, David Page, and Jude Shavlik

Goals for the lecture you should understand the following concepts •  Markov networks •  Markov logic networks (MLNs) •  the parameter learning task in MLNs •  the structure learning task in MLNs

Statistical relational learning (SRL) •  in the last lecture, we saw the representational advantages of relational learning methods •  ability to represent relationships among objects •  ability to incorporate background knowledge •  but there are disadvantages of these methods •  not well suited for representing uncertainty •  not very robust given noisy data •  an active area in machine learning is developing statistical relational methods that aim to preserve the advantages while alleviating the disadvantages •  the real world is complex and uncertain •  logic handles the complexity •  probability handles the uncertainty

Many SRL approaches have been developed •  •  •  •  • 

knowledge-based model construction [Wellman et al., 1992] stochastic logic programs [Muggleton, 1996] probabilistic relational models [Friedman et al., 1999] relational Markov networks [Taskar et al., 2002] constraint logic programming for probabilistic knowledge [Santos Costa et al., UAI 2003] •  Bayesian logic [Milch et al., 2005] !  Markov logic [Richardson & Domingos, Machine Learning 2006] •  etc.

Markov networks •  a Markov network is an undirected graphical model Smoking

Cancer Asthma

Cough

•  potential functions are defined over cliques

P(x) =

1 Φc (xc ) ∏ Z c ∈ Cliques

Z = ∑ ∏ Φc (xc ) x

c

Smoking

Cancer

false

false

4.5

false

true

4.5

true

false

2.7

true

true

4.5

Ф(S, C)"

Markov networks Smoking

Cancer Asthma

• 

Cough

potentials can be represented using log-linear models over features

P(x) =

⎛ ⎞ 1 exp ⎜ ∑ wi fi (x)⎟ ⎝ i ⎠ Z

weight of feature i!

feature i"

⎧⎪ 1 if ¬ Smoking ∨ Cancer f1 (Smoking, Cancer ) = ⎨ 0 otherwise ⎩⎪

w1 = 1.5

Markov nets vs. Bayes nets Property

Markov nets

Bayes nets

Graph

undirected

directed

Distribution form

product of potentials

product of potentials

Potentials

arbitrary

conditional probabilities

Cycles

allowed

forbidden

Partition function

Z = ?"

Z = 1"

First-order logic •  Constants, variables, functions, predicates e.g. Anna, x, MotherOf(x), Friends(x, y) •  Formulas: constructed from constants, variables, functions, predicates e.g. Friends(x, MotherOf(Anna)), Friends(x, y) Friends(x, z) •  Grounding: Replace all variables by constants e.g. Friends (Anna, Bob) •  World (model, interpretation): Assignment of truth values to all ground predicates

Markov logic: intuition

•  a logical knowledge base is a set of hard constraints on the set of possible worlds •  let’s make them soft constraints: when a world violates a formula, it becomes less probable, not impossible •  give each formula a weight (higher weight → stronger constraint)

P(world) ∝ exp

(∑ weights of formulas it satisfies)

MLN definition •  a Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number •  together with a set of constants, it defines a Markov network with –  one node for each grounding of each predicate in the MLN –  one feature for each grounding of each formula F in the MLN, with the corresponding weight w!

MLN example: friends & smokers Smoking causes cancer. Friends have similar smoking habits.

MLN example: friends & smokers ∀x Smokes ( x ) ⇒ Cancer ( x ) ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) )

MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) )

MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) ) Two constants: Anna (A) and Bob (B)

MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) ) Two constants: Anna (A) and Bob (B)

Smokes(A)

Smokes(B)

Cancer(A)

Cancer(B)

MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) ) Two constants: Anna (A) and Bob (B) Friends(A,B)

Friends(A,A)

Smokes(A)

Smokes(B)

Cancer(A)

Friends(B,B)

Cancer(B) Friends(B,A)

MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) ) Two constants: Anna (A) and Bob (B) Friends(A,B)

Friends(A,A)

Smokes(A)

Smokes(B)

Cancer(A)

Friends(B,B)

Cancer(B) Friends(B,A)

MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) ) Two constants: Anna (A) and Bob (B) Friends(A,B)

Friends(A,A)

Smokes(A)

Smokes(B)

Cancer(A)

Friends(B,B)

Cancer(B) Friends(B,A)

Markov logic networks •  a MLN is a template for ground Markov nets –  the logic determines the form of the cliques –  but if we had one more constant (say, Larry), we’d get a different Markov net •  we can determine the probability of a world v (assignment of truth values to ground predicates) by

P(v) =

⎛ ⎞ 1 exp ⎜ ∑ wi ni (v)⎟ ⎝ i ⎠ Z

weight of formula i!

# of true groundings of formula i in v!

Probability of a world in an MLN ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ v=⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

Friends(A,A) = T ⎤ ⎥ Friends(A,B) = T ⎥ Friends(B,A) = T ⎥ ⎥ Friends(B,B) = T ⎥ Smokes(A) = F ⎥⎥ Smokes(B) = T ⎥ ⎥ Cancer(A) = F ⎥ ⎥ Cancer(B) = F ⎦

Friends(A,B)

Friends(A,A)

∀x Smokes(x) ⇒ Cancer(x) x=A T x =B F n1 (v) = 1

# of true groundings of formula 1 in v!

Smokes(A)

Smokes(B)

Cancer(A)

Friends(B,B)

Cancer(B) Friends(B,A)

∀x, y Friends(x, y) ⇒ ( Smokes(x) ⇔ Smokes(y)) x = A, y = A

T

x = A, y = B x = B, y = A x = B, y = B n2 (v) = 2

F F T

Probability of a world in an MLN ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ v=⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

Friends(A,A) = T ⎤ ⎥ Friends(A,B) = T ⎥ Friends(B,A) = T ⎥ ⎥ Friends(B,B) = T ⎥ Smokes(A) = F ⎥⎥ Smokes(B) = T ⎥ ⎥ Cancer(A) = F ⎥ ⎥ Cancer(B) = F ⎦

Friends(A,B)

Friends(A,A)

Smokes(A)

Smokes(B)

Cancer(A)

Friends(B,B)

Cancer(B) Friends(B,A)

⎛ ⎞ 1 P(v) = exp ⎜ ∑ wi ni (v)⎟ ⎝ i ⎠ Z =

1 exp (1.5 (1) + 1.1( 2 )) Z

Three MLN tasks •  inference: can use the toolbox of inference methods developed for ordinary Markov networks –  Monte Carlo methods –  belief propagation –  variational approximations in tandem with weighted SAT solver (e.g., MaxWalkSAT [Kautz et al., 1997] ) •  parameter learning •  structure learning: can use ordinary relational learning methods to learn new formula

MLN learning tasks •  the input to the learning process is a relational database of ground predicates Friends(x, y)

Smokes(x)

Anna, Anna Anna, Bob Bob, Anna Bob, Bob

Bob

Cancer(x) Bob

•  the closed world assumption is used to infer the truth values of atoms not present in the DB

Parameter learning •  parameters (weights on formulas) can be learned using gradient ascent to maximize the log likelihood of the training data

f (w) = ∑ wi ni (v) − log Z i

∂ log Pw (V = v) = ni (v) − ∑ Pw ( V = v' ) ni (v') ∂wi v' # of times formula i is true in data Expected # times formula i is true according to MLN

Parameter learning ∂ log Pw (V = v) = ni (v) − ∑ Pw ( V = v' ) ni (v') ∂wi v' •  there are two challenges to using this update rule –  counting the number of true groundings may be intractable (counting the number of true groundings of a first-order formula in a database is #P-complete in the length of the formula) –  computing the expected number of true groundings may also be intractable (to get Pw need to compute the partition function Z)

Parameter learning •  a more efficient alternative is to optimize pseudo-likelihood

P (V = v) = ∏ Pw (Vl = vl | MBv (Vl )) n

* w

l=1

Setting of of values in Markov blanket of variable Vl!

Parameter learning •  the gradient ascent update rule then becomes n ⎡ ni (v) − Pw (Vl = 0 | MBv (Vl ) ) ni (v [Vl =0 ] ) ⎤ ∂ * ⎥ log Pw (V = v) = ∑ ⎢ ∂wi − P V = 1 | MB (V ) n (v ) l=1 ⎢ w( l v l ) i [Vl =1] ⎥ ⎣ ⎦

Probability (according to Markov net) that Vl =1 when its Markov blanket is set to the values observed in v!

# true groundings of ith formula when Vl is constrained to be 1!

MLN experiment •  testbed: a DB describing Univ. of Washington CS department •  12 predicates Professor(person) Student(person) Area(x, area) AuthorOf(publication, person) AdvisedBy(person, person) etc. •  2707 constants publication (342) person (442) course (176) project (153) etc.

MLN experiment •  obtained knowledge base by having four subjects provide a set of formulas in first-order logic describing the domain •  the formulas in the KB represent statements such as •  students are not professors •  each student has at most one advisor •  if a student is an author of a paper, so is her advisor •  at most one author of a given publication is a professor •  etc. •  note that the KB is not consistent

Learning to predict the AdvisedBy(x, y) relation MLN w/ original KB MLN w/ KB + ILP learned rules KB alone KB + ILP learned rules ILP learned rules naïve Bayes Bayes net learner

Comments on statistical relational learning •  a wide range of approaches •  Markov logic networks (MLNs) are one successful approach •  many algorithmic refinements •  state of the art applications in natural language, etc. •  open source software available http://alchemy.cs.washington.edu/ •  logic handles domain complexity, probability handles uncertainty •  inference is challenging •  exact methods not feasible for most problems •  even approximate inference may be very computationally expensive