Statistical Relational Learning
Mark Craven Computer Sciences 760 Fall 2015 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Tom Dietterich, Pedro Domingos, Tom Mitchell, David Page, and Jude Shavlik
Goals for the lecture you should understand the following concepts • Markov networks • Markov logic networks (MLNs) • the parameter learning task in MLNs • the structure learning task in MLNs
Statistical relational learning (SRL) • in the last lecture, we saw the representational advantages of relational learning methods • ability to represent relationships among objects • ability to incorporate background knowledge • but there are disadvantages of these methods • not well suited for representing uncertainty • not very robust given noisy data • an active area in machine learning is developing statistical relational methods that aim to preserve the advantages while alleviating the disadvantages • the real world is complex and uncertain • logic handles the complexity • probability handles the uncertainty
Many SRL approaches have been developed • • • • •
knowledge-based model construction [Wellman et al., 1992] stochastic logic programs [Muggleton, 1996] probabilistic relational models [Friedman et al., 1999] relational Markov networks [Taskar et al., 2002] constraint logic programming for probabilistic knowledge [Santos Costa et al., UAI 2003] • Bayesian logic [Milch et al., 2005] ! Markov logic [Richardson & Domingos, Machine Learning 2006] • etc.
Markov networks • a Markov network is an undirected graphical model Smoking
Cancer Asthma
Cough
• potential functions are defined over cliques
P(x) =
1 Φc (xc ) ∏ Z c ∈ Cliques
Z = ∑ ∏ Φc (xc ) x
c
Smoking
Cancer
false
false
4.5
false
true
4.5
true
false
2.7
true
true
4.5
Ф(S, C)"
Markov networks Smoking
Cancer Asthma
•
Cough
potentials can be represented using log-linear models over features
P(x) =
⎛ ⎞ 1 exp ⎜ ∑ wi fi (x)⎟ ⎝ i ⎠ Z
weight of feature i!
feature i"
⎧⎪ 1 if ¬ Smoking ∨ Cancer f1 (Smoking, Cancer ) = ⎨ 0 otherwise ⎩⎪
w1 = 1.5
Markov nets vs. Bayes nets Property
Markov nets
Bayes nets
Graph
undirected
directed
Distribution form
product of potentials
product of potentials
Potentials
arbitrary
conditional probabilities
Cycles
allowed
forbidden
Partition function
Z = ?"
Z = 1"
First-order logic • Constants, variables, functions, predicates e.g. Anna, x, MotherOf(x), Friends(x, y) • Formulas: constructed from constants, variables, functions, predicates e.g. Friends(x, MotherOf(Anna)), Friends(x, y) Friends(x, z) • Grounding: Replace all variables by constants e.g. Friends (Anna, Bob) • World (model, interpretation): Assignment of truth values to all ground predicates
Markov logic: intuition
• a logical knowledge base is a set of hard constraints on the set of possible worlds • let’s make them soft constraints: when a world violates a formula, it becomes less probable, not impossible • give each formula a weight (higher weight → stronger constraint)
P(world) ∝ exp
(∑ weights of formulas it satisfies)
MLN definition • a Markov Logic Network (MLN) is a set of pairs (F, w) where F is a formula in first-order logic w is a real number • together with a set of constants, it defines a Markov network with – one node for each grounding of each predicate in the MLN – one feature for each grounding of each formula F in the MLN, with the corresponding weight w!
MLN example: friends & smokers Smoking causes cancer. Friends have similar smoking habits.
MLN example: friends & smokers ∀x Smokes ( x ) ⇒ Cancer ( x ) ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) )
MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) )
MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) ) Two constants: Anna (A) and Bob (B)
MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) ) Two constants: Anna (A) and Bob (B)
Smokes(A)
Smokes(B)
Cancer(A)
Cancer(B)
MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) ) Two constants: Anna (A) and Bob (B) Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B) Friends(B,A)
MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) ) Two constants: Anna (A) and Bob (B) Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B) Friends(B,A)
MLN example: friends & smokers 1 .5 ∀x Smokes ( x ) ⇒ Cancer ( x ) 1 .1 ∀x, y Friends ( x, y ) ⇒ (Smokes ( x ) ⇔ Smokes ( y ) ) Two constants: Anna (A) and Bob (B) Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B) Friends(B,A)
Markov logic networks • a MLN is a template for ground Markov nets – the logic determines the form of the cliques – but if we had one more constant (say, Larry), we’d get a different Markov net • we can determine the probability of a world v (assignment of truth values to ground predicates) by
P(v) =
⎛ ⎞ 1 exp ⎜ ∑ wi ni (v)⎟ ⎝ i ⎠ Z
weight of formula i!
# of true groundings of formula i in v!
Probability of a world in an MLN ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ v=⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
Friends(A,A) = T ⎤ ⎥ Friends(A,B) = T ⎥ Friends(B,A) = T ⎥ ⎥ Friends(B,B) = T ⎥ Smokes(A) = F ⎥⎥ Smokes(B) = T ⎥ ⎥ Cancer(A) = F ⎥ ⎥ Cancer(B) = F ⎦
Friends(A,B)
Friends(A,A)
∀x Smokes(x) ⇒ Cancer(x) x=A T x =B F n1 (v) = 1
# of true groundings of formula 1 in v!
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B) Friends(B,A)
∀x, y Friends(x, y) ⇒ ( Smokes(x) ⇔ Smokes(y)) x = A, y = A
T
x = A, y = B x = B, y = A x = B, y = B n2 (v) = 2
F F T
Probability of a world in an MLN ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ v=⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
Friends(A,A) = T ⎤ ⎥ Friends(A,B) = T ⎥ Friends(B,A) = T ⎥ ⎥ Friends(B,B) = T ⎥ Smokes(A) = F ⎥⎥ Smokes(B) = T ⎥ ⎥ Cancer(A) = F ⎥ ⎥ Cancer(B) = F ⎦
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B) Friends(B,A)
⎛ ⎞ 1 P(v) = exp ⎜ ∑ wi ni (v)⎟ ⎝ i ⎠ Z =
1 exp (1.5 (1) + 1.1( 2 )) Z
Three MLN tasks • inference: can use the toolbox of inference methods developed for ordinary Markov networks – Monte Carlo methods – belief propagation – variational approximations in tandem with weighted SAT solver (e.g., MaxWalkSAT [Kautz et al., 1997] ) • parameter learning • structure learning: can use ordinary relational learning methods to learn new formula
MLN learning tasks • the input to the learning process is a relational database of ground predicates Friends(x, y)
Smokes(x)
Anna, Anna Anna, Bob Bob, Anna Bob, Bob
Bob
Cancer(x) Bob
• the closed world assumption is used to infer the truth values of atoms not present in the DB
Parameter learning • parameters (weights on formulas) can be learned using gradient ascent to maximize the log likelihood of the training data
f (w) = ∑ wi ni (v) − log Z i
∂ log Pw (V = v) = ni (v) − ∑ Pw ( V = v' ) ni (v') ∂wi v' # of times formula i is true in data Expected # times formula i is true according to MLN
Parameter learning ∂ log Pw (V = v) = ni (v) − ∑ Pw ( V = v' ) ni (v') ∂wi v' • there are two challenges to using this update rule – counting the number of true groundings may be intractable (counting the number of true groundings of a first-order formula in a database is #P-complete in the length of the formula) – computing the expected number of true groundings may also be intractable (to get Pw need to compute the partition function Z)
Parameter learning • a more efficient alternative is to optimize pseudo-likelihood
P (V = v) = ∏ Pw (Vl = vl | MBv (Vl )) n
* w
l=1
Setting of of values in Markov blanket of variable Vl!
Parameter learning • the gradient ascent update rule then becomes n ⎡ ni (v) − Pw (Vl = 0 | MBv (Vl ) ) ni (v [Vl =0 ] ) ⎤ ∂ * ⎥ log Pw (V = v) = ∑ ⎢ ∂wi − P V = 1 | MB (V ) n (v ) l=1 ⎢ w( l v l ) i [Vl =1] ⎥ ⎣ ⎦
Probability (according to Markov net) that Vl =1 when its Markov blanket is set to the values observed in v!
# true groundings of ith formula when Vl is constrained to be 1!
MLN experiment • testbed: a DB describing Univ. of Washington CS department • 12 predicates Professor(person) Student(person) Area(x, area) AuthorOf(publication, person) AdvisedBy(person, person) etc. • 2707 constants publication (342) person (442) course (176) project (153) etc.
MLN experiment • obtained knowledge base by having four subjects provide a set of formulas in first-order logic describing the domain • the formulas in the KB represent statements such as • students are not professors • each student has at most one advisor • if a student is an author of a paper, so is her advisor • at most one author of a given publication is a professor • etc. • note that the KB is not consistent
Learning to predict the AdvisedBy(x, y) relation MLN w/ original KB MLN w/ KB + ILP learned rules KB alone KB + ILP learned rules ILP learned rules naïve Bayes Bayes net learner
Comments on statistical relational learning • a wide range of approaches • Markov logic networks (MLNs) are one successful approach • many algorithmic refinements • state of the art applications in natural language, etc. • open source software available http://alchemy.cs.washington.edu/ • logic handles domain complexity, probability handles uncertainty • inference is challenging • exact methods not feasible for most problems • even approximate inference may be very computationally expensive