Model Combination for Event Extraction in BioNLP 2011 Sebastian Riedel,a David McClosky,b Mihai Surdeanu,b Andrew McCallum,a and Christopher D. Manningb a University
of Massachusetts at Amherst and b Stanford University
BioNLP 2011 — June 24th, 2011 1
Previous work / Motivation I
BioNLP 2009: model combination led to 4% F1 improvement over best individual system (Kim et al., 2009)
2
Previous work / Motivation I
BioNLP 2009: model combination led to 4% F1 improvement over best individual system (Kim et al., 2009)
I
Netflix challenge: winning entry relies on model combination (Bennett et al., 2007)
2
Previous work / Motivation I
BioNLP 2009: model combination led to 4% F1 improvement over best individual system (Kim et al., 2009)
I
Netflix challenge: winning entry relies on model combination (Bennett et al., 2007)
I
CoNLL 2007: winning entry relies on model combination (Hall et al., 2007)
2
Previous work / Motivation I
BioNLP 2009: model combination led to 4% F1 improvement over best individual system (Kim et al., 2009)
I
Netflix challenge: winning entry relies on model combination (Bennett et al., 2007)
I
CoNLL 2007: winning entry relies on model combination (Hall et al., 2007)
I
CoNLL 2003: winning entry relies on model combination (Florian et al., 2003)
2
Previous work / Motivation I
BioNLP 2009: model combination led to 4% F1 improvement over best individual system (Kim et al., 2009)
I
Netflix challenge: winning entry relies on model combination (Bennett et al., 2007)
I
CoNLL 2007: winning entry relies on model combination (Hall et al., 2007)
I
CoNLL 2003: winning entry relies on model combination (Florian et al., 2003)
I
etc. etc. etc.
2
Previous work / Motivation I
BioNLP 2009: model combination led to 4% F1 improvement over best individual system (Kim et al., 2009)
I
Netflix challenge: winning entry relies on model combination (Bennett et al., 2007)
I
CoNLL 2007: winning entry relies on model combination (Hall et al., 2007)
I
CoNLL 2003: winning entry relies on model combination (Florian et al., 2003)
I
etc. etc. etc.
I
Most of these use stacking—so do we
2
Previous work / Motivation I
BioNLP 2009: model combination led to 4% F1 improvement over best individual system (Kim et al., 2009)
I
Netflix challenge: winning entry relies on model combination (Bennett et al., 2007)
I
CoNLL 2007: winning entry relies on model combination (Hall et al., 2007)
I
CoNLL 2003: winning entry relies on model combination (Florian et al., 2003)
I
etc. etc. etc.
I
Most of these use stacking—so do we
I
Stacked model’s output as features in stacking model
2
Stacking Model Maximize
s (e, a, b) =
� i
under global constrains
si (ei ) +
�
si,j (ai,j ) +
sp,q (bp,q )
p,q
i,j
� s( Binding )= -0.1 1 0 s( Regulation )= 3.2 si (ei ) = s( Phosphor. )= 0.5 1
�
0.2 −1.2 −2.3 1.5
0 of TRAF2 � inhibits binding � to the CD40 domain � phosphorylation � =) s (ei ) + s (ai,j ) = s (ei ) + s (e,sa) (ai,j
i
i
i,j
i,j
� s( None )=-2.2 � � s( Theme s( )=0.2 s (e, a, b) = s (e ) + s (a ) + s (bp,q ) i i,j i
s(
Cause
)=1.3 i,j
p,q
s(
)=-2.2 )=3.2
3
Scores
s (e, a, b) =
� i
� � � � � s (e, a, b) = s (e ) + s (a ) + sp,q si (ei ) + + i sp,q (bi,j p,q ) i,j i,j (ai,j s(sRegulation )=i) 3.2 i
i,j
p,q
i,j
p,q
� −2.1 e = Reg e= Reg si (ei ) = (ei ) = ee= and∧ww1.3 ==”inhibit”e = Reg ∧ w = 1 1.3 Reg 1=Reg
1
�
� 1 1 (ei ) = s i 1
−2.1
1 Reg e=
� −2.1 e= Reg −2.1 1 e = Reg 1.2 e = Reg ∧ y = Reg 1 1.2 e = Reg ∧y =R (ei ) = 1.3 e1= Reg ∧ w1.3 = e = Reg ∧ w = 4
� � � s (e, a, b) = s (e ) + s (a ) + sp,q (bp,q ) i i,j i i,j Stacked Features i
(ei ) =
1 1
� �
1 1 (ei ) = 1
�
p,q
i,j
−2.1
e = Reg
� � � 1.3a, b) eRegulation = Reg s( = )= 3.2 s (e, si ∧ (ewi )= + si,j (ai,j ) + sp,q i
i,j
p,q
−2.1 e = Reg � e = Reg 1.2 e =1 Reg e = Reg Regy−2.1 ∧ y = Reg = Reg e=and si (ei ) = ee= and ==”inhibit”e = Reg ∧ w = 1=Reg 1.3 Reg ∧ ww1.3
1 −2.1 1 1.2 si (e i ) = =
� −2.1 1 e = Reg e = Reg 1 e= Reg 1.2y =Reg e = Reg ∧ y = R ∧ 5
Stacked model
I
Stanford Event Parsing system
I
Recall: Four different decoders: (1st, 2nd-order features) × (projective, non-projective)
I
Only used the parser for stacking (1-best outputs)
I
Different segmentation/tokenization
I
Different trigger detection
6
Performance of individual components
System UMass
F1 54.8
(Genia development section, Task 1)
7
Performance of individual components
System UMass Stanford (1N) Stanford (1P) Stanford (2N) Stanford (2P)
F1 54.8 49.9 49.0 46.5 49.5
(Genia development section, Task 1)
7
Performance of individual components
System UMass Stanford (1N) Stanford (1P) Stanford (2N) Stanford (2P)
F1 54.8 49.9 49.0 46.5 49.5
with reranker — 50.2 49.4 47.9 50.5
(Genia development section, Task 1)
8
Model combination strategies
System UMass Stanford (2P, reranked)
F1 54.8 50.5
(Genia development section, Task 1)
9
Model combination strategies
System UMass Stanford (2P, reranked) Stanford (all, reranked)
F1 54.8 50.5 50.7
(Genia development section, Task 1)
9
Model combination strategies
System UMass Stanford (2P, reranked) Stanford (all, reranked) UMass←2N UMass←1N UMass←1P UMass←2P
F1 54.8 50.5 50.7 54.9 55.6 55.7 55.7
(Genia development section, Task 1)
9
Model combination strategies
System UMass Stanford (2P, reranked) Stanford (all, reranked) UMass←2N UMass←1N UMass←1P UMass←2P UMass←all
F1 54.8 50.5 50.7 54.9 55.6 55.7 55.7 55.9
(Genia development section, Task 1)
9
Model combination strategies
System UMass Stanford (2P, reranked) Stanford (all, reranked) UMass←2N UMass←1N UMass←1P UMass←2P UMass←all (FAUST)
F1 54.8 50.5 50.7 54.9 55.6 55.7 55.7 55.9
(Genia development section, Task 1)
9
Ablation analysis for stacking
System UMass Stanford (2P, reranked) UMass←all
F1 54.8 50.5 55.9
(Genia development section, Task 1)
10
Ablation analysis for stacking
System UMass Stanford (2P, reranked) UMass←all UMass←all (triggers) UMass←all (arguments)
F1 54.8 50.5 55.9 54.9 55.1
(Genia development section, Task 1)
10
Conclusions I
Stacking: easy, effective method of model combination
11
Conclusions I
Stacking: easy, effective method of model combination I
...even if base models differ significantly in performance
11
Conclusions I
Stacking: easy, effective method of model combination I
I
...even if base models differ significantly in performance
Variability in models critical for success
11
Conclusions I
Stacking: easy, effective method of model combination I
...even if base models differ significantly in performance
I
Variability in models critical for success
I
Tree structure best provided by projective decoder
11
Conclusions I
Stacking: easy, effective method of model combination I
...even if base models differ significantly in performance
I
Variability in models critical for success
I
Tree structure best provided by projective decoder I
Incorporated in UMass model via 2P stacking
11
Conclusions I
Stacking: easy, effective method of model combination I
...even if base models differ significantly in performance
I
Variability in models critical for success
I
Tree structure best provided by projective decoder I
I
Incorporated in UMass model via 2P stacking
Future work: Incorporate projectivity constraint directly
Questions? 11
Backup slides
12
� � � s (e, a, b) = s (e ) + s (a ) + sp,q (bp,q ) i i,j i i,j Stacked Features i
(ei ) =
1 1
� �
1 1 (ei ) = 1
�
p,q
i,j
−2.1
e = Reg
� � � 1.3a, b) eRegulation = Reg s( = )= 3.2 s (e, si ∧ (ewi )= + si,j (ai,j ) + sp,q i
i,j
p,q
−2.1 e = Reg � e = Reg 1.2 e =1 Reg e = Reg Regy−2.1 ∧ y = Reg = Reg e=and si (ei ) = ee= and ==”inhibit”e = Reg ∧ w = 1=Reg 1.3 Reg ∧ ww1.3
1 −2.1 1 1.2 si (e i ) = =
� −2.1 1 e = Reg e = Reg 1 e= Reg 1.2y =Reg e = Reg ∧ y = R ∧ 13
1
e = Reg ∧ w =
1.3
Conjoined Features
� −2.1 1 e = Reg 1 1.2 e = Reg ∧ y = Reg si (ei ) = �e = Reg ∧ � � 1 1.3 w= Regulation )= 3.2 s( = s (e, a, b) si (ei ) + si,j (ai,j ) + sp,q
�
1 1 (ei ) = 1 1
i) =
1
�
i
i,j
p,q
−2.1 e = Reg e =Reg � −2.1 1.2 e =1 Reg and e = Reg ∧ y = Reg e = Reg y = Reg si (ei ) = 1.3 e =1 Rege and w1.3 =∧ ”inhibit” = Reg w = e = Reg ∧ w = ee==Reg and 3.2 Reg ∧ ww==”inhibit” and y∧=yReg = Reg � −2.1 1 e = Reg 1 1.2 e = 0.2 Reg e = Reg ∧ y = R si (e ) = i 14
Results on Genia
System UMass Stanford 1N Stanford 1P Stanford 2N Stanford 2P UMass←All UMass←1N UMass←1P UMass←2N UMass←2P UMass←All (triggers) UMass←All (arguments)
Simple 74.7 71.4 70.8 69.1 72.0 76.9 76.4 75.8 74.9 75.7 76.4 76.1
Binding 47.7 38.6 35.9 35.0 36.2 43.5 45.1 43.1 42.8 46.0 41.2 41.7
Regulation 42.8 32.8 31.1 27.8 32.2 44.0 43.8 44.6 43.8 44.1 43.1 43.6
Total 54.8 47.8 46.5 44.3 47.4 55.9 55.6 55.7 54.9 55.7 54.9 55.1
15
Results on Infectious Diseases System UMass Stanford 1N Stanford 1P Stanford 2N Stanford 2P UMass←All UMass←1N UMass←1P UMass←2N UMass←2P UMass←2P (conjoined)
Rec 46.2 43.1 40.8 41.6 42.8 47.6 45.8 47.6 45.4 49.1 48.0
Prec 51.1 49.1 46.7 53.9 48.1 54.3 51.6 52.8 52.4 52.6 53.2
F1 48.5 45.9 43.5 46.9 45.3 50.7 48.5 50.0 48.6 50.7 50.4
16
Results on test
(Task 1) (Task 2) EPI (Full task) EPI (Core task) ID (Full task) ID (Core task) GE
GE
Rec 48.5 43.9 28.1 57.0 46.9 49.5
UMass Prec 64.1 60.9 41.6 73.3 62.0 62.1
F1 55.2 51.0 33.5 64.2 53.4 55.1
UMass←All Rec Prec F1 49.4 64.8 56.0 46.7 63.8 53.9 28.9 44.5 35.0 59.9 80.3 68.6 48.0 66.0 55.6 50.6 66.1 57.3
17