C OMPOSITIONAL M ORPHOLOGY FOR W ORD R EPRESENTATIONS AND L ANGUAGE M ODELLING Jan Botha, Phil Blunsom
ICML 2014, Beijing
M OTIVATION
P ROPOSED M ETHOD
M OTIVATING E XAMPLE W HAT WE SEE The king finally abdicated after years of unkingly conduct .
E XPERIMENTS
M OTIVATION
P ROPOSED M ETHOD
M OTIVATING E XAMPLE W HAT WE SEE The king finally abdicated after years of unkingly conduct . Wait what – unkingly?
E XPERIMENTS
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
M OTIVATING E XAMPLE W HAT WE SEE The king finally abdicated after years of unkingly conduct . Wait what – unkingly? unkingly 2n’kINli a word you have probably never seen, but still understand
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
M OTIVATING E XAMPLE W HAT WE SEE The king finally abdicated after years of unkingly conduct . Wait what – unkingly? unkingly 2n’kINli a word you have probably never seen, but still understand ⇒ compositional morphology in action
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
M OTIVATING E XAMPLE W HAT WE SEE The king finally abdicated after years of unkingly conduct . Wait what – unkingly? unkingly 2n’kINli a word you have probably never seen, but still understand ⇒ compositional morphology in action
W HAT OUR MODELS SEE ( MOSTLY ) 10
2
95
529
11
88
21
50
74
239
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
M OTIVATING E XAMPLE W HAT WE SEE The king finally abdicated after years of unkingly conduct . Wait what – unkingly? unkingly 2n’kINli a word you have probably never seen, but still understand ⇒ compositional morphology in action
W HAT OUR MODELS SEE ( MOSTLY ) 10
2
95
529
11
88
21
50
74
239
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
M OTIVATING E XAMPLE 2 Other languages display still more variation
C ZECH CONJUGATION
ˇ cistit (to clean) cˇ istím cˇ istíš cˇ istí cˇ istíme cˇ istíte cˇ istil ˇ cˇ išten cˇ isti ˇ cˇ istete ˇ cˇ isteme
T URKISH PRODUCTIVE DERIVATION Avrupa Avrupalı Avrupalıla¸s Avrupalıla¸stır Avrupalıla¸stırama Avrupalıla¸stıramadık ...
(Europe) (of Europe) (become of Europe) (to Europeanise) (be unable to Europeanise) (we were unable to Europeanise)
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
M OTIVATING E XAMPLE 2 Other languages display still more variation
C ZECH CONJUGATION
ˇ cistit (to clean) cˇ istím cˇ istíš cˇ istí cˇ istíme cˇ istíte cˇ istil ˇ cˇ išten cˇ isti ˇ cˇ istete ˇ cˇ isteme
T URKISH PRODUCTIVE DERIVATION Avrupa Avrupalı Avrupalıla¸s Avrupalıla¸stır Avrupalıla¸stırama Avrupalıla¸stıramadık ...
(Europe) (of Europe) (become of Europe) (to Europeanise) (be unable to Europeanise) (we were unable to Europeanise)
⇒ we should model morphemes!
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
R EPRESENTING WORDS I
Discrete set? {a, aardvark, . . . , account, accounted, accounting, . . . }
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
R EPRESENTING WORDS I
Discrete set? {a, aardvark, . . . , account, accounted, accounting, . . . }
I
Vector space? x2
a
accounted account aardvark x1
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
E XTRACT FROM C OLLOBERT & W ESTON E MBEDDINGS
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
E XTRACT FROM C OLLOBERT & W ESTON E MBEDDINGS
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
E XTRACT FROM C OLLOBERT & W ESTON E MBEDDINGS
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
M ORPHEME VECTORS Existing word vectors already capture some morphology. −−−→ −−→ −−−→ −−→ −−−−→ −−−→ I banks − bank ≈ kings − king ≈ queens − queen (Mikolov et al. 2013)
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
M ORPHEME VECTORS Existing word vectors already capture some morphology. −−−→ −−→ −−−→ −−→ −−−−→ −−−→ I banks − bank ≈ kings − king ≈ queens − queen (Mikolov et al. 2013)
Logical extension: −−−→ −−→ → − I kings ≈ king + -s −−−−−→ −→ −−→ − → I unkingly ≈ un- + king + -ly
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
M ORPHEME VECTORS Existing word vectors already capture some morphology. −−−→ −−→ −−−→ −−→ −−−−→ −−−→ I banks − bank ≈ kings − king ≈ queens − queen (Mikolov et al. 2013)
Logical extension: −−−→ −−→ → − I kings ≈ king + -s −−−−−→ −→ −−→ − → I unkingly ≈ un- + king + -ly
H OW TO ... I
obtain morpheme vectors
I
compose morpheme vectors
I
do it all within a language model usable in an MT decoder
M OTIVATION
P ROPOSED M ETHOD
M ORPHOLOGICAL COMPOSITION AS ADDITION Literally, word = sum of its parts?
E XPERIMENTS
M OTIVATION
P ROPOSED M ETHOD
M ORPHOLOGICAL COMPOSITION AS ADDITION Literally, word = sum of its parts? Problems: I I
−−→ −−→ −−→ −−→ hang + over 6= over + hang −−−−−−−→ −−−→ −−−→ non-compositionality: greenhouse 6= green + house
bag of morphemes:
E XPERIMENTS
M OTIVATION
P ROPOSED M ETHOD
M ORPHOLOGICAL COMPOSITION AS ADDITION Literally, word = sum of its parts? Problems: I I
−−→ −−→ −−→ −−→ hang + over 6= over + hang −−−−−−−→ −−−→ −−−→ non-compositionality: greenhouse 6= green + house
bag of morphemes:
P RAGMATIC S OLUTION include word identity as component too: −−−−−−−→ greenhouse ≡
−−−→ −−−→ greenstem + housestem
−−−−−→ unkingly ≡
→ − −−→ → − unpre + kingstem + ly suf
E XPERIMENTS
M OTIVATION
P ROPOSED M ETHOD
M ORPHOLOGICAL COMPOSITION AS ADDITION Literally, word = sum of its parts? Problems: I I
−−→ −−→ −−→ −−→ hang + over 6= over + hang −−−−−−−→ −−−→ −−−→ non-compositionality: greenhouse 6= green + house
bag of morphemes:
P RAGMATIC S OLUTION include word identity as component too: −−−−−−−→ −−−−−−−→ −−−→ −−−→ greenhouse ≡ greenhouseid + greenstem + housestem −−−−−→ −−−−−→ → − −−→ → − unkingly ≡ unkinglyid + unpre + kingstem + ly suf
E XPERIMENTS
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
S IMPLEST VECTOR - BASED PROBABILISTIC LM LBL (Log-bilinear model)
(Mnih & Hinton, 2007; Mnih & Teh, 2012)
“colorless green ideas sleep furiously .”
M OTIVATION
P ROPOSED M ETHOD
A DD MORPHEME VECTORS INSIDE LM LBL++
“colorless green ideas sleep furiously .”
E XPERIMENTS
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
C OMPUTATIONAL E FFICIENCY Problem: Each probability query requires normalisation over vocabulary. I
O(vocab size)
I
rich morphology ⇒ large vocabulary
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
C OMPUTATIONAL E FFICIENCY Problem: Each probability query requires normalisation over vocabulary. I
O(vocab size)
I
rich morphology ⇒ large vocabulary
S OLUTION : D ECOMPOSE MODEL USING WORD CLASSES P word | history = P class(word) | history × P word | class(word), history I I
use unsupervised Brown-clustering √ each LM query becomes 2 × O( vocab size) ⇒ fast enough for MT-decoding
M OTIVATION
P ROPOSED M ETHOD
E VALUATION OVERVIEW Setup I
4-gram models
I
Czech, English, French, German, Spanish, Russian
I
train on 20–50m tokens
I
large vocabularies (exclude 5% of singletons)
E XPERIMENTS
M OTIVATION
P ROPOSED M ETHOD
E VALUATION OVERVIEW Setup I
4-gram models
I
Czech, English, French, German, Spanish, Russian
I
train on 20–50m tokens
I
large vocabularies (exclude 5% of singletons)
Three evaluation contexts: I
Perplexity on test data
I
Word similarity rating
I
Machine translation
E XPERIMENTS
M OTIVATION
P ROPOSED M ETHOD
E VALUATION OVERVIEW
Three evaluation contexts: I
Perplexity on test data
I
Word similarity rating
I
Machine translation
E XPERIMENTS
M OTIVATION
P ROPOSED M ETHOD
E XPERIMENTS
P ERPLEXITY I MPROVEMENTS BY L ANGUAGE CLBL→CLBL++ 683→643
6 422→404
%
4
313→300 281→273 207→203 232→227
2
0 CS
DE
EN
ES
FR
RU
M OTIVATION
P ROPOSED M ETHOD
P ERPLEXITY I MPROVEMENTS ON G ERMAN CLBL→CLBL++
(B REAK - DOWN BY TOKEN FREQUENCY )
20 15 % 10 5 0
0