Transfer learning in language - Semantic Scholar

Report 2 Downloads 160 Views
Transfer learning in language ダ ウ メ

Hal Daumé III Computer Science University of Maryland [email protected]

IWSML Kyoto, Japan

Piyush Rai (Utah

Avishek Saha Abhishek Kumar (Utah) (UMD)

31 Mar 2012

Jagadeesh Jagarlamudi (UMD) 1

Hal Daumé III ([email protected])

Suresh Venkatasubramanian (Utah) Transfer learning in language

Linguistic ambiguity

2



Teacher Strikes Idle Kids



Enraged Cow Injures Farmer With Ax



I saw the Grand Canyon flying to New York



Dog collar vs. flea collar



Plastic cat food can cover



The BUG in the room … ➢ … flew out the window ➢ … was planted by spies



Everyone on the island speaks two languages Hal Daumé III ([email protected])

Transfer learning in language

Typical NLP pipeline The man ate a sandwich The man eat+ a sandwich past DT NN VB DT NN

S

NP VP Theme

∃ a ∃ t ∃ e

Source Semantics Source Shallowmantics Source Syntax Source Morphology

Source Words 3

Hal Daumé III ([email protected])

Target Semantics Target Shallowmantics Target Syntax Target Morphology

ion rat ne Ge

man(a) & sandwich(t) & eat(e,a,t) & past(e)

Interlingua

An aly sis

NP Agent

Morphology Tagging Parsing Role labeling Interpretation

Target Words Transfer learning in language

Typical NLP pipeline The man ate a sandwich The man eat+ a sandwich past DT NN VB DT NN NP Agent

NP VP Theme

Interlingua These tasks are TheseSource tasks are Semantics Target Semantics related! a t e highly Source Target highly related! Shallowmantics man(a) & Shallowmantics ∃

S



Source Syntax Source Morphology

Source Words 4

Hal Daumé III ([email protected])

Target Syntax Target Morphology

ion rat ne Ge

sandwich(t) & eat(e,a,t) & past(e)

An aly sis



Morphology Tagging Parsing Role labeling Interpretation

Target Words Transfer learning in language

Pipeline models break down (sorta) ➢

Tagging + Parsing

+ 0%

/

+ 3%



Parsing + Named Entities

+ 0.5% /

+ 4%



Parsing + Role Identification (upper bound:

+ 0%

- 0.3% + 13% )



Named Entities + Coreference + 0.3% /

+ 1.3%

(upper bound:

+ 8% )

/

Why? Maybe simpler model already has a lot of the fancier information

[Finkel & Manning; ACL 09 Sutton & McCallum; NAACL 07 Daumé III & Marcu; EMNLP 06 Many others...] 5

Hal Daumé III ([email protected])

Transfer learning in language

This talk is about... 1. Joint Parsing and Entity Recognition

2. Transfer via Multilinguality

Mark Gales spoke at IWSML NNP NNP VB IN NNP Person

NP S

VP

PP Event

3. Transfer from unlabeled data

6

Hal Daumé III ([email protected])

Transfer learning in language

This talk is about... 1. Joint Parsing and Entity Recognition

2. Transfer via Multilinguality

Mark Gales spoke at IWSML NNP NNP VB IN NNP Person

NP S

VP

PP Event

3. Transfer from unlabeled data

7

Hal Daumé III ([email protected])

Transfer learning in language

Agreement-based transfer George Bush spoke to Congress NNP NNP VBD TO NNP [NP -] [- VP -] [-PP-][- NP -] [[[[-

Person Person -] Person -] Person -]

-]

yesterday NN [- NP -]

[- Org -] [Org [- Org -]

-]

Entities are subsequences of NPs ● NNPs are subsequences of entities ●

8

Hal Daumé III ([email protected])

Transfer learning in language

Lots of approaches in 2008 ➢

Semi-supervised learning with constraints ➢ ➢



Co-regularization ➢ ➢



Encourage learned models to have similar structure [Ganchev, Graca, Blitzer, Taskar; UAI 2008]

Cross-task Co-training ➢ ➢

9

Force outputs to obey constraints and do self-training [Chang, Ratinov, Rizzolo, Roth; AAAI 2008]

Do self-training only on outputs that obey constraints [Daumé III; EMNLP 2008]

Hal Daumé III ([email protected])

Transfer learning in language

Simple black-box algorithm ➢ ➢ ➢ ➢ ➢ ➢

Learn a parser on labeled data We love love love Learn an entity recognizer on labeledalgorithms data black-box Run both on unlabeled data Assume outputs are both correct for any data point that obeys the constraints in outputs space Retrain models on original data plus new data Rinse and repeat If the constraints are: Correct (true outputs always agree) Discriminating (the probability of agreement is at most 1/[4 (|Y| - 1)^2 ]) Then this algorithm “works” (in a PAC sense) [Daumé III; EMNLP 2008]

10

Hal Daumé III ([email protected])

Transfer learning in language

Black-box results

[Daumé III; EMNLP 2008] 11

Hal Daumé III ([email protected])

Transfer learning in language

This talk is about... 1. Joint Parsing and Entity Recognition

2. Transfer via Multilinguality

Mark Gales spoke at IWSML NNP NNP VB IN NNP Person

NP S

VP

PP Event

3. Transfer from unlabeled data

12

Hal Daumé III ([email protected])

Transfer learning in language

Multilinguality as a source of x-fer The man ate a tasty D N V D J NP

VP

sandwich N

NP

S Le+homme a mange un sandwich savoureaux 21% on average D N A V D N J over 8 languages

English PCFG

French PCFG

ϴ

NP English, Dutch NP VP Danish,SSwedish El hombre Portuguese se comio un bocadillo sabrosa Spanish, D N Slovene A V D N J Spanish Chinese PCFG NP NP See also:VP [Berg-Kirkpatrick & Klein; ACL10] S et al.... [Iwata, Mochihashi & Sawada; ACL10] Snyder, Barzilay 13

Hal Daumé III ([email protected])

Transfer learning in language

Implicational Universals English: I eat dinner in restaurants. French: je mange le diner dans les restaurants I eat the dinner in the restaurants

VO ⊃ PreP PostP ⊃ OV Verb-Object (VO) Prepositional (PreP)

Japanese: boku -wa bangohan-o resutoran -ni taberu I -topic dinner -obj restaurants -in eat Hindi: main raat ka khaana restra mein khaata hoon I night-of-meal restaurants in eat am

Object-Verb (OV) Postpositional (PostP)

[Daumé III & Campbell; ACL 2007] 14

Hal Daumé III ([email protected])

Transfer learning in language

Typological Map: VO

[Daumé III & Campbell; ACL 2007] 15

Hal Daumé III ([email protected])

Transfer learning in language

Typological Map: PreP

[Daumé III & Campbell; ACL 2007] 16

Hal Daumé III ([email protected])

Transfer learning in language

Unsupervised part of speech tagging ➢

Seeds (frequent words for each tag) ➢ ➢ ➢ ➢



Typological rules: ➢ ➢



N: membro, milhoes, obras D: as [the,2f] o [the,1m] os [the,2m] V: afector, gasta, juntar P: com, como, de, em

Art ← Noun Prp → Noun

Tag knowledge: ➢ ➢

Open class Closed class [Teichert & Daumé III; NIPSWS 2009]

17

Hal Daumé III ([email protected])

Transfer learning in language

Does typology help? SEEDS Can alsoNO transfer across languages SEEDS 60 60 Even for typologically distinct ones! 55

55

50

50

45

45

40

40

35

35

30

30

25

25

20

20

ArtN

18

ArtN

[Teichert & Daumé III; NIPSWS 2009; Hal Daumé III ([email protected]) Transfer learningsub] in language Sanders & Daume III, EMNLP 2012

This talk is about... 1. Joint Parsing and Entity Recognition

2. Transfer via Multilinguality

Mark Gales spoke at IWSML NNP NNP VB IN NNP Person

NP S

VP

PP Event

3. Transfer from unlabeled data

19

Hal Daumé III ([email protected])

Transfer learning in language

Spectral Clustering ➢ ➢ ➢

Represent datapoints as the vertices V of a graph G. All pairs of vertices are connected by an edge E. Edges have weights W. ➢

20

Large weights mean that the adjacent vertices are very similar; small weights imply dissimilarity.

Hal Daumé III ([email protected])

Transfer learning in language

Graph partitioning ➢





Clustering on a graph is equivalent to partitioning the vertices of the graph. A loss function for a partition of V into sets A and B

In a good partition, vertices in different partitions will be dissimilar. ➢

21

Mincut criterion: Find partition

that minimizes

Hal Daumé III ([email protected])

Transfer learning in language

Graph partitioning

22



Mincut criterion ignores the size of the subgraphs formed.



Normalized cut criterion favors balanced partitions.



Minimizing the normalized cut criterion exactly is NPhard.

Hal Daumé III ([email protected])

Transfer learning in language

Spectral Clustering ➢



One way of approximately optimizing the normalized cut criterion leads to spectral clustering. Spectral clustering ➢ ➢



23

Find a new representation of the original data points. Cluster the points in this representation using any clustering scheme (say 2-means).

The representation involves forming the row-normalized matrix using the largest 2 eigenvectors of the matrix

Hal Daumé III ([email protected])

Transfer learning in language

Example: 2-means

24

Hal Daumé III ([email protected])

Transfer learning in language

Example: Spectral clustering

25

Hal Daumé III ([email protected])

Transfer learning in language

Multiview spectral clustering 日本語

En

日本語



Ue We

En

En

Wj

Uj

日本語

UeUeWj T

Uj Uj We T

26

Hal Daumé III ([email protected])

Transfer learning in language

Multiview spectral clustering

U

W



日本語 Algorithm Algorithm En 1. each view 1. Run Run SVD SVD on on each view 日本語 En 2. 2. Project Project each each view view onto onto subspace subspace e spanned e j by other's top-left svs spanned by other's top-left svs 3. 日本語 3. Goto Goto 11 unless unless converged En converged

W

Uj

Look Look ma: ma: no no hyperparameters! hyperparameters!

UeUeWj T

Uj Uj We T

27

Hal Daumé III ([email protected])

Transfer learning in language

Multiview spectral clustering

U

W



日本語 Algorithm Algorithm En 1. each view 1. Run Run SVD SVD on on each view 日本語 En 2. each view onto subspace Results (Reuters) 2. Project Project each view onto Results (Reuters)subspace e spanned e j by other's top svs spanned by other's top svs 3. converged 日本語 Norm. F-score 3. Goto Goto 11 unless unless converged En F-score Norm. MI MI Best 0.287 Best View View 0.342 0.342 0.287 Look ma: Concat 0.368 0.298 Look ma: no no hyperparameters! hyperparameters! Concat 0.368 0.298 T0.381 SofA 0.381 0.342 SofA 0.342 Co-Spec 0.412 e e j0.388 Co-Spec 0.412 0.388

W

Uj

U U W

Uj Uj We T

28

Hal Daumé III ([email protected])

Transfer learning in language

This talk is about... 1. Joint Parsing and Entity Recognition

2. Transfer via Multilinguality

Mark Gales spoke at IWSML NNP NNP VB IN NNP Person

NP S

VP

PP Event

3. Transfer from unlabeled data

29

Hal Daumé III ([email protected])

Transfer learning in language

This talk is about... 1. Jointalgorithms Parsing and Simple can Simple algorithms can Entity Recognition achieve great transfer achieve great transfer

2. Transfer via Multilinguality

Mark Gales spoke at IWSML NNP NNP VB IN NNP Person

NP S

VP

PP Event

3. Transfer from unlabeled data

30

Hal Daumé III ([email protected])

Transfer learning in language

This talk is about... 1. Jointalgorithms Parsing and Simple can Simple algorithms can Entity Recognition achieve great transfer achieve great transfer Mark Gales spoke at IWSML NNP NNP VB IN NNP Person

NP S

VP

Plentiful multilingual Plentiful multilingual 2. Transfer via data ++knowledge dataMultilinguality knowledge== strong strongmodels models

PP Event

3. Transfer from unlabeled data

31

Hal Daumé III ([email protected])

Transfer learning in language

This talk is about... 1. Jointalgorithms Parsing and Simple can Simple algorithms can Entity Recognition achieve great transfer achieve great transfer Mark Gales spoke at IWSML NNP NNP VB IN NNP Person

NP S

VP

Plentiful multilingual Plentiful multilingual 2. Transfer via data ++knowledge dataMultilinguality knowledge== strong strongmodels models

PP Event

3. Transfer from unlabeled Unlabeled (paired) Unlabeled (paired) data data can canbe beexpoited expoitedefficiently efficiently

32

Hal Daumé III ([email protected])

Transfer learning in language

This talk is about... Plentiful multilingual Plentiful multilingual 2. Transfer via data ++knowledge dataMultilinguality knowledge== strong models strong models Mark Gales spoke at IWSML NNP NNP VB IN NNP When will transfer When will transferwill willhelp? help? Has transfer helped? Has transfer helped? NP PP Person Event VPto How incorporate knowledge? How to incorporate knowledge? S Scaling Scalingto tobillions billionsof ofexamples? examples? 3. Transfer from unlabeled Unlabeled (paired) Unlabeled (paired) data data can canbe beexpoited expoitedefficiently efficiently 1. Jointalgorithms Parsing and Simple can Simple algorithms can Entity Recognition achieve great transfer achieve great transfer

?

33

Hal Daumé III ([email protected])

Transfer learning in language

This talk is about... Piyush Avishek Abhishek Jags

Suresh

Plentiful multilingual Plentiful multilingual 2. Transfer via data ++knowledge dataMultilinguality knowledge== strong models strong models Mark Gales spoke at IWSML NNP NNP VB IN NNP When will transfer When will transferwill willhelp? help? Has transfer helped? Has transfer helped? NP PP Person Event VPto How incorporate knowledge? How to incorporate knowledge? S Scaling Scalingto tobillions billionsof ofexamples? examples? 3. Transfer from unlabeled Unlabeled (paired) Unlabeled (paired) data data can canbe beexpoited expoitedefficiently efficiently 1. Jointalgorithms Parsing and Simple can Simple algorithms can Entity Recognition achieve great transfer achieve great transfer

?

ありがとうございます ありがとうございます !! 質問は? 質問は? 34

Hal Daumé III ([email protected])

Transfer learning in language