Tamil Dependency Parsing: Results using Rule ... - Semantic Scholar

Report 5 Downloads 71 Views
Tamil Dependency Parsing: Results using Rule Based and Corpus Based Approaches ˇ Loganathan Ramasamy – Zdenˇek Zabokrtsk´ y Charles University in Prague

Feb 21, 2011

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

1 / 20

Outline

1

Motivation & Objectives

2

General Aspects of Tamil Language

3

Annotation Scheme

4

Rule Based Parser for Tamil Parsing Example

5

Experiments and Results

6

Conclusion

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

2 / 20

Motivation & Objectives

Motivation & Objectives

Resource poor (?) Morphologically Rich Develop a Treebank and Parser for Tamil Identify issues in Treebank developement Test Rule based (RB) and Corpus based (CB) parsers

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

3 / 20

General Aspects of Tamil Language

General Aspects of Tamil Language

Morphologically rich Agglutinative Compound word constructions

Head final & Relatively free word order strictly head final within clause word order freedom

Subject–Verb agreement Subject agrees with verb in person–number–gender

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

4 / 20

Annotation Scheme

Annotation Scheme

Developed a small treebank (approx. 3000 words) Based on Prague Dependency Treebank PDT 2.0 PDT 2.0 uses 3 levels of annotation. Ours uses only the first 2 layers: morphological and analytical There are 19 analytical functions (or dependency relations) defined for the Tamil treebank. Morphological layer contains ≈ 460 unique tags Rule based parser under TectoMT framework

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

5 / 20

Annotation Scheme

Annotation Scheme - Annotation of sentence fragments

maniTan (3sm_nn)

maniTan (3sm_nn)

w−layer walla m−layer (adj) a−layer Atr (a) walla maniTan (good man)

awTa (det) AuxA

walla (adj) Atr

(b) awTa walla maniTan (That good man)

patiTTa (pst_adj_part) AdjCl

paiyan (3sm_nn)

wanRAka (adv) Adv (c) wanRAka patiTTa paiyan (The boy who studied well)

Figure: Illustration of Atr, Adv, AuxA, AdjCl dependencies

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

6 / 20

Annotation Scheme

Annotation Scheme - Annotation of sentence fragments

maRRum (cconj) Coord

, (,) Coord

w−layer m−layer a−layer

A−um (mtag1) AuxC (a) A−um , B−um (A , B)

B−um (mtag1) AuxC

, A (mtag1) (,) AuxC AuxX

B (mtag1) AuxC

, (,) AuxX

C (mtag1) AuxC

D (mtag1) AuxC

(b) A, B, C maRRum D (A, B, C and D)

allaTu (cconj) Coord

A−O (mtag1) AuxC

B−O (mtag1) AuxC

(c) A−O allaTu B−O (A or B)

Figure: Illustration of coordination conjunction

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

7 / 20

Annotation Scheme

Annotation Scheme - Full annotation example

StaM

katawTa katawTa adj

StaA

aTimuka aTimuka nnpc

Atciyil Atciyil loc_3n_nn

latcumi latcumi nnpc

pirAnEsh pirAnEsh 3n_nnp

Talaimaic Talaimaic nnc

ceyalALarAka ceyalALarAka 3h_nn

iruwTAr iruwTAr mv_pst_3sh_f

mv_pst_3sh_f iruwTAr Pred

. . . loc_3n_nn Atciyil NR adj katawTa Atr

nnpc aTimuka Atr

3n_nnp pirAnEsh Sb

nnpc latcumi Atr

. . AuxK

3h_nn_adv ceyalALarAka Atr

nnc Talaimaic Atr

katawTa aTimuka Atciyil latcumi pirAnEsh Talaimaic ceyalALarAka iruwTAr.

Figure: Annotation using TrEd tool

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

8 / 20

Rule Based Parser for Tamil

Rule Based Parser for Tamil Uses tagger and simple morphological & syntactic rules to build unlabeled and labeled dependency trees. Algorithm 1 2

Tag the input sentence. We used TnT tagger. Build the unlabeled dependency tree by calling Identify Resolve Identify Process

3

main predicate() coordination() trivial parents() complements()

Assign labels to the edges.

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

9 / 20

Rule Based Parser for Tamil

Parsing Example

Parsing Example

How to parse the following Tamil sentence using RB parser?

katawTa aTimuka Atciyil

latcumi pirAnEsh Talaimaic ceyalALarAka iruwTAr .

adj

nnpc

loc_3n_nn nnpc

nnp

nnc

last

ADMK in_the_rule Lakshmi Pranesh chief

3h_nn_adv

mv_pst_3sh_f.

as_secretariat was

Lakshmi Pranesh was the chief secretariat in the last ADMK rule

Tamil sentence Morphological Tag English gloss English Translation

Figure: A Tamil sentence

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

10 / 20

Rule Based Parser for Tamil

Parsing Example

Parsing Example Root

katawTa adj last

aTimuka nnpc

Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp

ADMK

in the rule lakshmi pranesh

Talaimaic nnc

ceyalALarAka 3h_nn_adv

chief

secretariat

iruwTAr . mv_pst_3sh_f. was.

Tamil sentence Morphological tag Gloss

Figure: Initial flat tree

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

11 / 20

Rule Based Parser for Tamil

Parsing Example

Parsing Example Identify_main_predicate() Root

katawTa adj last

aTimuka nnpc

Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp

ADMK

in the rule lakshmi pranesh

Talaimaic nnc

ceyalALarAka 3h_nn_adv

chief

secretariat

iruwTAr . mv_pst_3sh_f. was.

Tamil sentence Morphological tag Gloss

Figure: Identify predicate of the sentence ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

12 / 20

Rule Based Parser for Tamil

Parsing Example

Parsing Example Identify_trivial_parents() Root

Atciyil

katawTa adj last

aTimuka nnpc

Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp

ADMK

in the rule lakshmi pranesh

Talaimaic nnc

ceyalALarAka 3h_nn_adv

chief

secretariat

iruwTAr . mv_pst_3sh_f. was.

Tamil sentence Morphological tag Gloss

Figure: Attach modifiers to phrasal heads ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

13 / 20

Rule Based Parser for Tamil

Parsing Example

Parsing Example Identify_trivial_parents() Root

Atciyil

katawTa adj last

pirAnEsh

aTimuka nnpc

Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp

ADMK

in the rule lakshmi pranesh

Talaimaic nnc

ceyalALarAka 3h_nn_adv

chief

secretariat

iruwTAr . mv_pst_3sh_f. was.

Tamil sentence Morphological tag Gloss

Figure: Attach modifiers to phrasal heads ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

14 / 20

Rule Based Parser for Tamil

Parsing Example

Parsing Example Identify_trivial_parents() Root

Atciyil

katawTa adj last

pirAnEsh

aTimuka nnpc

Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp

ADMK

in the rule lakshmi pranesh

ceyalALarAka

Talaimaic nnc

ceyalALarAka 3h_nn_adv

chief

secretariat

iruwTAr . mv_pst_3sh_f. was.

Tamil sentence Morphological tag Gloss

Figure: Attach modifiers to phrasal heads ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

15 / 20

Rule Based Parser for Tamil

Parsing Example

Parsing Example Process_complements() Root

iruwTAr

Atciyil

katawTa adj last

pirAnEsh

aTimuka nnpc

Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp

ADMK

in the rule lakshmi pranesh

ceyalALarAka

Talaimaic nnc

ceyalALarAka 3h_nn_adv

chief

secretariat

iruwTAr . mv_pst_3sh_f. was.

Tamil sentence Morphological tag Gloss

Figure: Process complements, attach arguments to clausal predicates ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

16 / 20

Rule Based Parser for Tamil

Parsing Example

Parsing Example Labeling the tree Root

Pred

iruwTAr NR Atciyil

Atr pirAnEsh

Subject ceyalALarAka

Atr

katawTa adj last

Atr

Atr

aTimuka nnpc

Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp

ADMK

in the rule lakshmi pranesh

Atr

Talaimaic nnc

ceyalALarAka 3h_nn_adv

chief

secretariat

iruwTAr . mv_pst_3sh_f. was.

Tamil sentence Morphological tag Gloss

Figure: Labeling of the dependency tree ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

17 / 20

Experiments and Results

Experiments and Results: Data Corpus 1 is morphologically and syntactically annotated. Corpus 2 is only morphologically tagged.

Tagset size Lexical verb tags Auxliary verb tags # of words Unique tokens 1 tag count 2 tag count 3 tag count 4 tag count

Corpus 1 296 120 31 2961 1634 1534/(93.88%) 92/(05.63%) 8/(00.49%) 0/(00.00%)

Corpus 2 459 194 44 8421 3747 3427/(91.46%) 284/(07.58%) 33/(00.88%) 3/(00.08%)

Table: Corpus statistics ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

18 / 20

Experiments and Results

Experiments and Results: Accuracy of RB & CB Parsers Rule based parser is tested against the whole 2961 tokens. Manual POS experiment uses gold standard tagged data. In Auto POS eperiment, tagging was done by TnT tagger. MST and Malt parsers are trained on 2008 word tokens and tested against 953 tokens. Unlabeled Labeled

Auto POS 71.94 61.70

Manual POS 84.73 79.13

Table: Rule Based parser accuracy

Unlabeled Labeled

MaltParser 75.03 65.69

MST Parser 74.92 65.69

Table: Corpus Based parser accuracy ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

19 / 20

Conclusion

Conclusion & Future work

Certain dependency relations are easy to tackle using RB parser. Accurate tagging is required for RB approach. Prediction of relations such as Subject and Coordination is difficult in both RB and CB parsers. More data will be annotated.

ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)

Feb 21, 2011

20 / 20