Tamil Dependency Parsing: Results using Rule Based and Corpus Based Approaches ˇ Loganathan Ramasamy – Zdenˇek Zabokrtsk´ y Charles University in Prague
Feb 21, 2011
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
1 / 20
Outline
1
Motivation & Objectives
2
General Aspects of Tamil Language
3
Annotation Scheme
4
Rule Based Parser for Tamil Parsing Example
5
Experiments and Results
6
Conclusion
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
2 / 20
Motivation & Objectives
Motivation & Objectives
Resource poor (?) Morphologically Rich Develop a Treebank and Parser for Tamil Identify issues in Treebank developement Test Rule based (RB) and Corpus based (CB) parsers
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
3 / 20
General Aspects of Tamil Language
General Aspects of Tamil Language
Morphologically rich Agglutinative Compound word constructions
Head final & Relatively free word order strictly head final within clause word order freedom
Subject–Verb agreement Subject agrees with verb in person–number–gender
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
4 / 20
Annotation Scheme
Annotation Scheme
Developed a small treebank (approx. 3000 words) Based on Prague Dependency Treebank PDT 2.0 PDT 2.0 uses 3 levels of annotation. Ours uses only the first 2 layers: morphological and analytical There are 19 analytical functions (or dependency relations) defined for the Tamil treebank. Morphological layer contains ≈ 460 unique tags Rule based parser under TectoMT framework
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
5 / 20
Annotation Scheme
Annotation Scheme - Annotation of sentence fragments
maniTan (3sm_nn)
maniTan (3sm_nn)
w−layer walla m−layer (adj) a−layer Atr (a) walla maniTan (good man)
awTa (det) AuxA
walla (adj) Atr
(b) awTa walla maniTan (That good man)
patiTTa (pst_adj_part) AdjCl
paiyan (3sm_nn)
wanRAka (adv) Adv (c) wanRAka patiTTa paiyan (The boy who studied well)
Figure: Illustration of Atr, Adv, AuxA, AdjCl dependencies
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
6 / 20
Annotation Scheme
Annotation Scheme - Annotation of sentence fragments
maRRum (cconj) Coord
, (,) Coord
w−layer m−layer a−layer
A−um (mtag1) AuxC (a) A−um , B−um (A , B)
B−um (mtag1) AuxC
, A (mtag1) (,) AuxC AuxX
B (mtag1) AuxC
, (,) AuxX
C (mtag1) AuxC
D (mtag1) AuxC
(b) A, B, C maRRum D (A, B, C and D)
allaTu (cconj) Coord
A−O (mtag1) AuxC
B−O (mtag1) AuxC
(c) A−O allaTu B−O (A or B)
Figure: Illustration of coordination conjunction
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
7 / 20
Annotation Scheme
Annotation Scheme - Full annotation example
StaM
katawTa katawTa adj
StaA
aTimuka aTimuka nnpc
Atciyil Atciyil loc_3n_nn
latcumi latcumi nnpc
pirAnEsh pirAnEsh 3n_nnp
Talaimaic Talaimaic nnc
ceyalALarAka ceyalALarAka 3h_nn
iruwTAr iruwTAr mv_pst_3sh_f
mv_pst_3sh_f iruwTAr Pred
. . . loc_3n_nn Atciyil NR adj katawTa Atr
nnpc aTimuka Atr
3n_nnp pirAnEsh Sb
nnpc latcumi Atr
. . AuxK
3h_nn_adv ceyalALarAka Atr
nnc Talaimaic Atr
katawTa aTimuka Atciyil latcumi pirAnEsh Talaimaic ceyalALarAka iruwTAr.
Figure: Annotation using TrEd tool
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
8 / 20
Rule Based Parser for Tamil
Rule Based Parser for Tamil Uses tagger and simple morphological & syntactic rules to build unlabeled and labeled dependency trees. Algorithm 1 2
Tag the input sentence. We used TnT tagger. Build the unlabeled dependency tree by calling Identify Resolve Identify Process
3
main predicate() coordination() trivial parents() complements()
Assign labels to the edges.
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
9 / 20
Rule Based Parser for Tamil
Parsing Example
Parsing Example
How to parse the following Tamil sentence using RB parser?
katawTa aTimuka Atciyil
latcumi pirAnEsh Talaimaic ceyalALarAka iruwTAr .
adj
nnpc
loc_3n_nn nnpc
nnp
nnc
last
ADMK in_the_rule Lakshmi Pranesh chief
3h_nn_adv
mv_pst_3sh_f.
as_secretariat was
Lakshmi Pranesh was the chief secretariat in the last ADMK rule
Tamil sentence Morphological Tag English gloss English Translation
Figure: A Tamil sentence
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
10 / 20
Rule Based Parser for Tamil
Parsing Example
Parsing Example Root
katawTa adj last
aTimuka nnpc
Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp
ADMK
in the rule lakshmi pranesh
Talaimaic nnc
ceyalALarAka 3h_nn_adv
chief
secretariat
iruwTAr . mv_pst_3sh_f. was.
Tamil sentence Morphological tag Gloss
Figure: Initial flat tree
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
11 / 20
Rule Based Parser for Tamil
Parsing Example
Parsing Example Identify_main_predicate() Root
katawTa adj last
aTimuka nnpc
Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp
ADMK
in the rule lakshmi pranesh
Talaimaic nnc
ceyalALarAka 3h_nn_adv
chief
secretariat
iruwTAr . mv_pst_3sh_f. was.
Tamil sentence Morphological tag Gloss
Figure: Identify predicate of the sentence ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
12 / 20
Rule Based Parser for Tamil
Parsing Example
Parsing Example Identify_trivial_parents() Root
Atciyil
katawTa adj last
aTimuka nnpc
Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp
ADMK
in the rule lakshmi pranesh
Talaimaic nnc
ceyalALarAka 3h_nn_adv
chief
secretariat
iruwTAr . mv_pst_3sh_f. was.
Tamil sentence Morphological tag Gloss
Figure: Attach modifiers to phrasal heads ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
13 / 20
Rule Based Parser for Tamil
Parsing Example
Parsing Example Identify_trivial_parents() Root
Atciyil
katawTa adj last
pirAnEsh
aTimuka nnpc
Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp
ADMK
in the rule lakshmi pranesh
Talaimaic nnc
ceyalALarAka 3h_nn_adv
chief
secretariat
iruwTAr . mv_pst_3sh_f. was.
Tamil sentence Morphological tag Gloss
Figure: Attach modifiers to phrasal heads ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
14 / 20
Rule Based Parser for Tamil
Parsing Example
Parsing Example Identify_trivial_parents() Root
Atciyil
katawTa adj last
pirAnEsh
aTimuka nnpc
Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp
ADMK
in the rule lakshmi pranesh
ceyalALarAka
Talaimaic nnc
ceyalALarAka 3h_nn_adv
chief
secretariat
iruwTAr . mv_pst_3sh_f. was.
Tamil sentence Morphological tag Gloss
Figure: Attach modifiers to phrasal heads ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
15 / 20
Rule Based Parser for Tamil
Parsing Example
Parsing Example Process_complements() Root
iruwTAr
Atciyil
katawTa adj last
pirAnEsh
aTimuka nnpc
Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp
ADMK
in the rule lakshmi pranesh
ceyalALarAka
Talaimaic nnc
ceyalALarAka 3h_nn_adv
chief
secretariat
iruwTAr . mv_pst_3sh_f. was.
Tamil sentence Morphological tag Gloss
Figure: Process complements, attach arguments to clausal predicates ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
16 / 20
Rule Based Parser for Tamil
Parsing Example
Parsing Example Labeling the tree Root
Pred
iruwTAr NR Atciyil
Atr pirAnEsh
Subject ceyalALarAka
Atr
katawTa adj last
Atr
Atr
aTimuka nnpc
Atciyil latcumi pirAnEsh loc_3n_nn nnpc nnp
ADMK
in the rule lakshmi pranesh
Atr
Talaimaic nnc
ceyalALarAka 3h_nn_adv
chief
secretariat
iruwTAr . mv_pst_3sh_f. was.
Tamil sentence Morphological tag Gloss
Figure: Labeling of the dependency tree ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
17 / 20
Experiments and Results
Experiments and Results: Data Corpus 1 is morphologically and syntactically annotated. Corpus 2 is only morphologically tagged.
Tagset size Lexical verb tags Auxliary verb tags # of words Unique tokens 1 tag count 2 tag count 3 tag count 4 tag count
Corpus 1 296 120 31 2961 1634 1534/(93.88%) 92/(05.63%) 8/(00.49%) 0/(00.00%)
Corpus 2 459 194 44 8421 3747 3427/(91.46%) 284/(07.58%) 33/(00.88%) 3/(00.08%)
Table: Corpus statistics ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
18 / 20
Experiments and Results
Experiments and Results: Accuracy of RB & CB Parsers Rule based parser is tested against the whole 2961 tokens. Manual POS experiment uses gold standard tagged data. In Auto POS eperiment, tagging was done by TnT tagger. MST and Malt parsers are trained on 2008 word tokens and tested against 953 tokens. Unlabeled Labeled
Auto POS 71.94 61.70
Manual POS 84.73 79.13
Table: Rule Based parser accuracy
Unlabeled Labeled
MaltParser 75.03 65.69
MST Parser 74.92 65.69
Table: Corpus Based parser accuracy ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
19 / 20
Conclusion
Conclusion & Future work
Certain dependency relations are easy to tackle using RB parser. Accurate tagging is required for RB approach. Prediction of relations such as Subject and Coordination is difficult in both RB and CB parsers. More data will be annotated.
ˇ Loganathan Ramasamy – Zdenˇ ek Zabokrtsk´ y (Charles University CICLing-2011 in Prague)
Feb 21, 2011
20 / 20