TectoMT Moses Depfix Language Models Constrained vs

Report 6 Downloads 30 Views
CUNI in WMT15: Chimera Strikes Again Ondřej Bojar, Aleš Tamchyna {bojar,tamchyna}@ufal.mff.cuni.cz Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics

Input

TectoMT

Moses

- hybrid (rule-based/statistical) MT system - transfer at a deep syntactic layer (t-layer) - our combination: get an extra phrase table for Moses from TectoMT output

- phrase-based SMT - large-scale data - morphological tags as factors for a better grammatical coherence

Poor Man's Combination Parallel training data

Dev set (En)

- rule-based error correction in (S)MT output - output parse corrected based on source

Constrained vs. Unconstrained - monolingual: 44.3 vs. 392.3 million sentences - parallel: 15.5 vs. 52.6 million sentence pairs

Test set (En)

TectoMT

CH0 CH1

Synthetic ttable

Baseline ttable

Depfix

Delta

Constrained 21.28 23.37 2.09

Full 22.59 24.24 1.65

Delta 1.31 0.87

Gains from extra data

Gains from adding TectoMT Moses

Moses

CH0

CH1

Language Models

CH2

Depfix

- not restricted to tokens in 1-best outputs, can use alternative translations to better "glue" system outputs together

TectoMT Overview explain PRED v:fin explains

also performance RHEM PATn:obj also performance

give RSTRv:rc gives #PersPron ACTn:subj he

#PersPron APP n:poss my

#PersPron PATn:obj me

vysvětlovat PRED v:fin vysvětluje důvěra ACTn:1 Důvěra

také RHEM také

dávat RSTRv:rc dává

The gives my AuxA Atr Atr DT VBZ PRP$ he me Sb Obj PRP PRP

výkon PAT n:4 výkon

#PersPron APP n:poss můj

který #PersPron #PersPron n:4 ACTn:1 PATn:4 kterou mě

explains Pred VBZ confidence also performance Sb Adv Obj NN RB NN

big - 4-gram LM on word forms - use all available data

vysvětluje VB-S3PA Důvěra Sb NNFS1

Synthesis

Analysis

confidence ACTn:subj The confidence

Transfer

také výkon NNIS4

morph - 10-gram LM on morphological tags longmorph - 15-gram LM on tags - goal: capture sentential patterns

LMs long long morph longmorph big long morph long longmorph big morph big long big morph longmorph big longmorph big long morph big long longmorph all

BLEU 21.32 22.00 22.00 22.01 22.14 22.21 22.26 22.28 22.29 22.48 22.69 22.59

můj dává VB-S3PA PSIS4 , kterou mě , P4FS4 PH-S4

- novel word forms (unseen in the parallel data) - grammatical coherence, clause structure

Why TectoMT Helps - TectoMT phrase table matches the test set ⇒ Moses can apply longer phrases - better grammatical coherence - search is simplified - TectoMT provides many novel translations - reduction of modelling errors

Presented at WMT 2015, Lisbon, Portugal.

long - 7-gram LM on word forms - mainly WMT monolingual data, individual years interpolated

WMT Results System CH2 CH1

JHU - SMT

CH0

G OOGLE T RANSLATE CU - TECTOMT

BLEU 18.8 18.7 18.2 17.6 16.4 13.4

TER 0.715 0.717 0.725 0.730 0.750 0.763

Manual 0.686 – 0.503 – 0.515 0.209

Chimera placed first among English→Czech MT systems in WMT for three years in a row.

This research was supported by the grants H2020-ICT-2014-1-644402 (HimL), H2020-ICT-2014-1-644753 (KConnect), and SVV 260224. This work has been using language resources developed, stored and distributed by the LINDAT-CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013).