Joint Tokenization and Translation Xinyan Xiao1, Yang Liu1, Young-Sook Hwang2, Qun Liu1, and Shouxun Lin1 of Computing Technology
2SKTelecom
INSTITUTE OF COMPUTING TECHNOLOGY
1Institute
Separate Tokenization and Translation Characters 0
2010-08-24
陶 tao
1
菲 克 有 望 fei 2 ke 3 you 4 wang
5
INSTITUTE OF COMPUTING TECHNOLOGY
夺 分 duo 6 fen 7
2
Separate Tokenization and Translation Characters 0
陶 tao
1
菲 克 有 望 fei 2 ke 3 you 4 wang
5
INSTITUTE OF COMPUTING TECHNOLOGY
夺 分 duo 6 fen 7
Tokenize Words
2010-08-24
tao-fei-ke
you-wang
duo
fen
3
Separate Tokenization and Translation Characters 0
陶 tao
1
菲 克 有 望 fei 2 ke 3 you 4 wang
5
INSTITUTE OF COMPUTING TECHNOLOGY
夺 分 duo 6 fen 7
Tokenize Words
tao-fei-ke
you-wang
duo
fen
Translate Translation 2010-08-24
Taufik
is expected to
gain a point 4
Challenges
INSTITUTE OF COMPUTING TECHNOLOGY
Propagation of tokenization errors Hard to define optimal granularity (Chang et al., 2008; Zhang et al., 2008) Inconsistence tokenization between training and testing
2010-08-24
5
Previous work
INSTITUTE OF COMPUTING TECHNOLOGY
Offering more tokenizations Lattice (Xu et al., 2005; Dyer et al., 2008; Dyer, 2009)
tao-fei-ke 0
1
tao
2010-08-24
duo-fen 3
fei-ke
you-wang
5
6
duo
7
fen
6
Joint Tokenization and Translation
0
tao
2010-08-24
1
fei
2
ke
3
you
4
wang
5
duo
6
fen
INSTITUTE OF COMPUTING TECHNOLOGY
7
7
Joint Tokenization and Translation
INSTITUTE OF COMPUTING TECHNOLOGY
r1 : tao-fei-ke Taufik
0
tao
2010-08-24
1
fei
2
ke
3
you
4
wang
5
duo
6
fen
7
8
Joint Tokenization and Translation
INSTITUTE OF COMPUTING TECHNOLOGY
r1 : tao-fei-ke Taufik tao-fei-ke
0
tao
1
fei
2
ke
3
you
4
wang
5
duo
6
fen
7
Taufik 2010-08-24
9
Joint Tokenization and Translation
INSTITUTE OF COMPUTING TECHNOLOGY
r2 : duo fen gain a point tao-fei-ke
0
tao
1
fei
Taufik 2010-08-24
2
duo ke
3
you
4
wang
5
duo
fen
6
fen
7
gain a point 10
Joint Tokenization and Translation
INSTITUTE OF COMPUTING TECHNOLOGY
r3 : x1 you-wang x2 x1 is expected to x2 tao-fei-ke
0
tao
1
fei
Taufik 2010-08-24
2
you-wang ke
3
you
4
wang
is expected to
duo
5
duo
fen
6
fen
7
gain a point 11
Log-linear Model characters
tokenization
INSTITUTE OF COMPUTING TECHNOLOGY
translation
Score(c, w, e) k hk (c, w, e) Och and Ney 2002
2010-08-24
12
Log-linear Model characters
tokenization
INSTITUTE OF COMPUTING TECHNOLOGY
translation
Score(c, w, e) k hk (c, w, e) Och and Ney 2002
8 Tokenization Features
8 Translation Features Chiang, 2007
2010-08-24
13
Tokenization Features
INSTITUTE OF COMPUTING TECHNOLOGY
Maximum entropy model over label sequences (Xue and Shen,2003; Ng and Low, 2004) tao
fei
ke
tao-fei-ke
e b m P(tao-fei-ke | tao fei ke) = P(bme | tao fei ke)
2010-08-24
14
Tokenization Features
Maximum entropy model over label sequences (Xue and Shen,2003; Ng and Low, 2004) tao
INSTITUTE OF COMPUTING TECHNOLOGY
fei
ke
tao-fei-ke
e b m P(tao-fei-ke | tao fei ke) = P(bme | tao fei ke) N-gram language model over words Word count OOV features
2010-08-24
15
Experiment Setup
Bilingual corpus
Xinhua portion of the GIGAWORD corpus
Chinese segmentation corpus
1.5M sentence pairs from LDC
4-gram English language model
INSTITUTE OF COMPUTING TECHNOLOGY
People’s Daily (6M words) Split into three parts: training, development, and testing
Three Chinese word segmenters
2010-08-24
ICTCLAS (ICT) Stanford (SF) Maximum Entropy model (ME) All: combined corpus processed by the three segmenters to extract rules 16
Translation Result on MT2005
INSTITUTE OF COMPUTING TECHNOLOGY
33.06 33.22
Separate
30.91 Lattice (Dyer et al., 2008) 33.95
ICT SF ME All
Joint
30 2010-08-24
31
32
33
34
35
36 17
Translation Result on MT2005 33.06 33.22
Separate
30.91 Lattice (Dyer et al., 2008) 33.95
ICT SF ME All
34.69 34.56 34.17 34.88
Joint
30 2010-08-24
INSTITUTE OF COMPUTING TECHNOLOGY
31
32
33
34
35
36 18
Tokenization Result
INSTITUTE OF COMPUTING TECHNOLOGY
97.47 97.48
Separate
95.53
97.68 97.68 97.7 97.7
Joint
95 2010-08-24
95.5
96
96.5
97
97.5
ICT SF ME All
98 19
Criterion for testing
Better Tokenization = Better Translation?
INSTITUTE OF COMPUTING TECHNOLOGY
F-1 on test-set of People’s Daily
BLEU on MT2005
97.37
92.49
27.43
34.88
Max F-1
Max BLEU
on dev-set of People’s Daily
on MT2002
Criterion for tuning weights 2010-08-24
20
An Example from MT2002
INSTITUTE OF COMPUTING TECHNOLOGY
Character 作 为 第 二 或 第 三 单 打 的 陶 菲 克 就 有 望 夺 分 Reference Taufik, as the second or third singles player, will have hopes of scoring Stanford 作为 第二 或 第三 单 打 的 陶菲克 就 有望 夺分 as the second or third hopefully 夺分 陶菲克 fought alone
Lattice 作为 第二 或 第三 单打 的 陶 菲克 就 有望 夺 分 as the second or third singles tao 菲克 expected to win Joint 作为 第二 或 第三 单打 的 陶菲克 就 有望 夺 分 as the second or third singles 陶菲克 expected to win points
Granularity: Too small / large 2010-08-24
Tokenization error 21
Conclusion
Better tokenization Better translation Joint tokenization and translation provide an elegant and effective way to
INSTITUTE OF COMPUTING TECHNOLOGY
Optimize tokenization for translation Improve tokenization by translation information
Future work
2010-08-24
Apply this method in other models Improve the performance for morphologically rich languages 22
Better tokenization
Better translation
Joint Tokenization
Translation
Thanks to Wenbin Jiang, Zhiyang Wang, Zongcheng Ji and anonymous reviewers
Considering All Tokenizations Constructing tokenization by the extracted rules will limit the search space of tokenization
0-3 tao fei ke
INSTITUTE OF COMPUTING TECHNOLOGY
0-3 tao-fei-ke
0-2 tao fei
0-1 tao
1-2 fei
0 2010-08-24
2-3 ke
tao fei ke 1
2
3 24
Considering All Tokenizations When a span is not tokenized into a single word by the extracted rules, we considering the entire span as an OOV
0-3 tao fei ke
INSTITUTE OF COMPUTING TECHNOLOGY
0-3 tao-fei-ke
0-2 tao fei
0-1 tao
1-2 fei
0 2010-08-24
2-3 ke
tao fei ke 1
2
3 25
Considering All Tokenizations 0-3 tao-fei ke
Working with the glue rule (SSX, SX), we can construct all potential tokenization
0-3 tao fei-ke
0-2 fei-ke
0-2 tao-fei 0-1 tao
2-3 ke
0 2010-08-24
INSTITUTE OF COMPUTING TECHNOLOGY
tao fei ke 1
2
3 26
Considering All Tokenizations
INSTITUTE OF COMPUTING TECHNOLOGY
We use two types of features to control the generation of OOV
OOV character count (OCC): the number of OOV characters OOV discount features: penalties for the OOV words with different number of characters ( ODi i = 1,2,3,4+ ) OCC = 3 OD1 = 3 tao fei ke
2010-08-24
OCC = 3 OD3 = 1 tao-fei-ke 27