Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation Aobo Wang & Min-Yen Kan 10/07/2013
Web IR / NLP Group
Introduction • Informal words in tweets – e.g. “The song is koo, doesnt really showcase anyones talent though.” Ø koo è cool Ø doesnt èdoesn’t Ø anyones èanyone’s
– Benefit to downstream applications – Informal word recognition (IWR) in Chinese is more difficult
Aobo Wang
1
Web IR / NLP Group
Introduction • Informal Words in Chines Weibo Sent 1: 排n久只买到了站票 (Only standing tickets are available after queuing for a long time) Seg:
排|n|久|只|买到|了|站票 ý
排 | n 久 | 只 | 买到 | 了| 站票 þ ê 很久 ( for a long time)
Outline • Introduction • Motivation – CWS and IWR are interactively related. – Joint inference on CWS and IWR
• Methodology – Joint inference method – Related work & baseline systems
• Experiment Result • Discussion • Conclusion
Aobo Wang
3
Web IR / NLP Group
Methodology • Problem definition – Input: a sentence from microtext S=X1X2 … Xn – Output: two-layer labels Ø Za = z1a z2a … zna Ø Yb = y1b y2b … ynb
zia ∈ { F,IF } yib ∈ { B,I,E,S }
• Joint inference Model – 2-layers Factorial Conditional Random Fields (FCRFs)
Aobo Wang
z1
z2
z3
……
zn-1
zn
y1
y2
y3
……
yn-1
yn
x1
x2
x3
……
xn-1
xn 4
Web IR / NLP Group
Joint Inference Model • Graphical illustration
– IWR can guide CWS to correctly segment – CWS helps decide the boundaries of informal words
Aobo Wang
5
Web IR / NLP Group
Feature Extraction • “ * ” indicates the novel feature set – Lexical Features Ø Character n-grams Ø Character lexicon: C−1C1 Ø Whether Ck and Ck+1 are identical
– Dictionary-based Features Ø Pinyin List*, Stop Word List*, informal word list*
– Statistical Features* Ø Pointwise mutual information (PMI)
character-based bigrams Two domains: news and microblog.
Aobo Wang
6
Web IR / NLP Group
Related Work & Baseline Systems • Informal Word Recognition – Xia et al. (2005,2006,2008) Ø To recognize Chinese informal sentences Ø BBS chats Ø Pattern matching Ø SVM
– Li and Yarowsky, (2008) Ø manually-tuned queries Ø search the web for definition sentences Ø bootstrap informal/formal word pairs
– Other Baselines Ø LCRFcws + LCRFiwr Ø Decision Tree Aobo Wang
7
Web IR / NLP Group
Related Work & Baseline Systems • Chinese word segmentation – Baselines Ø LCRF (Sun and Xu, 2011) Ø ICTCLAS (HHMM based segmenter) Ø LCRFiwr + LCRFcws
Aobo Wang
8
Web IR / NLP Group
Outline • • • •
Introduction Motivation Methodology Experiment Result – Baseline systems – Upper bound systems
• Discussion • Conclusion
Aobo Wang
9
Web IR / NLP Group
Experiment • Data Set – Crawling data from Sina Weibo – Crowdsourcing annotations using Zhubajie – 4, 000 posts for training – 1, 500 posts for testing. Ø 79.8% of informal words annotated in the testing set have not been covered by the training set.
Baseline Systems versus FCRF joint inference model CWS
IWR
ICTCLAS segmenter (2011)
Decision Tree-based recognizer
Sun and Xu, (2011) (LCRF)
Xia et al (SVM)(2006)
LCRFiwr + LCRFcws
LCRFcws + LCRFiwr
Upper Bound Systems CWS
IWR
LCRFiwr + LCRFcws – UB
LCRFcws + LCRFiwr - UB
FCRF - UB
FCRF - UB
11
Web IR / NLP Group
Experiment • FCRF versus Baselines on CWS Pre
Rec
F1
OOVR
ICTCLAS(HHMM)
0.640
0.767
0.698
0.551
Sun and Xu (LCRF)
0.661
0.691
0.675
0.572
LCRFiwr + LCRFcws
0.733
0.775
0.753
0.617
FCRF
0.757
0.790
0.773
0.673
– importance of considering IWR in CWS – joint inference is better than modeling the two tasks sequentially
Aobo Wang
12
Web IR / NLP Group
Experiment • Upper Bound Performance on CWS task
– room for improvement with better IWR labels – Better IWR labels lead to better CWS performance
Aobo Wang
13
Web IR / NLP Group
Experiment • FCRF versus Baselines on IWR task Pre
Rec
F1
SVM
0.382
0.621
0.473
DT
0.402
0.714
0.514
LCRFcws + LCRFiwr
0.858
0.584
0.695
FCRF
0.877
0.653
0.748
– FCRF gets the best performance – IWR task is improved with CWS tasks – SVM and DT are over-predicting – Joint inference is more efficient than sequential models.
Aobo Wang
14
Web IR / NLP Group
Experiment • Upper Bound Performance on IWR task
– more informal words prediction
Aobo Wang
15
Web IR / NLP Group
Experiment • Feature Evaluation – FCRF-new refers to the system without using the “starred” feature sets
•
CWS (F1)
IWR(F1)
FCRF-new
0.690
0.552
FCRF
0.773
0.748
SVM Joint Classification vs. FCRF – cross-product of two layer labels as gold labels Ø 2 (F,IF) * 4 (B,I,E,S) = 8 labels CWS (F1)
IWR(F1)
──
0.473
SVM -JC
0.711
0.624
FCRF
0.773
0.748
SVM
Aobo Wang
• Over-predicting problem is largely solved
16
Web IR / NLP Group
Outline • • • • •
Introduction Motivation Methodology Experiment Result Discussion – Error Analysis
• Conclusion
Aobo Wang
17
Web IR / NLP Group
Discussion • partially seen informal word phenomenon.
– 狠 è hen è 很 (very) is known – 狠久 è hen jiu è 很久 (for a long time) is not recognized correctly
Conclusion • Informal word recognition is correlated with the Chinese word segmentation problem • The joint inference solution (FCRF) outperforms – individual solutions – sequential linear CRF models
• It helps build the informal word lexicon automatically
Thank You Aobo Wang
19
Web IR / NLP Group
Features Extraction • Dictionary-based Features – If Ck (i−4