Joint Word Recognition and Segmentation - Semantic Scholar

Report 2 Downloads 227 Views
Web IR / NLP Group

Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation Aobo Wang & Min-Yen Kan 10/07/2013

Web IR / NLP Group

Introduction •  Informal words in tweets – e.g. “The song is koo, doesnt really showcase anyones talent though.” Ø  koo è cool Ø doesnt èdoesn’t Ø  anyones èanyone’s

– Benefit to downstream applications – Informal word recognition (IWR) in Chinese is more difficult

Aobo Wang

1

Web IR / NLP Group

Introduction •  Informal Words in Chines Weibo Sent 1: 排n久只买到了站票 (Only standing tickets are available after queuing for a long time) Seg:

排|n|久|只|买到|了|站票 ý

排 | n 久 | 只 | 买到 | 了| 站票 þ ê 很久 ( for a long time)

------------------------------------------------------------------------------------------------------Sent 2: 起床了孩纸们 (Get up kids.) Pattern : | xx们 | Seg:

Aobo Wang

起床|了 | 孩 | 纸 | 们 í î kid paper

ý

起床|了 | 孩纸们 þ ê hai zhi men èhai zi menè孩子们

(kids)

2

Web IR / NLP Group

Outline •  Introduction •  Motivation –  CWS and IWR are interactively related. –  Joint inference on CWS and IWR

•  Methodology –  Joint inference method –  Related work & baseline systems

•  Experiment Result •  Discussion •  Conclusion

Aobo Wang

3

Web IR / NLP Group

Methodology •  Problem definition – Input: a sentence from microtext S=X1X2 … Xn – Output: two-layer labels Ø  Za = z1a z2a … zna Ø  Yb = y1b y2b … ynb

zia ∈ { F,IF } yib ∈ { B,I,E,S }

•  Joint inference Model –  2-layers Factorial Conditional Random Fields (FCRFs)

Aobo Wang

z1

z2

z3

……

zn-1

zn

y1

y2

y3

……

yn-1

yn

x1

x2

x3

……

xn-1

xn 4

Web IR / NLP Group

Joint Inference Model •  Graphical illustration

–  IWR can guide CWS to correctly segment –  CWS helps decide the boundaries of informal words

Aobo Wang

5

Web IR / NLP Group

Feature Extraction •  “ * ” indicates the novel feature set – Lexical Features Ø Character n-grams Ø Character lexicon: C−1C1 Ø Whether Ck and Ck+1 are identical

– Dictionary-based Features Ø Pinyin List*, Stop Word List*, informal word list*

– Statistical Features* Ø  Pointwise mutual information (PMI)

character-based bigrams Two domains: news and microblog.

Aobo Wang

6

Web IR / NLP Group

Related Work & Baseline Systems •  Informal Word Recognition –  Xia et al. (2005,2006,2008) Ø  To recognize Chinese informal sentences Ø  BBS chats Ø  Pattern matching Ø  SVM

–  Li and Yarowsky, (2008) Ø  manually-tuned queries Ø  search the web for definition sentences Ø  bootstrap informal/formal word pairs

–  Other Baselines Ø  LCRFcws + LCRFiwr Ø  Decision Tree Aobo Wang

7

Web IR / NLP Group

Related Work & Baseline Systems •  Chinese word segmentation –  Baselines Ø  LCRF (Sun and Xu, 2011) Ø  ICTCLAS (HHMM based segmenter) Ø  LCRFiwr + LCRFcws

Aobo Wang

8

Web IR / NLP Group

Outline •  •  •  • 

Introduction Motivation Methodology Experiment Result – Baseline systems – Upper bound systems

•  Discussion •  Conclusion

Aobo Wang

9

Web IR / NLP Group

Experiment •  Data Set –  Crawling data from Sina Weibo –  Crowdsourcing annotations using Zhubajie –  4, 000 posts for training –  1, 500 posts for testing. Ø  79.8% of informal words annotated in the testing set have not been covered by the training set.

•  Modeling Tool –  Open-source Mallet GRMM package Ø http://mallet.cs.umass.edu

Aobo Wang

10

Web IR / NLP Group

Experiment • 

• 

Aobo Wang

Baseline Systems versus FCRF joint inference model CWS

IWR

ICTCLAS segmenter (2011)

Decision Tree-based recognizer

Sun and Xu, (2011) (LCRF)

Xia et al (SVM)(2006)

LCRFiwr + LCRFcws

LCRFcws + LCRFiwr

Upper Bound Systems CWS

IWR

LCRFiwr + LCRFcws – UB

LCRFcws + LCRFiwr - UB

FCRF - UB

FCRF - UB

11

Web IR / NLP Group

Experiment •  FCRF versus Baselines on CWS Pre

Rec

F1

OOVR

ICTCLAS(HHMM)

0.640

0.767

0.698

0.551

Sun and Xu (LCRF)

0.661

0.691

0.675

0.572

LCRFiwr + LCRFcws

0.733

0.775

0.753

0.617

FCRF

0.757

0.790

0.773

0.673

–  importance of considering IWR in CWS –  joint inference is better than modeling the two tasks sequentially

Aobo Wang

12

Web IR / NLP Group

Experiment •  Upper Bound Performance on CWS task

– room for improvement with better IWR labels – Better IWR labels lead to better CWS performance

Aobo Wang

13

Web IR / NLP Group

Experiment •  FCRF versus Baselines on IWR task Pre

Rec

F1

SVM

0.382

0.621

0.473

DT

0.402

0.714

0.514

LCRFcws + LCRFiwr

0.858

0.584

0.695

FCRF

0.877

0.653

0.748

–  FCRF gets the best performance –  IWR task is improved with CWS tasks –  SVM and DT are over-predicting –  Joint inference is more efficient than sequential models.

Aobo Wang

14

Web IR / NLP Group

Experiment •  Upper Bound Performance on IWR task

– more informal words prediction

Aobo Wang

15

Web IR / NLP Group

Experiment •  Feature Evaluation – FCRF-new refers to the system without using the “starred” feature sets

• 

CWS (F1)

IWR(F1)

FCRF-new

0.690

0.552

FCRF

0.773

0.748

SVM Joint Classification vs. FCRF – cross-product of two layer labels as gold labels Ø 2 (F,IF) * 4 (B,I,E,S) = 8 labels CWS (F1)

IWR(F1)

──

0.473

SVM -JC

0.711

0.624

FCRF

0.773

0.748

SVM

Aobo Wang

•  Over-predicting problem is largely solved

16

Web IR / NLP Group

Outline •  •  •  •  • 

Introduction Motivation Methodology Experiment Result Discussion – Error Analysis

•  Conclusion

Aobo Wang

17

Web IR / NLP Group

Discussion •  partially seen informal word phenomenon.

–  狠 è hen è 很 (very) is known –  狠久 è hen jiu è 很久 (for a long time) is not recognized correctly

•  extremely short sentences

–  “肥家! 太累了。。。” (Go home! Exhausted.) –  肥家 èfei jia è hui jia è 回家 (go home)

•  freestyle name entities

Aobo Wang

18

Web IR / NLP Group

Conclusion •  Informal word recognition is correlated with the Chinese word segmentation problem •  The joint inference solution (FCRF) outperforms – individual solutions – sequential linear CRF models

•  It helps build the informal word lexicon automatically

Thank You Aobo Wang

19

Web IR / NLP Group

Features Extraction •  Dictionary-based Features – If Ck (i−4