Benchmarking Machine Translated Sentiment Analysis for Arabic Tweets

Report 2 Downloads 81 Views
Benchmarking Machine Translated Sentiment Analysis for Arabic Tweets Eshrag Refaee and Dr. Verena Rieser [email protected] [email protected] Interaction Lab, School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom

Introduction

Research Questions

This study aims to measure the impact of automatically translated data on the accuracy of sentiment analysis of Arabic tweets.

1. How does off-the-shelf Machine Translation (MT) on Arabic social data influence SA performance?

Challenges of Arabic Sentiment Analysis (SA) in Social Networks: • Users on social networks tend to use Dialectal Arabic (DA)  most existing resources and NLP tools were originally designed for Modern Standard Arabic (MSA).

2. Can MT-based approaches be a viable alternative to improve sentiment classification performance on Arabic

tweets?

• DA poses additional challenges for NLP researchers i.e. they lack standardization, are written in free text, and show significant variation from MSA. • Lots of interest in analysing Arabic content of social media i.e. tweets (Abdul-Mageed et al, 2012, & Murad and Darwish, 2013)  Few open resources i.e. annotated data-sets freely available yet (Refaee and Rieser, 2014).

Arabic tweets

Approach

(937 manually labelled tweets)

3. Given the linguistic resources currently available for Arabic and its dialects, is it more effective to adapt an MT-based approach instead of building a new system from scratch?

Translate tweets into English with a publically available MT system

Assign sentiment labels using a state-of-theart sentiment classifier

(Google Trans. and Microsoft Trans.)

(Stanford Sentiment Classifier)

Examples of Misclassified Tweets

Results MT-based Approaches Google-Trans. + Deep Learning

Distant Supervision Approaches

Microsoft-Trans. + Deep Learning

Lexicon-based

Emoticon-based

Supervised Learning Approach Fully-supervised Learning

F-measure Accuracy F-measure Accuracy F-measure Accuracy F-measure Accuracy F-measure Accuracy Positive vs. Negative

49.16

71.28

51.46

76.34

47.51

Evaluate accuracy of assigned sentiment labels

76.72

53.9

57.06

41.6

49.65

Example tweet

1 2

3 4 5

Error Analysis • Incorrectly translated sentiment bearing dialectal words (Example 1).

6

Human Translation Crown Prince of Britain looks elegant ‫ولي عهد بريطانيا طالع كشخه في الزي السعودي‬ in the Saudi attire. Bashar Al-Assad’s picture being ‫يدوسون صورة بشار االسد بالنعال‬ stepped on with slippers. Muhammad’s happiness with scoring ‫فرححه محمد بالهدف‬ a goal. Oh God, shower people of Syria with ‫ياهلل امطر اهل سوريا باالمن والرزق‬ safety and livelihood. Because you are with me, I’m full of ‫ امتليت حب‬,‫وعشان انكم معايا انا امتليت حياه‬ life and love. The way of celebrating Valentine is a ‫االحتفال بالفالنتاين هو من صور التخلف في‬ an image of lag/falling behind in my ‫بلدي‬ country.

Conclusions

Auto Translation Crown Prince of Britain climber Kchkh in Saudi outfit Trample image Bashar al-Assad with shoes

Manual Label

Auto Label

Positive

Negative

Negative

Neutral

Positive

Negative

Positive

Negative

Positive

Negative

Negative

Positive

Farahhh Muhammad goal Oh God rained folks Syria security and livelihood And Ashan you having I Amtlat Amtlat love life The celebration of Valentine's Day is one of the images in my backwardness

References

• Incorrectly translated surrounding dialectal words (Example 5). • This work is the first to investigate and empirically evaluate the performance of Machine Translation (MT) based Sentiment Analysis (SA) for Arabic Tweets. • MT approaches reach a comparable performance or significantly outperform more resource-intense standard approaches.  Acc. 76.34 vs. 76.72 (Dist. Learning)  Acc. 76.34 vs. 49.65 (Superv. Learning) • Misspelled and hence incorrectly translated sentiment bearing word (Example 3). • Potential cultural bias (Example 4) wherein “Oh God” can have a negative connotation in English while seen as a context-dependant phrase in Arabic.

• With the LRs currently available for Arabic and its dialects, using off-the-shelf tools to perform SA is an effective and efficient alternative to building SA classifiers from scratch. • The test-set has been made available as a part of this work.

1. Abbasi, A., Hassan, A., & Dhar, M. (2014). Benchmarking Twitter Sentiment Analysis Tools. Proceedings of (LREC'14) 26-31 May 2014. (ELRA) 2. Abdul-Mageed, M., Kübler, S., & Diab, M. (2012). SAMAR: A system for subjectivity and sentiment analysis of Arabic social media. (pp. 19-28). ACL. 3. Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of NAACLHLT (pp. 359-369). 4. Mourad, A., & Darwish, K. (2013). Subjectivity and Sentiment Analysis of Modern Standard Arabic and Arabic Microblogs. WASSA 2013, 55. 5. Refaee, E. and Rieser, V. (2014). An Arabic twitter corpus for subjectivity and sentiment analysis. In proceedings of (LREC’14). Reykjavik-Iceland, 26-31 May 2014 6. Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013, October). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of (EMNLP) (pp. 1631-1642).