Benchmarking Machine Translated Sentiment Analysis for Arabic Tweets Eshrag Refaee and Dr. Verena Rieser
[email protected] [email protected] Interaction Lab, School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom
Introduction
Research Questions
This study aims to measure the impact of automatically translated data on the accuracy of sentiment analysis of Arabic tweets.
1. How does off-the-shelf Machine Translation (MT) on Arabic social data influence SA performance?
Challenges of Arabic Sentiment Analysis (SA) in Social Networks: • Users on social networks tend to use Dialectal Arabic (DA) most existing resources and NLP tools were originally designed for Modern Standard Arabic (MSA).
2. Can MT-based approaches be a viable alternative to improve sentiment classification performance on Arabic
tweets?
• DA poses additional challenges for NLP researchers i.e. they lack standardization, are written in free text, and show significant variation from MSA. • Lots of interest in analysing Arabic content of social media i.e. tweets (Abdul-Mageed et al, 2012, & Murad and Darwish, 2013) Few open resources i.e. annotated data-sets freely available yet (Refaee and Rieser, 2014).
Arabic tweets
Approach
(937 manually labelled tweets)
3. Given the linguistic resources currently available for Arabic and its dialects, is it more effective to adapt an MT-based approach instead of building a new system from scratch?
Translate tweets into English with a publically available MT system
Assign sentiment labels using a state-of-theart sentiment classifier
(Google Trans. and Microsoft Trans.)
(Stanford Sentiment Classifier)
Examples of Misclassified Tweets
Results MT-based Approaches Google-Trans. + Deep Learning
Distant Supervision Approaches
Microsoft-Trans. + Deep Learning
Lexicon-based
Emoticon-based
Supervised Learning Approach Fully-supervised Learning
F-measure Accuracy F-measure Accuracy F-measure Accuracy F-measure Accuracy F-measure Accuracy Positive vs. Negative
49.16
71.28
51.46
76.34
47.51
Evaluate accuracy of assigned sentiment labels
76.72
53.9
57.06
41.6
49.65
Example tweet
1 2
3 4 5
Error Analysis • Incorrectly translated sentiment bearing dialectal words (Example 1).
6
Human Translation Crown Prince of Britain looks elegant ولي عهد بريطانيا طالع كشخه في الزي السعودي in the Saudi attire. Bashar Al-Assad’s picture being يدوسون صورة بشار االسد بالنعال stepped on with slippers. Muhammad’s happiness with scoring فرححه محمد بالهدف a goal. Oh God, shower people of Syria with ياهلل امطر اهل سوريا باالمن والرزق safety and livelihood. Because you are with me, I’m full of امتليت حب,وعشان انكم معايا انا امتليت حياه life and love. The way of celebrating Valentine is a االحتفال بالفالنتاين هو من صور التخلف في an image of lag/falling behind in my بلدي country.
Conclusions
Auto Translation Crown Prince of Britain climber Kchkh in Saudi outfit Trample image Bashar al-Assad with shoes
Manual Label
Auto Label
Positive
Negative
Negative
Neutral
Positive
Negative
Positive
Negative
Positive
Negative
Negative
Positive
Farahhh Muhammad goal Oh God rained folks Syria security and livelihood And Ashan you having I Amtlat Amtlat love life The celebration of Valentine's Day is one of the images in my backwardness
References
• Incorrectly translated surrounding dialectal words (Example 5). • This work is the first to investigate and empirically evaluate the performance of Machine Translation (MT) based Sentiment Analysis (SA) for Arabic Tweets. • MT approaches reach a comparable performance or significantly outperform more resource-intense standard approaches. Acc. 76.34 vs. 76.72 (Dist. Learning) Acc. 76.34 vs. 49.65 (Superv. Learning) • Misspelled and hence incorrectly translated sentiment bearing word (Example 3). • Potential cultural bias (Example 4) wherein “Oh God” can have a negative connotation in English while seen as a context-dependant phrase in Arabic.
• With the LRs currently available for Arabic and its dialects, using off-the-shelf tools to perform SA is an effective and efficient alternative to building SA classifiers from scratch. • The test-set has been made available as a part of this work.
1. Abbasi, A., Hassan, A., & Dhar, M. (2014). Benchmarking Twitter Sentiment Analysis Tools. Proceedings of (LREC'14) 26-31 May 2014. (ELRA) 2. Abdul-Mageed, M., Kübler, S., & Diab, M. (2012). SAMAR: A system for subjectivity and sentiment analysis of Arabic social media. (pp. 19-28). ACL. 3. Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of NAACLHLT (pp. 359-369). 4. Mourad, A., & Darwish, K. (2013). Subjectivity and Sentiment Analysis of Modern Standard Arabic and Arabic Microblogs. WASSA 2013, 55. 5. Refaee, E. and Rieser, V. (2014). An Arabic twitter corpus for subjectivity and sentiment analysis. In proceedings of (LREC’14). Reykjavik-Iceland, 26-31 May 2014 6. Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013, October). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of (EMNLP) (pp. 1631-1642).