Arabic Sentiment Analysis using Supervised Classification

Report 2 Downloads 98 Views
The 1st International Workshop on Social Networks Analysis, Management and Security (SNAMS - 2014), August 2014, Barcelona, Spain

Arabic Sentiment Analysis using Supervised Classification Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110, Jordan [email protected]

Islam Qarqaz Department of Computer Science Jordan University of Science and Technology Irbid 22110, Jordan

Abstract—Sentiment analysis is a process during which the polarity (i.e. positive, negative or neutral) of a given text is determined. In general there are two approaches to address this problem; namely, machine learning approach or lexicon based approach. The current paper deals with sentiment analysis in Arabic reviews from a machine learning perspective. Three classifiers were applied on an in-house developed dataset of tweets/comments. In particular, the Naïve Bayes, SVM and K-Nearest Neighbor classifiers were run on this dataset. The results show that SVM gives the highest precision while KNN (K=10) gives the highest Recall. Keywords— sentiment analysis; sentiment classification; opinion mining; text mining, Arabic language.

I.

INTRODUCTION

Sentiment analysis or opinion mining is a field of study which attempts to analyze people’s opinions, sentiments, attitudes, and emotions on entities such as products, services, and organizations. The expression sentiment analysis was first appeared in [11] (Nasukawa and Yi 2003), and the expression opinion mining was first appeared in [10] (Dave, Lawence and Pennock 2003). Several authors would use the phrases sentiment analysis and opinion mining interchangeably. In this work, we will also use the two phrases interchangeably. Sentiment analysis can be viewed as a classification process that determines whether a certain document or text was written to express positive or negative opinion. Sentiment classification would be helpful in business intelligence applications and recommender systems [20]. There are also potential applications to message filtering [19]. In recent years, sentiment analysis has gained considerable attention, and its applications have spread to almost every possible domain. Many systems and applications are built for the English and other IndoEuropean languages. However, few studies have focused on the Arabic language which is a native language for more than 300 million speakers. This paper is concerned with studying sentiment analysis for public Arabic tweets and comments in social media using classification models that are built using Rapidminer [16] which is an open source data mining and machine learning software. The rest of this paper is structured as follows: In Section II, we describe some of the related works. In Section III, on the other hand, we present the software that was used in this manuscript in addition to the

1

dataset. Section IV presents supervised classification. Section V discusses experimentation and result analysis. Finally, in Section VI we discuss the conclusions of this study and highlight some future work.

II.

RELATED WORK

Researchers have proposed many different approaches for sentiment analysis. In general, there are two main methods, the first one is using machine learning techniques or supervised techniques which are presented in this paper, and the other one is unsupervised techniques. Many studies have focused on the sentiment analysis for the English language and other Indo-European languages. Pang and Lee [6] used machine learning techniques for sentiment classification. They employed three classifiers (Naïve Bayes, Maximum Entropy classification, Support Vector machine). Their data source was the Internet Movie Database (IMDB); they selected only reviews where the author rating was expressed either with stars or some numeral value. Dave, Lawrence, and Pennock [10] proposed an approach, which begins with training a classifier using a corpus of self-tagged reviews available from major web sites. They decided to use n-grams on two tests and the result showed that this way is better than traditional machine learning. Many researches were introduced to analyze sentiment and extracting opinions from the World Wide Web. This proved to be important due to the large amount of data contributed by users in websites such as social networks (Facebook, Twitter, etc.). For example Hassan, Yulan, and Alani [1] studied Semitic sentiment analysis of Twitter. The authors used three different Twitter datasets for their experiments. They proposed the using of Semitic features in Twitter sentiment classification and explored three different approaches for incorporating them into the analysis with replacement, augmentation, and interpolation. In [18], Kumar and Sebastian presented a novel approach for sentiment analysis on Twitter data. To do that, they extracted the opinion words in tweets. There are few studies for sentiment analysis for the Arabic language. For example, Abdul Majeed and Diab presented a newly developed manually annotated corpus of Modern Standard Arabic (MSA) together with a new polarity lexicon [2]. They ran their experiments on three different pre-processing settings based on tokenized text from the Penn Arabic Treebank (PATB). They adopted two-stage classification approach, in the first stage they built a binary classifier to sort out objective from subjective cases. For the second stage, they applied binary classification that distinguishes positive from negative cases. In [4], the same researchers in [2] reported efforts to bridge the gap between Arabic researches by presenting AWATIF; a multi-genre corpus for Modern Standard Arabic for Subjectivity and Sentiment Analysis (MSA SSA). They extend their previous work by showing how annotation studies within subjectivity and sentiment analysis can both be inspired by existing linguistic theory and cater for genre nuances. Alams, and Ahmed [3] target three languages (English, Arabic, and Urdu) in their work. They described a method for automatically extracting specialist terms called local grammar. The authors compared the behavior of single and compound tokens in specialist and general language corpora to determine whether a token is behaving like a sentiment term or not. Elhawary and Elfeky [5] extract business reviews scattered on the web written in the Arabic language. They built a system that comprises two components: a reviews classifier that classifies any web page whether it contains reviews or not, and sentiment analyzer that identifies the reviews’ text if it (positive, negative, neutral or mixed).

2

El-Halees [7] presented a combined approach that extracts opinions from Arabic documents. He used a combined approach that consists of three methods: first, the lexicon based method which classifies some documents. Second, the classified documents are used as a training set for Maximum Entropy model, last the K-nearest neighbor classifier is used to classify the rest of the documents. Saleh, Martin, Alfonso, and Perea presented an Opinion Corpus for Arabic (OCA) in [8]. It composed of Arabic reviews extracted from specialized web pages related to movies and films using Arabic language. They utilized two classifiers, namely: Support Vector Machine and Naïve Bayes. Al-Subaihin, Al-Khalifa, and Al-Salman [9] proposed a sentiment analysis tool for modern Arabic using human based computing. This tool will help construct and dynamically develop and maintain the tool’s lexicons. They also inspected the problem of conducting sentiment analysis on Arabic text in the World Wide Web. The solution of the problem they proposed is a lexicon based approach.

III.

SOFTWARE & DATASET

A. RapidMiner Rapidminer [16] is a java-based open source data mining and machine learning software. It has a graphical user interface (GUI) where the user can design his machine learning process without having to code. The process is then transformed into an XML (extensible Markup Language) file which defines the operations that the users want to apply to the given dataset. Perhaps, one of the most valuable extensions to Rapidminer is the Text Processing package. This includes many operators that support text mining. For example, there are operators for tokenization, stemming, filtering stopwords, and generating n-grams. The main reason for choosing Rapidminer is that the text Processing package can deal with the Arabic language. B.The Dataset We generated our dataset by collecting tweets and Facebook comments from the internet. These tweets and comments address general topics such as education, sports, and political news. As far as the tweets were concerned, we have utilized the crawler and annotation tool presented in [12]. The authors in [12] have designed a crawler to collect tweets from Twitter. They also, relied on crowdsourcing for tweets annotation. Initially, 10,000 tweets were collected and annotated. When the collected tweets were carefully examined, we realized that they suffer from several problems. They include high number of duplicate tweets; these may be the result of re-tweeting, also some of the automatically collected tweets are empty and contain the address of the sender only. Such tweets were removed from our dataset. We also, manually collected 500 comments from Facebook. Many of these comments were removed either because they are written in Arabizi where Roman letters are used in writing Arabic words – a style that Arab users of social media widely use; or because the comment consists of emoticons only. Table 1 shows the number of tweets and comments that remained with their sentiment orientation. Table1: Num. of neg. and pos. tweets/comments in the dataset. 3

Tweet/Comment

Positive 1073

Negative 1518

Total 2591

The crowdsourcing tool presented in [12] was used to label the tweets. Volunteers have to create a username and password to use the tool. Once they log onto the system, one tweet or comment will appear at a time. The user has the option to choose a label from (1) Positive (2) Negative (3) Neutral and (4) Other. Positive tweets are given label “1”, while negative tweets are given label “-1”. Neutral tweets are given label “0”. If the tweet is empty or suffers a problem then, the option “other” is used and that tweet is deleted from the dataset. Every tweet must be rated by at least three different users and majority voting is used to assign the final label for every tweet. As quality assurance measure, one of the raters was one of the researchers in [12]. After labelling was complete, we store the positive tweets in one file and the negative tweets in another file. The authors of the current paper manually labeled the Facebook comments.

IV.

SUPERVISED CLASSIFICATION

A. K-nearest neighbor classifier (KNN) This classifier is a simple one which chooses the K number of nearest neighbors in the training documents and classifies an unannotated document based on these K neighbors. Specifically, it calculates the similarity between the unlabeled document and the remaining documents in the training dataset. After that, the labels of the most K similar documents are considered. The final label of the new document is determined using majority voting or weighted average of the labels of these K neighbors. In [23], the authors used KNN to classify emotions contained in examples, written in Japanese, extracted from the web. B. Naïve Bayes It is a kind of classifier that depends on Bayes rule written in the following formula: P (c|d) =

( ) ( ( )

)

(1)

The main idea of the Naïve Bayes classifier is to hypothesis that predictor variables are autonomous which substantially reduces the computation of probabilities. This classifier gives good results and it has been used in many research such as the work reported in [21] and [24]. C. Support Vector Machines (SVM) It is an effective traditional text categorization framework. The main idea of SVM is to find the hyperplan, which is represented as a vector that separates document vectors in one class from document vectors in other classes [26]. SVM shows very good performance and higher accuracy in many studies directed towards sentiment analysis in many languages. The work reported in [25] shows that SVM did well with

4

the English language when compared to other classifiers. Also, the work reported in [22] shows that SVM gives good results for sentiment analysis of reviews written in Chinese. V.

EXPERIMENTATION AND RESULT ANALYSIS

All the experiments that were carried out in this research were done using Rapidminer [16] which was described in Section III. Here, the classification task for a given classifier is designed as a process. This process consists of several operators that are described next. Process documents from Files: This is a container operator, i.e. it contains other operators related to text processing. In this work, the Tokenize, Stem(Arabic), Filter Stopwords(Arabic), and Generate-n-Grams(Terms) operators were used. The Tokenize operator is responsible for splitting the text of the review into tokens or words. The Stem(Arabic) operator is responsible for reducing an Arabic token to its stem or root. Rapidminer [16] also has another operator called Stem(Arabic, Light). This operator does not reduce a word to its proper root; rather, it removes common prefixes and suffices from words or tokens. The Filter Stopword(Arabic) removes noise Arabic words that do not affect the classification task. When dealing with sentiment analysis, the usage of this operator is tricky, because negation words are considered stopwords for topical classification and thus removed. On the other hand, negation words are critical for sentiment analysis as they can reverse the sentiment from positive to negative and vice versa. The Generate-n-grams operator can slightly alleviate this problem by generating sequences of n-words and each sequence is considered one token. N, here, specifies the number of words or terms in a sequence. In this work, n was set to 2, i.e. we generated bi-grams. Obviously there are more sophisticated methods to deal with valence shifters such as negations. For example, one could use a parser that would search for valence shifters and attached them to the proper term and determine the sentiment of the sequence as a whole. For instance, good (‫ )جيد‬is a positive word and not-good (‫ )ليسى جيدا‬is a negative word in a given context. The Process Documents from Files Operator takes as input folders that contain text files. In this work, two folders were fed to this operator; namely; one folder which contains the positive reviews and a second folder that contains negative reviews. This operator has a set of parameters that are generally useful when dealing with text. For example, the Vector Creation parameter allows the user to choose a weighting scheme for the terms from TF (Term Frequency), TF-IDF (Term Frequency, Inverse Document Frequency) and others.

X-Validation: Validation is an important step that allows us to test the accuracy of algorithms. The most common approaches to validation are hold out method and cross validation method. In the hold out method, part of the data or reviews is held out for testing and the remaining parts are used for training the classifier. The cross validation method, by comparison, splits the data into testing and training as in the hold out method but the data is scanned several times and each division or part of the data is get to be used in the training and testing phases. To be clear, in the 10-fold cross validation method, the data is divided into 10 divisions or parts; one is used for testing and 9 for training in the first run. In the second run, a different part is used for testing and 9 parts, including the one that was used for testing in run one, are used for training. The runs continue until each part or division is given the chance to be part of the training data and the testing data.

5

The final accuracy is the average of the accuracies obtained in the 10 runs. In the current research, we have used 10-fold cross validation. The X-Validation operator is a nested one that consists of an operator for the classifier and another operator for calculating the performance of the classifier. The set of classifiers that we have used here are SVM, KNN, and Naïve Bayes. The Performance Operator is responsible for calculating the accuracy of the classifier. It has many parameters that one can choose from when deciding on a method for calculating the accuracy of the classifier. In the current research, we used precision and recall as measures of accuracy. To calculate these we need: TP: the number of reviews that were correctly classified by the classifier to belong to the current class. TN: the number of reviews that were correctly classified by the classifier not to belong to the current class. FP: the number of reviews that were mistakenly classified by the classifier to belong to the current class. FN: the number of reviews that were mistakenly classified by the classifier not to belong to the current class. Therefore Precision = TP/(TP+FP) and Recall = TP/(TP+FN) for binary classification tasks. As the problem we are dealing with consists of two classes, we calculated the precision/recall per class and we also calculated the macro-precision and macro-recall for the two classes together. Table 2 shows the class precision and recall for the Naïve Bayes classifier. As it can be seen from the table, the precision of the Negative class is 78.2 while the precision for the Positive class is 54.21. This variance is mainly contributed to the fact that the dataset is not balanced; the number of negative reviews is larger than the positive reviews. Recall on the other hand equals 52.7 for the Negative class and 79.2 for the Positive class. The reason for the lower recall for the Negative class, in our opinion, is that there is a large number of negative reviews that the classifier should retrieve to obtain high recall as the number of negative reviews is large. Tables 3 and 4 depict the class precision and recall for the SVM and K-NN (K=10) classifiers, respectively. Table 5 shows the Macro-Recall and Macro-Precision for the three classifiers. As the table shows, the highest Macro-Recall was achieved in the case of KNN. The highest Macro-Precision was achieved in the case of SVM. Table2: Class Precision and Recall for the Naïve Bayes Classifier True True Class Negative Positive Precision Predicted Negative Predicted Positive Class Recall

800

223

78.20

718

850

54.21

52.70

79.22

Table3: Class Precision and Recall for the SVM Classifier. 6

True True Class Negative Positive Precision Predicted Negative Predicted Positive Class Recall

1474

806

64.65

44

267

85.85

97.10

24.88

ROC (Receiving Operator Characteristics) curves are graphical methods used for depicting the accuracy of the classifiers. A ROC chart is used to describe the effectiveness of a classifier which allocates items into one of two categories depending on whether or not they exceed a threshold. In ROC-curves, the xaxis is the false positive rate (FPR) and the y-axis is the true positive rate (TPR). The FPR measure is the fraction of negative examples that are misclassified, and the TPR is the fraction of positive examples that are correctly labeled. The best point in the ROC space is located in the left upper cornet or coordinate (0,1) and sometimes it is called perfect classification. The diagonal of the ROC space represents random guesses; as it is the case with flipping coins. The points that are located above the diagonal are better than random guesses and the points that are located under the ROC diagonal are worse than random guesses. Figure 1 shows the ROC curve for the three classifiers. As is can be seen from the figure, all classifiers performed better than random guesses. In fact, the results are located in the left side of the ROC space which indicate that the classifiers actually did a good job. VI.

CONCLUSIONS AND FUTURE WORK

This work has considered sentiment analysis in Arabic text. A dataset, which consists of 2591 tweets/comments, was collected and labelled using crowdsourcing. The Naïve Bayes, SVM and KNN classifiers were used to detect the polarity of a given review. 10-fold cross validation was used to split the data into training and testing sets. The best precision was achieved by SVM and it equals to 75.25.The best recall was achieved in the case of KNN(K=10) and it equals 69.04.

Table4: Class Precision and Recall for the KNN Classifier (K=10).

Predicted Negative Predictec Positive Class Recall

True True Negative Positive

Class Precision

1260

482

72.33%

258

591

69.61%

83.00

55.08

Certainly there are many ways that this work can and will be improved. Firstly, the size of the dataset is rather small and if we want to make solid conclusions then we definitely need big datasets. Secondly, crowdsourcing is a useful tool when labelling or annotating large amounts of data is considered. In this work we utilized crowdsourcing to label our dataset. Thirdly semi-supervised learning could be used to

7

sentiment analysis in Arabic text as this techniques has been applied successfully to other languages as it is described in the research reported in [13] [14] [15] [17].

Table5: Macro-Precision and Micro-Recall for the three classifiers. Classifier Naïve Bayes SVM KNN (K=10)

MacroPrecision

MacroRecall

66.205

65.96

75.25

60.99

70.97

69.04

Figure1: The ROC Curve for the Three Classifiers

REFERENCES [1] Hassan. S., Yulan. H., and Alani. H., "Semantic sentiment analysis of Twitter." The Semantic Web– ISWC. Springer, pp. 508-524, 2012. [2] Abdul-Mageed. M., Diab. M., and Korayem M., "Subjectivity and sentiment analysis of modern standard Arabic." Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Vol. 2. 2011. [3] Almas Y., and Ahmad K., "A note on extracting sentiments in financial news in English, Arabic & Urdu." The Second Workshop on Computational Approaches to Arabic Script-based Languages. 2007. [4] Abdul-Mageed M., and Diab M., "AWATIF: A multi-genre corpus for Modern Standard Arabic subjectivity and sentiment analysis." Proceedings of LREC, Istanbul, Turkey, 2012. [5] Elhawary M. and Elfeky M., "Mining Arabic Business Reviews." Data Mining Workshops (ICDMW), P. 1108-1113, 2010. [6] Pang B., and Lee L. "A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts." Proceedings of the 42nd annual meeting on Association for Computational Linguistics. 2004.

8

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16] [17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

El-Halees A., "Arabic Opinion Mining Using Combined Classification Approach." Proceedings of the International Arab Conference on Information Technology, ACIT. 2011. Rushdi-Saleh M., Martín-Valdivia M., Ureña-López L. & Perea-Ortega J.M., Bilingual Experiments with an Arabic-English Corpus for Opinion Mining, 2011. Al-Subaihin A., Al-Khalifa H., and Al-Salman A.M., "A proposed sentiment analysis tool for modern arabic using human-based computing." Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services. ACM, 2011. Dave K., Lawrence S., & Pennock D.M., "Mining the peanut gallery: Opinion extraction and semantic classification of product reviews." In Proceedings of the 12th international conference on World Wide Web, pp. 519-528. ACM, 2003. Nasukawa T., and Jeonghee Y., "Sentiment analysis: Capturing favorability using natural language processing." Proceedings of the 2nd international conference on Knowledge capture. ACM, 2003. [Self reference to the authors, names were removed as per Journal instructions] " Sentiment Analysis." June 2012. Rao D., and Ravichandran D., "Semi-supervised polarity lexicon induction." Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009. Dasgupta S., and Ng V., "Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification." In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2, pp. 701-709. Association for Computational Linguistics, 2009. Sindhwani V., and Melville P., "Document-word co-regularization for semi-supervised sentiment analysis." 8th IEEE International Conference on Data Mining (ICDM'08), pp. 1025-1030, 2008. Rapidminer, http://rapid-i.com/, last access on 31-Jan-201. Goldberg, Anderwo B., and Zhu X., "Seeing stars when there aren't many stars: graph-based semisupervised learning for sentiment categorization." Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing. Association for Computational Linguistics, 2006. Kumar A., and Sebastian T.M., "Sentiment Analysis on Twitter." IJCSI International Journal of Computer Science, Issue 9.3, pp. 372-378, 2102. Malouf R, and Mullen.T. "Taking sides: User classification for informal online political discourse." Internet Research 18.2: pp. 177-190, 2008. Glance N., Hurst M., Nigam K., Siegler M., Stockton R., & Tomokiyo T.,"Deriving marketing intelligence from online discussion. In Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, pp. 419-428, 2005. Rish I., "An empirical study of the naive Bayes classifier." IJCAI workshop on empirical methods in artificial intelligence. Vol. 3. No. 22. 2001. Wan X., "Co-training for cross-lingual sentiment classification. "Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1. Association for Computational Linguistics, 2009. Tokuhisa R., Kentaro I., and Yuji M., "Emotion classification using massive examples extracted from the web." Proceedings of the 22nd International Conference on Computational Linguistics, Volume 1. Association for Computational Linguistics, 2008. McCallum A., and Nigam K. "A comparison of event models for Naive Bayes text classification." AAAI-98 workshop on learning for text categorization. Vol. 752. 1998. Ye Q., Zhang Z., and Law R., "Sentiment classification of online reviews to travel destinations by supervised machine learning approaches." Expert Systems with Applications, Vol. 36, Issue 3, pp. 6527-6535, 2009. 9

[26]

Fung G., and Olvi L.M., "Incremental support vector machine classification." Proceedings of the Second SIAM International Conference on Data Mining, Arlington, Virginia. 2002.

10