Content-Based Spam Filtering - Semantic Scholar

Report 10 Downloads 96 Views
Content-Based Spam Filtering Tiago A. Almeida and Akebo Yamakami Abstract— The growth of email users has resulted in the dramatic increasing of the spam emails. Helpfully, there are different approaches able to automatically detect and remove most of these messages, and the best-known ones are based on Bayesian decision theory and Support Vector Machines. However, there are several forms of Naive Bayes filters, something the anti-spam literature does not always acknowledge. In this paper, we discuss seven different versions of Naive Bayes classifiers, and compare them with the well-known Linear Support Vector Machine on six non-encoded datasets. Moreover, we propose a new measurement in order to evaluate the quality of anti-spam classifiers. In this way, we investigate the benefits of using Matthews correlation coefficient as the measure of performance.

I. I NTRODUCTION The term spam is generally used to denote an unsolicited commercial e-mail. The problem of spam can be quantified in economical terms since many hours are wasted everyday by workers. It is not just the time they waste reading the spam but also the time they spend deleting those messages. According to annual reports, the amount of spam is frightfully increasing. In absolute numbers, the average of spams sent per day increased from 2.4 billion in 20021 to 300 billion in 20092. The same report indicates that more than 90% of incoming e-mail traffic is spam. According to the National Technology Readiness Survey3 the cost of spam in terms of lost productivity in the United States has reached US$ 21.58 billion annually, while the worldwide productivity cost of spam is estimated to be US$ 50 billion. On a worldwide basis, the information technology cost of dealing with spam was estimated to rise from US$ 20.5 billion in 2003, to US$ 198 billion in 2007. Fortunately, many methods have been proposed to automatic classify messages as spams or legitimates, such as rule-based approaches, white and blacklists, collaborative spam filtering, challenge-response systems, among others. However, among all proposed techniques, machine learning algorithms have been achieved more success [1]. These methods include approaches that are considered top-performers in text categorization, like rule induction algorithm [2], Rocchio [3], [4], Boosting [5], Support Vector Machines [6], [7], [8], [9], and Naive Bayes classifiers [10], [11], [12], Tiago A. Almeida is with the School of Electrical and Computer Engineering, University of Campinas – UNICAMP, 13081-970, Campinas, SP, Brazil (phone: +55 19 3521-3849; fax: +55 (19) 3289-1395; email: [email protected]). Akebo Yamakami is with the School of Electrical and Computer Engineering, University of Campinas – UNICAMP, 13081-970, Campinas, SP, Brazil (email: [email protected]). 1 See http://www.spamunit.com/spam-statistics/ 2 See www.ciscosystems.cd/en/US/prod/collateral/ vpndevc/cisco_2009_asr.pdf 3 See http://www.smith.umd.edu/ntrs/NTRS_2004.pdf

978-1-4244-8126-2/10/$26.00 ©2010 IEEE

[13]. The two latter currently appear to be particularly very popular in commercial and open-source spam filters. This is probably due to their simplicity, which makes them easy to implement; their linear computational complexity; and their accuracy, which in spam filtering is comparable to that of more elaborate learning algorithms [12], [13]. In this paper, we present a comparative study of seven different statistical classifiers based on Naive Bayes probabilities and the linear Support Vector Machine employed to automatically filter spams. The choice of these methods is due to the fact that they are the most employed filters used to classify spams nowadays [12], [14], and besides, they are the actual top-performers in spam filtering. They are used in several free webmail servers and open-source softwares [15], [14]. Further details about other techniques used for spam filtering and applications are available in Bratko and Cormack [16], Cormack [1], and Guzella and Caminhas [14]. Here, we carry out a performance evaluation with the practical purpose of filtering e-mail spams in order to compare the currently top-performers spam filters. Furthermore, we investigate the performance measurements applied for comparing the quality of the spam filters. The remainder of this paper is organized as follows. Section II presents details of the Naive Bayes algorithms applied in spam filtering domain. The linear Support Vector Machine classifier is described in Section III. In Section IV, we offer a discussion about the benefits of using the Matthews correlation coefficient as a measure of quality of spam filters. Experimental results are showed in Section V. Finally, Section VI offers conclusions and outlines for future works. II. NAIVE BAYES A NTI -S PAM F ILTERS Probabilistic classifiers are historically the first filters and have been frequently used in recent years. The Naive Bayes (NB) classifier is the most employed in spam filtering because of its simplicity and high performance [12], [14]. From Bayes’ theorem and the theorem of the total probability, the probability for a message with vector ~x = hx1 , . . . , xn i belongs to a category ci ∈ {cs , cl } is: P (ci |~x) =

P (ci ).P (~x|ci ) . P (~x)

Since the denominator does not depend on the category, NB classifies each message in the category that maximizes P (ci ).P (~x|ci ). In spam filtering domain it is equivalent to classify a message as spam (cs ) whenever P (cs ).P (~x|cs ) > T, P (cs ).P (~x|cs ) + P (cl ).P (~x|cl )

with T = 0.5. By varying T , we can opt for more true negatives at the expense of fewer true positives, or viceversa. The a priori probabilities P (ci ) can be estimated as occurrences frequency of documents belonging to the category ci in the training set T r, whereas P (~x|ci ) is practically impossible to estimate directly because we would need in T r some messages identical to the one we want to classify. However, NB classifier makes a simple assumption that the terms in a message are conditionally independent and the order they appear is irrelevant. The probabilities P (~x|ci ) are estimated differently in each NB version. Despite the fact that its independence assumption is usually oversimplistic, several studies have found the NB classifier to be surprisingly effective in the spam filtering task [11], [17], [13]. In the following, we describe seven different versions of NB anti-spam filter. A. Basic Naive Bayes We denote Basic NB the first NB anti-spam classifier proposed by Sahami et al. [10]. Let T = {t1 , . . . , tn } the set of terms collected in the training stage and M the set of messages in the tranning set, each message m is represented as a binary vector ~x = hx1 , . . . , xn i, where each xk shows whether or not tk will occur in m. The probabilities P (~x|ci ) are calculated by: P (~x|ci ) =

n Y k=1

P (tk |ci ),

Here, probabilities P (tk |ci ) are estimated by: |Mtk ,ci | , |Mci |

where |Mtk ,ci | is the number of training messages of category ci that contain the term tk , and |Mci | is the total number of training messages of category ci . B. Multinomial term frequency Naive Bayes The multinomial term frequency NB (MN TF NB) represents each message as a set of terms m = {t1 , . . . , tn }, computing each one of tk as how many times it appears in m. In this sense, m can be represented by a vector ~x = hx1 , . . . , xn i, where each xk corresponds to the number of occurrences of tk in m. Moreover, each message m of category ci can be interpreted as the result of picking independently |m| terms from T with replacement and probability P (tk |ci ) for each tk [18]. Hence, P (~x|ci ) is the multinomial distribution: P (~x|ci ) = P (|m|).|m|!.

and the probabilities P (tk |ci ) are estimated as a Laplacian prior: 1 + Ntk ,ci P (tk |ci ) = , n + Nci term tk in the where Ntk ,ci is the number of occurrences ofP n training messages of category ci , and Nci = k=i Ntk ,ci . C. Multinomial Boolean Naive Bayes The multinomial Boolean NB (MN Boolean NB) is similar to the MN TF NB, including the estimates of P (tk |ci ), except that each attribute xk is Boolean. Note that these approaches do not take into account the absence of terms (xk = 0) from the messages. Schneider [19] demonstrates that MN Boolean NB may perform better than MN TF NB. It is because the multinomial NB with term frequency attributes is equivalent to a NB version with the attributes modeled as following Poisson distributions in each category, assuming that the message length is independent of the category. Therefore, the multinomial NB may achieve better performance with Boolean attributes, if the term frequencies attributes do not follow Poisson distributions. D. Multivariate Bernoulli Naive Bayes

and the criterion for classifying a message as spam is: Q P (cs ). nk=1 P (tk |cs ) P Qn > T. ci ∈{cs ,cl } P (ci ). k=1 P (tk |ci )

P (tk |ci ) =

Thus, the criterion for classifying a message as spam becomes: Qn P (cs ). k=1 P (tk |cs )xk P Qn > T, xk k=1 P (tk |ci ) ci ∈{cs ,cl } P (ci ).

n Y P (tk |ci )xk . xk !

k=1

The multivariate Bernoulli NB (MV Bernoulli NB) represents each message m by only computing the presence or absence of each term. Therefore, m can be represented as a binary vector ~x = hx1 , . . . , xn i, where each xk shows whether or not tk will occur in m. Moreover, each message m of category ci is seen as the result of n Bernoulli trials, where at each trial we decide whether or not tk will appear in m. The probability of a positive outcome at trial k is P (tk |ci ). Then, the probabilities P (~x|ci ) are computed by: P (~x|ci ) =

n Y

k=1

P (tk |ci )xk .(1 − P (tk |ci ))(1−xk ) .

The criterion for classifying a message as spam becomes: P

Q xk .(1 − P (tk |cs ))(1−xk ) P (cs ). n k=1 P (tk |cs ) Qn > T, xk .(1 − P (t |c ))(1−xk ) k i ci ∈{cs ,cl } P (ci ). k=1 P (tk |ci )

and probabilities P (tk |ci ) are estimated as a Laplacian prior: P (tk |ci ) =

1 + |Mtk ,ci | , 2 + |Mci |

where |Mtk ,ci | is the number of training messages of category ci that comprise the term tk , and |Mci | is the total number of training messages of category ci . For more theoretical explanation, consult Metsis et al. [12] and Losada and Azzopardi [20].

E. Boolean Naive Bayes We define as Boolean NB the classifier similar to the multivariate Bernoulli NB with the difference that this does not take into account the absence of terms. Hence, the probabilities P (~x|ci ) are estimated only by: P (~x|ci ) =

n Y

k=1

P (tk |ci ),

and the criterion for classifying a message as spam becomes: Q P (cs ). nk=1 P (tk |cs ) P Qn > T, ci ∈{cs ,cl } P (ci ). k=1 P (tk |ci ) where probabilities P (tk |ci ) are estimated in the same way as used in the multivariate Bernoulli NB.

III. S UPPORT V ECTOR M ACHINES Support vector machine (SVM) [6], [7], [8], [9] is one of the most recent techniques used in text classification. In this method a data point is viewed as a p-dimensional vector and the approach aims to separate such points with a (p − 1)-dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. Therefore, SVM chooses the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier (Figure 1).

F. Multivariate Gauss Naive Bayes Multivariate Gauss NB (MV Gauss NB) uses real-valued attributes by assuming that each attribute follows a Gaussian distribution g(xk ; µk,ci , σk,ci ) for each category ci , where the µk,ci and σk,ci of each distribution are estimated from the training set T r. The probabilities P (~x|ci ) are calculate by P (~x|ci ) =

n Y

g(xk ; µk,ci , σk,ci ),

k=1

and the criterion for classifying a message as spam becomes: Q P (cs ). nk=1 g(xk ; µk,cs , σk,cs ) P Qn > T. ci ∈{cs ,cl } P (ci ). k=1 g(xk ; µk,ci , σk,ci ) G. Flexible Bayes Flexible Bayes (FB) works similar to MV Gauss NB. However, instead of using a single normal distribution for each attribute Xk per category ci , FB represents the probabilities P (~x|ci ) as the average of Lk,ci normal distributions with different values for µk,ci , but the same one for σk,ci : P (xk |ci ) =

1 Lk,ci

Lk,ci

X

g(xk ; µk,ci ,l , σci ),

l=1

where Lk,ci is the amount of different values that the attribute Xk has in the training set T r of category ci . Each of these values is used as µk,ci ,l of a normal distribution of the category ci . However, all distributions of a category ci are taken to have the same σci = √ 1 . |T rci |

The distribution of each category become narrower as more training messages of that category are accumulated. By averaging several normal distributions, FB can approximate the true distributions of real-valued attributes more closely than the MV Gauss NB, when the assumption that attributes follow normal distribution is violated. For further details, consult Androutsopoulos et al. [11] and John and Langley [21].

Fig. 1. Maximum-margin hyperplane and margins for a SVM trained with samples from two classes. Samples on the margin are called the support vectors.

SVMs belong to a family of generalized linear classifiers. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. Consult Drucker, Wu and Vapnik [6], Kolcz and Alspector [7], and Hidalgo [8] for further details about the implementation of SVMs applied to filter messages as spams or legitimates. IV. P ERFORMANCE MEASUREMENTS Let S and L sets of spam and legitimate messages, respectively, the possible prediction results are: true positives (T P) corresponding to the set of spam messages correctly classified, true negatives (T N ) the set of legitimate messages correctly classified, false negatives (F N ) the set of spam messages incorrectly classified as legitimate, and false positives (F P) the set of legitimate messages incorrectly classified as spam. Some well-known evaluation measurements are: True positive rate (T pr), True negative rate (T nr), Spam precision (Spr), Spam recall (Sre), Legitimate precision (Lpr), Legitimate recall (Lre), area under the ROC curves (1-AU C) [1], LAM [1], precision × recall [12], among others.

Nevertheless, failures to identify legitimate and spam messages have materially different consequences. Misclassified non-spam substantially increases the risk that the information contained in the message will be lost, or at least delayed. Exactly how much risk and delay are incurred is difficult to quantify, as are the consequences, which depend on the nature of the message. On the other hand, failures to identify spam also vary in importance, but are generally less important than failures to identify non-spam. Viruses, worms, and phishing messages may be an exception, as they pose significant risks to the user [1]. In order to take into consideration the asymmetry in the misclassification costs, Androutsopoulos et al. [11] proposed a refinement based on spam recall and precision, to allow the performance evaluation based on a single measure. They consider a false positive as being λ times more costly than false negatives, with λ equals to 1 or 9. Hence, each false positive is accounted as λ mistakes, with the weighted accuracy (Accw ) being given by Accw =

|T P| + λ|T N | . |S| + λ|L|

So, the total cost ratio (T CR) can be calculated by T CR =

|S| . λ|FP| + |FN |

It offers an indication of the improvement provided by the filter. Greater T CR indicates better performance, and for T CR < 1, not using the filter is better. A. Matthews correlation coefficient According to Carpinter and Hunt [22] and Cormack and Lynam [23] the value of λ is very difficult to be determined. In particular, it depends on the message once some messages are more important than others, as previously discussed. Further, the problem of using TCR is that it does not return a value inside a predefined range. Consider, for example, two classifiers A and B employed to filter 600 messages (450 spams + 150 legitimates, λ = 1). Suppose that A attained a perfect prediction with FP A = FN A = 0, and B classified incorrectly only 3 spam messages as legitimate, thus F P B = 0 and F N B = 3. In this way, T CRA = +∞ and T CRB = 150. Intuitively, we can observe that both classifiers achieved similar performance with a small advantage for A. However, if we analyze only the T CR, we would wrong conclude that A was much better than B. Furthermore, a T CR is not a representative value which we can make strong assumptions about the performance achieved by a single filter, because it gives us information about the improvement provided by using the filter, but not provide an information about how much the classifier could be improved. In order to avoid these characteristics, we propose the use of the Matthews correlation coefficient (M CC) [24]. M CC is used in machine learning as a measure of the quality of binary classifications which provides much more information than T CR. It returns a real value between −1 and +1. A

coefficient equals to +1 indicates a perfect prediction; 0, an average random prediction; and −1, an inverse prediction [25]. M CC = (|T P|.|T N |) − (|F P|.|F N |) , p (|T P| + |F P|).(|T P| + |F N |).(|T N | + |F P|).(|T N | + |F N |)

It provides a fairer evaluation since in a real situation the number of spams we receive is much higher than the number of legitimate messages, therefore M CC tends to automatically adjust the how much a false positive error is worst than a false negative one. As the proportionality between the number of spams and legitimate messages increases a false positive tends to be much worst than a false negative [13]. Using the previous example, the classifier A would achieve M CCA = 1.000 and M CCB = 0.987. Thus, we can make correct conclusions about the classifiers’ predictions as much as each performance achieved individually. Furthermore, we can combine M CC with other measures, as precision × recall rates, for instance, in order to provide a fairer comparison. V. E XPERIMENTAL RESULTS We use the six public, large and well-known Enron corpora [12] in our experiments. All datasets are composed by real legitimate messages extracted from the mailboxes of former employees of the Enron Corporation and selected spam messages from different sources. The composition of each corpus is shown in Table I. TABLE I E NRON DATASETS Dataset

No of Legitimate

No of Spam

Total

Enron 1

3,672

1,500

5,172

Enron 2

4,361

1,496

5,857

Enron 3

4,012

1,500

5,512

Enron 4

1,500

4,500

6,000

Enron 5

1,500

3,675

5,175

Enron 6

1,500

4,500

6,000

Total

16,545

17,171

33,716

Tables II, III, IV, V, VI, and VII present the performance achieved by each classifier for each Enron dataset. Bold values indicate the highest score. In order to provide a fairer evaluation, we consider the most important measures the spam recall rate (Spr), legitimate recall rate (Lre) and Matthews correlation coefficient (M CC) achieved by each filter. Additionally, we present other measures as legitimate × spam precision rates, weighted accuracy (Accw ) and total cost ratio (T CR). In this way, we employed λ = 1 in order to calculate Accw and T CR, as described in Section IV. Due to paper limitations, we present the results achieved by each evaluated classifier. A comprehensive

TABLE II E NRON 1 – R ESULTS ACHIEVED BY EACH FILTER

Measures Sre(%) Spr(%) Lre(%) Lpr(%) Accw (%) T CR M CC

Basic 91.33 85.09 93.48 96.36 92.86 4.054 0.831

Bool 96.00 51.61 63.32 97.49 72.78 1.064 0.540

MN TF 82.00 75.00 88.86 92.37 86.87 2.206 0.691

MN Bool 82.67 62.00 79.35 91.82 80.31 1.471 0.578

MV Bern 72.00 61.71 81.79 87.76 78.96 1.376 0.516

MV Gauss 78.67 87.41 95.38 91.64 90.54 3.061 0.765

Flex Bayes 87.33 86.18 94.29 94.81 92.28 3.750 0.813

SVM 83.33 87.41 95.11 93.33 91.70 3.488 0.796

Flex Bayes 68.67 98.10 99.54 90.25 91.65 3.061 0.776

SVM 90.67 90.67 96.80 96.80 95.23 5.357 0.875

TABLE III E NRON 2 – R ESULTS ACHIEVED BY EACH FILTER

Measures Sre(%) Spr(%) Lre(%) Lpr(%) Accw (%) T CR M CC

Basic 80.00 97.57 99.31 93.53 94.38 4.545 0.850

Bool 95.33 81.25 92.45 98.30 93.19 3.750 0.836

MN TF 75.33 96.58 99.08 92.13 93.02 3.659 0.812

MN Bool 74.00 98.23 99.54 91.77 93.02 3.659 0.814

MV Bern 65.33 81.67 94.97 88.87 87.39 2.027 0.652

MV Gauss 62.67 94.95 98.86 88.52 89.61 2.459 0.717

TABLE IV E NRON 3 – R ESULTS ACHIEVED BY EACH FILTER

Measures Sre(%) Spr(%) Lre(%) Lpr(%) Accw (%) T CR M CC

Basic 57.33 100.00 100.00 86.27 88.41 2.344 0.703

Bool 99.33 99.33 99.75 99.75 99.64 75.000 0.991

MN TF 57.33 100.00 100.00 86.27 88.41 2.344 0.703

MN Bool 62.00 100.00 100.00 87.58 89.67 2.632 0.737

MV Bern 100.00 84.75 93.28 100.00 95.11 5.556 0.889

MV Gauss 52.67 89.77 97.76 84.70 85.51 1.875 0.613

Flex Bayes 52.00 96.30 99.25 84.71 86.41 2.000 0.644

SVM 91.33 96.48 98.76 96.83 96.74 8.333 0.917

Flex Bayes 94.89 100.00 100.00 86.71 96.17 19.565 0.907

SVM 98.89 100.00 100.00 96.77 99.17 90.00 0.978

TABLE V E NRON 4 – R ESULTS ACHIEVED BY EACH FILTER

Measures Sre(%) Spr(%) Lre(%) Lpr(%) Accw (%) T CR M CC

Basic 94.67 100.00 100.00 86.21 96.00 18.750 0.903

Bool 98.00 100.00 100.00 94.34 98.50 50.000 0.962

MN TF 93.78 100.00 100.00 84.27 95.33 16.071 0.889

MN Bool 96.89 100.00 100.00 91.46 97.67 32.143 0.941

MV Bern 98.22 100.00 100.00 94.94 98.67 56.250 0.966

MV Gauss 94.44 100.00 100.00 85.71 95.83 18.000 0.900

TABLE VI E NRON 5 – R ESULTS ACHIEVED BY EACH FILTER

Measures Sre(%) Spr(%) Lre(%) Lpr(%) Accw (%) T CR M CC

Basic 89.67 98.80 97.33 79.35 91.89 8.762 0.825

Bool 87.23 100.00 100.00 76.14 90.93 7.830 0.815

MN TF 88.86 100.00 100.00 78.53 92.08 8.976 0.835

MN Bool 94.29 100.00 100.00 87.72 95.95 17.524 0.909

MV Bern 98.10 92.56 80.67 94.53 93.05 10.222 0.828

MV Gauss 86.68 96.37 92.00 73.80 88.22 6.033 0.743

Flex Bayes 88.86 98.79 97.33 78.07 91.31 8.178 0.814

SVM 89.40 99.70 99.33 79.26 92.28 9.200 0.837

Flex Bayes 89.78 98.30 95.33 75.66 91.17 8.491 0.793

SVM 89.78 95.28 86.67 73.86 90.05 6.818 0.727

TABLE VII E NRON 6 – R ESULTS ACHIEVED BY EACH FILTER

Measures Sre(%) Spr(%) Lre(%) Lpr(%) Accw (%) T CR M CC

Basic 86.00 98.98 97.33 69.86 88.33 6.716 0.757

Bool 66.89 99.67 99.33 50.00 75.00 3.000 0.574

MN TF 76.67 99.42 98.67 58.50 82.17 4.206 0.661

MN Bool 92.89 97.21 92.00 81.18 92.67 10.227 0.816

set of results, including all tables and figures, is available at http://www.dt.fee.unicamp.br/˜tiago/ Research/Spam/spam.htm. Regarding anti-spam filters, reported experiments indicate that {SVM, MN Boolean NB, Basic NB, Boolean NB} > {Flexible Bayes, MV Bernoulli NB} >> {MN Term Frequency NB, MV Gauss NB}, where “>” means “performs better than”. In general, the linear SVM presented the best average performance for all analyzed databases. Moreover, it is important to emphasize that SVM was the only one to achieve an accuracy rate higher than 90% for all tested corpus. According to the results found by Schneider [19], in our experiments the filters that use real and integer attributes did not achieved better results than Boolean ones. However, Metsis et al. [12] showed that flexible Bayes are less sensitive to the threshold T . It indicates that it is able to attain a high spam recall even though a high legitimate recall is required. Nevertheless, it is important to note that TCR is really not an informative measurement. For instance, for Enron 4 (Table V), SVM and MV Bernoulli NB achieved similar performances (SV MMCC = 0.978 – BernMCC = 0.966). However, their TCR are very different (SV MT CR = 90.000 BernT CR = 56.250), besides their precision × recall rates are very close. VI. C ONCLUSIONS AND FURTHER WORK In this paper, we have presented a comparative study of seven different versions of Naive Bayes classifiers and the

MV Bern 96.22 92.32 76.00 87.02 91.17 8.491 0.757

MV Gauss 92.00 94.95 85.33 78.05 90.33 7.759 0.751

linear Support Vector Machine employed to automatically filter e-mail spams. We have conducted empirical experiments using six wellknown, large and public databases. The results indicate that linear SVM, Boolean NB, MN Boolean NB, and Basic NB are the best choice for automatic filtering spams. However, SVM acquired the best average performance for all analyzed databases presenting an accuracy rate higher than 90% for all tested corpus. Furthermore, we have proposed the use of Matthews correlation coefficient (M CC) as the evaluation measurement in order to provide a fairer comparison. We have showed that M CC provides a balanced evaluation of the prediction, especially if the two classes are of different sizes. Moreover, M CC returns a value inside a predefined range which provides more information about the classifiers’ performance than other measures. Actually, we are conducting more experiments using larger datasets as TREC05, TREC06 and TREC07 corpora [1] in order to reinforce the validation. We also intend to compare the approaches with other commercial and open-source antispam filters, as Bogofilter, SpamAssassin, among others. Future works should take into consideration that spam filtering is a coevolutionary problem, because while the filter tries to evolve its prediction capacity, the spammers try to evolve their spam messages in order to overreach the classifiers. Hence, an efficient approach should have an effective way to adjust its rules in order to detect the changes of

spam features. In this way, collaborative filters could be used to assist the classifier by accelerating the adaptation of the rules and increasing the classifiers performance. Moreover, spammers generally insert a large amount of noises in spam messages in order to difficult the probability estimation. Thus, the filters should have a flexible way to compare the terms in the classifying task. Approaches based on fuzzy logic could be employed to make the comparison of the terms more flexible. VII. ACKNOWLEDGMENTS This work is supported by the Brazilian Counsel of Technological and Scientific Development (CNPq). R EFERENCES [1] G. Cormack, “Email spam filtering: A systematic review,” Foundations and Trends in Information Retrieval, vol. 1, no. 4, pp. 335–455, 2008. [2] W. Cohen, “Fast effective rule induction,” in Proceedings of 12th International Conference on Machine Learning, Tahoe City, CA, USA, July 1995, pp. 115–123. [3] T. Joachims, “A probabilistic analysis of the rocchio algorithm with tfidf for text categorization,” in Proceedings of 14th International Conference on Machine Learning, Nashville, TN, USA, July 1997, pp. 143–151. [4] R. Schapire, Y. Singer, and A. Singhal, “Boosting and rocchio applied to text filtering,” in Proceedings of the 21st Annual International Conference on Information Retrieval, Melbourne, Australia, August 1998, pp. 215–223. [5] X. Carreras and L. Marquez, “Boosting trees for anti-spam email filtering,” in Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, September 2001, pp. 58–64. [6] H. Drucker, D. Wu, and V. Vapnik, “Support vector machines for spam categorization,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1048–1054, September 1999. [7] A. Kolcz and J. Alspector, “Svm-based filtering of e-mail spam with content-specific misclassification costs,” in Proceedings of the 1st International Conference on Data Mining, San Jose, CA, USA, November 2001, pp. 1–14. [8] J. Hidalgo, “Evaluating cost-sensitive unsolicited bulk email categorization,” in Proceedings of the 17th ACM Symposium on Applied Computing, Madrid, Spain, March 2002, pp. 615–620. [9] G. Forman, M. Scholz, and S. Rajaram, “Feature shaping for linear svm classifiers,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 2009, pp. 299–308. [10] M. Sahami, S. Dumais, D. Hecherman, and E. Horvitz, “A bayesian approach to filtering junk e-mail,” in Proceedings of the 15th National Conference on Artificial Intelligence, Madison, WI, USA, May 1998, pp. 55–62.

[11] I. Androutsopoulos, G. Paliouras, and E. Michelakis, “Learning to filter unsolicited commercial e-mail,” National Centre for Scientific Research “Demokritos”, Athens, Greece, Tech. Rep. 2004/2, March 2004. [12] V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with naive bayes - which naive bayes?” in Proceedings of the 3rd International Conference on Email and Anti-Spam, Mountain View, CA, USA, July 2006, pp. 1–5. [13] T. Almeida, A. Yamakami, and J. Almeida, “Probabilistic anti-spam filtering with dimensionality reduction,” in Proceedings of the 25th ACM Symposium On Applied Computing, Sierre, Switzerland, March 2010, pp. 1–5. [14] T. Guzella and W. Caminhas, “A review of machine learning approaches to spam filtering,” Expert Systems with Applications, 2009, in press. [15] A. Seewald, “An evaluation of naive bayes variants in content-based learning for spam filtering,” Intelligent Data Analysis, vol. 11, no. 5, pp. 497–524, 2007. [16] A. Bratko, G. Cormack, B. Filipic, T. Lynam, and B. Zupan, “Spam filtering using statistical data compression models,” Journal of Machine Learning Research, vol. 7, pp. 2673–2698, 2006. [17] T. Almeida, A. Yamakami, and J. Almeida, “Evaluation of approaches for dimensionality reduction applied with naive bayes anti-spam filters,” in Proceedings of the 8th IEEE International Conference on Machine Learning and Applications, Miami, FL, USA, December 2009, pp. 1–6. [18] A. McCallum and K. Nigam, “A comparison of event models for naive bayes text classication,” in Proceedings of the International Workshop Learning for Text Categorization, Menlo Park, CA, USA, July 1998, pp. 41–48. [19] K. Schneider, “On word frequency information and negative evidence in naive bayes text classification,” in Proceedings of the 4th International Conference on Advances in Natural Language Processing, Alicante, Spain, October 2004, pp. 474–485. [20] D. Losada and L. Azzopardi, “Assessing multivariate bernoulli models for information retrieval,” ACM Transactions on Information Systems, vol. 26, no. 3, pp. 1–46, June 2008. [21] G. John and P. Langley, “Estimating continuous distributions in bayesian classifiers,” in Proceedings of the 11th International Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, July 1995, pp. 338–345. [22] J. Carpinter and R. Hunt, “Tightening the net: A review of current and next generation spam filtering tools,” Computers and Security, vol. 25, no. 8, pp. 566–578, 2006. [23] G. Cormack and T. Lynam, “Online supervised spam filter evaluation,” ACM Transactions on Information Systems, vol. 25, no. 3, pp. 1–11, 2007. [24] B. Matthews, “Comparison of the predicted and observed secondary structure of t4 phage lysozyme,” Biochimica et Biophysica Acta, vol. 405, no. 2, pp. 442–451, October 1975. [25] T. Almeida, A. Yamakami, and J. Almeida, “Filtering spams using the minimum description length principle,” in Proceedings of the 25th ACM Symposium On Applied Computing, Sierre, Switzerland, March 2010, pp. 1–5.

Recommend Documents