Nonlinear Transformation of Term Frequencies for Term Weighting in Text Categorization Zafer Erenela and Hakan Altın¸cayb,∗ a
Department of Computer Engineering, European University of Lefke Gemikona˘ gı - Lefke, Northern Cyprus E-mail:
[email protected] b
Department of Computer Engineering, Eastern Mediterranean University Famagusta, Northern Cyprus E-mail:
[email protected] Abstract In automatic text categorization, the influence of features on the decision is set by the term weights which are conventionally computed as the product of term frequency and collection frequency factors. The raw form of term frequencies or their logarithmic forms are generally used as the term frequency factor whereas the leading collection frequency factors take into account the document frequency of each term. In this study, it is firstly shown that the best-fitting form of the term frequency factor depends on the distribution of term frequency values in the dataset under concern. Taking this observation into account, a novel collection frequency factor is proposed which considers term frequencies. Five datasets are firstly tested to show that the distribution of term frequency values is task dependent. The proposed method is then proven to provide better F1 scores compared to two recent approaches on majority of the datasets considered. It is confirmed that the use ∗ Corresponding
author: E-mail:
[email protected], Tel: +90 392 6302842, Fax: +90 392 3650711
1
of term frequencies in the collection frequency factor is beneficial on tasks which does not involve highly repeated terms. It is also shown that the best F1 scores are achieved on majority of the datasets when smaller number of features are considered.
Keywords: Text categorization, Term weighting, Term frequency, Collection frequency factor, Document length normalization
1
Introduction
Automatic text categorization aims to group electronic documents such as news articles, technical papers, corporate web sites and university web pages on the internet or intranets into a set of predefined categories with minimum possible error and hence decrease the human effort that is required for this purpose. Bag-of-words is the most widely used document representation approach. In this technique, each document is represented as a feature vector consisting of thousands of entries where each entry corresponds to a different word or term in the vocabulary (Sebastiani, 2002). Frequency based representation of each term corresponds to using the number of times a term occurs (term frequency) as the value of the relevant feature entry. On the other hand, binary representation sets the entries in the feature vector to one for terms that exist in the document under concern and zero for the rest. For a better representation, the features are generally represented using term weights that are defined as the product of term frequency factor (or, local weight) and the collection frequency factor (or, global weight) (Lan et al., 2009; Dumais, 1990). The raw form of term frequencies has been traditionally used as the term frequency factor (Lan et al., 2009) whereas the inverse document frequency (idf ) is considered as the collection frequency factor (Debole and Sebastiani, 2004; Diederich et al., 2003). While idf is appropriate for information retrieval, it is not the best choice for text classification (Soucy and Mineau, 2005) since term weighting, which is based on the idea that some terms are more discriminative compared to the others, aims to assign larger weights to the
2
discriminative ones (Lan et al., 2009). Researchers studied alternative collection frequency factors for replacing the idf multiplier and various schemes are proposed for this purpose where improved F1 scores are reported (Lan et al., 2009; Liu et al., 2009b; Tsai et al., 2008; Erenel et al., 2011). In computing the collection frequency, almost all term weighting schemes take into account the document frequencies. More specifically, the number of documents that contain a given term is considered rather than the overall sum of the frequency counts of a given term inside documents across the collection. Relevance frequency (RF ) based weighting (tf × RF ) that takes into account document frequencies is recently proposed and shown to deliver the best results on several benchmark datasets (Lan et al., 2009). Following this study, Erenel et al. (2011) proposed a novel scheme (exp(α, θ)) based on the ratio of term occurrence probabilities and it is shown to surpass various existing schemes including RF on four benchmark datasets. With the use of the term frequency factor, high frequency terms are expected to have greater impact on the decision compared to those having lower frequencies. In fact, this is generally considered as desirable where the direct use of tf is supported by the experimental results (Lan et al., 2009). However, it is also observed that binary representation performs better on some categories. This means that overstating high frequency terms is harmful in some cases. With the use of a transformation, large swing in the feature value of a term due to only a modest change in the tf value can be avoided (Lan et al., 2009). Consequently, a compromising representation providing a better average performance when multiple categories are considered will be obtained. The use of the logarithm function, log(1 + tf ) for this purpose is studied mainly in information retrieval and then in text categorization (Radovanovic and Ivanovic, 2006; Bisht et al., 2010). However, successful implementations in text categorization are quite limited. In fact, Lan et al. (2009) have recently argued that, when used alone with a linear SVM, there is no significant difference between tf and log(1 + tf ). On the other hand, Hassan et al. (2006) have shown that the logarithmic mapping of tf performs significantly better compared to its raw form on three other benchmark datasets. These conflicting observations clearly indicate that the effectiveness of a nonlinear mapping on tf values depends on the dataset and the distribution of tf values
3
in particular. It can be roughly argued that, if the documents include terms having large tf values, the system benefits from the logarithmic transformation since overstatement of these terms is avoided. However, this argument should be further explored to be verified by empirical results. The collection frequency factor is generally computed using document frequencies. More specifically, the collection frequency factor of a term depends on the number of positive and negative documents that contain it. Incorporating tf values into the collection frequency factor was considered in information retrieval by Jones (1973) and it is recently studied for text categorization by Maleki (2010). Nevertheless, that factor does not utilize the category information. Moreover, the performance achieved is shown by Maleki (2010) to be worse compared to RF . We believe that the contribution of term frequency in formulating the collection frequency factor for text categorization has not been fully investigated yet. In this study, the use of a nonlinear mapping in computing the term frequency factor is firstly addressed. The use of the raw form of term frequency (i.e. identity mapping) and log(1 + tf ) as the term frequency factors are evaluated where the document lengths are normalized using cosine normalization. In order to make sure that the characteristics of different mappings are not affected by the normalization operation, a linear approach is also applied. In particular, the logarithm of the document length is used for the normalization of log(1+tf ) as the third scheme which is previously shown to be effective in representing the relations of terms and concepts in concise semantic analysis for text categorization (Zhixing et al., 2011). The experiments conducted on five benchmark datasets namely, Reuters-21578 ModApte Top10 split, WebKB, CSTR2009 , e-News and 7-Sectors have revealed that the relative performances of these mappings depend on the distribution of tf values. It is also questioned whether the number of occurrences of the terms in individual documents is useful for computing collection frequency factors. A novel collection frequency factor which uses the term frequencies for this purpose is proposed. This novel scheme is used together with all three forms of the term frequency factors. The experiments reveal the fact that,
4
when compared to using the term frequency factor alone, the gain in performance by using the proposed collection frequency factor depends on the relative individual performance of the mappings used in the term frequency factors. In particular, on CSTR2009 and e-News datasets where the identity mapping is found to be superior or equivalent to the logarithmic mapping, the proposed collection frequency factor surpasses RF and exp(α, θ). On the other hand, RF and exp(α, θ) which omit tf are observed to be useful when overstating the terms having large tf values by using the identity mapping in the term frequency factor provides worse F1 scores compared to log(1 + tf ). Experiments have also shown that the proposed collection frequency factor, which takes into consideration the damped forms of the term frequencies, provides a better compromise in representation and hence the best F1 scores on four out of the five datasets when the best-fitting mapping is used in term frequency factors. Section 2 presents a brief literature review about text categorization. The main motivation of the current study is presented in Section 3. The proposed collection frequency factor is described in Section 4. In order to verify the significance of using tf in the collection frequency factor, experiments are conducted on five benchmark datasets. The datasets used, the experiments conducted and the results obtained are presented in Section 5. The last part, Section 6 summarizes the conclusions drawn from this study.
2
A Review of Related Work
The text categorization problem involves the solution of several independent binary classification subproblems. The positive class includes the documents from the target category whereas the negative class includes those from the remaining categories. In this approach, feature selection and weighting are done separately for each category. This local policy of system generation leads to locally tuned parameter sets for each category. In particular, feature selection corresponds to computing the best-fitting subset of terms for each category where a given term can receive different weights in different categories (Zheng et al., 2004). Moreover, the relative
5
importance of two different terms in a given category depends on the weighting scheme. This means that one term may be more important than the other according to one scheme whereas the contrary is true for another. As a matter of fact, the relative performances of different term selection and weighting schemes generally depend on the category under concern (Altın¸cay and Erenel, 2010). The first phase of designing an automated text categorization is term (feature) selection. If all terms in the vocabulary are considered, the number of features in the document vectors become extremely large which leads to computationally demanding classification schemes. Moreover, some terms are not informative for the classification of documents (Chen et al., 2006). Because of this, feature selection is applied to select a subset of features (Chen et al., 2009; Yang and Pedersen, 1997; Zheng et al., 2004; Chen et al., 2005). Various measures have been investigated in the last decade for this purpose. The most popular feature selection measures are mutual information (MI), chi-square (χ2 ) and odds ratio (OR) (Liu et al., 2009b; Mladenic and Grobelnik, 2003; Sebastiani, 2002; Yang and Pedersen, 1997; He et al., 2003). The following step is the term weighting. In order to avoid a large swing in the feature value of a term due to only a modest change in the tf value, several nonlinear transformations are proposed. Consider a term tj having the term frequency value tf (dk , tj ) for a given document dk . Instead of using the binary or identity mapping (i.e. raw form of tf ) on term frequencies, a linear or nonlinear mapping, M(.) is used which maps the term frequencies in the interval [0, tfmax ] onto [0, M(tfmax )] where 1 < M(tfmax ) < tfmax . With the use of such a mapping, terms having higher frequencies will contribute more to the overall decision compared to those having lower without being overstated. The use of such transformations on tf is extensively studied in information retrieval (Manning et al., 2008; Salton and Buckley, 1988; Singhal et al., 1995). The most common approach is to define the nonlinear mapping with the use of the logarithm function. Transformations of the form log(tf ), (1 + log(tf )) and log(1 + tf ) are studied (Buckley et al., 1994). Another logarithm based transformation used in information
6
retrieval is defined as (Manning et al., 2008) b = 1 + log(tfj ) , tf j 1 + log(tfavg )
(1)
where tfavg is the average of the tfj values of the terms within the document. It should be noted that the terms tf , tfj and tf (tj , dk ) will be used interchangeably in this study depending on the context they appear. Alternatively, tf is normalized by the maximum value when all terms in the document are considered. The general form of this linear scheme is b = a + (1 − a) tf j
tfj , maxi tfi
(2)
where the tf values are scaled to lie in the interval [a, 1] (Salton and Buckley, 1988). The constant ‘a’ is generally selected as zero or 0.5. The main disadvantage of this scheme is the effect of the outlier terms having large number of occurrences which are not representative for the document under concern (Manning et al., 2008). Although the use of the raw form of tf as the term frequency factor is more popular in text categorization, some of these transformations have already been evaluated. For instance, Yang and Liu observed that the performance of log(tf ) is almost indifferent from tf in text categorization (Yang and Liu, 1999). Following this study, Lan et al. (2009) have shown that, when considered individually, tf and log(1 + tf ) perform equivalently for 5000 terms. The linear transformation given in Eq. 2 is studied by Liu et. al. (Liu et al., 2009b) for a = 0 in text categorization but a comparison with the raw form of tf is not presented. On the other hand, Hassan et al. (2006) have shown that the logarithmic mapping of tf performs significantly better compared to its raw form on three other benchmark datasets when SVM is used. These results show that the relative performances of different mappings are categorization problem dependent. We believe that the use of log(1+tf ) should be studied on a wider set of benchmark datasets together with the state-of-the-art collection frequency factors to analyze and clarify the potential reasons for their problem dependent relative performances. The term frequency of a term is proportional to the length of the document under concern. 7
More specifically, as the length of a document increases, a particular term is expected to be repeated further in the text. In order to eliminate the effect of differences in length, document length normalization is generally considered as the next step even if the tf values are scaled down using the techniques listed above. Cosine normalization is the most popular document length normalization scheme (Miao and Kamel, 2011). In fact, the terms such as maxi tfi and tfavg in Eqs. 1 and 2 are correlated with the document lengths. Consequently, further normalization for document lengths is not necessary in such cases. Both normalized and unnormalized forms are used in information retrieval. Liu et al. (2009b) have recently shown that cosine normalization may deteriorate the performance when (1 + log(tf )) is used together with idf . It is clear that, when the document lengths are roughly the same, normalization is not necessary. On the other hand, normalization is expected to be meaningful when the document lengths are different. The collection frequency factor in term weights is generally computed using document frequencies of the terms for which the whole collection of documents is taken into account. More specifically, the number of documents that contain the given term is taken into account in weight estimation where the frequency counts of occurring terms in each document are not used. Asymmetric weighting schemes favor positive terms (terms that appear mostly in the positive documents) over negative ones, but the symmetric schemes give high leverage to predominantly negative terms as well (Erenel et al., 2011; Chen et al., 2006). Relevance frequency (RF ) is an asymmetric technique that is recently proposed and shown to deliver the best results on several benchmark datasets (Lan et al., 2009). It is defined as the function of the information elements A and C as
µ RF (tj ) = log 2 +
A max{1, C}
¶ (3)
where A and C denote the number of positive and negative documents which contain tj respectively. Following this study, a symmetric scheme, exp(α, θ) is proposed by Erenel et al. (2011) that is based on the ratio of the term occurrence probabilities. It is shown to surpass various existing − schemes including RF on four benchmark datasets. Let p+ j and pj denote the percentage of
8
positive and negative documents that contain tj respectively. Then e−αθj , θj ≤ π/4 exp(α, θj ) = ¡ ¢ π e−α 2 −θj , θj > π/4,
(4)
− where θj is defined as arctan(p+ j /pj ) and α is a design parameter to be tuned. It is specified
that α = 3 is a good choice on a wide set of categories (Erenel et al., 2011). RF is not studied together with log(1 + tf ) (Lan et al., 2009). It is shown that tf × RF is superior to both tf and log(1 + tf ) on two datasets, but the performance of log(1 + tf ) × RF is not studied (Lan et al., 2005). Both tf ×idf and log(1+tf )×idf are used for text categorization where the performances obtained are generally worse compared to the state-of-the-art scheme, tf ×RF (Lan et al., 2005). However, it is shown that log(1+tf )×idf performs better compared to tf × idf on four datasets in another study (Hassan et al., 2006). These two studies had one common dataset namely, 20 Newsgroups where both studies have provided better scores in favor of log(1 + tf ) × idf . These results show that nonlinear mappings deserve further attention to be considered in term weighting. It should also be noted that cosine normalization is a nonlinear transformation where the normalized frequency value of a term depends not only on its tf but also on the tf values of other terms within the document. As an example, assume that d1 and d2 are two documents where the vocabulary includes only three terms. Let d1 = [3, 4, 0]T and d2 = [3, 2, 2]T which indicates that the first term occurs three times in both d1 and d2 . After cosine normalization, √ √ √ we obtain d1 = [3/5, 4/5, 0]T and d2 = [3 17, 2/ 17, 2/ 17]T which means that the first term is represented by a larger value in the second document after cosine normalization. This means that, when cosine normalization is applied on transformed term frequencies, the one-to-one relation between tf s and values obtained after the transformation will not exist any more. In order to make sure that the characteristics of different mappings are not affected by the normalization operation, it can be done in a linear way as well. The logarithm of the document length can also be used for the normalization of the transformed tf values in logarithmic transformation (Zhixing et al., 2011). In this study, normalization is applied on term frequency 9
factors. The collection frequency factors are then multiplied with them to compute the term weights. The last step in designing a text categorization system is training the classification scheme. Experiments conducted on various domains have shown that, due to its robustness in very high dimensional feature spaces, Support Vector Machine (SVM) is the strongest classifier for text categorization (Lan et al., 2009; Miao and Kamel, 2011).
3
Main Motivation
The performances of different mappings used in term frequency factors are domain dependent as mentioned above. In order to explain this behavior, consider the CSTR2009 dataset where the problem is described as the classification of technical report abstracts having limited lengths. In this case, it is expected that the term frequencies will be small. If the term frequency factor is used alone as term weights, smoothing tf s by applying a nonlinear transformation will further reduce the differences among the terms, leading to a possible deterioration of the performance when compared to using raw forms of term frequencies. On the other hand, on a different categorization problem, the documents are including repeated terms as in WebKB. In this dataset, documents are formed using the web pages of computer science departments from several universities. The documents are related mainly to students, courses, projects and faculty members. Since the topics are very specific, some terms are repeated a number of times. In such a dataset, it can be argued that overstatement of terms having large tf s due to using in raw form will lead to worse performance. Hence, smoothing is expected to be useful. The collection frequency factors described above and others considered in the literature are based on document frequencies where the occurrence frequency of each term is taken as either zero or one for each document to calculate the information elements (Erenel et al., 2011). The main motivation for defining a new collection frequency factor that takes into account term frequencies is the categorization problem dependent performances of different mappings used in
10
term frequency factors as described above. In particular, if the use of the raw form of tf values in term frequency factor provides better scores compared to the smoothed version on some datasets, it can be argued that considering term frequencies should also be useful in computing collection frequency factors in such cases. Consider two terms, ti and tj . Assume that the term ti occurs only once in the documents in which it is contained whereas tj occurs several times in every document it is contained. It can be argued that tj is a more important term since it cannot be considered as an incidentally occurring term. In such a case, the term frequency values will be useful in computing the collection frequency factor. To the best of our knowledge, this has not been considered in any well-known scheme. In this study, we hypothesize that the problem dependent performances of different mappings is mainly due to the differences in the distribution of term frequency values in different text categorization tasks. In particular, shrinking the term frequencies in cases when they are large using a transformation is advantageous. We also hypothesize that the use of term frequencies in collection frequency factors is beneficial when the term frequencies are small. However, as in the case of term frequency factor, if some terms have high frequencies, they will be overstated if included in the collection frequency factor. In such a case, using document frequencies as in conventional approaches will be more appropriate. Based on these hypotheses, we propose a novel collection frequency factor which makes use of the term frequencies. The merit of the hypotheses and the effectiveness of the proposed scheme are investigated by a rich set of experiments on five different datasets.
4
Proposed Collection Frequency Factor
The proposed factor is based on a recent study by Zhixing et al. (2011). They defined the relation between a term tj and concept ci as N X k=1
H(ci , dk )
¡ ¢ log 1 + tf (dk , tj ) ¡ ¢, log 1 + length(dk )
11
(5)
where N denotes the total number of training documents and H(ci , dk ) = 1 if the document dk belongs to the concept ci and zero otherwise. Instead of cosine normalization, the denominator is used to normalize the length of the documents. The length of the document is defined as the number of terms within the document. We will refer to this as term-count normalization in the following context. It can be seen that the equation involves the transformation log(1 + tf ). Inspired from this expression, we define a mapping on term frequencies as b tf j,k = M(tfj,k ) ,
¡ ¢ log 1 + tf (tj,k ) ¡ ¢. log 1 + length(dk )
(6)
where the cosine normalization is omitted. This is due to the fact that the denominator corresponds to linear document length normalization. Consequently, the one-to-one relation between tf s and term frequency factors is preserved. Hence, the characteristic of logarithmic mapping is not modified due to normalization. Because of these properties, this mapping can also be used as a term frequency factor in term weighting. The transformed term frequency value can be re-written as ¡ ¢ b tf j,k = log(1+length(dk )) 1 + tf (tj,k ) .
(7)
This means that the base of the logarithm is set to be (1 + length(dk )). In fact, various base values are used before for term weighting such as 2, e and 10 in text categorization (Zhou and Zhang, 2008). Since the ultimate goal is not ranking of terms but setting their relative importance, the choice of the base is important in weight calculation. The representation given above sets the base of the logarithm as one plus the number of terms in the document. It should be noted that, for a given tf , the linearly normalized value will be the same for the documents having the same length and it will not be affected by the frequencies of other terms in the document. The collection frequency factor of each term is computed based on the transformation given in Eq. 7. The proposed scheme is symmetric which means that the terms which mainly occur in the negative class are regarded as equally valuable as terms that occur mostly in the positive class. The main motivation for this is the superiority of the symmetric schemes on four different 12
benchmark datasets (Erenel et al., 2011). Let S + denote the set of training samples belonging to the positive class having the cardinality N + . The negative class generally involves samples from several categories. Let Si− denote the set of negative training samples belonging to the ith category having the cardinality Ni− . In other words, there are N − =
P i
Ni− samples in
the negative class. The average transformed term frequency value of tj is firstly computed for the positive class: 1 X b b+ tf j,k . tf j,avg = N+ +
(8)
dk ∈S
Similarly, the average value of the transformed term frequency for tj is computed for each category in the negative class as follows: X b −,i = 1 b . tf tf j,k j,avg − Ni −
(9)
dk ∈Si
The collection frequency factor is defined as ¶ µ b+ tf j,avg b+ b −,i , if tf log 2 + j,avg > maxi tf j,avg b −,i ²+maxi tf j,avg W (tj ) = ¶ µ b −,i log 2 + maxi tf+j,avg , otherwise b
(10)
²+tf j,avg
−,i
b where maxi tf j,avg denotes the maximum mapped term frequency value obtained when the categories in the negative class are considered. ² is a small constant which is used to avoid division by zero. It can be seen in Eq. 10 that the weight of a term grows in proportion to the number of positive documents it exists in and also in line with its frequency in each of these documents. Similarly, if the term under concern is related to a particular category in the negative class, the weight of the term grows in proportion to the number of documents in that category containing the term and also in line with its frequency in each of these documents. The main characteristic of the proposed collection frequency factor is the employment of term frequencies which is not the case in any of the widely used schemes. The performance of the proposed scheme can be associated with the relative performance of different mappings on tf when the term frequency factor is used alone for term weighting. Assume that only the term frequency factor is used for term weighting where diminishing the term frequencies that are greater than one using log(1 + tf ) is observed to perform better compared to the raw form 13
for a particular dataset. In other words, avoiding overstatement of high frequency terms is found to have a significant effect. In such a case, a collection frequency factor which considers only document frequencies is a reasonable choice. However, if the term frequency values are generally small, applying a nonlinear transformation will further reduce the differences among the terms leading to a possible deterioration of the performance when compared to the raw form of tf . In this case, the raw form of tf s should be used as the term frequency factor and the use of tf s in collection frequency factor is expected to be beneficial as well. For this reason, the proposed scheme is expected to surpass conventional collection frequency factors particularly in cases where their performances are poor. As seen in Eq. 10, the collection frequency factor is computed by analyzing the distribution of the term frequency values separately for each category in the negative class. In order to understand the intuitive reasoning behind this, consider a term which commonly exists only in one specific category of the negative class. Assume that it rarely appears in the target category. If this term is available in a test document, then it is an important indicator that the document is a negative one. Therefore, it deserves a large weight. If the categories in the negative class were not to be treated separately and all negative documents were considered as a whole, the average transformed term frequency would be smaller due to large N − and hence the generated weight would also be smaller. As a result, the negative discriminative terms would be undermined. Figure 1 illustrates the nonlinear mapping given in Eq. 7 for five different document lengths b shrinks as the in the interval [10, 50]. It can be seen in the figure that, for a given tfj,k value, tf j,k document length grows. This is the inherent normalization used in the nonlinear transformation to eliminate the effect of varying document lengths in document representation. The proposed collection frequency factor can be used together with cosine normalized identity mapping, cosine normalized logarithmic mapping and term-count normalized term frequency factors. In our simulations presented in the following section, we evaluated the proposed scheme on all three types of term frequency factors.
14
1
0.9
0.8
0.7
M(tf)
0.6
0.5
0.4 length=10 length=20 length=30 length=40 length=50
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10
tf
Figure 1: The nonlinear transformation for different document lengths.
5 5.1
Experiments Datasets
In this study, five widely used datasets are used for evaluating the proposed scheme. The main characteristics of these datasets are presented in Table 1. The ModApte split of top ten classes of Reuters-21578 (Debole and Sebastiani, 2004) is the first where the negative classes are defined to include documents which belong to one or more of the remaining nine categories. The same policy is applied on all datasets for forming the negative class. Due to its highly imbalanced category distribution, Reuters-21578 ModApte Top10 is a significant dataset among others. WebKB is another popular dataset that is a collection of web pages which belong to seven categories (Craven et al., 1998). They were collected by the Carnegie Mellon University Text Learning Group from several universities in 1997. Four of the categories namely, “Student”, “Faculty”, “Course” and “Project” which contain totally 4199 documents are generally used in text categorization experiments (Altın¸cay and Erenel, 2010). The training and test sets of Reuters-21578 dataset are defined. For WebKB dataset, 4-fold cross validation is applied in the experiments and the average results are reported. The number of training documents in each fold are presented in the third column of Table 1. In the same way, 4-fold cross validation 15
is applied on the remaining three datasets. The third dataset, 7-Sectors contains 4581 web pages1 . Each page belongs to one parent category and one subcategory in hierarchical order. Similar to the previous experiments in the literature (Nunzio, 2009), seven parent categories are used. CSTR2009 dataset is composed of 625 abstracts, each belonging to a single category, from technical reports in four research areas published in the Department of Computer Science at the University of Rochester between 1991 and 20092 . The e-News dataset is collected from four newspapers3 . It includes five categories, namely “Business”, “Education”, “Entertainments”, “Sport” and “Travel”. The fourth column in Table 1 presents the total number of terms in each of the datasets. When all terms are considered, the average document lengths and the means of average term frequency values in the documents are also presented in the last two columns of the table. It should be noted that, for a given document, the average term frequency is computed as the average tf s of the terms that are present in the corresponding document. As seen in the table, there are considerable differences in these datasets. In particular, the web documents in WebKB and 7-Sectors are on the average longer compared to the documents in the other three datasets. Moreover, there are larger number of different terms in these datasets compared to the others. Although the e-News dataset includes longer documents compared to CSTR2009 and Reuters-21578, the average term frequency is smaller which means that it includes documents that have larger number of different terms, each of which are repeated smaller number of times in the documents.
5.2
Experimental Setup
In the implementation of a text categorization system, the first step is the removal of stopwords. In this study, SMART stoplist is used for this purpose (Buckley, 1985). Then, Porter stemming algorithm is applied (Porter, 1980). SVMlight with default parameters and linear kernel is 1 http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/ 2 http://www.cs.rochester.edu/trs/ 3 http://csmining.org/index.php/e-news-datasets-.html
16
Table 1: The main characteristics of the datasets used. Each row corresponds to a different training fold. Dataset
CSTR2009
e-News
Number of
Number of
Number of
Average
Average
categories
training docs.
terms
doc. length
term freq.
4
465
3481
90.75
1.54
470
3746
91.33
1.53
470
3830
94.60
1.54
470
3785
89.35
1.51
500
20783
322.50
1.39
504
20958
318.10
1.39
500
21266
326.25
1.39
500
21166
322.40
1.38
5
Reuters-21578
10
6491
17008
68.47
1.50
WebKB
4
3148
50041
557.05
2.58
3149
52364
557.67
2.66
3149
52308
552.89
2.63
3151
53022
560.63
2.64
3434
60216
695.45
2.86
3435
59987
662.35
2.83
3435
59935
676.41
2.82
3439
59935
689.00
2.84
7-Sectors
7
17
employed as the classification scheme (Joachims, 1998, 1999). For the evaluation of different approaches, precision (P ), recall (R) and F1 score are computed separately for each category. Precision is defined as the percentage of documents which are correctly labeled as positive. Recall provides the percentage of correctly classified positive documents. However, since a categorization system can be tuned to maximize either precision or recall at the expense of the other, their harmonic mean named as F1 score is generally used (Liu et al., 2009a) where 2×P ×R , P +R TP P = , TP + FP TP R= . TP + FN
F1 =
(11)
TP, FP and FN denote true positives, false positives and false negatives respectively for the category under concern (Wu and Tsai, 2009). Macro-F1 score is then computed as the average of individual F1 scores (Sebastiani, 2002). On the other hand, Micro-F 1 score is calculated based on the total values of TP, FP and FN over all categories. In a recent study, it is observed that the F1 scores of most weighting schemes plateau after 5000 features for SVM (Lan et al., 2009). Because of this, top 5000 features ranked by χ2 are considered in the experiments. However, in the CSTR2009 dataset, the total number of processed terms is less than 3500 in one fold. Because of this, top 3000 terms are used for this dataset in all four folds. In order to investigate the performances using smaller number of features (i.e. 1000 and 2000 terms), additional experimental results are presented at the end of Section 5.3. In all experiments, the value of ² in Eq. 10 is set as 0.001.
5.3
Experimental Results
The first stage of the simulations covers the use of different term frequency factors on their own. Table 2 presents the macro-F1 and micro-F1 scores obtained. Note that cosine normalization is applied on tf and log(1 + tf ) values whereas the third factor has its inherent normalization. On CSTR2009 dataset, tf performs better compared to the nonlinear mappings. On e-News
18
Table 2: The macro-F1 and micro-F1 scores obtained using different term frequency factors alone. Note that cosine normalization is applied on tf and log(1 + tf ) values whereas the third factor has its inherent normalization. Score
Macro-F1
Micro-F1
tf
log(1 + tf )
b tf j,k
CSTR2009
82.95
81.91
80.77
e-News
96.24
97.69
95.29
Reuters-21578
88.39
89.14
89.87
WebKB
85.69
91.65
91.42
7-Sectors
77.50
91.35
90.52
CSTR2009
87.10
86.52
86.06
e-News
96.72
97.95
96.04
Reuters-21578
94.32
94.44
94.34
WebKB
88.08
92.81
92.62
7-Sectors
78.67
92.14
91.52
Dataset
19
dataset, the relative performance depends on the normalization scheme. As a matter of fact, it can be argued that there is not any remarkable gain due to smoothing of tf s by applying a nonlinear transformation. On the other hand, it is evident from the table that nonlinear transformation provides significant improvements on the other datasets, especially on WebKB and 7-Sectors. These results verify the dataset dependent performance of different mappings. It is hypothesized in Section 3 that this dependence is mainly due to the differences in the distribution of term frequency values in different text categorization problems. This difference is evident in Table 1 which is obtained using all terms in the vocabulary. In order to evaluate the validity of this hypothesis after feature selection, consider Table 3 where the mean and standard deviations of the average tf values of the selected terms (3000 in CSTR2009 and 5000 in the other datasets) in the positive and negative documents are presented for the largest and smallest categories of each dataset. More specifically, for each document, the averages of the tf s of the appearing terms are firstly computed. Then, the mean and standard deviation of these values over all positive and negative documents are calculated. As seen in the table, documents in WebKB and 7-Sectors contain terms having higher frequencies compared to the others. As a matter of fact, employing the raw form of tf results in overstatement of these terms leading to poor performance as presented in Table 2. Figures 2 and 3 provide the histograms of the number of documents as a function of the average term frequencies for all categories in e-News and 7-Sectors respectively for further clarification of the differences between the two categorization tasks. The histograms are computed using the training documents in the first fold. As seen in the figures, the average term frequencies in the vast majority of documents in e-News dataset are smaller than 2 whereas the average term frequencies are above 2 in the great majority of documents in 7-Sectors dataset. In order to evaluate the proposed collection frequency factor, the experiments are performed for all three term frequency factors given above. As the collection frequency factors, asymmetric RF and symmetric exp(α, θ) are also considered for reference purposes. Table 4 presents the macro-F1 and micro-F1 scores obtained when cosine normalized tf is used as the term frequency 20
Table 3: The means and standard deviations of the average values of tf s that are greater than zero in the positive and negative documents for the largest and smallest categories of the datasets. Dataset
Category
Mean
Standard dev.
CSTR2009
Systems
1.58
0.28
Robotics
1.56
0.27
Business
1.45
0.22
Education
1.50
0.29
Earn
1.51
0.37
Corn
1.53
0.37
Student
2.47
1.29
Project
2.48
2.26
Technology
3.09
2.17
Utilities
2.84
1.64
e-News
Reuters-21578
WebKB
7-Sectors
21
Business
Education
70
60
50 Number of documents
Number of documents
60
50
40
30
20
30
20
10
10
0
40
0
0.5
1
1.5
2
2.5
0
3
0
0.5
1.5
2
2.5
Travel
50
45
45
45
40
40
40
35 30 25 20 15
35 30 25 20 15
10
10
5
5
0
0.5
1
1.5
2
Average term frequency
2.5
3
35 Number of documents
Number of documents
50
0
3
Sport
Entertainments
Number of documents
1
Average term frequency
Average term frequency
0
30 25 20 15 10 5
0
0.5
1
1.5
2
Average term frequency
2.5
3
0
0
0.5
1
1.5
2
2.5
Average term frequency
Figure 2: The number of documents corresponding to different average term frequencies for e-News dataset.
22
3
Utilities 900 800
Number of documents
700 600 500 400 300 200 100 0
0
5
10
15
20
25
Average term frequency
Energy
Healthcare
2000
Transportation
2500
2500
2000
2000
Number of documents
Number of documents
1600 1400 1200 1000 800 600
Number of documents
1800
1500
1000
500
400
1500
1000
500
200 0
0
10
20
30
40
50
0
60
0
10
20
30
40
50
60
70
80
90
0
100
0
10
Average term frequency
Average term frequency Basic Materials
20
Financial
1200
30
40
50
60
70
30
35
Average term frequency Technology
800
1200
700 1000
800
600
400
600
Number of documents
Number of documents
Number of documents
1000
500 400 300
800
600
400
200
200
200 100
0
0
5
10
15
20
25
Average term frequency
30
35
40
0
0
5
10
15
Average term frequency
20
25
0
0
5
10
15
20
25
Average term frequency
Figure 3: The number of documents corresponding to different average term frequencies for 7-Sectors dataset.
23
Table 4: The macro-F1 and micro-F1 scores obtained using cosine normalized tf and the collection frequency factors RF , exp(α, θ) and the proposed scheme, W . Score
Macro-F1
Micro-F1
Dataset
tf × W
tf × RF
tf × exp(α, θ)
CSTR2009
82.01
81.29
81.66
e-News
96.06
95.19
95.66
Reuters-21578
88.26
89.46
89.61
WebKB
87.02
86.20
87.64
7-Sectors
81.57
85.28
89.23
CSTR2009
86.68
85.45
86.35
e-News
96.49
95.77
96.30
Reuters-21578
94.15
94.73
94.91
WebKB
89.25
88.64
89.95
7-Sectors
82.35
85.08
89.37
24
Table 5: The macro-F1 and micro-F1 scores obtained using cosine normalized log(1 + tf ) and the collection frequency factors RF , exp(α, θ) and the proposed scheme, W . Score
Macro-F1
Micro-F1
Dataset
log(1 + tf ) × W
log(1 + tf ) × RF
log(1 + tf ) × exp(α, θ)
CSTR2009
82.67
81.28
81.29
e-News
97.04
96.16
96.92
Reuters-21578
89.68
90.08
89.73
WebKB
92.66
91.74
91.94
7-Sectors
92.52
94.29
95.18
CSTR2009
87.36
85.77
86.11
e-News
97.27
96.54
97.24
Reuters-21578
94.47
94.78
94.75
WebKB
93.46
92.99
93.18
7-Sectors
93.05
94.50
95.60
25
b Table 6: The macro-F1 and micro-F1 scores obtained using tf j,k and the collection frequency factors RF , exp(α, θ) and the proposed scheme, W . Score
Macro-F1
Micro-F1
b ×W tf j,k
b × RF tf j,k
b × exp(α, θ) tf j,k
CSTR2009
81.79
79.67
80.04
e-News
95.48
94.37
94.16
Reuters-21578
91.27
90.94
91.13
WebKB
92.48
91.32
91.77
7-Sectors
91.60
93.36
94.42
CSTR2009
86.72
84.41
85.37
e-News
96.04
95.15
95.14
Reuters-21578
94.81
94.73
94.94
WebKB
93.32
92.73
93.17
7-Sectors
92.34
93.80
95.01
Dataset
26
factor. As it can be seen in the table, on CSTR2009 and e-News datasets where tf s are generally smaller, using tf s in collection frequency factor is favorable where better F1 scores are achieved compared to RF and exp(α, θ). On the other three datasets where logarithmic transformation is found to be effective when term frequency factor is used alone, RF and exp(α, θ) provide better scores. This is consistent with our hypothesis about the use of term frequencies in collection frequency factor. The experiments are repeated for the term frequency factor log(1 + tf ) and the results are presented in Table 5. The results are consistent with the case when tf is the term frequency factor. This means that, whatever the term frequency factor is, the choice of the collection frequency factor still depends on the distribution of term frequency values in the dataset. It should be noted that the proposed weighting scheme provides the best scores on WebKB as well. Although the use of raw form of tf in term frequency factor is found to provide inferior results compared to logarithmic mapping, the use of tf s in collection frequency factor is still b useful. In fact, this observation is valid for tf j,k as given in Table 6 where the best macro-F1 scores are obtained by the proposed weighting scheme on four datasets. It should be noticed that, as seen in Tables 4, 5 and 6, the highest macro-F1 scores are achieved by the proposed collection frequency factor on four datasets (82.67 on CSTR2009 , 97.04 on e-News, 91.27 on Reuters-21578 and 92.66 on WebKB). In order to assess the statistical significance of the improvements in the macro-F1 scores provided by the proposed approach, hypothesis tests are performed using the t-test approach. The null hypothesis is defined as “H0 = mean of the improvement is equal to zero” and the alternative hypothesis is defined as “H1 = mean of the improvement is greater than zero”. The tests are performed between log(1 + tf ) × W and baseline tf × RF system. The null hypothesis is rejected at significance level of α = 0.06, with p-values 0.055, 0.030, 0.054 and 0.00052 respectively for CSTR2009 , e-News, WebKB and 7-Sectors datasets. It is already shown by Lan et al. (2009) that the use of insufficient number of terms slightly degrades the categorization scores. In order to evaluate the performance of the proposed term
27
Table 7: The macro-F1 and micro-F1 scores obtained using smaller number of terms. b ×W tf j,k Score
Macro-F1
Micro-F1
Dataset
b × RF tf j,k
b × exp(α, θ) tf j,k
1000
2000
1000
2000
1000
2000
CSTR2009
81.30
81.83
80.35
79.82
80.82
80.17
e-News
94.49
94.23
94.36
94.15
93.96
93.70
Reuters-21578
90.34
90.62
88.99
89.65
89.64
90.03
WebKB
92.06
92.23
91.04
91.31
91.86
92.07
7-Sectors
91.49
91.72
93.29
93.58
93.76
94.49
CSTR2009
86.54
86.85
84.61
84.42
85.59
85.40
e-News
95.15
95.00
95.00
94.84
94.75
94.57
Reuters-21578
94.37
94.48
93.75
94.11
94.35
94.43
WebKB
92.96
93.02
92.49
92.67
93.10
93.26
7-Sectors
91.31
91.95
93.09
93.81
93.56
94.86
28
weighting scheme on smaller number of features, experiments are repeated for 1000 and 2000 terms. The results are presented in Table 7. As seen in the table, the scores are slightly degraded on all systems but proposed factor scheme provides the best scores on majority of the datasets in both cases. As a final remark, it should be noted that the datasets considered have unique characteristics that can be easily exploited to determine the best-fitting type of the collection frequency factor to maximize the categorization performance. As given in Table 3, on three datasets, the means of the average term frequency values are found to be around 1.5 whereas, on the other two, the means are around 2.5 or above and the standard deviations are considerably higher. This suggests that some terms may be downgraded by far whereas some terms can be overemphasized if an inappropriate transformation is selected. Consequently, after analyzing the training data, one can easily decide whether the term frequencies should be considered in the calculation of collection frequency factors or not.
6
Conclusions
In this study, it is shown that each text categorization problem has different characteristics in terms of the distribution of term frequencies which affects the performance of the mappings considered in term frequency factors. When the term frequencies are small, it is shown that raw term frequencies provide better scores compared to logarithmic mapping. On the other hand, it is observed that if high frequency terms prevail, logarithmic transformation prevents their overstatement, leading to better scores compared to using them in raw form. Based on these observations, a novel collection frequency factor is proposed which uses term frequencies. Experiments on five datasets have verified that the proposed scheme provides better scores on categorization problems where the term frequencies are usually small. It is also shown that achieving better scores on the problems where greater tf values appear is also possible.
29
The proposed collection frequency factor uses the logarithm of term frequencies. On the other hand, the conventional approaches take into account the number of documents that contain the term. The logarithm function can be considered as an intermediate mapping between the two extremes: Using the raw term frequency values or the document frequencies. The logarithm function may not be the best choice for all datasets. As a future work, we believe that alternative functions should be investigated which may be in parametric forms. It is conceivable that the parameters can be estimated from the distribution of the term frequencies in the datasets.
References Altın¸cay, H. and Erenel, Z. (2010). Analytical evaluation of term weighting schemes for text categorization. Pattern Recognition Letters, 31:1310–1323. Bisht, R. K., Srivastava, G., and Dhami, H. S. (2010). Term weighting using term dependence. International Journal of Computer Applications, 3(10):1–3. Buckley, C. (1985). Implementation of the smart information retrieval system. Technical report, Cornell University, Ithaca, USA. Buckley, C., Salton, G., Allan, J., and Singhal, A. (1994). Automatic query expansion using SMART : TREC 3. In Proceedings of the Third Text Retrieval Conference, pages 69–80. Chen, C. M., Lee, H. M., and Hwang, C. W. (2005). A hierarchical neural network document classifier with linguistic feature selection. Applied Intelligence, 23:277–294. Chen, C. M., Lee, H. M., and Tan, C. C. (2006). An intelligent web-page classifier with fair feature-subset selection. Engineering Applications of Artificial Intelligence, 19:967–978. Chen, J., Huang, H., Tian, S., and Qu, Y. (2009). Feature selection for text classification with naive Bayes. Expert Systems with Applications, 36:5432–5435.
30
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. (1998). Learning to extract symbolic knowledge from the world wide web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, pages 509–516, Madison, Wisconsin, United States. Debole, F. and Sebastiani, F. (2004). An analysis of the relative hardness of Reuters-21578 subsets. Journal of the American Society for Information Science and Technology, 56(6):584– 596. Diederich, J., Kindermann, J., Leopold, E., and Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1-2):109–123. Dumais, S. T. (1990). Enhancing performance in latent semantic indexing (LSI) retrieval. Technical Memorandum, TM-ARH-017527. Erenel, Z., Altın¸cay, H., and Varo˘glu, E. (2011). Explicit use of term occurrence probabilities for term weighting in text categorization. Journal of Information Science and Engineering, 27(3):819–834. Hassan, S., Mihalcea, R., and Banea, C. (2006). Random-walk term weighting for improved text classification. In In Proceedings of TextGraphs: 2nd Workshop on Graph Based Methods for Natural Language Processing. ACL, pages 53–60. He, J., Tan, A. H., and Tan, C. L. (2003). On machine learning methods for Chinese document categorization. Applied Intelligence, 18:311–322. Joachims, T. (1998). Text categorization with support vector machines : Learning with many relevant features. In Proceedings of 10th European Conference of Machine Learning, pages 137–142. Joachims, T. (1999). Making large-scale SVM learning practical. In B. Sch¨ olkoph, C. J. C. Burges and A. J. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning, Cambridge, MA: MIT Press, pages 169–184. 31
Jones, K. S. (1973). Indexing term weighting. Information Storage and Retrieval, 9:619–633. Lan, M., Sung, S. Y., Low, H. B., and Tan, C. L. (2005). A comparative study on term weighting schemes for text categorization. In Proceedings of International Joint Conference on Neural Networks, pages 546–551. Lan, M., Tan, C. L., Su, J., and Lu, Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4):721–735. Liu, X., Wu, J., and Zhou, Z. (2009a). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Sytems Man Cyber. Part B, 39(2):539–550. Liu, Y., Loh, H. T., and Sun, A. (2009b). Imbalanced text classification: A term weighting approach. Expert Systems with Applications, 36:690–701. Maleki, M. (2010). Utilizing category relevancy factor for text categorization. In Second International Conference on Software Engineering and Data Mining, Chengdu, China, pages 334–339. Manning, C. D., Raghavan, P., and Sch¨ utze, H. (2008). Introduction to Information Retrieval. Cambridge University Press, New York, USA. Miao, Y. Q. and Kamel, M. (2011). Pairwise optimized rocchio algorithm for text categorization. Pattern Recognition Letters, 32:375–382. Mladenic, D. and Grobelnik, M. (2003). Feature selection on hierarchy of web documents. Decision Support Systems, 35(1):45–87. Nunzio, G. M. D. (2009). Using scatterplots to understand and improve probabilistic models for text categorization and retrieval. International Journal of Approximate Reasoning, 50(7):945– 956. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130–137. 32
Radovanovic, M. and Ivanovic, M. (2006). Document representations for classification of short web-page descriptions. Data Warehousing and Knowledge Discovery, 8th International Conference, DaWaK 2006, Krakow, Poland, September 4-8. Lecture Notes in Computer Science 4081 Springer, ISBN 3-540-37736-0, pages 544–553. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24:513–523. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47. Singhal, A., Salton, G., Mitra, M., and Buckley, C. (1995). Document length normalization. Technical report, Cornell University, Ithaca, NY, USA. Soucy, P. and Mineau, G. W. (2005). Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the 19th international joint conference on artificial intelligence, pages 1130–1135, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Tsai, R. T., Hung, H., Dai, H., Lin, Y., and Hsu, W. (2008). Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles. BMC Bioinformatics, 9. Wu, C. H. and Tsai, C. H. (2009). Robust classification for spam filtering by back-propagation neural networks using behavior-based features. Applied Intelligence, 31(2):107–121. Yang, Y. and Liu, X. (1999). A re-examination of text categorization methods. In SIGIR-99. Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of ICML’97, 14th International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, US, pages 412–420. Zheng, Z., Wu, X., and Srihari, R. (2004). Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl., 6(1):80–89.
33
Zhixing, L., Zhongyang, X., Yufang, Z., Chunyong, L., and Kuan, L. (2011). Fast text categorization using concise semantic analysis. Pattern Recognition Letters, 32:441–448. Zhou, X. and Zhang, H. (2008). An algorithm of text categorization based on similar rough set and fuzzy cognitive map. In Fifth International Conference on Fuzzy Systems and Knowledge Discovery, 18-20 October, Shandong, China.
34