Sequential Classifier Combination for Text ... - Semantic Scholar

Report 4 Downloads 69 Views
Sequential Classifiers Combination for Text Categorization: An Experimental Study1 Zheng Zhang, Shuigeng Zhou, and Aoying Zhou Dept. of Computer Science and Engineering, Fudan University, Shanghai, 200433, China {zhzhang1981, sgzhou, ayzhou}@fudan.edu.cn

Abstract. In this paper, we introduce Sequential Classifiers Combination (SCC) into text categorization to improve both the classification effectiveness and classification efficiency of the combined individual classifiers. We apply two classifiers sequentially for experimental study, where the first classifier (called filtering classifier) is used to generate candidate categories for the test document and the second classifier (called deciding classifier) is used to select a final category for the test document from the candidate categories. Experimental results indicate that when combining boosting and kNN methods, the combined classifier outperforms the best one of the two individual classifiers, and in the case of combining Rocchio and kNN methods, the combined classifier performs equally well as kNN while its efficiency is much better than kNN and is close to that of Rocchio.

1 Introduction With the rapid growth of the amount of documents in digital form, how to manage these massive documents repositories and retrieve what we want from them accurately and efficiently is a severe challenge to both database and IR communities. Automated text categorization (TC, also known as text classification), the task of assigning predefined category labels to new documents based on the rule suggested by a training set of documents, is one method that can help to solve this problem. So far, a lot of learning algorithms have been proposed for text categorization, including naïve Bayes classifier [1], decision tree [2], kNN classifier [3], online classifier [4], Rocchio classifier [5], support vector machine [6] and boosting method [7] etc. In addition to developing different TC methods, some other researches try to combine multiple classifiers for achieving better performance than any individual classifier can obtain. These methods are termed multiple classifiers combination. Current combination methods include majority voting (MV) [9], weighted linear combination (WLC) [9], dynamic classifier selection (DCS) [10] and adaptive classifier combination (ACC) [11] etc. In these combination methods, different individual classifiers make judgments separately on a test document according to their respective classifying algorithms, and the final (usually better) judgment is made 1

This work was supported by National Natural Science Foundation of China under grant No. 60373019.

1

based on these judgments according to a pre-specified combination rule (e.g. MV, WLC, DCS and ACC). Because these different individual classifiers perform classification independently, we can call these methods parallel combination methods. Different from the parallel combination methods, Sequential Classifiers Combination (SCC) [16] employs multiple classifiers sequentially in the same categorization task where the latter classifiers utilize the results of the former ones. However, prior to this paper, Sequential Classifiers Combination has not been used in text categorization field. We conduct an experimental study to validate the performance of this method in text categorization. In our experimental study, kNN classifier is combined with Rocchio classifier and Boosting classifier respectively. The experimental results indicate that when we combine Boosting and kNN, the SCC classifier outperforms the best individual classifier; while we combine Rocchio and kNN, the SCC classifier performs equally well as kNN, but its efficiency is much better than kNN and is close to that of Rocchio. The rest parts of this paper are organized as follows. Section 2 is the related work about text classifiers combination methods; Section 3 introduces sequential classifiers combination (SCC) method; Section 4 presents an experimental study of SCC method; Section 5 summarizes our experiment results of SCC method; Section 6 concludes the paper and highlights some future work directions.

2 Related Work Text categorization has been extensively studied in the last decade, and a lot of methods were proposed. Typical text categorization methods include naïve Bayes classifier [1], decision tree [2], kNN classifier [3], neural network [4], Rocchio classifier [5], support vector machine (SVM) [6] and boosting method [7] etc. For individual classification approach, we refer the readers to [8, 12]. To further improve categorization performance, approaches to combine different classifiers in text categorization were also investigated, such as majority voting (MV) [9], weighted linear combination (WLC) [9], dynamic classifier selection (DCS) [10] and adaptive classifier combination (ACC) [10]. In majority voting (MV), classification judgments for a test document obtained by the k classifiers are pooled together, and the classification decision that reaches the majority (i.e., at least (k+1)/2 votes) is taken (k obviously needs to be an odd number). In weighted linear combination (WLC), different classifiers first calculate scores of each category for the test document individually and then get a weighted sum of these scores. The weight wj means the expected relative effectiveness of classifier Φj, and the sum of all weights should be equal to 1. These weights are assigned by the user, and typically are optimized on a validation set. The category with the largest summed score is taken as the final result. As for dynamic classifier selection (DCS), among classifier committee the individual classifier that performs best on the l training documents most similar to the test document d is selected, and its judgment for d is adopted eventually by the committee. In adaptive classifier combination (ACC), the judgments of all the classifiers in the committee are summed together, but their individual contribution is

2

weighted by the effectiveness that they have shown on the l training documents most similar to the test document d. So ACC can be seen as a kind of hybrid between DCS and WLC.

3 Sequential Classifiers Combination All the existing text combination methods are essentially parallel methods, that is, different classifiers make their judgments on a test document using their individual classifying algorithms separately and the final decision is then made based on these judgments according to a certain rule (e.g. MV, WLC, DCS and ACC), see Fig. 1(a) for illustration. In contrast to parallel combination, here we introduce Sequential Classifiers Combination (SCC) into text categorization field, that is, two or more classifiers classify the document sequentially where the latter classifier utilizes the categorization results of the former ones as input. See Fig. 1(b) for illustration.

(a) Parallel combination

(b) Sequential combination Fig. 1. Illustration of parallel combination and sequential combination

In Fig. 1(b), the process of classifying a document with our sequential classifiers combination method is illustrated. Here two different classifiers are combined, namely, C1 and C2. C1 is used to generate a certain number of candidate categories for a test document dj, say c1, c2, …, ck’, and C2 is then used to select the final

3

category (say ci, 1≤i≤k’) from the candidate categories generated by C1. Intuitively, we call the first classifier filtering classifier, which is responsible for filtering out a relatively large number of categories where the test document does not belong to with high probability; and the second classifier is termed deciding classifier, which makes the final categorization decision on the test document within a relatively small number of candidate categories. The number of candidate categories generated by the filtering classifier (denoted as k’) is a system parameter in SCC, which can be selected by users or tuned to yield the optimal performance by experiments. To achieve good categorization performance with regard to both effectiveness and efficiency, we should be careful about the configuration of SCC. That is, how to select filtering classifiers, deciding classifier, and the number of candidate categories k’ generated by the filtering classifiers. Intuitively, for the deciding classifier, we should pay more attention on its precision than on its efficiency for it makes the final categorization decision. However, for filtering classifiers, we don’t require them to have very high precision but their efficiency must be emphasized because their task is to select some candidate categories (instead of the final result) from the whole category space. More formally, let say the combination is SCC={C1, C2, …, Ck}, here C1, C2, …Ck-1 are filtering classifiers, Ck is the deciding classifier. Denote p(Ci) and e(Ci) the precision and efficiency of classifier Ci respectively, then a reasonable configuration of SCC should like this: p(C1)≤ p(C2)≤…≤ p(Ck), and e(C1)≥ e(C1)≥…≥ e(C1). As for the setting of k’ value, a general rule is like this: if the filtering classifier has high precision, then we can select a small value for k’, otherwise, we should select a larger one.

4 Sequential Classifier Combination for Text Categorization To demonstrate the performance of SCC method for TC task, we conduct an experimental study in this section. Our goal is to improve kNN classifier’s performance by combining it with other two classifiers: Rocchio classifier and Boosting classifier. We establish two SCC classifiers: SCC1={Rocchio, kNN}, SCC2={Boosting, kNN}. That is, kNN is used as deciding classifier, Rocchio classifier and Boosting classifier play the role of filtering classifier. Experimental results show that SCC classifiers outperform kNN classifier in both classification precision and classification efficiency. kNN is one of the most effective methods in text classification, but it suffers from the problem of efficiency which is induced by two reasons: first, kNN is an examplebased or lazy method, that is, it defers the decision on how to generalize beyond the training data until each new query instance is encountered [8]; second, in order to find the k documents which are most similar to the test document d, kNN must compare d with all the documents in the training set based on the cosine formula using the TF-IDF weighing scheme. The time complexity of kNN algorithm is O (|D|Nt), where |D| is the number of the training documents and Nt is the number of the test documents. In contrast to kNN, the Rocchio classifier, whose classification algorithm is quite similar to kNN, is much more efficient than kNN. This is because Rocchio compares the test document d with the “profile” of each category rather than

4

with all the documents in the training set. The time complexity of Rocchio is O (|C|Nt), where |C| is the number of categories. Typically, |C| is quite smaller than |D|. Although Rocchio is a very efficient method, its classification precision is rather poor compared to kNN. So combining kNN and Rocchio can merge the advantages of kNN’s precision and Rocchio’s efficiency. The motivation of combing kNN and Boosting is to demonstrate the general nature of the SCC method since the learning algorithm of Boosting is quite different from that of Rocchio and kNN and its performance is well accepted. Following we give brief introduction to kNN, Roochio and Boosting classification methods respectively. 4.1 kNN Classifier kNN classifier (also known as k-nearest neighbor classifier), which is based on the term vector space model, is an effective method in text categorization. Given a test document d, kNN’s classification process consists of two steps. In the first step, the classifier finds k documents from the training set which are most similar to d. Usually similarity between two documents is calculated according to the cosine formula based on the TF-IDF weighing scheme [13]. In the second step, the classifier assigns a score to each category and ranks the candidate categories according to the score of each category. In a single-label case, the top category in the ranking list is the category kNN assigns to the test document. Formally, for a test document d, we calculate the score of category ci as follows:

score ( d , c i ) =

∑ sim ( d , d

d j ∈ kNN ( d ) Ι d j ∈ c i

j

) − bi

(1)

kNN(d) in the formula is the set of k documents nearest to d, bi is a threshold for each category, which can be tuned on the training set. 4.2 Rocchio classifier Rocchio classifier is a TC method whose root lies exclusively in the IR tradition, and is a typical example of profile-based classifiers that extract an explicit profile of each category from the training set. The Rocchio method computes a classifier (w1i, w 2i, ..., w ri ) for category ci by the following formula[8]:

⎛ β w ki = ⎜ . w kj ⎜ {d j | ca ij = 1} {d j∑ |ca ij =1} ⎝

⎞ ⎛ γ ⎟−⎜ . w kj ⎟ ⎜ {d j | ca ij = 0} {d j ∑ |ca ij = 0 } ⎠ ⎝

⎞ ⎟ ⎟ ⎠

(2)

In this formula, caij =1 means that the document dj belongs to the category ci, while caij =0 indicates that dj doesn’t belong to the category ci. The documents that belong to the category ci are called “positive” examples of ci, and the documents doesn’t belong to ci are called “negative” examples of ci, β and γ are control parameters that allow setting the relative importance of positive and negative examples. When setting β=1 and γ=0, the profile of each category is exactly the centroid of the positive training examples. To assign a category to a document d, it comes down to calculate

5

the similarity of d with the profile of each category and select the most similar category. This method is quite easy to implement, and the resulting classifier tends to be quite efficient compared to kNN classifier. However, in terms of classification effectiveness, it is not a good TC method. 4.3 Boosting Boosting is originally a machine learning technique and later introduced into text classification by Schapire [7]. The main idea of boosting is to combine many simple and moderately inaccurate categorization rules into a single, highly accurate categorization rule. We must point out that the concept of combination used by boosting is quite different from the combination we proposed in this paper: the combination in boosting is in the step of constructing the classifier while the combination we discuss in this paper is in the step of classifying documents with classifiers. There are k different simple and moderately inaccurate classifiers in boosting method. These k classifiers are trained sequentially, i.e., one after another. The training of each classifier will take into account how previous classifiers perform on the training set, and concentrate on getting right categories on those documents where previous classifiers perform worst. After all the k classifiers have been constructed, a weighted linear combination rule is applied to yield the final classifier.

5 Experimental Results The Chinese text collection used in our experiment consists of news reports from the People’s Daily. The whole data set has 2850 documents, which are categorized into 20 groups manually. We call this data set G1.In order to experiment on different data sets, we divide G1 into G2 and G3. Table1 shows the name of each category and the corresponding number of the documents in each category. The documents that belong to the 10 categories in the first and third column of Table1 form data set G2 and G3, respectively, while the whole documents in Table1 form data set G1. Table 1. Experimental data set Category Politics (G2) Sports (G2) Economy (G2) Agriculture (G2) Environment (G2) Space (G2) Art (G2) Education (G2) Medic (G2) Transport (G2)

No. of documents 617 350 226 86 102 119 150 120 104 116

Category Mine (G3) Military (G3) Computer (G3) Electronic (G3) Communication (G3) Energy (G3) Philosophy (G3) History (G3) Law (G3) Literature (G3)

6

No. of documents 67 150 109 55 52 65 89 103 103 67

In our experiment, for each category we choose 70% of the documents in the data set as the training set and the left 30% as the test set. Each test in our experiment is repeated for 5 times and each time we select the training documents and test documents randomly. We get the final result by averaging results of each trial. We preprocess the Chinese documents by removing the low-frequency words, that is, the words appear less than 2 times in a document. We don’t remove the stopwords, which will be filtered by our feature selection algorithm. There is a challenge on how to get Chinese words or features from Chinese documents. In English documents, the words or features are separated naturally by blanks and thus are easily obtained, while in Chinese documents the words are Chinese characters string without separation between them and thus are difficult to extract. Generally, there are two different methods to obtain words from Chinese documents. One method is to use segmentation procedure with dictionary support [13]. This method is very complex and its accuracy is not proportionate to its complexity. The other method is to use N-gram [14] as the features of Chinese documents. Although the semantics of N-gram is less accurate than the real word, this method is quite efficient. In our experiment, we use the second method to get features from Chinese documents. We use the global dimensionality reduction (DR) technique and information gain (IG) [8] method to reduce the high dimensionality of the term space. If not specially mentioned, the number of features in our experiment is 500. We use macro-average F1 and micro-average precision [8] as the evaluation measures of the text classifiers. Table 2 shows a comparison of the performances of 3 different classifiers on our data set G1, G2 and G3 respectively. All the parameters for different classifiers are tuned to yield the best performance. For kNN, we set k=10(neighborhood size) in data set G2, G3, and k=20 in G1. For boosting, we set training round to 400 in all of these three data sets. The experimental results indicate that kNN performs best on our Chinese dataset, and boosting performs worst. Table 2. Performance of three different classifiers Macro-average F1

Micro-average precision

Rocchio

G1 0.753

G2 0.856

G3 0.780

G1 0.779

G2 0.889

G3 0.751

kNN

0.772

0.898

0.794

0.815

0.931

0.798

0.588

0.787

0.706

0.634

0.822

0.705

Boosting

Results of sequential combination of Rocchio and kNN are shown in Fig. 2, 3 and 4. Fig. 2 and 3 show effectiveness of individual classifiers and combined classifier, evaluated by Macro-average F1 and Micro-average precision respectively. Fig. 4 shows efficiency of each classifier evaluated by time cost for classification. This test is over G1, but the trends performed on G2 and G3 are very similar. In G1, there are 20 categories in total. The X-axis of each figure represents the number of candidate categories in SCC (denoted by k’). In our experiment, we choose Rocchio as the filtering classifier and kNN as the deciding classifier.

7

Fig. 2 and 3 indicate that when k’=1, the effectiveness of SCC is equal to Rocchio; but when k’=2, the effectiveness of SCC improves radically and is close to kNN; when k’=4, SCC outperforms kNN. The effectiveness of SCC when k’=11~20 are not shown in Figure 2 and 3, but we can figure it out that as the value k’ gets close to 20, the effectiveness of SCC is tending equal to kNN. Fig. 4 shows that the classification time for SCC is nearly a linear function of k’. From figure 2, 3 and 4, we can see that when k’=4, SCC outperforms the best individual classifier (i.e., kNN), but its classification time is about one fourth of that of kNN. Fig. 5 shows experimental results of combining kNN and Boosting by SCC where boosting is the filtering classifier and kNN is the deciding classifier. From Fig. 5, we can see that when k’=4, SCC yields the best effectiveness and outperforms the best individual classifier (i.e., kNN) by more than 2 percent. This test is done on data set G3, and the number of total categories in G3 is 10. We also investigate performance of SCC and individual classifiers with different features sizes. Experimental results indicate that SCC performs best with medium-sized features. Fig. 6 shows the micro-average precision with different numbers of features on data set G1.The parameter k’ in SCC is tuned to yield the best effectiveness for different numbers of features.

Macro-average F1

SCC with Rocchio and kNN 0.78 Rocchio

0.77

kNN

0.76

SCC

0.75 0 1 2 3 4 5 6 7 8 9 10 11

k'

Fig. 2. Macro-average F1 of kNN, Rocchio and SCC with different k’

Micro-average precision

SCC with Rocchio and kNN 0.83 0.82 0.81 0.8 0.79 0.78 0.77

Roochio kNN SCC 0 1 2 3 4 5 6 7 8 9 10 11

k'

Fig. 3. Micro-average precision of kNN, Rocchio and SCC with different k’

8

Classification time (s)

SCC with Rocchio and kNN 15 Rocchio

10

kNN

5

SCC

0 0 1 2 3 4 5 6 7 8 9 10 11

k'

Fig. 4. Classification time of kNN, Rocchio and SCC with different k’

Macro-average F1

SCC with Boosting and kNN 0.85 kNN

0.8

Boosting

0.75

SCC

0.7 0 1 2 3 4 5 6 7 8 9 10

k'

Fig. 5. Macro-average F1 of kNN, Boosting and SCC with different k’ values

Micro-average precision

SCC with different sized features 0.85 kNN

0.8

Boosing

0.75

SCC

0.7 0.65 100

600

1100

1600

Number of features

Fig. 6. Micro-average precision of kNN, Boosting and SCC for different feature sizes

6 Conclusion Remarks In this paper, we introduce Sequential Classifiers Combination (SCC) into text categorization filed and conduct an experimental study to test its performance. The experiment results indicate that the SCC method can improve both the efficiency and effectiveness of individual text classifiers. As for future work, first, we will test the SCC method on more text collections, including standard benchmark collections such

9

as Reuters collections; second, we would like to test more SCC classifiers by combing different classifiers; and third, we plan to do a comprehensive comparison between SCC method and parallel combination method.

References [1] [2] [3]

[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

Lewis, D. D. 1992. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval pp. 37-50. Lewis, D. D. and Ringuette, M. 1994. A comparison of two learning algorithms for text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US, 1994), pp. 81-93. Yang, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, IE, 1994), pp. 13-22. Wiener, E., Pedersen, J. O., and Weigend, A. S. 1995. A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US, 1995), pp. 317-332. Hull, D. A. 1994. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, IE, 1994), pp. 282-289. Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, DE, 1998), pp. 137-142. Schapire, R. E., Singer, Y., and Singhal, A. 1998. Boosting and Rocchio applied to text filtering. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, AU, 1998), pp. 215-223. Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, US, 1999), pp. 42-49. Larkey. S. and Croft, W. 1996. Combining classifiers in text classification. In Proc. SIGIR, pp. 81-93. Woods, K., Kegelmeyer, W. P., Jr., and Bowyer Jr, K. 1997. Combination of multiple classifiers using local accuracy estimates. IEEE Trans. PAMI. 19(4): 405-410. Li, Y. H. and Jain, A. K. 1998. Classification of text documents. Computer. J. 41(8): 537546. Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys. 34(1): 1-47. Salton, G. and McGill, M. 1983. Introduction to Modern Information Retrieval. McGrawHill, New York. Zhou, S., Guan, J. and Hu, Y. 2002. Chinese documents classification based on N-grams. In Proceedings of Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), LNCS. 2276, 405-414. Larkey, L. S. 1998. Automatic essay grading using text categorization techniques. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), 90-95. Rahman, A. F. R. and Fairhurst, M. C. 1999. Serial Combination of Multiple Experts: A Unified Evaluation. Pattern Anal. Appl. 2(4): 292-311.

10