AutoPCS: A Phrase-based Text Categorization ... - Semantic Scholar

Report 1 Downloads 38 Views
AutoPCS: A Phrase-based Text Categorization System for Similar Texts Zhixu Li1, Pei Li1, Wei Wei1, Hongyan Liu2, Jun He1, Tao Liu1, Xiaoyong Du1 Information School, Renmin University of China, Beijing, China 2 Tsinghua University, Beijing, China [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] 1

Abstract. Nearly all text classification methods classify texts into predefined categories according to the terms appeared in texts. State-of-the-art of text classification prefer to simplely take a word as a term since it performs good on some famous datasets; some experts even pointed out that phrases don’t improve or improve only marginally the classifiction accuracy. However, we found out that this is not always true when we try to categorize texts about similar topics in the same domain. With words only we can not categorize those texts effectively since they nearly share the same word set. Then we suppose the results might be improved if we also use phrases as terms. To testify our supposition, we propose our own phrase extraction way as well as select proper feature selection method and classifier by conducting experimental study on a data set which comes from paper abstracts in the field of Databases. Accordingly, we also develop a system called AutoPCS which can be used to help experts in choosing relevant topics for newly coming papers from a predefined topic list only by their abstracts. Keywords: Text Classification, Phrase-based, BOP, Similar Texts

1 Motivation When there is a research paper submitted to a conference, some submission systems would ask its contributor to choose several relevant topics for the paper from a given predefined topic list so that the conference organization can arrange several corresponding reviewers to judge the paper. Those given predefined topics may cover most of the research areas in a certain domain. A snapshot of a manual relevant topic selection system can be seen in Fig. 1. This topic selection process can be regarded as a manual text classification way. It might be simple and accurate if the researcher is familiar with most of the topics in the domain. But it becomes much unreliable and time-consuming if the researcher is a freshman of the research field who do not know most of those topics (And this situation always happens since a large number of contributors are students.) In addition, there are also some conferences which would not ask contributors but some experts in this domain to choose relevant topics for the paper. This might spend lots of time since they at least have to read the abstracts of the paper. When two experts

decide whether or not to classify a document under a certain category, they may disagree, and this in fact happens with relatively high frequency. Besides, manual classification is incautious, which may also bring false classification. In order to solve these problems, we would like to provide some automatic relevant topics recommendation or checkout function to the manual categorization submission system by developing an Automatic Publication Categorization System, AutoPCS.

(a) Relevant topics selection

(b) Predefined topic list

Fig. 1. Snapshot of the manual relevant topic selection in submission system of PAKDD2009

As a fundamental task in Information Retrieval and Data Mining, text categorization (or text classification) has been studied extensively in the past several decades. The classical approach to text categorization has so far been using a document representation in a word-based space, i.e. as a vector in a high dimensional Euclidean space where each dimension corresponds to a word. However, the main drawback of the word-based representation is that it destructs the semantic relations between words by using words in a phrase separately [7]. A classical example which has been proposed in [7] is “White House” or “Bill Gates”. Given a BOW (Bag-OfWords) of a document in which words “bill” and “gates” occur, one can suggests that the document is about accounting or gardening, but not about computer software. Whereas given a document representation that contains a phrase “bill gates”, the reader will hardly be mistaken about the topic of discussion. These fairly obvious observations led researchers to an idea of enriching the BOW representation by phrases. Bag-Of-Bigrams (pairs of consequent words) was proposed firstly in early 90s [8]. Since then, dozens of articles have been published on this topic. While some of the researchers report significant improvement in text categorization results, many of them show only marginal improvement or even a certain decrease. However, as far as we concerned, the word-based BOW is efficient enough for text classification because we are always classifying texts from different fields like “Corporate/Industrial”, “Markets” and “Government Finance” in Reuters dataset. Because each category has some solely-used domain words, adding phrases into the bag cannot improve much. But the situation becomes quite different when it comes to classifying similar texts. Similar texts means texts of similar topics which almost share the same word set, e.g. texts from two similar topics “Data mining” and “Text Mining”. The three words “Data”, “Mining” and “Text” are very common in text about either topic. Since similar texts nearly share the same word set, it is difficult to classify them only by words (or BOW). But different topics have respective terms which are usually phrases; therefore, a phrase-based representation (We can call it as Bag-Of-Phrases, or simply BOP) is expected to be more efficient. In this paper we would like to

overview some related works in recent years on the problem of using phrases for text classification. Then we try to use BOP in classifying similar texts in our dataset. The rest of the paper is organized as follows: in Section 2 we describe the problem of text categorization; in Section 3 we briefly reviews related works on text classification in history, especially the most recent works on the problem of using phrases for text classification; in Section 4 we propose our BOP method of incorporating words and phrases in document representations; Section 5 presents our experimental study on classifying similar texts; Section 6 gives a introduction to the automatic new paper categorization system (AutoPCS) we developed for paper submission system. Finally we make a conclusion in Section 7.

2 Problem Statement In order to better state the problem in text classification, we would like to give a simple formulation to text classification as follows. Assume we are given a training set: 𝑇=

𝑑𝑖 , 𝑐𝑗 𝑑𝑖 ∈ 𝐷, 𝑐𝑗 ∈ 𝐶}

(1)

In this formula (1), each document 𝑑𝑖 belongs to a document set 𝐷 and the label 𝑐𝑗 is within a predefined set of categories 𝐶 = {𝑐1 , 𝑐2 , … , 𝑐𝑚 }. Each 𝑑𝑖 , 𝑐𝑗 represents that 𝑑𝑖 is labeled as 𝑐𝑗 . The goal of text categorization is to devise a learning algorithm which can generate a classifier or a hypothesis ℎ: 𝐷 → 𝐶 that can label unlabeled each document in D accurately, with the help of the training set 𝑇. Designing a learning algorithm for text classification usually follows the classical approach in pattern recognition, where data instances (i.e. documents) first undergo a transformation of dimensionality reduction, and then a classifier learning algorithm is applied to the low-dimensionality representations. This transformation is also performed prior to applying the learned classifier to unseen instances. The incentives in using dimensionality reduction techniques are to improve classification quality and to reduce the computational complexity of the learning algorithm and of the application of the classifier to unseen documents. Typically, dimensionality reduction techniques fall into two basic schemes: feature selection and feature generation. Feature selection can also be called as feature reduction. This kind of methods tries to select the subset of features (words in text classification) that are most useful for text classification. After the selection of a suitable subset, the reduced representation of a document is computed by projecting the documents over the selected words. In contrast, feature generation which can also be called as feature induction, tries to generate new features which are not necessarily words for representation. Usually, the new features are synthesized from the original set of features. After feature selection or feature generation, the next step is to choose a proper classifier. There have been lots of excellent algorithms proposed by researchers in this field such as Naïve Bayes, Bayes Networks, K Nearest Neighbors, Decision Trees, Decision Rules, Neural Networks, SVMs and so forth. Different classifier performs best in different situation.

3 Related Works The classical approach to text categorization has so far been using a document representation in a word-based space, i.e. as a vector in a high dimensional Euclidean space where each dimension corresponds to a word. This method relies on classification algorithms that are trained in a supervised learning manner. In the early days of text categorization, classifier design has been significantly advanced [1] with lots of strong learning algorithms emerged such as [2], [3], [4]. Later, despite numerous attempts to introduce more sophisticated techniques for document representation, the simple minded independent word-based representation, known as bag-of-words (BOW), remains very effective. Indeed, to date the best multi-class, multi-labeled categorization results for the well-known Reuters-21578 dataset are based on the BOW representation [5] [6]. A sufficient effort has been expended on attempting to come up with a document representation which is richer than BOW. A widely explored approach is in using ngrams of words (or phrase) in addition to or in place of single words (or unigrams). However, after many years of unsuccessful attempts to improve the text categorization results by applying n-grams (usually n=2), many researchers agree that there might be a certain limitation in usability of phrases for text categorization. According to [7], this can probably be explained by two considerations: (a) the results achieved on these corpora are so high that they probably cannot be improved by any technique, because all the incorrectly classified documents are basically mislabeled; and (b) the corpora are “simple” enough so only a few extracted keywords can do the entire job of distinguishing between categories. There are mainly two kinds of approaches to incorporate n-gram of words into the document representation: the first one applies n-grams together with unigrams, while the second one excludes unigrams from the representation and bases on n-grams only. However, in most cases the second approach leads to a certain decrease in the categorization results, while the first approach can potentially improve the results. This observation indicates that the simple BOW representation is powerful enough, so the classification results cannot be probably improved by replacing the BOW representation but only by extending it. Now we give a particular presentation to the state-of-the-art of using n-gram of words into the document representation. [9] uses document representation based on Noun Phrases (obtained by a shallow parsing) and Key Phrases (the most meaningful phrases obtained by the Extractor system). The results achieved by either scheme are roughly the same as their baseline with BOW representation. [10] uses both unigrams and bigrams as document features and extract the top-scored features using various feature selection methods. Their results indicate that in general bigrams can better predict categories than unigrams. However, despite the fact that bigrams represent the majority of the top-scored features, the use of bigrams does not yield significant improvement of the categorization results while using the Rocchio classifier. [11] combines BOW and Bag-Of-Ngrams (BON) as document features. By n-grams the authors mean all continuous word sequences in texts. They use several common classifiers with the highest results obtained by the SVM classifier. [12] does feature induction with combination of single words and word pairs. The word pairs are of the Head/Modifier type, i.e. nouns are extracted with their modifiers. The authors show

that using pairs without BOW, the results of both classifiers decrease, while using both pairs and BOW, the results are marginally above the BOW baseline. For extracting bigrams, [13] use the following method: first, they sort words according to their document frequency and consider only highly ranked words. Then they extract bigrams such that at least one of their components belongs to those highly ranked words. After that the authors filter the resulting bigrams according to their tf· idf and Mutual Information with respect to a category. One of few relatively successful attempts of using bigrams is demonstrated by [14], who propose a very sophisticated feature induction technique to improve the text categorization results on Reuters and ComputerSelect datasets. They apply a string distance measure which is similar to the String Kernel [15]. Basing on this measure the authors introduce a score according to which they rank bigrams. Then they extract highly ranked bigrams so that less than 1% of all bigrams are extracted. Using the SVM classifier, the authors achieve a significant improvement on the ComputerSelect dataset, while the improvement on the Reuters dataset is again statistically insignificant. Nevertheless, this result on Reuters is highly noticeable: 88.8% break-even point is clearly the state-of-the-art result. The success of this technique may be explained by the fact that documents of the Reuters dataset are very well structured (many of them are even not free text but tables) and the string similarity method used by the authors manages to capture this clear structure.

4 Phrase-based Text Classification for Similar Texts Now we propose our own method for classifying similar texts. Most of the time, we are classifying texts from totally different fields, each of which has some solely-used domain words. Therefore the word-based BOW is effective enough for text classification. But the situation becomes quite different when it comes to classifying texts of similar topics. 4.1 Similar Texts Analysis Since similar texts are texts of similar topics which almost share the same word set, we cannot classify them only by words. In this situation, a phrase-based representation (BOP) is expected to be much more effective than BOW. Take the example we mentioned above: We have collected a small document set which contains 158 documents from two similar topics “Data mining” and “Text Mining”. There are 120 documents from “Data Mining” while other 38 from “Text Mining”. As it is performed in the table 1 below, the average number of the three words “text”, “data” and “mining” contained in documents of the two topics are not quite different, but the number of “text mining” in documents of “Text Mining” is apparently more than those of “Data Mining”. On contrast, the number of “data mining” in documents of “Data Mining” is much more than those of “Text Mining”.

Table 1. The average number of "text", "data", "mining", "text mining" and "data mining" in the small document set which consists of 22 documents in “Data Mining” and 18 documents in “Text Mining”.

text data mining text mining data mining

120 documents in “Data Mining” 0.20 0.52 0.50 0.08 0.42

38 documents in “Text Mining” 0.51 0.36 0.41 0.36 0.05

The same problem also exists between some other similar topics such as “Information Retrieval” and “Web Search”, “Information Security” and “Privacy and Trust”, and so forth. There are still some other similar topics which may not have the same word set, but there is also a great intersection between their word sets. In this situation, BOW is not very applicable either. Therefore, in order to classify documents from similar topics more effectively, we propose to use phrases combined with words in classifying similar texts. 4.2 Using BOP in Text Categorization It is necessary to explain that phrases we defined in bag-of-phrases (BOP) method are different from those in [10]. In [10] phrases are only Noun Phrases (obtained by a shallow parsing) and Key Phrases (the most meaningful phrases obtained by the Extractor system); while in our BOP method, phrases are frequently-used continuous word sequences (including a single word) in texts. In order to extract them from texts, we use an n-gram word sequence extractor which can get all word sequences no longer than n (including unigram), then only those word sequences which have appeared more than a threshold times in all texts are chosen as features in our BOP method. After we get all potential phrases and words (word can also be seen as single-word phrase) as features for BOP, feature selection is necessary to reduce dimensionality and overcome the statistical sparseness of document representations. After feature selection we have to choose a suitable classifier which is also very influential to the results. Those previous attempts to incorporate phrases or n-grams in the past decade lead us to the following two strategies: (a) There are too much n-gram phrases (not words) in our phrase bags, it is necessary to make sure that the phrases we will use are all highly discriminative features. That means we should only choose those n-gram phrases which are “better” than all their components (“better” means more discriminative). Just like the phrase “data mining”, which can be chosen only if it is “better” than both “data” and “mining”; (b) In order to improve results that have been achieved, we should enrich the existing model rather than propose a new one. That implies that we prefer to choose feature selection method and classifier from existing ones. Based on the two strategies above, we would like to find out which phrases are “better” than their components firstly. For each category we sort all the unigrams

according to their Mutual Information measure with respect to the category. Then we compare the rank of each n-gram phrase to its component unigram words. If the rank of n-gram phrase is better than all of its components, then it can be chosen, otherwise it should be removed from the phrase bag. After this process, the number of phrases left in the bag decrease sharply. For the next step, we have to select a feature selection (FS) method for the phrases in the bag. There are lots of FS methods proposed. There are some class-independent measures such as document frequency. It is the simplest feature selection method which based on Zipf’s Law. By this method, N most common phrases and those terms that appear in fewer than M documents (M usually 1 or 2) should be removed. However, we prefer to use other methods which consider classes while selecting features such as Information Gain [16], Chi-Square (CHI) [17], Gain Ratio [18] and so forth. 1) Information Gain, which is also known as Expected Mutual Information, tries to keep only the terms distributed more differently in the training set of the categories. 2) Chi-square measures the lack of independence between a term t and a category ci and can be compared to the chi-square distribution with one degree of freedom to judge extremeness. 3) Gain Ratio is commonly used as a splitting criterion in decision tree induction. It is defined whenever a data set is split into two or more subsets. The more a split helps create subsets with homogeneous classes, the better the gain ratio. Since we don’t know which FS method performs better for similar texts, we would like to use all of these methods and choose the most proper one. Later, we have to choose a proper classifier to learn the relationships between features and predefined categories. There have been lots of excellent algorithms proposed such as Naï ve Bayes [19], Bayes Networks [20], K Nearest Neighbors (KNN) [21], Decision Trees [22], SVMs [23] and so forth. 1) Naï ve Bayes: A very cheap but very successful classifier which assumes that all of the attribute values are independent given the class label. 2) Bayes Networks: This is a classifier which tries to improve the results of Naïve Bayes by relaxing the naïve Bayes independence assumption. 3) KNN: It formalizes the intuition that class of the unseen example is likely to be same as to the one of the closest known instances. Degree of similarity is to be defined according to a suitable criterion. 4) Decision Trees: The basic idea of decision tree is to classify texts through a decision tree. The most important advantage is their capability to break down a complex decision-making process into a collection of simpler decisions, thus providing a solution which is often easier to interpret. 5) SVMs: A SVM is an algorithm which computes the linear separation surface with maximum margin for a given training set. Only a subset of the input vectors will influence the choice of the margin; such vectors are called support vectors. When a linear separation surface does not exist, for example in presence of noisy data, SVMs algorithms with a slack variable are appropriate.

5 Experimental Study Our experimental data set contains 917 abstracts from 15 different categories. These abstracts come from conference papers in Databases field from ACM Digital Library1( ACMDL). To make this data set, we collected thousands of abstracts of conference papers from famous conferences of Databases in ACMDL, and then we label category for each abstract according to the name of session it belongs to, e.g. if a paper belongs to a session named “Data Mining”, then the category of its abstract can be labeled as “Data Mining”. Since the limitation of our resources, although we find hundreds of different session names, most of them only contain no more than 10 abstracts. Therefore, we only have chosen some popular categories, each of which contains no less than 30 abstracts. All categories and the number of abstracts in each category of the data set are given in Fig. 2. 140 120 100 80 60 40 20 0

Fig. 2. Category name and the number of abstracts in each category of experimental data set.

Firstly, we use an 3-gram word sequence extractor to extract all phrases no longer than 3 (including words), and only those word sequences which appear more than 3 times in all texts are chosen as features in our BOP method. Secondly, we remove those less discriminative phrases from the bag with the consideration we described in section 4.2. In order to make a comparison between BOP and BOW, we also use the classical BOW method to deal with the data sets with nearly the same first step and third step (The second step of BOP is not necessary for BOW) In the third step, in order to choose a proper feature selection (FS) method for the phrases in the bag, we use some common FS methods including Information Gain (IG), Chi-Square (CHI), Gain Ratio respectively. After feature selection, we have to choose a proper classifier from the following algorithms: Naï ve Bayes, Bayes Networks, K Nearest Neighbors, Decision Trees, SVMs and so forth. The results with ten-fold cross-validation on the data set are given in the Table 2 to Table 6 and Fig. 3. 1

ACM Digital Library: http://portal.acm.org/

Table 2. Accuracy of Naive Bayes Classifier Feature Number Information Gain Chi-Square

Gain Ratio

BOP BOW BOP BOW BOP BOW

50

80

100

300

500

1000

0.6375 0.5878 0.6667 0.5904 0.6247 0.5365

0.6308 0.5851 0.6467 0.6130 0.6313 0.5551

0.6414 0.5918 0.6667 0.6157 0.6260 0.5418

0.6282 0.5439 0.6693 0.5864 0.6366 0.4861

0.5976 0.5173 0.6454 0.5439 0.6340 0.4834

0.5631 0.4840 0.6069 0.5040 0.5703 0.4436

Table 3. Accuracy of Bayes Networks Classifier Feature Number Information Gain Chi-Square

Gain Ratio

BOP BOW BOP BOW BOP BOW

50

80

100

300

500

1000

0.6866 0.5851 0.6906 0.5851 0.6830 0.5538

0.6985 0.5851 0.6972 0.5851 0.6910 0.5790

0.6972 0.5851 0.6972 0.5851 0.6910 0.5790

0.6960 0.5851 0.6959 0.5851 0.6910 0.5790

0.6960 0.5851 0.6959 0.5851 0.6910 0.5790

0.6959 0.5851 0.6959 0.5851 0.6910 0.5790

Table 4. Accuracy of K Nearest Neighbors Classifier Feature Number Information Gain Chi-Square

Gain Ratio

BOP BOW BOP BOW BOP BOW

50

80

100

300

500

1000

0.5046 0.4295 0.5857 0.4934 0.5212 0.5086

0.4993 0.4335 0.5697 0.4601 0.5358 0.4688

0.5046 0.4069 0.5339 0.4495 0.5371 0.4462

0.4050 0.2939 0.4940 0.3404 0.5027 0.3732

0.3293 0.2606 0.4117 0.3032 0.4907 0.3559

0.2059 0.1676 0.2869 0.1941 0.3912 0.3187

Table 5. Accuracy of Decision Trees Classifier Feature Number Information Gain Chi-Square

Gain Ratio

BOP BOW BOP BOW BOP BOW

50

80

100

300

500

1000

0.2311 0.2181 0.2311 0.2181 0.2308 0.2178

0.2311 0.2181 0.2311 0.2181 0.2307 0.2179

0.2311 0.2181 0.2311 0.2181 0.2307 0.2178

0.2311 0.2181 0.2311 0.2181 0.2307 0.2178

0.2310 0.2181 0.2311 0.2181 0.2307 0.2178

0.2310 0.2181 0.2311 0.2181 0.2308 0.2178

Table 6. Accuracy of SVM Classifier Feature Number Information Gain Chi-Square

Gain Ratio

BOP BOW BOP BOW BOP BOW

50

80

100

300

500

1000

0.5843 0. 5559 0.6096 0.5559 0.5889 0.5193

0.6096 0.5731 0.6282 0.5665 0.6167 0.5113

0.6162 0.5745 0.6162 0.5625 0.6141 0.5060

0.6082 0.5585 0.6003 0.5625 0.5424 0.4117

0.5830 0.4854 0.5790 0.4907 0.5133 0.3373

0.5206 0.3670 0.5139 0.3710 0.3966 0.2364

0.7 0.695 0.69 0.685 0.68 0.675 50

80

Information Gain

100

300

500

Chi-Square

1000 Gain Ratio

Fig. 3. The accuracy of the BOP + Bayes Networks classifier

As we can see from Table 2 to Table 6, BOP performs better than BOW in most situations here, which proves that BOP is more effective for our data set. What’s more, Bayes Networks performs much better than the other classifiers on our data set. it performs the best when there are only 80-100 words and phrases chosen in the BOP by using Information Gain feture selection method.

6 Introduction to AutoPCS According to the experimental study in the last section, we decide to use BOP + Information Gain + Bayes Networks to do text classification for AutoPCS. Different from classifying unseen text into one of the predefined category, AutoPCS tries to find out three most relevant topics for each text. Therefore, we have to make a small adjustment to the classifier. We still use the BOP representation combining Information Gain feature selection method to get about 500 high-discriminative phrases and words in the bag. However, for an unseen text, the modified Bayes Networks classifier does not categorize it with only one category, but top three categories which ranked according to their relevant values. AutoPCS is a domain-irrelevant tool. To have AutoPCS categorize new papers into three relevant topics according to its abstract effectively, it should be given a predefined list of topics in a certain domain, as well as a plenty of former accepted papers which have been categorized with their most relevant topics (only the most

relevant one topic for each paper is enough) as training input. Since the topics of a certain domain are changing every year, the AutoPCS can be updated if only we add the newly accepted papers with their categorizations as additional input. A snapshot of AutoPCS is shown in Fig. 4.

Fig. 4. A snapshot of AutoPCS

7 Conclusion In this paper, we try to develop a tool named AutoPCS which can be used to help experts in choosing relevant topics for newly coming papers from a predefined topic list only by their abstracts. This topic selection process can be regarded as a text classification problem. Therefore, we tried to use classical BOW based text categorization method to help us but failed. The reason we concluded is that papers submitted to the same conference are all about a certain domain. Although concerning about different topics, they still share lots of common words, therefore, it is difficult to classify them only by words (or BOW). But different topics have their respective terms which are usually phrases, so a phrase-based representation (BOP) is expected to be more effective. In fact, a sufficient effort has been taken on attempting to use phrases or n-grams to represent a document which is richer than BOW. However, they are unsuccessful in improving the text categorization results on some famous data sets. After an overview to related works in history, we design our own way to extract phrases from texts; through experimental study we find out proper feature selection method and classifier for our case. Finally, we decide to use BOP + Information Gain + Bayes Networks to do text classification for AutoPCS. Since AutoPCS has to label each new paper with three most relevant topics, we also have to make some adjustment to the output of our classifier. We believe that AutoPCS can be a good helper for conference organizations.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.

G. Salton, M. McGill: Introduction to Modern Information Retrieval. McGraw Hill(1983). R. O. Duda, P. E. Hart, and D. G. Stork.: Pattern Classification (2nd ed). John Wiley & Sons, Inc., New York (2000). V. N. Vapnik: Statistical Learning Theory. John Wiley & Sons Inc., New York (1998). R. E. Schapire and Y. Singer.: BOOSTEXTER: a boosting-based system for text categorization. Machine Learning, 39(2/3):135–168 (2000). S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami.: Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM’98 pages 148–155 (1998). S. M. Weiss, C. Apte, F. J. Damerau, D. E. Johnson, F. J. Oles, T. Goetz, and T. Hampp.: Maximizing text-mining performance. IEEE Intelligent Systems, 14(4):63–69 (1999). R. Bekkerman and J. Allan: Using Bigrams in Text Categorization. CIIR Technical Report, University of Massachusetts at Amherst (2004). D. D. Lewis.: An evaluation of phrasal and clustered representations on a text categorization task. Proceedings of SIGIR’92, pages 37–50, Kobenhavn, DK (1992). S. Scott and S. Matwin.: Feature engineering for text classification. In I. Bratko and S. Dzeroski, editors, Proceedings of ICML’99 , pages 379–388, Bled, SL (1999). M. F. Caropreso, S. Matwin, and F. Sebastiani.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Amita G. Chin, editor, Text Databases and Document Management: Theory and Practice, pages 78–102. Idea Group Publishing, Hershey, US (2001). D. Zhang and W. S. Lee.: Question classification using support vector machines. In J. Callan, G. Cormack, C. Clarke, D. Hawking, and A. Smeaton, editors, Proceedings of SIGIR’03, pages 26–32, Toronto, CA (2003). C. H. Koster and M. Seutter.: Taming wild phrases. In F. Sebastiani, editor, Proceedings of ECIR’03, pages 161–176, Pisa, IT (2003). C. M. Tan, Y. F. Wang, and C. D. Lee.: The use of bigrams to enhance text categorization. Information Processing and Management, 38(4):529–546 (2002). B. Raskutti, H. Ferra, and A. Kowalczyk.: Second order features for maximising text classification performance. In L. De Raedt and P. A. Flach, editors, Proceedings of ECML’01, pages 419–430, Freiburg, DE (2001). H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C.J.C.H.Watkins.: Text classification using string kernels. In Advances in Neural Information Processing Systems (NIPS), pages 563–569 (2000). Lewis, D.D. & Ringuette, M.: A comparison of two learning algorithms for text categorization. Proceedings of SDAIR-94, pp. 81–93 (1994). Yang, Y. & Pedersen, J.O.: A comparative study on feature selection in text categorization. Proceedings of ICML-97 , pp. 412–420 (1997). Wiener, E.D., Pedersen, J.O. &Weigend, A.S.: A neural network approach to topic spotting. Proceedings of SDAIR-95, pp. 317–332 (1995). Domingos, Pedro & Michael Pazzani: On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103–137 (1997). Friedman, N., Geiger, D., & Goldszmidt, M.: Bayesian network classifiers. Machine Learning, 29, 131-163 (1997). Yang, Y: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In SIGIR '94, pp. 13-22 (1994). Y. Yuan and M.J. Shaw: Induction of fuzzy decision trees. Fuzzy Sets and Systems 69 pp. 125–139 (1995). Joachims, T.: Text categorization with support vector machines: learning with many relevant features. Proceedings of ECML-98, pp. 137–142 (1998).