A Genetic Semi-supervised Fuzzy Clustering Approach to Text

Report 0 Downloads 92 Views
A Genetic Semi-supervised Fuzzy Clustering Approach to Text Classification Hong Liu and Shang-teng Huang Dept. of Computer Science, Xinjian Building 2008, Shanghai Jiaotong University, Shanghai 200030,China {liuhongshcn,royhuang}@hotmail.com

Abstract. A genetic semi-supervised fuzzy clustering algorithm is proposed, which can learn text classifier from labeled and unlabeled documents. Labeled documents are used to guide the evolution process of each chromosome, which is fuzzy partition on unlabeled documents. The fitness of each chromosome is evaluated with a combination of fuzzy within cluster variance of unlabeled documents and misclassification error of labeled documents. The structure of the clusters obtained can be used to classify future new documents. Experimental results show that the proposed approach can improve text classi-fication accuracy significantly, compared to text classifiers trained with a small number of labeled documents only. Also, this approach performs at least as well as the similar approach – EM with Na¨ıve Bayes

1

Introduction

There is a great need to design efficient content-based retrieval, searching and filtering for the huge and unstructured online repositories on the internet. Automated Text Classification[1], which can automatically assign documents to pre-defined classes according to their text contents, has become a key technique to accomplish these tasks. Various supervised learning methods have been applied to construct text classifiers from a priori labeled documents, e.g.,Na¨ıve Bayes[2]and SVM[3]. However, for complex learning tasks, providing sufficiently large set of labeled training examples becomes prohibitive. Compared to labeled documents, unlabeled documents are usually easier to obtain, with the help of some tools like Digital Library, Crawler Programs, and Searching Engine. Therefore, it is reasonable to learn text classifier from both labeled and unlabeled documents. This learning paradigm is usually referred as semi-supervised learning, and some previous approaches have been proposed to implement it, e.g., co-training[4], EM with Na¨ıve Bayes[5] and transductive SVM[6]. While the approaches above attempt to feed unlabeled data to supervised learners, some other approaches consider incorporating labeled data into unsupervised learning, e.g., partially supervised fuzzy clustering[7] and ssFCM[8]. G. Dong et al. (Eds.): WAIM 2003, LNCS 2762, pp. 173–180, 2003. c Springer-Verlag Berlin Heidelberg 2003 

174

H. Liu and S.-t. Huang

For learning text classifier from labeled and unlabeled documents, this paper proposes a semi-supervised fuzzy clustering algorithm based on GA[9]. By minimizing a combination of fuzzy within cluster variance on unlabeled documents and misclassification error of labeled documents using GA, it attempts to find an optimal fuzzy partition on unlabeled documents. The structure of the clusters obtained can be used to classify a future new document. The remaining of the paper is organized as follows. Section 2 gives the problem definition. In section 3, major components of our algorithm are introduced, and the overall algorithm itself – Genetic Semi-Supervised Fuzzy Clustering (GSSFC) is described. Section 4 illustrates how to classify a new document with the aid of the results obtained by GSSFC. Experimental results and discussion are given in Section 5. Finally, section 6 concludes the paper.

2

Problem Definition

We are provided with a small number of labeled documents and a large number of unlabeled documents. Using Vector Space Model, all the labeled and unlabeled documents can be denoted in a matrix form:   xl1 , ..., xlnl | xu1 , ..., xunu X =       = Xl ∪ Xu (1) labeled

unlabeled

Here, l indicates the designation labeled documents, andu, as a superscript, indicates the designation unlabeled documents. (In other context, u may indicate the name of a membership function or value appropriately, when it doesn’t appear as a superscript.) Moreover, nl = |X l |, nu = |X u |, and n = |X | = nl + nu . A matrix representation of a fuzzy c-partition of X induced by equation (1) has the form:    Ul Uu        l  u u u  u11 ul12 ... ul1n   u u ... u   11 12 1n u l   l   u21 ul22 ... ul2n   uu21 uu22 ... uu2n   u l        . ... . . ... .   . (2) U =  .       . ... ... . . . . .     l  u u u   u ul ... ul   cnl   uc1 uc2 ... ucnu   c1 c2 Here, the fuzzy values of the column vectors in U l are assigned by domain experts after a careful investigation on X l . In general, for 1 ≤ i ≤ c, 1 ≤ h ≤ nl , 1 ≤ j ≤ nu , equation (2) should satisfy the following conditions:

c

c ulih = 1, uuij ∈ [0, 1], uuij = 1 (3) ulih ∈ [0, 1], i=1

i=1

The goal of the problem is to construct, using X , a text classifier. Our basic idea is to find a fuzzy c-partition on X u , which can minimize fuzzy within

A Genetic Semi-supervised Fuzzy Clustering Approach to Text Classification

175

cluster variance of unlabeled documents and misclassification error of labeled documents, and then, use the structure of the clusters obtained to classify future new documents. Misclassification Error of Labeled Documents In order to get good generalization performance, the text classifier to be constructed should minimize the misclassification error of labeled documents. We will use the variance of the fuzzy memberships of labeled documents to measure the misclassification error. In detail, given a fuzzy c-partition on X u , the c cluster centers v 1 , v 2 , . . . , v c can be computed as follows: nl  l m l nu m uik xk + (uuik ) x uk vi = k=1 , 1≤i≤c (4) nl  l m k=1 nu m + k=1 (uuik ) k=1 uik For i = 1, 2, . . . , c, and j = 1, 2, . . . , nl , the fuzzy memberships of labeled documents can be re-computed as follows: −1    c xl − vi  2/(m−1)

   j  ulij =  (5) xl − vh  j

h=1

Accordingly, the misclassification error of labeled documents, denoted as E, can  be measured as a weighted sum of variance between ulij and ulij , with weights 2  equal to xl − vi  , that is, j

E=

nl c 



ulij − ulij

m   xlj − vi 2

(6)

j=1 i=1

Fuzzy within Cluster Variance of Unlabeled Documents Although minimizing misclassification error of labeled documents is necessary for the text classifier to get good generalization ability, it is not sufficient, as minimizing misclassification error of a small labeled documents only would very likely lead to the problem of so-called over-fitting. Fuzzy within cluster variance is a well-known measurement of cluster quality in fuzzy clustering, which is defined as: Jm (Uu , V) =

nu c



uuij

 m  u xj − vi 2

(7)

j=1 i=1

Here, m is treated as the parameter controlling the fuzziness of the clusters, m > 1. We can see that minimizing fuzzy within cluster variance is equal to maximizing the similarity of documents within the same cluster. Thus, we argue that fuzzy within cluster variance of unlabeled documents can play the role of capacity control [10] in our problem. Based on equation (7) and (6), we can clarify our objective function as follows:

176

H. Liu and S.-t. Huang

f (Uu , V) = Jm + α · E

(8)

Here, α > 0 is a positive regularization parameter, which maintains a balance between fuzzy within cluster variance of unlabeled documents and misclassification error of labeled documents. The choice of α depends very much on the number of labeled documents and unlabeled documents. To ensure that the impact of the labeled documents is not ignored, the value of α should produce approximately equal weighting of the two terms in equation (8). This suggests that α should be proportional to the rate nu / nj . At this time, our problem has been converted to minimization of the objective function in equation(8). In this paper, we use GA to resolve this optimization problem, because GA uses population-wide search instead of a point-search, and the transition rules of it are stochastic instead of deterministic. So, the probability of reaching a false peak is much less than that in other conventional optimization methods.

3

Genetic Semi-supervised Fuzzy Clustering (GSSFC)

In this section, major components of GA overall algorithm itself are described. 3.1

Major Components of GA

• Representation and Its Initialization In our algorithm, U u s, whose form are illustrated in equation (2), play the role of chromosomes. U u s are initialized randomly. • Fitness Function The fitness function being used is the objective function in equation (8). • Selection A roulette wheel selection method[9] is used for selecting population members to reproduce. Each member of the population gets a percentage of the roulette wheel based on its fitness. • Crossover Suppose Pc is the probability of crossover, then (Pc b) chromosomes will undergo the crossover operation in the following steps: Step 1. Generate a random real number rc , rc ∈[0,1], for the given kth chromosome. Step 2. Select the given kth chromosome for crossover if rc < Pc . Step 3. Repeat Steps 1 and 2 fork = 1, ... , b, and produce (Pc b) parents, averagely. Step 4. For each pair of parents, for example, U u1 and U u2 , the crossover operation on U u1 and U u2 will produce two children U u(b+1) and U u(b+2) as follows: Uu(b+1) = c1 Uu1 + c2 Uu2 , Uu(b+2) = c2 Uu2 + c1 Uu1 (9) where c1 + c2 = 1 and c1 ∈[0,1] is a random real number.

A Genetic Semi-supervised Fuzzy Clustering Approach to Text Classification

177

• Mutation Mutation is usually defined as a change in a single bit in a solution vector. This would correspond to a change of one element uuij of a chromosome U u in our problem. 3.2

Overall Algorithm

The overall algorithm is given in Table 1. Table 1. Genetic Semi-Supervised Fuzzy Clustering Algorithm Inputs: X l , the set of labeled documents; X u , the set of unlabeled documents; U l a fuzzy c-partition on X l . Output: U u a fuzzy c-partition on X u a fuzzy c-partition on X and c cluster centers v 1 , v 2 , . . . , v c . Choose parameters: m > 1, the degree of fuzziness; b, the population size; max gen, the number of generations; P c, the probability of crossover; P m, the probability of mutation; α,the regularization parameter in fitness function. STEP 1 Initialize Initialize U u randomly and set gen = 0. STEP2 Evaluation For i = 1, 2, . . . , c, compute the current cluster center v i using equation (4). For i = 1, 2, . . . , c, and j = 1, 2, . . . , nl, compute the new fuzzy membership  ulij for labeled document xlj with respect to cluster i, using equation (5). k For k = 1, 2, . . . , b, compute fuzzy within cluster variance Jm for U uk using equation (7), compute misclassification error E k for U uk using equation (6), and compute the fitness f k for U uk using equation (8). STEP3 Selection For k = 1, 2, . . . , b, generate a random real number rs ∈[0,1], and if f k−1 < rs < f k , then select U uk . STEP4 Crossover For k = 1, 2, . . . , b/2, generate a randome number rc ∈[0,1], and if rc ≤ Pc , then perform the crossover on the lth and mth chromosomes, which are randomly selected. STEP5 Mutation For k = 1, 2, . . . , b, and j = 1, 2, . . . , nu , generate a randome number rm ∈[0,1], and if rm ≤ Pm , then generate new elements in the jth column of the kth chromosome; STEP6 Termination If gen < max gen, then let gen = gen +1 and go to STEP 2. Otherwise, the algorithm stops.

4

Classification with the Aid of Results of GSSFC

The structure of the clusters learned above reflects the natural structure of the documents collection, so it can be used to classify future new documents. In detail, given a new document x , and the c cluster centers v 1 , v 2 , . . . , v c ,

178

H. Liu and S.-t. Huang

obtained with GSSFC, the fuzzy membership of x , with respect to class i, ui can be computed in the similar way as in equation (5):  ui =

2/(m−1) c 

x − vi  h=1

x − vh 

−1 (10)

Thus, x is assigned to c classes, with corresponding fuzzy memberships u1 , u2 , . . . , uc respectively. In some applications, the need is to assign x to exactly one of c classes. For this purpose, some defuzzify methods, for example, so-called Maximum Membership Rule (MMR) can be used. In detail, x can be assigned to class i, where i = arg max(uh ). h=1,2,...c

5

Experimental Setup and Results

In out experiments, Na¨ıve Bayes classifier (NBC)[2] is used as a basic classifier, which is trained using labeled documents only. Also, we will compare our method to the approach proposed by Nigam[5], namely EM with Na¨ıve Bayes. Benchmark Documents Collection 1 20-Newsgroups dataset consists of Usenet articles collected by K. Lang[11] from 20 different newsgroups. The task is to classify an article into the one newsgroup (of twenty) to which it was posted. From the original data set, three different data sets are created. The labeled set contains a total of 6000 documents (300 documents per class). We create a set of unlabeled set of 10000 documents (500 documents per class). The remaining 4000 documents (200 documents per class) form the subset of test documents. Different numbers of labeled documents are extracted from the labeled set with uniform distribution of documents over the 20 classes. The use of each size of labeled set comprises a new trial of the experiments below. Each document is presented as a TFIDF weighted word frequency vector and then be normalized. Benchmark Documents Collection 2 The second dataset WebKB[12] contains web pages gathered from computer science departments at four universities. The task is to classify a web page into the appropriate one of the four classes: course, faculty, student and project. Documents not in one of these classes are deleted. After removing documents which contain the relocation command for the browser, this leaves 4183 examples. We create four test sets, each containing all the documents from one of the four complete computer science departments. For each test set, an unlabeled set of 2500 pages is formed by randomly selecting from the remaining web pages. Labeled sets are formed by the same method as in 20-Newsgroups. Stemming and stop-word removal are not used. As with the 20-Newsgroups, each document is presented as a TFIDF weighted word frequency vector, and then normalized.

A Genetic Semi-supervised Fuzzy Clustering Approach to Text Classification

179

Parameters in GSSFC are selected as follows: The degree of fuzziness m = 2; the maximum generation max gen = 2000; population size b = 50; the probability of crossover Pc = 0.7; the probability of mutation Pm = 0.1. Figure 1 shows the classification accuracy of NBC (no unlabeled documents), the classifier constructed through GSSFC and EM with NBC on 20Newsgroups and WebKB respectively. The vertical axis indicates average classification accuracy on test sets, and the horizontal axis indicates the amount of labeled documents on a log scale.

Fig. 1. Classification Accuracy

Na¨ıve Bayes classifier provides an idea of the performance of a base classifier, if no unlabeled documents are used. The corresponding curves in Figure 1 show that, by increasing the number of labeled documents from low to high, a significant improvement in accuracy is got. These results support the justice and necessity to find methods to learn classifier from unlabeled documents in addition to labeled documents, when labeled documents are sparse. The classifier constructed through GSSFC performs significantly better than traditional Na¨ıve Bayes classifier. For example, with 200 labeled documents (10 documents per class), the former reaches an accuracy of 41.3%, while the latter reaches 60.5%. This presents a 19.2% gain in classification accuracy. Another way to view these results is to consider how unlabeled documents can reduce the need for labeled documents. For example, with 20-Newsgroups, for NBC to reach 70% classification accuracy, more than 2000 labeled documents are needed, while only less than 800 labeled documents for GSSFC. This indicates that incorporating a mall number of labeled documents into a large number of unlabeled documents can help constructing a better classifier than that constructed using a mall number of labeled documents alone. The essential reason is that, although unlabeled documents do not provide class label information, but they can provide much structural information of the feature space of the particular problem. It is this information that helps us attaining a better classifier, when labeled documents are sparse. As for GSSFC and EM with Na¨ıve Bayes, the fomer performs at least as well as the latter.

180

6

H. Liu and S.-t. Huang

Conclusion

In summary, we have proposed a genetic semi-supervised fuzzy clustering algorithm, which can learn text classifier from both labeled and unlabeled documents. Experiments are carried out on two separated benchmark document collections. The results indicated that, by combining both labeled and unlabeled documents in the training process, the proposed algorithm can learn a better text classifier than traditional inductive text classifier learners, for instance Na¨ıve Bayes, when labeled documents are sparse. Also, GSSFC performs at least as well as EM with Na¨ıve Bayes. GSSFC is an effective way to construct text classifiers from labeled and unlabeled documents.

References 1. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys, 34(1). (2002) 1–47 2. Tzeras, K., Hartman, S.: Automatic indexing based on bayesian inference networks. In Proc 16th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93). (1993) 22–34. 3. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the 10 European Conference on Machine Learning. (1998) 137-142. 4. Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training. Proceedings of the 11th Annual Conference on Computational Learning Theory. (1998) 92–100. 5. Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3). (2000) 103–134. 6. Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the 16th International Conference on Machine Learning. (1999) 200–209. 7. Pedrycz, W., Waletzky, J.: Fuzzy clustering with partial supervision. IEEE Trans. on Systems, Man, and Cybernetics, 27(5). (1997) 787–795. 8. Benkhalifa, M., Mouradi, A., Bouyakhf, H.: Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization. Information Retrieval, 4(2). (2001) 91–113. 9. Michalewicz, Z.: Genetic Algorithm + Data Structures = Evolution Programs. Third ed., New York: Springer-Verlag. (1996). 10. Vapnik, V. N.: The Nature of Statistical Learning Theory. New York: SpringerVerlag. (1995). 11. Lang, K.: NewsWeeder: learning to filter Netnews. Proceedings of the 12th International Conference on Machine Learning. (1995) 331–339. 12. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchel, T., Nigam, K., Slatteryet, S.: Learning to construct knowledge bases from the World Wide Web. Articial Intelligence,118(1-2). (2000) 69–113.