Clustering Based Text Classification Requiring

Report 0 Downloads 90 Views
CBC: Clustering Based Text Classification Requiring Minimal Labeled Data Hua-Jun Zeng1

Xuan-Hui Wang2

Zheng Chen1

Wei-Ying Ma1

1

Microsoft Research Asia, Beijing, P. R. China {i-hjzeng, zhengc, wyma}@microsoft.com

2

University of Science and Technology of China, Anhui Hefei, P. R. China [email protected]

Abstract. Semi-supervised learning methods construct classifiers using both labeled and unlabeled training data samples. While unlabeled data samples can help to improve the accuracy of trained models to certain extent, existing methods still face difficulties when labeled data is not sufficient and biased against the underlying data distribution. In this paper, we present a clustering based classification (CBC) approach. Using this approach, training data, including both the labeled and unlabeled data, is first clustered with the guidance of the labeled data. Some of unlabeled data samples are then labeled based on the clusters obtained. Discriminative classifiers can subsequently be trained with the expanded labeled dataset. The effectiveness of the proposed method is justified analytically. Related issues such as expanding labeled dataset and interacting clustering with classification are discussed. Our experimental results demonstrated that CBC outperforms existing algorithms when the size of labeled dataset is very small.

1. Introduction Text classification is a supervised learning task of assigning natural language text documents to one or more predefined categories or classes according to their contents. While it is a classical problem in the field of information retrieval for a half century, it has recently attracted an increasing amount of attention due to the ever-expanding amount of text documents available in digital form. Its applications span a number of areas including auto-processing of emails, filtering junk emails, cataloguing Web pages and news articles, etc. A large number of techniques have been developed for text classification, including Naive Bayes (Lewis 1998), Nearest Neighbor (Masand 1992), neural networks (Ng 1997), regression (Yang 1994), rule induction (Apte 1994), and Support Vector Machines (SVM) (Vapnik 1995, Joachims 1998). Among them SVM has been recognized as one of the most effective text classification methods. Yang & Liu gave a comparative study of many algorithms (Yang 1999). As supervised learning methods, most existing text classification algorithms require sufficient training data so that the obtained classification model can generalize well. When the number of training data in each class decreases, the classification accuracy of traditional text classification algorithms

degrade dramatically. However, in practical applications, labeled documents are often very sparse because manually labeling data is tedious and costly, while there are often abundant unlabeled documents. As a result, exploiting these unlabeled data in text classification has become an active research problem in text classification recently. The general problem of exploiting unlabeled data in supervised learning leads to a semi-supervised learning or labeled-unlabeled problem in different context. The problem, in the context of text classification, could be formalized as follows. Each sample text document is represented by a vector x∈ℜd. We are given two datasets Dl and Du. Dataset Dl is a labeled dataset, consisting of data samples (xi, ti), where 1≤i≤n, and ti is the class label with 1≤ ti ≤ c. Dataset Du is an unlabeled dataset, consisting of unlabeled sample data xi, n+1 ≤ i ≤n+m. The semi-supervised learning task is to construct a classifier with small generalization error on unseen data1 based on both Dl and Du. There have been a number of work reported in developing semi-supervised text classification recently, including Co-Training (Blum & Mitchell, 1998), Transductive SVM (TSVM) (Joachims, 1999), and EM (Nigram et al., 2000); and a comprehensive review could be found in Seeger (2001). While it has been reported that those methods obtain considerable improvement over traditional supervised methods when the size of training dataset is relatively small, our experiments indicated that they still face difficulties when the labeled dataset is extremely small, e.g. containing less than 10 labeled examples in each class. This is somehow expected as most of those methods adopt the same iterative approach which train an initial classifier heavily based on the distribution presented in the labeled data. When containing very small number of samples, and the samples are far apart from corresponding class centers due to the high dimensionality, those methods will often have a poor starting point and cumulate more errors in iterations. On the other hand, although there are much more unlabeled data sample available which should be more representative, they are not made full use of in the classification process. The above observation motivated our work reported in this paper. We present CBC, a clustering based approach for text documents classification with both labeled and unlabeled data. The philosophical difference between our approach and existing ones is that, we treat semi-supervised learning as clustering aided by the labeled data, while the existing algorithms treated it as classification aided by the unlabeled data. Traditional clustering is unsupervised and requires no training examples. However, the labeled data can provide important hint for the latent class variables. The labeled data also help determine parameters associated with clustering methods, thus impacting on the final clustering result. Furthermore, the label information could be propagated to unlabeled data according to clustering results. The expanded labeled set could be used in subsequent discriminative classifiers to obtain low generalization error on unseen data. Experimental results indicated that our approach outperforms existing approaches, especially when the original labeled dataset is very small. Our contributions can be summarized as follows. (1) We proposed a novel clustering based 1

A transductive setting of this problem just uses seen unlabeled data as testing data.

classification approach that requires minimal labeled data in the training dataset to achieve high classification accuracy; (2) We provided analysis that gives some insight to the problem and proposed various implementation strategies; (3) We conducted comprehensive experiments to validate our approach and study related issues. The remainder of the paper is organized as follows. Section 2 reviews several existing methods. Our approach is outlined in Section 3 with some analysis. The detailed algorithm is then presented in Section 4. A performance study using several standard text datasets is presented in Section 5. Finally, Section 6 concludes the paper.

2. Semi-Supervised Learning: Motivations As defined in the previous section, semi-supervised uses both the labeled dataset Dl and the unlabeled dataset Du to construct a classification model. However, how the unlabeled data could help in classification is not a trivial problem. Different methods were proposed according to different view of unlabeled data. Expectation-Maximization (EM) (Dempster et al, 1977) has a long history in semi-supervised learning. The motivation of EM is as follows. Essentially, any classification method is to learn a conditional probability model P(t|x,θ), from a certain model family to fit the real joint distribution P(x, t). With unlabeled data, a standard statistical approach to assessing the fitness of learned models P(t|x,θ) is

∑ log P( x | t ,θ ) P (t ) + ∑ log ∑ P( x | t ,θ ) P (t )

x∈Dl

i

i

x∈Du

(1)

t

where the latent labels of unlabeled data are treated as missing variables. Given Eq. 1, a Maximum Likelihood Estimation (MLE) process can be conducted to find an optimal θ. Because the form of likelihood often makes it difficult to maximize by partial derivatives, Expectation-Maximization (EM) algorithm is generally used to find a local optimal θ. For example, Nigram et al. (2000) combined EM with Naive Bayes and got improved performance over supervised classifiers. Theoretically if a θ close to the global optima could be found, the result will also be optimal under the given model family. However, the selection of a plausible model family is difficult, and the local optima problem is serious especially when given a poor starting point. For example, in Nigram’s approach, EM is initialized by Naive Bayes classifiers on labeled data, which may be heavily biased when there is no sufficient labeled data. Co-Training and TSVM methods were proposed more recently and have shown superior performance over EM in many experiments. Co-Training method (Blum & Mitchell, 1998) splits the feature set by x=(x1, x2) and trains two classifiers θ1 and θ2 each of which is sufficient for classification. With the assumption of compatibility, i.e. P(t|x1,θ1)= P(t|x2,θ2), Co-Training uses unlabeled data to place an additional restriction on the model parameter distribution P(θ), thus improving the estimation of real

θ. The algorithm initially constructs two classifiers based on labeled data, and mutually selects several confident examples to expand the training set. This is based on the assumptions that an initial “weak predictor” could be found and the two feature sets are conditional independent. However, when labeled dataset is small, it is often heavily biased against the real data distribution. The above assumptions will

be seriously violated. TSVM (Joachims, 1999) adopts a totally different way of exploiting the unlabeled data. It maximizes margin over both the labeled data and the unlabeled data. TSVM works by finding a labeling tn+1, tn+2, ..., tn+m of the unlabeled data Du and a hyperplane <w, b> which separates both Dl and Du with maximum margin. TSVM expects to find a low-density area of data and constructs a linear separator in this area. Although empirical results indicate the success of the method, there is a concern that the large margin hyperplane over the unlabeled data is not necessary to be the real classification hyperplane (Zhang, 2000). In text classification, because of the high dimensionality and data sparseness, there are often many low-density areas between positive and negative labeled examples. Instead of using two conditional independent features in the co-training setting, Raskutti (2002) co-trained two SVM classifiers using two feature spaces from different views. One is the original feature space and the other is derived from clustering the labeled and unlabeled data. Nigam K. & Ghani R. (2002) proposed two hybrid algorithms, co-EM and self-training, using two randomly split features in co-training setting. After exhaustive experiments, they found that co-training is better than non-co-training algorithms such as self-training. As a summary, all existing semi-supervised methods still work in the supervised fashion, that is, they pay more attention to the labeled dataset, and rely on the distribution presented in the labeled dataset heavily. With the help of the unlabeled data, extra information on data distribution can help to improve the generalization performance. However, if the number of samples contained in the labeled data is extremely small, such algorithms may not work well as the labeled data can hardly represent the distribution in unseen data from the beginning. This is often the case for text classification where the dimensionality is very high and a labeled dataset of small size just represents a few isolated points in a huge space. In Figure 1, we depict the results of applying two algorithms to a text classification problem. The X-axis is the number of samples in each class, and the Y-axis is their performance in terms of FMicro that will be defined in Section 5. We can see that the performance of both algorithms degrades dramatically when the number of samples in each class dropped to less than 16. 0.8

Micro-F1 Measure

0.7

0.6

0.5

0.4

TSVM K-means

0.3

Co-Training

0.2 0

1

2

4

8

16

32

64

128

256

512

Number of labeled data in each class

Figure 1. Performance of two existing algorithms with different size of labeled data (No. classes = 5, total number of training data samples = 4000)

In Figure 1, we depict another line, the dotted line, to indicate the performance of using a clustering method, K-means, to cluster the same set of training data. In the experiments, we ignore the labels; hence it represents the performance of unsupervised learning. It is interesting to see that when the number of labeled data in each class is less than 4, unsupervised learning in fact gives better performance than both semi-supervised learning algorithms. This motivated us to develop a clustering based approach to the problem of semi-supervised learning.

3. Clustering Based Classification Our approach clusters the unlabeled data with the guidance of labeled data first. With those clusters, some originally unlabeled data can be viewed as labeled data with high confidence. Such an expanded labeled dataset can be subsequently used in classification algorithms to construct the final classification model.

3.1 The Basic Approach CBC consists of the following two steps: 1.

Clustering step: to cluster the training dataset including both the labeled and unlabeled data and expand the labeled set according to the clustering result;

2.

Classification step: to train classifiers with the expanded labeled data and remaining unlabeled data.

Figure 2 gives an example to illustrate the traditional approach and our clustering based approach, respectively. The black points and grey points in the Figure represent data samples of two different classes. We have very small number of labeled data, e.g. one for each class, represented by the points with “+” and “-” signs. Apparently, a classification algorithm trained with these two points will most likely find line A as shown in Figure 2(a) as the class boundary; and it’s also rather difficult to discover the real boundary B even with the help of those unlabeled data points. Firstly, because the initial labeled samples are highly biased, they will cause poor starting points for iterative reinforcement algorithms such as Co-Training and EM. Moreover, TSVM algorithm may also take line A as the result because it happens to lie in a low density area. In fact in a feature space with high dimensionality, a single sample B

B A

(a) Classification with original labeled data

(b) Expand labeled set by clustering

(c) Classification with newly labeled data

Figure 2. An illustrative example of clustering based classification. The black and gray data points are unlabeled examples. The big “+” and “-” are two initially labeled example, and small “+” and “-” are examples expanded by clustering.

is often highly biased; and many low density areas will exist. Our clustering based approach is shown in Figure 2(b) and (c). During the first step, a clustering algorithm is applied to the training samples. In the example, it results in two clusters. Then we propagate the labels of the labeled data samples to the unlabeled samples which are closest to cluster centroids. As a result, we have more labeled data samples, as shown in Figure 2(b). The second step of the approach is to use the expanded labeled data and remaining unlabeled data to train a classifier. As the result, we can obtain the better class boundary, as shown in Figure 2(c). From the above description, we can see that, our approach aims to combining the merits of both clustering and classification methods. That is, we use clustering method to reduce the impact of the bias caused by the initial sparse labeled data. At the same time, with sufficient expanded labeled data, we can use discriminative classifiers to achieve better generalization performance than pure clustering methods.

3.2 Benefits of the Clustering Based Approach In this subsection, we further analyze the benefit of integrating clustering into the classification process. First, clustering methods are more robust to the bias caused by the initial sparse labeled data. Let us take k-means, the most popular clustering algorithm as an example. In essence, k-means is a simplified version of EM working on spherical Gaussian distribution models. They can be approximately described by MLE of k spherical Gaussian distributions, where the means µ1, …, µk and the identical covariances Σ are latent variables. Thus with the aid of labeled data, the objective is to find an optimal θ= to maximize the log-likelihood of Eq. 1 where the P(x|ti,θ) equals to

1 (2π )

d /2

⋅Σ

1/ 2

1 ⋅ exp(− ( x − µi )T Σ −1 ( x − µi )) 2

(2)

When the number of labeled examples is small, the bias of labeled example will not affect much the likelihood estimation and the finding of the optimal θ. Second, our clustering method is in fact a generative classifier, i.e., it constructs a classifier derived from the generative model of its data P(x|t,θ). Ng & Jordan theoretically and empirically analyzed the asymptotic performance of generative classifier and discriminative classifier (such as logistic regression, which is a general form of SVM). They showed that generative classifiers reach their asymptotic performance faster than discriminative classifiers (Ng & Jordan 2002). Thus our clustering method is more effective with small training data; and easier to achieve high performance when the labeled data is sparse. To address the problem that generative classifiers usually lead to higher asymptotic error than discriminative classifiers, discriminative classification method such as TSVM can be used in the second step of our approach, i.e., after clustering unlabeled data and expanding the labeled data set. Our clustering is guided by labeled data. Generally, clustering methods address the issue of finding a partition of available data which maximizes a certain criterion, e.g. intra-cluster similarity and

inter-cluster dissimilarity. The labeled data could be used to modify the criterion. There are also some parameters associated with each clustering algorithm, e.g. the number k in k-means, or split strategy of dendrogram in hierarchical clustering. The labeled data can also be used to guide the selection of these parameters. In our current implementation, we use the soft-constraint version of k-means algorithm for clustering, where k is equal to the number of classes in the given labeled data set. The labeled data points are used to obtain the initial labeled centroids, which are used in the clustering process to constraint the cluster result. The detailed algorithm is described in Section 4.

3.3

Combining Clustering with Classification

It is interesting to note that a TSVM classifier can also provide more confident examples for clustering. That is, two-step clustering based classification, i.e., clustering followed by classification, can be viewed as a conceptual approach. Another strategy of combining clustering and classification is through iterative reinforcement. That is, we first train a clustering model L1 based on all available data, obtaining an approximately correct classifier. Afterwards, we select from unlabeled data examples that are confidently classified by L1 (i.e. examples with high likelihood) and combine them with original labeled data to train a new model L2. Because more labeled data are used, the obtained L2 is expected to be more accurate and can provide more confident training examples for L1. We use the new labeled dataset to train L1 again. This process is iterated until all examples are labeled. One key issue here is how to expand the labeled dataset. In principle, we can just assign the label to the most confident p% of examples from each of the resulting clusters. If we choose p=100% after first clustering process, we actually have a two-step approach. First, we need to determine the value of p. The selection of p is a tradeoff between the number of labeled samples and possible noise introduced by the labeling error. Obviously, with higher p, a large labeled dataset will be obtained. In general, a classifier with higher accuracy can be obtained with more training samples. On the other hand, when we expand more samples, we might introduce incorrectly labeled samples into the labeled dataset, which become noise and will degrade the performance of a classification algorithm. Furthermore, small p means more iteration in the reinforcement process. Second, we need to choose “confident examples”. Note that any learned model is an estimation of the real data model P(x,t). We can find examples that are confidently classified by a given model if a slightly change of θ has no impact on them. When more examples are given, the model estimation will become more accurate, and the number of confident examples will grow. As illustrated in Figure 2 (a) and (b), even when some of the data points are wrongly classified, the most confident data points, i.e. the ones with largest margin under classification model and the ones nearest to the centroid under clustering model, are confidently classified. That is, a slightly change of the decision boundary or centroid will not affect the label of these data. We assume that class labels t are uniformly distributed. Since the Gaussian is spherical, the log-likelihood of a given data point and the estimated label is

2

log( P ( x* , t * | θ )) = log( P ( x* | t * ,θ ) P (t * | θ ) = −c1 x − µ * + c2

(3)

where c1 and c2 are positive constants. The most probable points in a single Gaussian distribution are the points that are nearest to the distribution mean. To get the most confident examples from the result of TSVM, we should draw a probabilistic view of the TSVM. Let us take logistic regression as an example, which is a general form of discriminative methods. The objective is to maximize

∑ log 1 + e

1

(4)

− yi f ( xi ,θ )

i

where f(xi,θ) is some linear function depending on the parameter θ. θ is typically a linear combination of training examples. Under the margin maximization classifier such as SVM, the likelihood of a given point x* and its label t*=+ can be derived from the above equation:

P( x*,+) = P( x)(1 −

1 1 + exp(∑ t j (∑ β jk )( x j ⋅ x ) + b) j

)

(5)

k

which considers points with largest margin the most probable.

4. The Algorithm In this section, we present the detailed algorithm when CBC is applied to text data, which is generally represented by sparse term vectors in a high dimensional space. Following the traditional IR approach, we tokenize all documents into terms and construct one component for each distinct term. Thus each document is represented by a vector (wi1, wi2, ..., wid) where wij is weighted by TFIDF (Salton, 1991), i.e.

wij = TFij × log( N / DF j ) , where N is total number of documents. Assuming the term vectors are normalized, cosine function is a commonly used similarity measure for two documents: sim( doc j , dock ) =



d

i =1

wij ⋅ wik . This measure is also used in the clustering

algorithm to calculate the distance from an example to a centroid (which is also normalized). This simple representation proves to be efficient for supervised learning (Joachims, 1998), e.g. in most tasks they are linear separatable. For the classification step, we use a TSVM classifier with a linear kernel. The detailed algorithm is presented as following.

Algorithm CBC Input:  Labeled data set Dl  Unlabeled data set Du. Output:  The full labeled set Dl′ =Dl+(Du, Tu*)  A classifier L 1. 2.

Initialize current labeled and unlabeled set Dl′=Dl, Du′=Du Repeat until Du′ = ∅ Clustering Step  Calculate initial centroids oi=

∑x

∀j ,t j = i

j

, i=1…c, xj∈Dl, and set current centroids oi*=oi.

 The label of the centroids t(oi)= t(oi*) are equal to labels of the corresponding examples.  Repeat until cluster result doesn’t change any more − Calculate the nearest centroids oj* for each oi. If t(oi)≠t(oj*), exit the loop. − Assign t(oi*) to each xi∈Dl+Du that are nearer to oi* than to other centroids. −

Update current centroids oi*=

∑x

∀j ,t j = i

j

, i=1…c, xj∈Dl+Du.

 From each cluster, select p% examples xi∈Du′ which is nearest to oi*, and add them to Dl′. Classification Step  Train a TSVM classifier based on Dl′ and Du′.  From each class, select p% examples xi∈Du′ with the largest margin, and add them to Dl′. The algorithm implements the iterative reinforcement strategy. During each iteration, a soft-constrained version of k-means is used for clustering. We compute the centroids of the labeled data for each class (which is called “labeled centroids”) and use them as the initial centroids for k-means. The k value is set to the number of classes in the labeled data. Then we run k-means on both labeled and unlabeled data. The loop is terminated when clustering result doesn’t change anymore, or just before a labeled centroid being assigned to a wrong cluster. This sets “soft constraints” on clustering because the constraints are not based on exact examples but on their centroid. The constraints will reduce bias in the labeled examples. Finally, unlabeled data are assigned the same labels as labeled centroid in the same cluster. After clustering, we select the most confident examples (i.e. examples nearest to cluster centroids) to form a new labeled set, together with remaining unlabeled data, to train a TSVM classifier. Then, examples with largest margin will be selected into the new labeled set for next iteration. It should be noted that the time complexity of a TSVM classifier is much higher than that of a SVM

classifier, because it repeatedly switches estimated labels of unlabeled data and tries to find the maximal margin hyperplane. The more unlabeled data, the more time it requires. When there is no unlabeled data, TSVM becomes a pure SVM and runs much faster. In our algorithm, the last run of classification is done by a standard SVM classifier.

5. A Performance Study To evaluate the effectiveness of our approach, comprehensive performance study has been conducted using several standard text dataset.

5.1 Datasets Three commonly used datasets, 20-Newsgroups, Reuters-21578, and Open Directory Project (ODP) webpages were used in our experiments. For each document, we extract features from its title and body separately for Co-Training algorithm, and extract a single feature vector from both title and body for all other algorithms. Stop-words are eliminated and stemming is performed for all features. For body features, all words with low document frequency (set to less than three in the experiments) are removed. TFIDF is then used to index both titles and bodies, where IDF is calculated only based on the training dataset. The words that appear only in the test dataset but not in the train dataset are discarded. Because training time for the TSVM classifier is proportional with the number of classes, we did not use all the classes in each dataset. In the experiments, we select five classes from 20-Newsgroups (the same as Nigam K. & Ghani R. (2002)), ten biggest classes from Reuters-21578 as many researches were conducted, and six biggest classes from ODP. From the 20-Newsgroups dataset2, we only select the five comp.* discussion groups, which forms a very confusing but evenly distributed dataset for classification, i.e. there are almost 1000 articles in each group. We choose 80% of each group as training set and the remaining 20% as test set, which give us 4000 train examples and 1000 test examples. Such a split is similar to what used in Nigam K. & Ghani R. (2002). For each article, the subject line and body are retained and other content is discarded. After preprocessing, there are 14171 distinct terms, with 14059 in body feature, and 2307 in title feature. The dataset Reuters-21578 is downloaded from Yiming Yang’s homepage3. We use the ModApte split to form the training set and test set, where there are 7769 training examples and 3019 test examples. From the whole dataset, we select the biggest ten classes: earn, acq, money-fx, grain, crude, trade, interest, ship, wheat, and corn. After selecting the biggest ten classes, there are 6649 training examples and 2545 test examples. After preprocessing, there are only 7771 distinct terms, with 7065 in body feature and 6947 in title feature respectively. The ODP webpages dataset used in our experiments is composed of the biggest six classes in the second level of ODP directory4: Business/Management (858 documents), Computers/Software (2411), Shopping/Crafts (877) Shopping/Home&Garden (1170), Society/Religion &Spirituality (886), and 2 3 4

Available at http://www.ai.mit.edu/people/jrennie/20Newsgroups/. http://www-2.cs.cmu.edu/~yiming/ http://dmoz.org/

Society/ Holidays (881). We select 50% of each category as the training data and the remainder as the test data. From the experiment result below, we see that there is only a small difference between 90% and 50%. So we conclude that such a split is enough for training a classifier and is reasonable. The webpages are preprocessed by a HTML parser and pure text is extracted. After preprocessing, there are 16818 distinct terms on body, 3729 on title and 17050 on title+body. The Jochims’s SVM-light package5 is used for SVM and TSVM classification. We use a linear kernel and set the weight C of the slack variables to default. The basic classifiers for the two feature sets in the Co-Training method are Naive Bayes. During each iteration, we add 1% of examples with the maximal classification confidence, i.e. examples with largest margin, into the labeled set. And the performance is evaluated on the combined features.

5.2 Evaluation Metrics We use micro-averaging of F1 measure among all classes to evaluate the classification result. Since it is a multi-classification problem, the TSVM or SVM in our algorithm construct several one-to-many binary classifiers. For each class i∈[1,c], let Ai be the number of documents whose real label is i, and Bi the number of documents whose label is predicted to be i, and Ci the number of correctly predicted examples in this class. The precision and recall of the class i are defined as Pi=Ci/Bi and Ri=Ci/Ai respectively. F1 Measure could be used to combine precision and recall into a unified measure. Because of the difference of F1 for different classes, two averaging functions could be used to judge the combined performance. The first is macro-averaging:

F1macro =

1 c 2 × Pi × Ri ⋅∑ c i=1 Pi + Ri

(6)

The second averaging function is micro-averaging, which first calculates the total precision and recall on all classes P=ΣCi/ΣBi and R=ΣCi/ΣAi, and the F1 measure is: F1micro=

2× P × R P+R

(7)

Because the micro-averaging of F1 is in fact a weighted average over all classes, which is more plausible for highly unevenly distributed classes, we use this measure in the following experiments.

5.3 Results We first evaluate our algorithms and compare them with TSVM and Co-Training on 5 comp.* newsgroups data set (See Figure 3). In our algorithm, we set the parameter p to 100%, which means to only cluster once and classify once. The selection of p will be explained later.

5

Available at http://svmlight.joachims.org/.

1

K-means Co-Training

TSVM CBC

0.9

Micro-F1 Measure

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0

1

2

4

8

16

32

64

128

256

512

Number of labeled data in each class

Figure 3. Comparative study of TSVM, Co-Training, and our algorithm CBC when given small training data on 5 comp.* newsgroups data set. The algorithms run several rounds with the number of labeled examples ranging from 1, 2, 4, 8, ..., 512 per class. For each number of labeled data, we randomly choose 10 sets to test the three methods, and draw the average of them on Figure 3. From Figure 3, we can see that when the number of the labeled data is very small, CBC performs significantly better than the other algorithms. For example, when the number of the labeled data in each class is 1, our method outperforms TSVM by 5% and Co-Training by 22%; when the number is 2, our method outperforms TSVM by 9% and Co-Training by 12%. Only when the number of the labeled samples exceeds 64, the performances of TSVM and Co-Training achieves a slightly better performance than our method. To evaluate the performance of CBC with a large range of labeled data, we run the same algorithm together with TSVM, Co-Training and SVM on different percentage of the labeled data on the above 3 datasets. Figure 4, 5, and 6 illustrate the results. The horizontal axis indicates the percentage of the labeled data in all training data, and the vertical axis is the value of Micro-F1 measure. We vary the percentage from 0.50% to 90%. The performance of a basic SVM which do not exploit unlabeled data is always the worst because the labeled data is too sparse for SVM to effectively generalize. In all datasets, CBC performs best when labeled data percentage is less than 1%. When the number of labeled documents increases, the performance of our algorithm is still comparative to other methods. The Co-Training algorithm has low performance when training data is small because it is based on the assumptions that an initial “weak predictor” could be found. When training data is rather small, it is often largely biased against the real data distribution. However, Co-Training sometimes got superior result than others, especially in Figure 4 which uses Reuters dataset, because it exploits the feature split to form two views of the training data. In Figure 4, 5, and 6, TSVM always has an intermediate

performance between Co-Training and our algorithm. Reuters 1

TSVM

Co-Training

SVM

CBC

Micro-F1 Measure

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.50%

1%

5%

10%

50%

90%

Percentage of the Train Data Labeled

Figure 4. Micro-F1 of four algorithms on the Reuters data set ODP

Micro-F1 Measure

1

TSVM

Co-Training

SVM

CBC

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.50%

1%

5%

10%

50%

90%

Percentage of T rain Data Labeled

Figure 5. Micro-F1 of four algorithms on the ODP data set 5 Comp.* 1

TSVM

Co-Training

SVM

CBC

0.9

Micro-F1 Measure

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.50%

1%

5%

10%

50%

90%

Percentage of T rain Data Labeled

Figure 6. Micro-F1 of four algorithms on the 5 comp.* newsgroups data set

 

0L FU R )  0H D VX UH

    





H[S

  



















,WHUDWLRQ

Figure 7. Micro-F1 of different algorithm settings with 0.5% labeled data on 5 comp.* news groups. 0.1 TSVM

Co-Traning

CBC

variance

0.01

0.001

0.0001

0.00001 1

2

4

8

16 32 sample size

64

128

256

512

Figure 8. Variance of different algorithms given different labeled data size (each calculated over 10 randomly selected samples). One limitation of our algorithm is that with the increase of labeled data, the performance grows slowly, and sometimes will drop (as in Figure 4 based on Reuters set). This could be explained by the simple method of integration of labeled examples in the clustering step. In fact, labeled examples can not only provide constraints of the clustering, but are also used to modify the similarity measure. We hope this can be further evaluated in future works. In Figure 7, we empirically analyze our proposed algorithm under different parameter p based on experiments on 5 comp.* newsgroups and same 0.5% labeled data initially. Different p controls how much data to be labeled in each iteration. The 10% curve means we increase the labeled data by 10% each time; the 100% curve means to label all the data only once; and the exp2 curve means that we add examples by a sequence of exponential of 2 (i.e. we sequentially add 0.5%, 1%, 2%, 4%, ... examples to the labeled set). As can be seen from Figure 7, although the 10% selection and exp2 selection can improve the classification monotonously, a simpler but more effective setting of this algorithm is to just cluster once

and classify once. This can be explained by the fact that although the clustering step provides informative examples for classification step, there is a lack of informative examples selected by the classification step for subsequent clustering in the proposed algorithm. That is, the examples selected by TSVM and SVM classifiers cannot provide enough information for clustering algorithms. Finally we evaluate the stability of the algorithms by the variance of the three algorithms calculated over 10 randomly selected examples under different sample size (see Figure 8). This experiment is also conducted on 5 comp.* newsgroups. As can be seen, generally the TSVM algorithm has smallest variance and Co-Training has largest variance. It is natural to understand the large variance of Co-Training, as the aforementioned reason that the initial labeled data has large impact on the initial weak predictor. The variance of our algorithm is lower than Co-Training but higher than TSVM. It is mainly because our simple version of constrained k-means algorithm may also fall into a local minimal given a poor starting point.

6. Conclusion and Future Work This paper presented a clustering based approach for semi-supervised text classification. The experiments showed the superior performance of our method over existing methods such as TSVM and Co-Training when labeled data size is extremely small. When there is sufficient labeled data, our method is comparable to TSVM and Co-Training. To the best of our knowledge, no work has been focused on clustering examples aided by labeled data has been reported. Some works on constraint clustering (Wagstaff et al., 2001; Klein et al., 2002) could be considered most relevant ones. They use prior knowledge in the form of cannot-link and must-link constraints to guide clustering algorithms. While in our case, the labeled data provide not only these constraints but also label information which could be exploited to assign labels for unlabeled data. The constrained clustering method in this paper may not be sophisticated enough to capturer all information carried in labeled data. We plan to evaluate other clustering methods and further adjust the similarity measure with the aid of labeled examples. Another direction is to evaluate the validity of two general classifiers used in our framework, and investigate the problems of the example selection approach, confidence assessment, and noise control for further performance improvement.

References Apte, C., Damerau, F., & Weiss, S.M. (1994) Automated learning of decision rules for text categorization, ACM TOIS, Vol 12, No. 3. 223-251 Bhavani Raskutti, Herman Ferra, & Adam Kowalczyk. (2002). Combining Clustering and Co-Training to enhance text classification using unlabeled data. In Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining. Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with Co-Training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (pp. 92-100). Dempster, A. P., Laird, N.M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data

via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38. Joachims, T. (1998). Text categorization with support vector machines: Learning with Many Relevant Features. In C. Ndellec and C. Rouveirol (Eds.), Proceedings of the European Conference on Machine Learning (pp. 137--142), Berlin: Springer. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. ECML’98. Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of 16th International Conference on Machine Learning (pp. 200-209). San Francisco: Morgan Kaufmann. Klein, D., Kamvar, S. D., & Maning, C. D. (2002). From instance-level constraints to space-level constraints: making the most of prior kowldege in data clustering. In Proceedings of the Nineteenth International Conference on Machine Learning. Lewis, D.D (1998). Naïve Bayes at forty: The independence assumption in information retrieval, ECML’98. Masand, B., Linoff, G., & Waltz, D. (1992). Classifying news stories using memory based reasoning, 15th ACM SIGIR Conference, 59-64. Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems 14. Ng, T.H., Goh, W.B., & Low, K.L. (1997). Feature selection, perception learning and a usability case study for text categorization, 20th ACM SIGIR Conference. Nigam K. & Ghani R. (2002). Analyzing the effectiveness and applicability of co-training. In Proceedings of 9th International Conference on Information and Knowledge Management. Nigam, K., McCallurn, A. K., Thrun, S. & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103-134. Salton, G. (1991). Developments in automatic text retrieval. Science, 253:974-979. Seeger, M. (2001). Learning with labeled and unlabeled data. Technical report, Edinburgh University. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Wagstaff, K., Cardie, C., Rogers, S., & Caruana, R. (2001). Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning. Yang, Y. & Chute, C.G. (1994). An example-based mapping method for text categorization and retrieval, ACM TOIS, Vol 12, No. 3, 252-277. Yang, Y. & Liu, X. (1999). An re-examination of text categorization, 22th ACM SIGIR Conference. Zhang, T. & Oles, F. (2000). A probability analysis on the value of unlabeled data for classification problems. In Proceedings of the Seventeenth International Conference on Machine Learning.