WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia
IEEE CEC
Particle Swarm Optimization based Semi-Supervised Learning on Chinese Text Categorization ∗ Department
† Department
Shi Cheng∗† , Yuhui Shi† , Quande Qin‡
of Electrical Engineering and Electronics, University of Liverpool, Liverpool, UK of Electrical & Electronic Engineering, Xi’an Jiaotong-Liverpool University, Suzhou, China
[email protected],
[email protected] ‡ College of Management, Shenzhen University, Shenzhen, China
Abstract—For many large scale learning problems, acquiring a large amount of labeled training data is expensive and timeconsuming. Semi-supervised learning is a machine learning paradigm which deals with utilizing unlabeled data to build better classifiers. However, unlabeled data with wrong predictions will mislead the classifier. In this paper, we proposed a particle swarm optimization based semi-learning classifier to solve Chinese text categorization problem. This classifier utilizes an iterative strategy, and the result of classifier is determined by a document’s previous prediction and its neighbors’ information. The new classifier is tested on a Chinese text corpus. The proposed classifier is compared with the k nearest neighbor method, the k weighted nearest neighbor method, and the self-learning classifier.
I. I NTRODUCTION Particle Swarm Optimization (PSO) was introduced by Eberhart and Kennedy in 1995 [1], [2]. It is a population-based stochastic algorithm modeled on social behaviors observed in flocking birds. A particle flies through the search space with a velocity that is dynamically adjusted according to its own and its companion’s historical behaviors. Each particle’s position represents a solution to the problem. Particles tend to fly toward better and better search areas over the course of the search process [3], [4]. For many large scale learning problems, acquiring a large amount of labeled training data is difficult and timeconsuming. Semi-supervised learning (SSL) is a machine learning paradigm that deals with utilizing unlabeled data to build better classifiers. Traditional classifiers use only labeled data to train, for example, labeled data as pairs of feature and label. However, labeled instances are often difficult, expensive, or time consuming to obtain, as they require the experienced human experts’ efforts. Meanwhile unlabeled data may be relatively easy to collect, but not easy to use them. Semisupervised learning addresses this problem by using large amount of unlabeled data, together with the labeled data, to build better classifiers. Semi-supervised learning is of great interest both in theory and in practice for it requires less human effort and gives higher accuracy than only utilizes labeled data [5]. Text categorization, or termed as text classification (TC) is a problem that finds correct category (or categories) for documents by giving a set of categories (subject, topics) and a collection of text documents. Text categorization can be
U.S. Government work not protected by U.S. copyright
considered as a mapping f : D → C, which is from the document space D onto the set of classes C. The objective of a classifier is to obtain a accurate categorization results or a high confidence of predictions. The most common bag-of-words model simply uses all words in document as the features, and thus the dimension of feature space is equal to the number of different words in all of the documents. This indicates that the data in each document is not texts, but a collection of words. The methods of assigning weights to the features may vary. The simplest is the binary method in which the feature weight is either one if the corresponding word is present in the document - or zero otherwise [6]. Recently, particle swarm optimization is utilized for data categorization problem [7], [8]. In these methods, PSO is only utilized to optimize the parameter of classifier. In particle swarm optimization, a particle not only learns from its own experience, it also learns from companions. It indicates that a particle’s ‘moving position’ was determined by its own experience and the neighbors’ experience. With this concept, we introduce a particle swarm optimization based semi-supervised learning method to solve Chinese text categorization problem. The rest of the paper is organized as follows. The basic PSO algorithm and some distance and similarity metrics are reviewed in Section II. In Section III the problem and process of text categorization are discussed. In Section IV, a nearest neighbor classifier and self training are reviewed, and particle swarm optimization based semi-supervised learning are introduced. In Section V, the results of different methods on text corpus are given, and the properties of different methods are discussed. Finally, Section VI concludes with some remarks and future research directions. II. P RELIMINARIES A. Particle Swarm Optimization The original PSO algorithm is simple in concept and easy in implementation [9], [10]. The basic equations are as follow: vi ← wvi + c1 rand()(pi − xi ) + c2 Rand()(pg − xi )
(1)
xi ← xi + vi
(2)
where w denotes the inertia weight and usually is less than 1, c1 and c2 are two positive acceleration constants, rand()
3191
and Rand() are two random functions to generate uniformly distributed random numbers in the range [0, 1], xi represents the ith particle’s position, vi represents the ith particle’s velocity, pi is termed as personal best, which refers to the best position found by the ith particle, and pg is termed as local best, which refers to the position found by the members in the ith particle’s neighborhood that has the best fitness value so far. The basic process of PSO is shown on Algorithm 1. A particle updates its velocity according to equation (1), and updates its position according to equation (2). The c1 rand()(pi − xi ) part can be seen as a cognitive behavior, while c2 Rand()(pg − xi ) part can be seen as a social behavior. In particle swarm optimization, a particle not only learns from its own experience, it also learns from its companions. It indicates that a particle’s ‘moving position’ is determined by its own experience and the neighbors’ experience [11]. Algorithm 1 The basic process of particle swarm optimization 1: Initialize velocity and position randomly for each particle in every dimension 2: while not find the “good enough” solution or not reach the maximum number of iterations do 3: Calculate each particle’s fitness value 4: Compare fitness value between that of current position and that of the best position in history (personal best, termed as pbest). For each particle, if the fitness value of current position is better than pbest, then update pbest to be the current position. 5: Select the particle which has the best fitness value among current particle’s neighborhood, this particle is called the neighborhood best (termed as nbest). If current particle’s neighborhood includes all particles then this neighborhood best is the global best (termed as gbest), otherwise, it is local best (termed as lbest) 6: for each particle do 7: Update particle’s velocity and position according to the equation (1) and (2), respectively 8: end for 9: end while
i.e., between two binary vectors. The equation is as following: dis(x, y) =
k ∑
|(xi − yi )|
i=1
C. Similarity Measure Similarity is defined as a mapping from two vectors x and y to an interval [0, 1] (for overlap measure is [0, +∞)). There are four kind of similarity measure used in information retrieval: Dice, Jaccard, Cosine and Overlap Measure [12]. 1) Dice Measure: For similarity between two vectors x and y, the equation of Dice measure is as following: 2 sim(xi , yi ) =
k ∑
xi yi
i=1 k ∑ i=1
x2i +
k ∑
yi2
i=1
where 0 ≤ sim(xi , yi ) ≤ 1 iff text A and Text B have same contents, the Dice similarity will be 1. 2) Jaccard Measure: For similarity between two vectors x and y, the equation of Jaccard measure is as following: k ∑
sim(xi , yi ) =
k ∑ i=1
x2i +
xi yi
i=1 k ∑ i=1
k ∑
yi2 −
xi yi
i=1
where 0 ≤ sim(xi , yi ) ≤ 1 iff text A and text B have same contents, the Jaccard similarity will be 1. 3) Cosine Measure: For similarity between two vectors x and y, the equation of Cosine measure is as following: k ∑
sim(x, y) = √
xi yi
i=1 k ∑
i=1
x2i
k ∑
yi2
i=1
If x and y are two document vectors, then B. Distance Measure
sim(x, y) =
Distances, which are dissimilarities with regard to certain properties, measure how unlike or dissimilar objects are. 1) Euclidean Distance: The most popular metric is the traditional Euclidean Distance, the equation is as following: v u k u∑ dis(x, y) = t (xi − yi )2 i=1
2) Manhattan Distance: Manhattan distance or termed hamming distance, which is the number of bits that are different between two objects that have only binary attributes,
x·y ∥ x ∥∥ y ∥
k ∑ where · indicates the vectors dot product, x · y = xi yi , and i=1 √ k ∑ √ ∥ x ∥ is the length of vector x, ∥ x ∥= x2i = x · x. i=1
The cosine similarity is a measure of the (cosine of the) angle between x and y. Thus, if the cosine similarity is 1, the angle between x and y is 0◦ , and x and y are the same except for magnitude (length). If the cosine similarity is 0, then the angle between x and y is 90◦ , and they do not share any terms(words).
3192
4) Overlap Measure: For similarity between two vectors x and y, the equation of Overlap measure is as following: k ∑
sim(x, y) = min(
xi yi
i=1 k ∑
k ∑
i=1
i=1
x2i ,
yi2 )
where 0 ≤ sim(xi , yi ), sim(xi , yi ) can be any value bigger than 0. D. Measure Comparison Four kinds of similarity measure are used for calculating the similarity of two vectors. Dice similarity measures the connection of intersection and mean size between two vectors. Jaccard similarity considers the relation between intersection and union of two vectors. Cosine similarity measures the relation between intersection and geometric mean of two vectors [13]. Vector Space Model (VSM) (or term vector model) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers. Similarity measures, which are vector space model, do not take the magnitude of the two data objects into account. Distance measure might be a better choice when magnitude is important. Distance measure considers the magnitude of the vector difference between two document vectors. This measure suffers from a drawback: Two documents with very similar contents can have a significant vector differences simply because one is much longer than the other. Thus, the relative distributions of terms may be identical in the two documents, but the absolute term frequencies of one may be far larger. For example, for terms vector x = {a, a, b} and y = {a, b, b}, the cosine, dice and overlap similarity is 0.8, and Jaccard similarity is 0.667; the Euclidean distance is 1.414 and Manhattan distance is 2. If the vectors are doubled in magnitude, x = {a, a, a, a, b, b} and y = {a, a, b, b, b, b}, similarity measures do not change, but Euclidean distance changes to 2.828 and Manhattan distance changes to 4, distances measure changes with magnitude’s mutation. Similarity and distance measure do not take the sequence of terms in text into account. The measure has no change when the sequence have modified. This is a weak point of vector space model. For example, content spam, which adds a lot of unrelated keywords to text content, uses this to cheat with search engines. III. T EXT C ATEGORIZATION P ROBLEMS The task of text categorization is to classify documents into a fixed number of one or two predefined classes. A class is considered a semantic category that groups documents that have certain properties in common. Generally, a document can be in multiple, exactly one, or no classes. Yet, with the task of information filtering in mind, i.e., the categorization of documents as either relevant or non-relevant, we assume that each document is assigned to exactly one class. More
precisely, we pose the text categorization problem as follows [14]. Assume a space of textual documents D and a fixed set of k classes C = {c1 , · · · , ck }, which implies a disjoint, exhaustive partition of D. Text categorization is a mapping, f : D → C, from the document space onto the set of classes. In solving the text categorization problem, each text will be transferred to a collection of words. In other words, the long text needs to take some process before classifier works. This process includes text preprocess, Chinese word segmentation, and feature selection. A. Text Preprocess One important step in text preprocess is the Tokenization or punctuation clean [15]. For language properties, a single word has no punctuation. Changing all punctuation marks to empty spaces is a useful method to simplify text. After punctuation clean, text becomes to be several short sentences, sentences become to be a set of words as the search words in dictionary. Content in text becomes a vector and each element is a frequency of single word. Each word is one dimension of the text space, and text becomes a vector space model. Then the distance and similarity can be measured by using this model. B. Chinese word segmentation The most used method to segment Chinese sentence is dictionary. Text is split to several terms which contain two or three Chinese characters. If this term is found in dictionary, then this term is a word. The quality of dictionary may affect the performance of different classifier. Constructing a proper dictionary is an important step in Chinese segmentation. IV. C ATEGORIZATION M ETHODS A. k Nearest Neighbor The k Nearest Neighbor (KNN) classifier is to find the k training examples that are relatively similar to the attributes of the test example. These examples, which are known as nearest neighbors, can be used to determine the class label of the test example. It is important to choose the right value of k. If k is too small, then the nearest-neighbor classifier may be susceptible to overfitting because of noise in the training data. On the other hand, if k is too large, the nearest-neighbor classifier may misclassify the test instance because its list of nearest neighbors may include data points that are located far away from its neighborhood. A high-level summary of the nearest-neighbor categorization method is given in Algorithm 2. The algorithm computes the distance (or similarity) between each test example z = (x′ , y ′ ) and all the training examples (x, y) ∈ D to determine its nearest-neighbor list, Dz . Such computation can be costly if the number of training examples is large. However, efficient indexing techniques are available to reduce the amount of computations needed to find the nearest neighbors of a test example.
3193
Algorithm 2 The k nearest neighbor categorization algorithm. 1: Let k be the number of nearest neighbors and D be the set of training examples. 2: for each test example z = (x′ , y ′ ) do 3: Compute d(x′ , x), which is the distance between test example x′ and every training example x, (x, y) ∈ D. 4: Select Dz ⊆ D, the set of k closest training examples to x′ . ∑ 5: The prediction y ′ = max (xi ,yi )∈Dz I(v = yi ) v 6: end for
Once the nearest-neighbor lists are obtained, the test example is classified based on the majority class of its nearest neighbors: ∑ Majority voting : y ′ = max I(v = yi ) v
(xi ,yi )∈Dz
where v is class label, yi is the class label for one of the nearest neighbors, and I(·) is an indicator function that returns the value 1 if its argument is true and 0 otherwise. In the majority voting approach, every neighbor has the same impact on the categorization. This makes the algorithm sensitive to the choice of k. The k weighted nearest neighbor (KWNN) utilizes a weight on different neighbors. Distance weighted voting is a straightforward way to weight each neighbor, if a training example is located far away from a unlabeled data, which will have a weaker impact on the categorization result compared to those that has a close distance. The equation is defined as follows: ∑ y ′ = max wi × I(v = yi ) v
(xi ,yi )∈Dz
Nearest-neighbor categorization is part of more general technique known as instance-based learning, which uses specific training instances to make predictions without having to maintain an abstraction (or mode) derived from data. Instancebased leaning algorithms require a proximity measure to determine the similarity or distance between instances and a categorization function that returns the predicted class of a test instance based on its proximity to other instances. Lazy learners such as nearest-neighbor classifiers do not require model building. However, classifying a test example can be quite expensive because we need to compute the proximity values individually between the test and training examples. In contrast, eager learners often spend the bulk of their computing resources for model building. Once a model has been built, classifying a test example is extremely fast.
both in theory and in practice for it requires less human effort and gives higher accuracy than only utilizes labeled data [5]. Self training is a simple and easy to apply semi-supervised learning technique, which is characterized by the fact that the learning process uses its own predictions to teach itself [16]. The main idea is first training with labeled data, then the unlabeled data, with their predicted labels, is utilized to predicted other unlabeled data. Self training assumes that the prediction based on the previous training tend to be correct, then other unlabeled could benefit from these predictions. C. Particle Swarm Optimization based Semi-Supervised Learning The “No Free Lunch” (NFL) theorem for optimization, which was introduced by Wolpert and Macready [17], has claimed that under certain assumptions no algorithm is better than other one on average for all problems. Unlabeled data with a prediction is utilized to train other data in semisupervised learning. If the previous prediction has a highly confidence, the learning will benefit from experiences, however, if the previous prediction has many errors, the later prediction will be misled. The particle swarm optimization based semi-supervised learning is shown in Algorithm 3. The algorithm utilizes an iterative strategy, which compares each document’s previous prediction and neighbors’ information. The distance or the similarity is recorded for each test example. If this test example finds a better prediction, i.e., it has a closer distance or similarity than the record, the test example will be predicted to a new category, and the record will be updated to a close distance or similarity. After several iterations, the error rate of categorization could be reduced. The fitness function of text categorization is different when similarity measures or distance measures are utilized in classifier. The function is as follows: • The documents may belong to the same class if they have higher similarity. The object of categorization is to maximum the similarity: f (x) = max
sim(xli , xu )
i=1 •
The documents in the same class will have a small distance. The object of categorization is to minimum the distance: k ∑ f (x) = min dis(xli , xu ) i=1
where the xli is a labeled document, and xu is a unlabeled document
B. Semi-Supervised Learning In many real-world learning scenarios, acquiring a large amount of labeled training data is expensive and timeconsuming. Semi-supervised learning is a machine learning paradigm which deals with utilizing unlabeled data to build better classifiers. Semi-supervised learning is of great interest
k ∑
V. E XPERIMENTAL R ESULTS AND A NALYSIS A. Categorization Corpus The test corpus is given in Table I, which has 10 categories and 950 news articles in total. The documents are separated unequally in each category, The class ‘computer’ has the
3194
Algorithm 3 The particle swarm optimization based semisupervised learning. 1: Let k be the number of nearest neighbors and D be the set of training examples. 2: for each test example z = (x′ , y ′ ) do 3: Compute d(x′ , x), which is the distance between test example x′ and every training example x, (x, y) ∈ D. 4: Select Dz ⊆ D, the set of k closest training examples to z. ∑ 5: The prediction y ′ = max (xi ,yi )∈Dz I(v = yi ) v 6: Initialize the archive, add each classified text into archive Da with a random probability. 7: while not reach the maximum number of iterations do 8: Compute d(x′ , x) and d(x′ , xa ), the distance between x′ and every example, (x, y) ∈ D. 9: if test example has closer distance or similarity then 10: Update test example’s ∑ predicted category. The prediction y ′ = max (xi ∪ xa ,yi )∈Dz I(v = yi ) i v 11: end if 12: Update the archive, add each classified text into archive Da with a random probability. 13: end while 14: end for
most elements, which contains 210 elements, and the class ‘automobile’ only has 42 elements.
From the above equations, it can be easily conducts that: Error rate = 1 − Accuracy The error rate metric only considers misclassified pattern, recall and precision concern the retrieved document and relevant documents together. In the perspective of information retrieval, the definitions are as follow [12], [15]: Recall (r) is the fraction of relevant documents that are retrieved correct predictions Recall = = P (retrieved|relevant) relevant items Precision (p) is the fraction of retrieved documents that are relevant correct predictions Precision = = P (relevant|retrieved) retrieved items The Fβ metric utilizes a weight β to balance between recall and precision. The formula is defined as follows: Fβ (r, p) =
where β is the parameter allowing differential weighting of p and r. When the value of β is set to one (denoted as F1 ), and precision is weighted equally. The F1 measure, which represents the harmonic mean of recall (r) and precision (p), utilizes an equal weight to combine these two components [18]. The F1 measure is defined as follows: F1 (r, p) =
TABLE I T HE TEST CORPUS USED IN OUR EXPERIMENTAL STUDY. T HIS CORPUS HAS 950 TEXTS IN TOTAL , AND DIFFERENT CATEGORIES HAVE DIFFERENT NUMBERS OF TEXTS .
Item 1 2 3 4 5 6 7 8 9 10
Categories Human Resource Sport Health Entertainment Real Estate Education Automobile Computer Technology Finance
(β 2 + 1)pr β2p + r
1 r
2 +
1 p
=
2rp r+p
C. Nearest Neighbor Nearest neighbor classifiers and k weighted nearest neighbor are firstly tested in the experiments. 1) k Nearest Neighbor: Table II gives the categorization results of k = 1 nearest neighbor classifier. The number of wrong categorized documents and the error rate are given while training examples of each category are the first 10, 20, and 30 examples, respectively. The number of wrong categorized are given in ‘10’, ‘20’, and ‘30’ column of Table II and following tables. The corresponding error rate is given in ‘rate’ column and following tables.
Numbers 43 201 100 107 67 58 42 210 74 48
TABLE II RESULT OF KNN. k IS 1 IN THIS EXPERIMENT, AND TRAINING EXAMPLES OF EACH CATEGORY ARE FROM 10 TO 30.
T HE CATEGORIZATION
B. Performance Metrics
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
Accuracy is a straightforward way to measure the performance of categorization. The accuracy is defined as follows: Accuracy =
correct predictions total number of predictions
Equivalently, the performance of a model can be expressed in terms of its Error Rate, which is given by the following formula: wrong predictions Error rate = total number of predictions
10 610 702 313 313 513 292
rate 0.6421 0.7389 0.3294 0.3294 0.54 0.3073
20 427 538 226 226 426 209
rate 0.4494 0.5663 0.2378 0.2378 0.4484 0.22
30 334 451 183 183 314 168
rate 0.3515 0.4747 0.1926 0.1926 0.3305 0.1768
Table III gives the categorization results of k = 3 nearest neighbor classifier. In 3 nearest neighbors, if more than one neighbor belongs to a specific class, the unlabeled document
3195
TABLE III
TABLE VI T HE CATEGORIZATION RESULT OF KWNN. k
RESULT OF KNN. k IS 3 IN THIS EXPERIMENT, AND TRAINING EXAMPLES OF EACH CATEGORY ARE FROM 10 TO 30.
T HE CATEGORIZATION
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 672 780 294 294 468 265
rate 0.7073 0.8210 0.3094 0.3094 0.4926 0.2789
20 511 552 205 205 397 184
rate 0.5378 0.5810 0.2157 0.2157 0.4178 0.1936
30 412 482 162 162 298 150
IS 11 IN THIS EXPERIMENT, AND TRAINING EXAMPLES OF EACH CATEGORY ARE FROM 10 TO 30.
rate 0.4336 0.5073 0.1705 0.1705 0.3136 0.1578
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
will predict to that class. Otherwise, it will belong to the nearest neighbor’s class. From the results above, we can conclude that the similarity metrics is better than distance metrics, and the cosine metric has the best performance in all metrics. The error rate decrease with more training examples. 2) k Weighted Nearest Neighbor: Table IV, V, and VI give the results of k weighted nearest neighbor classifier with k being 3, 7, and 11 respectively. All the k nearest neighbors will be weighted by a weight, which is the distance or similarity between unlabeled data and the neighbor. Summing all distances or similarities, unlabeled document will be predicted to a neighbor’s class, which has the closest distance or the highest similarity.
rate 0.7452 0.8221 0.3094 0.3094 0.4852 0.28
20 566 586 206 206 374 186
rate 0.5957 0.6168 0.2168 0.2168 0.3936 0.1957
30 452 531 156 162 287 151
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
rate 0.4757 0.5589 0.1642 0.1705 0.3021 0.1589
rate 0.6263 0.7189 0.2778 0.2768 0.4273 0.2631
20 636 693 183 183 322 176
rate 0.6694 0.7294 0.1926 0.1926 0.3389 0.1852
30 512 583 156 156 252 129
30 526 498 148 148 265 117
rate 0.5536 0.5242 0.1557 0.1557 0.2789 0.1231
10 690 770 356 356 540 327
rate 0.7263 0.8105 0.3747 0.3747 0.5684 0.3442
20 473 550 278 278 387 249
rate 0.4978 0.5789 0.2926 0.2926 0.4073 0.2621
30 376 462 203 203 301 174
rate 0.3957 0.4863 0.2136 0.2136 0.3168 0.1831
TABLE VIII T HE CATEGORIZATION RESULT OF SELF TRAINING . T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 3 IN THIS EXPERIMENT.
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
IS 7 IN THIS EXPERIMENT, AND TRAINING EXAMPLES OF EACH CATEGORY ARE FROM 10 TO 30.
10 595 683 264 263 406 250
rate 0.6189 0.6936 0.2073 0.2073 0.36 0.1757
TABLE VII T HE CATEGORIZATION RESULT OF SELF TRAINING . T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 1 IN THIS EXPERIMENT, TRAINING EXAMPLES OF EACH CATEGORY ARE FROM 10 TO 30.
TABLE V T HE CATEGORIZATION RESULT OF KWNN. k
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
20 588 659 197 197 342 167
1) Self Training: Table VII, VIII, IX, and X give the categorization results of self training. In this experiment, the previous prediction of unlabeled data is based on k weighted nearest neighbor, and k is 1, 3, 7, and 11 respectively. If an unlabeled document has been predicted to a class, this document will be added to example documents. Other unlabeled documents will learn from the previous prediction. If the previous prediction has a high confidence, the later prediction will benefit, otherwise, it will be misled.
IS 3 IN THIS EXPERIMENT, AND TRAINING EXAMPLES OF EACH CATEGORY ARE FROM 10 TO 30.
10 708 781 294 294 461 266
rate 0.8 0.8231 0.2947 0.2947 0.3989 0.28
D. Semi-Supervised Learning
TABLE IV T HE CATEGORIZATION RESULT OF KWNN. k
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 760 782 280 280 379 266
rate 0.5389 0.6136 0.1642 0.1642 0.2652 0.1357
The results of classifiers is sensitive to the choice of k. From the results above, we can conclude that performance will get better with the increasing of k value, however, this is not monotonic. The performance may get worse after k beyond a specific value. For example, the result of Cosine similarity with 10 training examples in Table V is better than the corresponding result in Table VI.
10 702 737 328 328 578 300
rate 0.7389 0.7757 0.3452 0.3452 0.6084 0.3157
20 531 598 243 243 378 263
rate 0.5589 0.6294 0.2557 0.2557 0.3978 0.2768
30 415 518 172 172 275 183
rate 0.4368 0.5452 0.1810 0.1810 0.2894 0.1926
2) Rotated Self Training: Table XI, XII, XIII, and XIV give the categorization results of rotated self training. In this experiment, the previous prediction of unlabeled data is based on k weighted nearest neighbor, and k is 1, 3, 7, and 11 respectively. If an unlabeled document has been predicted to a class, this document will be added to training documents. Other unlabeled documents will learn from the previous prediction. After all unlabeled documents have been predicted to a class, every document will be trained again. The self training didn’t improve the performance of classifier. This is because the first supervised learning has many
3196
TABLE IX T HE CATEGORIZATION RESULT OF SELF TRAINING . T HE PREDICTION BASED ON KWNN CLASSIFIER , AND k IS 7 IN THIS EXPERIMENT.
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 699 782 505 505 627 409
rate 0.7357 0.8231 0.5315 0.5315 0.66 0.4305
20 638 636 275 275 481 243
rate 0.6715 0.6694 0.2894 0.2894 0.5063 0.2557
30 502 559 150 150 371 125
TABLE XIV T HE CATEGORIZATION RESULT OF ROTATED SELF TRAINING . T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 11 IN THIS EXPERIMENT.
IS
rate 0.5284 0.5884 0.1578 0.1578 0.3905 0.1315
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
TABLE X T HE CATEGORIZATION RESULT OF SELF TRAINING . T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 11 IN THIS EXPERIMENT.
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 758 818 535 535 654 513
rate 0.7978 0.8610 0.5631 0.5631 0.6884 0.54
20 630 682 278 281 503 276
rate 0.6631 0.7178 0.2926 0.2957 0.5294 0.2905
30 556 586 173 173 372 162
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 740 772 356 356 537 305
rate 0.7789 0.8126 0.3747 0.3747 0.5652 0.3210
20 502 552 274 274 415 233
rate 0.5284 0.5810 0.2884 0.2884 0.4368 0.2452
30 402 464 203 203 327 172
rate 0.4231 0.4884 0.2136 0.2136 0.3442 0.1810
TABLE XII T HE CATEGORIZATION RESULT OF ROTATED SELF TRAINING . T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 3 IN THIS EXPERIMENT.
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 792 774 324 324 604 263
rate 0.8336 0.8147 0.3410 0.3410 0.6357 0.2768
20 636 719 227 227 371 271
rate 0.6694 0.7568 0.2389 0.2389 0.3905 0.2852
30 526 628 159 159 276 149
rate 0.5536 0.6610 0.1673 0.1673 0.2905 0.1568
TABLE XIII T HE CATEGORIZATION RESULT OF ROTATED SELF TRAINING . T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 7 IN THIS EXPERIMENT.
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 650 664 519 519 655 397
rate 0.6842 0.6989 0.5463 0.5463 0.6894 0.4178
20 689 711 252 252 528 216
rate 0.7252 0.7484 0.2652 0.2652 0.5557 0.2273
30 512 631 145 145 432 115
rate 0.5389 0.6642 0.1526 0.1526 0.4547 0.1210
rate 0.7042 0.7431 0.5652 0.5652 0.6936 0.5547
20 586 710 278 278 559 261
rate 0.6168 0.7473 0.2926 0.2926 0.5884 0.2747
30 566 603 159 159 425 154
rate 0.5957 0.6347 0.1673 0.1673 0.4473 0.1621
wrong predictions. The unlabeled documents with wrong categories will mislead the later prediction.
rate 0.5852 0.6168 0.1821 0.1821 0.3915 0.1705
TABLE XI T HE CATEGORIZATION RESULT OF ROTATED SELF TRAINING . T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 1 IN THIS EXPERIMENT, TRAINING EXAMPLES OF EACH CATEGORY ARE FROM 10 30.
10 669 706 537 537 659 527
E. Particle Swarm Optimization based Semi-Supervised Learning
TO
Table XV, XVI, XVII, and XVIII give the categorization results of particle swarm optimization based semi-supervised learning. The supervised learning method is also the k weighted nearest neighbors, and k is 1, 3, 7, and 11 respectively. If an unlabeled document has been predicted to a class, this document will have a probability to add into an additional archive. Other unlabeled documents will learn from the training examples and prediction in this archive. This classifier is an iterative method, if unlabeled document has a closer distance or higher similarity with some examples, the unlabeled document will be predicted to these example’s class. After several iterations, the classifier learns from unlabeled document, and the error rate will be decreased. TABLE XV T HE CATEGORIZATION RESULT OF PSO BASED SSL. T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 1 IN THIS EXPERIMENT. T RAINING EXAMPLES OF EACH CATEGORY ARE 10 , 20, AND 30 RESPECTIVELY.
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 610 702 308 309 504 278
rate 0.6421 0.7389 0.3242 0.3252 0.5305 0.2926
20 427 538 231 228 419 206
rate 0.4494 0.5663 0.2431 0.24 0.4410 0.2168
30 334 451 183 183 308 168
rate 0.3515 0.4747 0.1926 0.1926 0.3242 0.1768
TABLE XVI T HE CATEGORIZATION RESULT OF PSO BASED SSL. T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 3 IN THIS EXPERIMENT.
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 708 781 279 280 458 248
rate 0.7452 0.8221 0.2936 0.2947 0.4821 0.2610
20 572 590 199 200 370 178
rate 0.6021 0.6210 0.2094 0.2105 0.3894 0.1873
30 459 537 160 160 282 150
rate 0.4831 0.5652 0.1684 0.1684 0.2968 0.1578
The PSO based semi-supervised learning classifier has the best performance in the experiment. This method utilizes
3197
TABLE XVII T HE CATEGORIZATION RESULT OF PSO BASED SSL. T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 7 IN THIS EXPERIMENT.
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 589 684 253 254 400 231
rate 0.62 0.72 0.2663 0.2673 0.4210 0.2431
20 634 693 173 173 317 174
rate 0.6673 0.7294 0.1821 0.1821 0.3336 0.1831
30 511 583 153 153 244 124
rate 0.5378 0.6136 0.1610 0.1610 0.2568 0.1305
ACKNOWLEDGMENT The authors’ work is partially supported by National Natural Science Foundation of China under grant No.60975080.
TABLE XVIII T HE CATEGORIZATION RESULT OF PSO BASED SSL. T HE PREDICTION IS BASED ON KWNN CLASSIFIER , AND k IS 11 IN THIS EXPERIMENT.
Measure Euclidean Manhattan Dice Jaccard Overlap Cosine
10 756 782 268 266 380 250
rate 0.7957 0.8231 0.2821 0.28 0.4 0.2631
20 591 657 191 191 343 162
rate 0.6221 0.6915 0.2010 0.2010 0.3610 0.1705
30 527 499 144 144 261 113
The new classifier should be tested on different Chinese text corpus, comparing this methods with other classification method, such as support vector machine, and modeling text categorization problems as a multi-objective problem, maximum the recall and precision at the same time is our future research work.
rate 0.5547 0.5252 0.1515 0.1515 0.2747 0.1189
unlabeled data’s prediction to guide other unlabeled data. The iterative strategy is utilized in the method, and different metrics can be used in different iterations. VI. C ONCLUSIONS For many large scale learning problems, acquiring a large amount of labeled training data is expensive and timeconsuming. Semi-supervised learning is a machine learning paradigm which deals with utilizing unlabeled data to build better classifiers. However, unlabeled data with a wrong prediction will mislead the classifier. In this paper, we proposed a particle swarm optimization based semi-learning classifier to solve Chinese text categorization problem. This classifier utilized an iterative strategy, and the result of classifier is determined by document’s previous prediction and its neighbors’ information. The new classifier is tested on a Chinese text corpus. In the experiment, the performance of this classifier is better than the k nearest neighbor method, the k weighted nearest neighbor method, and the self learning classifier. The error rate is utilized in this paper to measure the performance of different classifier. Beside the error rate, the precision, recall, and Fβ metric are often used to measure performance. These metrics all represent text categorization as a single objective optimization problem. However, for the real world problems, different problems have different error risk (or lost), we may need different solutions under different situations. Fβ metric utilizes a fixed value β to balance precision and recall at a time, less information of classifier can be obtained in this metric. The text categorization problem can be solved as a multiobjective problem [19]. In multi-objective optimization, precision and recall can be considered at the same time, and a proper classifier can be found to suit for different situations.
R EFERENCES [1] R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” in Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1995, pp. 39–43. [2] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of IEEE International Conference on Neural Networks (ICNN), 1995, pp. 1942–1948. [3] R. Eberhart and Y. Shi, “Particle swarm optimization: Developments, applications and resources,” in Proceedings of the 2001 Congress on Evolutionary Computation (CEC2001), 2001, pp. 81–86. [4] X. Hu, Y. Shi, and R. Eberhart, “Recent advances in particle swarm,” in Proceedings of the 2004 Congress on Evolutionary Computation (CEC2004), 2004, pp. 90–97. [5] X. Zhu, “Semi-supervised learning literature survey,” Computer Sciences, University of Wisconsin-Madison, Tech. Rep. 1530, 2005. [6] A. Cervantes, I. M. Galv´an, and P. Isasi, “AMPSO: A New Particle Swarm Method for Nearest Neighborhood Classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 39, no. 5, pp. 1082–1091, October 2009. [7] Y. Jin, W. Xiong, and C. Wang, “Feature selection for chinese text categorization based on improved particle swarm optimization,” in International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), Beijing, August 2010, pp. 1–6. [8] Y. Zhang, M. Jiang, and D. Yuan, “Chinese text mining based on distributed SMO,” in IEEE 3rd International Conference on Communication Software and Networks (ICCSN), 27-29 May 2011, pp. 175–177. [9] J. Kennedy, R. Eberhart, and Y. Shi, Swarm Intelligence, 1st ed. Morgan Kaufmann Publisher, 2001. [10] R. Eberhart and Y. Shi, Computational Intelligence: Concepts to Implementations, 1st ed. Morgan Kaufmann Publisher, 2007. [11] S. Cheng, Y. Shi, and Q. Qin, “Experimental study on boundary constraints handling in particle swarm optimization: From population diversity perspective,” International Journal of Swarm Intelligence Research (IJSIR), vol. 2, no. 3, pp. 43–69, 2011. [12] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison Wesley, 2006. [13] M. H. Dunham, Data Mining Introductory and Advanced Topics, 1st ed. Pearson Education, 2003. [14] C. Lanquillon, “Enhancing text classification to improve information filtering,” Ph.D. dissertation, DaimlerChrysler AG, Research & Technology, December 2001. [15] C. D. Manning, P. Raghavan, and H. Sch¨utze, Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008. [16] A. B. Goldberg, “New directions in semi-supervised learning,” Ph.D. dissertation, University of WisconsinCMadison, 2010. [17] D. Wolpert and W. Macready, “No free lunch theorems for optimization,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67– 82, April 1997. [18] Y. Yang and X. Liu, “A re-examination of text categorization methods,” in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 1999, pp. 42–49. [19] Y. Jin and B. Sendhoff, “A systems approach to evolutionary multiobjective structural optimization and beyond,” IEEE Computational Intelligence Magazine, vol. 4, no. 3, pp. 62–76, August 2009.
3198