World Wide Web DOI 10.1007/s11280-013-0215-7
A unified framework for semi-supervised PU learning Haoji Hu · Chaofeng Sha · Xiaoling Wang · Aoying Zhou
Received: 6 October 2012 / Revised: 31 January 2013 / Accepted: 3 April 2013 © Springer Science+Business Media New York 2013
Abstract Traditional supervised classifiers use only labeled data (features/label pairs) as the training set, while the unlabeled data is used as the testing set. In practice, it is often the case that the labeled data is hard to obtain and the unlabeled data contains the instances that belong to the predefined class but not the labeled data categories. This problem has been widely studied in recent years and the semi-supervised PU learning is an efficient solution to learn from positive and unlabeled examples. Among all the semi-supervised PU learning methods, it is hard to choose just one approach to fit all unlabeled data distribution. In this paper, a new framework is designed to integrate different semi-supervised PU learning algorithms in order to take advantage of existing methods. In essence, we propose an automatic KL-divergence learning method by utilizing the knowledge of unlabeled data distribution. Meanwhile, the experimental results show that (1) data distribution information is very helpful for the semi-supervised PU learning method; (2) the proposed framework can achieve higher precision when compared with the state-of-the-art method. Keywords Data mining · Semi-supervised learning · PU learning
H. Hu · X. Wang (B) · A. Zhou Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China e-mail:
[email protected] H. Hu e-mail:
[email protected] A. Zhou e-mail:
[email protected] C. Sha Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China e-mail:
[email protected] World Wide Web
1 Introduction Classification is a fundamental data mining problem which has been well studied. Traditional classification procedure can be described as follows. Given a set of training data containing category information, after learning a model from the training set, this model can be used to classify the unseen testing instances. There is an assumption that the new instances belong to one of the categories in the training data. This assumption, however, was not satisfied in many applications. In practice, some testing instances may not belong to any of the predefined classes, and they are called negative instances. For example, assume that there exist two different classes of mails: news and entertainment. If a new email about job description comes, as a result, it will be assigned to one of the given two classes. In order to address this problem, PU learning is studied where only the positive and unlabeled instances are given while negative instances are not provided in the training set. One scenario of PU learning application is where the input instances belong to more than single labeled positive training class and unlabeled data. Previous work addresses this problem by using large amount of unlabeled data, together with small labeled data, to build better classifiers. There are previous work on this topic. Table 1 gives the comparison of existing methods. All these methods follow a one-vs-all schema. The performance of different algorithms depends on the distribution of the unlabeled data. These methods perform well when the unlabeled set contains plenty of negative data. Some methods such as LGN only deal with textual data. LiKL don’t require the number of the training set, but it needs extra effort to tune some parameters. Since the unlabeled data distribution is unknown, it is hard to choose a proper method that best matches the given situation. This paper studies the previous work and presents a unified framework to take advantage of the existing methods by estimating the negative data distribution. Our contributions are as follows. 1. An approach to estimate the percentage of negative data in the unlabeled set is proposed first. 2. In order to discover the hidden information in the unlabeled examples, an automatic semi-supervised method based on KL divergence is adopted. This is a general method which is used not just on text but also on other data type, such as category data. 3. With the support of the unlabeled distribution information, a framework that integrates existing approaches to produce a hybrid model is first proposed. 4. A series of experiments are conducted on 20 newsgroups and Reuters corpus, using F-score as measure. Experimental results show that our method can always reach a high performance.
Table 1 The comparison of some existing methods Method
Data type
Requirement
s-em [11] roc-svm [11] LGN [9] LiKL [16]
any type any type only text any type
Need quite a few unlabeled negative data Need a great deal of unlabeled negative data Need a great deal of unlabeled negative data No more constraint on the size of unlabeled negative data
World Wide Web
The rest of this paper is organized as follows: Section 2 presents related work; Section 3 introduces the proposed auto-CiKL approach and describes our way to estimate the proportion of negative data in unlabeled set which is also used to integrate existing algorithms to get a more stable model; Finally, the experimental results are in Section 4, followed by the conclusion in Section 5.
2 Related work Learning a classifier from only positive and unlabeled training examples has been an active field in classification area [5, 10, 15, 20]. One simple approach to address this issue is to completely ignore the unlabeled examples, namely, learning only from the labeled positive examples, e.g. one-class SVM [13] which aims to approximately covering all labeled positive examples. This work is not always proper. If some reliable negative instances can be identified in the unlabeled data, knowledge provided by unlabeled set will be discarded. And this method makes over-fitting easily. However, unlabeled data usually contains useful knowledge for training. Adding unlabeled data into training set properly can get better classifier. These approaches can be covered by two categories. The more common method is (i) to use heuristics to identify unlabeled examples that are likely to be negative, and then (ii) to apply a standard learning method to these examples and the positive examples; steps (i) and (ii) may be iterated. For example, PEBL [17] firstly utilizes 1-DNF [4] to find out quite likely negative examples and secondly employs SVM [2] iteratively for classification. S-EM [11] uses spy technique for extraction of likely negative examples in the first step, and sequently EM algorithm is applied for the parameter estimate. Self-training and co-training also belong to this category. For the self-training method, a classifier is first trained with the small amount of labeled data. The classifier is re-trained and the procedure repeates. Rosenberg et al. [14] apply self-training to object detection systems from images, and show the semi-supervised technique compared favorably with a state-of-the-art detector. In the co-training, each classifier classifies the unlabeled data, and teaches the other classifier with the few unlabeled examples (and the predicted labels) they feel most confident. Each classifier is retrained with the additional training examples given by the other classifier, and the process repeats. Another common method is to modify the training model directly. One approach is to assign weights to the unlabeled examples, and then the classifier is trained with the unlabeled examples as weighted negative examples. This approach is used in [7, 12]. On the other hand, B-Pr [18] and W-SVM [6] take a probabilistic approach. Without extracting possible negative examples, they transform the original problem p(y|x) y∈{+,−} into a simple sample model p(t|x)t∈{P,U} , where P is the training set and U is the unlabeled data set. Semi-supervise learning is a method of using labeled and unlabeled data to learning a classifier [19, 21]. Zhu [23] gives a complete analysis on the existing semisupervise learning approaches in his survey. The critical issue in semi-supervised learning also is how to integrate the information in unlabeled data into training step. The difference between semi-supervise learning and PU learning is whether all the categories in the unlabeled data appear in labeled data. If the labeled data contain all categories, it’s semi-supervise learning. Otherwise, it’s PU learning.
World Wide Web
Semi-supervised learning just uses the information in unlabeled data to enhance the classifier, while PU learning use the unlabeled data to obtain information about negative instances which can supplement the knowledge. In this paper, we combine these two approaches which can use information contained in unlabeled data further. It has been noticed that constraining the class proportions on unlabeled data is important for semi-supervised learning. Without any constraints on class proportion, semi-supervised learning algorithms tend to produce unbalanced output. Zhu et al. [22] use a heuristic class mean normalization procedure to move towards the desired class proportions; S3VM [1] methods explicitly fit the desired class proportions. However, in these methods the class proportion constraint is combined with other model assumptions. To our best knowledge, this paper is the first to introduce an unlabeled data distribution estimate method for PU problem.
3 The proposed algorithm 3.1 Preliminaries The KL Divergence [3] between the probability distribution P = { p1 , . . . , pn } and Q = {q1 , . . . , qn } is defined as: KL(P Q) =
n
pi lg
i=1
pi qi
In this paper, the KL divergence is calculated just as [16] between the posterior probabilities of every instance di in the unlabeled set and the prior probabilities of each class, and then the revised formula of KL divergence is defined as: KL(di ) =
|C|
p(c j|di ) p(c j)
p(c j|di ) lg
j=1
where p(c j|di ) is the class posterior probability of instance di belonging to the j-th class c j; p(c j) is the prior probability of each class; |C| is the number of known or predefined class labels appearing in the training set. The intuition of this revision is that, if an instance belongs to a labeled category the KL divergence from posterior probability distribution to prior probability distribution is large; if an instance is negative it will belong to any labeled category equally that leads to small KL Divergence. For the unlabeled set(labeled as U), the KL divergence of every instance is calculated, and the instances di =
arg max
(KL(di ))
di ∈{dt | arg maxck ∈C ( p(ck |dt ))}
and d j = arg min(KL(d j)) d j ∈U
are obtained as the most likely corresponding sub-positive (ck ) and negative examples respectively.
World Wide Web
3.2 Problem description Given that a training set(P) only containing positive examples, from multiple (at least 2) classes, without negative ones, and one unlabeled set(U) containing both positive and negative examples, our task is to construct a classifier(C), finding all likely negative examples(U n ) hidden in the unlabeled set and it is formulated as C
follows: input(P, U) ⇒ output(U n ). 3.3 Auto CiKL approach This work is based on our former work LiKL [16], which contains three steps. The algorithms are described as follow. •
• •
In the first step a probabilistic classifier C P is trained from the positive set P. Then use the posterior probability returned by C P to calculate the KL divergence of every instance in the unlabeled set and pick up top k reliable positive instances set S p . In the second step, another classifier C N is trained from the combined positive set P + S p and reliable negative instances can be obtained by using C N . In the last step, with the help of instances identified from the unlabeled set, the final classifier to identify the instances that don’t belong to any class in the training set is trained.
Figure 1 Finding likely positive and negative examples
World Wide Web Figure 2 Finding likely negative examples
In this paper, an improvement for LiKL called Auto CiKL is introduced. The difference between Auto CiKL and LiKL lies in the first two steps. The improved algorithms are showed in Figures 1, 2 and 3.
Figure 3 CiKL classification algorithm
World Wide Web
The Figure 1 is to find the most likely positive examples. Initially, the reliable positive set is set empty. Lines 2–4 initializes each category set Bc . From the lines 5– 9, the KL divergency of each document is calculated. The labeled set P is used to learn a classifier for getting posterior probability. Through the posterior probability, the KL divergence of each instance in the unlabeled set can be calculated out. Each document is assigned to the largest posterior probability category. Line 10 calls the proportion estimate function which will be described latter. In line 11, the function f (λ) is just assigned 1 or 2. When the λ is larger than 0.5, it is assigned 1, otherwise it is assigned 2. After getting the proportion, lines 12–17 is to pick up the top K documents with the largest KL value and add them into labeled corresponding category. Then all the reliable positive instances are collected in S p . If the percentage of positive instances in the unlabeled data is unknown, it is hard to pick up a proper K, which is the shortcoming of LiKIn, L., Auto CiKL a function EstimateProportion(P, U) is used to estimate the proportion of negative instances. Then we can get a proper number of positive instances. The function EstimateProportion(P, U) will be given in the next subsection. In Figure 2, a plenty of labeled data are available. From lines 4–15 the new training set (P +S p ) is used to build a new classifier FN to classify the remaining unlabeled set (U −S p ). Based on the new posterior probability, the remaining unlabeled instances’ KL value is calculated and find the top N with the largest KL value as most likely negative instances. All the negative instances are denoted as Sn ; In Figure 3, a training set (P +S p + Sn ) is obtained which contains both positive instances and negative instances. Use (P +S p + Sn ) to train the final classifier model and classify the instances in the set (U −S p − Sn ). Many classification algorithms such as SVM and logistic regression can be used in the last step. 3.4 Distribution estimation for unlabeled data The step 1 and step 2 of Auto CiKL suffer from the problem of setting the value of K and N respectively. It is difficult to set these two parameters if any extra information is unknown. The reasons to adaptively set value of K and N are as follows. If a small value is given to K when the percentage of positive instances in the unlabeled set is large, too many positive instances will be left in the unlabeled set. It will make the classifier which use the remaining unlabeled set as negative instances set ineffective. On the contrary, if a large value is given to K when the percentage of positive instances in the unlabeled set is small, too many negative instances will be labeled as positive instances, which will pollute the labeled training set. The similar analysis can be found for the value of N. It is necessary to give different values to different proportion of positive or negative instances in the unlabeled set. Unfortunately, the proportion of unlabeled set is unknown. That means proportion should be estimated. In this article, we will focus on negative instance proportion estimate. Positive instance proportion can be calculated from negative instance proportion easily. The intuition of our method is presented in the Figure 4. The P set stands for the labeled set, while the U set represents the unlabeled set. Label the instances in the P as the positive, and label the instances in the U as the negative. Then use all of them to learn a classifier. (a) and (b) show the situation that the U set contains low percentage and high percentage of negative instances
World Wide Web Figure 4 Classifiers for Labeled Set and Unlabeled Set of Different Proportion
P
U
(a) Low Density
P
U
(b) High Density
respectively. It is obviously that the classifier in the (b) can get a higher recall for the positive instances than that in the (a). The reason is that, the more real negative instances which the unlabeled set contains, the easier classifier separate the P and the U properly. Based on this intuition ,the detailed algorithms are shown in Figures 5 and 6. In the Figure 5, we describe how to generate training set for proportion estimate. Num instances are generated for different proportions. The more instances we generate, the higher accuracy for estimation. T represents training set. PT represents positive instances in training set, which can directly use the P set. UT represents unlabeled instances in training set. The positive instances of UT are generated by generate_instances algorithm. The basic idea of the generate_instances algorithm is to find the distribution of each attribute for each category and to generate positive instances based on these distributions of attributes in the training set Each, P., attribute of each positive instance in UT is generated one by one through the distribution found in The, P., negative instances of UT can be obtained from an irrelevant corpus. We can randomly select documents as negative instances. In the Figure 6, after the training set is produced, a model used for estimating proportion could be learned. Take the recall of the P as the input of the model.
Figure 5 Generating training set for estimate proportion
World Wide Web Figure 6 Estimating the negative instances proportion of unlabeled set
The estimation for the negative instances proportion of unlabeled data is outputted from this model. As the estimated proportion of negative instances is available, proper values can be chosen for K and N respectively. 3.5 Integrating different methods in the new framework From the experiment results showed in Figure 12, we can conclude that no one method can achieve the best performance in all the cases. Different methods have their different best fit situations. If the knowledge about the proportion can be accessed, choosing a proper approach for different situation is possible. In another words, these different methods could be integrated into a mixture method. This paper uses the proportion knowledge to integrate KL method and ROC-EM method to achieve a better mixture model, which can always stay good performance than just using only one of them. Based on the estimate of unlabeled data distribution, a new framework for semisupervise learning can be proposed (Figure 7). This is based on the estimate of unlabeled data distribution. With the knowledge of distribution, a framework that integrates different performance methods is given. This framework can integrate some former methods which have different performance in different unlabeled data distribution. It automatically chooses the proper approach according to the distribution knowledge. The experimental results demonstrate that hybrid method can achieve a globally high performance.
Figure 7 New framework for semi-supervise learning
World Wide Web
4 Experimental evaluation The objective of this section is to evaluate our proposed approach in terms of learning accuracy. The experiments are conducted on Intel 2.0 GHZ PC with 2 GB of RAThe, M., classifiers used in our approach are implemented inWeka1 environment. And the existing PU learning methods such as roc-em are downloaded from http://www.cs.uic.edu/∼liub/LPU/LPU-download.html. 4.1 Data sets We used the benchmark 20Newsgroups collection [24] and Reuters corpus [25] that are often used in existing methods. 20Newsgroups has approximately 20,000 documents, divided into 20 different small subgroups respectively, each of which corresponds to a different topic. Firstly, four subgroups are chosen from computer topic and two topics from science respectively, i.e. comp.graphics, comp.ibm.hardware, comp.mac.hardware, comp.windows.x×sci. crypt, sci.space, C41 C21 = 8 pairs of different experiments. For each pair of classes, i.e. selecting one class from {graphics, ibm.hardware, mac.hardware, windows.x}×{crypt, space} respectively as two positive classes, e.g. graphics×crypt, an equal part of documents are chosen randomly for training as corresponding positive instances, and the rest as unlabeled positive data in unlabeled set; Then some examples are extracted randomly from the rest 18 subgroups are viewed as unlabeled negative examples in unlabeled set, and the number is α × |U|, where α is a proportion parameter, showing the percentage of negative examples in the unlabeled set, and |U| is the number of all instances in the unlabeled set. The similar operation is done to Reuters, and the topic combinations {acq-crude, acq-earn, crude-interest, crude-earn, earn-interest, interest-acq} are chosen as the positive topic. 4.1.1 Feature extraction All the documents in our experiments are modeled by Vector Space Model. Feature selection is very important in textual classification. We use TF-IDF for dimensionality reduction, a light but effective way which is borrowed from information retrieval. TF-IDF is a measure used to reflect the importance of the words in documents. The words with high TF-IDF value for each class are retained. After removing stop words, we calculate TF-IDF of words under each topic and reserve the top 150 words as the representative for each topic. If several topics are used as positive classes, the feature space is the union of all the topics’ representative. The intuition of this method is that it is common to see some certain words in certain topic. 4.1.2 Negative instances used for estimate In [8], Li propose an approach which introduce an external corpus as absolute negative instances to learn from positive and unlabeled data. In our method, it also calls for extra negative instances to generate training data for proportion estimate. However, the data we need is never used to learn the model for predicting. It is
1 http://www.cs.waikato.ac.nz/ml/weka
World Wide Web
just used to simulate different proportion situations. In Reuters corpus, there are a lot of articles whose TOPIC attributes are labeled as NO, which indicates that in the original data the story has no entries in the TOPICS field. These articles are never used in generating negative instances of 20Newsgroups. Some of them may be used in generating negative instances of Reuters corpus. We want to know whether negative instances used in our method and negative instances used in original negative instances should correlate to each other. The following experiments results show that it is unnecessary that the negative instances for estimate is similar to original negative instances. Both the results 20Newsgroups and Reuters corpus can achieve good estimate result. 4.2 Experimental results We perform ten times hold-out tests with random data partitions to get the average F-score value as the final result. In the experiments, α is the ratio of the unlabeled negative examples compared to the unlabeled set. E.g. α = 0.1 means that the number of the unlabeled negative examples is only 10 % of the unlabeled set. Performance for auto-KL method Figure 8 shows the comparison of the performance between using proportion estimate and giving the real proportion. KLAuto represents the former and KL-ByHand represents the latter respectively. KL-Random means randomly giving proportion. KL-Fixed is our previous work [16], in which a fixed number of instances will be abstracted from unlabeled data as reliable positive and negative instances. To avoid over-abstracting, we set only abstract 50 instances for positive instances and negative instances respectively. Both KL-Auto and KL-ByHand have the trend that the F scores increase with the proportion of the negative instance increases. It is obvious that the performance of KL-Auto is worse than KL-ByHand. The reason is that our estimate method may make mistake. Bad proportion estimate leads to bad k and n, which causes the performance decrease. But the result of KL-Auto is much better than the KLRandom, in which all the N and K are chosen randomly. That means our proportion estimate approach is so effective that the estimate given by our method is close to the real proportion. Besides, with the proportion of negative instances increase, performance of all the methods go up. The reason we believe is the more negative instances are, the easier reliable negative instances can be picked up. KL-Fixed
(a) 20 Newsgroups
(b) Reuters
Figure 8 Comparison between given estimate proportion and given real proportion
World Wide Web
(a) 20 Newsgroups
(b) Reuters
Figure 9 The F scores of proposed method without reliable positive instances
almost is the worst method. However, when the negative instance proportion is very low (0.1), KL-Fixed is a little better than KL-Auto. This is caused by the estimate error in KL-Auto. Ef fect of abstract positive data In the existing algorithms, they only abstract negative data from unlabeled data to obtain the missing information about negative data, which can complete the training set. Although positive instances can be found in labeled data, the positive instances knowledge contained in unlabeled data seems to be more than labeled data which may have possibility to enhance the training set. However, abstracting positive data is a double-edge sword. If many negative data are categorized as positive data, the positive instances knowledge may introduce pollution. Figure 9 illustrates the comparison between our approach and the traditional approach just based on posterior probability for abstracting positive instances as the first step in our framework. Since labeled data always contains more than two set, we can simply learn a classifier. By using this classifier, the posterior probability of each instance in unlabeled data can be obtained. Rank all the posterior probability and select the top k largest of them as the reliable positive instances. The real proportions of unlabeled data are given to both of them. In general, our approach can achieve better result most of the time. The traditional approach may introduce more pollution.
(a) 20 Newsgroups
(b) Reuters
Figure 10 The F scores of proposed method without reliable positive instances
World Wide Web
(a) 20 Newsgroups
(b) Reuters
Figure 11 The F scores of proposed method with different abstracting order
Figure 10 shows the comparison of proposed KL method and variant of KL method which skips the step of abstract reliable positive instances from unlabeled data. The real proportion of unlabeled data are given to both of them. The result illustrates that abstracting positive instances in unlabeled data can enhance the proposed method in general. It is necessary to use an effective way to abstract positive data in unlabeled data, since positive data in unlabeled data can enhance final classifier. The order of abstracting instances counts In our approach, we not only abstract negative instances but also positive instances. The goal is to maximally use the knowledge contained in unlabeled data. And reliable positive instances are abstracted firstly, then reliable negative instances. It is natural to think that, in low proportion of negative instances, negative instances are difficult to be found while positive instances can be found with high probability so that we should abstract positive instances first. Similar analysis can be done in the opposite situation. However, experiment results illustrated in Figure 11 don’t agree with this analysis. The PosNeg stands for picking up positive instances firstly and picking up negative instances secondly, and vice versa. In fact, if we pick up reliable negative instances firstly, we can use reliable negative instances and labeled positive instances to learn a classifier that can find positive instances in unlabeled more precise, which have
Table 2 F score of different parameters for negative instances proportion at 0.1
Neg
Pos 0.1
0.2
0.4
0.6
0.05 0.075 0.1 0.2 0.4 0.05 0.075 0.1 0.2 0.4
0.468 0.456 0.424 0.274 0.236 0.325 0.336 0.338 0.322 0.239
0.482 0.478 0.455 0.291 0.240 0.311 0.329 0.335 0.323 0.247
0.494 0.511 0.510 0.392 0.248 0.273 0.297 0.311 0.322 0.278
0.471 0.523 0.540 0.504 0.295 0.238 0.268 0.283 0.314 0.321
World Wide Web Table 3 F score of different parameters for negative instances proportion at 0.2
Neg
Pos 0.1
0.2
0.4
0.6
0.05 0.075 0.1 0.2 0.4 0.05 0.075 0.1 0.2 0.4
0.647 0.676 0.680 0.567 0.396 0.449 0.491 0.514 0.531 0.427
0.646 0.674 0.684 0.622 0.415 0.426 0.468 0.491 0.526 0.445
0.634 0.683 0.705 0.703 0.456 0.353 0.406 0.438 0.500 0.496
0.582 0.664 0.689 0.727 0.648 0.302 0.354 0.392 0.460 0.517
been verified by experiments. The behaviors of different data set are opposite. In 20Newsgroups, positive instances should be picked up firstly in low proportion and negative instances should be picked up firstly in high proportion. The Reuters corpus shows the opposite way. The reason of this fact is beyond this paper. So we will simply pick up positive instances firstly all the time. The only common thing is that order of pick up instances has relation with the proportion. And low proportion situation should have the opposite order compared with high proportion. In the following experiments, we just simply abstract positive data firstly. Performance of parameters for dif ferent proportion As the proportion known, the value of top K and N for the first step and second step respectively can be set more properly. Tables 2, 3, 4, 5 and 6 show the different proportion given to abstract positive and negative instances respectively at different negative instances proportion. For example, Table 2 whose left table records the situation of 20Newsgroups and right one records Reuters corpus illustrates the comparison in 0.1 proportion. In each table, 0.1, 0.2, 0.4 and 0.6 are given as the positive instances proportion, and 0.05, 0.075, 0.1, 0.2 and 0.4 are given as the negative instances proportion. The F-score is recorded in the table. We don’t show the more record such as 0.6 and 0.8 given as negative instances proportion. The reason is the real negative instances proportion in the unlabeled data is 0.1, and the more negative instances we abstract, the worse result we will obtain.
Table 4 F score of different parameters for negative instances proportion at 0.4
Neg
Pos 0.1
0.2
0.4
0.6
0.1 0.2 0.4 0.5 0.6 0.1 0.2 0.4 0.5 0.6
0.816 0.845 0.685 0.637 0.631 0.619 0.699 0.688 0.648 0.620
0.765 0.817 0.701 0.619 0.615 0.587 0.680 0.703 0.660 0.627
0.752 0.815 0.825 0.724 0.660 0.526 0.629 0.709 0.696 0.674
0.719 0.799 0.842 0.846 0.839 0.448 0.562 0.663 0.687 0.700
World Wide Web Table 5 F score of different parameters for negative instances proportion at 0.6
Neg
Pos 0.1
0.2
0.4
0.6
0.1 0.2 0.4 0.6 0.8 0.1 0.2 0.4 0.6 0.8
0.856 0.889 0.906 0.813 0.842 0.652 0.757 0.822 0.782 0.758
0.802 0.858 0.895 0.797 0.839 0.612 0.727 0.813 0.791 0.789
0.725 0.808 0.872 0.880 0.886 0.527 0.657 0.777 0.816 0.849
0.677 0.774 0.839 0.871 0.886 0.404 0.554 0.697 0.761 0.806
The best F-score usually lay on the row whose negative proportion value equal to the real proportion. So the proportion given to the negative instances is simply the estimated proportion λ. In many situations, when the negative proportion is fixed, the F-score doesn’t increase with the proportion. The reasons we believe are as follows. Abstract positive instances always have risk. In the low proportion, the more instances abstracted as positive, the more negative instances will be treated as positive. Even though the number of real negative instances treated as positive will be small, the number of negative instances is also small, which may still cause a large proportion of negative will be treated as positive instances. Then the Fscore decreases. However, in the high proportion, if the proportion given to positive instances exceeds the real proportion, many negative instances will be introduced into positive instances which causes the positive knowledge polluted. Based on these analyses, we can simply give the estimated positive proportion in the high proportion, but give the smaller proportion than estimated proportion. Half proportion set in this situation can already achieve good performance. Performance for dif ferent algorithms Figure 12 records the F scores of Spy-SVM, ROC-EM and KL-Auto. The ROC-EM means using rocchio technique for the first step to identify reliable negative instances and using EM for the second step to build final classifier. The Spy-SVM means use the spy technique for the first step and the use SVM step for the second step.
Table 6 F score of different parameters for negative instances proportion at 0.8
Neg
Pos 0.1
0.2
0.4
0.6
0.1 0.2 0.4 0.6 0.8 0.1 0.2 0.4 0.6 0.8
0.862 0.893 0.923 0.925 0.940 0.657 0.774 0.866 0.886 0.898
0.807 0.867 0.911 0.934 0.956 0.615 0.739 0.848 0.892 0.923
0.686 0.776 0.851 0.893 0.929 0.479 0.622 0.772 0.837 0.887
0.598 0.709 0.797 0.846 0.879 0.386 0.529 0.687 0.771 0.832
World Wide Web
(a) 20 Newsgroups
(b) Reuters
Figure 12 The F scores of different algorithms in different negative instances proportion
As shown in Figure 12, no one approach can achieve the best performance all the time. It is easy to see the KL-Auto outperforms the other two methods dramatically in the case of 0.1 and 0.2. KL-Auto perform quite well in the other cases, but not as well as these method proposed by Liu. In the contrary, ROC-EM can get high F score in the high proportion case, but have bad performance in the case of 0.1 and 0.2. The reason is that, Liu’s approach never abstracts positive instances from unlabeled data while our approach introduces risk in abstracts reliable positive instances. Each approach has its best fit situation. In another words, if the proportion of the unlabeled data set is unknown, when only one of these methods is chosen, result may fall into the bad case of these methods. Performance for integration of dif ferent approaches As the proportion of negative instances in unlabeled data can be estimated, more than one approach can be used for the given task. Methods that can make up a loss for each other can be picked up to produce a hybrid approach. Figure 13 shows the result of integration of roc-em and KL-Auto. It is clearly to see that the hybrid approach can almost achieve the best all the time.
(a) 20 Newsgroups Figure 13 Performance for the hybrid method
(b) Reuters
World Wide Web
From the above experiments, we can draw the following conclusion. (1) Using the estimate for proportion, the parameters can be automatically chosen for the KL method effectively. The KL method outperforms existing methods in the low proportion case. (2) Knowledge of positive instances in unlabeled data should be used properly. (3) The order of abstracting instances correlates with the proportion of unlabeled data. (4) No one algorithm can always achieve good performance, and each approach has its best fit proportion. (5) The new framework for semi-supervise learning can always generate a high performance hybrid method.
5 Conclusion In this paper, the problem of learning from positive and unlabeled examples is tackled by using a novel approach called Auto CiKL and propose a framework that can integrate different algorithms to obtain a global high performance hybrid method. The CiKL method can achieve good performance when the negative instances proportion in unlabeled data is low. And the hybrid method can usually perform better. In the further work, we will further study how to improve the precision of our estimate method and give theoretical analysis for our approach. Acknowledgements This work was supported by the National Major Projects on Science and Technology under grant number 2010ZX01042-002-003-004, NSFC grant (No. 61033007, 60903014 and 61170085), 973 project(No. 2010CB328106), Program for New Century Excellent Talents in China (No.NCET-10-0388).
References 1. Bennett, K., Demiriz, A.: Semi-supervised support vector machines. In: NIPS 11, pp. 368–374 (1999) 2. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995) 3. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley Interscience, Hoboken (1991) 4. Denis, F.: PAC learning from positive statistical queries. In: ALT 1998, LNCS (LNAI), vol. 1501, pp. 112–126. Springer, Heidelberg (1998) 5. Denis, F., Gilleron, R., Tommasi, M.: Text classification from positive and unlabeled examples. In: IPMU (2002) 6. Elkan, C., Noto, K.: Learing classifiers from only positive and unlabeled data. In: KDD (2008) 7. Lee, W.S., Liu, B.: Learning with positive and unlabeled examples using weighted logistic regression. In: Proceedings of the 20th International Conference on Machine Learning (2003) 8. Li, X.L., Liu, B.: Learning from positive and unlabeled examples with different data distributions. In: ECML (2005) 9. Li, X.L., Liu, B., Ng, S.K.: Learning to identify unexpected instances in the test set. In: AAAI (2007) 10. Li, X.L., Liu, B., Ng, S.K.: Negative training data can be harmful to text classification. In: EMNLP (2010) 11. Liu, B., Dai, Y., Li, X.L., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: ICDM (2003)
World Wide Web 12. Liu, Z., Shi, W., Li, D., Qin, Q.: Partially supervised classification—based on weighted unlabeled samples support vector machine. In: Proceedings of the 1st International Conference on Advanced Data Mining and Applications (2005) 13. Manevitz, L.M., Yousef, M.: One class svms for document classification. J. Mach. Learn. Res. 2, 139–154 (2002) 14. Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised selftraining of object detection models. In: 7th IEEE Workshop on Applications of Computer Vision (2005) 15. Wang, X.L., Xu, Z., Sha, C.F., Ester, M., Zhou, A.Y.: Semi-supervised learning from only positive and unlabeled data using entropy. In: WAIM (2010) 16. Xu, Z., Sha, C.F., Wang, X.L., Zhou, A.Y.: Semi-supervised classification based on KL divergence. J. Comput. Res. Dev. 1, 81–87 (2010) 17. Yu, H., Han, J., Chang, K.C.C.: Pebl: positive example based learning for web page classification using svm. In: KDD (2002) 18. Zhang, D., Lee, W.S.: A simple probabilistic approach to learning from positive and unlabeled examples. In: UKCI (2005) 19. Zhang, X.H., Lee, W.S.: Hyperparameter learning for graph based semi-supervised learning algorithms. In: NIPS (2006) 20. Zhou, D.Y., Huang, J.Y., Schölkopf, B.: Learning from labeled and unlabeled data on a directed graph. In: ICML (2005) 21. Zhou, Z.H., Li, M.: Semisupervised regression with co-training style algorithms. In: TKDE (2007) 22. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using Gaussian fields and harmonic functions. In: the 20th International Conference on Machine Learning (2003) 23. Zhu, X.J.: Semi-supervised learning literature survey. In: Technical Report 1530, Dept. Comp. Sci., Univ. Wisconsin-Madison (2006) 24. 20 Newsgroups data set. http://people.csail.mit.edu/jrennie/20Newsgroups 25. Reuters corpus data set. http://www.daviddlewis.com/resources/testcollections/reuters21578/