A Semi-Supervised Learning Algorithm Based on Modified Self ...

Report 0 Downloads 18 Views
1438

JOURNAL OF COMPUTERS, VOL. 6, NO. 7, JULY 2011

A Semi-Supervised Learning Algorithm Based on Modified Self-training SVM Yun Jin Key Laboratory of Underwater Acoustic Signal Processing of Ministry of Education, School of Information Science and Engineering, Southeast University, Nanjing, China School of Physics and Electronic Engineering, Xuzhou Normal University, Xuzhou, China Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing, China Email:[email protected]

Chengwei Huang, Li Zhao Key Laboratory of Underwater Acoustic signal Processing of Ministry of Education, School of Information Science and Engineering, Southeast University, Nanjing, China Email: [email protected], [email protected] Abstract—In this paper, we first introduce some facts about semi-supervised learning and its often used methods such as generative mixture models, self-training, co-training and Transductive SVM and so on. Then we present a self-training semi-supervised SVM algorithm based on which we give out a modified algorithm. In order to demonstrate its validity and effectiveness, we carry out some experiments which prove that our method is better than the former algorithm. Using our modified self-training semi-supervised SVM algorithm, we can save much time for labeling the unlabelled data and obtain a better classifier with good performance. Index terms—semi-supervised learning, self-training, SVM, UCI

I.

INTRODUCTION

In traditional classification applications, only labeled data (features) are used for training. However, collecting labeled instance is difficult, expensive or time-consuming[1]. Meanwhile unlabeled data are abundant, relatively easier to obtain. If only a small amount of labeled data and a large amount of unlabeled data are available, semi-supervised learning can often provide us with a satisfactory classifier. Applications such as text classification, genetic research and machine vision are examples where cheap unlabeled data can be added to a pool of labeled data. The literature seems to hold a rather optimistic view, where “unclassified observations should not be discarded”[2]. And maybe the most representative conclusion of recent literature arise from McCallum and Nigam(1998), who demonstrate that “by augmenting this small set of labeled samples with a large set of unlabeled data and combining the two pools with EM, we can improve our parameter estimates.” In recent years, semi-supervised learning has received considerable attention due to its potential for reducing the effort of labeling data. Some often used methods include EM with generative mixture models[3], self-training[4], cotraining[5], transductive support vector machines[6] and graph-based methods[7].

© 2011 ACADEMY PUBLISHER doi:10.4304/jcp.6.7.1438-1443

Unfortunately, many experiments show that unlabeled instances are quite often detrimental to the performance of classifier. And the more unlabeled data are added to a fixed set of labeled samples, the poorer is the performance of the corresponding classifier. However some might argue that unlabeled data are rather useful which were also proved by many experiments, and so any “degradation” must come from inappropriate analysis, while others might demonstrate that if modeling assumptions are obviously unsatisfied, unlabeled data could conceivably be deleterious in exceptional situations[8]. From these literatures, it is clear that satisfied with some condition, the unlabeled data can provide some useful information which can enhance the classification rate of the classifier. Also we can reduce the need for expensive labeled data. In the section 2, we firstly present relevant facts about semi-supervised learning, and then introduce some important prevailing semi-supervised learning methods, such as generative mixture models, co-training, self-training, transductive support vector machines and so on. In section 3, we present a former effective self-training semi-supervised SVM algorithm[9] and then give out our modified algorithm. In section 4, we will carry out two experiments to select the best parameter of and to prove the effectiveness of our method. Then we will give out our conclusion in the section 5. II.

RELATED WORK

In this section, we introduce some often used semisupervised learning methods. Generative models are maybe the oldest semi-supervised learning method. It assumes a model P( x, y ) = p( y ) p( x y ) where p( x y ) is an identifiable mixture distribution, such as Gaussian mixture models. Using large amount of unlabeled instances, the mixture components can be identified. So ideally we only need one labeled example per component to fully determine the mixture distribution. There are a lot of successful applications in different fields.

JOURNAL OF COMPUTERS, VOL. 6, NO. 7, JULY 2011

Nigam apply the EM algorithm on mixture of multinomial for the text classification. And the resulting classifiers perform better than those trained only from labeled data.[1] Co-training approach to semi-supervised learning and classification was proposed by Blum and Mitchell in 1998. Co-training was proposed for two classifier and assumes that input features are naturally subdivided into two sets, and each feature subset is sufficient to train an optimal classifier, supposed that enough labeled data are available[10]. Two different classifiers are trained on the initial, small, labeled data set L. And then each classifier is then applied to the unlabeled instances in U. For each classifier, the highest confidence unlabeled data are added to the labeled data set L, so that the two classifiers can contribute to increase the data set L. Both classifiers are retrained on this enlarged data set, and the steps are repeated a fixed number of times. The rationale behind co-training is that a classifier may assign correct labels to certain examples while it may be difficult for the other classifier to do so. Therefore, each classifier can increase the training set with examples which are very informative for the other classifier. In self-training, a classifier is firstly trained using the labeled data set L. Then this classifier is used to separate the unlabeled data set U and assign pseudo-class labels on them. And such pseudo-labeled data are added to L. Usually, the unlabeled data classified with the highest confidence are chosen to data set L. Then the classifier is re-trained using the augmented data set L. Because it is difficult to guarantee the convergence of this simple algorithm, the last two steps are usually repeated for a given number of times or until some heuristic convergence criterion is meet. Transductive support vector machines (TSVMs) builds the connection between p( x) and the discriminative decision boundary by not putting the boundary in high density regions. TSVM is an extension of standard support vector machines with unlabeled data. In a standard SVM only the labeled data is used, and the goal is to find a maximum margin linear boundary in the Reproducing Kernel Hilbert Space. In a TSVM the unlabeled data is also used. The goal is to find a labeling of the unlabeled data, so that a linear boundary has the maximum margin on both the original labeled data and the (now labeled) unlabeled data. The decision boundary has the smallest generalization error bound on unlabeled data[9]. Intuitively, unlabeled data guides the linear boundary away from dense regions. III.

A MODIFIED SELF-TRAINING SEMI-SUPERVISED SVM ALGORITHM

In this section, we first present a self-training semisupervised SVM algorithm, and then we give out our modified algorithm.

1439

A. Former self-training semi-supervised SVM algorithm [11] A standard SVM classifier for two-class problem is defined as following: l 1 min wT w + C ∑ ε i w , b ,ε 2 i =1 (1)

yi ( wT xi + b ) ≥ 1 − ε i

Subject to

ε i ≥ 0, i = 1,..., l

xi ∈ R is a feature vector of a training instance, yi ∈ {−1,1} is the label of xi , i = 1,..., l , C>0 is a n

Where

regularization constant. Algorithm. Suppose that FI is a training set which includes l 1 samples { xi , i = 1,..., l1} with given labels [ y0 (1),..., y0 (l1 ) ] , and a test set FT containing l 2 samples with unknown labels.

{ xN + i , i = 1,..., l2 }

Step 1. Using FI , we train a SVM, and perform classification on FT . Then we get the parameters of the SVM,

w(1) ∈ R n , ε (1) ∈ R l1 , and b(1) ∈ R . The predicted labels ⎡ y (1) (1),..., y (1) (l2 ) ⎤⎦ . The superscript denotes are denoted as ⎣ the current iteration number. Step 2. the k th iteration ( k

= 2,... ) follow Steps 2.1-2.3

Step2.1 Define a new training set as FN = FI + FT , where the labels of FT are predicted in the ( k − 1 )th iteration. Step 2.2 Using the augmented training set

FN

, we train

F a SVM, and perform classification again on T . The l1 + l2 (k ) n (k ) , parameters of SVM are denoted as w ∈ R , ε ∈ R and b

(k )

∈ R . The predicted labels are denoted as

⎡⎣ y (1),..., y ( k ) (l2 ) ⎤⎦ (k )

Step 2.3 Calculate the objective function value in (1)

f ( w( k ) , ε ( k ) ) =

l1 + l2 1 T w w + C ∑ ε i( k ) 2 i =1

(2) Step 3.(Termination step): given a pre-determined positive f ( w( k ) , ε ( k ) ) − f ( w( k −1) , ε ( k −1) ) < δ 0 constant δ 0 : if , the algorithm stops after the kth iteration, and the predicted

⎡ y ( k ) (1),..., y ( k ) (l ) ⎤

2 ⎦ labels ⎣ of the test set are the final classification results. Otherwise, perform the ( k + 1 )th iteration.

B. Modified self-training semi-supervised SVM algorithm A modified SVM was motivated by the fact that the Fisher’s discriminant optimization problem for two classes is

© 2011 ACADEMY PUBLISHER

1440

JOURNAL OF COMPUTERS, VOL. 6, NO. 7, JULY 2011

a constraint least-squares optimization problem[12]. The problem of minimizing the within-class variance has been reformulated so that it can be solved by constructing the optimal separating hyperplane for both separable and nonseparable cases. The modified SVMs has been applied successfully according to their application. In order to form the optimization problem of the modified SVMs, we first define the within class scatter matrix of the

μk1 and μk2

training set in (3), where

are the mean vector of

U k1 and U k2 , respectively. We assume the within Sk scatter matrix w is invertible. the class S wk =

∑ (g

− μ k1 )( g i − μ k1 )T + ∑ ( g i − μ k2 )( g i − μ k2 )T

i

giU 1k

(3)

giU k2

The optimization problem of the modified SVMs is l 1 min k wkT S wk wk + Ck ∑ ε kj wk ,bk ,ε 2 j =1

(4)

Subject to the separability constraints

yik ( wkT xi + bk ) ≥ 1 − ε ik , ε ik ≥ 0, j = 1,..., l

(5) The solution of the optimization problem (4) subject to (5) is given by the saddle point of the Lagrangian

L( wk , bk , α k , β k , ε k ) i =1

−∑ α [ y ( w gi xi − bk ) − 1 + ε ] i =1

T k

k i

−∑ β ε

k k i i

(6) k k k k k k β = [ β ,..., β ] α = [ α ,..., α ] l 1 1 l and are the Where vectors of Lagrangian multipliers for the constraints (5). The

w vector k can be derived from the Kuhn-Tucker(KT) l 1 wk = S wk −1 ∑ α ik yik g i 2 i =1

(7) Instead of finding the saddle point of the Lagrangian (6), we find the maximization point of the Wolf dual problem l

i =1

1 l l k k k k T T k −1 ∑∑ α i α j yi y j xi gi Sw g j 4 i =1 j =1

(8)

Subject to

0 ≤ α ≤ Ck , i = 1,..., l k i

l

∑α i =1

k i

(9)

yik = 0

We can use MATLAB toolbox to solve the above optimization problem. So the test data set can be separated with the following

© 2011 ACADEMY PUBLISHER

Our steps are as follows. We

provide a possible interval of C . Then all points of C are tried to see which one gives the highest cross validation accuracy. The best parameter will be used to train the whole training set and generate the final model. With the different

C value, we can get the different recognition rate which are listed in the TABLE I and shown in the Figure1. From the Figure1, we can see that with the increment of the C , the recognition rate firstly increase to about 83.2%, and then

conditions

W (α k ) = ∑ α ik −

EXPERIMENTS

C parameter which can guarantee the best classification rate. In the second one, we mainly demonstrate the validity and effectiveness of our modified algorithm. Additionally, the SVM classification in our algorithm are carried out by LIBSVM[13] and the linear SVMs is used in our experiment.

to search the C value.

l

i =1

IV.

In this section, two experiments will be presented with the standard data set UCI. The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. In the first experiment, we show how to select the

parameter C of SVM should be first selected. In our experiments, we use the cross-validation on training data set

l

k i

(10) We substitute formula (1) with formula (4), so we get a modified self-training semi-supervised SVM algorithm.

A. Parameters Selection For the implementation of SVM, the regularization

l

= wkT S wk wk + Ck ∑ ε ik k i

1 f k ( g ) = sign( wT w + bk ) 2 1 l = sign( ∑ yikα ik g Tj S wk −1 g + bk ) 2 i =1

decrease. So the best parameter of

C

is about 0.05.

TABLE I. THE RECOGNITION RATE WITH DIFFERENT C VALUE

C value Recognition Rate C value Recognition Rate C value Recognition Rate

0.01

0.012

0.015

0.02

0.05

0.1

60.9

68.6

77.3

83.2

83.2

80.5

0.2

0.5

1

2

5

10

78.6

79.5

77.7

77.7

73.6

71.8

50

100

200

500

70

74.1

73.6

73.6

JOURNAL OF COMPUTERS, VOL. 6, NO. 7, JULY 2011

Figure 1. Selection of Parameter C

Figure 2. Convergence Rate

B. Simulation Results In this experiment, we demonstrate the validity and effectiveness of our method by using some data in UCI database. The data we use are “heart_scale”, “iris”, “wine” and “vehicle”. The detailed parameters of data are listed in the TABLE II. For each data set, we perform a 5-fold cross-validation using our modified method compared to the former algorithm. In each fold, the data set is splitted into three parts, the first is used as initial training data set (D0), the second is first used as the test data set and then used in retraining(D1), and the last one is used as an independent test set(D2) for further validation of our method. After the modified algorithm is carried out to each of the data sets, we will get the accuracy rates for each of the 5-folds of the test set and the independent test set. So there are 10 accuracy rates, which are then averaged. The averaged outcomes are shown in the Figure3-Figure5. The convergence of the former algorithm has been theoretically proved in the [8] which can be seen in the figure2. From the Figure2, we can see that the convergence rate of the algorithm is very fast. After the first iteration, the recognition rate is about 82%. While with the second iteration, the recognition rate is increased to about 93%. Only after 4 iterations, the recognition rate is increased to about 100%. With many experiments, we choose 6 as the iteration number which can guarantee the convergence effect. TABLE II.

Data heart_scale iris wine vehicle

DETAILED INFORMATION Dimension

© 2011 ACADEMY PUBLISHER

1441

One of the experiment results is shown in the Figure3 which is the results of the “heart_scale” data set. The lower line with “*” is the result of the former algorithm and the upper line with “+” is the result of our modified algorithm. We can see that the performance of our method is much better than the former algorithm. The advantages include two aspects. One is the recognition rate. Our modified algorithm can reach higher recognition rate than that of the former ones. For the first iteration, the recognition rate of our modified method is about 84% which is about 4% higher than the former algorithm. And with the implementation of iteration, the increment rate is decreasing because the recognition rate is nearly 100%. The other one is the convergence rate. From the Figure3, we can see that the former algorithm needs six iterations for convergence while our modified method only needs five iterations.

OF DATA SET

Size

13

270

4 13

150 178

18

846

Figure 3. Result of the data set “heart_scale”. The lower line with “*” is the results of the former algorithm and the upper line with “+” is the results of our modified algorithm

1442

JOURNAL OF COMPUTERS, VOL. 6, NO. 7, JULY 2011

Figure 4. Result of the data set “Wine”. The lower line with “*” is the results of the former algorithm and the upper line with “+” is the results of our modified algorithm

For each data set, we can get the final model after several time iterations. Then we use the Independent test set(D2) to check the classification rate which are shown in the TABLE III. From the TABLE III, we can see that the classification rate of the final models is not always higher than those models which are only trained with D0 which are called original models. For iris data set, the recogniton rates are 100% with final model and original model because the dimension of iris data set is only 4 which can be easily separated. For wine data set, the recognition rate is also lower than that of the original model whose recognition rate is about 95%, so with the added retraining data set, the classification performance is decreased. While for heart_scale and vehicle data set, the performance with the final models are not obviously enhanced compared to those of the original models. Because the data set is randomly splitted into several parts which are not with the identical distribution with each other. So only using self-training method, it is difficult to obviously improve the recognition rate. But we save much time to label. However, our method is much better than the former algorithm. For heart_sacle and vehicle data set, our method is almost 4% higher than

Figure 5. Result of the data set “Vehicle”. The lower line with “*” is the results of the former algorithm and the upper line with “+” is the results of our modified algorithm

those of the former method. While for wine data set, our method is only 1% higher than that of the former method because their recognition rate are above 90% which are already quite high. For iris data set, our method is equal to the former method because their recognition rates are both 100%.And the comparison results of our method with the former method are shown in the Figure 6. V.

CONCLUSION

In this paper ,we present a modified self-training semisupervised SVM algorithm. The aim is to reduce the number of labeled training examples. We substitute the former algorithm with a modified SVMs to get a new method. First, 100 90 80 70 60

RECOGNITION RATE OF CLASSIFICATION RESULT OF INDEPENDENT DATA SET D2

Our Method

50

TABLE III.

Data heart_scale iris wine vehicle

Former Algorithm 40

Our Method

Former Algorithm

85.4

81.2

100 94.8

100 93.5

10

89.1

85.6

0

30 20

heart_scale

iris

wine

vehicle

Figure 6. Comparison of our method with the former algorithm

© 2011 ACADEMY PUBLISHER

JOURNAL OF COMPUTERS, VOL. 6, NO. 7, JULY 2011

we use parameter selection method to select the parameter

C which can guarantee the best performance. Then we compare our method to the former algorithm with several experiments which proves that our method is better than the former ones. With self-training semi-supervised algorithm, the performance is difficult to be obviously enhanced. So how to use the unlabeled data to help improve the performance is our next goal. ACKNOWLEDGMENT This paper is supported by the National Natural Science Foundation of China(No: 60975017 , 60872073) and Natural Science Foundation of Jiangsu Province(No: BK2008291). This paper is supported by the Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing, China. The authors are grateful for insightful and constructive suggestions from the anonymous reviewers. REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9] [10]

[11]

[12]

[13]

X. Zhu, "Semi-Supervised Learning Literature Survey," Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI2008. O'Neill and T.J, "Normal discrimination with unclassified observations," Journal of American Statistical Association, vol. 73, pp. 821-826, 1978. K. Nigam, et al., "Text classificaiton from labeled and unlabeled documents using EM," Machine Learning, vol. 39(2/3), pp. 103134, 2000. C. Rosenberg, et al., "Semi-supervised self-training of object detection models," Seventh IEEE Workshop on Applications of Computer Vision, 2005. A. Blum and T. Mitchell, "Combining labeled and unlabeled data with co-training.," Proc. Conf. on Computational Learning Theory, pp. 92-100, 1998. O. Chapelle and A. Zien, "Semi-supervised classfication by low density separtion.," Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005. D. Zhou, et al., "Semi-supervised learning on directed graphs," Advances in Neural Information Processing Systems, vol. 16, 2004. F. G. Cozman, et al., "Semi-Supervised Learning of Mixture Models," Proceedings of the Twentieth International Conference on Machine Learning(ICML-2003), 2003. V. Vapnik, "Statistical Learning Theory," Springer, 1998. L. Didaci and F. Roli, "Using Co-training and Self-training in Semi-supervised Multiple Classifier System," Lecture Notes in Computer Science, vol. 4109, pp. 522-530, 2006. C. G. Yuanqing Li, Huiqi Li, Zhengyang Chin, "A self-training semi-supervised SVM algorithm and its application in an EEGbased brain computer interface speller system," Pattern Recognition Letters, vol. 29, pp. 1285-1294, 2008. I. P. Irene Kotsia, "Facial Expression Recognition in Image Sequences Using Geometric Dformation Features and Support Vector Machines," IEEE Transactions on Image Processing, vol. 16, 2007. C.-J. L. Chih-Chung Chang, "LIBSVM: a Library for Support Vector Machines," Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/, 2006.

© 2011 ACADEMY PUBLISHER

1443

Yun Jin He received his BSc and MSc in Mechanical Engineering and Automation from China University of Mining and Technology in 2001 and 2004 respectively. His primary research area focused on machine learning, pattern recognition, and speech emotion recognition. Currently he is a PhD student of the School of Information Science and Engineering, Southeast University, Nanjing, China. Chengwei Huang He received his MSc in Signal and Communication Engineering from Southeast University in 2009. His primary research area focused on pattern recognition, machine learning, and automatic speech recognition. Currently he is a PhD student of the School of Information Science and Engineerin, Southeast University, Nanjing, China.

Li Zhao He received his MSc in Signal Processing from Southeast University in 1988. And he received his PhD from Kyoto Institute of Technolog in 1995.His primary research area focused on pattern recognition, machine learning, and automatic speech recognition. Currently he is a professor of the School of Information Science and Engineerin, Southeast University, Nanjing, China.