Method for Determining Parameters of Posterior ... - Springer Link

Report 3 Downloads 101 Views
Method for Determining Parameters of Posterior Probability SVM Based on Relative Cross Entropy Qing-hua Xing, Fu-xian Liu, Xiang Li, and Lu Xia Missile Institute of Air Force Engineering University, Sanyuan, 713800, China

[email protected]

Abstract. The technology of support vector machines (SVM) is being widely used in many research fields at present, but standard SVM does not provide posterior probability that is needed in many uncertain classification problems. To solve this problem, a probability SVM model is built firstly, then the cross entropy and relative cross entropy model for classification problems are built. Finally, the method for determining parameters of probability SVM model is put forward by minimizing relative cross entropy. Experiment results show that the method of determining model parameters is reasonable, and the posterior probability SVM model is effective. Keywords: support vector machines, relative cross entropy, posterior probability.

1 Introduction SVM [1,7] (Support Vector Machine) as a method of statistical learning theory has solved learning problems with finite samples, and is widely used in pattern recognition, data mining and many other fields. However, standard SVM only considers two extreme cases whose result belongs to some class with probability for 1 or 0, so it cannot provide posterior probability of what is needed in many uncertain problems of sample classification. Wahba [2] and Platt [3] had firstly introduced posterior probability in SVM to expand the capability for standard SVM. There are mainly two kinds of methods for determining posterior probability [4, 5]: the first one is the Bayesian framework theory which needs to calculate conditional probability density of every class firstly and then compute its posterior probability based on Bayesian theory; the second one is fitting posterior probability directly without calculating the probability density of every class. These methods are all the beneficial attempts of introducing the posterior probability into standard SVM. In this paper, a modeling method of posterior probability SVM based on relative cross entropy has been put forward. An optimization model has been constructed by using relative cross entropy as the objective function. The parameters optimal value of probability SVM model can be obtained by minimizing relative cross entropy. And the classification results of every SVM are given by probability in the method. Namely, categories of samples are determined by posterior probability, which can not only give a qualitative explanation but also give a quantitative evaluation. H. Deng et al. (Eds.): AICI 2011, Part III, LNAI 7004, pp. 664–670, 2011. © Springer-Verlag Berlin Heidelberg 2011

Method for Determining Parameters of Posterior Probability SVM

665

2 Posterior Probability SVM Model The standard output value of SVM is [6]:

y = sign( f (x) )

(1)

and f ( x ) = ( w × x ) + b In the equation above, w* is the weight coefficient vector of optimal hyperplane and * b is the threshold of classification. The nearest sample point x (support vector) to hyperplane should satisfy *

*

f ( x) = 1 . So the sample points on the hyperplane should satisfy f(x)=0 and the other points should satisfy

f ( x) = ± r ⋅ w , where r is the distance between x and the

hyperplane, and the sign means which side of the hyperplane it belongs to. In this way, the distance between the support vector and the hyperplane can be expressed as:

rsv = 1 / w . Therefore the distance between any sample point x and the hyperplane is:

rx = f ( x) / w . then: f(x)=rx/rsv

(2)

From Equation (2), f(x) is the ratio of rx to rsv, which reflects the degree of a sample point belonging to certain class in the problem. Thus, the posterior probability model can be regarded as a function of f(x) by which posterior probability of sample point can be measured. Generally, the probability output function should satisfy the following requirements [2]: the range of the function value should be in [0, 1], the function must be monotonic. The contrastive analysis among several kinds of monotone functions used for the probability output function shows that, sigmoid function with two parameters (A and B) has a flexible function form in modeling on probability SVM [4] and presents better classification accuracy in practical application. Therefore, the sigmoid function with two parameters (A and B) is used as posterior probability SVM model. For two kinds of classification problems, the posterior probability model of SVM can be given by use of the sigmoid function with two parameters (A and B):





P( y = 1 | f ( x) ) =

1 A⋅ f ( x ) + B

1+ e P ( y = −1 | f ( x) ) = 1 − P( y = 1 | f ( x) )

(3)

In Equation (3), the modality of sigmoid function is controlled by parameter A and B, and f(x) is the standard output value of sample x in SVM. Thus, it is obvious that after probability modeling on standard SVM, the class of sample x can be determined by the two equations above and the degree of a sample point belonging to certain class can be measured by the value of posterior probability which can also be called reliability. However, for standard SVM, the class of sample x is determined by Equation (1) with the presentation of y=1 or y=-1.

666

Q.-h. Xing et al.

3 Method for Determining Model Parameters of Probability SVM Based on Relative Cross Entropy After building the posterior probability model of sigmoid function according to the standard output f(x) of SVM, how to determine the parameter A and B of the probability model? In the following, a method of minimizing relative cross entropy based on cross entropy is put forward in order to determine parameters in probability model (3). 3.1 Modeling on the Cross Entropy and Relative Cross Entropy for Classification Problem Suppose that random variable x is from distribution p(x). As p(x) is unknown, it can be represented by distribution q(x) which is known as some kind of parameter model. Define the cross entropy between q(x) and genuine distribution p(x) as:

−  p( x) ln q( x)dx

(4)

The cross entropy can get to the minimum only if q(x) is equal to p(x). For two kinds of classification problems, assume y=p(c1|x) and 1-y=p(c2|x). Namely, the output is t=1 when x belongs to c1 and t=0 when x belongs to c2. Then :

p(t | x) = y t (1 − y )1−t

(5)

It is obvious that p(t|x) obeys Bernoulli distribution. If the training sample (xi,ti) (i=1,2,…,n) is selected independently, the likelihood function of it can be described n

as

∏ p(t i =1

i

| xi ) , that is: n

∏y i =1

ti i

(1 − yi )1− ti

(6)

Take the negative logarithm of the equation above, then: n

E1 = − [t i ln y i + (1 − t i ) ln(1 − y i )]

(7)

i =1

It can be proved that E1 is the cross entropy between y(x) and distribution of target t. If y i = t i is put into Equation (7), the minimum of E1 can be got: n

E min = − [t i ln t i + (1 − t i ) ln (1 − t i )]

(8)

i =1

For the two kind problem, if ti is 1 or 0, Emin=0 and if ti is the successive value of ( 0, 1) , Emin≠0. Therefore, after Equation (8) is subtracted from Equation (7), a form of error function can be got as the following:

Method for Determining Parameters of Posterior Probability SVM

n  y (1 − y i )  E 2 = − t i ln i + (1 − t i ) ln  ti (1 − t i )  i =1 

667

(9)

The error function, which can be called as relative cross entropy, is virtually a relative entropy between actual output yi and theoretic output ti. The smaller the error is, the closer E1 is to Emin as well as y(x) is to target t. 3.2 Method for Determining Parameters of Probability SVM Model by Minimizing Relative Cross Entropy Suppose training sample set (xi, yi) (i=1,2,…,n) is the training sample of SVM, and in order to calculate parameter A and B, another set of sample (fi, yi) (i=1,2,…,n) can be considered as training sample, in which fi= f( xi) , f( xi) is the standard output of SVM and yi {-1 1}. For the reason of avoiding over-fitting in using small data set to fit sigmoid function, noise is added into the original data set [3]. Namely, in the reconstructed training sample, the SVM output value of positive sample is f(xi) and the corresponding goal value ti= 1 − ε + . Also, the corresponding goal value of the negative one is ti= ε − .

∈ ,

There,

ε+ =

1 1 and ε − = can be estimated by Bayes posterior N+ + 2 N− + 2

probability. Then, a redefined training sample set (fi, ti) (i=1,2,…,n), in which ti is the goal value of f(xi) after adding noise, can be got. The equation is as follows:

 N+ + 1  N + 2 , ti =  + 1  ,  N − + 2

yi = 1; (10)

y i = − 1.

Aiming at solving model pi, namely computing parameter A and B in pi and making the value of pi as closer to ti as possible, the relative cross entropy function of pi and ti can be constructed as follows: n  p (1 − pi )  E = − t i ln i + (1 − t i ) ln  ti (1 − t i )  i =1 

(11)

After minimizing relative cross entropy, parameter A and B of sigmoid function can be got. If the parameter A and B are expressed by the form of vector Z=(A,B)T, the following equation should be minimized:

min F (Z )

Z = ( A , B )T

(12)

668

Q.-h. Xing et al.

and:

n  p (1 − pi )  F ( Z ) = − t i ln i + (1 − t i ) ln , ti (1 − t i )  i =1  1 pi = Af ( xi ) + B 1+ e

Use Newton iterative algorithm [3] to compute parameter A and B. The basic idea of the algorithm is: Firstly, compute gradient F(Z) and Hessian matrix G(Z) of F(Z) as follows:



 n ∂F   ∂p ∇F ( Z ) =  i =n1 i  ∂F  i =1 ∂pi

∂p i   n  (ti − pi ) f i     ∂A   =  i =1 ∂p i   n  (ti − pi )  ∂B   i =1

G (Z ) = ∇ 2 F ( Z )  n 2 2  p i f i (1 − p i ) =  i =n1 2  p i f i (1 − p i )   i =1

 f i (1 − p i ) i =1  n 2  p ( 1 p ) −  i i  i =1 n

p

2

i

For a given initial point Z0 and parameter σ ≥ 0 , ensure that H ( Z ) + σI is positive definition. Secondly, convert the calculating of the problem above into computing the following iterative equation: 0

[G ( Z k ) + σI ]δ k = −∇F ( Z k ) If F (Z k ) = 0 , end the calculation; otherwise, select α successively from the sequence: 1, k

1 1 , , " . Namely, the first 2 4

element of the sequence which satisfies

F (Z k + α k δ k ) ≤ F (Z k ) + 0.0001⋅ α k (∇F (Z k ) T δ k ) can be considered as

α k . Suppose Z k +1 = Z k + α k δ k and continue the iteration. In this way, the value of A and B can be got by the iterative calculation. Then, the posterior probability of sample x belonging to some class can be determined according to Formula (3).

Method for Determining Parameters of Posterior Probability SVM

669

4 Experimental Analysis In this paper, the data of heart_scale, ionosphere_scale, liver-disorders_scale, and ijcnn1is are used in the experiments on probability SVM. The number of heart_scale sample is 260, which is made up of 120 positive samples, 140 negative samples and the data characteristic dimension of which is 13; the number of ionosphere_scale sample is 340, which is made up of 214 positive samples, 126 negative samples and the data characteristic dimension of which is 34; the number of liver-disorders_scale sample is 345, which is made up of 150 positive samples , 195 negative samples and the data characteristic dimension of which is 6; the training sample and testing sample are separated in ijcnn1 experiment which is made up of 35001 training samples, 91701 testing samples and the data characteristic dimension of which is 22. The results of data classification based on probability SVM model and standard SVM are shown in Table 1. Table 1. The correctness of sample classification based on different methods

Standard SVM Classifying correct rate

Probability SVM of minimizing relative cross entropy Classifying Parameters correct rate

heart_scale

85.3846%

86.1538%

A=-1.81391, B=-0.0998787

Ionosphere_scale

94.7059%

95.2941%

A=-7.33835, B=5.21406

liver-disorders_scale

60%

68.1159%

A=-4.16529, B=1.29649

Ijcnn1

91.0906%

92.3785%

A=-1.36896, B=-0.700627

Method Sample

From Table 1, it is obvious that the classification results of probability SVM are better than those of standard SVM.

5 Conclusion For solving uncertain problems of sample classification, classification results are often required to be exported by the form of posterior probability, however, it cannot be realized by Standard SVM. Therefore, in this paper, based on cross entropy theory, a method of minimizing relative cross entropy is used to build posterior probability SVM model directly. In terms of this method, not only the classification precision of SVM is improved, but also the credible degree of which class the sample belongs to is provided. Experiment results show that the classification accuracy can be effectively improved by probability SVM based on minimizing relative cross entropy.

670

Q.-h. Xing et al.

References 1. Song, N.-H., Xing, Q.-H.: Multi-class Classification of Air Targets Based on Support Vector Machine. Systems Engineering and Electrics 28(8), 1279–1281 (2006) 2. Wahba, G.: Support Vector Machines, Reproducing Kernel Hilbert Spaces and the Randomized GACV. In: Advances in Kernel Methods Support Vector Learning, pp. 69–88. MIT Press, Massachusetts (1999) 3. Platt, J.C.: Probabilities for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Massachusetts (2000) 4. Zhang, X., Xiao, X.-L., Xu, G.-Y.: Probabilistic Outputs for Support Vector Machines Based on the Maximum Entropy Estination. Control and Decision 21(7), 767–770 (2006) 5. Wu, G.-W., Tao, Q., Wang, J.: Support Vector Machines Based on Posterior Probability. Journal of Computer Research and Development 42(2), 196–202 (2005) 6. Lin, H.T., Lin, C.J., Weng, R.C.: A Note on Platt’s Probabilistic Outputs for Support Vector Machines. National Taiwan University, Taipei (2003) 7. Wen, C.-J., Zhang, Y.-Z., Chen, C.-J.: Maximum-Margin minimal-Volume Hypershere Support Vector Machine. Control and Decision 25(1), 79–83 (2010) 8. Ma, Y.-L., Pei, S.-L.: Study on Parameter Optimization Algorithm for SVM Based on Improved Genetic Algorithm. Computer Simulation 27(8), 150–152 (2010)