Posterior Probability Support Vector Machines for Unbalanced Data

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

1561

Posterior Probability Support Vector Machines for Unbalanced Data Qing Tao, Gao-Wei Wu, Fei-Yue Wang, Fellow, IEEE, and Jue Wang, Senior Member, IEEE

Abstract—This paper proposes a complete framework of posterior probability support vector machines (PPSVMs) for weighted training samples using modified concepts of risks, linear separability, margin, and optimal hyperplane. Within this framework, a new optimization problem for unbalanced classification problems is formulated and a new concept of support vectors established. Furthermore, a soft PPSVM with an interpretable parameter is obtained which is similar to the -SVM developed by Schölkopf et al., and an empirical method for determining the posterior probability is proposed as a new approach to determine . The main advantage of an PPSVM classifier lies in that fact that it is closer to the Bayes optimal without knowing the distributions. To validate the proposed method, two synthetic classification examples are used to illustrate the logical correctness of PPSVMs and their relationship to regular SVMs and Bayesian methods. Several other classification experiments are conducted to demonstrate that the performance of PPSVMs is better than regular SVMs in some cases. Compared with fuzzy support vector machines (FSVMs), the proposed PPSVM is a natural and an analytical extension of regular SVMs based on the statistical learning theory. Index Terms—Bayesian decision theory, classification, margin, maximal margin algorithms, -SVM, posterior probability, support vector machines (SVMs), unbalanced data.

I. INTRODUCTION

T

HE Bayesian decision theory is a fundamental statistical approach to classification problems [1], and its power, coherence, and analytical nature when applied in pattern recognition make it among the elegant formulations in science. A Bayesian approach achieves the exact minimum probability of error based entirely on evaluating the posterior probability. However, in order to perform required calculations, a number

Manuscript received October 12, 2004; revised April 1, 2005. The work of Q. Tao was supported by the Excellent Youth Science and Technology Foundation of Anhui Province of China (04042069). This work was supported in part by the National the Basic Research Program (2004CB318103 and 2002CB312200) and NNSF Grants (60575001 and 60334020) of China. Q. Tao is with the Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P.R. China, and also with the New Star Research Institute of Applied Technology, Hefei 230031, P.R. China (e-mail: [email protected]; [email protected]). G.-W. Wu is with Bioinformatics Research Group, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, P.R. China (e-mail: [email protected]). F.-Y. Wang is with the Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P.R. China, and also with the University of Arizona, Tucson, AZ 85721 USA (e-mail: [email protected]). J. Wang is with the Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P.R. China (e-mail: [email protected]). Digital Object Identifier 10.1109/TNN.2005.857955

of assumptions including the availability of a priori probability and the class-conditional probability must be made. Clearly, the knowledge of density functions would allow us to solve whatever problems that can be solved on the basis of available data. But in line with Vapnik’s principle of never solving a problem that is more general than you actually need to solve [2] and [3], one should try to avoid estimating any density when solving a particular classification problem. Hence, in machine learning, the minimization of the expected risk should be achieved only in terms of samples available. Fortunately, the statistical learning theory provides distribution-free conditions and guarantees for good performance of generalization for learning algorithms [2] and [3]. It attempts to minimize the expected risk using the framework Probably Approximately Correct (PAC, see [4]) by simultaneously minimizing the empirical risk and the model complexity [2], [3], and [5]. The most significant difference between Bayesian decision and statistical learning theory might be that the former is based on distributions called deductive inference, while the latter is based on data called inductive inference. Support vector machines (SVMs) are a machine learning method based on the statistical learning theory. The geometric interpretation of a linear SVM, known as the maximum margin algorithm, is very clear. Since SVMs need no distribution information, all training points in each class are equally treated. However, in many real applications, certain samples may be outliers and some may be corrupted by noise, thus, the influences of different samples may be unbalanced, and SVMs may not be robust enough and sometimes their performance could be affected severely by few samples with small probabilities. How can we adapt SVMs to such situations? Inspired by the Bayesian decision theory, we try to solve the problem by weighing the label of each sample using posterior probability . Unlike those in regular SVMs, these labels may not be 1 or and thus data are in fact unbalanced now. Such a classification problem is called unbalanced in this paper. As far as the relationship between SVMs and the posterior probability is concerned, some researchers suggest that the output of SVM should be a calibrated posterior probability to enable postprocession [3]. Under this circumstance, the output of SVM will not be binary. In [3], Vapnik proposes fitting this probability with a sum of cosine functions, where the coefficients of the cosine expansion will minimize a regularized function. To overcome some limitations in [3], Platt applies a sigmoid regression to the output of SVMs to approximate the posterior probability [6] and demonstrates that the SVM+sigmoid combinations produce probabilities that are of comparable

1045-9227/$20.00 © 2005 IEEE

1562

quality to the regularized maximum likelihood kernel methods. Since the output of SVM is used in the posterior probability estimation, the desired sparsity is achieved for Bayesian classifiers in [3] and [6]. On the other hand, Sollich has described a framework interpreting SVMs as maximizing a posterior solution to inference problems [7]. It should be pointed out that, in all the previous papers, the training of SVM is in fact investigated in the balanced cases. In order to solve unbalanced classification problems, Y. Lin et al. have presented a modified SVM [8], and C. Lin et al. proposed a fuzzy SVM (FSVM, see [9]). They all have extended the soft margin algorithm in SVMs by weighing the punishment terms of error. Unfortunately, the linearly separable problem, which forms the main building block for more complex SVMs, is difficult to be discussed in their formulations. Obviously, how to reformulate the entire SVM framework in terms of posterior probabilities for solving unbalanced classification problems is an interesting issue. In this paper, a complete framework of posterior probability support vector machines (PPSVMs) is presented by weighing unbalanced training samples and introducing new concepts for linear separability, margins and optimal hyperplanes. The style of PPSVMs is almost the same as that of SVMs, but the binary output of PPSVMs is based on the posterior probability now. This might constitute an interesting attempt to train a SVM to behave like a Bayesian optimal classifier. In reformulating SVMs, our main contributions come from defining the weighted margin properly and formulating a series of weighted optimization problems to determine hard and soft margins and -SVM (as proposed and developed by Schölkopf et al., [10]). Furthermore, the generalization bound of PPSVMs is analyzed and the result provides PPSVMs with a solid basis in the statistical learning theory. Note that when the posterior probability of each sample is either 0 or 1, the proposed PPSVMs coincide with regular SVMs. Intuitively, by weighing the samples, PPSVM is able to lessen the influence of the outliers with small weight. This is a natural way to make the algorithm more robust against outliers. Clearly, the potential advantage of PPSVMs lies in that it can be used to solve classification problems where the classes and samples are not equally distributed. It is quite interesting to notice that the proposed approach for classification with posterior probabilities is very similar to what has been described in the Bayesian decision theory [1]. But it should be emphasized that our model is based on data rather than distributions. However, in many real applications, there is no posterior probability information available for training samples. To make our PPSVMs applicable, we have proposed an empirical method for estimating the posterior probability of samples. The empirical method of calculating posterior probabilities also provides us a simply way to determine parameter in -SVMs, which is new and interesting. To verify the logical correctness of the proposed framework for weighing data, two classification examples are used to illustrate the relationship between our classifiers and Bayesian classifiers. These synthetic experiments demonstrate that the proposed PPSVMs are closer to Bayes optimal. Additional tests have also been conducted on

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

several real data sets and the corresponding results demonstrate that the proposed PPSVMs are more robust than regular SVMs for some classification problems. The remainder of this paper is arranged as follows. In Section II, the unbalanced classification problem and its loss are defined. The complete framework of PPSVMs is presented in SectionsIII through IV. Several numerical classification examples are given in Section VII. Finally, SectionVIII concludes the paper with a few remarks. II. UNBALANCED CLASSIFICATION PROBLEMS Basically, a regular SVM considers the following two-category classification problem

where is independently drawn and identically distributed, and is the label of . For linearly separable cases, SVMs attempts to optimize the generalization bound by separating data with a maximal margin classifier. Geometrically speaking, the margin of a classifier is the minimal distance of training points from the decision boundary. The maximal margin classifier is the one with the maximum distance from the nearest patterns to the boundary, called support vectors. It is easy to show that all training points are treated uniformly in SVMs [3]. In this paper, the following unbalanced two-category classification problem is addressed

(1) is a posterior where is one of the two categories, and probability. Note that the degree of unbalance in (1) is reflected by the distribution of posterior probabilities. Clearly, if the classification decision is based on the complete information of distribution function of posterior probabilities, the expected risk (or probability of error) can be minimized exactly using Bayesian decision theory [1]. Unfortunately, the Bayesian inference can not be applied here because the distribution function of all the samples is unknown and the only available information is the posterior probability of the training samples. as the label of sample , the inUsing formation on posterior probabilities will be lost partly but regular SVMs can be employed to deal with the linear classification problem

(2) In order to utilize fully the information provided by posterior probabilities in classification, a new continuous label is introduced to replace the original binary one for each sample. Define

Obviously, if and if . and as labels Now, a regular classification problem with can then be regarded as a special case of problem (1). For the for all sake of convenience, it is assumed that

TAO et al.: POSTERIOR PROBABILITY SUPPORT VECTOR MACHINES FOR UNBALANCED DATA

. In fact, if , we can perturb a little and force to be in one of two categories. To develop posterior probability support vector machines (PPSVMs) for solving unbalanced classification problems, we have to modify the loss function and risks first. Note that, in is labeled 1 if and is Bayesian decision theory, if . Obviously, the modified loss function labeled should coincide with Bayesian inference in this case. In the same time, the loss function here should be in agreement with the one in SVMs when the label of each sample is treated as . Based on these considerations, the loss function is redefined as follows. Definition 1: Let and be a hypothesis space of problem (1), where is a set of the parameters and a particular choice of generates a so-called a trained machine. Define as follows:

III. OPTIMAL HYPERPLANE, MARGIN, AND GENERALIZATION Over the last decade, both theories and practices have pointed out that the concept of margin is central to the success of SVMs [5], [11], [12], [13], and [14]. Generally, a large margin implies good generalization performance. Thus, to effectively deal with the unbalanced classification in (1), the concept of margin in SVMs has to be modified for the use by PPSVMs. Clearly, weighing the regular margin by posterior probabilities is a straightforward and the most natural way for this purpose. Since the significance of a sample should be weakened while its margin be enlarged if its posterior probability is small, the weighted margin is defined as follows. be a linear classifier of posterior Definition 3: Let probability linearly separable problem (1) and

is called the margin of

is called the loss function of (1). Based on this loss function, the expected and empirical risks can be defined in terms of distributions and data, respectively, [2] and [3]. Now the linear separability for unbalanced classification can be easily defined as shown. , such Definition 2: If there exists a pair that (3) Then the unbalanced classification problem (1) is called posterior probability linearly separable and is called a posterior probability linear classifier. The following remarks are useful in understanding the properties of linear separability. • Obviously, the regular classification problem (2) is linearly separable if and only if (1) is posterior probability linearly separable. • Since the norm of is not restricted in Definition 2 and we can multiply with a sufficiently large positive number, (3) is actually equivalent to, (3) is actually such that equivalent to: there exists a pair

1563

.

is called the margin of linear classifier , such that,

. If there exists

The hyperplane is called the optimal hyperplane for unbalanced classification problem (1), and is called the maximal margin. As in SVMs, the maximal margin classifier is an important concept to PPSVMs, serving as a start point for analysis and construction of more sophisticated algorithms. Therefore, problems concerning the existence, uniqueness, implementation, and other properties of the maximal classifier for PPSVMs must be investigated first. Theorem 1: There exists a unique optimal hyperplane for (1). Proof: From the linear separability, there exists and , such that

The linear separability also implies

(4) This property is important in specifying the constraint imposed by the linearly separability for the purpose of optimization. Specifically, if is 1 or , (4) becomes



This is just the canonical description of linear separability in regular SVMs. is the distance of to the hyIf . Therefore, for linear separable perplane can be viewed as a weighted problems, distance.

Thus, we can define in region . The existence of maximum of follows in the bounded region. directly from the continuity of must be achieved at . Since The maximum of , and if the maximum is achieved somewhere in and , then would achieve a larger value. can not be On the other hand, the maximum of achieved on two different points , and , where and . Otherwise, since function is concave, the maximum would be achieved on the and . Since there exists a line that connects pair with on this line, it contradicts the

1564

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

fact that the maximum must be achieved on . Thus, theorem is proved. and is a posteAssume rior probability linear classifier. Obviously, follows from the linear separability. Finding a maximal margin can now be formulated as solving the following optimization problem

PPSVMs too. One of the main contributions here is that we have successfully modified the concept of margin in terms of posterior probabilities and developed the corresponding generalized optimization problem. Moreover, the generalization ability of corresponding soft algorithms can be analyzed by employing the margin bound, as one can see in sequel. IV. POSTERIOR PROBABILITY SUPPORT VECTORS To solve the optimization problem (5), define Lagrangian function as,

As discussed in [3], the above optimization problem is equivalent to (5) In linearly separable cases, (4) can be easily satisfied by multiplying with a sufficiently large positive number. However, optimization problem (5) looks for with the minimal norm. Next the generalization ability of the above maximal margin algorithm is analyzed. To this end, the well-known VC dimension bound theorem in [2], [3], and [5] is used. Based on the direct proof in [5], this bound also holds for the expected risks of classification here. Theorem 2 ([5]): Let denote the VC dimension of the func, with tion class . For any probability distribution on over random examples, any that is probability of consistent with the training set has error no more than

provided and , where is the expected risk. Let the hypothesis space be the set of all linear classifiers, and be the corresponding VC dimension. The following inequality is given in [2] and [3]

where is the radius of the smallest ball around the samples. Roughly speaking, SVMs attempt to minimize the expected risk using PAC framework by minimizing the VC dimension of and the empirical risk simultaneously [15] and [16]. Note that in optimization problem (5), the constraints imply the empirical risk is zero. On the other hand, if we bound the margin of a function class from below, from the above ineqality, we can control its VC dimension. Hence, the maximal margin algorithm is in fact to minimize a bound of the expected risk on condition that the corresponding empirical risk is zero. Therefore, the maximal margin algorithm here captures the insight of statistical learning theory. However, there are ramifications of such analysis that go beyond the scope of this paper. In fact, applying Theorem 2 to SVMs requires a priori structuring of the hypothesis space. Bounds that rely on an effective complexity measure rather than the a priori VC dimension have been proposed by Shawe-Taylor et al. in [5] and [17]. These bounds rely only on the geometric margin. Since our margin is defined in almost the same geometric way as that in regular SVMs, the bounds in [5] and [17] can be applied to

Then, the dual representation of (5) is

(6)

Let be a solution of (6). Using the same inference as that in SVMs, optimal can be written as

Definition 4: If , the corresponding is called a posterior probability support vector. The following remarks are helpful to understand the nature of the proposed PPSVMs. , it is easy to see that there exists at least • Since , therefore, in the optimal one hyperplane can be found by the Karush-Kuhn-Tucker complementarity conditions [18] and [19] . According to Theorem 1, we will obtain . the same for all is equal to 1 or 0, will be 1 or and • If all the above mentioned definitions and optimization problems coincide with those in regular SVMs. Thus, the proposed PPSVM is an extension of regular SVMs for unbalanced classification problems. • For each posterior probability support vector, . One of the most important differences between regular support vectors and posterior probability support vectors is that the former are the closest to the optimal hyperplane while the latter may not be. This is shown in Fig. 1. • Note that optimization problem (6) uses inner products only, therefore, kernel techniques can be employed to solve the corresponding nonlinear problems. • For linearly separable cases, it is interesting to notice that vector on the optimal hyperplane satisfies . This classification ruler is similar to the Bayesian decision (see [1]) in principle. Particularly, it is commonly known that the Bayesian discriminant function of a normal density is linear in some special cases (see [1]). Does it coincide with the proposed PPSVM classifier? Theoretical analysis, unfortunately,

TAO et al.: POSTERIOR PROBABILITY SUPPORT VECTOR MACHINES FOR UNBALANCED DATA

Fig. 1.

Illustration of posterior probability optimal hyperplane and support vectors, where the dot points are posterior support vectors.

is lacking at this point, but we will demonstrate this interesting fact in SectionVII through two designed examples. Apparently, the above classification idea looks as if we are searching a regression function with posterior . However, in a regresprobability sion problem, although a sample is correctly labeled in the classification, the distance from this sample to the boundary always has influence on deciding the regression function. Obviously, the main difference between classification and regression is in the objective function. From the definition of empirical and expected risks in SectionII, it can be found out that we are intrinsically solving a classification problem rather than a regression one. This is same with regular SVM classification and regression cases.



V.

1565

-SOFT MARGIN ALGORITHMS AND FUZZY SVMS

For problems which allow that patterns may be wrongly classified or in the margin stripe, Cortes and Vapnik proposed soft algorithms by introducing slack variables in constraints [20], [2], and [3]. Similarly, for posterior probability linearly inseparable classification problems, the following optimization problem for soft margins can be formulated

(7)

where is a predefined positive real number and s are slack variables. In the constraint of (7), each slack variable is multiplied with . Obviously, other choices that could result in different implementations are also possible. Note that our hard margin of is and it is relaxed to be . Therefore, the softness in (7) is in fact introduced through the relaxation of the margin, along the same line as that in soft SVMs in [5] and [17]. As pointed out in [5] and [17], the advantage of such relaxation is that each soft algorithm can be proved to be equivalent to a hard one in an augmented feature space and its generalization ability can thus be guaranteed. To deal with the unbalanced classification problems, each is weighted using a fuzzy membership in [9]. Since the fuzzy membership is the attitude of the corresponding point toward one class and the punishment term is a measure of errors in is a measure of errors with different weighting. a SVM, The optimal hyperplane problem in [9] is then considered as the solution to (8) Optimization problem (7) and (8) are very similar in format. Although it is claimed in [9] that small can reduce the effect of , the motivation is not clear from the point of view of statistical learning theory because the expected risk and the empirical risk are not defined. Consequently, optimization problem (8) can only be regarded simply as an extension of SVM-soft-margin optimization problem in its mathematical form. The difference

1566

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

between PPSVMs and Fuzzy SVMs is that the former is based on weighting the margin while the latter is based on weighting the penalty. It should also be pointed out that the optimization problem in [8] is almost identical to (8). Remarkably, it has the advantage of approaching the Bayesian rule of classification. Methods in [8] and [9] can be considered as another two approaches to unbalanced problems. Unfortunately, the linearly separable problems can not be addressed if we use the ideas of reducing the effect of some samples as suggested in [8] and [9]. Therefore, it is not possible to reformulate a complete SVM-like framework for unbalanced classification problems. VI.

-SVM AND EMPIRICAL METHODS FOR POSTERIOR PROBABILITY

In order to effectively control the number of support vectors through a single parameter, Schölkopf et al. constructed a new SVM algorithm called -SVM [10]. In addition, -SVMs have the advantage of enabling us to eliminate the regularization parameter in -SVMs. Specifically, -SVMs are formulated as

where is a constant and is 1 or . To understand the , the above constraints imply that role of , note that for . the two classes are separated by the margin Based on the proposed margin of PPSVMs, a corresponding -SVM is formulated by the following modification shown in (9) at the bottom of the page. To obtain the dual, we consider the Lagrangian

It can be shown that the resulting decision function takes the form

To compute , a point and KKT conditions

and

with are selected. By the

Compared with regular -SVM, the only change in (10) is is multiplied with . Consequently, by the KKT that each conditions, the following theorem can be proved using the same method as that developed in Proposition 5 of [10]. This theorem indicates that the interpretation of in -PPSVMs is the same as that in regular -SVMs. However, it is more significant in the PPSVMs framework since a new approach to determine can be constructed based on this interpretation. The detailed discussion is presented later in this paper. , Theorem 3: Assume that the solution of (9) satisfies then the following statements hold. i) is an upper bound on the fraction of margin errors. is a lower bound on the fraction of support vectors. ii) If the class prior probability and class-conditional probability of each sample are known, the posterior probability can be easily calculated using Bayesian formula. But in many real applications, the class-conditional probability is unknown. In order to improve the performance of SVMs using PPSVMs in applications, an empirical method for estimating the class-conditional probability is proposed here. . Let be the number of the samples in class Obviously, . be a real number. The class-conditional probability Let is defined as of

where is the number of elements in . Using the Bayesian formula, the posterior probability can be calculated as follows: and arrive at the following optimization problem (11) (10)

From the above empirical posterior probability, if there are some training data in the other category around a sample, the influence of this sample on the classification is weakened and

(9)

TAO et al.: POSTERIOR PROBABILITY SUPPORT VECTOR MACHINES FOR UNBALANCED DATA

1567

Fig. 2. The relationship between an exact PPSVM classifier and the Bayesian classifier. The solid curve is for the exact PPSVM classifier ( = 0:10), while the dashed is for the Bayesian.

Fig. 3.

The relationship between an empirical PPSVM classifier and the Bayesian classifier. The solid curve is for the empirical PPSVM classifier ( = 0:05; r =

0:4), while the dashed is for the Bayesian.

the degree of the weakening is decided by parameter . Since we have only finite training samples in a classification problem, for such each sample , there must exist a sufficiently small . In this that there are no other samples in

if . This indicates that case, PPSVMs will become SVMs if is small enough. If the label of a sample does not coincide with its posterior probability, it will be regarded as an error. Let be a small positive real number

1568

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

Fig. 4. The relationship between a  -SVM classifier and the Bayesian classifier. The solid curve is for the  -SVM classifier ( = 0:15), while the dashed is for the Bayesian.

Fig. 5. The relationship between an exact PPSVM classifier and the Bayesian classifier. The solid curve is for the exact PPSVM classifier ( = 0:10), while the dashed is for the Bayesian.

such that . If for some , we force . If for some , . we force Intuitively, a sample can be viewed as a margin error if the label determined by its posterior possibility contradicts with its

actual label. By Theorem 3, can be taken to be the fraction of such errors. Clearly, this is an interesting method to determine . Hence, our empirical method for calculating the posterior probability also provides a new approach on determining the interpretable parameter in -SVMs. A PPSVM classifier de-

TAO et al.: POSTERIOR PROBABILITY SUPPORT VECTOR MACHINES FOR UNBALANCED DATA

1569

Fig. 6. The relationship between an empirical PPSVM classifier and the Bayesian classifier. The solid curve is for the empirical PPSVM classifier ( = 0:1; r =

0:2), while the dashed is for the Bayesian.

Fig. 7. The relationship between an  -SVM classifier and the Bayesian classifier. The solid curve is for the  -SVM classifier ( = 0:5), while the dashed is for the Bayesian.

termined by this approach will henceforth be referred to as an empirical PPSVM classifier. For example, we can let if the number of the wrong samples is . Obviously, if an empirical PPSVM classifier coincides with its Bayesian classifier, the

number of errors in PPSVM classification would almost equal to the number of the samples whose labels do not coincide with their posterior probabilities. It is for this reason that the method of determining is justified.

1570

Fig. 8.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

Illustration of an SVM classifier in linearly separable case, where the circled points are support vectors.

In machine learning, the word “unbalanced” is mainly referred to an unequal number of examples of the dataset in each class. Under this circumstance, the role of the samples is different. In this paper, the unbalance in the examples is directly reand flected by the posterior probability if we adapt in the empirical estimation method according to the specific application problem. This indicates that the usual unbalanced classification problems can also be solved by using PPSVMs. On the other hand, we might also employ some theoretically based methods to estimate the probability in pattern recognition [1]. But here, our estimation method is only against training samples, and it is easy to understand. VII. EXAMPLES To show the relationship between Bayesian classifiers, SVM classifiers and PPSVM classifiers, the first two examples are designed. To further demonstrate the performance of PPSVMs, a synthetic and several real classification problems taken from http://mlg.anu.edu.au~raetsch/ are conducted. and . Example 1: Let Let

where

, and

.

Total 100 samples are produced with labels randomly chosen according to the class prior probability. Obviously, if the label is 1, we will calculate its class-conditional density according , and if the label is , we will calculate its to class-conditional density according to . The posterior possibility of each sample can be easily derived by using Bayes formula. A linear kernel is used in this example. The relationship between an exact PPSVM classifier and the Bayesian classifier can be seen in Fig. 2. In this case, results indicate that they almost coincide with each other. If the proposed empirical method is used to determine posterior probabilities and parameter , an empirical PPSVM classifier is constructed. The relationship between the empirical PPSVM and Bayesian classifiers is presented in Fig. 3, while the relationship between a regular SVM and Bayesian classifier is presented in Fig. 4. To some extent, this example illustrates that our PPSVM formulation as well as the empirical estimation method is reasonable and closer to the Bayes optimal. and . Example 2: Let Let

TAO et al.: POSTERIOR PROBABILITY SUPPORT VECTOR MACHINES FOR UNBALANCED DATA

1571

Fig. 9. Illustration of an PPSVM classifier in linearly separable case with r = 0:5 and  = 0:1, where the circled points are support vectors.

TABLE I AVERAGED CLASSIFICATION TEST ERRORS(%)(r = max

fkx 0 x k; i; j = 1; 2; . . . ; l:g)

where

, and .

Total 200 samples are produced with labels randomly chosen according to the class prior probability. Obviously, if the label is 1, we will calculate its class-conditional density according , and if the label is , we will calculate its to . The posterior class-conditional density according to possibility of each sample can be easily derived by using Bayes formula. In this example, we use the standard quadratic polynomial kernels on both SVMs and PPSVMs. The posterior possibility of each sample can be easily derived by using class-conditional probability. The relationship between the exact PPSVM and Bayesian classifiers can be seen in Fig. 5. The relationship between an empirical PPSVM classifier and the Bayesian classifier is presented in Fig. 6, while the relationship between a regular SVM and Bayesian classifiers is presented in Fig. 7. As in Example 1, results here indicate that our PPSVM formulation as well as the empirical estimation method is reasonable and closer to the Bayes optimal.

According to the resulting relationship between PPSVM and Bayesian classifiers in the above examples, we can say that the proposed method of determining the parameter is reasonable. Example 3: To intuitively illustrate the robustness of PPSVM, the following synthetic example is designed. From Fig. 8, the point (3, 1) can be regarded as an outlier. It is easy to find out that the point (3, 1) sharply affects the SVM classifier while it does not severely influence the PPSVM classifier in Fig. 9. and are taken to Example 4: In this example, . The data sets and test errors of RBF SVM algorithms be are taken from http://mlg.anu.edu.au/~raetsch/. Since we have used the cross validation strategy by employing the data given in http://mlg.anu.edu.au/~raetsch, is fixed. In these experiments, RBF kernel is used. Table I presents the averaged classification test errors by SVMs and PPSVMs. To certain extent, this example indicates that the weighted data with some empirical methods can produce better results than the regular -SVM. Table II presents the corresponding average support vectors ratio of SVM and

1572

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

TABLE II THE CORRESPONDING AVERAGED SUPPORT VECTORS RATIO (%)

PPSVM. To certain extent, it illustrates that the sparseness is not affected by using PPSVMs. As a class of generalized SVMs, PPSVMs are expected to have better performance than regular SVMs in both theory and applications. Unfortunately, PPSVMs sometimes do not have desired performance on some real data sets from http://mlg.anu.edu.au/~raetsch/. A major reason for this may be that there are a certain number of contradictory samples that have two labels, which make PPSVMs unable to return . On the other hand, to regular SVMs even if we let performance problems in actual applications may be caused by the empirical method of deriving posterior probabilities. Nevertheless, the performance of PPSVMs has already been illustrated by the fact that PPSVM classifiers are closer to Bayes optimal. At least, PPSVMs are effective on training samples with known posterior probability and have good potential on unbalanced data sets. In the above examples, all optimization problems are carried out using the quadratic optimization solver in Matlab. Obviously, designing easy-to-implement fast iterative algorithms similar to those for regular SVMs is a very important step for PPSVMs.

[6] J. C. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” in In Advances in Large Margin Classifies. Cambridge, MA: MIT Press, 1999. [7] P. Sollich, “Bayesian methods for support vector machines: Evidence and predictive class probabilities,” Machine Learning, vol. 46, no. 1, pp. 21–52, 2002. [8] Y. Lin, Y. Lee, and G. Wahba. (2000) Support vector machines for classification in nonstandard situations. [Online]. Available: http://www.stat.wisc.edu/yilin/papers/papers.html [9] C. Lin and S. Wang, “Fuzzy support vector machines,” IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 464–471, 2002. [10] B. Schölkopf, A. J. Smola, R. Williamson, and P. Bartlett, “New support vector algorithms,” Neural Comput., vol. 12, pp. 1083–1121, 2000. [11] J. Schawe-Taylor, P. L. Bartletett, R. C. Williamson, and M. Anthony, “Structure risk minimization over data-dependent hierarchies,” IEEE Trans. Inform. Theory, vol. 48, no. 10, pp. 1926–1940, 1998. [12] R. Schapire, Y. Freund, P. Bartlett, and W. Sun Lee, “Boosting the margin: A new explanation for the effectiveness of voting methods,” Ann. Statist., vol. 26, no. 5, pp. 1651–1686, 1998. [13] G. Rätsch, “Robust boosting via convex optimization,” Ph.D. dissertation, Univ. Posdam, 2001. [14] G. Rätsch, S. Mika, B. Schölkopf, and K. R. Müller, “Constructing boosting algorithms from SVMs: An application to one-class classification,” IEEE Trans. Pattern Anal. Machine Intell., vol. 9, no. 4, pp. 1184–1199, 2002. [15] K. R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An introduction to kernel-based learning algorithms,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 181–201, 2001. [16] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, pp. 121–167, 1998. [17] J. Schawe-Taylor and N. Cristianini, “On the generalization of soft margin algorithms,” IEEE Trans. Inform. Theory, vol. 48, no. 10, pp. 2721–2735, 2002. [18] D. Kinderlerer and G. Stampcchia, An Introduction to Variational Inequalities and Their Applications. New York: Academic, 1980. [19] R. Fletcher, Practical Methods of Optimization, 2nd ed. New York: Wiley, 1987. [20] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995.

VIII. CONCLUSION This paper presents a class of posterior probability support vector machines that combines the Bayesian inference with SVMs. PPSVMs are motivated and constructed based on analytical formulations, and the proposed algorithms reduce the effect of training samples of small probabilities. Numerical classification examples have demonstrated that the proposed PPSVM formulation is reasonable and its performance is more robust than that of regular SVMs. ACKNOWLEDGMENT

Qing Tao was born in Hefei City, Anhui province, China, in 1965. He received the M.S. degree in mathematics from Southwest Normal University, Chongqing, China, in 1989, and the Ph.D. degree from the University of Science and Technology of China, Hefei, China, in 1999. From June 1999 to June 2001, he was a Postdoctoral Fellow with the University of Science and Technology of China. From June 2001 to 2003, he was a Postdoctoral Fellow with the Institute of Automation, Chinese Academy of Sciences, China. Currently he is a Professor with the Institute of Automation, Chinese Academy of Sciences, and also a Professor with the New Star Research Institute of Applied Technology, Hefei. His research interests are in the area of applied mathematics, neural networks, statistical learning theory, and SVM.

The authors would like to thank the referees for their valuable comments. REFERENCES [1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001. [2] V. Vapnik, The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995. [3] , Statistical Learning Theory: Addison-Wiley, 1998. [4] L. G. Valiant, “A theory of the learnable,” Commun. ACM, vol. 27, no. 11, pp. 1134–1142, 1984. [5] N. Cristianini and J. Schawe-Taylor, An Introduction to Support Vector Machines. Cambridge, U.K.: Cambridge Univ. Press, 2000.

Gao-Wei Wu was born in Jilin City, Jilin province, P. R. China, in 1975. He received the M.S. degree in circuits and systems from the Institute of Semiconductor, Beijing, China, in 2000, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2003. His research focuses on neural networks, statistical learning theory, and SVM.

TAO et al.: POSTERIOR PROBABILITY SUPPORT VECTOR MACHINES FOR UNBALANCED DATA

Fei-Yue Wang (S’87–M’89–SM’94–F’03) received the B.S. degree in chemical engineering from the Qingdao University of Science and Technology, Qingdao, China, the M.S. degree in mechanics from Zhejiang University, Hangzhou, China, and the Ph.D. degree in electrical, computer and systems engineering from the Rensselaer Polytechnic Institute, Troy, NY, in 1982, 1984, and 1990, respectively. He joined the University of Arizona in 1990, where he is currently a Professor with the Department of Systems and Industrial Engineering and the Director of the Program for Advanced Research in Complex Systems. In 1999, he founded the Intelligent Control and Systems Engineering Center at the Institute of Automation, Chinese Academy of Sciences, Beijing, under the support of the Outstanding Oversea Chinese Talents Program. Since 2002, he has been the Director of the Key Laboratory of Complex Systems and Intelligence Science at the Chinese Academy of Sciences. His current research interests include modeling, analysis, and control mechanism of complex systems; agent-based control systems; intelligent control systems; real-time embedded systems, application specific operating systems (ASOS); applications in intelligent transportation systems, intelligent vehicles and telematics, web caching and service caching, smart appliances and home systems, and network-based automation systems. He has published more than 200 books, book chapters, and papers in those areas since 1984 and received more than $20 million and over Y50 million RMB from NSF, DOE, DOT, NNSF, CAS, MOST, Caterpillar, IBM, HP, AT&T, GM, BHP, RVSI, ABB, and Kelon. Dr. Wang was the Editor-in Chief of the International Journal of Intelligent Control and Systems from 1995 to 2000, Editor-in-Charge of Series in Intelligent Control and intelligent Automation from 1996 to 2004, and is currently the Editor-in-Charge of Series in Complex Systems and Intelligence Science. He is a member of Sigma Xi, ACM ASME, ASEE, and the International Council of Systems Engineering (INCOSE). He received Caterpillar Research Invention Award with Dr. P. J. A. Lever in 1996 for his work in robotic excavation and the National Outstanding Young Scientist Research Award from the National Natural Science Foundation of China in 2001, as well as various industrial awards for his applied research from major corporations. He is an Associate Editor for the IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, the IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, the IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION, the IEEE INTELLIGENT SYSTEMS, and several other international journals. He is an elected member of the IEEE SMC Board of Governors and the AdCom of the IEEE Nanotechnology Council, President-Elect of the IEEE Intelligent Transportation System Society, and Chair the Technical Committee on System Complexity of the Chinese Association of Automation. He was the Program Chair of the 1998 IEEE International Symposium on Intelligent Control, the 2001 IEEE International Conference on Systems, Man, and Cybernetics, Chair for Workshops and Tutorials for 2002 IEEE International Conference on Decision and Control (CDC), the General Chair of the 2003 IEEE International Conference on Intelligent Transportation Systems, and Co-Program Chair of the 2004 IEEE International Symposium on Intelligent Vehicles, the General Chair of the 2005 IEEE International Conference on Networking, Sensing and Control, the Co-Program Chair of the 2005 IEEE International Conference on Intelligence and Security Informatics, the General Chair of the 2005 IEEE International Symposium on Intelligent Vehicles, and the General Chair of 2005 ASME/IEEE International Conference on Mechatronic and Embedded Systems and Applications. He is the President of Chinese Association for Science and Technology, USA and the Vice President and one of the major contributors of the American Zhu Kezhen Education Foundation.

1573

Jue Wang (SM’99) received the M.S. degree from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 1984. Currently, he is a Professor with the Institute of Automation, Chinese Academy of Sciences. His research interests include ANN, GA, multiagent system, machine learning, and data mining.