IMPROVING CLASSIFICATION PERFORMANCE OF LINEAR FEATURE EXTRACTION ALGORITHMS Moataz El Ayadi and Konstantinos N. Plataniotis Multimedia Lab, The Edward S. Rogers Dept. of Electrical and Computer Engineering, University of Toronto, Ontario, Canada M5S 3G4 ABSTRACT In this work, we propose a new and novel framework for improving the performance of linear feature extraction (LFE) algorithms, characterized by the Bayesian error probability (BEP) in the extracted feature domain. The proposed framework relies on optimizing a tight quadratic approximation to the BEP in the transformed space with respect to the transformation matrix. Applied to many synthetic multi-class Gaussian classification problems, the proposed optimization procedure significantly improves the classification performance when it is initialized by popular LFE matrices such as the Fisher linear discriminant analysis. Index Terms— Bayesian error probability, linear feature extraction, multivariate Gaussian density. 1. INTRODUCTION Linear feature extraction (LFE) has been extensively applied in various signal processing applications such as face recognition [1], speech recognition [2], speaker identification [3], micro-array classification, image analysis, and many others. The key advantages of LFE are the simplicity of implementation and the preservation of the shape of of the class conditional densities after transformation. Since the introduction of the classical linear discriminant analysis (LDA) by Fisher [4], many extensions have been proposed to account for multiple classes and heteroscedastic classification [5, 6]. However, very few of LFE methods considered the Bayesian error probability (BEP) as a design criterion, though the BEP is the main performance measure of interest in most signal processing applications. In fact, the BEP in the transformed domain is generally a multi-modal function in the transformation matrix. Hence, it is not expected to find an LFE that is superior to all other LFE methods for all classification problems. Therefore, rather than devising a new LFE method which is not guaranteed to outperform all other LFE methods for all classification problems, the main objective in this paper is to design a new framework for improving the performance of existing LFE methods in terms of the BEP in the transformed domain. We believe that this framework will increase the usability of LFE methods for signal processing applications. The main focus in this paper will be on Gaussian classification problems. The main assumptions are: (1) the num-
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
2166
ber of classes, C, is known before classification, (2) the given classes are disjoint and exhaustive; i.e., a pattern x belongs to exactly one of the C classes, and (3) all the class conditional densities and the class prior probabilities are exactly known. Since the BEP has no closed-form expression, we propose a tight quadratic approximation which can be efficiently optimized using gradient-based techniques. The paper is divided into five sections. Section 1 is the introduction. The mathematical formulation of the LFE problem is given in section 2. In section 3, the proposed quadratic approximation to the BEP is derived for Gaussian classification problems. Finally, simulation results are shown in section 4 and important conclusions are drawn in section 5. 2. MATHEMATICAL FORMULATION Consider a signal processing application in which the main task is to distinguish between C classes; e.g. a face recognition problem with C different faces. Assume that all patterns are real n-dimensional vector, x ∈ Rn . In multi-class classification, we are interested in classifying a pattern x into one of the classes H1 , . . ., and HC with prior probabilities P (H1 ), P (H2 ), and P (HC ). Patterns of Hi are generated according to the Gaussian density N(μi , Σi ), i.e., 1
T
−1
(1) fx (x|Hi ) = (2π)−n/2 |Σi |−1/2 e− 2 (x−μi) Σi (x−μi) , where μi and Σi are the mean vector and the covariance matrix of x if Hi is true, respectively. We assume all fx (x|Hi ) to be exactly known before classification as well as P (Hi ), i = 1, . . . , C. Further, assume that, for each class, all the generated patterns are statistically independent and identically distributed. When all the class-conditional densities are known, the following Bayesian decision rule is usually used for classification [4]: x ∈ Ci if i = arg max P (Hj )fx (x|Hj ) , (2) j=1,...,C
A common measure of the classification performance is the BEP given by [4] Pex = 1 − max P (Hi |x) fx (x) dx, (3) i=1,...,C
where P (Hi |x) is the posterior probability of Hi given x and the function fx (x) is the unconditional density of x. We now consider LFE to overcome the estimation and computation problems usually encountered when n is high.
ICASSP 2010
3. QUADRATIC APPROXIMATION OF THE PROBABILITY OF MISCLASSIFICATION The proposed quadratic approximation takes the form. C Qy (φ0 , φ2 ) = φ0 + φ2 P 2 (Hi |y) fy (y) dy, (6) i=1
Fig. 1. Schematic description of the considered LFE problem. The main objective is to find A that minimizes |Pey − Pex |1 .
The linearly transformed vector y is given by y = Ax where the matrix A is of size k×n and k < n and the vector y ∈ Rk . Similarly, we may define the BEP in the y-domain as Pey = 1 − max P (Hi |y) fy (y) dy, (4) i=1,...,C
where fy (y) is the unconditional density of y. The ultimate goal of LFE is to find the transformation matrix A for which the Pey is as close as possible to Pex ; i.e., find A that minimizes |Pey − Pex |1 where |.|1 indicates l1−norm. The overall architecture for determining A is shown in Figure 1. When fx (x|Hi ) is known, it is established that Pey ≥ Pex [7]. In [8, 9], it is proven that Pey = Pex if and only if the transformed vector y is a sufficient statistic for x with respect to the classification hypotheses. For Gaussian classification problems, this happens when A can be obtained through full-rank decomposition of the following matrix [8]. M = δ 2 . . . δ C Σ1,2 −I . . . Σ1,C −I , (5) −1/2
−1/2
−1/2
where δ i = Σ1 (μi − μ1 ) and Σ1,i = Σ1 Σi Σ1 for i = 1, . . . , C. That is, M = AT G, where G is a full rank matrix. Thus, the rank of M is the minimum value of k below which y cannot be a sufficient statistic for x with respect to the classification hypotheses. This dimensionality is called the minimal dimensionality for statistical sufficiency, kmin . If no assumption is made about Σi , the matrix M is in general of full-rank and hence kmin = n. In fact, this is the case in most signal classification problems and the desired reduced dimensionality is typically much less than kmin . In this case, y will not be a sufficient statistic for x and Pey > Pex as a result. Thus, minimization of |Pey − Pex |1 w.r.t A is exactly equivalent to minimizing Pey w.r.t. the same matrix A since Pex is not a function in A. In fact, very few of the existing LFE methods have considered minimizing Pey as a design criterion for A. The most relevant work is done by Hamsici and Martinez [6]. However, the exact version of the algorithm in [6] requires solving O(kC!) convex optimization problems while an approximate version requires solving O(kC 2 ) convex optimizations. Since Pey is, in general, a multi-modal function in A, we propose starting with an initial transformation matrix A0 and updating it iteratively through optimization in order to obtain a smaller value of Pey . Since it is mathematically intractable to derive a closed-form expression for neither Pey [4] nor ∂Pey /∂A, the quadratic approximation of section 3 will be optimized instead of Pey with respect to A.
2167
where the parameters φ0 and φ2 should be selected to minimize the absolute difference between Pey and Qy (φ0 , φ2 ); i.e. the optimum value of φ0 and φ2 are given by (7) (φ∗0 , φ∗2 ) = arg min |Pey − Qy (φ0 , φ2 )|. φ0 ,φ2
The motivation for the quadratic approximation is two folds. First, the derived approximation does not involve complex decision boundaries as in the BEP and hence it is simpler to calculate. Second, Qy can be easily differentiated with respect to the transformation matrix A unlike Pey which involves complex decision boundaries which are also functions of A. In fact, analytical solution of (7) is not feasible. However, an approximate solution can be found by replacing the objective function in (7) by a tight upper bound which can be analytically optimized. Because of the limited size of the manuscript, only the final results are mentioned. The solution to the approximate optimization problem is φ∗0 = (7C + 1)/(8C) and φ∗2 = −1. The corresponding absolute difference is upper bounded by (C − 1)/(8C); i.e., C −1 |Pey − Qy (φ∗0 , φ∗2 )| ≤ . (8) 8C ∗ ∗ Substituting φ0 , φ2 in (6) and simplifying, the quadratic approximation is given by C 7C +1 Qy (φ∗0 , φ∗2 ) = P (Hi ) P (Hi |y)fy (y|Hi ) dy. − 8C i=1 (9) The function Qy (φ∗0 , φ∗2 ) has no closed-form expression but can be approximated by any off-the-shelf Monte-Carlo method such as rejection sampling or importance sampling. In this paper, we select ordinary Monte Carlo. Ni C 7C +1 P (Hi ) Qy (φ∗0 , φ∗2 ) ≈ P (Hi |Axni ), (10) − 8C Ni n=1 i=1 where the patterns {xni }, n = 1, . . . , Ni are randomly generated according to the density fx (x|Hi ) and Ni should be sufficiently large. Unlike the BEP, the derivative of Qy can be efficiently estimated using the Monte-Carlo method via1 Ni C ∂Qy (φ∗0 , φ∗2 ) P (Hi ) Fi (A, xni ). (11) ≈− ∂A Ni n=1 i=1 ∂ log fy (Ax|Hi ) Fi (A, x) = P (Hi |Ax) ∂A C ∂ P (Hj |Ax) (12) − log fy (Ax|Hj ) ∂A j=1 1 The
proof of (11) is lengthy and hence omitted from the paper.
Algorithm 1 Estimation of the optimal LFE matrix for Gaussian classification problems. Require: C, n, k, P (Hi ), μi , Σi , i = 1, . . . , C. Output: Optimal transformation matrix A. 1: Calculate M in (5). 2: Find kmin ≡ rank (M). 3: if k ≥ kmin then 4: Find A by performing full rank decomposition to M = AT G. 5: else 6: Optimize Qy in (10) w.r.t. A. Initialize the optimization algorithm using any other popular LFE algorithm. 7: end if
It is easy to verify that, for the multivariate Gaussian distribution given in (1), the following expression holds: ∂ ˜ −1 A. log fy (Ax|Hi ) = 2Σ i ∂A ˜ −1 AΣi , (13) −Σi +(x−μi )(x−μi )T I−AT Σ i
˜ i = AΣi AT . Thus, when k < kmin , we can obtain a where Σ suboptimal transformation matrix by optimizing the quadratic approximation in (10) where the gradient is calculated using (11)-(13). In this work, we prefer to optimize Qy (φ∗0 , φ∗2 ) using quasi-Newton methods because they have fast convergence and reasonable memory requirements for moderate size problem [10]. Another important issue is the proper initialization for the optimizer. Similar to almost all optimization algorithms, poor initialization may lead to undesired local optimum of Pey . In order to alleviate this problem, we suggest initializing the proposed optimization algorithm using any other powerful LFE algorithm such as the Fisher LDA [4] or an approximate decomposition of M in (5) such as the Tubbs method [8]. This way, it is guaranteed that the obtained LFE is not inferior to those initializers. The overall optimization algorithm based on this strategy is depicted in Algorithm 1. In order to reduce the effect of initialization even more, different initializers can be employed and the one resulting in the lowest BEP is selected. 4. SIMULATION RESULTS We mainly have two sets of simulations. In subsection 4.1, the objective is to study the quality of the proposed approximation and to verify the theoretical bound in (8). In subsection 4.2, we shall demonstrate the efficacy of the proposed optimization algorithm in reducing Pey . All BEPs in all experiments are estimated using Monte-Carlo simulation. The number of trials is selected such that N ≥ max(5 × 104 , (1 − Pˆe )/(0.052 Pˆe )), where Pˆe is the estimated BEP. This selection ensures that the ratio between the standard deviation of the estimate to its mean does not exceed 0.05. 4.1. Quality of the proposed quadratic approximation In this simulation, we fixed n to 30 while C is changed from 2 to 40. For each value of C, 10 Gaussian classification problems are synthetically generated. The distributions of the generated P (Hi ) and Σi are shown in Table 1 (heteroscedas-
2168
Fig. 2. Comparison between actual value of |Pex − Qx (φ∗0 , φ∗2 )| and its upper bound in (8). tic classification) and μi is generated according to N(0, 5I). For each classification problem, both the Pex and its corresponding quadratic approximation Qx (φ∗0 , φ∗2 ) are estimated using the method of Monte-Carlo. The actual values of |Pex − Qx (φ∗0 , φ∗2 )| as well as the theoretical upper bound in (8) are plotted in Figure 2. We notice that the actual difference saturates at 0.1 with the increase of C. Hence, as C increases, Qx (φ∗0 , φ∗2 ) should still provide a reasonable approximation to Pex . Therefore, it is expected that the performance of the proposed optimization procedure is not much affected by the increase of C. In addition, the difference between the theoretical upper bound of |Pex − Qx (φ∗0 , φ∗2 )| and its actual values does not exceed 0.025 since the theoretical upper bound does not exceed 0.125. This indicates the tightness of the derived theoretical upper bound. 4.2. Improvement achieved by the proposed framework We considered three types of Gaussian classification problems: (1) heteroscedastic (general) Gaussian classification problems; (2) Gaussian classification problems with spherical covariance matrices; i.e. Σi = σi2 I; (3) Homoscedastic Gaussian classification problems, Σi = Σ. For each type, 100 synthetic Gaussian classification problems were randomly generated. The distributions of the generated classification parameters are shown in Table 1. The mean vectors were generated differently for each type of Gaussian classification so as to have reasonably small values of BEPs for each type of classification. We allowed C to take the values of 10, 20, 30, and 40 while n was fixed to 30. We employed three LFE methods as initializers for our optimization procedure: the Fisher LDA, the Tubbs method [8], and the Chernoff-based LDA [5]. The quadratic approximation in (10), our objective function, is optimized using the Broyden Fletcher Goldfarb Shanno (BFGS) method. The maximum number of iterations was 70 while the convergence termination criterion was set to 5 × 10−4 . For each generated classification problem, the following quantities were estimated: (1) P eF i1 , P eT u1 , P eCh1 are the BEPs obtained after applying the Fisher LDA method, the Tubbs method, the Chernoff-based method without optimization, respectively, (2) P eF i2 , P eT u2 , P eCh2 are the BEPs obtained after optimizing the transformation matrix corresponding to P eF i1 , P eT u1 , P eCh1 , respectively, and (3) kF i , kT u , kCh are the corresponding reduced dimensionalities. Tables 2 lists the average values of the quantities defined
Table 1. Distribution of the generated parameters. Parameter P (Hi ) μi Σi
Covariance type All Homoscedastic Spherical Heteroscedastic Homoscedastic Spherical, Σi = σi2 I Heteroscedastic
Generation density Dirichlet; Dir(10, . . . , 10) Multivariate normal; N(0, 30I) Multivariate normal; N(0, I) Multivariate normal; N(0, 50I) Wishart (50 degrees of freedom) σi2 ∼ χ2 (50n degrees of freedom) Wishart (50 degrees of freedom)
Table 2. Homoscedastic Gaussian classification. C 10 20 30 40
P eF i1 0.0050 0.0242 0.0565 0.0635
P eF i2 0.0044 0.0198 0.0404 0.0535
kF i 6.42 7.00 7.00 7.00
P eT u1 0.0051 0.0275 0.0626 0.0762
P eT u2 0.0045 0.0219 0.0447 0.0565
kT u 6.48 7.00 7.00 7.00
P eCh1 0.0050 0.0241 0.0561 0.0635
P eCh1 0.0044 0.0198 0.0404 0.0535
kCh 6.44 7.00 7.00 7.00
above for homoscedastic Gaussian classification over the 100 generated problems. The reduced dimensionality is determined as the one providing the minimum BEP (before optimization) within the domain from 1 to 7. Generally, the three methods have very comparable classification performance before and after optimization. This is consistent with two facts for homoscedastic Gaussian classification problems: (a) the Chernoff method reduces to the Fisher LDA [5] and (b) when k ≥ kmin , both the Fisher and the Tubbs methods are equivalent since the left singular vectors of M in (5) are related to the Fisher eigenvectors by a nonsingular transformation. Therefore, it is very likely that these LFE methods will perform close to optimum when k < kmin and there is no significant improvement in the classification performances. The experiments were repeated for the other two types of Gaussian classification problems in Table 3 and Table 4. The proposed optimization algorithm provides significant reduction in the BEP as compared to the homoscedastic classification. This is expected since, with probability one, kmin = n. Hence, y in the three LFE methods will not be a sufficient statistic for x and it is very likely that none of the three methods will be optimal. In addition, the amount of reduction in the BEP increases when C increases, or equivalently, when the classification task becomes harder. This adds to the strength of our proposed optimization algorithm. It also means that as C increases, the three LFE methods become less optimal and there is more need for optimizing Pey w.r.t. A. Further, the BEPs of the three methods after optimization are very comparable to each other. That is, the final obtained classification performance after optimization is somewhat insensitive to the initializer. We believe that this property is very desirable in many signal classification problems since it is difficult to know which LFE method will yield the best classification performance before hand. Finally, the selected dimensionality in all Table 2, 3, and 4 was either exactly 7 or very close to it. This is consistent with the intuition of the monotonic decreasing behaviour of the Pey versus k when the class conditional densities are exactly known.
2169
Table 3. Gaussian classification problems with spherical covariance matrices. C 10 20 30 40
P eF i1 0.0116 0.0725 0.1341 0.1971
P eF i2 0.0097 0.0544 0.1067 0.1697
kF i 7.00 7.00 7.00 7.00
P eT u1 0.0109 0.0800 0.1507 0.2177
P eT u2 0.0095 0.0562 0.1053 0.1689
kT u 7.00 7.00 7.00 7.00
P eCh1 0.0115 0.0727 0.1346 0.1977
P eCh1 0.0097 0.0544 0.1070 0.1692
kCh 7.00 7.00 7.00 7.00
Table 4. General Gaussian classification problems. C 10 20 30 40
P eF i1 0.0081 0.0562 0.1188 0.1696
P eF i2 0.0056 0.0401 0.0917 0.1352
kF i 6.80 7.00 7.00 7.00
P eT u1 0.0299 0.1527 0.2503 0.3191
P eT u2 0.0121 0.0484 0.1004 0.1392
kT u 7.00 7.00 7.00 7.00
P eCh1 0.0071 0.0575 0.1266 0.1772
P eCh1 0.0056 0.0395 0.0961 0.1426
kCh 7.00 7.00 7.00 7.00
5. CONCLUSIONS In this paper, we have proposed a novel framework for improving the performance of LFE algorithms, characterized by the BEP after dimensionality reduction. The improvement has been validated experimentally for three popular LFE algorithms. Based on the performed simulations, it was concluded that, unless the transformed feature vector is a sufficient statistic for the original pattern vector, no single LFE method is guaranteed to have the best classification performance for all Gaussian classification problems. Though the focus in this paper has been on classification problems with known conditional densities, we believe the same conclusion still applies when the class conditional densities are estimated from given training data. This case will be further investigated in future publications.
References [1] Juwei Lu, K.N. Plataniotis, and A.N. Venetsanopoulos, “Face recognition using lda-based algorithms,” Neural Networks, IEEE Transactions on, vol. 14, no. 1, pp. 195–200, Jan 2003. [2] M. Sakai, N. Kitaoka, and S. Nakagawa, “Generalization of linear discriminant analysis used in segmental unit input hmm for speech recognition,” in ICASSP 2007, April 2007, vol. 4, pp. IV–333–IV–336. [3] M.A. El-Gamal, M.F. Abu El-Yazeed, and M.M.H. El Ayadi, “Dimensionality reduction for text-independent speaker identification using gaussian mixture model,” in Micro-NanoMechatronics and Human Science, 2003 IEEE International Symposium on, Dec. 2003, vol. 2, pp. 625–628 Vol. 2. [4] Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification (2nd Edition), Wiley-Interscience, November 2000. [5] R.P.W. Duin and M. Loog, “Linear dimensionality reduction via a heteroscedastic extension of lda: the chernoff criterion,” IEEE Trans. PAMI, vol. 26, no. 6, pp. 732–739, June 2004. [6] O.C. Hamsici and A.M. Martinez, “Bayes optimality in linear discriminant analysis,” IEEE Trans. PAMI, vol. 30, no. 4, pp. 647–657, April 2008. [7] Luc Devroye, Lazlo Gyorfi, and Gabor Lugosi, A Probabilistic Theory of Pattern Recognition, Springer-Verlag New York, Inc., 1996. [8] J. D. Tubbs, W. A. Coberly, and D. M. Young, “Linear dimension reduction and bayes classification with unknown population parameters,” Pattern Recognition, vol. 15, no. 3, pp. 167 – 172, 1982. [9] P. B. Charles, R. Richard, and P. Henry, “Characterization of linear sufficient statics,” Sankhya: The Indian Journal of Statistics, vol. 40, no. 3, pp. 303–309, 1978. [10] P.A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach, Prentice Hall International, 1982.