improving classification performance of linear feature extraction ...

Comment

Report 2 Downloads 180 Views

IMPROVING CLASSIFICATION PERFORMANCE OF LINEAR FEATURE EXTRACTION ALGORITHMS Moataz El Ayadi and Konstantinos N. Plataniotis Multimedia Lab, The Edward S. Rogers Dept. of Electrical and Computer Engineering, University of Toronto, Ontario, Canada M5S 3G4 ABSTRACT In this work, we propose a new and novel framework for improving the performance of linear feature extraction (LFE) algorithms, characterized by the Bayesian error probability (BEP) in the extracted feature domain. The proposed framework relies on optimizing a tight quadratic approximation to the BEP in the transformed space with respect to the transformation matrix. Applied to many synthetic multi-class Gaussian classiﬁcation problems, the proposed optimization procedure signiﬁcantly improves the classiﬁcation performance when it is initialized by popular LFE matrices such as the Fisher linear discriminant analysis. Index Terms— Bayesian error probability, linear feature extraction, multivariate Gaussian density. 1. INTRODUCTION Linear feature extraction (LFE) has been extensively applied in various signal processing applications such as face recognition [1], speech recognition [2], speaker identiﬁcation [3], micro-array classiﬁcation, image analysis, and many others. The key advantages of LFE are the simplicity of implementation and the preservation of the shape of of the class conditional densities after transformation. Since the introduction of the classical linear discriminant analysis (LDA) by Fisher [4], many extensions have been proposed to account for multiple classes and heteroscedastic classiﬁcation [5, 6]. However, very few of LFE methods considered the Bayesian error probability (BEP) as a design criterion, though the BEP is the main performance measure of interest in most signal processing applications. In fact, the BEP in the transformed domain is generally a multi-modal function in the transformation matrix. Hence, it is not expected to ﬁnd an LFE that is superior to all other LFE methods for all classiﬁcation problems. Therefore, rather than devising a new LFE method which is not guaranteed to outperform all other LFE methods for all classiﬁcation problems, the main objective in this paper is to design a new framework for improving the performance of existing LFE methods in terms of the BEP in the transformed domain. We believe that this framework will increase the usability of LFE methods for signal processing applications. The main focus in this paper will be on Gaussian classiﬁcation problems. The main assumptions are: (1) the num-

978-1-4244-4296-6/10/$25.00 ©2010 IEEE

2166

ber of classes, C, is known before classiﬁcation, (2) the given classes are disjoint and exhaustive; i.e., a pattern x belongs to exactly one of the C classes, and (3) all the class conditional densities and the class prior probabilities are exactly known. Since the BEP has no closed-form expression, we propose a tight quadratic approximation which can be efﬁciently optimized using gradient-based techniques. The paper is divided into ﬁve sections. Section 1 is the introduction. The mathematical formulation of the LFE problem is given in section 2. In section 3, the proposed quadratic approximation to the BEP is derived for Gaussian classiﬁcation problems. Finally, simulation results are shown in section 4 and important conclusions are drawn in section 5. 2. MATHEMATICAL FORMULATION Consider a signal processing application in which the main task is to distinguish between C classes; e.g. a face recognition problem with C different faces. Assume that all patterns are real n-dimensional vector, x ∈ Rn . In multi-class classiﬁcation, we are interested in classifying a pattern x into one of the classes H1 , . . ., and HC with prior probabilities P (H1 ), P (H2 ), and P (HC ). Patterns of Hi are generated according to the Gaussian density N(μi , Σi ), i.e., 1

T

−1

(1) fx (x|Hi ) = (2π)−n/2 |Σi |−1/2 e− 2 (x−μi) Σi (x−μi) , where μi and Σi are the mean vector and the covariance matrix of x if Hi is true, respectively. We assume all fx (x|Hi ) to be exactly known before classiﬁcation as well as P (Hi ), i = 1, . . . , C. Further, assume that, for each class, all the generated patterns are statistically independent and identically distributed. When all the class-conditional densities are known, the following Bayesian decision rule is usually used for classiﬁcation [4]: x ∈ Ci if i = arg max P (Hj )fx (x|Hj ) , (2) j=1,...,C

A common measure of the classiﬁcation performance is the BEP given by [4] Pex = 1 − max P (Hi |x) fx (x) dx, (3) i=1,...,C

where P (Hi |x) is the posterior probability of Hi given x and the function fx (x) is the unconditional density of x. We now consider LFE to overcome the estimation and computation problems usually encountered when n is high.

ICASSP 2010

3. QUADRATIC APPROXIMATION OF THE PROBABILITY OF MISCLASSIFICATION The proposed quadratic approximation takes the form. C Qy (φ0 , φ2 ) = φ0 + φ2 P 2 (Hi |y) fy (y) dy, (6) i=1

Fig. 1. Schematic description of the considered LFE problem. The main objective is to ﬁnd A that minimizes |Pey − Pex |1 .

The linearly transformed vector y is given by y = Ax where the matrix A is of size k×n and k < n and the vector y ∈ Rk . Similarly, we may deﬁne the BEP in the y-domain as Pey = 1 − max P (Hi |y) fy (y) dy, (4) i=1,...,C

where fy (y) is the unconditional density of y. The ultimate goal of LFE is to ﬁnd the transformation matrix A for which the Pey is as close as possible to Pex ; i.e., ﬁnd A that minimizes |Pey − Pex |1 where |.|1 indicates l1−norm. The overall architecture for determining A is shown in Figure 1. When fx (x|Hi ) is known, it is established that Pey ≥ Pex [7]. In [8, 9], it is proven that Pey = Pex if and only if the transformed vector y is a sufﬁcient statistic for x with respect to the classiﬁcation hypotheses. For Gaussian classiﬁcation problems, this happens when A can be obtained through full-rank decomposition of the following matrix [8]. M = δ 2 . . . δ C Σ1,2 −I . . . Σ1,C −I , (5) −1/2

−1/2

−1/2

where δ i = Σ1 (μi − μ1 ) and Σ1,i = Σ1 Σi Σ1 for i = 1, . . . , C. That is, M = AT G, where G is a full rank matrix. Thus, the rank of M is the minimum value of k below which y cannot be a sufﬁcient statistic for x with respect to the classiﬁcation hypotheses. This dimensionality is called the minimal dimensionality for statistical sufﬁciency, kmin . If no assumption is made about Σi , the matrix M is in general of full-rank and hence kmin = n. In fact, this is the case in most signal classiﬁcation problems and the desired reduced dimensionality is typically much less than kmin . In this case, y will not be a sufﬁcient statistic for x and Pey > Pex as a result. Thus, minimization of |Pey − Pex |1 w.r.t A is exactly equivalent to minimizing Pey w.r.t. the same matrix A since Pex is not a function in A. In fact, very few of the existing LFE methods have considered minimizing Pey as a design criterion for A. The most relevant work is done by Hamsici and Martinez [6]. However, the exact version of the algorithm in [6] requires solving O(kC!) convex optimization problems while an approximate version requires solving O(kC 2 ) convex optimizations. Since Pey is, in general, a multi-modal function in A, we propose starting with an initial transformation matrix A0 and updating it iteratively through optimization in order to obtain a smaller value of Pey . Since it is mathematically intractable to derive a closed-form expression for neither Pey [4] nor ∂Pey /∂A, the quadratic approximation of section 3 will be optimized instead of Pey with respect to A.

2167

where the parameters φ0 and φ2 should be selected to minimize the absolute difference between Pey and Qy (φ0 , φ2 ); i.e. the optimum value of φ0 and φ2 are given by (7) (φ∗0 , φ∗2 ) = arg min |Pey − Qy (φ0 , φ2 )|. φ0 ,φ2

The motivation for the quadratic approximation is two folds. First, the derived approximation does not involve complex decision boundaries as in the BEP and hence it is simpler to calculate. Second, Qy can be easily differentiated with respect to the transformation matrix A unlike Pey which involves complex decision boundaries which are also functions of A. In fact, analytical solution of (7) is not feasible. However, an approximate solution can be found by replacing the objective function in (7) by a tight upper bound which can be analytically optimized. Because of the limited size of the manuscript, only the ﬁnal results are mentioned. The solution to the approximate optimization problem is φ∗0 = (7C + 1)/(8C) and φ∗2 = −1. The corresponding absolute difference is upper bounded by (C − 1)/(8C); i.e., C −1 |Pey − Qy (φ∗0 , φ∗2 )| ≤ . (8) 8C ∗ ∗ Substituting φ0 , φ2 in (6) and simplifying, the quadratic approximation is given by C 7C +1 Qy (φ∗0 , φ∗2 ) = P (Hi ) P (Hi |y)fy (y|Hi ) dy. − 8C i=1 (9) The function Qy (φ∗0 , φ∗2 ) has no closed-form expression but can be approximated by any off-the-shelf Monte-Carlo method such as rejection sampling or importance sampling. In this paper, we select ordinary Monte Carlo. Ni C 7C +1 P (Hi ) Qy (φ∗0 , φ∗2 ) ≈ P (Hi |Axni ), (10) − 8C Ni n=1 i=1 where the patterns {xni }, n = 1, . . . , Ni are randomly generated according to the density fx (x|Hi ) and Ni should be sufﬁciently large. Unlike the BEP, the derivative of Qy can be efﬁciently estimated using the Monte-Carlo method via1 Ni C ∂Qy (φ∗0 , φ∗2 ) P (Hi ) Fi (A, xni ). (11) ≈− ∂A Ni n=1 i=1 ∂ log fy (Ax|Hi ) Fi (A, x) = P (Hi |Ax) ∂A C ∂ P (Hj |Ax) (12) − log fy (Ax|Hj ) ∂A j=1 1 The

proof of (11) is lengthy and hence omitted from the paper.

Algorithm 1 Estimation of the optimal LFE matrix for Gaussian classiﬁcation problems. Require: C, n, k, P (Hi ), μi , Σi , i = 1, . . . , C. Output: Optimal transformation matrix A. 1: Calculate M in (5). 2: Find kmin ≡ rank (M). 3: if k ≥ kmin then 4: Find A by performing full rank decomposition to M = AT G. 5: else 6: Optimize Qy in (10) w.r.t. A. Initialize the optimization algorithm using any other popular LFE algorithm. 7: end if

It is easy to verify that, for the multivariate Gaussian distribution given in (1), the following expression holds: ∂ ˜ −1 A. log fy (Ax|Hi ) = 2Σ i ∂A ˜ −1 AΣi , (13) −Σi +(x−μi )(x−μi )T I−AT Σ i

˜ i = AΣi AT . Thus, when k < kmin , we can obtain a where Σ suboptimal transformation matrix by optimizing the quadratic approximation in (10) where the gradient is calculated using (11)-(13). In this work, we prefer to optimize Qy (φ∗0 , φ∗2 ) using quasi-Newton methods because they have fast convergence and reasonable memory requirements for moderate size problem [10]. Another important issue is the proper initialization for the optimizer. Similar to almost all optimization algorithms, poor initialization may lead to undesired local optimum of Pey . In order to alleviate this problem, we suggest initializing the proposed optimization algorithm using any other powerful LFE algorithm such as the Fisher LDA [4] or an approximate decomposition of M in (5) such as the Tubbs method [8]. This way, it is guaranteed that the obtained LFE is not inferior to those initializers. The overall optimization algorithm based on this strategy is depicted in Algorithm 1. In order to reduce the effect of initialization even more, different initializers can be employed and the one resulting in the lowest BEP is selected. 4. SIMULATION RESULTS We mainly have two sets of simulations. In subsection 4.1, the objective is to study the quality of the proposed approximation and to verify the theoretical bound in (8). In subsection 4.2, we shall demonstrate the efﬁcacy of the proposed optimization algorithm in reducing Pey . All BEPs in all experiments are estimated using Monte-Carlo simulation. The number of trials is selected such that N ≥ max(5 × 104 , (1 − Pˆe )/(0.052 Pˆe )), where Pˆe is the estimated BEP. This selection ensures that the ratio between the standard deviation of the estimate to its mean does not exceed 0.05. 4.1. Quality of the proposed quadratic approximation In this simulation, we ﬁxed n to 30 while C is changed from 2 to 40. For each value of C, 10 Gaussian classiﬁcation problems are synthetically generated. The distributions of the generated P (Hi ) and Σi are shown in Table 1 (heteroscedas-

2168

Fig. 2. Comparison between actual value of |Pex − Qx (φ∗0 , φ∗2 )| and its upper bound in (8). tic classiﬁcation) and μi is generated according to N(0, 5I). For each classiﬁcation problem, both the Pex and its corresponding quadratic approximation Qx (φ∗0 , φ∗2 ) are estimated using the method of Monte-Carlo. The actual values of |Pex − Qx (φ∗0 , φ∗2 )| as well as the theoretical upper bound in (8) are plotted in Figure 2. We notice that the actual difference saturates at 0.1 with the increase of C. Hence, as C increases, Qx (φ∗0 , φ∗2 ) should still provide a reasonable approximation to Pex . Therefore, it is expected that the performance of the proposed optimization procedure is not much affected by the increase of C. In addition, the difference between the theoretical upper bound of |Pex − Qx (φ∗0 , φ∗2 )| and its actual values does not exceed 0.025 since the theoretical upper bound does not exceed 0.125. This indicates the tightness of the derived theoretical upper bound. 4.2. Improvement achieved by the proposed framework We considered three types of Gaussian classiﬁcation problems: (1) heteroscedastic (general) Gaussian classiﬁcation problems; (2) Gaussian classiﬁcation problems with spherical covariance matrices; i.e. Σi = σi2 I; (3) Homoscedastic Gaussian classiﬁcation problems, Σi = Σ. For each type, 100 synthetic Gaussian classiﬁcation problems were randomly generated. The distributions of the generated classiﬁcation parameters are shown in Table 1. The mean vectors were generated differently for each type of Gaussian classiﬁcation so as to have reasonably small values of BEPs for each type of classiﬁcation. We allowed C to take the values of 10, 20, 30, and 40 while n was ﬁxed to 30. We employed three LFE methods as initializers for our optimization procedure: the Fisher LDA, the Tubbs method [8], and the Chernoff-based LDA [5]. The quadratic approximation in (10), our objective function, is optimized using the Broyden Fletcher Goldfarb Shanno (BFGS) method. The maximum number of iterations was 70 while the convergence termination criterion was set to 5 × 10−4 . For each generated classiﬁcation problem, the following quantities were estimated: (1) P eF i1 , P eT u1 , P eCh1 are the BEPs obtained after applying the Fisher LDA method, the Tubbs method, the Chernoff-based method without optimization, respectively, (2) P eF i2 , P eT u2 , P eCh2 are the BEPs obtained after optimizing the transformation matrix corresponding to P eF i1 , P eT u1 , P eCh1 , respectively, and (3) kF i , kT u , kCh are the corresponding reduced dimensionalities. Tables 2 lists the average values of the quantities deﬁned

Table 1. Distribution of the generated parameters. Parameter P (Hi ) μi Σi

Covariance type All Homoscedastic Spherical Heteroscedastic Homoscedastic Spherical, Σi = σi2 I Heteroscedastic

Generation density Dirichlet; Dir(10, . . . , 10) Multivariate normal; N(0, 30I) Multivariate normal; N(0, I) Multivariate normal; N(0, 50I) Wishart (50 degrees of freedom) σi2 ∼ χ2 (50n degrees of freedom) Wishart (50 degrees of freedom)

Table 2. Homoscedastic Gaussian classiﬁcation. C 10 20 30 40

P eF i1 0.0050 0.0242 0.0565 0.0635

P eF i2 0.0044 0.0198 0.0404 0.0535

kF i 6.42 7.00 7.00 7.00

P eT u1 0.0051 0.0275 0.0626 0.0762

P eT u2 0.0045 0.0219 0.0447 0.0565

kT u 6.48 7.00 7.00 7.00

P eCh1 0.0050 0.0241 0.0561 0.0635

P eCh1 0.0044 0.0198 0.0404 0.0535

kCh 6.44 7.00 7.00 7.00

above for homoscedastic Gaussian classiﬁcation over the 100 generated problems. The reduced dimensionality is determined as the one providing the minimum BEP (before optimization) within the domain from 1 to 7. Generally, the three methods have very comparable classiﬁcation performance before and after optimization. This is consistent with two facts for homoscedastic Gaussian classiﬁcation problems: (a) the Chernoff method reduces to the Fisher LDA [5] and (b) when k ≥ kmin , both the Fisher and the Tubbs methods are equivalent since the left singular vectors of M in (5) are related to the Fisher eigenvectors by a nonsingular transformation. Therefore, it is very likely that these LFE methods will perform close to optimum when k < kmin and there is no significant improvement in the classiﬁcation performances. The experiments were repeated for the other two types of Gaussian classiﬁcation problems in Table 3 and Table 4. The proposed optimization algorithm provides signiﬁcant reduction in the BEP as compared to the homoscedastic classiﬁcation. This is expected since, with probability one, kmin = n. Hence, y in the three LFE methods will not be a sufﬁcient statistic for x and it is very likely that none of the three methods will be optimal. In addition, the amount of reduction in the BEP increases when C increases, or equivalently, when the classiﬁcation task becomes harder. This adds to the strength of our proposed optimization algorithm. It also means that as C increases, the three LFE methods become less optimal and there is more need for optimizing Pey w.r.t. A. Further, the BEPs of the three methods after optimization are very comparable to each other. That is, the ﬁnal obtained classiﬁcation performance after optimization is somewhat insensitive to the initializer. We believe that this property is very desirable in many signal classiﬁcation problems since it is difﬁcult to know which LFE method will yield the best classiﬁcation performance before hand. Finally, the selected dimensionality in all Table 2, 3, and 4 was either exactly 7 or very close to it. This is consistent with the intuition of the monotonic decreasing behaviour of the Pey versus k when the class conditional densities are exactly known.

2169

Table 3. Gaussian classiﬁcation problems with spherical covariance matrices. C 10 20 30 40

P eF i1 0.0116 0.0725 0.1341 0.1971

P eF i2 0.0097 0.0544 0.1067 0.1697

kF i 7.00 7.00 7.00 7.00

P eT u1 0.0109 0.0800 0.1507 0.2177

P eT u2 0.0095 0.0562 0.1053 0.1689

kT u 7.00 7.00 7.00 7.00

P eCh1 0.0115 0.0727 0.1346 0.1977

P eCh1 0.0097 0.0544 0.1070 0.1692

kCh 7.00 7.00 7.00 7.00

Table 4. General Gaussian classiﬁcation problems. C 10 20 30 40

P eF i1 0.0081 0.0562 0.1188 0.1696

P eF i2 0.0056 0.0401 0.0917 0.1352

kF i 6.80 7.00 7.00 7.00

P eT u1 0.0299 0.1527 0.2503 0.3191

P eT u2 0.0121 0.0484 0.1004 0.1392

kT u 7.00 7.00 7.00 7.00

P eCh1 0.0071 0.0575 0.1266 0.1772

P eCh1 0.0056 0.0395 0.0961 0.1426

kCh 7.00 7.00 7.00 7.00

5. CONCLUSIONS In this paper, we have proposed a novel framework for improving the performance of LFE algorithms, characterized by the BEP after dimensionality reduction. The improvement has been validated experimentally for three popular LFE algorithms. Based on the performed simulations, it was concluded that, unless the transformed feature vector is a sufﬁcient statistic for the original pattern vector, no single LFE method is guaranteed to have the best classiﬁcation performance for all Gaussian classiﬁcation problems. Though the focus in this paper has been on classiﬁcation problems with known conditional densities, we believe the same conclusion still applies when the class conditional densities are estimated from given training data. This case will be further investigated in future publications.

References [1] Juwei Lu, K.N. Plataniotis, and A.N. Venetsanopoulos, “Face recognition using lda-based algorithms,” Neural Networks, IEEE Transactions on, vol. 14, no. 1, pp. 195–200, Jan 2003. [2] M. Sakai, N. Kitaoka, and S. Nakagawa, “Generalization of linear discriminant analysis used in segmental unit input hmm for speech recognition,” in ICASSP 2007, April 2007, vol. 4, pp. IV–333–IV–336. [3] M.A. El-Gamal, M.F. Abu El-Yazeed, and M.M.H. El Ayadi, “Dimensionality reduction for text-independent speaker identiﬁcation using gaussian mixture model,” in Micro-NanoMechatronics and Human Science, 2003 IEEE International Symposium on, Dec. 2003, vol. 2, pp. 625–628 Vol. 2. [4] Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classiﬁcation (2nd Edition), Wiley-Interscience, November 2000. [5] R.P.W. Duin and M. Loog, “Linear dimensionality reduction via a heteroscedastic extension of lda: the chernoff criterion,” IEEE Trans. PAMI, vol. 26, no. 6, pp. 732–739, June 2004. [6] O.C. Hamsici and A.M. Martinez, “Bayes optimality in linear discriminant analysis,” IEEE Trans. PAMI, vol. 30, no. 4, pp. 647–657, April 2008. [7] Luc Devroye, Lazlo Gyorﬁ, and Gabor Lugosi, A Probabilistic Theory of Pattern Recognition, Springer-Verlag New York, Inc., 1996. [8] J. D. Tubbs, W. A. Coberly, and D. M. Young, “Linear dimension reduction and bayes classiﬁcation with unknown population parameters,” Pattern Recognition, vol. 15, no. 3, pp. 167 – 172, 1982. [9] P. B. Charles, R. Richard, and P. Henry, “Characterization of linear sufﬁcient statics,” Sankhya: The Indian Journal of Statistics, vol. 40, no. 3, pp. 303–309, 1978. [10] P.A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach, Prentice Hall International, 1982.

Recommend Documents

Information-Theoretic Linear Feature Extraction ... - Semantic Scholar