A Novel Feature Fusion Method Based on Partial Least Squares Regression Quan-Sen Sun 1,2, Zhong Jin 3,2, Pheng-Ann Heng 4, and De-Shen Xia 2 1
School of Science, Jinan University, Jinan 250022, China
[email protected] 2 Department of Computer Science, Nanjing University of Science &Technology, Nanjing 210094, China
[email protected] 3 Centre de Visió per Computador, Universitat Autònoma de Barcelona, Spain
[email protected] 4 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong
[email protected] Abstract. The partial least squares (PLS) regression is a new multivariate data analysis method. In this paper, based on the ideas of PLS model and feature fusion, a new non-iterative PLS algorithm and a novel method of feature fusion are proposed. The proposed method comprises three steps: firstly, extracting two sets of feature vectors with the same pattern and establishing PLS criterion function between the two sets of feature vectors; then, extracting two sets of PLS components by the PLS algorithm in this paper; and finally, doing feature fusion for classification by using two strategies. Experiment results on the ORL face image database and the Concordia University CENPARMI database of handwritten Arabic numerals show that the proposed method is efficient. Moreover, the proposed non-iterative PLS algorithm is superior to the existing iterative PLS algorithms on the computational cost and speed of feature extraction.
1 Introduction The Partial least squares (PLS) regression is a novel multivariate data analysis method developed from practical application in real word. PLS regression originally developed by Wold has a tremendous success in chemometrics and chemical industries for static data analysis[1-4]. The robustness of the generated model also makes the partial least squares approach a powerful tool for regression analysis, dimension reduction and classification technique, being applied to many other areas such as process monitoring, marketing analysis and image processing and so on[5-7]. PLS regression provides a modeling method between two sets of data (PLS of single dependent variable is referred as PLS1, PLS of Multi-dependent variable is referred as PLS2), during the regression modeling, both the reduction of primitive data and the elimination of redundant information (i.e. noise) could be achieved. PLS model is effective because it integrates the multivariate linear regression (MLR)[8], principal component analysis (PCA)[9] and canonical correlation analysis (CCA)[10] together naturally while it is S. Singh et al. (Eds.): ICAPR 2005, LNCS 3686, pp. 268 – 277, 2005. © Springer-Verlag Berlin Heidelberg 2005
A Novel Feature Fusion Method Based on Partial Least Squares Regression
269
convenient for the analysis of the multi-dimensional complexity system. PLS method has received a lot of attention and interest in recent years. In PLS modeling methods, a classical algorithm that the nonlinear iterative partial least squares( NIPALS) was given by Wold[11]. Based on this, several different modification versions of the iterative PLS methods have been given[12]. However, in practical applications the iterative PLS modeling may suffer from overfitting or local minima. In this paper based on the idea of PLS model, we present a new PLS modeling method under the orthogonal constraint. It is a non-iterative PLS algorithm. Furthermore, a novel method of feature fusion is proposed and it has been used in the application of pattern classification. The rest of this paper is organized as follows. In Section 2, the “classical” PLS algorithm is described and a new PLS modeling method is presented. Then both of their performances have been analyzed. In Section 3, based on PLS model a new method of feature fusion for pattern classification is proposed. In Section 4, we show our experiment results using the ORL face image database and the Concordia University CENPARMI database of handwritten Arabic numerals. Finally, conclusions are drawn in Section 5.
2 PLS Modeling Method Let A and B be two groups of feature sets on pattern sample space Ω , any pattern sample ξ ∈ Ω ⊂ R N , whose two corresponding feature vectors are x ∈ A ⊂ R p and
y ∈ B ⊂ R q , respectively. Given two data matrices X ∈ R p × n on sample space A and Y ∈ R q × n on sample space B, where n is total number of samples. Further we assume centered variables; i.e. the columns of X Τ and Y Τ are zero-mean. Let S xy = XY Τ and Τ S yx = S xy ( (1/ n − 1) S xy denotes between-set covariance matrix of X and Y). Note the
zero mean does not impose any limitations on the methods discussed since the mean values can easily be estimated.
㧔C-PLS㧕
2.1 The Classical PLS Algorithm
Basically, the PLS method is a multivariable linear regression algorithm that can handle correlated inputs and limited data. The algorithm reduces the dimension of the predictor variables (input matrix, X) and response variables (output matrix, Y) by projecting them to the directions (input weight α and output weight β ) that maximize the covariance between input and output variables. This projection method decomposes variables of high collinearity into one-dimensional variables (input score vector t and output score vector u). The decomposition of X and Y by score vectors is formulated as follows:
X = ¦ i =1 ti piΤ + E = TP Τ + E ; Y = ¦ i =1 ui qiΤ + F = UQ Τ + F . d
d
Where pi and qi are loading vectors, and E and F are residuals.
270
Q.-S. Sun et al.
More precisely, PLS method is to find a pair of directions (weight vectors)
α k and β k such that
{α k ; β k } = arg
max Τ
Τ
α α = β β =1
Cov( X Τα , Y Τ β ) = arg
max Τ
Τ
α α = β β =1
α Τ S xy β
(1)
for k = 1,2,, d . The nonlinear iterative partial least squares (NIPALS) algorithm[11] is an iterative process, which can be formulated as follows. Let E0 = X Τ , F0 = Y Τ and h = 1 of initialization, and randomly initialize u1 of the Y-score vector. Iterate the following steps until convergence: 1. α hΤ = u hΤ Eh−1 /(u hΤu h )
4. β hΤ = t hΤ Fh −1 /(t hΤt h )
2. α h = α h / || α h ||
5. β h = β h / || β ||
3. t h = Eh −1α h
6. uh = Fh −1β h
After the convergence, by regressing Eh −1 on t h and Fh −1 on u h , the loading vectors ph = EhΤ−1t h / t hΤt h and qh = FhΤ−1uh / uhΤu h can be computed. Although NIPALS algorithm is efficient and robust and it can be extended to some nonlinear PLS models[13]. It may lead the uncertain results that the score vector u1 of initialization is random. A faster and more stable iterative algorithm has proposed by HÖskuldsson[2] and Helland[14]. As mentioned above, it can be shown that the weight vector α h also corresponds to the first eigenvector of the following eigenvalue problem EhΤ−1Fh −1 FhΤ−1 Eh −1α h = λ2hα h
(2)
The X-scores t are then given as t h = Eh −1α h
(3)
Similarly, eigenvalue problems for the extraction of uh and β h estimates can be derived. The above iterative algorithm gives out a linear PLS modeling method. It is called the classical PLS algorithm(C-PLS) in this paper. C-PLS exist the facts that tiΤt j = 0 for i ≠ j , tiΤu j = 0 for j > i and α iΤα j = β iΤ β j = 0 for i ≠ j . 2.2 A New Non-iterative PLS Modeling Method(NI-PLS)
In this section, we present a new PLS modeling method based on idea of PLS model. Under the object criterion function (1), we give out a theorem based on the following the orthogonal constraint
α kΤα i = β kΤ β i = 0 for all 1 ≤ i ≤ k , where k = 1,2,, d .
(4)
A Novel Feature Fusion Method Based on Partial Least Squares Regression
271
Theorem. Under the criterion (1), the number of effective weight vectors, which satisfy constraint (4), is r (r is the number of non-zero eigenvalues of matrix S xy S yx )
pairs at most. d (≤ r ) pairs of weight vectors are composed of vectors which are selected from the eigenvectors corresponding to the first d maximum eigenvalues of eigenequations (5) and (6) S xy S yxα = λ2α (5) S yx S xy β = λ2 β
(6)
and satisfying
°α iΤα j = β iΤ β j = δ ij ® Τ °¯α i S xy β j = λiδ ij
(i, j = 1,2, , d )
(7)
Where λ2i is the non-zero eigenvalue corresponding to eigenvectors α i and β i , and 1 ¯0
δ ij = ®
i= j i≠ j
Proof. Using Lagrange Multiplier Method to Transform Eq.(1):
L(α , β ) = α Τ S xy β −
λ1 2
(α Τα − 1) −
λ2 2
( β Τ β − 1)
(8)
where λ1 and λ2 are Lagrange multipliers. Let ∂L(α , β ) / ∂α = S xy β − λ1α = 0
(9)
∂L(α , β ) / ∂α = S yxα − λ2 β = 0
(10)
Multiplying both side of Eq.(9) and (10) by α Τ and β Τ respectively, considering the constraint in Eq.(1), we obtain α Τ S xy β = λ1α Τα = λ1 and β Τ S yxα = λ2 β Τ β = λ2 . Τ Since S yx = S xy , so λ1 = λ1Τ = (α Τ S xy β ) Τ = β Τ S yxα = λ2 . We can infer that to obtain
the maximum value of λ is the same as to maximize the criterion function (1) under constraint (4). Let λ1 = λ2 = λ , then eigenequations (5) and (6) can inferred via Eq.(9) and Eq.(10). Since both S xy S yx and S yx S xy are symmetric matrices, and rank ( S xy S yx ) = rank ( S yx S xy ) ≤ rank ( S xy ) , so that the two eigenequations (5) and (6) have the same non-zero eigenvalues. Let λ12 ≥ λ22 ≥ ≥ λ2r > 0 (r ≤ rank ( S xy )) , then the r pairs of eigenvectors corresponding to them are orthonormal, respectively,
272
Q.-S. Sun et al.
namely α iΤα i = β iΤ β i = δ ij .Due to α iΤ S xy β j = α iΤ S xy (λ−j 1S yxα j ) = λ−j 1α iΤ (λ2jα j ) =
λ jδ ij , then conclusion (7) is true.
The above theorem gives out a laconic algorithm that can solve the orthogonal weight vectors. Supposing all d pairs of weight vectors are {α i ; β i }id=1 , which make up of two projection matrices Wx = (α1 , α 2 , α d ) and W y = ( β1 , β 2 , β d ) . By two linear transformations z1 = W xΤ x and z 2 = W yΤ y , we can obtain two sets of PLS components from the same pattern so as to achieve the goal of dimensionality reduction. From Eq.(9) and (10), we know that we only need to solve one among each pair of weight vectors, the other one can be solved by Eq.(9) or Eq. (10). Generally, we can choose the low-order matrix S xy S yx or S yx S xy to find its eigenvalues and eigenvectors, such that the computational complexity can be lowered. 2.3 The Algorithm Analysis
As mentioned above, two algorithms are based on the constraints of orthogonal weight vectors. The NI-PLS can't assure that the extracted PLS components are uncorrelated; but it are uncorrelated that the PLS components are extracted by the classical PLS in Section 2.1. Theoretically, from the angle of pattern classification, the performance of the latter classification is better than that of the former one; but in terms of the angle of algorithm’s complexity, the former one is better. We could say that both of them have its own advantage over the other, so we should use either one of them depending on the nature of problems in real situation. Moreover, from the process of solving the above weight vectors, we could see that it is valid for either big sample size or small sample size to adapt the PLS modeling method in reducing the dimension. Since no matter the PLS model is singular or not, it cannot be affected whether for the total scatter matrix of the training sample. Due to this property, the PLS model has better performance in general. It overcomes some modeling difficulties which will appear when the high-dimensional small sample size problems are processed with those methods such as the Fisher discrimination analysis[15], CCA and MLR.
3 Feature Fusion Strategy and Design of Classifier 3.1 Classification Based on Correlation Feature Matrix
Through the above PLS algorithms, we can acquire two sets of PLS components on sample space A and sample space B respectively. Each a pair of PLS components constitute the following matrix: M = [ z1 , z 2 ] = [W xΤ x, W yΤ y ] ∈ R d ×2
(11)
A Novel Feature Fusion Method Based on Partial Least Squares Regression
273
Where matrix M is called the correlation feature matrix of feature vectors x and y (or pattern sample ξ ). The distance between any two correlation feature matrices M i = [ z1(i ) , z 2(i ) ]
M j = [ z1( j ) , z 2( j ) ] is defined as 2
d ( M i , M j ) = ¦ || z k(i ) − z k( j ) || 2 k =1
(12)
where || ⋅ || represents the vector’s Euclidean distance. Let ω1 , ω 2 ,, ω c be the c known pattern classes, and assume that ξ1 , ξ 2 , , ξ n are the all training samples and their corresponding correlation feature matrixes are M 1 , M 2 ,, M n . For any testing sample ξ , its correlation feature matrix M = [ z1 , z 2 ] . If d ( M , M l ) = min d ( M , M j ) and ξ j ∈ ω k then ξ ∈ ω k . j
3.2 The Quadratic Bayesian Classifier
By the following linear transformation (13), we can extract a new d-dimension combination feature of each a pair of PLS components, and is used in classification. § Wx · z = ¨¨ ¸¸ ©W y ¹
Τ
§x· ¨¨ ¸¸ © y¹
(13)
In d-dimension combination feature space, we use the quadratic Bayesian classifier to classification. The quadratic Bayesian function is defined as: g i ( x) =
1 1 −1 ln ¦l + ( x − µ l ) Τ ¦l (( x − µ l ) 2 2
(14)
where µ l and ¦ l denote the mean vector and the covariance matrix of class l , respectively. The classifying decision-making based on the discriminant function will be x ∈ ω k if sample x satisfies g k ( x䋩 = min g l ( x) . l
4 Experiments and Analysis 4.1 Experiment on ORL Face Image Database
Experiment is performed on the ORL face image database (http://www.camorl.co.uk). There are 10 different images for 40 individuals. For some people, images were taken at different times. And the facial expression (open/closed eyes, smiling/ nonsmiling) and facial details (glasses/no glasses) are variables. The images were taken against a dark homogeneous background and the people are in upright, frontal position with tolerance for some tilting and rotation of up to 200. Moreover, there is some variation in scale of up to about 10%. All images are grayscale and normalized with a resolution of 112×92. Some images in ORL are shown in Fig.1.
274
Q.-S. Sun et al.
Fig. 1. Ten images of one person in ORL face database
In this experiment, we use the five images of random extraction of each person for training and the remaining five for testing. Thus, the total amount of training samples and testing samples are both 200. We use the original face images to make the first training sample space A = {x | x ∈ R 10304 } . Performing the cubic wavelet transformation on each original image, low-frequency sub-images with 28×23 resolution are obtained, and the second training sample space B = { y | y ∈ R 168 } are constructed; then, combining those two groups of features, and using the two PLS algorithm in Section 2, we can obtain the d pairs of weight vectors, respectively. By using the linear transformation z1 = WxΤ x and z 2 = W yΤ y , the feature vectors of the two sets can be reduced to d-dimensional(d varies from 1 to 168) discrimination feature vectors; finally, we proceed to classify according to the feature fusion strategy and classifier in Section 3.1, and the classification results and once time(s) in ten experiments are shown in Table 1. Table 1 shows that the two kinds of PLS methods all reach the better classification result, and their average recognition rates is above 95%. But seeing from the feature and classification, the computational time for C-PLS is 2.7 multiple one for NI-PLS. This matches the result of complexity analyzes of the algorithm in Section 2.3. Compared with the C-PLS, adopting the NI- PLS algorithm, not only dose it not reduce the recognition rate, but also the feature extraction speed is raised consumedly. Table 1. Comparison of the recognition accuracy (%) of two PLS methods at ten experiments Methods
1
2
3
4
5
6
7
8
9
10
Average time(s)
NI-PLS
95.5 95.5 96.5 92.5 95.0 94.0 96.5 94.5 96.0 94.5
95.0
189
C-PLS
95.0 95.5 92.0 97.0 96.0 94.0 96.5 95.0 94.5 95.5
95.1
516
Note: in this table, the recognition rates are obtained when the number the PLS component all are chosen to 50. The time mean that once experiment is accomplished as the number of PLS component varying from 1 to 168, and without taking into account of wavelet transformation.
A Novel Feature Fusion Method Based on Partial Least Squares Regression
275
4.2 Experiment on CENPARMI Handwritten Numerals Database
The goal of this experiment is to test the validity of the algorithm with the big sample proposed in this paper. The Concordia University CENPARMI database of handwritten Arabic numerals, popular in the world, is adopted. In this database, there are 10 class, i.e. 10 digits (from 0 to 9), and 600 sample for each. Some images of original samples are shown in Fig. 2. Hu et al.[16] had done some preprocessing work and extracted four kinds of features as follows: − XG: 256-dimensional Gabor transformation feature, − XL: 21- dimensional Legendre moment feature, − XP: 36-dimensional Pseudo-Zernike moment feature, − XZ: 30-dimensional Zernike moment feature.
Fig. 2. Some images of digits in CENPARMI handwritten numeral database
In this experiment, we use the first 300 images of each class for training and the remaining 300 for testing. Thus, the total amount of training samples and testing samples are both 3000. Combine any two features of the above four features in the original feature space, using two algorithms described in Section 2 and according to the feature fusion strategy and classifier in Section 3.2, we can obtain classification results that is shown in Table 2. For the sake of further comparing algorithm’s performance, in Table 3, as the combination feature’s dimension varies form 1 to 40, we also list the classification error rates and the classification time of two kinds of algorithms when we combine Gabor feature and the Legendre feature. Table 2. Comparison of classification error rates of two methods under different feature combination Methods
XG-XL
XG-XP
XG-XZ
XL-XP
XL-XZ
XP-XZ
NI-PLS
0.0403(31)
0.0863(33)
0.0850(30)
0.0857(33)
0.0863(30)
0.1910(27)
C-PLS
0.0407(30)
0.0870(34)
0.0847(30)
0.0857(35)
0.0857(30)
0.1853(27)
Note: the value in ( ) denotes the dimension of combination feature(i.e. the number the PLS component) as the optimal result is achieved.
276
Q.-S. Sun et al.
Table 3. Comparison of classification error rates and time(s) of two methods on the Gabor feature and the Legendre feature Dimension
1
5
10
15
20
25
30
35
40
Time(s)
NI-PLS
0.755 0.204 0.110 0.060 0.045 0.043 0.040 0.043 0.044
256
C-PLS
0.757 0.264 0.113 0.063 0.045 0.041 0.040 0.043 0.043
587
From Table 1 and Table 2, we can see that, the two kinds of algorithm all take the similar classification results when the four sets of features are combined differently. Among them the classification error rate by combining the Gabor feature and the Legendre feature is lower than those of others, and the optimal recognition rate is up to 96%. Moreover, from Table 2 we can also see the classification error rate drops quickly when we use the PLS modeling method proposed in this paper. As the feature vector dimension varies from 20 to 40, the two kinds of algorithm’s recognition rate are all stable above 95%. While the advantage using the NI-PLS algorithm in the time of the feature extraction is much obvious.
5 Conclusion In this paper, we put PLS model and the idea of feature fusion together, and create a new framework for image recognition. Experimental results on two image databases show that the proposed framework is efficient. Because of the property that PLS works no matter whether the total covariance matrix of the training samples is singular or not, the proposed method is more suitable for processing small sample size classification problems in high dimensional spaces. From the comparison of the two PLS modeling methods mentioned in this paper, the proposed non-iterative PLS(NIPLS) is superior to the classical PLS algorithm(C-PLS) at the complexity of algorithm and the speed of feature extraction.
Acknowledgements We wish to thank National Science Foundation of China under Grant No. 60473039. This work was supported by the Research Grants Council of the Hong Kong Special Administrative Region under Earmarked Research Grant (project no. CUHK 4223/04 E).
References 1. Wold, S., Martens, H., Wold, H.: The Multivariate Calibration Problem in Chemistry Solved by the PLS Method. Proceedings of Conference on Matrix Pencils, Lecture Notes in Mathematics, Springer, Heidelberg (1983) 286–293 2. HÖskuldsson, A.: PLS Regression Methods. Journal of Chemometrics 2 (1988) 211–228 3. Yacoub, F., MacGregor, J.F.: Product Optimization and Control in the Latent Variable Space of Nonlinear PLS Models. Chemometrics and Intelligent Laboratory Systems 70 (2004) 63–74
A Novel Feature Fusion Method Based on Partial Least Squares Regression
277
4. Barker, M., Rayens, W.S.: Partial Least Squares for Discrimination. Journal of Chemometrics 17 (2003) 166-173 5. Kesavanl, P., Lee, J.H., Saucedo, V., Krishnagopalan, G.A.: Partial Least Squares (PLS) Based Monitoring and Control of Batch Digesters. Journal of Process Control 10 (2000) 229–236 6. Chin, W.W.: The Partial Least Squares Approach for Structural Equation Modeling. In: Marcoulides, G.A.(Ed.), Modern Methods for Business Research, Lawrence Erlbaum Associates, London (1998) 295–336 7. Nguyen, D.V., Rocke, D.M.: Tumor Classification by Partial Least Squares Using Microarray Gene Expression Data. Bioinformatics 18 (2002) 39–50 8. Turk, M., Pentland, A.: Face Recognition Using Eigenfaces. Proc. IEEE Conf, On Computer Vision and Pattern Recognition 3 (1991) 586–591 9. Izenman, A.J.: Reduced-Rank Regression for the Multivariate Linear Model. Journal of Multivariate Analysis 5 (1975) 248–264 10. Hotelling, H.: Relations between Two Sets of Variates. Biometrika 28 (1936) 321–377 11. Wold, H.: Non-Linear Iterative Partial Least Squares(NIPALS) Modeling. Some Current Developments. In: Krishnaiah, P.R. (Ed.), Multivariate Analysis, Academic Press, New York (1973) 383–407 12. Wold, S., Trygg, J., Berglund, A., Antti, H.: Some Recent Developments in PLS Modeling. Chemometrics and Intelligent Laboratory Systems 58 (2001) 131–150 13. Tang, K.L., Li, T.H.: Comparison of Different Partial Least-Squares Methods in Quantitative Structure–Activity Relationships. Analytica Chimica Acta 476 (20030 85–92 14. Helland, I.S.: On the Structure of Partial Least Squares. Comm. Statist. Simulation Comput. 17 (1988) 581–607 15. Belhumeur, P.N., Hespanha, J., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. Pattern Anal. Machine Intell. 19 (1997) 711–720 16. Hu, Z.S., Lou, Z., Yang, J.Y., Liu, K., Suen, C.Y.: Handwritten Digit Recognition Basis on Multi-Classifier Combination. Chinese J. Computer 22 (1999) 369–374