Regularization Versus Dimension Reduction ... - Semantic Scholar

Report 6 Downloads 98 Views
Regularization Versus Dimension Reduction, Which Is Better? Yunfei Jiang1 and Ping Guo1,2, 1

Laboratory of Image Processing and Pattern Recognition Beijing Normal University, Beijing, 100875, China 2 School of Computer Science and Technology Beijing Institute of Technology, Beijing, 100081, China yunfeifei [email protected], [email protected]

Abstract. There exist two main solutions for the classification of highdimensional data with small number settings. One is to classify them directly in high-dimensional space with regularization methods, and the other is to reduce data dimension first, then classify them in feature space. However, which is better on earth? In this paper, the comparative studies for regularization and dimension reduction approaches are given with two typical sets of high-dimensional data from real world: Raman spectroscopy signals and stellar spectra data. Experimental results show that in most cases, the dimension reduction methods can obtain acceptable classification results, and cost less computation time. When the training sample number is insufficient and distribution is unbalance seriously, performance of some regularization approaches is better than those dimension reduction ones, but regularization methods cost more computation time.

1

Introduction

In real world, there are some data, such as Raman spectroscopy and stellar spectra data, that the number of variables (wavelengths) is much higher than the number of samples. When classification (recognition) tasks are applied, the ill-posed problems arise. For such ill-posed problems, there mainly have two solutions. One is to classify them directly in high-dimensional space with regularization methods [1], the other is to classify them in feature space after dimension reduction. Many approaches are proposed to solve the ill-posed problem [1,2,3,4,5,6,7,8]. Among these methods, Regularized Discriminant Analysis (RDA), Leave-OneOut Covariance matrix estimate (LOOC) and Kullback-Leibler Information Measure based classifier (KILM) are regularization methods. RDA [2] is a method based on Linear Discriminant Analysis (LDA) which adds the identity matrix as a regularization term to solve the problem in matrix estimation, and LOOC [3] brings the diagonal matrix in solving singular problem. The KLIM estimator 

Corresponding author.

D. Liu et al. (Eds.): ISNN 2007, LNCS 4492, Part II, pp. 474–482, 2007. c Springer-Verlag Berlin Heidelberg 2007 

Regularization Versus Dimension Reduction, Which Is Better?

475

is derived by Guo and Lyu [4] based on Kullback-Leibler information measure. Regularized Linear Discriminant Analysis (R-LDA), Kernel Direct Discriminant Analysis (KDDA) and Principal Component Analysis (PCA) are dimension reduction methods. R-LDA was proposed by Lu et.al [6], which introduces a regularized Fisher’s discriminant criterion, and via optimizing the criterion, it addresses the small sample size problem. KDDA [7] can be seen as an enhanced kernel Direct Linear Discriminant Analysis (kernel D-LDA) method. PCA [8] is a linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In this paper, comparative studies on regularization and dimension reduction approaches are given with two typical sets of high-dimensional data from real world, Raman spectroscopy signals and stellar spectra data. Correct classification rate (CCR) and time cost are used to evaluate the performance for each method. The rest of this paper is organized as follows. Section 2 gives a review on discrimination analysis. Section 3 introduces the regularization approaches. Section 4 discusses the dimension reduction approaches. Experiments are described in Section 5. Then we make some discussions in Section 6. Finally, conclusions are given in last section.

2

Discriminant Analysis

Discriminant analysis is to assign an observation x ∈ RN with unknown class membership to one of k classes C1 , ..., Ck known a priori. There is a learning data set A = {(x1 , c1 ), ..., (xn , cn )|xj ∈ RN and cj ∈ {1, ..., k}}, where the vector xj contains N explanatory variables and cj indicates the index of the class of xi . The data set allows to construct a decision rule which associates a new vector x ∈ RN to one of the k classes. Bayes decision rule assigns the observation x to the class Cj∗ which has the maximum a posteriori probability. Which is equivalent, in view of the Bayes rule, to minimize a cost function dj (x), j ∗ = arg min dj (x), j

j = 1, 2, · · · , k,

dj (x) = −2 log(πj fj (x)).

(1)

(2)

Where πj is the prior probability of class Cj and fj (x) denotes the class conditional density of x, ∀j = 1, ..., k. Some classical discriminant analysis methods can be obtained by combining additional assumptions with the Bayes decision rule. For instance, Quadratic Discriminant Analysis (QDA) [1,5] assumes that  j ) which  j, Σ the class conditional density fi for the class Cj is Gaussian N (m leads to the discriminant function  j | − 2 ln α  −1 (x − m  j )T Σ  j ) + ln |Σ j . dj (x) = (x − m j

(3)

476

Y. Jiang and P. Guo

 j is the covariance  j is the mean vector, and Σ Where α j is the prior probability, m matrix of the j-th class. If the prior probability α j is the same for all classes, the term 2 ln α j can be omitted and the discriminant function reduces to a more simple form. The parameters in above equations can be estimated with traditional maximum likelihood estimator. nj , (4) α j = N 1 nj j = m xi , (5) i=1 nj nj j = 1  j )(xi − m  j )T . Σ (xi − m (6) i=1 nj In practice, this method is penalized in high-dimensional spaces since it requires estimating many parameters. For small sample number case, it will lead to the ill-posed problem. In that case the parameter estimates can be highly unstable, giving rise to high variance in classification accuracy. By employing a method of regularization, one attempts to improve the estimates by biasing them away from their sample based values towards values that are deemed to be more “physically plausible”. For this reason, particular rules of QDA exist in order to regularize the estimation of xi . We can assume that covariance matrices are equal, i.e.  j = Σ,  which yields the framework of LDA [9]. This method makes linear Σ separations between the classes.

3

Regularization Approaches

Regularization techniques have been highly successful in the solution of ill-posed and poorly-posed inverse problems. Such as RDA, LOOC and KLIM are proposed, the crucial difference of these methods is the diversity of the covariance matrix estimation formula. We will give brief review of these methods. 3.1

RDA

RDA is a regularization method which was proposed by Friedman [2]. RDA is designed for small number samples case, where the covariance matrix in Eq.(3) takes the following form: Σj (λ, γ) = (1 − γ)Σj (λ) + γ with Σj (λ) =

Trace[Σj (λ)] Id , d

  j + λN Σ (1 − λ)nj Σ . (1 − λ)nj + λN

(7)

(8)

The two parameters λ and γ, which are restricted to the range 0 to 1, are regularization parameters to be selected according to maximizing the leave-one j that are out correct classification rate (CCR). λ controls the amount of the Σ  while γ controls the shrinkage of the eigenvalues towards shrunk towards Σ, equality as Trace[Σj (λ)]/d is equal to the average of the eigenvalues of Σj (λ).

Regularization Versus Dimension Reduction, Which Is Better?

3.2

477

LOOC

There exists another covariance matrix estimation formula which was proposed by Hoffbeck and Landgrebe [3]. They examine the diagonal sample covariance matrix, the diagonal common covariance matrix, and some pair-wise mixtures of those matrices. The proposed estimator has the following form:  j + ξj3 Σ+ξ  j4 diag(Σ).  j ) + ξj2 Σ  Σj (ξj ) = ξj1 diag(Σ

(9)

T

The elements of the mixing parameter ξj = [ξj1 , ξj2 , ξj3 , ξj4 ] are required 4 to sum up to unity: Σl=1 ξjl = 1. In order to reduce the computation cost, they only considered three cases: (ξj3 , ξj4 ) = 0, (ξj1 , ξj4 ) = 0, and (ξj1 , ξj2 ) = 0. They called the covariance matrix estimator as LOOC because the mixture parameter ξ was optimized by Leave-One-Out Cross validation method. 3.3

KLIM

The matrix estimation formula of KLIM is shown in the following: (1)  j, Σj (h) = hId + Σ

(10)

where h is a regularization parameter, Id is a d × d dimensional identity matrix. This class of formula can solve matrix singular problem in high-dimension setting. In fact, as long as h is not too small, Σ−1 j (h) exists with a finite value and the estimated classification rate will be stable.

4

Dimension Reduction Approaches

Dimension reduction is another solution to solve the ill-posed problem arising in the case of high dimension with small sample number setting. R-LDA, KDDA and PCA are three common dimension reduction methods. R-LDA and KDDA are considered to be variations of D-LDA. R-LDA introduces a regularized Fisher’s discriminant criterion. The introduction of the regularization helps to decrease the importance of those highly unstable eigenvectors, thereby reducing the overall variance. KDDA introduces a nonlinear mapping from the input space to an implicit high-dimensional feature space, where the nonlinear and complex distribution of patterns in the input space is “linearized” and “simplified” so that conventional LDA can be applied. PCA tends to find a p-dimensional subspace whose basis vectors correspond to the maximum variance direction in the original data space. We will give brief review of R-LDA and KDDA. Since PCA is a well-known method, we will not verbosely introduce it in this paper. 4.1

R-LDA

The purpose of R-LDA [6] is to reduce the high variance related to the eigenvalue estimates of the within-class scatter matrix at the expense of potentially increased bias. The regularized Fisher criterion can be expressed as follows: Ψ = arg max Ψ

|ΨT SB Ψ| , |η(ΨT SB Ψ) + (ΨT SW Ψ)|

(11)

478

Y. Jiang and P. Guo

where SB is the between-class scatter matrix, SW is the within-class scatter matrix, 0 ≤ η ≤ 1 is a regularization parameter. Determine the set Um = [u1 , · · · , um ] of eigenvectors of SB associated with the m ≤ c − 1 non-zero −1/2 eigenvalues ΛB . Define H = Um ΛB , then compute the M (≤ m) eigenvecT tors PM = [p1 , · · · , pM ] of H SW H with the smallest eigenvalues ΛW . In the end, we can obtain Ψ = HPM (ηI + ΛW )−1/2 by combining the results of above, which is considered a set of optimal discriminant feature basis vectors. 4.2

KDDA

The KDDA method [7] implements an improved D-LDA in a high-dimensional feature space using a kernel approach. Define RN as the input space, assuming that A and B represent the null spaces of the between-class scatter matrix SB and the within-class scatter matrix SW respectively, the complement spaces of A and B can be written as A = RN − A and B  = RN − B. Then the optimal discriminant subspace sought by the KDDA algorithm is the intersection  space (A B). A is found by diagonalizing the matrix SB . The feature space F becomes implicit by using kernel methods, where dot products in F are replaced with a kernel function in RN so that the nonlinear mapping is performed implicitly in it.

5

Experiments

Two typical sets of real world data, namely Raman spectroscopy and stellar spectra data, are used in our study. The Raman spectroscopy data set used in this work is the same data set with in reference [10]. This data set consists of three classes of substance, they are acetic acid, ethanol and ethyl acetate. After data preprocess, all the data have been cut into 134 dimension. There are 50 samples in acetic acid, 30 samples in ethanol and 290 samples in ethyl acetate, therefore there are 370 samples in total. The stellar spectrum data used in the experiments are from Astronomical Data Center (ADC) [11]. They are drawn from standard stellar library for evolutionary synthesis. The data set consists of 430 samples and could be divided 1 (a) 0.9

0.8

0.7

0.6

0.5

I

0.4

0.3

0.2

0.1

0

0

200

400

600

800 1000 Wavelength(nm)

1200

1400

1600

Fig. 1. The typical three type stellar spectra lines

Regularization Versus Dimension Reduction, Which Is Better?

479

into 3 classes. The number of samples in each class are 88, 131 and 211, respectively. The spectrum is of 1221 wavelength points covering the range from 9.1 to 160000 nm. The typical distribution of these spectrum lines in a range from 100 nm to 1600 nm is shown in Fig. 1. In experiments, the data set is randomly partitioned into a training set and a testing set with no overlap between them. In Raman data experiment, 15 samples are chosen randomly from each class. They are used as training samples to estimate the mean vector and covariance matrix. The remained 310 samples are the test samples to verify the classification accuracy. While in stellar data experiment, 40 samples are randomly chosen from each class for training, and the remains for testing. In this study, we investigate regularization methods firstly, that is to classify the data directly in high-dimensional space with regularization methods. The another aspect of experiments is to apply R-LDA, KDDA and PCA methods for dimension reduction, respectively. With the reduced dimension data set, we choose QDA as a classifier to get the correct classification rate (CCR) in feature space. The results with PCA method are gotten under the condition of reduced 10-dimensional data set. All the experiments are repeated 20 runs with random different partitioned data sub-set, and all results reported in tables of this paper are the average values over the twenty runs. In experiments, we noted the CCR and time cost for each method. Table 1 shows the classification results with different approaches. It is needed to point out that the dimension of the raw stellar data is too high compared with its sample number, it is unstable to compute CCR directly in such a high dimensional space. Thus we reduce the dimension of stellar data to 100 with PCA method first, and consider it still being a sufficient high-dimensional space for the problem to investigate. In the tables presented in this paper, the CCR is reported in decimal fraction. Furthermore, the notation N/A represents that the covariance matrix is singular, in which case reliable results can not be obtained. Table 1. The Classification Results with Different Approaches Data Evaluation RDA LOOC Raman CCR N/A N/A Time 99.399 39.782 Stellar CCR 0.9490 0.7786 Time 150.1 40.058

KLIM 0.8448 178.57 0.9653 194.3

R-LDA 0.6536 0.2423 0.9677 0.1672

KDDA 0.7374 2.6132 0.9591 2.6678

PCA 0.7625 0.4166 0.9531 4.157

For further comparison, we perform PCA methods to reduce the data to different dimension before classification, and still use the same classifier QDA to compute CCR. The dimension of two data sets is reduced into four different levels, which is 40, 20, 10 and 2 dimension respectively. For the purpose of comparative convenience, in table 2 we shows the classification results for different dimension of Raman spectroscopy and stellar spectra data together.

480

6

Y. Jiang and P. Guo

Discussion

In table 1, it depicts a quantitative comparison of the mean CCR and time cost obtained by direct classification with regularization approaches in high dimensional space and classification after dimension reduction with a non-regularization classifier (QDA). As it can be seen from the table, time cost by classifying the highdimensional data directly with regularization approaches is usually from 20 to 1000 times higher than that classification after dimension reduction. And in most cases, the CCR obtained by classification with dimension reduction approaches is more acceptable compared to directly classify with regularization approaches. From table 1, we can find that when the training samples are insufficient and distribution unbalance seriously, if the regularization parameters are too small, even with regularized classifiers, the ill-posed problem can not be fully solved. This phenomenon is very obvious for Raman data, and by applying RDA and LOOC classifiers still encounter the covariance matrix singular problem. Meanwhile we also find that KLIM is a very effective regularization approach. For Raman data, it can get the best results among three regularization approaches and three dimension reduction approaches. If data sets are insufficient and distribution unbalance seriously such as Raman, KLIM always gives us better CCR results compared to those dimension reduction approaches, but it costs more computation time than other classifiers. Table 2. The Classification Results for Different Dimensionality Data Evaluation d=40 Raman CCR N/A Time 6.358 Stellar CCR N/A Time 39.2

d=20 N/A 4.2011 0.9574 13.913

d=10 0.7625 0.4166 0.9531 4.157

d=2 0.6446 0.3014 0.8963 2.2212

Wether it can give us more acceptable results when classification with dimension reduction compared to classification directly with regularization classifiers? The answer is not always positive. As illustrated in table 2, the lower of the dimension reduced, the less of the computation time cost. We see that results of Raman data are worse than those of Stellar data, due to training samples insufficient and distribution unbalance seriously. For Raman data, even though we reduce data to the 20 dimension, the ill-posed problem sill exists. And the classified results of Raman data are much worse than those of stellar data. From experiments we also find that mean classification accuracy with principal components(PCs) is still acceptable even with only 2 PCs, but the classification accuracy has an obvious degradation. When we reduce the data dimension to 2 PCs, for Stellar data, the CCR obtained with QDA classifier is lower than the CCR obtained with RDA. We consider that is because a reduction in the number of features will lead to a loss of the discriminant ability for some data set. In order to cut down the computational time and get a satisfactory classification

Regularization Versus Dimension Reduction, Which Is Better?

481

accuracy at the same time, it need a careful choice of the dimension level of the data to reduce. However, it still is an open problem how to select a suitable dimension level.

7

Conclusions

In this paper, we presented comparative studies of regularization and dimension reduction with real world data sets in same working conditions. From the results, we can draw some conclusions: (1) Dimension reduction approaches often gives us acceptable CCR results. Meanwhile they can reduce the computational time cost and use less memory compared to classification directly in high dimension with regularization methods. (2) The choice of what dimension level should be reduced to is a very important thing. There exists an appropriate dimension level, at this level we can get satisfied results, and computational time cost as well as memory required as less as possible. However, it is very difficult to choose such a proper dimension level. (3) If the dimension we chosen is not sufficient low such that still can not avoid the ill-posed problem. And if the dimension reduced is too low, that will lead to a loss in the discriminant ability and consequently degrade the classification accuracy. (4) If data sample number is insufficient and sample distribution unbalance seriously like Raman spectroscopy, some regularization approaches like KLIM may be more effective than those dimension reduction approaches.

Acknowledgments The research work described in this paper was fully supported by a grant from the National Natural Science Foundation of China (Project No. 60675011). The author would like to thank Fei Xing and Ling Bai for their help in part of experiment work.

References 1. Aeberhard, D., Coomans, D., De Vel, O.: Comparative Analysis of Sstatistical Pattern Recognition Methods in High Dimensional Settings. Pattern Recognition 27 (1994) 1065-1077 2. Friedman, J.H.: Regularized Discriminant Analysis. J. Amer. Statist. Assoc. 84 (1989) 165-175 3. Hoffbeck, J.P., Landgrebe, D.A.: Covariance Matrix Estimation and Classification with Limited Training Data. IEEE Trans. Pattern Analysis and Machine Intelligence 18 (1996) 763-767 4. Guo, P., Lyu, M.R.: Classification for High-Dimension Small-Sample Data Sets based on Kullback-Leibler Information Measure. In: Proceedings of The 2000 International Conference on Artificial Intelligence, H. R. Arabnia (2000) 1187-1193 5. Webb, A.R.: Statistical Pattern Recognition. In: Oxford University Press, London (1994)

482

Y. Jiang and P. Guo

6. Lu, J., Plataniotis, K.N., Venetsanopoulos, A.N.: Regularization Studies of Linear Discriminant Analysis in Small Sample Size Scenarios with Application to Face Recognition. Pattern Recognition Letter 26 (2005) 181-191 7. Lu, J., Plataniotis, K.N., Venetsanopoulos, A.N.: Face Recognition using Kernel Direct Discriminant Analysis Algorithms. IEEE Trans. Neural Networks 14 (2003) 117-126 8. Jolliffe, I.T.: Principal Cmponent Analysis. Springer-Verlag (1996) 9. Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7 (1936) 179-188 10. Guo, P., Lu, H.Q., Du, W.M.: Pattern Recognition for the Classification of Raman Spectroscopy Signals. Journal of Electronics and Information Technology 26 (2002) 789-793 (in Chinese) 11. Stellar Data: ADC website: (http://adc.gsfc.nasa.gov/adc/sciencedata.html).