Kernel Feature Selection to Improve Generalization ... - CiteSeerX

Report 2 Downloads 221 Views
Kernel Feature Selection to Improve Generalization Performance of Boosting Classifiers Kenji NISHIDA Neuroscience Research Institute National Institute of Advanced Industrial Science and Technology (AIST) Central 2, 1-1-1 Umezono, Tsukuba, IBARAKI 305-8568 JAPAN [email protected] Phone: 81-29-861-5879 Fax: 81-29-861-5841

Abstract In this paper, kernel feature selection is proposed to improve generalization performance of boosting classifiers. Kernel feature Selection attains the feature selection and model selection at the same time using a simple selection algorithm. The algorithm automatically selects a subset of kernel features for each classifier and combines them according to the LogitBoost algorithm. The system employs kernel logistic regression for the base-learner, and a kernel feature is selected at each stage of boosting to improve the generalization error. The proposed method was applied to the MIT CBCL pedestrian image database, and kernel features were extracted from each pixel of the images as a local feature. The experimental results showed good generalization error with local feature selection, while more improvement was achieved with the kernel feature selection. Key words: Ensemble classifier, Boosting, Kernel method, Feature selection, Model selection.

1 Introduction A powerful nonlinear classifier can be determined by combining a kernel method with a linear classification method such as support vector machines (SVM). The classification performance, however powerful the classifier is, is affected when training samples contain unnecessary features for classification, especially in the generalization performance (classification performance for unlearned data). Therefore, feature selection to prune the subset of features is still important, even with kernel methods. In addition,

Takio KURITA Neuroscience Research Institute National Institute of Advanced Industrial Science and Technology (AIST) Central 2, 1-1-1 Umezono, Tsukuba, IBARAKI 305-8568 JAPAN [email protected] Phone: 81-29-861-5838 Fax: 81-29-861-5841

because in the naive kernel method, the number of training samples determines the number of kernel bases, overlearning may result with determining a complicated model. Thus, the model selection through selecting a subset of samples becomes important when there is a large number of training samples. Previous studies described the effects of feature selection [1, 4]. In Nishida’s work[1], the combination of feature selection and boosting[2] of a soft-margin support vector machine (SVM) was proposed. A base learner selected the best local feature from 100 predefined rectangular regions of the training samples. After 100 stage of boosting, about 50 local features were selected repeatedly from 100 local features, and combining different feature extraction items (such as edge features and histogram-equalization features) improved the generalization performance. However, further improvement can be expected by increasing the variation of local features which was limited to only 100 rectangular regions in this method. Hotta proposed a classifier with a summation of local kernels[4] based on a convolution kernel[3] to determine a robust classifier for partially occluded images. In this method, a kernel feature was determined for each pixel, and the SVM was trained by the summation of the kernel features as a new kernel feature for the whole image (usually the kernel feature is considered to be a product of local kernels). The classifier showed a robustness for about 50% of the occluded images, attaining a classification ratio of about 70%. The robustness, in other word, indicates that we can determine a good classifier by selecting about half of the pixels in one image. Hotta did not apply weighting of the local kernels for summation, however, the contribution to classification of each pixel should be intrinsically varied by

its location, intensity, etc. Thus, the classification performance (including generalization performance) can be improved by selecting pixels and summing them up according to their contribution to classification, such as boosting. Model selection can be determined by selecting a subset of training samples which represents an approximation of a feature space of the training samples. Feature vector selection (FVS)[5] was proposed to approximate the kernel feature space with the linear combination of selected feature vectors. In FVS, feature vectors were selected sequentially while computing the fitness of their linear combination to the kernel space, until adequate fitness was attained. The import vector machine (IVM)[6] introduced a similar selection method in the computation of kernel logistic regression (KLR). IVM selected the subset of feature vectors during the iterative reweighted least square (IRLS) optimization process of KLR. In this paper, we propose kernel feature selection in which select an adequate subset of local features and an adequate subset of samples are selected at the same time to improve the generalization performance of a classifier. In kernel feature selection, we first generate kernel feature vectors for each local feature of the input samples; then, a subset of kernel feature vectors from all of the local features is selected. In FVS and IVM, the subset of kernel features had to approximate the kernel space with a small error; however, we truncated the approximation for one base learner in boosting, since Hidaka found that weakness of the base learner can improve the generalization performance of an ensemble classifiers[7]. The proposed method was evaluated in a process of classifying pedestrians using LogitBoost[2]. We first evaluated the effects of combining feature selection and boosting with one pixel as a local feature, compared with the summation of local kernels. The results showed that selection of 50 adequate pixels achieved better classification performance than summing up all local kernels for 528 pixels of images of pedestrians. We next evaluated the effects of kernel feature selection on the generalization performance for LogitBoost. The results indicated that the generalization performance improved while the classification performance was preserved by selecting only a few samples. We present a description of kernel feature selection, after a brief description of KLR and IVM, in section 2. In section 3, we present our experimental results.

2 Kernel Feature Selection for LogitBoost First, we describe the classification based on kernel logistic regression (KLR). Then, we describe a kernel selection method that achieves model selection. Finally, we describe a method of local-feature selection with a boosting procedure for KLR in combination with kernel selection.

2.1

Kernel Logistic Regression (KLR)

Kernel logistic regression produces a nonlinear classification boundary in the original input space by constructing a linear boundary in a transformed version of the original input space. The classification function for KLR is represented as exp(η) , (1) y = f (η) = 1 + exp(η) where η are transformed from the original inputs x by a kernel function, ie., η=

N X

α ˜ i K(x, xi ).

(2)

i=1

The likelihood for classification output y is determined as follows: N Y L= yiui (1 − yi )(1−ui ) ; (3) i=1

where ui stands for the target value of ith training sample. Hence, the log-likelihood is represented as l

=

N X

{ui log yi + (1 − ui ) log(1 − yi )}

i=1

=

N X

{ui

i=1

N X

αj K(xj , xi )

j=1

− log{1 + exp(

N X

αj K(xj , xi ))},

(4)

j=1

where N stands for the number of training samples. The classification boundary is determined by maximizing loglikelihood l. Kernel logistic regression uses iterative reweighted least square (IRLS) method to determine the parameters to maximize log-likelihood l. In IRLS, we first determine the second order derivative of equation 4 such as: N X ∂2l =− αi K(xi , xk )K(xi , xj ) ∂α ˜k ∂ α ˜j i=1

(5)

where αi = yi (1 − yi ). The matrix representation of the first order derivative and the second order derivative are determined as:

∇l

=

N X ˜ i = K T (u − y), (ui − yi )k

(6)

i=1

∇2 l

=



N X i=1

T

˜ik ˜ = −K T AK αi k i

(7)

where K stands for the gramm matrix of the input samples, and A = diag(α1 , . . . , αN ). Determining the parameters for maximum log-likelihood ˜ ∗ can be defined using Newton method, the new prediction α as: ˜∗ α

2.2

˜ T x + K T (u − y)) = (K T AK)−1 (K T Aα ˜ T x + A−1 (u − y)). (8) = (K T AK)−1 K T A(α

Kernel Selection for KLR

Import vector machine (IVM) [6] and feature vector selection (FVS) [5] are proposed to select training samples while approximating a kernel feature space. When a full model for input sample set X = xi , · · · , xn is defined as equation (2), a sub-model, which approximate the full model for KLR, can be defined as X η≈ α ˜ i K(x, xi ), (9) xi ∈S where S is a subset of the training sample. Here, x1 , · · · , xn ∈ S are called import points in IVM. While IVM and FVS select the feature vectors as many as the submodel can approximate the full model with a small error, we propose to truncate the approximation by selecting a fixed number of feature vectors and then compensate for the deterioration in classification boosting.

2.3

We already proposed a boosting algorithm with an automatic feature selection method [1], which selects an optimal local feature from 100 predefined local features. Through our experience with this feature selection method, we speculated that a combination of fine-grain local feature could improve the generalization performance of the boosted classifier. Hotta proposed a sum of local kernels [4] based on the convolution kernel [3] to combine local features, and a classification function can be determined by f (x) =

i

L X

p

i

βp αi yi Kp (xi (p), x(p)),

N XX

=

p∈Sp

βp αi yi Kp (xi (p), x(p)),

(15)

i

where Sp stands for the selected set of local features. When we employ kernel logistic regression for base learners, the classifier significance β is merged in f (x): X F (x) = sign[ fp (x)] p∈Sp

=

sign[

N XX p∈Sp

αi yi Kp (xi (p), x(p))]. (16)

i

p(x) = αi

=

exp(F (x)) (exp(F (x)) + exp(−F (x))) p(xi )(1 − p(xi )).

(17) (18)

By selecting kernel features as in 2.2, the classification function is determined as follows: X F (x) = sign[ fp (x)] p∈Sp

=

sign[

X X

αi yi Kp (xi (p), x(p))], (19)

p∈Sp i∈Sk

where Sk stands for the set of selected kernels.

βp fp (x)

L X N X

(12) (13) (14)

p∈Sp

(10)

p

=

= Eα [1(y6=fp (x)) ] = log((1 − lp )/lp ) = αi exp[βp · 1(yi 6=fp (xi )) ].

Here, Eα represents expectation over the training data with weight α and 1(s) is the indicator of set s. Selecting M local-features (M boosting steps), a classification function can be derived from (11): X F (x) = βp fp (x)

L

1X αi yi Kp (xi (p), x(p)), L p

where Kp (xi (p), x(p)) is a kernel feature for a local feature, N is the number of training samples, and L is the number of local features. Equation (10) can be thought of as an example of boosting with equal significance of classifiers; therefore, it can be redefined by introducing classifier significance β, as F (x) =

lp βp αi

where

Feature Selection and Boosting

N X

where

(11)

2.4

Kernel Feature Selection for Boosting KLR

IVM selects feature vectors until the selected subset approximates their feature space with a small error; therefore, the number of selected vectors is not determined in advance. We propose selecting feature vectors with a predefined number to truncate the approximation; thereby our classifier becomes weaker than IVM. By compensating the weakness of the base learners with boosting, we aim to improve the generalization performance of a strong classifier.

1: N for the number of input samples, M for the number of boosting stage, L for the number of pixel (the number of local feature), S for the number of extracted feature, J for the number of selected kernel feature vectors. 2: Compute kernel gramm matrix for each pixel and each feature, N × L × S = K kernel feature vectors are determined in total. Concatenate all kernel feature vectors to determine K (N × K). 3: set initial weight as wi = 1/N , i = 1, 2, . . . , N , F (x) = 0, set initial possibility as p(xi ) = 1/2. 4: For m = 1, to M : k = []. 4.1 Km 4.2 for j = 1 to J (Select J kernel feature vectors.) 4.2.1 for k = 1 to K k 4.2.2.1 Km = [Km , kk ]. (Select one feature vector from K, and add to Km .) y ∗ −p(xi ) 4.2.2.2 zi = p(xii)(1−p(x , i )) k k (Train a classifier fm (x) by Km with sample weight as wi = p(xi )(1 − p(xi )).) k 4.2.2 set fm (x) with the highest classification ratio as fm (x), k set Km as Km , (fm (x) and Km for mth base-learner.) 4.3 F (x) ← F (x) + 12 fm (x), F (x) p(x) ← eF (x)e+e−F (x) . (update strong classifier F (x) and possibility p(x). PM 5: final result is determined as [F (x)] = sign[ m=1 fm (x)].

Figure 2. Boosting with local kernel selection

Figure 1. Pseudo code for Kernel Feature Selection with LogitBoost

In fig. 1, the pseudo code for kernel feature selection is shown, with J feature vectors selected. In kernel feature selection, a set of kernel feature vector is selected from the entire groups of kernel feature vectors generated from all the local features, while a set of local features is selected to generate one kernel gramm matrix in the usual feature selection process. Figure 2 shows a brief description of boosting with local feature selection, and Figure 3 illustrates a brief description of kernel feature selection. By generating local kernels for all the local features in advance, the same procedure can be used to select the kernel features.

Figure 3. Boosting with kernel-feature selection

0.35 Training Error Test Error 0.3

Error Ratio

0.25

0.2

0.15

0.1

0.05

0

0

100

200

300

400

500

600

Number of Pixels (Number of Boosting)

Figure 5. Error Ratio for Summing-up All Local-Features

3.2

Experimental Results

3.2.1

Error Ratio

Figure 4. Sample Images and Local Features

3 Experiment 3.1

Overview of Sample Data

We employed the MIT CBCL (Center for Biological and Computational Learning) database as sample data. The database contained 926 images of pedestrians with 128×64 pixels. We also collected 2,000 images of non-pedestrians by ourselves. We reduced the resolution of all the samples to 24×11 before we applied them to our system. We extracted histogram-equalization and edge features from the input images, and extracted local features as one pixel of the featured images. Figure 4 shows the original image, extracted features, and local features. We selected one pixel as a local feature to generate a local kernel feature for boosting with local kernels, and we selected one kernel feature vector from all the local kernel features for kernel feature selection. We used 100 images of pedestrians and 100 images nonpedestrians for training, and the same number and type of images to test the generalization error.

Figure 5 plots the error ratio for summing up all the local kernels. The test error reached 3% after summing up 500 local kernels (local features). Figure 6 plots the error ratio after 50 boosting steps. The test error reached 3% by boosting only with 50 local features in local feature selection. The slower convergence of the training error with kernel feature selection indicates that weaker base learners were determined compared with that of local kernel selection. This is quite reasonable since we selected only one kernel feature vector for kernel feature selection, while one local kernel contains 200 (number of samples) kernel feature vectors. The test error reached 2.5% with kernel feature selection. This result indicates that kernel feature selection helped improve the classification performance, despite the weakness of its base learners. 3.2.2

Selected Features

Figure 7 shows the selected local features (pixels). Although the same number of pixels are selected in both methods since the number of boosting steps are the same, kernel feature selection gave a slightly wider variety of selected

0.14 Pixel Selection Training Error Pixel Selection Test Error Kernel-Feature Selection Training Error Kernel-Feature Selection Test Error

0.12

Error Ratio

0.1

0.08

Figure 7. Selected local-features (Pixels)

0.06

Acknowledgment

0.04

0.02

0

0

10

20

30

40

50

Number of Boosting

Figure 6. Error Ratio after 50 Boosting Steps

pixels. Pixels on the edges of the pedestrians tend to be selected with local kernel selection, whereas pixels within the pedestrians are also selected with kernel feature selection.

4

Conclusion

We presented kernel feature selection as a method to improve the generalization performance of ensemble classifiers. The proposed method was evaluated through pedestrian detection with LogitBoost using one pixel for one local feature and generating local kernels for each pixel. The experimental results showed good generalization performance with a test error ratio of 2.5%. Good local features (pixels) were automatically selected by the kernel feature selection. We had to limit the number of kernel feature vectors for one base learner to one, because we had limited computational time to train the classifiers. Actually, we had to select one kernel feature vector out of 105,600 (264 pixels × two features × 200 samples). It took about two hours to train one base learner (four days for 50 boosting steps) on a 3.4-GHz Pentium 4 processor, and the power of 105,600 computations would be required to combine the number of kernel feature vectors. We are planning to introduce some heuristics in sampling to reduce the computational cost for selection, and to combine a number of kernel features to enhance the performance of the base learners.

This research is a part of “Situation and Intention Recognition for Risk Finding and Avoidance: Human-Centered Technology for Transportation Safety”, which is supported by Special Coordination Funds for Promoting Science and Technology provided by MEXT (Ministry of Education, Culture, Sports, Science and Technology) Japan.

References [1] K.Nishida, T.Kurita, “Boosting Soft-Margin SVM with Feature Selection for Pedestrian Detection”,Proc. of 6th International Workshop on Multiple Classifier Systems MCS 2005, pp.22-31, 2005. [2] J. Friedman, T. Hastie and R. Tibshirani, ”Additive logistic regression: a statistical view of boosting”, Stanford University Technical Report., 1998. [3] D.Haussler, “Convolution kernels on discrete structures,” tech. rep., UCSC-CRL-99-10, 1999. [4] K.Hotta, “Support Vector Machine with Local Summation Kernel for Robust Face Recognition”, Proc. of 17th International Conference on Pattern Recognition (ICPR2004), pp.482-485, 2004. [5] G.Baudat and F.Anouar, “ Feature vector selection and projection using kernels”, Neurocomputing,Val.55, PP.21-38, (2003). [6] J. Zhu and T. Hastie, “Kernel Logistic Regression and the Import Vector Machine”,J. of Computational and Graphical Statistics, Vol. 14, No. 1, pp. 185-205, 2005. [7] A.Hidaka, T.Kurita, “Generalization Performance of Face Detector constructed by AdaBoost using Rectangle Features with Random Thresholds”, Proc. of IEICE Congress, D-12101, 2005 (in Japanese).