Document not found! Please try again

Locality sensitive semi-supervised feature selection - people.cs

Report 8 Downloads 63 Views
ARTICLE IN PRESS

Neurocomputing 71 (2008) 1842–1849 www.elsevier.com/locate/neucom

Locality sensitive semi-supervised feature selection Jidong Zhaoa,, Ke Lua, Xiaofei Heb a

School of Computer Science and Engineering, University of Electronic Science & Technology of China, Chengdu, Sichuan 610054, China b Yahoo Inc., Burbank, CA 91506, USA Available online 26 March 2008

Abstract In many computer vision tasks like face recognition and image retrieval, one is often confronted with high-dimensional data. Procedures that are analytically or computationally manageable in low-dimensional spaces can become completely impractical in a space of several hundreds or thousands dimensions. Thus, various techniques have been developed for reducing the dimensionality of the feature space in the hope of obtaining a more manageable problem. The most popular feature selection and extraction techniques include Fisher score, Principal Component Analysis (PCA), and Laplacian score. Among them, PCA and Laplacian score are unsupervised methods, while Fisher score is supervised method. None of them can take advantage of both labeled and unlabeled data points. In this paper, we introduce a novel semi-supervised feature selection algorithm, which makes use of both labeled and unlabeled data points. Specifically, the labeled points are used to maximize the margin between data points from different classes, while the unlabeled points are used to discover the geometrical structure of the data space. We compare our proposed algorithm with Fisher score and Laplacian score on face recognition. Experimental results demonstrate the efficiency and effectiveness of our algorithm. r 2008 Published by Elsevier B.V. Keywords: Feature selection; Semi-supervised learning; Fisher score

1. Introduction In many problems of computer vision and pattern recognition, one is often confronted with high-dimensional data. For example, in face recognition, a face image is typically represented as a 32  32, or 1024-dimensional vector. However, the labeled face images for each individual is usually limited. When the number-of-dimensions to sample-size ratio is too high, learning from examples become infeasible due to the ‘‘curse of dimensionality’’ [9]. One of the most natural methods to overcome this problem is through feature selection and extraction [1,2,11,13,14,16,27,29]. Feature extraction methods have been extensively applied to face analysis and recognition. Three of the most linear feature extraction techniques are Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Locality Preserving Projections (LPP) [15]. PCA is an orthogonal linear transformation that transCorresponding author.

E-mail address: [email protected] (J. Zhao). 0925-2312/$ - see front matter r 2008 Published by Elsevier B.V. doi:10.1016/j.neucom.2007.06.014

forms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. When used to dimensionality reduction, PCA retains those characteristics of the data set that contribute most to its variance, by keeping lower-order principal components and ignoring higherorder ones. LDA is a supervised learning method. It projects the variables on a space of small dimension (generally c1, where c is the number of classes) such that class information is maximally preserved. Specifically, LDA seeks projection vectors such that the projected data points have maximal between-class-variance and minimal within-class-variance. When each class follows Gaussian distribution with an identical covariance matrix, LDA generates an optimal Bayes classifier [9]. Both LDA and PCA have been extensively applied to face recognition [2,27], image retrieval [24,25], microarray data analysis [12], etc. Despite the fact that both PCA and LDA have been proven to be useful for many real world applications, they may fail when the data points are sampled from a highly

ARTICLE IN PRESS J. Zhao et al. / Neurocomputing 71 (2008) 1842–1849

nonlinear submanifold embedded in high-dimensional ambient space. In order to fully discover the manifold structure instead of the Euclidean structure, some manifold learning algorithm like ISOAMP [26], Locally Linear Embedding [21], and Laplacian Eigenmaps [3] are proposed. Both Laplacian eigenmap and LLE try to preserve the local structure. Laplacian eigenmap finds a map such that if two data points are sufficiently close in the original space, then they are also close in the reduced subspace. LLE finds a map such that each data point can be reconstructed by its nearest neighbors. Unlike Laplacian eigenmap and LLE, ISOMAP aims to preserve the global structure of the data manifold. Specifically, ISOMAP finds an embedding such that the geodesics on the manifold are preserved. One major disadvantage of these manifold learning algorithms are that they are computationally expensive. Moreover, they are defined only on the training points, and it is not clear how the evaluate maps for test points. To overcome these limitations, LPP has been recently proposed. These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set. LPP should be seen as an alternative to PCA. When the high-dimensional data lies on a low-dimensional manifold embedded in the ambient space, the LPP are obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. LPP is linear and more crucially is defined everywhere in ambient space rather than just on the training data points. LPP has been applied to face recognition [16], image retrieval [13], document clustering [4], etc. In some situations where computational complexity is the major concern, one may consider feature selection methods instead of feature extraction methods as mentioned above. Feature selection methods can be classified into ‘‘warpper’’ methods and ‘‘filter’’ methods [17]. The wrapper model techniques evaluate the features using the learning algorithm that will ultimately be employed. Thus, they ‘‘wrap’’ the selection process around the learning algorithm. Most of the feature selection methods are wrapper methods. Algorithm based on the filter model examines intrinsic properties of the data to evaluate the features prior to the learning tasks. The filter-based approaches almost always reply on the class labels, most commonly assessing correlations between features and the class label. The typical supervised filter methods include Fisher score, Pearson correlation coefficients, and Kolmogorov– Smirnov test. Recently, Laplacian score was proposed for unsupervised feature selection [14]. Unlike supervised feature selection methods, Laplacian score tries to capture intrinsic geometrical properties in the data. Basically, it evaluates each individual feature by calculating its locality preserving power. In other words, it considers a feature as ‘‘good’’ if the fact that two data points are sufficiently close to each in the original space implies that they are also close to each at this dimension. It has

1843

been shown that Laplacian score provides a good unsupervised approximation to Fisher score, which is supervised [14]. Motivated by recent progress on manifold learning and semi-supervised learning, in this paper we propose a novel semi-supervised feature selection algorithm, called Locality Sensitive Discriminant Feature (LSDF). Unlike Fisher score which makes use of only labeled data points and Laplacian score which makes use of only unlabeled data points, our algorithm makes use of both labeled and unlabeled data points. Our algorithm tries to discover both geometrical and discriminant structure in the data. Specifically, we construct two graphs, i.e. within-class graph and betweenclass graph. The within-class graph connects data points which share the same label or are sufficiently close to each other, while the between-class graph connects data points which are sufficiently close to each other but have different labels. The most *LSDFs are selected such that the withinclass and between-class graph structures can be best preserved on these features. Specifically, a feature is considered ‘‘good’’ if at this dimension nearby points, or points sharing the same label, are close to each other, while points with different labels are far apart. It is worthwhile to highlight several aspects of the proposed approach here: (i) Unlike previous feature selection methods such as Fisher score and Laplacian score, our approach makes use of both of labeled and unlabeled data points. (ii) Our approach is independent to any learning algorithm. Therefore, it can be applied in many real world applications as a preprocessing of the data. (iii) Comparing to feature extraction methods like LDA and LPP, the computation is much faster. The rest of the paper is organized as follows. In Section 2, we give a brief review of Fisher score and Laplacian score. We introduce our LSDF selection algorithm in Section 3. We provide justifications and some theoretical analysis of our algorithm in Section 4. The experimental results are presented in Section 5. Finally, we give concluding remarks and future works in Section 6. 2. Previous feature selection extraction algorithms In this section, we give a brief review of Fisher’s criteria and Laplacian criteria for feature selection and extraction. 2.1. Fisher’s criteria Given a set of data points with label, {xi, yi}, yi A{1,y,c}, i ¼ 1,y, n, where c is the number of classes. Let ni denote the number of data points in class i. Let mi and si2 be the mean and variance of class i, i ¼ 1,y, c, corresponding to the rth feature. Let m and s2 denote the mean and variance of the whole data set. The Fisher score

ARTICLE IN PRESS J. Zhao et al. / Neurocomputing 71 (2008) 1842–1849

1844

is define by Pc n ðm  mÞ2 Pci i 2 (1) F r ¼ i¼1 i¼1 ni si P where ci¼1 ni ðmiP  mÞ2 is the between-class variance at the rth feature, and ci¼1 ni s2i is the within-class variance at the rth feature [9]. Once the score is computed for each individual feature, one can simply select those features with highest scores. The same principle can also be applied to feature extraction, which leads to LDA. Let a be the projection vector. The objective function of LDA is as follows: aT S B a max T a a SW a SB ¼

c X

(2) i

i

ni ðl  lÞðl  lÞ

(3)

i¼1

SW

i¼1

~ f Tr Lf~ r Lr ¼ ~ f T Df~ r

(7)

r

Similarly, the Laplacian criteria can also be applied to feature extraction, which leads to LPP [15]. The LPP algorithm aims to minimize the following cost function: min a

n X ðaT xi  aT xj Þ2 S ij

! (4)

(8)

ij¼1

It can be shown that the optimal solution to this optimization problem is given by solving the following generalized eigenvalue problem: XLX T a ¼ lXDX T a

T

ni c X X ¼ ðxij  li Þðxij  li ÞT

Thus, the Laplacian score of the ith feature is computed as follows:

(9) T

where X ¼ (x1,y,xn) . For more details, please see [16]. Another closely related work to our algorithm is Locality Sensitive Discriminant Analysis [6]. Here we list the major differences of our algorithm from previous works:

j¼1

where l is the total sample mean vector, li is the average vector of the ith class, and xij is the jth sample in the ith class. We call SW the within-class scatter matrix and SB the betweenclass scatter matrix. The optimal projection vectors is given by solving the following generalized eigenvalue problem: S B a ¼ lSW a

(5)

2.2. Laplacian criteria Laplacian criteria aims to preserve the local geometrical structure in the data. It has been used for manifold learning and dimensionality reduction [3]. The Laplacian score was recently proposed for feature selection [14]. In the following we provide a brief description about Laplacian score. Let Lr denote the Laplacian score of the rth feature. Let fri denote the ith sample of the rth feature, i ¼ 1,y, n. It first constructs a nearest-neighbor graph G with n nodes. The ith node corresponds to xi. We put an edge between nodes i and j if xi and xj are ‘‘close’’. Let S be the weight matrix of G, defined below: ( 2 ejjxi xj jj =t if nodes i and j are connected; S ij ¼ (6) 0 otherwise: where t is a suitable constant. For the ith feature, we define f r ¼ ½f r1 ; f r2 ; . . . ; f rn T D ¼ diagðS1Þ1 ¼ ½1; . . . ; 1T L¼DS where the matrix L is often called graph Laplacian [8]. Let T

f D1 f~ r ¼ f r  rT 1 1 D1

(i) Comparing to feature extraction algorithms like PCA, LDA, LPP, and LSDA, our algorithm is much more efficient. Specifically our algorithm does not need to solve an eigenvector problem. (ii) Comparing to traditional feature selection methods like Fisher score and Laplacian score, our algorithm is semi-supervised. Our algorithm can make efficient use of both labeled and unlabeled data points. Therefore, our algorithm can better describe the geometrical structure in the data. 3. The algorithm In this section, we introduce our LSDF selection algorithm. Our algorithm is fundamentally based on Locality Sensitive Discriminant Analysis [7]. We begin with a formal description of the semi-supervised feature selection problem. 3.1. The problem The generic semi-supervised feature selection problem is as follows. Given a set of labeled points x1, x2,y, xl and a set of unlabeled points xl+1,y, xn in Rm , let f ¼ ff 1 ; . . . ; f m g denote the set of m features. Find a feature subset, which contains the most discriminative and informative features. In other words, these features can improve the classification or clustering the most if they were used as input features. 3.2. The algorithm The algorithmic procedure is formally stated below: (i) Construct a within-class graph: let Gw denote a graph with n nodes. The ith node corresponds to the data

ARTICLE IN PRESS J. Zhao et al. / Neurocomputing 71 (2008) 1842–1849

point xi. Put an edge between nodes i and j if they share the same label, or if one of them is unlabeled but they are sufficiently close to each other. Define a weight matrix Sw as follows: 8 g if nodes i and j share the same label; > > > > > > < 1 if node i or j is unlabeled; but node i 2 KNNðjÞ Sw;ij ¼ > > > or node j 2 KNNðiÞ; > > > : 0 otherwise

Therefore, a reasonable criteria for choosing a ‘‘good’’ feature is to minimize the following objective function: X ðf i  f j Þ2 S w;ij : (11) min f

ij

By minimizing the above objective function, the bigger Sw,ij gets, the smaller |fi–fj| becomes. In this way, the graph structure of Gw can be well preserved. Let Dw be a diagonal matrix, and X Dw;ii ¼ Sw;ij : j

(10) where KNN(i) denote the set of k nearest neighbors of node i and KNN(j) denotes the set of k nearest neighbors of node j. g is a suitable constant. In our experiments, g is empirically set to 100 and k is set to 5. (ii) Construct a between-class graph: let Gb denote a graph with n nodes. The ith node corresponds to the data point xi. Put an edge between nodes i and j if they have different labels. Define a weight matrix Sb as follows:

Sb;ij

8 > :

0

We define Lw ¼ DwSw. Lw is often called graph Laplacian of Gw [8]. By simple algebraic formulation, we have X X ðf i  f j Þ2 S w;ij ¼ ðf 2i  2f i f j þ f 2j ÞSw;ij ij

ij

¼2

X

f 2i

i

X

! S w;ij

2

X

j

f i f j S w;ij

ij

¼ 2f T Dw f  2f T S w f if nodes i and j have different labels; otherwise:

(iii) Compute the graph Laplacians: the graph Laplacians of Gw and Gb can be computed as follows: Lw ¼ Dw  Sw ; Dw ¼ diagðSw 1Þ Lb ¼ Db  S b ; Db ¼ diagðSb 1Þ 1 ¼ ð1; 1; . . . ; 1ÞT . Compute the importance score for the rth feature: Lr ¼

1845

f Tr Lb f r f Tr Lw f r

for r ¼ 1, 2,y,m.

¼ 2f T Lw f Thus, the objective function (11) can be reduced to the following: max f T Lw f

(12)

f

Similarly, the graph Gb describes the dissimilarity between data points. Therefore, if xi and xj have the different labels, or Sb,ij ¼ 1, then we expect that fi and fj are as far apart as possible, or |fifj| is as large as possible. Thus, the following objective function on graph Gb should be maximized: X max ðf i  f j Þ2 S b;ij (13) f

ij

By maximizing (13), the smaller Sb,ij gets, the bigger |fi–fj| becomes. So the graph structure Gb can be well preserved. As before, we define a diagonal matrix Db and the graph Laplacian Lb as follows: X Db;ii ¼ S b;ij ; j

4. Justification

Lb ¼ Db  S b Recall that given a data set we construct a within-class graph Gw and a between-class graph Gb. The graph Gw connects points sharing the same label or having high probability of sharing the same label, while the graph Gb connects points with different labels. The weight matrix Sw evaluates the similarity between data points, while the weight matrix Sb evaluates the dissimilarity between data points. Thus, the importance of the feature can be thought of as the degree it preserves these two graph structures. Let f be a feature in the feature set F, f ¼ (f1, f2,y, fn)T, where fi corresponds to the ith sample. Intuitively, if xi and xj are sufficiently close to each other or share the same label, in other words, if Sw,ij is large, then we expect that fi and fj are also close to each other, or, |fi–fj| is small.

Similarly, we have X X ðf i  f j Þ2 S b;ij ¼ ðf 2i  2f i f j þ f 2j ÞS b;ij ij

ij

¼2

X

f 2i

i

X

! Sb;ij

j

2

X

f i f j S b;ij

ij

¼ 2f T Db f  2f T S b f ¼ f T Lb f Thus, the objective function (13) can be reduced to the following: max f T Lb f f

(14)

ARTICLE IN PRESS 1846

J. Zhao et al. / Neurocomputing 71 (2008) 1842–1849

By combining the objective functions (12) and (14), the importance score of feature f can be computed as follows: f T Lb f LðfÞ ¼ T f Lw f

(15)

5. Experimental results In this section, we compare our proposed feature selection algorithm with Fisher score and Laplacian score on appearance-based face recognition [1,19,20, 22,23]. We use three face databases, i.e. YaleB Database, PIE database, and Olivetti Research Laboratory (ORL) database. As we described previously, a face image can be represented as a point in vector space. A typical image of size m1  m2 describes a point in m1  m2-dimensional vector space. Although the ambient dimensionality is usually very large, there is reason to believe that the intrinsic dimensionality is much lower. Before we apply standard classification algorithms, such as k nearest neighbor, Support Vector Machine [28], regression, etc., it is usually beneficial to apply some feature selection and extraction methods to reduce the dimensionality. Previously, feature extraction methods have been extensively applied to face recognition, and three of the most popular methods are Eigenface [27], Fisherface [2], and Laplacianface [16]. Eigenface is based on PCA, Fisherface is based on LDA, and Laplacianface is based on LPP. It has been shown that recognition performance can be significantly improved in the reduced subspace. One major disadvantage of these three methods is that all of them need to solve an eigenvector problem. In some situations where computational complexity is the major concern, feature selection methods may be more suitable. In our experiments, we first calculate the score for each feature. We then sort them in descending order, and finally select the top N features with maximal scores. For all the feature extraction and selection algorithms, one needs to estimate the optimal dimensionality. In our experiments, we iterate all the dimensions and select the optimal one. Specifically, all the algorithms are compared at the optimal dimensions.

5.1. Data preparation In all the experiments, preprocessing to locate the faces was applied. Original images were normalized (in scale and orientation) such that the two eyes were aligned at the same position. Then, the facial areas were cropped into the final image for matching. The size of each cropped image in all the experiments is 32  32 pixels, with 256 gray levels per pixel. Thus, each image can be represented by a 1024-dimensional vector in image space. No further preprocessing is done. In this work, we apply nearestneighbor classifier for its simplicity. The number of nearest neighbors in our algorithm, as well as Laplacian score, is taken to be 4. It would be important to note that the preprocessing we applied here is the same as that in [5]. 5.2. Recognition on Extended Yale database B The Extended Yale face database B contains 16,128 images of 38 human subjects under 9 poses and 64 illumination conditions [10,18]. In this experiment, we choose the frontal pose and use all the images under different illumination, thus we get 64 images for each subject. Thirty sample images of one subject are presented in Fig. 1. For each subject, l ( ¼ 5, 10, 20, 30) images are randomly selected for training and the rest are used for testing. For each given l, we average the results over 10 random splits. In general, the recognition rate varies with the number of selected features. In all our experiments, we recorded the best recognition rate for each algorithm. Table 1 shows the performance of the baseline, Fisher score, Laplacian score and our algorithm. For the baseline method, we simply perform nearest-neighbor classification in the original 1024-dimensional space. Table 1 Face recognition error on the Extended Yale database B 5 train (%) 10 train (%) 20 train (%) 30 train (%) Baseline Fisher score Laplacian score LSDF

69.2 55.4 62.7 49.4

55.5 43.1 49.1 38.3

42.1 32.8 37.9 29.8

34.6 28.7 31.6 27.2

Fig. 1. Sample face images from the extended Yale database B. For each subject, we use 64 frontal face images under varying illumination condition.

ARTICLE IN PRESS J. Zhao et al. / Neurocomputing 71 (2008) 1842–1849

1847

Fig. 2. Sample face images from the CMU PIE database. For each subject, there are 170 near-frontal face images under varying pose, illumination, and expression.

Table 2 Face recognition error on the PIE database 5 train (%) 10 train (%) 20 train (%) 30 train (%) Baseline Fisher score Laplacian score LSDF

76.6 61.8 72.2 55.9

64.8 48.7 61.7 43.4

48.6 37.6 45.1 34.2

37.9 24.1 37.9 22.5

Table 1 shows the recognition rates for different algorithms. As can be seen, our algorithm achieved 27.2% recognition error rate when 30 face images per class were used for training, which is the best out of the four algorithms. The recognition error rates for baseline, Fisher score, and Laplacian score are 34.6%, 28.7%, and 31.6%, respectively.

different times and have different variations including expressions (open or closed eyes, smiling or non-smiling) and facial details (glasses or no glasses). The images were taken with a tolerance for some tilting and rotation of the face up to 201. Ten sample images of one individual are displayed in Fig. 3. For each individual, l ( ¼ 2, 3, 4, 5) images are randomly selected for training and the rest are used for testing. Table 3 shows the recognition results. As can be seen, our algorithm performed the best. Fisher score performed better than Laplacian score. 5.5. Discussions Three experiments on three database have been systematically performed. These experiments reveal a number of interesting points:

5.3. Recognition on PIE database The CMU PIE face database contains 68 subjects with 41,368 face images as a whole. The face images were captured by 13 synchronized cameras and 21 flashes, under varying pose, illumination, and expression. We choose the five near-frontal poses (C05, C07, C09, C27, and C29) and use all the images under different illuminations and expressions, thus we get 170 images for each individual. Fig. 2 shows some of the faces with pose, illumination, and expression variations. For each individual, l ( ¼ 5, 10, 20, 30) images are randomly selected for training and the rest are used for testing. The experimental design here is the same as before. We first apply Fisher score, Laplacian score, and our LSDF algorithms to select the most informative and discriminant features. The recognition is then carried out by using the selected features. Table 2 shows the recognition rates for these algorithms. As can be seen, our algorithm performed the best. 5.4. Recognition on ORL database The ORL face database is used in this test. It consists of a total of 400 face images, of a total of 40 people (10 samples per person). The images were captured at

(i) Both Fisher score, Laplacian score, and our LSDF algorithms performed better on the selected features than on the whole feature set. This indicates that feature selection is useful for face recognition. (ii) Our algorithm outperformed both Fisher score and Laplacian score. This is because our algorithm takes advantage of both labeled and unlabeled data points, while Fisher score can only use the labeled data points and Laplacian score fails to make use of label information. (iii) As more face images are used for training, the recognition error rate decreases for all the algorithms. This shows that all the algorithms are capable of learning from examples. 6. Conclusion A novel semi-supervised feature selection algorithm called LSDF is introduced in this paper. The algorithm is fundamentally based on manifold learning and spectral graph theory. The local geometrical structure and discriminant structure in the data is captured by two graphs, i.e. within-class graph and between-class graph. The importance scores of the features are characterized by their degree of preserving these two graph structures.

ARTICLE IN PRESS J. Zhao et al. / Neurocomputing 71 (2008) 1842–1849

1848

Fig. 3. Sample face images from the ORL database. For each subject, there are 10 face images with different facial expression and details.

Table 3 Face recognition on the ORL database

Baseline Fisher score Laplacian score LSDF

2 train (%)

3 train (%)

4 train (%)

5 train (%)

29.6 26.2 28.2 24.7

21.1 18.8 20.9 16.7

15.5 12.2 14.6 11.3

11.9 9.9 10.4 9.2

Unlike traditional feature selection methods, such as Fisher score which is supervised and Laplacian score which is unsupervised, our algorithm is semi-supervised. It makes use of both labeled and unlabeled data points. Therefore, it can best find those most discriminative and informative features. Also, our algorithm is a ‘‘filter’’ approach, which is independent to any learning algorithm. This makes it flexible to be used as a preprocessing of the data. We have applied our algorithm to face recognition on the extended Yale database B, CMU PIE database, and ORL database. Our algorithm outperformed both Laplacian score and Fisher score on all the three databases. Comparing to Fisher score, our algorithm is less computationally tractable. This is mainly due to the time complexity of high-dimensional nearest-neighbor search, which is the bottleneck of our algorithm. One may consider use A neighborhood to construct the graph. Specifically, we can consider two data points as neighborhood if their distance is less than A. However, it remains unclear how to select the optimal value of A. References [1] A.U. Batur, M.H. Hayes, Linear subspace for illumination robust face recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2001. [2] P. Belhumeur, J. Hepanha, D. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [3] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in Neural Information Processing Systems, vol. 14, 2001. [4] D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Trans. Knowl. Data Eng. 17 (12) (2005) 1624–1637. [5] D. Cai, X. He, J. Han, H.-J. Zhang, Orthogonal laplacianfaces for face recognition, IEEE Trans. Image Process. 15 (11) (2006) 3608–3614. [6] D. Cai, X. He, K. Zhou, J. Han, H. Bao, Locality sensitive discriminant analysis, in: International Joint Conference on Artificial Intelligence, 2007. [7] D. Cai, X. He, K. Zhou, J. Han, H. Bao, Locality sensitive discriminant analysis, in: International Joint Conference on Artificial Intelligence, Hyderabad, India, 2007.

[8] F.R.K. Chung. Spectral Graph Theory, volume 92 of Regional Conference Series in Mathematics, 1997. [9] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed, Wiley-Interscience, Hoboken, NJ, 2000. [10] A. Georghiades, P. Belhumeur, D. Kriegman, From few to many: illumination cone models for face recognition under variable lighting and pose, IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) (2001) 643–660. [11] R. Gross, J. Shi, J. Cohn, Where to go with face recognition, in: Third Workshop on Empirical Evaluation Methods in Computer Vision, Kauai, Hawaii, December 2001. [12] Y. Guo, T. Hastie, R. Tibshirani, Regularized linear discriminant analysis and its application in microarrays, Biostatistics 8 (1) (2007) 86–100. [13] X. He, Incremental semi-supervised subspace learning for image retrieval, in: Proceedings of the ACM Conference on Multimedia, New York, October 2004. [14] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: Advances in Neural Information Processing Systems 18, Vancouver, Canada, 2005. [15] X. He, P. Niyogi, Locality preserving projections, Adv. Neural Inf. Process. Syst. 16 (2003). [16] X. He, S. Yan, Y. Hu, P. Niyogi, H.-J. Zhang, Face recognition using laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 328–340. [17] R. Kohavi, G. John, Wrappers for feature subset selection, Artif. Intell. 97 (1–2) (1997) 273–324. [18] K. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 684–698. [19] A. Levin, A. Shashua, Principal component analysis over continuous subspaces and intersection of half-spaces, in: Proceedings of the European Conference on Computer Vision, 2002. [20] A.M. Martinez, A.C. Kak, Pca versus lda, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2) (2001) 228–233. [21] S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [22] T. Shakunaga, K. Shigenari, Decomposed eigenface for face recognition under various lighting conditions, in: IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, December 2001. [23] L. Sirovich, M. Kirby, Low-dimensional procedure for the characterization, J. Opt. Soc. Am. 4 (1987) 519–524. [24] Z. Su, S. Li, H.-J. Zhang, Extraction of feature subspace for contentbased retrieval using relevance feedback, in: Proceedings of the Ninth ACM International Conference on Multimedia, 2001. [25] D.L. Swets, J. Weng, Using discriminant eigenfeatures for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 18 (8) (1996) 831–836. [26] J. Tenenbaum, V. de Silva, J. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [27] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cognitive Neurosci. 3 (1) (1991) 71–86. [28] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, Berlin, 1995. [29] X. Wang, X. Tang, Random sampling lda for face recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, 2004.

ARTICLE IN PRESS J. Zhao et al. / Neurocomputing 71 (2008) 1842–1849 Jidong Zhao, received his B.S. degree in 1998, and his M.S. degree in Computer Engineering in 2003 from the University of Electronic Science and Technology of China where he is currently a Ph.D. student. His research interests include computer network and multimedia information retrieval.

Ke Lu, received his B.S. degree in thermal power engineering from Chongqing University, China in 1996, and his Ph.D in Computer Engineering in 2006 from the University of Electronic Science and Technology of China where he is currently a lecturer. His research interests include pattern recognition and multimedia information retrieval.

1849

Xiaofei He, received his B.S. degree from Zhejiang University, China, in 2000 and Ph.D. from the University of Chicago, in 2005, both in computer science. He is currently research scientist at Yahoo!. His research interests are machine learning, information retreival, and comptuer vision.