Sparse Representation using Nonnegative Curds and Whey

Report 2 Downloads 99 Views
Sparse Representation using Nonnegative Curds and Whey Yanan Liu, Fei Wu, Zhihuang Zhang, Yueting Zhuang College of Computer Science and Technology, Zhejiang University, China {liuyn, wufei, zhzhang, yzhuang}@cs.zju.edu.cn Shuicheng Yan Department of Electrical and Computer Engineering, National University of Singapore, Singapore [email protected] Abstract

tions within a same class and the disparity between different classes. Under the supervised setting, however, the NMF can be regarded as a nonnegative garrote [2].

It has been of great interest to find sparse and/or nonnegative representations in computer vision literature. In this paper we propose a novel method to such a purpose and refer to it as nonnegative curds and whey (NNCW). The NNCW procedure consists of two stages. In the first stage we consider a set of sparse and nonnegative representations of a test image, each of which is a linear combination of the images within a certain class, by solving a set of regressiontype nonnegative matrix factorization problems. In the second stage we incorporate these representations into a new sparse and nonnegative representation by using the group nonnegative garrote. This procedure is particularly appropriate for discriminant analysis owing to its supervised and nonnegativity nature in sparsity pursuing. Experiments on several benchmark face databases and Caltech 101 image dataset demonstrate the efficiency and effectiveness of our nonnegative curds and whey method.

1. Introduction

Figure 1. An exemplar illustration of sparse representation using nonnegative curds and whey.

The problem of finding a sparse representation for the data has become an interesting topic recently in computer vision and pattern recognition. The essential challenge to be resolved in sparse representation is to develop an efficient approach with which each sample could be reconstructed from its sparse representation. Nonnegative matrix factorization (NMF) [12, 13] is an important technique for finding such a representation. It is well shown that NMF is able to produce such a sparse representation in a collective way [10, 11] . Moreover, the nonnegativity constraint makes the representation easy to interpret due to purely additive combinations of nonnegative basis vectors. The NMF technique has been successfully applied in computer vision and pattern recognition, especially for image analysis [12]. Many of these applications are under an unsupervised setting and incidentally ignored the correla-

In this paper we consider a supervised setting for image representation as well as image classification. With the empirically validated discriminativity of sparse representation in classification, ideally, a test image outside training images can be represented just in terms of the training images and only coefficients of those samples which belong to the same class with test image may be nonzero. That is, a valid test image could be sufficiently represented by the training samples from the same class. Sparse representation could expedite classification when the number of classes is reasonably large. The sparser the coefficients are, the easier the test sample is accurately assigned to its class label. Therefore, when the test image is expressed as linear superposition of all the training images, the coefficient vector is expected to be sparse and nonnegative. 1

In particular, we model the sparse representation by using a nonnegative curds and whey (NNCW) method. The key idea takes advantage of similarity within the same class and disparity between different classes to formulate the classification problem as two consequent linear regressions. That is to say, a test image is represented as a nonnegative weighted combination of all the training images. For this combination, we introduce two sets of sparse nonnegative weight coefficients, one of which is for each training image within a certain class and another is for each class. Our work is motivated by the latest work of Wright et al. [22], which cast the face recognition problem as a liner regression problem with sparse constraints for regression coefficients. To solve the regression problem, Wright et al. [22] reformulated it as the lasso problem [20]. Lassobased sparse representation was also used for image annotation with multiple tags [21], classification [7], and clustering [6]. For example, Wang et al. [21] proposed a multi-label sparse coding framework for automatic image annotation, which takes advantage of the l1 -norm based reconstruction coefficients. In [7], an empirical Bayesian approach to sparse regression and classification is presented, which does not involve any parameters controlling the degree of sparseness. Elhamifar and Vidal [6] introduced a sparse representation-based method to cluster data from multiple low-dimensional subspaces embedded in a highdimensional space. However, the lasso makes the representation unnecessarily additive. This might result in that the representation is not interpretable as NMF. Moreover, the class label or discriminant information from the training set was not apparently incorporated during constructing sparse representation, which may limit the ultimate classification accuracy. Our proposed method can circumvent these limitations since the two steps of linear regressions not only utilize the discriminative class information but also impose the nonnegativity constraint for each coefficient. Beyond the image classification in question, the nonnegative curds and whey is also related to the group nonnegative garrote [23], which is a grouped extension of the conventional nonnegative garrote. The estimate of the the regression coefficients in group nonnegative garrote for individual variables is based on the least squares error, thus these coefficients are not necessarily nonnegative and zero. Figure 1 illustrates the overall procedure of the proposed NNCW method. Intuitively, only one regression model is learned by lasso-based representation methods without utilizing any discriminative label information and putting nonnegative constraints on each sample during convex optimization. However, NNCW first obtains m independent representations (called curds) from each class and then uses curds to redefine a new representation (called whey). Those two kinds of regression models are constructed in conse-

quent order, and the later step directly output the class label information. The rest of this paper is organized as follows. In Section 2, we introduce the details on nonnegative curds and whey (NNCW) method for image representation and classification. Section 3 reviews the related work. Experiment results are reported in Section 4. Finally, we conclude this work in Section 5.

2. Methodology Given a set of n training samples, X = {x1 , . . . , xn } ⊂ Rd , where xi is a d-dimensional feature vector representing an image. Here the images are assumed to be grouped into m disjoint classes and each xi belongs to one and only one class. Let nj be the cardinality of the jth class. We then have Pm j=1 nj = n. Without loss of generality, we put the samples in the jth class into a d×nj matrix Xj . Accordingly, we form an d×n training data matrix X = [X1 , . . . , Xm ]. Our current concern is that of training a classifier when the sparse representation of a test image y ∈ Rd is constructed from the training data, we can predict its corresponding label. The basic idea is to devise a sparse representation approach for the development of classifiers. Before formally presenting our method, we give some notation to be used in this paper. For a p×1 vector a = (a1 , . . . ,q ap )T , we by kak2 denote the l2 -norm of a (i.e., Pp 2 kak2 = j=1 aj ), by kak1 denote the l1 -norm of a (i.e., Pp kak1 = j=1 |aj |) and by kak0 denote the l0 -norm of a (i.e., the number of nonzero entries of a). The sparse representation-based classification approach proposed in this paper learns two inalienable linear regression models with nonnegative coefficient constraints under supervised learning framework. The direct point of our approach is to take discriminative information to make the classifier more interpretable (therefore structural) and an additive model.

2.1. Nonnegative Curds and Whey Procedure Our proposed sparse representation approach consists of two stages. In the first stage we consider m linear regression models by treating y as the response and each image from Xj as one basis. That is, the jth regression problem is based on: y = XTj bj + ²j , (1) where ²j is an error term, and bj = (bj1 , . . . bj,nj )T ∈ Rnj ×1 for j = 1, . . . , m. Recall that y and xi represent reference and training images respectively, so they are typically encoded as nonnegative values. The idea behind nonnegative matrix factorization for learning parts-based representation [12] inspires us to impose nonnegativity on the repression vectors bi . As a

result, we have the following optimization problems, for j = 1, . . . , m

bj

l=1

1: 2:

nj

X 1 ky − XTj bj k22 + λj bjl , 2

argmin

Algorithm 1 NNCW (nonnegative curds and whey)

nj

(2)

bj

where the λj ≥ 0 are tunable weighting parameters. For j = 1, . . . , m, each optimization problem in (2) is a nonnegative garrote model [2]. The nonnegative garrote can be efficiently solved by classical numerical methods such as the least angle regression (LARS) [5] and pathwise coordinate method [8]. However, we follow Breiman’s original implementation [2] to solve the optimization problems. The approach used in [2] for optimization is to shrink each ordinary least squares (OLS) estimated coefficient by a nonnegative amount whose sum is subject to an upper bound constraint (the garrote). ˆ j = (ˆbj1 , . . . , ˆbjn )T be the estimate of the bj . Let b j Pnthe j As l=1 bjl = kak1 , the bj should be sparse, and this leads to m sparse representations of y. We express them ˆ j for j = 1, . . . , m. Actually, since b ˆ j is the as zj = Xj b reconstruction coefficients learned from samples within the ˆ j could be used to denote the difference of jth class, all of b ˆ j together. the different classes if we put b Therefore, to capture the class label information from the training samples and further make use of the disparity of different classes, we consider the following optimization problem,

c1 ,...cm

m m X X 1 ky − cj zj k22 + λ pj cj , 2 j=1 j=1

(3)

where λ ≥ 0 is a tunable weighting parameter, and the pj > 0 are degrees of penalties. In all experiments, we set pj = nj /n. The optimization problem in (3) is also a nonnegative garrote model. We can solve it again by using Breiman’s implementation [2]. As we can see, the second stage further refines the representation of y by using the representations of y obtained in the first stage. In particular, y is now represented as u=

j=1

Pm

cˆj zj =

m X

l=1

s.t. bjl ≥ 0, ∀ l, 3:

for j = 1, . . . , m. Whey: Solve the following optimization problem argmin c1 ,...cm

m m X X 1 ky − cj zj k22 + λ pj cj , 2 j=1 j=1

s.t. cj ≥ 0, ∀j 4:

Output: The class label k of the test sample y is ˆ j k2 . k = argmin ky − cˆj XTj b 2 j

5:

end procedure

The optimization problems in (2) define m independent representations of y, which we call “curds”. The optimization problem in (3) then takes advantages of such m curds to define a new representation, which we call “whey”. Since we impose the the nonnegativity constraints on the bj as well as the cj , we refer to our method for sparse image representation as the nonnegative curds and whey (NNCW).

2.2. Classification Procedure

s.t. cj ≥ 0, ∀j

m X

X 1 ky − XTj bj k22 + λj bjl , 2

argmin

s.t. bjl ≥ 0, ∀ l

argmin

procedure NNCW({X1 , . . . , Xm } ⊂ Rd ; y ∈ Rd ) Curds: Solve the following optimization problems

ˆ j and Given the test sample y and its corresponding b cˆj for j = 1, . . . , m, which are obtained from the NNCW method, we are now concerned with the class label of y. Ideally, the nonzero cˆj indicates the class to which the test sample y belongs. However, it is not always the case that there is only one nonzero cˆj . Thus, we allocate y to the kth class with ˆ j k2 . k = argmin ky − cˆj X0j b 2

(5)

j

We summarize the entire NNCW method in Algorithm 1.

3. Related Work ˆj . cˆj X0j b

(4)

j=1

As j=1 pj cj can be considered as weighted l1 -norm of c = [c1 , c2 , . . . , cm ]T , some of the cˆj are zeros, and the representation is sparse. Moreover, if cˆj = 0 for a j ∈ {1, . . . , m}, it means all samples from the jth class are ˆ j = 0 and eliminated from this representation due to cˆj b the test image y therefore apparently does not belong to the jth class.

A so-called “curds and whey” method was first proposed by Breiman and Friedman and was a form of multivariate shrinking [3]. However, the main purpose of the method in [3] is to improve predictive accuracy in multiple linear regression by using correlations between the response variables. As discussed before, NNCW first makes use of intra-class information to generate m independent representations (curds), then consequently utilizes inter-class information to generate a linear regression model (whey) for discriminative learning.

To some extent, our proposed NNCW can be regarded as a variant of the group lasso [23]. Group lasso is a natural extension of lasso and the covariates in group lasso are assumed to be clustered in groups. Intuitively, Group lasso will derive all the weights in one group to zero together and thus lead to group selection. Different from NNCW in this paper which put nonnegative efficient constraints on each training samples and each class, there is no nonnegative coefficient constraint on group lasso. Specially, NNCW is closely related to the group nonnegative garrote [23]. The main difference lies in that the group LS nonnegative garrote instead uses zj = XTj bLS j , where bj is the least square estimate. In this case, zj is not guaranteed to be nonnegative, although Xj is nonnegative. Thus, zj may no longer represent a real image. Moreover, owing to its explicit dependence on the full least square estimates, in problems where the sample size is small relative to the total number of variables, the group nonnegative garrote may perform suboptimally. Naturally, group nonnegative garrote is not robust to image noise and occlusion, and thus we do not compare our algorithm with it and focus on sparse related algorithm instead. It is worth pointing out that the sparse representation in [22] tries to solve the following problem: argmin kβk0 , β

subject to Xβ = y,

(6)

where β = (b1 , . . . , bm )T . However, this problem is NPhard [1]. Based on the sparse theory from Donoho [4], Wright et al. [22] thus consider the following alternative: nj m X X 1 |bji |, argmin ky − Xβk22 + λ 2 β j=1 i=1

(7)

which is essentially the lasso model. On one hand, this sparse model for classification does not consider the discriminative class information, which is definitely useful for classification. Moreover, in many regression problems, we are interested in finding important integratable factors in predicting the categorical information, where each factor may be represented by a group of derived variables. On the other hand, since β is not nonnegative, such sparse representations do not have interpretable properties as NMF and NNCW. The strength of this work is to integrate sparse coding, nonnegative data factorization and supervise learning together in NNCW framework. The two inseparate learned linear regression models encode similarity and disparity information useful for data classification. Moreover, the nonnegative constraints and natural sparsity in NNCW make it to be more interpretable.

4. Experiments In this section, we investigate the applications of our nonnegative curds and whey (NNCW) method in face recognition and image classification. We compare NNCW with three popular classification methods: nearest neighbor (NN), naive Bayes (NB), linear support vector machines (SVM), as well as sparse-representation based classification (SRC) proposed by Wright et al. in [22] and group lasso (gLasso) [23]. The tuning parameters λ and λj (j = 1, . . . , m) are evaluated by 10-fold cross-validation to avoid over-fitting. Four face databases and one image dataset were used. The face databases include ORL database [18], Extended Yale B database [9], AR face database [17] and CMU PIE face database [19]. Specifically these four face databases focus on frontal faces, illumination condition variations, occlusions and different poses, respectively. We also conducted experiments on Caltech 101 image dataset [15].

4.1. Visualization on Face Dataset We first give a visualization comparison of nonnegative curds and whey (NNCW) with lasso-based sparse representation (LSR) and group lasso (gLasso) on a subset of ORL dataset. We chose 10 persons from ORL database, and 9 images per subject to comprise the training data. Then the one remaining image per person is treated as test sample. Figure 2, 3 and 4 respectively demonstrate the visualization results of sparse representation by LSR, gLasso and our NNCW with the test sample from the first individual. Figure 4(a) shows the first stage of NNCW which computes zj , 1 ≤ j ≤ 10, and Figure 4(b) illustrates the second stage of NNCW, which refines the sparse representation of y in the first stage. We can see that by incorporating the class label information in the second stage of NNCW, the optimized sparse estimates of class weights cj (1 ≤ j ≤ 10) lead to quite sparse coefficients for test sample y. Specifically, after the computation in NNCW, we obtain an optimized c1 (i.e., 0.86) and b1 (i.e., [0, 0, 0, 0, 0.58, 0, 0, 0.01, 0]) to estimate y. These two sets of estimation parameters can reconstruct y effectively. In Figure 2, except for the first class to which the test sample belongs, the weights of other classes calculated with LSR are not sparse enough. That is why the reconstruction of y is not so good as NNCW. Besides, although gLasso chose the right class in Figure 3, the reconstruction result of NNCW is much better than that of gLasso. A possible explanation is that the negative coefficients of gLasso bring negative visual affect for the representation of images.

4.2. Recognition on frontal faces The ORL database consists of 400 face images of 40 people (10 samples per person). The images with 92×112 pixels were captured at different times and have different variations including expressions (open or closed eyes, smiling

Figure 2. Visualization of LSR on the sampled ORL dataset. The test image y belongs to the subject 1.

Figure 3. Visualization of gLasso on the sampled ORL dataset. The test image y belongs to the subject 1.

or non-smiling) and facial details (glasses or no glasses). To compute the recognition rate, the images are downsampled to 48, 99, 220, and 644 feature dimensions, which correspond to downsampling ratios of 1/15, 1/10, 1/6, and 1/4, respectively. For each subject, 6 images are randomly selected for training and the rest are used for testing.

been seen that all of the six algorithms achieve good performance, since ORL database contains almost all frontal faces with little pose or illumination variations. The proposed NNCW achieves the best recognition accuracy rate of 96.53%, compared to 96.04% for gLasso, 95.47% for SRC, 93.64% for SVM, 94.38% for NB, and 93.44% for NN.

4.3. Recognition with illumination variations

Figure 5. Comparison of face recognition accuracy rates on ORL database.

Figure 5 shows the face recognition results on ORL database using six different classification methods. It can

The Extended Yale B database consists of 2414 frontalface images of 38 persons. The cropped 168×192 face images were captured under various illumination conditions [14]. The illumination type is determined uniquely by azimuthal and elevational values, where the azimuth changes from -130o to 130o , and the elevation ranges from -40o to 90o . Firstly we randomly select half of the images (about 32 images per individual) for training, and the other half for testing, as in [22]. The images are downsampled to 30, 56, 120, and 504 feature dimensions, corresponding to the downsampling ratios of 1/32, 1/24, 1/16, and 1/8, respectively. From Figure 6, we can see that NNCW improves the highest recognition accuracy rate to 95.29% from 80.63% for NN, 87.25% for NB, 85.54% for SVM, 94.31% for SRC, and 94.58% for gLasso. Secondly we divide the images with 504 feature dimensions into five subsets of increasing azimuth illumination

(a) The first stage of NNCW

(b) The second stage of NNCW

Figure 4. Visualization of NNCW on the sampled ORL dataset. The test image y belongs to the subject 1.

Figure 6. Comparison of face recognition accuracy rates on Extended Yale B face database.

angles, i.e., the frontal illuminated images are used as the training set, the test set 1, 2, 3, 4 include the face images under variant illumination conditions for which the light angle varies from 5o to 15o , from 20o to 45o , from 50o to 75o , from 75o to 130o , respectively. Figure 7 illustrates the face recognition accuracy rates under different illumination situations. We can see that the recognition accuracy rates decline with the increased lighting angles, which indicates that the variations of illumination affect the face recognition performance, especially for nearest neighbor (NN) classifier. Our NNCW achieves a recognition accuracy rate between 87.22% and 98.75%, much better than the other methods.

Figure 7. Comparion of face recognition accuracy rates under different illuminations on Extended Yale B database.

4.4. Recognition with occlusions The AR face database comprises over 4,000 color images corresponding to the faces from 126 people (70 male and 56 female). This dataset includes frontal view faces with different facial expressions, illumination conditions, and occlusions (sun glasses and scarf). Each person participated in two sessions, separated by two weeks (14 days) time, totally 26 pictures were taken. In this experiment, as in [22], we firstly chose a subset of the database consisting of 50 male individuals and 50 female individuals. For each individual, the 13 images from Session 1 were selected for training, and the 13 images for Session 2 were for testing. The images are firstly cropped with feature dimension of 120×165 and converted into grayscale. The images are also downsampled to 30, 54, 130, 540-dimensional feature spaces, with downsampling

ratios of 1/24, 1/18, 1/12, and 1/6, respectively. Figure 8 shows the recognition accuracy rates for this experiment. NNCW achieves a recognition accuracy rate of 90.15% with 540 dimensional features, higher than the other methods, e.g. 78.63% for NN, 83.32% for NB, 84.18% for SVM, 88.33% for SRC and 88.98% for gLasso.

test set 2 (c11, c37), test set 3 (c02, c14), and test set 4 (c22, c34).

Figure 9. Comparison of face recognition accuracy rates on CMU PIE database.

Figure 8. Comparison of face recognition accuracy rates on AR database.

Moreover, we test the classification performances with different occlusions on a subset (70 male, 55 female except women-027 due to the corrupted image w-027-14.bmp) of the AR face database. We use 1750 (14 each) unoccluded frontal face images as training set. Test set 1 with sunglasses occlusion contains 750 images (6 each), and test set 2 with scarves occlusion also consists of 750 images (6 each). Table 1 lists the face recognition accuracy rates in scenarios with occlusions (sunglasses and scarves) for six different methods. We can see that NNCW achieves best recognition accuracy rates for both occlusion conditions, though on the case with scarf occlusions the overall accuracy rate is not quite high.

NN NB SVM SRC gLasso NNCW

Sunglasses 69.87% 75.39% 81.33% 86.28% 86.93% 88.44%

Scarves 14.12% 40.66% 45.48% 59.21% 61.37% 62.19%

Table 1. Comparison of face recognition accuracy rates with different occlusions (sunglasses and scarves) on AR database.

4.5. Recognition under different poses In this experiment, we evaluate six methods under different poses using CMU PIE face database. We use the frontal faces c27 as training data, and four test sets with increasing variations of pose angles, including test set 1 (c05, c29),

Figure 9 shows the face recognition results of six different methods. Test set 1 contains the most near frontal images, so the recognition accuracy rates are the best. The results of the test set 2 and 3 are worse since the angle variations are larger than that for the test set 1. Test set 4 are almost for profile faces, which results in the worst recognition accuracy rates. The proposed NNCW method still achieves better performance than the others.

4.6. Image classification Caltech 101 image database contains 9197 images from 101 various object categories, collected from Google image search by Li. et al. [15]. Most objects are centered and in the foreground. In order to make a robust comparison, we have selected 50 categories which contain more sample images than others, range from 60 to 800. Then we randomly chose 50 images per category for training and the remaining images are used as testing samples. We downsampled the images to 100 (10×10), 225 (15×15), 400 (20×20), and 625 (25×25)-dimensional feature vectors. From Figure 10, we observe that NNCW outperforms other methods for image classification on Caltech 10 image database. The classification accuracy of Caltech 101 is not as good as those on other face databases, since general image classification is more complicated and challenging.

4.7. Exploring sparseness To testify the “sparse” property of the proposed NNCW method, we also investigate the “sparsity ratio” of the estiˆ defined as: mated b, Sparsity ratio =

ˆ the number of zeros in b ˆ the number of elements in b

Table 2 lists the average sparsity ratio of the estimated ˆ for different databases with SRC, gLasso and coefficients b NNCW. As can be seen, all the sparsity ratios are larger

References

Figure 10. Comparion of image classification accuracies on Caltech 101 image database.

than 0.5 for SRC, larger than 0.6 for gLasso and NNCW. This indicates that the sparse representation takes effect in the classification task and NNCW achieves better sparsity than SRC and gLasso in general.

ORL Extended Yale B AR PIE Caltech 101

SRC 0.78 0.75 0.74 0.72 0.59

gLasso 0.82 0.79 0.78 0.73 0.61

NNCW 0.85 0.83 0.80 0.77 0.64

Table 2. Comparison of average sparsity ratio between SRC, gLasso and NNCW on different databases.

5. Conclusions and Future Work This paper proposed a novel sparse nonnegative image representation method, called the nonnegative curds and whey (NNCW). The NNCW method is attractive due to its natural sparsity along with its nonnegativity property and discriminating capability. The NNCW method consists of a set of the nonnegative garrote models, which are solved by using the numerical approach developed by Bremian [2]. In recent years, there are some sophisticated approaches to the nonnegative garrote such as the least angle regression [5] and pathwise coordinate method [8]. It would be interesting to implement our method via these approaches in our future work.

Acknowledgements This work is supported by 973 Program (2009CB320801), National Natural Science Foundation of China (90920303), National Key Technology R&D Program (2007BAH11B01), Program for Changjiang Scholars and Innovative Research Team in University (IRT0652,PCSIRT).

[1] E. Amaldi and V. Kann. On the approximability of minimizing nonzero variables or unsatified relations in linear systems. Theoretical Computer Science, 1998. 4 [2] L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 1995. 1, 3, 8 [3] L. Breiman and J. Friedman. Predicting multivariate responses in multiple linear regression (with discussions). J.R.Statist. Soc.B, 1997. 3 [4] D. Donoho. For most large underdetermined systems of equations, the minimal l1 -norm near-solution approximates the sparsest near-solution. Comm. Pure Appl. Math., 2006. 4 [5] B. Efron, I. Johnstone, T. Hastie, and R. Tibshirani. Least angle regression. Ann. Statist., 2004. 3, 8 [6] E. Elhamifar and R. Vidal. Sparse subspace clustering. In CVPR, 2009. 2 [7] M. Figueiredo. Adaptive sparseness for supervised learning. IEEE TPAMI, 2003. 2 [8] J. Friedman, T. Hastie, H. Hoefling, and R. Tibshirani. Pathwise coordinate optimization. Ann. Appl. Stat., 2007. 3, 8 [9] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE TPAMI, 2001. 4 [10] P. Hoyer. Nonnegative matrix factorization with sparseness constraints. JMLR, 2004. 1 [11] J. Kim and H. Park. Sparse nonnegative matrix factorization for clustering. Technical Report CSE Technical Reports; GTCSE-08-01, Georgia Institute of Technology, 2008. 1 [12] D. Lee and H. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 1999. 1, 2 [13] D. Lee and H. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001. 1 [14] K. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE TPAMI, 2005. 5 [15] F. Li, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In IEEE CVPR 2004, Workshop on Generative-Model Based Vision. 4, 7 [16] A. Martinez and R. Benavente. The AR face database. Technical Report 24, CVC, 1998. 4 [17] F. Samaria and A. Harter. Parameterisation of a stochastic model for human face identification. In 2nd IEEE Workshop on Applications of Computer Vision, 1994. 4 [18] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression database. IEEE TPAMI, 2003. 4 [19] R. Tibshirani. Regression shrinkage and selection via the lasso. J.R.Statist. Soc.B, 1996. 2 [20] C. Wang, S. Yan, L. Zhang, and H. J. Zhang. Multi-label sparse coding for automatic image annotation. In CVPR, 2009. 2 [21] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE TPAMI, 2009. 2, 4, 5, 6 [22] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J.R.Statist. Soc.B, 2006. 2, 4