Robust Nonnegative Matrix Factorization via L1 Norm Regularization

Report 8 Downloads 81 Views
arXiv:1204.2311v1 [cs.LG] 11 Apr 2012

Robust Nonnegative Matrix Factorization via L1 Norm Regularization Bin Shen, Luo Si, Rongrong Ji, Baodi Liu [email protected] May 10, 2014

Abstract Nonnegative Matrix Factorization (NMF) is a widely used technique in many applications such as face recognition, motion segmentation, etc. It approximates the nonnegative data in an original high dimensional space with a linear representation in a low dimensional space by using the product of two nonnegative matrices. In many applications data are often partially corrupted with large additive noise. When the positions of noise are known, some existing variants of NMF can be applied by treating these corrupted entries as missing values. However, the positions are often unknown in many real world applications, which prevents the usage of traditional NMF or other existing variants of NMF. This paper proposes a Robust Nonnegative Matrix Factorization (RobustNMF) algorithm that explicitly models the partial corruption as large additive noise without requiring the information of positions of noise. In practice, large additive noise can be used to model outliers. In particular, the proposed method jointly approximates the clean data matrix with the product of two nonnegative matrices and estimates the positions and values of outliers/noise. An efficient iterative optimization algorithm with a solid theoretical justification has been proposed to learn the desired matrix factorization. Experimental results demonstrate the advantages of the proposed algorithm.

1

Introduction

Nonnegative Matrix Factorization (NMF) has been widely applied in a lot of applications such as face recognition [1], motion segmentation [2], etc. NMF has 1

received substantial attention due to its theoretical interpretation and practical performance. Several variants of NMF have been proposed recently to improve the performance. Sparseness constraints have been incorporated into NMF to obtain sparse solutions [3, 4]. NMF algorithms in [5, 6] are proposed to preserve the local structure on the low dimensional manifold(s). To be robust to outliers, [7] proposes RSNMF, which is based on an outlier resistant objective function. [8] maintains an outlier list in NMF for more robust performance.

Figure 1: Large Additive Noise/Partial Corruption/Outlier In real applications, data samples are often partially corrupted(e.g, pepper and salt noise in images, occlusion on faces). Figure 1 shows some examples of this kind of partial corruption. Intuitively, partial corruption can be treated as large additive noise. Unfortunately, traditional methods based on least square estimation, such as NMF and PCA, are sensitive to this kind of noise [9], since the underlying assumption of Gaussian noise distribution is not valid. Some recent work [10, 11, 12] tries to deal with partial corruption. They usually assume the positions of the corruption are given ahead, and then ignore the corresponding data entries. However, it is unrealistic to assume that the positions of corruption are known in many real world applications. [13] proposes Robust PCA to recover the noise value and position. This paper proposes a Robust Nonnegative Matrix Factorization(RobustNMF) approach, which is able to simultaneously learn the basis matrix, coefficient matrix and estimate the positions and values of noise. The underlying observation is that the clean data allow a nonnegative factorization and the noise is sparse. An efficient iterative optimization algorithm with solid theoretical justification has been proposed to obtain the desired solution of the RobustNMF approach. To the best of our knowledge, our work is the first NMF technique that generates robust results for data wit large additive noise(partial corruption) without requiring the information of the positions of the noise. The rest part of this paper is organized as follows. Section 2 reviews the traditional NMF algorithm. Section 3 proposes the RobustNMF algorithm, followed by the iterative optimization method in section 4. Section 5 provides some theoretical 2

justification of the optimization method, and the experimental results are shown in section 6. Finally, we conclude and discuss future work.

2

Review of Nonnegative Matrix Factorization

Given a nonnegative matrix X ∈ Rm×n , each column of X represents a data sample, the NMF algorithm aims to learn two nonnegative matrices U ∈ Rm×k and V ∈ Rk×n for approximating X by the product of them, i.e. X ≈ U V . To learn the U and V , the following objective function should be minimized: O = ||X − U V ||2F s.t. U ≥ 0, V ≥ 0

(1)

where ||.||F denotes the Frobenius norm. The following iterative multiplicative updating algorithm is proposed in [14] to minimize the above objective function:

3

Uij = Uij

(XV T )ij (U V V T )ij

(2)

Vij = Vij

(U T X)ij (U T U V )ij

(3)

Robust Nonnegative Matrix Factorization

The proposed Robust Nonnegative Matrix Factorization(RobustNMF) algorithm explicitly models the partial corruption, which is treated as large additive noise. Let nonnegative matrix X ∈ Rm×n denote the observed corrupted data, while ˆ ∈ Rm×n denote the clean data without each column of X is a data sample. Let X ˆ + E, where E ∈ Rm×n is the large additive noise. pollution. We have X = X Note that the large additive noise E is not Gaussian noise with zero mean, which is well handled by least square error minimization. Moreover, we are concerned with partial corruption, and partial means the noise distribution is sparse. In other words, only a small portion of entries of E are nonzero. For example, in face recognition, the occlusion by glasses is an instance of this kind of noise, and it covers only a small portion of the entire face. ˆ is approximated by U V (U ∈ Rm×k , V ∈ Rk×n ) as in The clean data X traditional NMF, thus we have X ≈ UV + E 3

(4)

Considering the above model and the sparseness of the large additive noise E, the objective function of RobustNMF is defined as follows: ORobustN M F = ||X − U V − E||2F X +λ [kE.j k0 ]2

(5)

j

The first term is to approximate the clean data; the second term is obtained from the sparseness constraint of E. The parameter λ controls the tradeoff between the two terms, thus it is dependent on how large portion of entries are corrupted. However, the L0 norm in the second term makes this objective function difficult to optimize, so L1 norm is employed to approximate it, which has been a popular strategy in prior research [15]. Substituting the L1 norm into the objective function, we have ORobustN M F = ||X − U V − E||2F X +λ [kE.j k1 ]2 j

 = ||X − [U, I, −I]

  ||2F Ep V

(6)

n

E X +λ [kE.jp k1 + kE.jn k1 ]2 j |E|−E n p n where E = E p − E n , E p = |E|+E 2 , E = 2 , and E ≥ 0,E ≥ 0. Now we have squared L1 norm penalty for sparseness, which has been proved to be effective and computationally convenient [4, 16, 6]. Note that E is the sparse large additive noise, which could be either negative or nonnegative. We need to decompose E into two nonnegative matrices E p and E n described above to gain the nonnegativity which results in the convenience in optimization. We also set constraint that X − E ≥ 0, since the clean data should be nonnegative. Finally, the objective function should be minimized with respect to U , V , E p , and E n subject to the constraints that U ≥ 0, V ≥ 0, E p ≥ 0, E n ≥ 0, and X − E ≥ 0.

4

Optimization

Since ORobustN M F is not convex with U , V , E p , and E n jointly, it is difficult to find the global minimum for ORobustN M F . Instead, we aim to find a local minimum by iteratively updating U , V , E p and E n in a similar way with the work [14] for NMF. 4

4.1

Update U

Given V ,E p ,and E n , we update U to decrease the value of objective function.

  V U =arg min ||X − [U, I, −I] E p  ||2F U ≥0



X

En

[kE.jp k1

+

kE.jn k1 ]2

(7)

j

=arg min ||[X − E] − U V ||2F U ≥0

The updating rule for U to reduce the objective function is as follows, which can be proven in a similar way as in [14]. Uij = Uij

ˆ T )ij (XV (U V V T )ij

(8)

ˆ = X − E. Note that at this step E is given, and it satisfies the where X constraint that X − E ≥ 0.

4.2

Update V , E p , and E n

Now we decrease the objective function with respect to V , E p and E n given U .  Let Ve = ( EVp ) . En

The updating rule for Ve is:

eT U e Ve )ij e T X) e ij Veij (U Veij (U Veij = max(0, Veij − + ) (S Ve )ij (S Ve )ij e= where X

X 01×n



e= ,U

 √ U,I,−I√ , 01×k λe1×m λe1×m

and S is defined as

eT U e )ij | Sij = |(U

5

Correctness of Updating Rules

To decrease the objective function with respect to V , E p and E n . We have: 5

(9)

(10)

(V, E p , E n ) = arg = arg

min

ORobustN M F

min p n

  V ||X − [U, I, −I] E p  ||2F

V,E p ,E n ≥0

V,E ,E ≥0

En

X +λ [kE.jp k1 + kE.jn k1 ]2 j

  X = arg min || V,E p ,E n ≥0 01×n    U, I, −I V √ √  ||2F − Ep 01×k λe1×m λe1×m En 2 e e e = arg min ||X − U V ||F

(11)

V,E p ,E n ≥0

X e √ U,I,−I√ 01×n , U = 01×k λe1×m λe1×m , ,E p and E n is more involved than

 Ve = ( EVp ) . En e contains Updating V updating U , since U some negative values. Now we prove the correctness of the updating rules for V , E p and E n proposed in section 4. e= where X

5.1





Decrease Objective Function

e 0 ) is an auxiliary function for F (e Definition 1 [14] Z(e v, v v), if it satisfies the following conditions e 0 ) ≥ F (e e ) = F (e Z(e v, v v), Z(e v, v v) Lemma 1 [14] If Z is an auxiliary function, then F is nonincreasing under the update e t+1 = arg minve Z(e et ) v v, v Now we generalize the Lemma 1 to Lemma 2. e t+1 Lemma 2 If Z is an auxiliary function, then F is nonincreasing as long as v satisfies the following condition: e t ) ≤ Z(e et ) Z(e vt+1 , v vt , v Proof: e t ) ≤ Z(e e t ) ≤ F (e F (e vt+1 ) ≤ Z(e vt+1 , v vt , v vt )  This generalization from Lemma 1 to Lemma 2 is similar to the generalization from EM to Generalized EM. e contains some negative value. Thus the updating rules in In our problem, U [14] do not hold. So we begin to seek new updating rules. Define a matrix S as follows. eT U e )ij | Sij = |(U 6

(12)

Lemma 3 If K(e vt ) is the diagonal matrix that Kab (e vt ) = δab (Se vt )a /e vat

(13)

then

e t ) =F (e e t )∇F (e Z(e v, v vt ) + (e v−v vt ) 1 e t )T K(e et ) + (e v−v vt )(e v−v 2

(14)

is an auxiliary function for F (e v) =

X 1X eia v ea )2 (e xi − U 2 a

(15)

i

Proof: e ) = F (e e t ) ≥ F (e Z(e v, v v), obviously. Now we prove that Z(e v, v v). Comparing

e t )∇F (e F (e v) =F (e vt ) + (e v−v vt ) 1 eT U e )(e e t )T (U et ) v−v v−v + (e 2

(16)

e t ), we find that we only need to show to the Z(e v, v eT U e )(e e t )T K(e e t ) − (e e t )T (U et ) ≥ 0 (e v−v vt )(e v−v v−v v−v eT U e ](e e t )T [K(e et ) ≥ 0 (e v−v vt ) − U v−v

(17)

To prove the positive semidefiniteness, consider the matrix M (e vt ): eT U e ]ab v eat [K(e ebt Mab (e vt ) = v vt ) − U

(18)

eT U e . The M is semipositive definite if and only M is a rescaling of K(e vt ) − U 7

eT U e is. if K(e vt ) − U

µT M µ =

X

µa Mab µb

ab

=

X

=

X

=

X

eT U e )ab v eat Sab v ebt µ2a − µa v eat (U ebt µb v

ab

ab

ab

1 1 e )ab )µa µb ] eT U eat v ebt [ µ2a + µ2b − sgn((U Sab v 2 2

(19)

1 e )ab )µb ]2 eT U ebt [µa − sgn((U eat v Sab v 2

≥0 where

  −1: x < 0 0: x = 0 sgn(x) =  1: x > 0

(20)

e contains some negative values, but Ve is nonnegative.  Note in our setting, U Substitute Lemma 3 into Lemma 1, the updating rule is:

e t+1 = v e t − K(e v vt )−1 ∇F (e vt ) eT U ev eT x e t − K(e e t + K(e e =v vt )−1 U vt )−1 U

(21)

Writing the components explicitly, we get:

eat+1 = v eat − v

eT x eT U ev e t (U e )a eat (U e t )a v v + a t t (Se v )a (Se v )a

(22)

The proposed updating rules can deal with negative values by explicitly cone . If U e ≥ 0, the first two sidering the negative part of large additive noise in the U terms in the above updating rule would cancel each other, resulting in the same rule as in [14]. The Ve gained by (22) is made up of three parts: V ,E p ,and E n . All of them e t+1 gained by the rule (22) does should be nonnegative. Unfortunately, the value v not guarantee the nonnegativity. Now we discuss how to keep it nonnegative while updating the values. 8

In the auxiliary function Z, we see that the K(e vt ) is a diagonal matrix. Thus 2 ea . This results in a very important the second order terms only involve the form v property in Lemma 4. e t ) and v et ≥ v e 0t+1 ≥ v e t+1 , then Z(e et ) ≤ e t+1 = arg minve Z(e Lemma 4 If v v, v v0t+1 , v t t e ). Z(e v ,v According to Lemma 2 and Lemma 4, we have F (e v0t+1 ) ≤ F (e vt ). t+1 e So, to ensure the nonnegativity, we can simply threshold v by 0. This operation will introduce the nonnegativity, while keeping the value of F nonincrease e t to the thresholded v e t+1 . Thus, we have the updating rule described in (9). from v

5.2

Convergence Analysis

Since the objective function has a lower bound, e.g., 0, and the updating rules for U , V , and E will all cause the objective function nonincrease, the algorithm always converges.

6

Experimental Results

This section presents experimental results on two different applications of noise detection(identifying exact positions of large additive noise), and image reconstruction/denoising.

6.1

Large Additive Noise Detection

Several algorithms have been proposed to deal with data with partial corruption. However, they usually assume the positions of noise are not known in advance. Fortunately, the proposed RobustNMF is able to locate the positions. Once the missing values are located, existing algorithms can also be applied. This subsection presents experiments for detecting the positions of large noise in face images and image patches. The reported results are averaged over ten runs. The experiments are based on the ORL face dataset. Each face image is of size 32×32, thus is represented by a 1024 dimensional vector. For each face in a randomly selected subset, 50 pixels are randomly selected and replaced with the values of 255 to simulate the large additive noise. The polluted faces make up the data matrix X, each column of which corresponds to a polluted face image. Then, we apply the RobustNMF algorithm to this X to estimate U , V and E. Furthermore, we scan all the entries of the E. When Eij is nonzero, we claim that the corresponding pixel is polluted. With the above procedure, we are able to detect the positions of noise by analyzing E. The performance is evaluated ollutedP ixels by precision and recall, where P recision = #DetectedP #N onzeroEntriesInE × 100% and 9

Figure 2: Noise Detection Results in Face Images ollutedP ixels Recall = #DetectedP #T otalP ixelsP olluted × 100%. Here we only show the performance of our algorithm, since no other algorithm, to the best of our knowledge, is designed to handle this task. We tried to compare the proposed algorithm with PCA and NMF. We first applied PCA or NMF to the noisy data X to gain a reconstruction ˆ and then tried to detect the positions of noise by analyzing the difference X −X. ˆ X, However, it is very difficult to find an appropriate threshold of the difference and the performance is very sensitive to this. We tried several thresholds, all of which gave poor results, probably because the partial corruption significantly skews the solution of PCA and NMF. In figure 2, the left subfigure presents the precision and recall versus different numbers of face images. This algorithm gains a precision of over 90%, and a recall of over 50%. The performance increases with the increase of the number of image faces. This is reasonable, since more samples means there is more information that RobustNMF can explore. When the number is large enough, increasing the number does not help to improve the performance any longer. Here k is set to 10, and λ is set to 0.04. The middle and right subfigures in figure 2 investigate the relationship between performance and the parameter λ. The k is still set to 10, and the number of face images is fixed at 50 in middle subfigure, and 100 in right subfigure. Generally speaking, the algorithm gains over 90% precision, and over 50% recall. With larger values of λ, the precision will become a little higher, and the recall a little lower. This is consistent with our expectation, since a larger λ indicates the detected noise is more sparse, which often leads to higher precision and lower recall.

6.2

Image Reconstruction/Denoise

This subsection presents the performance of RobustNMF on reconstruction/denoising. First, we simulate large additive noise in the same way as in previous subsection. 10

Given the polluted data X, RobustNMF will learn U , V and E. The original images are reconstructed as U V . As discussed before, some other algorithms, such as WNMF [10], are able to handle large noise if the positions are given. Thus, we can use RobustNMF to locate noise and then employ WNMF to recover the images. The reason for combining these two methods is that there is an approximation of L0 norm by L1 in RobustNMF, which may cause the the absolute value of estimation of E smaller than the truth. Let RobustNMF+WNMF denote the new combined method.

6.2.1

Reconstruction of Faces

A subset of faces from ORL face dataset are selected, and same large noise is added to generate a set of polluted samples denoted as X in a similar way as described e Mean Squared in Section 6.1, while the original data samples form the matrix X. Error(MSRE) is used to measure the reconstruction performance. For NMF, matrices U and V are learned based on X, and then are used in e the MSRE is reconstruction. Compared with the original noise free matrix X, 1 e 2 calculated as N ||X −U V ||F , where N is the number of samples. For RobustNMF, the MSRE definition is the same as for NMF. For RobustNMF+WNMF, based on X, RobustNMF learns U , V , and E. Since E is an indictor for whether a pixel e , Ve . The is polluted or not. Taking E as a mask, WNMF learns the new matrix U 1 e 2 e Ve || . MSRE is defined as N ||X − U F Experiments are conducted with varying number of pixels polluted and faces. In the first set of experiments,we fix the number of faces to be 50 or 100, and then vary the number of pixels polluted, from 10 to 100 with a step of 10. This means that there are about 1 percent to 10 percent pixels corrupted in each face. The results are shown in top row of figure 3. In the second set of experiments, the number of pixels polluted is fixed at 50 or 100, which means about 5 or 10 percent of pixels on each face are corrupted. Experiments are conducted with various numbers of faces, from 10 to 100 with a step of 10. The results are shown in bottom row of figure 3. It can be seen from these experiments that both RobutNMF and RobustNMF+WNMF consistently outperform the traditional NMF with varying number of data samples. With the increasing amount of noise, the advantages of proposed algorithms become even larger. This is because RobustNMF is able to detect the positions of the large value noises, i.e. the partial corruption, which enables the application of WNMF. Considering the approximation of L0 norm by L1 norm, the large noise is underestimated, and that is why we prefer Robust+WNMF to pure RobustNMF, even though both methods outperform the traditional NMF. 11

Figure 3: Face Reconstruction Results - Top: MSRE V.S. # Polluted Pixels in Each Face; Bottom: MSRE V.S. # Faces 6.2.2

Image Denoising(Reconstruction of Patches)

This subsection presents some experiment results on image denoising by using RobustNMF. Pepper and salt noise is added to natural images. The noise density is set to 5%, which means about 5% of pixels are affected. The noisy image is converted into a set of patches, to which RobustNMF is applied. λ is set to 0.04 and k is set to 10. U V is used to reconstruct the original image. Some denoising results are shown in figure 4. The first row shows the generated polluted images, the second row shows the denoised results by traditional NMF, and the third row is the results by RobustNMF. It can be seen that RobustNMF outperforms traditional NMF. Due to the space limit, the ground truth image, the results by RobustNMF+WNMF, and more experiments on other images are given in supplemental materials.

7

Conclusion

Data in many real world applications are often partially corrupted without the explicit information of positions of noise, which prevents the usage of NMF and other existing variants.This paper proposes a RobustNMF algorithm for large ad12

Figure 4: Image Denoising Results - Top: Polluted Images; Middle: Results by NMF; Bottom: Results by RobustNMF

13

ditive noise, which can handle partial corruption without requiring the position information of noise in advance. The proposed algorithm is able to simultaneously locate and estimate the large additive noise and learn the basis matrix U and coefficient matrix V in the framework of NMF. This proposed algorithm also paves the way to apply other variants of NMF(e.g. WNMF) to data with missing values by estimating the positions of noise. An efficient optimization algorithm with a solid theoretical justification is proposed for RobustNMF. Experimental results on three different sets of applications demonstrate the advantages of our algorithm. As for future research, we plan to explore a low rank version of RobustNMF, which can automatically find the adequate low rank of the decomposed matrices. Similar to RobustPCA [13], more applications of RobustNMF can be investigated, since NMF is widely used in various areas, including computer vision, text mining, speech analysis, and etc.

References [1] David Guillamet and Jordi Vitri`a, “Non-negative matrix factorization for face recognition,” in CCIA ’02: Proceedings of the 5th Catalonian Conference on AI, London, UK, 2002, pp. 336–344, Springer-Verlag. [2] A.M. Cheriyadat and R.J. Radke, “Non-negative matrix factorization of partial track data for motion segmentation,” in Computer Vision, 2009 IEEE 12th International Conference on, 2009, pp. 865 –872. [3] Patrik O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” J. Mach. Learn. Res., vol. 5, pp. 1457–1469, 2004. [4] Hyunsoo Kim and Haesun Park, “Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis,” Bioinformatics, vol. 23, pp. 1495–1502, June 2007. [5] Deng Cai, Xiaofei He, Xuanhui Wang, Hujun Bao, and Jiawei Han, “Locality preserving nonnegative matrix factorization,” in IJCAI’09: Proceedings of the 21st international jont conference on Artifical intelligence, San Francisco, CA, USA, 2009, pp. 1010–1015, Morgan Kaufmann Publishers Inc. [6] Bin Shen and Luo Si, “Nonnegative matrix factorization clustering on multiple manifolds,” in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence. 2010, pp. 575–580, AAAI Press. 14

[7] A.B. Hamza and D.J. Brady, “Reconstruction of reflectance spectra using robust nonnegative matrix factorization,” Signal Processing, IEEE Transactions on, vol. 54, no. 9, pp. 3637 –3642, 2006. [8] Paul Fogel, S. Stanley Young, Douglas M. Hawkins, and Nathalie Ledirac, “Inferential, robust non-negative matrix factorization analysis of microarray data,” Bioinformatics, vol. 23, no. 1, pp. 44–49, 2007. [9] F. De la Torre and M.J. Black, “Robust principal component analysis for computer vision,” in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, 2001, vol. 1, pp. 362–369 vol.1. [10] James Ford Sheng Zhang, Weihong Wang and Fillia Makedon, “Learning from incomplete ratings using non-negative matrix factorization,” in SDM, 2006. [11] A. Eriksson and A. van den Hengel, “Efficient computation of robust lowrank matrix approximations in the presence of missing data using the l1 norm,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010, pp. 771 –778. [12] Sameer Sheorey Kaushik Mitra and Rama Chellappa, “Large-scale matrix factorization with missing data under additional constraints,” in NIPS. 2010, MIT Press. [13] Yigang Peng John Wright and Yi Ma, “Robust principal component analysis: Exact recovery of corrupted low-rank matrices by convex optimization,” in NIPS. 2009, MIT Press. [14] Daniel D. Lee and H. Sebastian Seung, “Algorithms for non-negative matrix factorization,” in NIPS. 2001, pp. 556–562, MIT Press. [15] David L. Donoho, “For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution,” Communications on pure and applied mathematics, vol. 59, no. 7, pp. 907– 934, 2006. [16] Hyunsoo Kim and Haesun Park, “Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method,” SIAM J. Matrix Anal. Appl., vol. 30, pp. 713–730, July 2008.

15