Optimal Linear Compression Under Unreliable Representation and Robust PCA Neural Models Konstantinos I. Diamantaras, Member, IEEE and Kurt Hornik, Member, IEEE and Michael G. Strintzis, Senior Member, IEEE
Abstract | In a typical linear data compression system the representation variables resulting from the coding operation are assumed totally reliable and therefore the solution in the mean-squared-error sense is an orthogonal projector to the so-called principal component subspace. When the representation variables are contaminated by additive noise which is uncorrelated with the signal, the problem is called Noisy PCA (NPCA) and the optimal MSE solution is not a trivial extension of PCA. We rst show that the problem is not well de ned unless we impose explicit or implicit constraints on either the coding or the decoding operator. Second, orthogonality is not a property of the optimal solution under most constraints. Third, the signal components may or may not be reconstructed depending on the noise level. As the noise power increases, we observe rank reduction in the optimal solution under most reasonable constraints. In these cases it appears that it is preferable to omit the smaller signal components rather than attempting to reconstruct them. This phenomenon has similarities with classical information theoretical results, notably the water- lling analogy, found in parallel additive Gaussian noise channels. Finally, we show that standard Hebbian-type PCA learning algorithms are not optimally robust to noise, and propose a new Hebbiantype learning algorithm which is optimally robust in the NPCA sense. Keywords | Principal Component Analysis, Robust representation, Hebbian learning
I. Introduction
The last 20 years have seen a great surge of research interest in the area of learning models for optimal linear data compression and feature extraction, relating to classical statistical methods such as Principal Component Analysis (PCA) [16], [6], [19], [1], [3]. These rules can roughly be categorized into the ones which are derived from a Hebbian learning principle and those implemented on special types of multilayer perceptrons. The Hebbian rules are usually realized by single layer networks with perhaps lateral weights among the output units. The multilayer networks, on the other hand, usually operate in auto-associative mode and have a \bottleneck" hidden layer of reduced size which enforces the optimal compression and decompression of the input signal. In this paper we shall study the problem of optimal linear compression and decompression from the perspective K. I. Diamantaras and M. G. Strintzis are with the Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54006 Thessaloniki, Greece. K. Hornik is with the Institut fur Statistik und Wahrscheinlichkeitstheorie, Technische Universitat Wien, Wiedner Hauptstrae 8-10/1071, A-1040 Wien, Austria.
of its robustness to noise. In section II, we introduce the Noisy PCA (NPCA) model which assumes that both the representation and the reconstruction process is characterized by the presence of additive noise which is uncorrelated with the signal. (It turns out that without loss of generality, the reconstruction noise can be ignored.) In the NPCA case, no nite solution minimizing the mean square reconstruction error can be found, and constraints need to be imposed on the encoding and decoding operators. In section III, we characterize the NPCA optimal solutions for the most important constraints (orthogonality, bounds on the Frobenius norm, symmetry, and hard and soft constraints on the representation variance). It turns out that in most cases, the optimal solutions are not orthogonal, and their rank (the number of features extracted) decreases with increasing noise level. The relation of these ndings to results from information theory is brie y discussed in section IV. In section V, we assess the NPCA robustness of Hebbian PCA neural methods that have been proposed in the past. We show that these algorithms are not optimally robust, and propose a new Hebbian-type learning rule, the so-called robust PCA learning rule for extracting single or multiple noisy principal components, which is optimally robust in the NPCA sense (under the symmetry constraint). In fact, it turns out that for the fundamental Hebbian learning rule of Oja [16], the \breakdown" noise power level at which the algorithms stops extracting anything interesting is only half the level set by the NPCA optimal solution and realized by our new algorithm. Our results are summarized and discussed in section VI. Proofs are deferred to the Appendix. II. Noisy Principal Component Analysis (NPCA)
A. Standard PCA
The standard analysis of a stochastic signal x(k) 2 Rn, k = 0; 1; : : :, into principal components determines a set of hidden linear representation parameters y1 (k); : : :; ym (k), called factors or features. The features are used for the linear least mean-squared-error reconstruction of the original signal. If there is enough correlation between the observation variables xi(k), then we can reconstruct the signal with acceptable accuracy using a number of features m much smaller than the data dimension n. We often call n
the super cial dimensionality of the signal while m is called noise and how is the minimum error aected by the noise the intrinsic dimensionality of the signal. level? The linear mapping As with standard PCA, the dimension m of the representation vector y is assumed to be less than or equal to y = Ax the dimension n of the input x. For our unreliability study to be complete we assume that, in addition to the repren m from the data space R to the representation space R , is sentation process, the reconstruction process is also noisy. combined with the linear reconstruction mapping So the general NPCA model is x^ = By y = Ax + e (3) x^ = By + (4) to yield, in general, a lossy compression mechanism (if m < n), thus perfect reconstruction is not possible. The PCA method minimizes the mean-squared reconstruction error (see Fig. 1). We assume that the reconstruction and representation 2 noises are uncorrelated with the signal x and with each ^ J = E kx ? xk (1) other, so Rxe = Rx = Re = 0, where Ruv E (uvT ), by an optimal choice of the linear coding and decoding and our goal is to minimize the mean-squared error operators A and B, respectively. In signal processing PCA is known as the KarhunenJ = E kx^ ? xk2 Loeve transform (KLT) and both are immediately related = tr(BARx AT BT ) ? 2 tr(BARx ) to the eigenvalue decomposition (ED) of the symmetric sig+ tr(BRe BT ) + tr(R ) + tr(Rx ) nal autocorrelation matrix n (`tr' denotes the matrix trace operator). It is obvious that X T Rx E (xx ) = x;iux;iuTx;i the reconstruction noise enters the NPCA cost function i=1 only through the constant term tr(R ), so we can ignore without aecting the optimal solution. The representaThe eigenvectors ux;i are orthonormal and the corre- tion noise variance, on the other hand, appears in the term sponding eigenvalues x;i are nonnegative. Assuming that tr(BR T which involves the variable matrix B. Thus there are no repeated eigenvalues and that they are ar- we cane Bnot) ignore the representation noise term without ranged in decreasing order, x;1 > > x;n , then the substantially altering the cost function. Without loss of PCA-optimal compression and decompression maps are generality we shall assume that there is no reconstruction noise, thus simplifying the NPCA cost function into the A = TUTm following: ? 1 B = Um T J = E kB(Ax + e) ? xk2 where T is any invertible m m matrix. The columns = tr(BARx AT BT ) ? 2 tr(BARx ) of the matrix Um are the m principal eigenvectors, i.e., Um = [ux;1; : : :; ux;m] and due to orthogonality we have + tr(BRe BT ) + tr(Rx ) (5)
AB = UTmUm = I
(2) We assume that the noise covariance matrix is positive de nite but otherwise we make no assumptions about the noise The input-output transfer matrix statistics. Let e;1 e;m be the noise eigenvalues arranged in decreasing magnitude order and let ue;i denote BA = UmUTm the corresponding orthonormal eigenvectors. We shall also it useful to write e;?i and ue;?i for the i-th smallis an orthogonal projector onto the subspace Lm spanned nd est eigenvalue eigenvector, respectively. by the rst m principal eigenvectors, which is called the (I.e., = and corresponding .) e;?i e;m?i+1 principal component subspace. The PCA-optimal signal reIf either the linear coding or the decoding operator is T construction x^ = Um Um x lies in Lm . known then the problem becomes trivial. For example, if A is given then the statistics of y are known and from (4) the B. Noisy Representation optimal decoding matrix B is simply the linear regression In the Noisy PCA (NPCA) problem the basic assumption operator of x on y, is the unreliability of the representation signal y. Without changing the fundamental linear mapping assumptions of B = Rxy R?y 1 = RxAT (ARxAT + Re)?1 (6) standard PCA the question is what is the eect of noise in the representation variables. For example, in many Similarly, if B is given then from (5) we nd the optimal telecommunications applications, the representation data coding operator to be the pseudo-inverse are not reliable due to channel noise. How does the optimal reconstruction change when there is representation A = B+ (7)
and it is attained for of the decoding operator. Note that (6) and (7) are inconsistent m X (10) A = ue;?iuTx;i (6) ) AB = ARx AT (ARx AT + Re )?1 (8) i =1 + (7) ) AB = B B = I (9) The right hand sides of (8) and (9) are not equal except for the trivial zero noise case. It follows that in the noisy >From (10) and (6) it follows that the corresponding opcase the problem has no nite solution if both coding and decoding matrices are unknown. This issue and ways to timal decoding operator is tackle it are discussed in the next section. m X x;i B = ux;iuTe;?i (11) III. Optimization under Constraints + x;i e; ? i i=1 The minimization of J is an unbounded problem if no constraints are imposed on either B or A. If we arbitrarily Both coding and decoding operators are the outer prodincrease the magnitude of A we increase the power ratio ucts between the noise and signal eigenvectors but paired (SNR) of the term Ax relative to e, thus making the eect in reverse order: the most signi cant signal eigenvector is of the noise arbitrarily small relative to the signal. The paired with the least signi cant noise eigenvector, and vice magnitude of B must be correspondingly reduced in order versa. Unlike PCA or other orthogonal transforms the optimal to balance the size of A for the sake of reconstructing x. However, since B multiplies both the noise and the signal NPCA decoding matrix is not orthogonal. Thus the total terms it does not in uence the SNR. Take for example, any input-output transfer matrix pair of matrices A, B 6= 0, and any positive scalar c > 1. m X As (12) BA = +x;i ux;iuTx;i ? 1 T T x;i e; ? i i =1 J(cA; c B) = tr(BARx A B ) ? 2 tr(BARx ) +c?2 tr(BRe BT ) + tr(Rx) is not an orthogonal projector except for the trivial zero< tr(BARx AT BT ) ? 2 tr(BARx ) noise case. >From (12) it becomes apparent that each signal T component is reconstructed with dierent accuracy level + tr(BRe B ) + tr(Rx ) x;i =(x;i + e;?i ). The rst principal component is most = J(A; B); accurately reconstructed (accuracy level closest to 1), while the new pair cA, c?1B, achieves a smaller MSE than the the last principal component gets the worst reconstruction original pair. It follows that the minimization of J leads (accuracy level closest to 0). to unbounded solutions with A ! 1 and B ! 0. The Naturally, we may ask next what will be the solution if in mum of J is the PCA error which corresponds to the we impose the analogous orthogonality constraint zero noise case: SNR = 1. BT B = I Various constraints on the coding matrix A, the decoding matrix B, or the representation vector variance have been studied in the literature. Next, we shall present the onTthe decoding matrix. Using (7) we replace A with B+ = B in (5) to obtain most important of these results. A. Orthogonality Constraints
Orthogonality is very common property appearing in linear transformations which are optimal in some meansquared-error sense. For example the PCA transform is based on an orthogonal operator (see Eq. 2). Many other common signal and image processing transforms are orthogonal, such as the Discrete Cosine Transform (DCT), the Haar Transform, etc. Suppose that we impose an orthogonality constraint AAT = I on the coding operator. The following result has been shown [2], [3]. Theorem 1: The minimum of the cost function (5) under the constraint AAT = I is n m X X 1 x;i JAo = 1= + 1= + x;i e;?i i=m+1 i=1
J = tr(Rx ) + tr(Re ) ? tr(BT Rx B):
The minimization of J leads to the orthogonal PCA matrices
B = Um = [ux;1; : : :; ux;m];
A = BT :
In this case the input-output transfer matrix is an orthogonal projector but now the minimum error n m X X x;i JBo = e;i + i=1
i=m+1
is larger than the minimum error JAo attained when the orthogonality constraint is imposed on the coding matrix. Therefore, the orthogonal solution corresponds to the worse of the two errors. This fact manifests a clear lack of symmetry between the coding and decoding operators.
B. Frobenius Norm Constraints
One way to obtain nite NPCA solutions is to use bruteforce constraints on the sizes of the coding and decoding matrices. For example, since the unconstrained problem leads to an in nite coding matrix one constraint could be to bound the size of A from above using the Frobenius norm kAk2F s2 : We have the following result [2], [3]. Theorem 2: The minimum error J under the constraint kAk2F s2 is attained for r X A = iue;?iuTx;i; i=1
where
p i = i (r) r = max fl : l (l) > 0g and for 1 i; l m P s2 + l = p
i (l) = e;?i Pl k=1pe;?k x;k ? e;?i : x;i k=1 e;?k
(13) (14)
i=m+1
No matrix B attains the in mum, but for m X B = s2 ux;1uTe;?1 + ux;iuTe;?i i=2
The decoding operator corresponding to the coding matrix B which is approximately optimal for small is its pseudo-inverse, m X A = B+ = s?2ue;?1uTx;1 + ?1 ue;?iuTx;i i=2
The solution in this case is ill-conditioned since the size of A tends to in nity as ! 0. C. Symmetry Constraint between Coding and Decoding Matrices
A mathematically tractable constraint results from forcing the coding and decoding operators to be transposes of each other instead of being orthogonal, i.e.,
B = AT :
This constraint induces a compromise between the SNR maximization and the signal reconstruction since it is im(15) possible to arbitrarily increase the magnitude of A while at the same time decreasing the magnitude of B. The NPCA cost (5) becomes
The corresponding optimal decoding operator is r X B = 2 x;i+i ux;iuTe;?i e;?i i=1 i x;i The solution is similar to the one obtained using the coding matrix orthogonality constraint, in the sense that both solutions are outer products between the signal and noise eigenvectors and also that these signal and noise eigenvectors are paired in the reverse order. One additional feature of the solution we obtain from the Frobenius constraint is the rank reduction of the optimal matrix as the noise power increases. This rank reduction feature will be recurrent in the next sections when we will study other constraints as well. Under these constraints, the increase of the noise power leads to the loss of the least signi cant signal components. In terms of the MSE it is better to lose these components in the reconstruction rather than try to recover them. The dual constraint on the Frobenius norm of the decoding matrix kBk2F s2 (16) leads to the following result [3]. Theorem 3: We have n X x;i + s2 e;m : inf J( B ) = 2 2 kBkF s
J(B+ ; B) converges to the in mum for ! 0.
J = tr(AT ARx AT A) ? 2 tr(ARx AT ) + tr(AT Re A) + tr(Rx )
(17)
As shown in [4] this constraint is suitable for single-layer neural network implementation, similar to Oja's [16], [11] and Sanger's GHA [19] models. It also helps the robustness study of \standard" single-layer PCA models. We show the following result (see the appendix for the proof). Theorem 4: The minimum of J under the constraint B = AT is attained for r X (18) A = iue;?iuTx;i i=1
where i2 = max(1 ? e;?i=2x;i; 0) r = maxfi : i > 0g Once more, the solution is a weighted outer product between the signal and noise eigenvectors paired in reverse order. The necessary and sucient condition for A to have full rank is that the largest noise eigenvalue be less than twice the m-th signal eigenvalue. If that condition does not hold, then the optimal rank is the largest integer r for which the condition e;?r < 2x;r is satis ed. In other words, r rank(A) = m
if e;1 < 2x;m
or else r satis es the conditions
and
E kyk2 = tr(ARx AT ) + tr(Re) 2x;r > e;?r m m X X 2x;r+1 e;?(r+1) i2x;i + e;i = i=1 i=1 and we have r+1 = = m = 0. Finally, we can compute r m q X X the minimum NPCA cost in this case to be = p1 e;?ix;i + e;?i ; (22) ! n i =1 i = r +1 r X X Jeq = e;?i 1 ? 4e;?i + x;i : respectively. Hence, the corresponding reconstruction error x;i i=1 i=r+1 is r q n X X D. Constraints on the Representation Variance x;i + p x;i e;?i: J = J1 () ? E kyk2 = In most telecommunications applications channel bandi=1 i=r+1 width is at a premium. In these cases it is reasonable to Now let us turn to a hard constraint on the representaput an upper bound on the NPCA representation vector variance, thus indirectly limiting the size of the coding ma- tion variance of the form trix A. E kyk2 = s2 (23) Let us rst consider a soft constraint on the representation vector variance using the additional term E kyk2 in for some xed value s. Using the method of Lagrange multhe mean-squared error, where is a xed, positive scalar. tipliers for optimizing (5) under this constraint we de ne The error function then becomes the auxiliary cost J 0 = J + (E kyk2 ? s2 ): (24) J1() = E kx ? x^ k2 + E kyk2: (19) According to Theorem 5 the optimal solution for xed For xed noise power, the additional term is given by (20) and (21). Plugging (22) into (23) we obtain the optimal value for as E kyk2 = tr(ARxAT ) + tr(Re ) Pr p )2 ( penalizes large sizes of the coding matrix A. The following = (s2 ?i=1Pm e;?i x;i )2 : i=r+1 e;?i result holds [3] Theorem 5: The minimum of J1 is attained for The rank r can be determined from the value of the representation variance s2 by comparing the latter with r X T i ue;?iux;i (20) the threshold levels A = k q m i=1 X e;?k X r p + L(k) = X x;k i=1 e;?i x;i i=k+1 e;?i : i ux;iuTe;?i (21) B = i=1 This gives where s2 L(1) ) r=0 r = maxfi : i > g L(1) < s2 L(2) ) r = 1 L(2) < s2 L(3) ) r = 2 is the rank of the optimal solution, .. .
i = x;i =e;?i L(m) < s2 ) r=m are the signal to noise eigenvalue ratios in reverse pairing, IV. Information Theoretical Connections and In the Gaussian setting there is a close relationship between information maximization and PCA. We expect that i2 = max(1=p i ? 1= i ; 0); there will a be similar relation between the noisy PCA and i2 = max(p i ? ; 0): information maximization and noisy Gaussian channels as well. Indeed, the following is an old information theoretical result [7, chapter 7] which bears strong analogies with NPCA. The minimum of J1 and the variance of the optimal hidTheorem 6 (Parallel additive Gaussian noise channels) den units' representation equal Consider a set of n memoryless channels y1 ; : : :; yn with m n r q additive zero-mean Gaussian noise e1 ; : : :; en: X X X p x;ie;?i + e;?i x;i + 2 yi = xi + ei : i=r+1 i=r+1 i=1
We denote the noise variances by e;i = E (e2i ), i = 1; : : :; n, and we de ne the input and output vectors x = [x1; : : :; xn], y = [y1; : : :; yn]. If the energy of the input messages is constrained in a way that n X E (x2i ) s2 E kxk2 = i=1
then the mutual information I(x; y) is maximized by choosing the inputs to be statistically independent, zero-mean Gaussian random variables with variances E (x2i ) = x;i , where x;i = max(L ? e;i ; 0) P and L is chosen so that ni=1 x;i = s2 . The interpretation of the solution is shown in Fig. 2. The system puts most of its energy to the least noisy channels and progressively less energy to the more noisy channels. This is analogous to the noisy PCA situation where the least principal noise component pairs with the most principal signal component and vice versa. The situation in the information theoretical setting resembles a container lled with water (and justi ably it is called the water- lling analogy ) where the bottom of the container is shaped according to the sizes of the noise variances e;i and the quantity of the water is s2 . It is interesting to note that as the noise variances e;i become larger compared to s2 more and more channels are left without signal. In the limit, as e;i ! 1 and assuming that the noise variances are not equal, then all the signal is concentrated in only one channel: the one with smallest noise. In noisy PCA a similar phenomenon happens when increasing the noise variance results in rank loss. The NPCA model is also closely related to the INFOMAX principle proposed by Linsker [14]. Linsker considers the same noisy linear data compression process y = Ax + e, where the noise is uncorrelated with the signal, as in NPCA (see Eq. 3), but proposes to choose A in a way that \the rate of (Shannon) information transmission . .. is maximized, subject to constraints and/or additional cost terms" [14, p. 113]. If both signal and noise are zero mean Gaussian, this amounts to maximizing the criterion det(AR AT + R ) 1 x e : I = 2 log det(Re ) He nds that under suitable additional assumptions, the rows of A become increasingly collinear with increasing noise level. This is qualitatively similar to the asymptotic NPCA behavior under e.g. the Frobenius norm constraint in section III-B, where the high noise NPCA optimal solutions have rank 1 (note that 1 (1) = s2 > 0). See [9] for more details. The dierence between noisy PCA and the above information theoretical results is that the noisy PCA problem applies to non-Gaussian signal and/or noise distributions as well, and hence it is more general.
V. Robustness of Hebbian PCA Neural Models
A. Single Principal Component Learning
One of the earliest and most in uential Hebbian learning algorithms for extracting principal components was proposed in 1982 by Oja [16]. It applies to a single linear neuron. Let w = [w1; : : :; wn]T denote the synaptic weight vector of such a neuron and let y = wT x be its output given an input signal x. The Oja rule uses the adaptation formula w / xy ? y2 w (25) and extracts the unit-length principal eigenvector u1 (the sign ambiguity is inconsequential) of the input autocorrelation matrix Rx . Oja's successful neural principal component extraction algorithm for a single unit was extended to multiple principal component extraction algorithms proposed by various authors (see e.g. [1], [3]). Essentially, all Hebbian PCA rules are based on Oja's rule. Next we shall study the robustness of this fundamental rule when there is noise present in the feature unit. Suppose that the cell activation is a non-deterministic function of the input signal, the non-determinism modeled as in NPCA by an additive noise term: y = wT x + e: According to stochastic approximation theory [13], [15], algorithm (25) can be studied with the help of its associated ODE dw = E (xy ? y2 w) dt = (Rx ? e2 I)w ? (wT Rx w)w (26) where e2 denotes the noise variance. If e2 is small enough so that the principal eigenvalue of the matrix Rx = Rx ? e2 I is positive, then the ODE (26) extracts a principal eigenvector of Rx [3], which is also a principal eigenvector of Rx . This happens if e2 < x;1 . In the noisy case, the norm of the steady state solution is seen to be given by kwk2 = 1 ? e2 =x;1 : Hence, unlike the noiseless PCA case, it is not unity. If on the other hand e2 x;1 , Rx becomes negative semide nite, and in this case the globally stable attractor of (26) is w = 0. To evaluate the robustness of rule (25) we must compare its solution with the optimal NPCA solution for an appropriate constraint. As Oja's rule employs only a single weight vector, a natural choice for a constraint is the symmetry one where B = AT . In this case, our results in section III-C yield that the optimal solution for a single vector is wNPCA = 1 u1 (where 12 = 1 ? e2 =2x;1) if e2 < 2x;1, and zero otherwise. Hence, Oja's rule has only half the noise tolerance of the optimal NPCA solution, since the weights result in the trivial zero solution at noise
levels x;1 and 2x;1, respectively. Clearly, even for small noise variances the Oja solution is not NPCA-optimal, as shown by the disparity between the norms of Oja's solution and the optimal NPCA solution (which are 1 and 1 < 1, respectively). The robust stochastic gradient descent law w / (2 ? kwk2)xy ? y2 w (27) derived from the NPCA cost function (1) under the constraint AT = B = w has recently been proposed by one of the authors [4]. We shall refer to this rule as the robust PCA rule for a single component. The norms of the steady-state vector w in Oja's rule and in the robust rule are compared in Fig. 3 for dierent noise powers. Clearly, as the noise level tends to zero the robust rule tends to extract the unit-length standard principal component of x as did the original Oja algorithm. The additional term (1 ? kwk2)Rx w of the robust rule does not aect the nal solution in the noise-free case since then kwk2 = 1. However, it aects the dynamics of the algorithm when the noise level is non-zero. B. Multiple Component Case
A simple generalization of the approach followed in the previous subsection leads to the extraction of the learning rule for multiple components from the derivative of J with respect to the matrix W. >From the related ODE the discrete rule W / xyT (2I ? WT W) ? WyyT (28) can be derived. This rule is analogous to the subspace rule W / xyT ? WyyT (29) for multiple principal component extraction proposed by Karhunen and Oja [11], [17]. Similar to the single unit case, the robustness of the subspace rule is not NPCA optimal. The steady state equations for the associated ODE of (29) in the noisy case y = WT x + e give RxW ? WRe = WWT RxW: Multiplying by WT from the left and taking traces we obtain tr(WT Rx W) ? tr(WRe WT ) = tr(WWT Rx WWT ): Hence, if the largest noise eigenvalue e;1 is larger than the largest signal eigenvalue x;1 , then the left hand side is negative and the right hand side is positive unless W = 0. Recall that according to the NPCA theory, the optimal matrix is zero only for e;1 > 2x;1 . Therefore, again the subspace rule \quits" at half the NPCA noise level. The dierence between the two rules in (28) and (29) is the term xyT (I ? WT W) which equals zero in the noisefree steady state. Indeed, it can be shown [3] that in this case, the asymptotic behavior of rule (29) is WT W ! I. Therefore, as with the single unit case, the two algorithms
are identical in steady state for the noise-free case, although their dynamics dier in the transient state, thus leading to dierent results in the presence of noise. Of course, even if the noise level is zero, rule (28) can still be used since this rule is NPCA-optimal and PCA is a special case of NPCA. C. Simulation Experiments
We tested the old and the new (robust) algorithms (25) and (27) using arti cial data. The input data is a set of 100 6-dimensional vectors created by a standard uniform noise generator in the range [?0:5; 0:5]. The signal eigenvalues are 0.115502, 0.112392, 0.080303, 0.075401, 0.066348, and 0.049583 (in decreasing order). The noise e is zero-mean Gaussian with adjustable power. The learning rate constant that we used for both algorithms took values between 0.002 and 0.005. Fig. 3 shows the norms of the nal weight vector for both cases. We see that the norm of Oja's rule decays to zero faster than that of the robust algorithm. At all non-zero noise levels below 2x;1, Oja's rule produces non-NPCA-optimal results. Fig. 4 shows the nal cost obtained from the two algorithms. At noise levels approaching x;1 the dierence between the two costs is maximal. At noise level 2x;1 the two solutions are zero so the two costs are identical. They are also identical for noise power equal to zero as expected since the NPCA problem degenerates to the PCA problem in this case. VI. Discussion
This paper studies the eects of unreliability in the representation variables in a linear data compressor. We found that a non-zero level of additive noise in the representation vector (\Noisy PCA") introduces interesting phenomena. First, the problem is unbounded, unless some constraints are imposed on the linear coding or decoding operators. Typically the magnitude of the decoding matrix must be bounded from above while the magnitude of the coding matrix must be bounded from below (explicitly or implicitly). Various such constraints have been studied in the literature but a lot of questions still remain open. Second, orthogonality does not appear to be a property of the optimal solution under most constraints. If we impose a total orthogonality constraint on both the coding and decoding operators then the MSE is larger than the MSE obtained when not forcing the matrices to orthogonality. Third, under most constraints the Noisy PCA solution has a rank restriction depending on the level of noise in the neural activations. This imposes a limit on the number of components that can be meaningfully extracted from the input signal. The stronger the noise power, the fewer the number of components that have non-zero magnitude. This rank reduction of the optimal solution is similar to the water- lling analogy appearing in information theory [7] where it appears as absence of signal from channels with the highest noise power. Most signal power is transmitted through the channel with the least noise power, while
the second largest signal power is transmitted through the channel with the second smallest noise power, and so on. Our Noisy PCA results however, are more general than the information theoretical results because they apply to non-Gaussian signals and/or noises. The similarity between PCA and NPCA is naturally the involvement of the signal eigenvectors in both solutions. After all, PCA is a special case of NPCA for zero noiselevel. Unlike PCA, if the noise level is non-zero, then the the NPCA optimal norm typically is not equal to 1 as usually is the case for PCA. Finally, using the NPCA analysis as a tool, we study the robustness of Hebbian PCA learning rules that have been proposed in the literature. Most of these models are based on Oja's algorithm which is shown to be less than optimally robust to the presence of noise. We follow our observation by proposing a new robust PCA learning rule for both single and multiple component extraction. The form of this algorithm is derived immediately from our NPCA theory as a stochastic gradient descent method on the noisy error surface. Simulation results are presented to verify the fact that the algorithm extracts the NPCA-optimal vectors at twice the noise level acceptable by the original PCA rules. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
References P. Baldi and K. Hornik. \Learning in Linear Neural Networks: a Survey". IEEE Transactions on Neural Networks, vol. 6, no. 4, pp. 837{858, July 1995. K. I. Diamantaras and K. Hornik. \Noisy Principal Component Analysis". In J. Volaufova and V. Witkowsky, editors, Measurement '93, pages 25{33, Institute of Measurement Science, Slovak Academy of Sciences, Bratislava, Slovakia, May 1993. K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks: Theory and Applications. John Wiley, New York, 1996. K. I. Diamantaras, Robust Principal Component Extracting Neural Networks. in Proceedings ICNN-96, pp. 74{77, Washington DC, IEEE Press, 1996. K. I. Diamantaras and M. G. Strintzis, \Noisy PCA Theory and Application in Filter Bank Codec Design", to appear in ICASSP-97, Munich, April, 1997. P. Foldiak, \Adaptive Network for Optimal Linear Feature Extraction", in Proc. Int. Joint Conf. Neural Networks, vol. 1, pp. 401-406, Washington DC, 1989. R. G. Gallager, Information Theory and Reliable Communication, Wiley, NY, 1968. D. O. Hebb, The Organization of Behavior, Wiley Inc., New York, 1949. K. Hornik, \Noisy Linear Networks". In R. Mammone, editor, in Arti cial Neural Networks with Applications in Speech and Vision, Chapman & Hall, London, 1994. H. Hotelling, \Analysis of a Complex of Statistical Variables into Principal Components", J. Educ. Psych., vol. 24, pp. 498{520, 1933. J. Karhunen and E. Oja, \New Methods for Stochastic Approximation of Truncated Karhunen-Loeve Expansions", in Proc. 6th Int. Conf. on Pattern Recognition, pp. 550-553, Springer-Verlag, NY, 1982. S. Y. Kung, K. I. Diamantaras and J. S. Taur, \Adaptive Principal Component Extraction (APEX) and Application", IEEE Transactions on Signal Processing, vol. 42, no. 5, pp. 1202{1217, May, 1994. H. J. Kushner and D. S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems. SpringerVerlag, New York, 1978. R. Linsker, \An Application of the Principle of Maximum Information Preservation to Linear Systems". In D. S. Touretzky, editor, Advances in Neural Information Processing Systems, pages 186{194, Morgan Kaufman, San Mateo, CA, 1989.
[15] L. Ljung, \Analysis of Recursive Stochastic Algorithms". IEEE Transactions on Automatic Control, vol. 22, no. 4, pp. 551{575, August 1977. [16] E. Oja, \A Simpli ed Neuron Model as a Principal Component Analyzer". J. Math. Biology, vol. 15, pp. 267{273, 1982. [17] E. Oja, \Neural Networks, Principal Components, and Subspaces", Int. J. Neural Systems, vol. 1, no. 1, pp. 61{68, 1989. [18] M. D. Plumbley, \Ecient Information Transfer and AntiHebbian Networks". Neural Networks, vol. 6, no. 6, pp. 823{834, 1993. [19] T. D. Sanger, \Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network", Neural Networks, vol. 2, no. 6, pp. 459{473, 1989.
Appendix I. Proof of Theorem 4
For positive real numbers x and y let
h(x; y) ?2x max(1 ? y=2x; 0)2:
(30)
We start with a simple lemma. Lemma 1: For xed x0 , x1, where x1 > x0 , the function h(y) h(x1; y) ? h(x0 ; y) is nondecreasing in y. Furthermore, if at least y 2x1 it is a strictly increasing function of y. This lemma follows easily by using the fact that the partial derivative of h with respect to y equals max(2 ? y=x; 0). To establish Theorem 4, we note that Theorem 5.1 in [3, p. 136f] shows that the optimal A is of the form m X A = i ue;(i)uTx;i; i=1
where is a permutation of f1; : : :; mg and
i2 = max(1 ? e;(i) =2x;i; 0); hence, it only remains to show that the minimum is attained for (i) = m ? i + 1 for 1 i m. It is readily veri ed that for A as above, m n X X J = x;i + h(x;i; e;(i) ): i=1
i=1
Consider a permutation for which there exist indices 1 i < j m such that 1 (i) = k < l = (j) m. Let 0 be the same as except for 0 (i) = l and 0 (j) = k. Then J ? J 0 = (h(x;i ; e;k ) + h(x;j ; e;l )) ? (h(x;i ; e;l ) + h(x;j ; e;k )) = (h(x;i ; e;k ) ? h(x;j ; e;k )) ? (h(x;i ; e;l ) ? h(x;j ; e;l )) 0 by Lemma 1. Iteratively applying this argument, we nd that \removing" all i < j with (i) < (j) does not increase J, whence the theorem.
ν
e x
y
A
^ x
B
Fig. 1. The Noisy PCA data model
0.52
0.5 Oja 0.48
Fig. 2. The water- lling analogy in informationtheory applies to parallel channels with additive Gaussian noise. The noise level determines the energy distribution of the signal: most signal energy goes to those channels that have the least noise power. Channels with too much noise may get no signal at all.
mean squared error
Robust 0.46
0.44
0.42
0.4
0.38 0
λ1 0.05
0.1
0.15
2
σe
0.2
0.25
0.3
0.35
1
Fig. 4. The minimal mean-squared cost J obtained by the single-unit Oja rule and the single-unit robust rule. Clearly the robust rule yields better error.
0.9 0.8
weight norm
0.7 0.6 Robust
0.5 Oja 0.4 0.3 0.2 0.1 λ1 0 0
0.05
0.1
2λ 1 0.15 σ2e
0.2
0.25
0.3
Fig. 3. The weight norms for the single-unit Oja's algorithm and the robust algorithm. Oja's rule decays to zero at an earlier noise level than the NPCA optimal robust rule. In the zero noise case the two rules give identical results.