1050
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 6, JUNE 2009
[9] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proc. Nat. Acad. Sci., vol. 79, pp. 2554–2558, Apr. 1982. [10] G. X. Ritter and P. Gader, “Fixed points of lattice transforms and lattice associative memories,” in Advances in Imaging and Electron Physics, P. Hawkes, Ed. New York: Academic, 2006, vol. 144. [11] G. X. Ritter, G. Urcid, and L. Iancu, “Reconstruction of patterns from noisy inputs using morphological associative memories,” J. Math. Imag. Vis., vol. 19, no. 2, pp. 95–111, 2003. [12] P. Sussner, “Associative morphological memories based on variations of the kernel and dual kernel methods,” Neural Netw., vol. 16, no. 5, pp. 625–632, Jul. 2003. [13] G. Urcid and G. X. Ritter, “Noise masking for pattern recall using a single lattice matrix associative memory,” in Computational Intelligence Based on Lattice Theory, V. Kaburlasos and G. Ritter, Eds. Heidelberg, Germany: Springer-Verlag, 2007, ch. 5, pp. 81–100. [14] B. Raducanu, M. Graña, and X. F. Albizuri, “Morphological scale spaces and associative morphological memories: Results on robustness and practical applications,” J. Math. Imag. Vis., vol. 19, no. 2, pp. 113–131, 2003. [15] M. Graña, J. Gallego, F. J. Torrealdea, and A. D’Anjou, “On the application of associative morphological memories to hyperspectral image analysis,” in Lecture Notes in Computer Science. Berlin, Germany: Springer-Verlag, 2003, vol. 2687, pp. 567–574. [16] E. Aptoula and S. Lefèvre, “On lexicographical ordering in multivariate mathematical morphology,” Pattern Recognit. Lett., vol. 29, no. 2, pp. 109–118, Jan. 2008. [17] E. Aptoula and S. Lefèvre, “A comparative study on multivariate mathematical morphology,” Pattern Recognit., vol. 40, no. 11, pp. 2914–2929, Nov. 2007. [18] M. L. Comer and E. J. Delp, “Morphological operations for color image processing,” J. Electron. Imag., vol. 8, no. 3, pp. 279–289, 1999. [19] R. A. Monteros and J. H. Azuela, “A bidirectional hetero-associative memory for true-color patterns,” Neural Process. Lett., vol. 28, no. 3, pp. 131–153, 2008. [20] R. A. V. E. D. L. Monteros and J. H. S. Azuela, “A new associative model with dynamical synapses,” Neural Process. Lett., vol. 28, no. 3, pp. 189–207, 2008. [21] H. Heijmans, Morphological Image Operators. New York: Academic, 1994. [22] J. Serra, Image Analysis and Mathematical Morphology. London, U.K.: Academic, 1982. [23] G. Banon, J. Barrera, and U. Braga-Neto, Eds., Mathematical Morphology and Its Applications to Signal and Image Processing. São José dos Campos, Brasil: INPE, 2007. [24] P. Soille, Morphological Image Analysis. Berlin, Germany: Springer-Verlag, 1999. [25] C. Ronse, “Why mathematical morphology needs complete lattices,” Signal Process., vol. 21, no. 2, pp. 129–154, 1990. [26] G. Birkhoff, Lattice Theory, 3rd ed. Providence, RI: AMS, 1993. [27] G. X. Ritter and J. N. Wilson, Handbook of Computer Vision Algorithms in Image Algebra, 2nd ed. Boca Raton, FL: CRC Press, 2001. [28] J. Angulo, “Morphological colour operators in totally ordered lattices based on distances: Application to image filtering, enhancement and analysis,” Comput. Vis. Image Understand., vol. 107, no. 1–2, pp. 56–73, Jul.–Aug. 2007. [29] V. Barnett, “The ordering of multivariate data,” J. Roy. Statist. Soc. A, vol. 3, pp. 318–355, 1976. [30] A. Yu, M. Giese, and T. Poggio, “Biophysiologically plausible implementations of the maximum operation,” Neural Comput., vol. 14, no. 12, pp. 2857–2881, 2002. [31] M. K. Riesenhuber and T. Poggio, “Hierarchical models of object recognition in cortex,” Nature Neurosci., vol. 2, pp. 1019–1025, 1999. [32] M. Giese, “Neural field model for the recognition of biological motion,” presented at the 2nd Int. ICSC Symp. Neural Comput., Berlin, Germany, May 23–26, 2000 [Online]. Available: http://www.uni-tuebingen.de/uni/knv/arl/papers/nc20p.pdf
Boundedness and Convergence of Online Gradient Method With Penalty for Feedforward Neural Networks Huisheng Zhang, Wei Wu, Fei Liu, and Mingchen Yao
Abstract—In this brief, we consider an online gradient method with penalty for training feedforward neural networks. Specifically, the penalty is a term proportional to the norm of the weights. Its roles in the method are to control the magnitude of the weights and to improve the generalization performance of the network. By proving that the weights are automatically bounded in the network training with penalty, we simplify the conditions that are required for convergence of online gradient method in literature. A numerical example is given to support the theoretical analysis. Index Terms—Boundedness, convergence, feedforward neural networks, online gradient method, penalty.
I. INTRODUCTION The convergence of online gradient method for the network training has been considered by many authors. For example, White [19] investigated the convergence of the learning process for feedforward network models with single hidden layer by using the stochastic approximation theory in [10]. Other convergence results by probabilistic asymptotic analysis include that of [5], [6], [12], and [18]. For some special online learnings where the network inputs are presented as an ordered sequence, deterministic convergence results have been obtained in [1], [11], [17], [20], and [21]. One crucial condition for the convergence of the online gradient method is the boundedness of the network weights in the online learning process. Indeed, most of the aforementioned convergence results explicitly or implicitly assume this condition holds. In practice, however, these boundedness conditions may be difficult to check, and there is no theory to guarantee such conditions. Despite the fact that it can be replaced by other conditions (see, e.g., [1] and [2]), the boundedness condition remains important, due to the difficulty to check the new conditions. Adding a penalty term to the error function (see, e.g., [9], [13]–[16], and reference therein) has become a common practice to make the network weights bounded. (The penalty method can also improve the generalization performance of the network. But we will not elaborate on this point.) The boundedness of the weights is an obvious fact when a convergent training method (such as the quadratic programming used in support vector machines in [4]) is used to minimize the cost function with penalty term. When using online gradient method to minimize the cost function with penalty, the boundedness of weights is not so obvious because the decrease of the cost function and convergence of the method during the learning process are almost always obtained by first Manuscript received October 11, 2007; revised April 22, 2008 and March 15, 2009; accepted April 05, 2009. First published May 08, 2009; current version published June 03, 2009. This work was supported by the National Natural Science Foundation of China (10871220). H. Zhang is with the Applied Mathematics Department, Dalian University of Technology, Dalian 116023, China and also with the Department of Mathematics, Dalian Maritime University, Dalian 116026, China (e-mail:
[email protected]). W. Wu and M. Yao are with the Applied Mathematics Department, Dalian University of Technology, Dalian 116023, China (e-mail: wuweiw@dlut. edu.cn;
[email protected]). F. Liu is with the Department of Statistics, University of Missouri—Columbia, Columbia, MO 65211-6100 USA (e-mail:
[email protected]). Digital Object Identifier 10.1109/TNN.2009.2020848
1045-9227/$25.00 © 2009 IEEE
Authorized licensed use limited to: National Taiwan University. Downloaded on June 2, 2009 at 17:40 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 6, JUNE 2009
1051
assuming the network weights are bounded. Grippo [7] designed a special convergent online gradient algorithm for neural network training under the assumption that the error function has compact level sets, which is implied by the addition of the penalty term. To understand how the penalty approach works, Hanson and Pratt [8] have given some explanations under certain conditions. But there seems to be no theoretical proof of the weight boundedness for the online gradient method with the penalty. The main purpose of this brief is to fill this theoretical gap in literature. We show that the weights are actually bounded deterministically in the online learning process by adding a usual penalty term that is proportional to the norm of the weights. In addition, we give a convergence result of the online learning process, in which the above boundedness result is applicable. Consequently, we can remove boundedness of the weights as a precondition. The remainder of this brief is organized as follows. In Section II, we briefly describe a neural network model with a hidden layer and an online gradient learning method with penalty. Section III presents the main convergence theorem. A numerical example is given in Section IV to support the convergence result. We summarize the findings and conclusions in Section V. Proof of the theorem (including the boundedness results) has been relegated to the Appendix. II. ONLINE GRADIENT LEARNING METHOD WITH PENALTY Consider a feedforward neural network with a hidden layer and p input nodes, q hidden nodes, and one output node. Let q w = (w1 ; w2 ; . . . ; wq )T 2 be the weight vector between the output and the hidden layer, and V = (vij )p2q Vj = (v1j ; v2j ; . . . ; vpj )T 2 p be the weight matrix between the input and the hidden layer, respectively. Given activation functions f : ! for the hidden layer, and g : ! for the output layer, we define a vector function F (x) = (f (x1 ); f (x2 ); . . . ; f (xq ))T for x = (x1 ; x2 ; . . . ; xq )T 2 q . For an input 2 p , the output vector of the hidden layer can be written as F (V T ) and the final output of the network can be written as
Y
= g(w 1 F (V )) T
(1)
where w 1F (V T ) represents the inner product between the two vectors w and F (V T ). Letting O 2 be the target output with respect to an input , we define the instant error function as
E^(w; V ; O; ) = 12 (O 0 Y )2 = 12
O 0 g (w 1 F (V T ))
2
: (2)
By adding a penalty term, the modified instant error function takes the form (cf., [15])
E (w; V ; O; ) = E^(w; V ; O; ) + 21 kwk2 +
= 12
O 0 g (w 1 F (V T ))
+ 21 kwk2 +
q j =1
2
kVj k2
q j =1
kVj k2
(3)
where > 0 is a penalty coefficient and k 1 k denotes the usual Euclidean norm. Remark 2.1: Our results in this brief can be easily generalized to the case that each weight vector norm in (3) has its own penalty coefficient. It also can be easily generalized by considering the squared distance between the current weights and some reference point. Such a generalized penalty term has been used in constructing some new training methods [7]. The optimal choice of the penalty coefficient lambda remains a difficult problem as discussed in [16], and we will not elaborate on this point in this brief.
Differentiating E (w; V ; O; ) with respect to w and Vj , respectively, we have
rw E (w; V ; O; ) = rw E^(w; V ; O; ) + w = 0 O 0 g(w 1 F (V T )) 2 g0 (w 1 F (V T ))F (V T ) + w
(4a)
and
rV E (w; V ; O; ) = rV E^(w; V ; O; ) + Vj = 0 O 0 g(w 1 F (V T )) 2 g0 (w 1 F (V T ))wj f 0 (Vj 1 )
+ Vj ;
j
= 1; 2; . . . ; q:
(4b)
p Let fOn ; n g1 be a sequence of training samples, and w0 n=1 2 0 and V be arbitrary initial weights. The online gradient method with penalty updates the weights as follows:
= wn 0 n rw E (wn ; V n ; On ; n ); n = 0; 1; . . . = Vjn 0 n rV E (wn ; V n ; On ; n ); j = 1; . . . ; q; n = 0; 1; . . . where n > 0 is the learning rate which may depend on n. wn+1 Vjn+1
(5a) (5b)
This online method with penalty can be written as the following pseudocode. 1. Initialize w0 and V 0 2. FOR n
= 0; 1; . . .
Obtain n Randomly choose an instance training sample Update the network weights using (5) END
III. MAIN RESULTS As has been discussed earlier, the boundedness of the network weights is a very crucial condition for the convergence of the online gradient method. In this section, a convergence result of the online gradient method with penalty is presented where the boundedness of the weights is not required as a precondition. The proof of the theorem has been relegated to the Appendix. The definitions of the probability terminologies used below are the same as those in [19]. Our main theoretical result is based on the following assumptions. 1=1 is a bounded seA1) The training sequence f(On ; ( n )T )T gn quence of independent identically distributed (i.i.d.) random vectors. A2) The functions f and g are twice continuously differentiable on . Moreover, f , g , f 0 , g 0 , f 00 , and g 00 are uniformly bounded on . A2’) The function f is twice continuously differentiable on . Moreover, f , f 0 , and f 00 are uniformly bounded on . g (t) = t for all t 2 . A3) fn g is a decreasing positive sequence such that a) 1 n=0 n = d < 1 for 1, b) limn!1 sup n01 0 n001 1 < 1, and c) 1 n n=0 some d > 1. Remark 3.1: These assumptions are quite general. Assumption A1) is usually satisfied in the settings of regular real-time learning, and in 1=1 is randomly generated from the settings where f(On ; ( n )T )T gn a finite training set. Assumption A2) is satisfied by typical activation functions such as sigmoid functions. A very important case that the network output is a linear function is included in Assumption A2’).
Authorized licensed use limited to: National Taiwan University. Downloaded on June 2, 2009 at 17:40 from IEEE Xplore. Restrictions apply.
1052
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 6, JUNE 2009
Note that Assumption A3) holds for n n0k with 0 < k 1. This is more general than a commonly used assumption for n (see, e.g., [1], [11], and [18])
1 n=0
n =
1
1 and
n=0
(n)2
N
for all 0 . We proceed to prove (13) by considering the following two cases. n k (2 i) For any ( 2 ) 0 ), the inequality k always holds. In this case, we simply set = 3 to validate max fk 0 k k 1 k . . . k N k (2 2 )g (13). ii) There exists an integer ( 0 ) such that
w N n n w ; w ; ; w ; C = N N N kwN k > 2C 2 :
1053
C
C =
(18)
(23)
Now, the remaining part of the proof to (14) can copy the corresponding proof to (13). The detail is omitted. Lemma 6.3: Suppose = ( 1 2 . . . q )T 2 q , then the eigenvalues of matrix T are 0, with multiplicity 0 1, and 21 + 2 2 2 + 1 1 1 + q. Proof: The proof is simple and omitted. Lemma 6.4: Suppose that Assumptions A1), A2’), and A3) are valid. Let f n gn1 and jn n1 ( = 1 . . . ) be the sequences of weight vectors generated by (5) with arbitrary initial values 0 and 0 . f n gn1 and jn n1 are uniformly bounded, i.e., there exist positive constants 6 and 7 such that
a
a
V
w
V
w
a
aa
V
C kwn k C6 ; Vjn C7 ;
a ;a ; ;a j
q
; ;q
C n = 1; 2; . . . j = 1; . . . ; q;
n = 1; 2; . . . :
a
w
(24) (25)
Proof: If Assumption A2’) is valid, then (4) takes the following form
n that rw E (w; V ; O; ) = 0 O 0 w 1 F (V T ) F (V T ) + w (26a) kwn k kwN k + C2 8 n = N; N + 1; . . . : (19) rV E (w; V ; O; ) = 0 O 0 w 1 F (V T ) wj f 0 (Vj 1 ) + Vj ; j = 1; 2; . . . ; q: (26b) Equation (19) is evidently valid for n = N . So we suppose that (19) is valid for an integer n (n N ), and we try to show that (19) is also By (26a) and (5a), we have valid for n + 1. n If kw k (2C2 =), using (15) and (17), we have wn+1 = (1 0 n)wn + n (On 0 wn 1 F ((V n )T n ))F ((V n )T n ) kwn+1 k (1 0 n)kwn k + n C2 kwn k + n C2 = (1 0 n )wn 0 n (wn 1 F ((V n )T n ))F ((V n )T n ) + n On F ((V n )T n ) (20) 2C 2 + C2 kwN k + C2 : = (1 0 n )wn 0 n (F ((V n )T n )F T ((V n )T n ))wn n If, on the other hand, kw k > (2C2 =), then a combination of (15), + n On F ((V n )T n ) (17), and the induction assumption produces = (1 0 n )I 0 n (F ((V n )T n )F T ((V n )T n )) wn kwn+1k (1 0 n)kwn k + 2n kwn k + n On F ((V n )T n ); (27) n n n kw k < kw k = 10 where I is the identity matrix. From Lemma 6.3, we know that the 2 eigenvalue of N kw k + C2 : (21) F ((V n )T n )F T ((V n )T n ) Now we have shown by induction that (19) is always true for Case ii). In this case, we can prove by induction on
Hence, (13) is valid for Case ii) by setting
is 0 (of
C3 = max kw0 k; kw1 k; . . . ; kwN 01 k; kwN k + C2 : So (13) is true in both Cases i) and ii). Proof to (14) of Lemma 6.2: Using (2), (4b), (13) and Assumptions A1) and A2), for all , we have
n
rV E^(wn ;V n ; On ; n ) = On 0 g (wn 1 F ((V n )T n ))
t2
has the following eigenvalues:
1 0 n ; 1 0 n 0 n kF ((V n )T n )k2 :
(28)
By Assumption A2’), we have
2g0 (wn 1 F ((V n )T n ))wjnf 0 Vjn 1 n n (jOn j + jg(wn 1 F ((V n )T n ))j) 2 jg0 (wn 1 F ((V n )T n ))jkwnk f 0 Vjn 1 n kn k (sup jOnj +sup jg(t)j) sup jg0 (t)jC3 sup jf 0 (t)j sup kn k C5 : t2
n 0 1 multiplicity) and kF ((V n )T n )k2 . Thus, the matrix (1 0 n )I 0 n (F ((V n )T n )F T ((V n )T n ))
t2
(22)
kF (z)k pq sup jf (t)j = C1 ; t2
z 2 q:
(29)
N1 such that for n>N 0 < n < 1 0 < 1 0 n 0 n kF ((V n )T n )k2 1 0 n < 1: (30)
By (29) and Assumption A3), there exists an integer all 1
Authorized licensed use limited to: National Taiwan University. Downloaded on June 2, 2009 at 17:40 from IEEE Xplore. Restrictions apply.
1054
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 6, JUNE 2009
It follows from (27)–(30) that for n
> N1
kwn+1 k (k(1 0 n )I 0 n (F ((V n )T n )F T ((V n )T n ))k) 2 kwn k + n kOn F ((V n )T n )k n n n T n = (1 0 n )kw k + n kO F ((V ) )k n n (1 0 n )kw k + n C1 sup jO j n (31) = (1 0 n )kw k + n C8 where C8 = C3 sup jOn j. Now, the remaining part of the proof to (24) can copy the corresponding proof to (13). The detail is omitted. Next, we begin to prove (25). By (24), (26b), (29), and Assumptions A1) and A2’), for j = 1; . . . ; q , we have
rQ() = E (rd(Z n ; )) = 0M (): So we have (rQ())T M () = 0M ()T M () 0. It has been assumed that fn g satisfies Assumption A3). Now all the conditions of Lemma 6.5 have been verified, and by the continuity of rQ() guaranteed by Assumption A2) [or A2’)], we conclude that either n ! 3 n 2 f j(rQ( ) = 0g or ! 1 with probability 1. But it has been shown by (13), (14), (24), and (25) that kn k is uniformly bounded for all n. This finally completes the proof.
REFERENCES
On 0 wn 1 F ((V n )T n ) wjn f (Vjn 1 n ) n 0
(jOn j + kwn k(kF ((V n )T n )k))kwn k f 0 (Vin 1 n ) k n k (32) (sup jOn j + C6 C1 ) C6 sup jf 0 (t)j sup k n k C9 : t2 By (5b) and (26b), we have
Vjn+1 = (1 0 n )Vjn +n(On 0wn 1F ((V n )T n ))wjnf Vjn 1 n n: 0
(33)
By virtue of triangle inequality and (32), we have
Vjn+1 (j1 0 n j) Vjn
3.2, we define Q() = E [d(Z n ; )]. According to [3, Th. 16.8], rE (d(Z n ; )) = E (rd(Z n ; )). Hence, for all 2 q+pq , we have
+ n C9 :
(34)
Now, the remaining part of the proof to (25) can copy the corresponding proof to (13). The proof of Theorem 3.2 will proceed along the lines of [19, Proposition 3.1.a]. For the convenience of reading, we present it below as a lemma. In this lemma, we write n ! 1 when kn k ! 1. Lemma 6.5: Let fZ n gn0 v be a sequence of bounded i.i.d. random vectors, and m : v 2 l ! l a continuously differentiable function. Suppose that, for each in l , the expectation M () E (m(Z n ; )) < 1, that there exists a twice continuously differentiable function Q : l ! such that (rQ())T M () 0 for all in l , and that fn g1 n=1 is a sequence satisfying Assumpl iteratively by n+1 = 1 tion A3). Define the sequence fn gn =0 n + n m(Z n ; n ), where 0 is arbitrarily given. Then, either n ! 3 T n 2 f j(rQ( )) M ( ) = 0g or ! 1 with probability 1. Proof to Theorem 3.2: We want to show under the assumptions of Theorem 3.2 that Z n , n , and n defined in (7) satisfy all the conditions of Lemma 6.5. Then, the conclusions of Lemmas 6.2, 6.4, and 6.5 immediately result in the conclusion of Theorem 3.2. Assumption A1) ensures that fZ n g is i.i.d. and bounded. By Assumption A2) [or A2’)] we see that (r denotes the gradient with respect to )
m(Z; ) 0rd(Z; ) = r(; )T (O 0 (; )) 0 is continuously differentiable on p+1 2 q+pq . Thus, we have
M () = E (0rd(Z n ; )) n T n n = E (r( ; ) (O 0 ( ; )) 0 )
(35)
where E (1) denotes the expectation with respect to Z n . By virtue of A2) [or A20 )] and the definition of (; ), r( n ; )T ((On 0 ( n ; )) 0 ) is obviously bounded for each fixed 2 q+pq , and this leads to M () < 1 for each fixed 2 q+pq . As in Theorem
[1] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996. [2] D. P. Bertsekas and J. N. Tsitsiklis, “Gradient convergence in gradient methods with errors,” SIAM J. Optim., vol. 3, pp. 627–642, 2000. [3] P. Billingsley, Probability and Measure. New York: Wiley, 1979. [4] T. Evgeniou, M. Pontil, and T. Poggio, “Regularization networks and support vector machines,” Adv. Comput. Math., vol. 13, pp. 1–50, 2000. [5] T. L. Fine and S. Mukherjee, “Parameter convergence and learning curves for neural networks,” Neural Comput., vol. 11, pp. 747–769, 1999. [6] A. A. Gaivoronski, “Convergence properties of backpropagation for neural nets via theory of stochastic gradient methods (Part I),” Optim. Methods Software, vol. 4, pp. 117–134, 1994. [7] L. Grippo, “Convergent on-line algorithms for supervised learning in neural networks,” IEEE Trans. Neural Netw., vol. 11, no. 6, pp. 1284–1299, Nov. 2000. [8] S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network construction with back-propagation,” Adv. Neural Inf. Process., vol. 1, pp. 177–185, 1989. [9] E. D. Karnin, “A simple procedure for pruning back-propagation trained neural networks,” IEEE Trans. Neural Netw., vol. 1, no. 2, pp. 239–242, Jun. 1990. [10] L. Ljung, “Analysis of recursive stochastic algorithm,” IEEE Trans. Autom. Control, vol. AC-22, no. 4, pp. 551–575, Aug. 1977. [11] O. L. Mangasarian and M. V. Solodov, “Serial and parallel backpropagation convergence via nonmonotone perturbed minimization,” Optim. Methods Software, vol. 4, pp. 117–134, 1994. [12] S. C. Ng, C. C. Cheung, and S. H. Leung, “Magnified gradient function with deterministic weight modification in adaptive learning,” IEEE Trans. Neural Netw., vol. 15, no. 6, pp. 1411–1423, Nov. 2004. [13] S. J. Nowlan and G. E. Hinton, “Simplifying neural networks by soft weight sharing,” Neural Comput., vol. 4, pp. 173–193, 1992. [14] D. C. Plaut, S. J. Nowlan, and G. E. Hinton, “Experiments on learning by back-propagation,” Carnegie-Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-CS-86-126, 1986. [15] R. Reed, “Pruning algorithms: A survey,” IEEE Trans. Neural Netw., vol. 4, no. 5, pp. 740–747, Sep. 1993. [16] K. Saito and R. Nakano, “Second-order learning algorithm with squared penalty term,” Neural Comput., vol. 12, pp. 709–729, 2000. [17] H. Shao, W. Wu, and L. Liu, “Convergence and monotonicity of an online gradient method with penalty for neural networks,” WSEAS Trans. Math., vol. 6, pp. 469–476, 2007. [18] V. Tadic and S. Stankovic, “Learning in neural networks by normalized stochastic gradient algorithm: Local convergence,” in Proc. 5th Seminar Neural Netw. Appl. Electr. Eng., Yugoslavia, Sep. 2000, pp. 11–17. [19] H. White, “Some asymptotic results for learning in single hidden-layer feedforward network models,” J. Amer. Statist. Assoc., vol. 84, pp. 1003–1013, 1989. [20] W. Wu, G. Feng, and X. Li, “Training multilayer perceptrons via minimization of sum of ridge functions,” Adv. Comput. Math., vol. 17, pp. 331–347, 2002. [21] W. Wu, G. Feng, Z. Li, and Y. Xu, “Deterministic convergence of an online gradient method for BP neural networks,” IEEE Trans. Neural Netw., vol. 16, no. 3, pp. 533–540, May 2005.
Authorized licensed use limited to: National Taiwan University. Downloaded on June 2, 2009 at 17:40 from IEEE Xplore. Restrictions apply.