Neural Networks 31 (2012) 33–45
Contents lists available at SciVerse ScienceDirect
Neural Networks journal homepage: www.elsevier.com/locate/neunet
Robust adaptive learning of feedforward neural networks via LMI optimizations Xingjian Jing Department of Mechanical Engineering, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
article
info
Article history: Received 30 March 2011 Received in revised form 1 March 2012 Accepted 2 March 2012 Keywords: Feed-forward neural network (FNN) Robust learning Linear matrix inequality (LMI) Robust control approach
abstract Feedforward neural networks (FNNs) have been extensively applied to various areas such as control, system identification, function approximation, pattern recognition etc. A novel robust control approach to the learning problems of FNNs is further investigated in this study in order to develop efficient learning algorithms which can be implemented with optimal parameter settings and considering noise effect in the data. To this aim, the learning problem of a FNN is cast into a robust output feedback control problem of a discrete time-varying linear dynamic system. New robust learning algorithms with adaptive learning rate are therefore developed, using linear matrix inequality (LMI) techniques to find the appropriate learning rates and to guarantee the fast and robust convergence. Theoretical analysis and examples are given to illustrate the theoretical results. © 2012 Elsevier Ltd. All rights reserved.
1. Introduction Learning algorithms of feed-forward neural networks (FNNs) have been extensively studied. A well-known one is the gradient decent method or back propagation (BP) algorithm which has been used as a basic algorithm for training FNNs in many areas, e.g., function approximation, system identification, pattern recognition and control (Chen, Hong, Harris, & Sharkey, 2004; Hagan, Demuth, & Beale, 1996a; Haykin, 1999; Jarabo-Amores, Rosa-Zurera, Gil-Pita, & Lopez-Ferreras, 2009; Wei, Billings, Zhao, & Guo, 2010). The traditional BP algorithm is known to have drawbacks in its slow convergence speed (unstable in some cases) and its inability to ensure global minimum. Different improvement methods, for example, introduction of a momentum term, utilization of standard optimization techniques (e.g., quasi-Newton, conjugate-gradient, recursive least square and Levenberg–Marquardt (LM)), and adaptability of learning rates, have been reported to overcome the mentioned problems (Bilski & Rutkowski, 1998; Charalambous, 1992; Hagan et al., 1996a; Lera & Pinzolas, 2002; Osowski, Bojarczak, & Stodolski, 1996; Sarkar, 1995). To cope with noisy data and to ensure more robust convergence, the extended Kalman filter method and H∞ filter design technique have been utilized in the development of new learning algorithms (Iiguni, Sakai, & Tokumaru, 1992; Nishiyama & Suzuki, 2001). Usually, these methods may encounter limitations such as dependence on heuristic knowledge, high storage and memory requirements, difficulty in scheduling the learning rate, inability in coping with colored-noisy data or non-stationary processes, or inability in guaranteeing global convergence etc. Moreover, some methods involve too many free
E-mail addresses:
[email protected],
[email protected]. 0893-6080/$ – see front matter © 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2012.03.003
parameters to tune or their performance is very sensitive to the parameter setting, and consequently considerable inconvenience is incurred in practice. To ensure the convergence of the adaptive BP algorithm, several methods based on Lyapunov stability were studied in Behera, Kumar, and Patnaik (2006) and Man, Wu, Liu, and Yue (2006) recently which however do not directly consider noise effects or may not be applicable to time-varying processes. Also, the algorithm in Man et al. (2006) requires the analytical inverse of nonlinear activation functions of FNNs. In this study, a novel robust control approach is further investigated for the learning problems in FNNs. The network training problem is cast into a robust control problem (Jing, 2011). This robust control approach can reveal clearly some drawbacks of existing commonly-used BP-type algorithms especially in noisy environments, and also provide a novel and powerful insight into the design of robust learning algorithms. In this scheme, new robust learning algorithms of FNNs with adaptive learning rate are developed with full consideration of the noisy effect by exploiting the robust control theory and using LMI optimization techniques. With the new algorithms, clear convergent information on the boundedness of the estimation error can be provided; the weights of the FNN will be bounded; non-stationary or time-varying processes with noise or disturbance can also be addressed. The advantages of new algorithms are demonstrated especially in system identification and function approximation compared with some existing learning algorithms. Examples are provided to illustrate the results. 2. Problem formulation Some existing learning algorithms are discussed from a robust control viewpoint, and thereafter new robust learning algorithms are developed which can guarantee robust and asymptotical
34
X. Jing / Neural Networks 31 (2012) 33–45
convergence with respect to bounded noise. The adaline model is focused firstly, which is then extended to the general ‘‘linear-inparameters’’ models and the general FNN model. 2.1. Background For convenience in discussion, consider firstly an adaline with p inputs, 1 output and no hidden units, which can be described at each discrete time by y(n) = K (n)T X (n)
(1)
where X (n) = [x1 (n), x2 (n), . . . , xp (n)]T is supposed to be in a compact set U ⊂ Rp , y(n) ∈ R, K (n) = [k1 (n), k2 (n), . . . , kp (n)]T represents the weights of the network at time n. Given a series of input–output data, the objective is to update the weights K (n) at each sampling time such that the error e(n) between the predicted output y(n) and the desired output y∗ (n) converges towards zero eventually. Throughout the paper, it is assumed that there exists an ideal bounded K ∗ (n) at time n such that y∗ (n) = K ∗ (n)T X (n)+ε(n) for any X (n) in U and any ε(n) which is upper bounded by a constant in R. The term ε(n) may include the unmodeled dynamics which can be reduced by increasing the order of the model and may also represent the measurement noise in the output data. Then the estimation error between the real model and the ideal model is er (n) = y(n) − K ∗ (n)T X (n) = e(n) + ε(n), where e(n) = y(n) − y∗ (n) is the error that can be measured practically. The weight update law can be written as K (n + 1) = K (n) + 1K (n)
(2)
where 1K (n) is the weight update law to be designed. Using this weight update law, the error dynamics of the network in terms of error e(n) can be described by e(n + 1) − e(n)
= K (n + 1)T X (n + 1) − K ∗ (n + 1)T X (n + 1) − ε(n + 1) −K (n)T X (n) + K ∗ (n)T X (n) + ε(n) = [K (n) + 1K (n)]T [X (n) + 1X (n)] − [K ∗ (n) + 1K ∗ (n)]T X (n + 1) − K (n)T X (n) + K ∗ (n)T X (n) + ε(n) − ε(n + 1) = X (n)T 1K (n) + 1X (n)T 1K (n) + (K (n)T − K ∗ (n)T ) × 1X (n) − 1K ∗ (n)T X (n + 1) + ε(n) − ε(n + 1) = X (n)T 1K (n) + 1X (n)T 1K (n) + (K (n)T − K ∗ (n)T )1X (n) + w(n) where 1X (n) = X (n + 1) − X (n), and w(n) = −1K ∗ (n)T X (n + 1) + ε(n) − ε(n + 1) can be regarded as a bounded noise process (since the involved terms are all bounded). That is, e(n + 1) = e(n) + X (n)T 1K (n) + 1X (n)T 1K (n) + d(n)
where η > 0. Substituting (4) into (3) gives e(n + 1) = {1 − η[X (n)T X (n) + 1X (n)T X (n)]}e(n) + d(n).
Clearly, the control law (4) cannot always guarantee the stability or convergence of system (5) from a robust control point of view, because (i) {1 −η[X (n)T X (n)+ 1X (n)T X (n)]} is time-varying which is not guaranteed to be in the range (−1, 1), and therefore the time-varying convergence rate is not guaranteed to be less than 1; and (ii) the time-varying disturbance input term d(n) is not reasonably considered in the control law, which may also be unknown. A small learning rate η will ensure η[X (n)T X (n) + 1X (n)T X (n)] to be very small and therefore probably result in {1 −η[X (n)T X (n)+ 1X (n)T X (n)]} < 1 but bring very slow convergence speed. A larger learning rate η will however possibly result in |1 −η[X (n)T X (n)+ 1X (n)T X (n)]| > 1 and therefore could bring instability of the network. The improved BP algorithms with a momentum (Phansalkar & Sastry, 1994) can be written as
1K (n) = −ηX (n)e(n) + α 1K (n − 1)
= −ηX (n)e(n)/(1 − α Z −1 )
(6)
where α > 0, Z −1 is a delay operator (i.e., Z −1 [1K (n)] = 1K (n − 1)). This enables the error system to be e(n + 1) = (1 + α − η(X (n) + 1X (n))T X (n))e(n) − α e(n − 1)
+ d(n) − α d(n − 1).
(7)
Usually α takes a small value and thus the last two terms in (7) is basically approximate to d(n). Obviously, the same problems mentioned above happen to the new error system (7) and the only difference is that system (7) is now transformed into a second-order system which is deliberately designed for avoiding local minima. Similar conclusions can be made for some other existing learning algorithms for FNNs in the literature such as delta learning rules (Kuschewski, Hui, & Zak, 1993), and some different variants of BP algorithms (Bilski & Rutkowski, 1998; Charalambous, 1992; Lera & Pinzolas, 2002; Osowski et al., 1996). Although Lyapunov-function based BP algorithms developed in Behera et al. (2006) and Man et al. (2006) can provide adaptive learning rates and guarantee the convergence of the network, the noise effect is not fully considered. Some other learning algorithms with adaptive learning rates could also overcome the slow convergence speed and some local minima problems of BP algorithms. However, these algorithms usually either need heuristic knowledge or additional cost in scheduling the learning rate and are difficult to guarantee the stability theoretically (Riedmiller & Braun, 1993; Shao & Wu, 2006), or involve several free parameters to tune which are crucial and sensitive to convergence performance (Bordes, Bottou, & Gallinari, 2009; Sutton, 1992). It should be noted that most existing learning algorithms for FNNs do not or not fully consider the noise or disturbance effects incurred by w(n), 1X (n)T (K (n) − K ∗ ) and 1X (n)T 1K (n).
(3) 2.2. The robust control approach
where d(n) = 1X (n) (K (n) − K (n)) + w(n). The dynamic evolution of the estimation error is clearly determined by three terms, i.e., the change of the weights, the change of the input and the nonlinear cross-effect incurred by both of them. From a robust control point of view, 1K (n) can be regarded as a control input, the term d(n) a disturbance input, and the crossing term 1X (n)T 1K (n) a disturbance coupled with the input. The error dynamics could be convergent if the roots of the closed-loop characteristic equation of Eq. (3) are inside the unit circle. Therefore, any controller ‘‘1K (n)’’ without careful consideration of the disturbance terms will not always guarantee the convergence of the error system (3). The traditional BP algorithm can be written as
Define K¯ (n) = K (n) − K ∗ (n) and z (n) = [e(n) Combining Eqs. (3) and (8) yields
1K (n) = −ηX (n)e(n)
z (n + 1) = A¯ (n)z (n) + B¯ (n)1K (n) + B1 w(n).
T
(5)
∗
(4)
In order to overcome the problems above of existing learning algorithms for FNNs, the robust control point of view is advantageous and could provide a systematic and effective solution. To this aim, a novel robust control approach is proposed in this section and new robust learning algorithms are therefore developed in what follows. Eq. (2) can be written as K (n + 1) − K ∗ (n + 1) = K (n) − K ∗ (n) + 1K (n).
(8) K¯ T (n)]T . (9a)
X. Jing / Neural Networks 31 (2012) 33–45
Additionally define a system output as y˜ (n) = Cz (n)
(9b)
Considering D(n) = 1X T (n) ∂ X (n)∂ K T (n) 1K (n) or D(n) = 0, Eqs. (11a–b) yield
where A¯ (n) =
35
∂ 2 f (K (n),X (n))
1 0p,1
B1 = [1, 01,p ]T ,
1X T (n) Ip,p
,
B¯ (n) =
X T (n) + 1X T (n) , Ip,p
∂ 2 f (K (n), X (n)) ∂ X (n)∂ K T (n) ∂ f (K (n), X (n)) ∂ f (K (n), X (n)) 1K (n) + + ∂ K T (n) ∂ X T ( n) ∂ f (K ∗ (n), X (n)) − 1X (n) + w(n) (12a) ∂ X T (n)
e(n + 1) = e(n) + 1X T (n)
(9c)
C = [1, 01,p ].
Ip,p denotes a p × p matrix with unit elements in the diagonal and zero for others, 01,p is a zero vector of p dimension. Eq. (9a) is a typical linear time-varying discrete system with disturbance term 1X (n), and the term 1K (n) is the control law to be designed. The control objective is to drive the output y˜ (n) to go towards zero asymptotically. Note that K¯ (n) = K (n) − K ∗ (n) is not known in state vector z(n), and only the output state y˜ (n) is known. Therefore, this is a typical robust output feedback control problem, which is noticed as a hot topic in the control literature as well (Syrmos, Abdallah, Dorato, & Grigoriadis, 1997). Using this novel approach, the robust control theory can be adopted to develop new robust learning algorithms which could be powerful to cope with the noise and disturbance effects systematically. It should be emphasized that, although the dynamic error model (9a–b) is established from the adaline model (1), quite a few nonlinear models can be easily transformed to an adaline model (1) (i.e., linear-in-parameter models), for example, radial basis function based NN models, nonlinear autoregressive moving average with exogenous input (NARX) model (Wei & Billings, 2008), some other polynomial models and Wiener models (Nelles, 2001) etc. This will be further discussed in Section 3.4. For multi-layer FNN models, the input–output relationship can be generally written as y(n) = f (K (n), X (n))
(10)
or
∂ f (K (n), X (n)) ∂ f (K (n), X (n)) 1K (n) + e(n + 1) = e(n) + ∂ K T (n) ∂ X T ( n) ∂ f (K ∗ (n), X (n)) − 1X (n) + w(n). (12b) ∂ X T (n) Respectively, where w(n) = −
∂ f (K ∗ (n),X (n)) 1K ∗ (n)T ∂ K ∗ (n)T ∗
+ε(n)−ε(n +
1). If the system is slowly time-varying, 1K (n) ≈ 0. In this case, applying the first order Taylor series approximation again
∂ f (K (n + 1), X (n + 1)) ∂ f (K ∗ (n + 1), X (n + 1)) − ∂ X (n + 1) ∂ X (n + 1) ∂ f (K (n), X (n)) ∂ f (K ∗ (n), X (n)) ≈ − ∂ X (n) ∂ X (n) 2 ∂ f (K (n), X (n)) + 1K (n). ∂ X (n)∂ K T (n) Defining
∂ f (K (n), X (n)) , ∂ X (n) ∂ 2 f (K (n), X (n)) fXK (n) = ∂ X (n)∂ K T (n)
∂ f (K ∗ (n), X (n)) , ∂ X ( n)
where X (n) is the input in a compact set U ⊂ R , y is the output and K (n) represents all the model parameters to be tuned of q dimensions, and f (·) is a nonlinear function with at least first order continuous derivatives with respect to X and K . Using Taylor series around the previous estimate and previous input, and neglecting higher-order terms, Eq. (10) can be written as
Kˆ (n) =
∂ f (K (n), X (n)) 1K (n) ∂ K T (n) ∂ f (K (n), X (n)) + 1X (n) + D(n). (11a) ∂ X T (n) In (11a), D(n) is designed to be a flexible term considered in
Kˆ (n + 1) − Kˆ ∗ (n + 1) ≈ Kˆ (n) − Kˆ ∗ (n) + fXK (n)1K (n).
p
y(n + 1) = f (K (n), X (n)) +
the model approximation. For a simple first order approximation, D(n) is set to 0, while for a more accurate approximation D(n) can be designed to include some higher order terms. For exam∂ 2 f (K (n),X (n))
ple, if letting D(n) = 1X T (n) ∂ X (n)∂ K T (n) 1K (n), then it can be verified that Eq. (11a) includes Eq. (1) as a special case (f (·) is linear). This involves a problem of the efficiency of model transformation. It shall be noted that different model transformation might result in very different learning performance of the designed algorithm. This study only focuses on the cases that D(n) = 0 with complementary discussions on the case D(n) = (n),X (n)) 1K (n). 1X T (n) ∂∂ Xf ((Kn)∂ K T (n) 2
Similarly, for the FNN with ideal parameters (i.e., y∗ (n + 1) = f (K ∗ (n), X (n + 1)) + ε(n + 1)), it can also be obtained with Taylor series expansion that
∂ f (K ∗ (n), X (n)) y∗ (n + 1) = f (K ∗ (n), X (n)) + 1X (n) ∂ X T (n) ∂ f (K ∗ (n), X (n)) 1K ∗ (n)T + ε(n + 1). + ∂ K ∗ (n)T
Kˆ ∗ (n) =
it can be written as (13)
Now define the state vector z (n) for FNN (10) as z (n) = [e(n), Kˆ (n) − Kˆ ∗ (n)]T , a similar state space equation as (9ab) can ∂ 2 f (K (n),X (n))
be obtained for the case D(n) = 1X T (n) ∂ X (n)∂ K T (n) 1K (n) with A¯ (n) =
1 0p,1
1X T (n) Ip,p
,
B1 = [1, 01,p ]T ,
C = [1, 01,p ],
∂ f (K (n), X (n)) T + 1 X ( n ) f ( n ) XK . B¯ (n) = ∂ K (n)T fXK (n)
(14a)
It can be checked that the state space realization (A¯ (n), B¯ (n), B1 , C ) in (9c) for the linear adaline model is included in the state space realization in (14a). Note that in the realization (14a) a second order derivative fXK (n) should be computed. To remove the burden in computing fXK (n), the first order Taylor approximation of the FNN can be used, i.e., the case D(n) = 0. For most ofthe FNN models, it is reasonable to suppose that
∂ f (K (n),X (n)) ∂ X T (n) is upper bounded (∥ ∗ ∥ is the Euclidian norm)
(Han, Su, & Stepanenko, 2001). Therefore, in the compact set
∂ f (K (n),X (n))
∗ − ∂ f (K∂ X(Tn),(nX) (n)) must be upper bounded. For this ∂ X T (n) ∗ ∂ f (K (n),X (n)) reason, [ ∂ X T (n) − ∂ f (K∂ X(Tn),(nX) (n)) ]1X (n) can be regarded as an
U, (11b)
36
X. Jing / Neural Networks 31 (2012) 33–45
additional ‘‘disturbance’’ and thus is included in the uncertain term w(n). That is, (12b) can be simply written as e(n + 1) = e(n) +
∂ f (K (n), X (n)) 1K (n) + w( ˆ n) ∂ K T (n)
∂ f (K (n),X (n)) ∂ X T (n)
where w( ˆ n) = [
feedback control law 1K (n) = L(n)˜y(n) = L(n)Cz (n) = L(n)e(n), then choose a Lyapunov function to show 1V (n) < 0. To this end, V (n + 1) ≤ α V (n) + γ (n)w(n)T w(n) is obtained, which yields γ (n)ρ 2
− ∂ f (K∂ X(Tn),(nX) (n)) ]1X (n)+w(n). Together ∗
that 1V (n) < 0 for all V (n) > 1−α where α is a convergent rate. By LMI equivalent manipulations, the main results can then be obtained, and finally the uniform boundedness is proved.
with (8), a state space equation similar to Eqs. (9a–b) with state z (n) = with
T
e(n)
1 A¯ (n) = 0
0
K¯ T (n)
Ip,p
B1 = [1, 01,p ]T ,
can be obtained for the case D(n) = 0
∂ f (K (n), X (n)) , B¯ (n) = ∂ K (n)T
,
(14b)
Ip,p
C = [1, 01,p ].
It can be concluded that the robust learning problem of FNNs can be uniformly transformed to a robust output feedback control problem of a linear discrete time-varying system described by state space Eqs. (9a–b) in terms of the model estimation error and the errors between the estimated model parameters and ideal parameters. 3. The LMI-based robust learning (LMI–RL) algorithms
γ (n) < γ¯
3.1. The LMI tool Many problems that are traditionally difficult to compute can be easily solved by casting into a linear matrix inequality (LMI) problem (Boyd, El Ghaoui, Feron, & Balakrishnan, 1994). An LMI is any constraint of the form A(x) < 0, where x ∈ RN is a vector of unknown scalars representing the decision variables, and ‘‘ 0, a vector S (n) of p dimension, and a scalar γ (n) > 0 such that,
−α Q (n)
0
Q (n)A¯ (n)T + (B¯ (n)S (n)C )T
0
−γ (n)I
BT1
A¯ (n)Q (n) + (B¯ (n)S (n)C )
B1
−Q (n)
−2q1 (n) < B¯ 1 (n)S (n) < 0
Remark 1. Theorem 1 provides a general linear weight update law for the error model (9a–b) by solving a simple LMI problem. This is very easy to implement using the LMI optimization toolbox in R Matlab⃝ , that is, to find a feasible solution for the LMI (15) in terms of the decision variables Q (n) > 0, S (n), and γ (n). Any feasible solution satisfying (15) results in a weight update law which can drive the estimation error to converge to a bounded region around zero. √ The radius of this region is clearly determined by |e(n)| ≤ ρ γ (n)q1 (n)/(1 − α). Note that the existence of the solution to the LMI (15) implies that γ (n) and q1 (n) must be upper bounded and therefore the time-varying bound of the estimation error must be bounded. Obviously, when there is no noise, this region could be sufficiently small. Otherwise, the radius of this region can be further defined clearly by adding two more inequality in the LMI optimization as
≤ 0 (15a)
(15b)
where B¯ 1 (n) the first row of B¯ (n), q1 (n) is a positive scalar and Q2 is a positive definite matrix of p dimensions. The dynamic weight update law is given by
1K (n) = S (n)q1 (n)−1 e(n). (16) √ Moreover, |e(n)| ≤ ρ γ (n)q1 (n)/(1 − α) is uniformly upper bounded. Proof. The outline of the proof is given here, which is referred to Appendix A for detail. Firstly, for system (9a–b), define an output
and
q1 (n) < q¯ 1
(17)
where γ¯ and q¯ 1 should be given heuristically. Smaller values for γ¯ and q¯ 1 imply smaller estimation error and faster computation time in LMI optimization. Solving the LMIs (15) and (17), the achieved weight update law can drive the estimation error to√a small region around zero whose radius is determined by ρ γ¯ q¯ 1 /(1 − α). Moreover, to find the minimum value for the bound |e(n)| ≤ √ ρ γ (n)q1 (n)/(1 − α) given ρ and α is equivalent to minimizing γ (n)q1 (n) subject to (15ab) and γ (n) > 0, q1 (n) > 0, which is further equivalent to minimizing q1 (n) + γ (n). The latter can be solved directly using the Matlab function mincx(). That is to find Q (n) =
q1 (n) 0
0 Q2 (n)
> 0, a vector S (n) of p dimension, and a scalar γ (n) > 0 to minimize q1 (n) + γ (n), subject to (15ab).
Remark 2. Identification of the adaline model (1) can also be addressed by the well-known least mean square (LMS) or recursive least square (RLS) algorithms (e.g., Subudhi & Ray, 2009). However, Theorem 1 provides a novel algorithm using LMI optimization techniques from a robust control point of view. Firstly, the identification problem is cast into a robust output feedback control problem via the LMI optimization techniques. The noisy effect or disturbance in the model incurred by the terms 1X (n)T (K (n) − K ∗ (n)) and 1X (n)T 1K (n) can be systematically considered in this scheme. Secondly, given a proper value for 0 < α < 1, the prediction error of the estimated model can converge asymptotically to a small region around zero whose radius is determined by the noise level ρ . It shall be noted that the error between the estimated weights and the ideal weights is also asymptotically convergent of the same convergent rate. Thirdly, the objective func1 2 tion that is optimized in Theorem 1 is z (n)T Q −1 z (n) = q− 1 e(n) + ¯K (n)T Q2−1 K¯ (n) which considered both the output error e(n) and the parameter error K (n) − K ∗ (n) while in most of the existing algorithms such as RLS, LMS, BPs (e.g., LM–BP) and EKF algorithms etc., the objective function is only a quadratic function of the estimation error e(n). Corollary 1. With the same assumptions of Theorem 1, the estimation error of the adaline model (1) or FNN (10) is asymptotically convergent to a bounded region given by |e(n)| ≤ √ γ (n) and ρ γ (n)q1 (n)/(1 − α) with upper bounded parameters
q1 (n), if there exist matrix Q (n) =
q1 (n) 0
0 Q2 (n)
> 0, any matrix
X. Jing / Neural Networks 31 (2012) 33–45
S (n) of p × p dimension, and a scalar γ (n) > 0 such that,
−α Q (n)
0
0 A¯ (n)Q (n) + (B¯ (n)S (n)J (n)C )
−γ (n)I B1
Q (n)A¯ (n)T + (B¯ (n)S (n)J (n)C )T BT1 −Q (n)
A = Ip+1,p+1 ,
1A(n) =
(18a)
≤0 −2q1 (n) < B¯ 1 (n)S (n)J (n) < 0
(18b)
where B¯ 1 (n) the first row of B¯ (n); and the weight update law is given by,
1K (n) = S (n)q1 (n)−1 J (n)e(n) where J (n) = X (n) for (1) and J (n) =
(19) ∂ f (K (n),X (n)) ∂ K (n)
for (10).
Proof. Define a new ‘‘virtual’’ system output for model (9ab) as yv (n) = J (n)˜y(n) = J (n)CZ (n). Regarding the new virtual output as the feedback signal (the feedback signal in Theorem 1 is y˜ (n) = CZ (n)), the desired weight update law is of the form 1K (n) = L(n)yv (n) = L(n)J (n)e(n), where L(n) is to be designed. Then following the same line as Theorem 1, this corollary can be established immediately. This completes the proof. Remark 3. The weight update law (19) can be regarded as a generalized adaptive BP algorithm using an LMI technique. Comparing (19) with the BP-type algorithm (4), the adaptive learning rate in (19) is a square matrix S (n)q1 (n)−1 . The matrix S (n), of course, can be chosen as a scalar s(n). The adaptive learning rates are determined by the solution of the LMI problem in consideration of the noise effect and convergence rate of the error dynamics under the performance of z (n)T Q −1 z (n). To the best of our knowledge, this could be the first result in the literature addressing the adaptive learning rate of BP via LMI techniques in the context of robust control theory. Compared with the learning algorithms using Kalman filter and H∞ filter in Iiguni et al. (1992) and Nishiyama and Suzuki (2001), the noise here is only supposed to be bounded, while it should be white Gaussian process in the Kalman filter method or bounded-power in the H∞ filter. Note that several Lyapunov function based adaptive BP algorithms were developed in Behera et al. (2006) and Man et al. (2006) recently without direct consideration of the noise effect. Especially, compared with the adaptive BP algorithm in Man et al. (2006), the new BP algorithm in Corollary 1 does not need the inverse of the activation functions. It should be noted that the results above are developed for networks with single output but can be extended to multiple output case by redefining the corresponding matrices A¯ (n), B¯ (n), B1 , and C etc. For example, q1 (n) will be a r × r matrix in Theorem 1 for r outputs. This study only focuses on FNNs with single output. Note that the error model (9a–b) with realization (14a) and (9c) involves 1X (n). Because 1X (n) = X (n + 1) − X (n) may not be known in the online learning case that X (n + 1) is not known, the results in Theorem 1 and Corollary 1 for the state space realizations (9c) or (14a) may not be applicable. For this reason, the error model (9a–b) can be regarded as a linear time-varying model with uncertainties in case that 1X (n) is not known. That is, (9a) can be rewritten as z (n + 1) = (A + 1A(n))z (n) + (B(n) + 1B(n))1K (n)
+ B1 w(n)
(20)
where for the adaline model with state space model T realization (9c): A = Ip+1T ,p+1 , B(n) = [X (n), Ip,p ] , 1A(n) = 0 1X T (n) 1X (n) , 1B(n) = 0 , and for the FNN model with state 0 0 p,p
p,p
space model realization (14a),
37
∂ f (K (n), X (n)) , ∂ K (n)T fXK (n) T 1X T (n) 1X (n)fXK (n) , 1B(n) = .
0 0
B(n) =
0p,p
0p,p
Assumption 1. |[1X (n)]i | < ri and ri is a known constant, where [1X (n)]i denotes the ith element of 1X (n). Assumption 1 requires the difference between two successive inputs is bounded. Since the input X (n) is supposed to be in a compact set U (an assumption commonly-used in the literature), the bound ri must exist and can be estimated by heuristic knowledge (the conservative estimate could be ri = maxn ([1X (n)]i ) − minn ([1X (n)]i )). Using Assumption 1, 1X (n) can be written as
1X (n) = αp,p (n) · RT1,p (21a) where R1,p = r1 r2 · · · rp , αp,p (n) = diag{αi (n)} and |αi (n)| ≤ 1. 1A(n) and 1B(n) can be rewritten as 1A(n) = EF (n) and 1B(n) = EF (n)H (21b) T T 0p,1 αp,p (n) , H = R1,p 0p,p , F (n) = where E = T T T T [01,p Ip,p ] for (9c) and H = [01,p fXK (n)]T for (14a). Note that T T F (n)F (n) = αp,p (n) αp,p (n) ≤ I. Theorem 2. With the same assumptions of Theorem 1 and Assumption 1, the estimation error of the adaline model (1) or FNN model (10) with model transformation (14a) is asymptotically conver√ gent to a bounded region given by |e(n)| ≤ ρ γ (n)q1 (n)/(1 − α) with upper γ (n) and q1 (n), if there exist matrix bounded parameters Q (n) =
q1 (n) 0
0 Q2 (n)
> 0, any matrix S (n) of appropriate dimension, scalars γ (n) > 0 and ζ (n) > 0 such that,
−α Q (n) ∗ ∗ ∗
Q (n)AT (n) + (B(n)S (n)C )T −Q (n) + ζ (n)EE T
∗ ∗
Q (n) + (HS (n)C )T 0 −ζ (n) I
∗
0 B1 0
−γ (n) (22a)
≤0 −2q1 (n) < B1 (n)S (n) + R1,p α¯ p,p fXK (n)S (n) < 0
(22b)
where B1 (n) the first row of B(n), α¯ p,p = diag{α¯ i } and α¯ i = ±1; and the weight update law is given by 1 1K (n) = S (n)q− 1 (n)e(n).
(23)
If the following LMIs are satisfied instead of (22ab), −α Q (n) ∗ ∗ ∗
Q (n)AT (n) + (B(n)S (n)J (n)C )T
Q (n) + (HS (n)J (n)C )T
0
−Q (n) + ζ (n)EE T ∗ ∗
0
B1
−ζ (n) I ∗
−γ (n) 0
(24a)
≤0 −2q1 (n) < B1 (n)S (n)J (n) + R1,p α¯ p,p fXK (n)S (n)J (n) < 0.
(24b)
Then the weight update law is given by,
1K (n) = S (n)q1 (n)−1 J (n)e(n).
(25)
Proof. The proof is outlined here, which is referred to Appendix B for detail. Following the same line in the proof of Theorem 1, for the closed-loop system with the control law 1K (n) = L(n)e(n), choose Lyapunov function V (n) = z (n)T Pz (n), and then V (n) γ ρ2
is asymptotically convergent to zero for any V (n) > 1−η if the inequality in (A.7) holds (see Appendix A), which is equivalent to
38
X. Jing / Neural Networks 31 (2012) 33–45
−Q (n) ∗ ∗
(A¯ (n) + B¯ (n)L(n)C )Q (n) ≤ 0. 0 −α Q (n)
B1
−γ (n)I ∗
By LMI equivalent transformation, the above inequality is equivalent to (22) or (24). The result in Theorem 2 is similar to Theorem 1. The robust weight update law is given by solving an LMI in (22) or (24). 3.3. Understanding of the robust control approach
y(n) =
The robust control approach demonstrated above can provide robust convergence of learning algorithms in consideration of bounded noise and disturbance effects in the data. For convenience in discussions, consider the learning of FNNs in function approximation. It is known that, the direction of the weight evolution is guided mainly by using the first or second derivatives of the objective function in many of the existing optimization-based learning algorithms as mentioned before. When there is noise or disturbance in the data, the derivative information will fail to always provide the correct update direction since the state ‘‘hyperplane’’ will no longer be smooth. However, in the proposed robust control approach, the evolution direction of model parameters is given in the context of robust stability instead of only relying on derivative information such that the noise and disturbance can be coped with systematically. That is, when there are noise and disturbance in the data, the robust learning algorithms could still produce the correct direction for weight updates and can finally achieve a robust convergence to a bounded region of the estimated error. For more detailed, consider system (9a). For the adaline model (1), noting that e(n) = X (n)T (K (n)− K ∗ (n))−ε(n) = X (n)T K¯ (n)− ε(n) and letting the weight update law 1K (n) = L(n)e(n), the closed-loop system can be written as e(n + 1) K¯ (n + 1)
1 + X (n + 1) L(n) 0 T
=
1X (n) I + L(n)X (n)T T
e(n) w(n) + . −L(n)ε(n) K¯ (n)
×
and thus can be trained with the robust learning algorithms in Theorems 1 and 2 directly. Consider a three-layer RBF neural network with p inputs and one output, and the basis function is denoted by gj (X ) (j = 1, 2, . . . , M ), which can be any basis function, e.g., exp(−(X − ξj )T σj (X − ξj )), where X ∈ U is the input, ξj is the center of the jth basis function, and σj = diag{σj1 , σj2 , . . .} represents the width of the jth basis function. Then the output can be written as
(26)
It can be checked that similar results hold for the general FNN model (9ab) with (14ab). The existence of the control Lyapunov function in Theorem 1 (see (A.3), the proof in Appendix A) implies that the eigenvalues of the system matrix of the close-loop system are inside the unit circle. The eigenvalue corresponding to the error dynamics is 1 + X (n + 1)T L(n) which must be within the unit circle for any nonzero L(n) derived by Theorem 1, while the eigenvalues corresponding to the weight errors must be within or on the unit circle. This implies that, with the robust control approach the weights K (n) are updated in such a way that the estimation error |e(n)| is driven to become smaller and smaller subject to that the norm of the current weight error ∥K¯ (n)∥ is not larger than the previous one ∥K¯ (n − 1)∥. Clearly, in this process the derivative of the objective function is not the only factor in determination of the weight evolution direction. This will obviously provide a more robust and systematic way to deal with the learning problem of FNNs (especially in function approximation and system identification) in (non-white) noisy data. 3.4. Discussions on system identification Although the results established in Theorem 1 and Corollary 1 can be used for any FNN models ((9a–b) with (14a) or (14b)), some specific neural networks or nonlinear models can be simply cast into an adaline model (i.e., a linear-in-parameters model) without linearization of the corresponding nonlinearity
M
kj gj (X (n)) = K T GX (n)
(27)
j =1
where K is the weight vector of M dimensions, and GX (n) is a vector consisting of gj (X (n)). Note that the positions of the M centers ξj s (for j = 1, . . . , M) can usually be generated regularly or predefined heuristically in the input space U. The value of σj = diag{σj1 , σj2 , . . .} for each basis function can also be determined heuristically. In this case, GX can be regarded as a known ‘‘new’’ input vector, and consequently the RBF model (27) becomes an adaline model. Therefore, the new algorithms established in Theorems 1 and 2 can directly be used for the estimation of a RBF model (27). If the basis functions in (27) are chosen as some linear and nonlinear functions of the input and output, for example, x1 (n − 1), x1 (n − 2), x2 (n − 2), y(n − 1), x1 (n − 1)y(n − 1), x1 (n − 1)x1 (n − 2), . . . , then (27) is a NARX model (see the NARX model in Wei & Billings, 2008), where xi (n − 1) is the ith component in X (n). Similarly, the well-known Volterra series model and Wiener model etc. can also be written into the form of (27) (see the Volterra and Weiner models in Nelles, 2001). 3.5. Discussions on computation complexity Note that (Boyd et al., 1994), the actual performance of the interior-point method that is used in the LMI problem is much better than the worst-case estimate: O(M 2.1 ) arithmetic operations, where M is the number of free scalar variables involved. In order to reduce the actual computation cost in the LMI optimization, some heuristic constraints can be made for some variables such as those in (17) to reduce the searching space, and the matrix Q (n) can be defined as a diagonal one. Moreover, it shall be noted that for the adaline model (1), the LMIs (15), (18), (22) and (24) have only relationship with the input data X (n). That is, in training the ‘‘linear-in-parameters’’ models including model (1), RBF model in (27), NARX, and Volterra model etc., the learning rates obtained from the LMI optimization after the first round of training are exactly the same as those in the first round. Another potential method to reduce the computation complexity may be to look for alternative problem transformation method such that the LMI optimization process is either not involved in every training step or needed to optimize only some parameters of the learning algorithm. For comparisons, it is noted that usually the orders of complexity for least mean square, the common BP and the first order stochastic gradient descents (SGD) are O(M ), while the recursive least square, Kalman filter and the second order SGD are O(M 2 ). The complexity orders of the extended H∞ filter and extended Kalman filter methods discussed in Iiguni et al. (1992) and Nishiyama and Suzuki (2001) are O(8M 4 ). The orthogonal least square algorithm for the identification of NARX models involve repetitively using the Gram–Schmidt orthogonalization process (Wei & Billings, 2008) whose complexity is O(M 3 ). 4. Examples To demonstrate the new LMI based robust learning (LMI–RL) algorithms above, examples are given including comparisons with
X. Jing / Neural Networks 31 (2012) 33–45
Fig. 1. The RMSEs in training and validation.
39
Fig. 2. The evolution of different model parameters in training (1 iteration implies one training sample).
several existing methods. The robust learning here generally refers to robustness to noise, avoidance of local minima and fast convergence in epoch. The performance of the trained neural networks will be evaluated in these aspects using root-meansquare error (RMSE). Without specialty, the convergence rate α in the LMI–RL is chosen as 0.5 by default. 4.1. System identification with non-white Gaussian noisy data To verify the robustness of the new algorithms with respect to non-white noise, a nonlinear system identification example is borrowed from Wei and Billings (2008), i.e., y(t ) = u(t − 1) + 0.5u(t − 2) + 0.25u(t − 1)u(t − 2)
− 0.3u3 (t − 1) +
1 1 − 0.8z −1
w(t ).
(28)
Here z −1 represents a delay operator (i.e., z −1 y(t ) = y(t − 1), w(t ) ∼ N (0, 0.022 ). Clearly, the noise process is non-white and theoretically unbounded. Furthermore, the input is chosen as a low frequency process of the form u(t ) = 1.6u(t −1)−0.6375u(t −2)+ 0.16 ζ (t ), with ζ (t ) ∼ N (0, 1). This will make it difficult to identify the real system model, but is a more practical case in system identification. A data set of 1000 input–output samples was generated from this system. The first 500 samples were used for training while the second half used for model validation. Model (27) was used in this example, where the basis function vector was chosen as GX (n) = [u(t − 1), u(t − 2), u2 (t − 1), u(t − 1)u(t − 2), u2 (t − 2), u3 (t − 1)]T . The weight vector K is to be determined, whose real value is K ∗ = [1 0.5 0 0.25 0 −0.3]T according to (28). To demonstrate the theoretical results, consider the robust adaptive learning algorithm in Corollary 1. If let S (n) = s(n) Iq×q where s(n) is a scalar, then the weight update law (19) can be further simplified as 1K (n) = s(n)q1 (n)−1 X (n)e(n), where s(n)q1 (n)−1 is the robust adaptive learning rate to be determined through solving the LMI (18). Fig. 1 shows the evolution of the RMSE both in training and validation. It can be seen that after one round of training (the training data is used in training one time, which is also referred to as one epoch), the RMSE is quickly convergent. Most of the model parameters converge quickly to their real values as shown in Fig. 2. Fig. 3 demonstrates the evolution of the estimation error in training, which confirms the fast convergence speed of the proposed robust adaptive learning algorithm and also indicates the bounded area (Corollary 1) that the error asymptotically and quickly converges to. As discussed in Remark 1, a tighter bound
Fig. 3. The estimation error and desired convergent bound in training, without considering (17).
Fig. 4. The estimation error and much tighter desired convergent bound with considering (17).
can be achieved if inequality (17) is considered in the optimization of the LMI (18). Fig. 4 shows the result with γ (n) < 0.5 and q1 (n) < 0.5, indicating a tighter bounded area. However, it shall be emphasized that both cases result in very similar RMSE
40
X. Jing / Neural Networks 31 (2012) 33–45
Fig. 5. The adaptive learning at each sampling time in training and its histogram.
Fig. 6. The RMSE performance with respect to free parameters.
errors and parameter estimation in different independent runs of training, implying the robustness of the algorithm with respect to
(a) RMSE of the best one run among 50 runs.
the two additional constraints. Fig. 5 shows the evolution of the adaptive learning rate in training, indicating a great variation of the learning rate at each sampling time with majority in the range [−0.3 0]. It can also be seen in Fig. 5 the periodicity of the magnitude of the learning rates in the three rounds of training. That is, the adaptive learning rates are not needed to be computed after the first round of training because the consequent ones are exactly the same as the corresponding ones in the first round. After 10 rounds of training, the model parameters are estimated as K = [0.9909 0.5081 −0.0059 0.2483 0.0036 −0.2981]. Note that (see Table 1) the error is very small and comparable to the result in Wei and Billings (2008) which used an orthogonal least square method that is known to be one effective method for offline identification of NARX models. To demonstrate the advantages of the proposed method, comparisons are conducted between the new LMI–RL algorithm and other learning algorithms, including the BP algorithm in (4), the LM–BP algorithms and RLS etc. Importantly, as the new LMI–RL is a Lyapunov function based method, two recently-developed Lyapunov function based adaptive BP algorithms in Behera et al. (2006) and Man et al. (2006) were used for comparisons, which are referred to as LFa and LFb (mu = 1.5) respectively. In Behera et al. (2006), the LFa algorithm is shown to be faster than BP and EKF and the LFb algorithm in Man et al. (2006) is shown to be more robust and faster than RLS and LMS algorithms. The initial weight values for all the algorithms are set to be the same one (i.e., Ki (0) = 1/M). For the same series of data, the RMSE performance after one round of training of all algorithms as a function of their free parameters is shown in Fig. 6, where the learning rate for BP is tested in [0.001, 0.02], LM–BP in [0.01, 0.9], LFa in [0.1, 0.99], RLS in [0.1, 0.995], and LMI–RL in [0.1, 1], and all the learning rate ranges are normalized to [0, 1] for comparisons. Then each algorithm is run 50 times independently with the data corrupted by different series of noise. The corresponding parameters of those existing algorithms in comparisons are chosen in such a way that a faster convergence speed can be achieved (for example, 0.01, 0.3, 0.5, 0.8, 0.7 for BP, LFa, LM–BP, RLS and LMI–RL respectively). The RMSE performances of each algorithm are shown in Fig. 7. The estimation of the model parameters with different methods are summarized in Table 1 with the stopping criterion that e(n − 1)− e(n) < 0.0001 or the maximum epoch is up to 20. Fig. 6 shows that the LMI–RL and LM–BP algorithms are the most insensitive to the parameter changes. Similar results can be
(b) RMSE of 50 independent runs. Fig. 7. The RMSE performance with different algorithms.
X. Jing / Neural Networks 31 (2012) 33–45
41
Table 1 Model parameter estimation. Real values
BP
LFa (Behera et al., 2006)
LFb (Man et al., 2006)
LM–BP
RLS
1
0.7233 (±0.0040)
1.0047 (±0.0286)
0.1793 (±0.0014)
0.9954 (±0.0412)
0.9993 (±0.0412)
1.0024
1.0006 (±0.0192)
0.5
0.7057 (±0.0042)
0.4952 (±0.0251)
0.1666 (±0.0014)
0.5033 (±0.0375)
0.4998 (±0.0118)
0.4969
0.4956 (±0.0194)
0
0.0592 (±0.0031)
0.0212 (±0.0963)
0.0426 (±0.0042)
−0.0368 (±0.0969)
−0.0085 (±0.0311)
0.25
0.0760 (±0.0008)
0.2001 (±0.1979)
0.0454 (±0.0031)
0.3104 (±0.1836)
0.2297 (±0.0621)
0
0.1003 (±0.0031)
0.0246 (±0.1000)
0.0580 (±0.0042)
−0.0251 (±0.0785)
−0.0096 (±0.0334)
−0.3
−0.2666 (±0.0021)
−0.2954 (±0.0299)
0.3554 (±0.0026)
−0.2940 (±0.0288)
−0.3047 (±0.0238)
concluded for different independent runs with different training data (omitted for to conserve space). The LMI–RL could be slightly growing with the increase of α . This is consistent with the theoretical result since α corresponds to the convergence rate and a larger α implies a smaller convergence speed (proof of Theorem 1). Fig. 7 indicates that the LMI–RL could be the fastest in convergence. Although the RLS with lambda close to 1 has very similar performance to the LMI–RL, the LMI–RL can achieve better RMSE performance after several rounds of training (Fig. 7(a)). Due to poor performance of LFb in this example, it is not included in Figs. 6–7. Table 1 shows that the parameter estimation with the LMI–RL is the best in terms of mean errors and corresponding std, especially in estimation of the 3rd to 5th parameters. The results obtained by the LMI–RL are also comparable to or even better than those with the new IFOS–SC with an orthogonal least square method in Wei and Billings (2008). Note that although the learning theory is established on a bounded noise assumption, it can still work for this example where the noise is known as theoretically unbounded. 4.2. Function approximation Consider a 2-D Gabor function shown in Fig. 8 and given by g (x1 , x2 ) =
1 0.5π
exp −(x21 + x22 )/(2(0.5)2 )
× cos(2π (x1 + x2 )).
(29)
There are two (p = 2) inputs X = [x1 , x2 ] . The basis function for the RBF model (27) is chosen as gi (x1 , x2 ) = exp(−(X − ξj )T σj (X −ξj )). Given M hidden units, the positions of the M centers ξj s (for j = 1, . . . , q) can be generated regularly in the input space U (for example, if M = 25, the input space can be gridded evenly with 5 rows and 5 columns). The width σj1 = σj2 = 15 for all j. In the simulation, 10 000 training data X are generated randomly with uniform distribution in the range [−0.5, 0.5], while 10 000 validation data are generated regularly in the same range with an interval 0.01. The accuracy of the estimated RBF network is evaluated by the RMSE over the validation data. The initial weight values are random generated and M is chosen as 25. The simple LMI–RL algorithm in Corollary 1 is used and compared with the standard 3-layer neural network with 25 hidden units using BP algorithm (the best learning rate 0.2 Behera et al., 2006), gradient descent with momentum and adaptive learning rate, and Levenberg–Marquardt (LM) BP algorithm (Hagan, Demuth, & Beale, 1996b). The results are summarized in Table 2, showing that the LMI–RL algorithm is much faster than those commonly-used BPtype algorithms and provides better error convergence. Moreover,
IFOS–SC (Wei & Billings, 2008)
LMI–RL
−0.0033 (±0.0149) 0.2531
0.2426 (±0.0126) 0.0022 (±0.0110)
−0.29998
−0.2987 (±0.0082)
Fig. 8. The 2-D Gabor function.
compared with the newly-developed Lyapunov based method in Behera et al. (2006) which was shown to outperform the BP and EKF algorithms, the new LMI–RL algorithm uses only 25 centers for the same RBF network but achieves a better performance than the Lyapunov function based method in Behera et al. (2006), which used 40 centers and obtained a RMSE 0.01868 for the same function approximation problem. 4.3. XOR problem
T
A 2-4-1 network is adopted for this benchmark problem with tangent sigmoid functions as transfer function in hidden layer neurons. The network is trained with random initial weights and the training is terminated when the RMSE per epoch reaches 0.0001. The simulation results are given in Table 3. Several existing BP-type algorithms (especially with adaptive learning rate) (Hagan et al., 1996b) are compared, whose learning rates (or initial learning rate) are all set to 0.85 to have a faster convergence. Table 3 shows that the new LMI-based robust learning algorithm used the smallest number of epochs in average to reach the training objective. Importantly, the standard deviation information with the LMI–RL method is obviously much smaller. This indicates that the convergence of the new robust learning algorithm is much more robust to the initial weights. 4.4. Tracking a non-stationary process To check the capability of the LMI-based robust learning algorithm in tracking a non-stationary process, an example is
42
X. Jing / Neural Networks 31 (2012) 33–45 Table 2 Performance results for function approximation. Methods
Averaged RMSE after 1 round of training
Averaged rounds of training to achieve a RMSE less than 0.02
BP BP with momentum and adaptive learning rate (Hagan et al., 1996b) Levenberg–Marquardt BP LMI–RL
0.6042 (±0.0638) 0.5301 (±0.0347)
>1000 (0.0977(±0.0224) for 1000 runs) >200 (0.0694 (±0.0114) for 200 runs)
0.3809 (±0.0463) 0.01158 (±0.0091)
10 (±2) 1
Table 3 Performance for XOR problem. Algorithm
BP
BP with adaptive learning rate (Hagan et al., 1996b)
BP with momentum and adaptive learning rate
LF (Behera et al., 2006)
LM–BP
LMI–RL
Averaged epochs STD of epochs Max epoch Min epoch
347 103 680 190
216 108 450 85
413 104 650 260
235 100 519 38
145 39 260 41
129 32 171 57
Table 4 Performance for tracking a non-stationary process.
RMSE
working equilibrium are quickly changed or switched due to task modification or strong disturbance encountered.
BP
LFa (Behera et al., 2006)
RLS
LMI–RL
112.6
38.5
206.5
26.3
5. Conclusions
borrowed from Sutton (1992). Consider a classical problem of incrementally estimating the parameter vector K (t ) of the linear scalar system y(t ) = K (t ) u(t ) + η(t ) T
(30)
with the given input output data y(t ) and u(t ), where η(t ) is white Gaussian noise with unit variance and K (t ) is slowly varying over time according to K (t + 1) = K (t ) + d(t ) where d(t ) is a random vector whose first 5 entries are independently uniformly distributed in [0, 1] and the others 0. The dimension of the system is 10 and the input signals ui (t ) were independent Gaussian random variables with unit variance. The initial parameter is K (0) = (0, . . . , 0)T . In this experiment, each algorithm was run 6000 steps to get past any transients and then for another 4000 steps to evaluate the RMSE, which is assessed with the best parameter value of each algorithm indicated in Example 1 (see Fig. 6). The result is summarized in Table 4, where the learning rate for BP is 0.1, for LFa (Behera et al., 2006) is 0.3, for RLS is 0.995 and for LMI–RL is still 0.5. The simulation results show that the LMI–RL method could still achieve better performance in tracking the given non-stationary process, compared with the other three methods (BP is the classical gradient descent, RLS is known as good at tracking changing dynamics, and LFa is a recently developed Lyapunov function based adaptive batch method). More examples can be tried but omitted for paper length limitation. As a summary, the examples show that the LMI–RL is more robust in training and less sensitive to the parameter settings, and can converge faster than many existing algorithms. These demonstrate the effectiveness and potential capability of the new robust control approach in training FNNs when dealing with noise data for system identification and control. Note that there is only one free insensitive parameter to tune in the LMI–RL. This removes the troublesome burden in looking for a better parameter setting as required in many of the existing learning algorithms such as BP, LMS, LFa, RLS, H∞ filter method (Nishiyama & Suzuki, 2001) and some second order stochastic gradient descent methods. The robustness of the parameter setting could be very helpful for online application when system input, dynamic property or
A novel robust control approach is proposed to address the learning problems of FNNs. The advantages of the new method are demonstrated both by theoretical analysis and simulation studies. The novelty of this approach might lie in two aspects: (a) It casts the learning problem of a FNN into a robust output feedback control problem of a discrete, time-varying and linear system with bounded disturbance, and both the output estimation error and weight estimation error can be minimized through a quadratic Lyapunov function. (b) The LMI optimization technique is introduced to solve the adaptive learning problem of FNNs such that the learning rate of the new learning algorithms can be optimized to be more robust to noise, disturbance effects, and initial weight/parameter settings. The new robust control approach provides a novel insight into the design of learning and identification algorithms based on artificial neural networks and some commonly-used nonlinear polynomial models (e.g., NARMAX, Wiener/Hammerstein models). Through the problem transformation, the present study demonstrated that these traditional learning and identification problems could be effectively addressed in a robust control point of view. This implies that many existing well-developed robust control methods can therefore be potentially applicable to this deliberately-cast problem for different purposes such as robustness, fast speed and unbiased estimation etc. Further studies will focus on how to improve the computation efficiency and achieve much more robustness of the new LMI-based robust learning algorithm, and extensions of this new approach to identification of nonlinear block-oriented models such as NARMAX and Wiener/Hammerstein models will also be investigated. Note that the learning method of this study adopts a robust control approach, which needs a quadratic loss function and requires a boundedness assumption on noise. However, some regularization based methods (DeLisi, Fan, Kim, Kon, & Louise, 2010) are known to be able to deal with non-quadratic loss function; and for some (semi-) batch training algorithms (Behera et al., 2006; Do, Le, & Foo, 2009), they can address unbounded noise with known stochastic properties. Compared with these methods and also the methods in Behera et al. (2006), Iiguni et al. (1992), Man et al. (2006) and Nishiyama and Suzuki (2001), our method eventually can provide clearly convergent information
X. Jing / Neural Networks 31 (2012) 33–45
on the boundedness of the steady error with Assumption 1; it is a global asymptotical convergence; the weights of the network will be bounded; the system can be time-varying. However, the mentioned methods cannot achieve some or all of these. For example, considering the method in Nishiyama and Suzuki (2001), (1) it addressed only time-invariant systems; (2) the activation function is restricted to sigmoid functions; (3) the weights are not guaranteed to be bounded since only estimation error is focused; (4) computation of matrix inverse is involved in the algorithm. Further studies would investigate unbounded noise and non-quadratic loss function, and study more effective and computationally efficient algorithms using the robust control approach for neural network learning and nonlinear system identification problems.
The author gratefully acknowledges the constructive and helpful comments and suggestions from the anonymous reviewers, and the support of the GRF project of Hong Kong RGC (Ref. 517810), the Department General Research Funds and Internal Competitive Research Grants (B-Q24E, G-YJ13, A-PL07, A-PL74) of Hong Kong Polytechnic University for this work.
The following lemma will be used in the derivation of the theorem. Lemma 1 (Schur Complement Boyd et al., 1994, p. 7). Given any matrices M , L and Q of appropriate dimensions, where M = M T , L = M L
LT −Q −1
< 0.
Since only the output state is known for system (9a–b), an output feedback control law is defined as
1K (n) = L(n)˜y(n) = L(n)Cz (n) = L(n)e(n).
(A.1)
w(n)
T T
,
Θ (A¯ (n) + B¯ (n)L(n)C )T PB1 , Ξ= T ¯ B1 P (A(n) + B¯ (n)L(n)C ) −γ (n)I + BT1 PB1 Θ = (A¯ (n) − I + B¯ (n)L(n)C )T P (A¯ (n) + B¯ (n)L(n)C ) − α P .
Obviously, if
Ξ ≤0
(A.5)
then
1V (n) + (1 − α)z (n)T Pz (n) − γ (n)w(n)T w(n) = V (n + 1) − α V (n) − γ (n)w(n)T w(n) ≤ 0 which further gives V (n + 1) ≤ α V (n) + γ (n)w(n)T w(n). Thus 1V (n) < 0 (V (n) is asymptotically convergent to zero) if V (n) > By Lemma 1, (A.5) is equivalent to
−α P ∗ ∗
(A¯ (n) + B¯ (n)L(n)C )T P
0
BT1 P −P
−γ I ∗
≤ 0.
(A.6)
Pre- and post-multiplying with diag{Q (n), I , Q (n)}, where Q(n) is the inverse of the matrix P at sampling time n, (A.6) is equivalent to
−α Q (n) ∗ ∗
Appendix A. Proof of Theorem 1
LT and Q > 0, then M + LT QL < 0 if and only if
where Zn = z (n)T
γ (n)ρ 2 . 1−α
Acknowledgments
43
0
−γ (n)I ∗
Q (n)(A¯ (n) + B¯ (n)L(n)C )T ≤ 0. BT1 −Q (n)
If the matrix Q (n) is defined as Q (n) =
q1 (n) 0
0
(A.7)
Q2 (n)
, where
q1 (n) is a positive scalar and Q2 (n) a positive definite matrix, q1 (n) T T then Q (n)C L(n) = 0 L(n)T = q1 (n)C T L(n)T . Therefore, if p,1
define S (n) = L(n)q1 (n), inequality (A.7) is inequality (15), and the controllaw L(n) = S (n)/q1 (n). Because of P = Q −1 = −1 q1
0
0 −1 Q2
, V (n) = z (n)T Pz (n) = e(n)2 /q1 + K¯ (n)T Q2−1 K¯ (n).
Recalling that V (n) is asymptotically convergent to zero for any γ (n)ρ 2 , e(n)2 must be asymptotically convergent for |e(n)| > 1−α γ (n)q1 (n) . Therefore, considering the error state at all the sampling 1−α
V (n) >
The closed-loop system of (9a–b) is
ρ
z (n + 1) = (A¯ (n) + B¯ (n)L(n)C )z (n) + B1 w(n).
time, the error will asymptotically converge to the bounded region √ given by |e(n)| ≤ ρ γ (n)q1 (n)/(1 − α). Note that the existence of the solution for the √ LMI (A.7) implies that γ (n) and q1 (n) must be bounded. Therefore, ρ γ (n)q1 (n)/(1 − α) must be upper bounded.
(A.2)
Stabilization of system (A.2) can be realized by finding an L(n) such that the eigenvalues of A¯ (n) + B¯ (n)L(n)C always lie within the unit circle. This is equivalent to show that there is a positive definite matrix P of appropriate dimension for system (A.2) at sampling n such that the following Lyapunov function V (n) = z (n)T Pz (n)
(A.3)
satisfying 1V (n) < 0. Using Eqs. (A.2), for any 0 < α < 1 and γ (n) > 0, (A.3) gives
1V (n) + (1 − α)z (n)T Pz (n) − γ (n)w(n)T w(n) = V (n + 1) − V (n) + (1 − α)z (n)T Pz (n) − γ (n)w(n)T w(n) = z (n + 1)T Pz (n + 1) − α z (n)T Pz (n) − γ (n)w(n)T w(n) = [(A¯ (n) + B¯ (n)L(n)C )z (n) + B1 w(n)]T P [(A¯ (n) + B¯ (n)L(n)C )z (n) + B1 w(n)] − α z (n)T Pz (n) − γ (n)w(n)T w(n) = z (n)T (A¯ (n) + B¯ (n)L(n)C )T P (A¯ (n) + B¯ (n)L(n)C )z (n) + 2z (n)T (A¯ (n) + B¯ (n)L(n)C )T PB1 w(n) + w(n)T BT1 PB1 w(n) − α z (n)T Pz (n) − γ (n)w(n)T w(n)
= ZnT Ξ Zn
(A.4)
√
The uniform boundedness of ρ γ (n)q1 (n)/(1 − α) can be further proved as follows. Consider the adaline model (9abc) and similar results hold for (9ab) with (14a) or (14b). Noting that e(n) = X (n)T (K (n) − K ∗ (n)) − ε(n) = X (n)T K¯ (n) − ε(n), the closed-loop system can be written as e(n + 1) K¯ (n + 1)
1 + X (n + 1)T L(n) 0
1X (n)T I + L(n)X (n)T e(n) w(n) × ¯ + . −L(n)ε(n) K (n)
=
(A.8)
The existence of the control Lyapunov function (A.3) implies the eigenvalues of the system matrix of the close-loop system are within the unit circle. Therefore, from (A.8) ||z (n)|| must be bounded all the time due to the bounded noise. Particularly, the closed loop error dynamics is now e(n + 1) = [1 + X (n + 1)T L(n)]e(n) + 1X (n)T K¯ (n) + w(n)
= βn e(n) + Γn
(A.9)
where βn = 1 + X (n + 1) L(n) and Γn = 1X (n) K¯ (n)+w(n). Note that βn is one of the eigenvalues and Eq. (15b) implies that |βn | < T
T
44
X. Jing / Neural Networks 31 (2012) 33–45
1. The boundedness of 1X (n), K¯ (n) and w(n) implies ∥Γn ∥ ≤ max{ri } K¯ (1) + ρ , where |[1X (n)]i | < ri . By mathematical induction, (A.9) gives e(n + 1) = βn βn−1 · · · β1 e(1) + Γn + βn Γn−1 + βn βn−1 Γn−2
+ · · · + βn βn−1 · · · β2 Γ1 .
(A.10)
Let β = maxi (|βi |). Note that 0 < β < 1, |βn βn−1 · · · β1 | ≤ β n , and
|Γn + βn Γn−1 + βn βn−1 Γn−2 + · · · + βn βn−1 · · · β2 Γ1 | ≤ max{ri } K¯ (1) + ρ (1 + β + · · · + β n ) max{ri } K¯ (1) + ρ → . n→∞ 1−β
using Lemma 1, which is equivalent to
−α Q −1 (n) ∗ ∗ ∗ ≤0
(I + HL(n)C )T 0
−ζ (n) I ∗
0 B1 0
−γ (n)
by pre- and post-multiplying with diag{Q (n), I , I }, which is equivalent to −α Q (n) ∗ ∗
Q (n)AT (n) + Q (n)(B(n)L(n)C )T
Q (n) + Q (n)(HL(n)C )T
−Q (n) + ζ (n)EE T
0
∗
−ζ (n) I
∗
∗
∗
It can be obtained from (A.10) that
(A(n) + B(n)L(n)C )T −Q (n) + ζ (n)EE T ∗ ∗
0
0 −γ (n) B1
≤ 0.
|e(n + 1)| ≤ β n e(1) + max{ri } K¯ (1) + ρ (1 + β + · · · + β n ) max{ri } K¯ (1) + ρ → . n→∞ 1−β
Note that CQ (n) = q1 (n)C . Therefore, inequality (22) is obtained. Similar to Corollary 1, if the output matrix C in (9b) is defined by X (n)C , then the inequality above becomes (24) and the weight update law is given by (25). This completes the proof.
Therefore the estimation error will uniformly converge to an upper bounded region. This completes the proof.
References
Appendix B. Proof of Theorem 2 The following lemma will be used in the derivation of the theorem. Lemma 2 (Xie & Soh, 1995). Given any matrices of appropriate dimensions W , M , N , F , 8, satisfying W = W T , 8 > 0, and F T F ≤ I, then W + [ + MFN ]T 8[ + MFN ] ≤ 0 if and only if there exists a scalar ζ > 0 such that
−8−1 + ζ MM T T
W + ζ −1 N T N
≤ 0.
Following the same line in the proof of Theorem 1, for the closed-loop system with the control law 1K (n) = L(n)e(n), choose Lyapunov function V (n) = z (n)T Pz (n), and then V (n) is asymptotically converγ ρ2
gent to zero for any V (n) > 1−η if the inequality (A.7) holds, which is equivalent to
−Q (n) ∗ ∗
B1 −γ (n)I
∗
(A¯ (n) + B¯ (n)L(n)C )Q (n) ≤ 0. 0 −α Q (n)
(B.1)
By Lemma 1 and using (21b), (B.1) is equivalent to
− Q (n) + γ −1 (n)B1 BT1 + α −1 × [A(n) + B(n)L(n)C + EF (n)(I + HL(n)C )]Q (n)[A(n) + B(n)L(n)C + EF (n)(I + HL(n)C )]T ≤ 0
(B.2)
which can be written as W + [ + MF T (n)N ]T 8[ + MF T (n)N ] ≤ 0 where M = (I + HL(n)C ) , N = E , 8 = α T
= (A(n) + B(n)L(n)C ) , T
T
−1
(B.3) Q (n),
W = −Q (n) + γ −1 (n)B1 BT1 .
Noting F (n)F T (n = αp,p (n))T αp,p (n) = I and using Lemma 2, (B.3) is equivalent to
−α Q −1 (n) + ζ −1 (I + HL(n)C )T (I + HL(n)C ) ∗ ≤0
(A(n) + B(n)L(n)C )T −Q (n) + γ −1 (n)B1 BT1 + ζ EE T
Behera, L., Kumar, S., & Patnaik, A. (2006). On adaptive learning rate that guarantees convergence in feedforward networks. IEEE Transactions on Neural Networks, 17(5). Bilski, J., & Rutkowski, L. (1998). A fast training algorithm for neural networks. IEEE Transactions on Circuits and Systems Part II: Analog and Digital Signal Processing, 45(6), 749–753. Bordes, A., Bottou, L., & Gallinari, P. (2009). SGD-QN: careful quasi-Newton stochastic gradient descent. Journal of Machine Learning Research, 10, 1737–1754. Boyd, Stephen, El Ghaoui, Laurent, Feron, Eric, & Balakrishnan, Venkataramanan (1994). Linear matrix inequalities in system and control theory. Philadelphia: Society for Industrial and Applied Mathematics. Charalambous, C. (1992). Conjugate gradient algorithm for efficient training of artificial neural networks. Proceedings of the IEEE Institute of Electrical and Electronics Engineers, 139, 301–310. Chen, S., Hong, X., Harris, C. J., & Sharkey, P. M. (2004). Sparse modeling using orthogonal forward regression with press statistic and regularization. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 34(2), 898–911. DeLisi, C., Fan, Y., Kim, S., Kon, M., & Raphael, L. (2010). Regularization techniques for machine learning on graphs and networks with biological applications. Communications in Mathematical Analysis, 8(3), 136–145. Do, Chuong B., Le, Quoc V., & Foo, Chuan-Sheng (2009). Proximal regularization for online and batch learning. In Proceedings of the 26th annual international conference on machine learning ICML 09 (pp. 1–8). ACM Press, Issue: i. Hagan, M. T., Demuth, H. B., & Beale, M. (1996a). Neural network design. Boston: PWS Publishing Co. Hagan, M. T., Demuth, H. B., & Beale, M. H. (1996b). Neural network design. Boston, MA: PWS Publishing. Han, H., Su, C. Y., & Stepanenko, Y. (2001). Adaptive control of a class of nonlinear systems with nonlinearly parameterized fuzzy approximators. IEEE Transactions on Fuzzy Systems, 9(2), 315–323. Haykin, S. (1999). Neural networks: a comprehensive foundation. Englewood Cliffs, NJ: Prentice-Hall. Iiguni, Y., Sakai, H., & Tokumaru, H. (1992). A real-time learning algorithm for a multilayered neural network based on extended Kalman filter. IEEE Transactions on Signal Processing, 40(4), 959–966. Jarabo-Amores, M.-P., Rosa-Zurera, M., Gil-Pita, R., & Lopez-Ferreras, F. (2009). Study of two error functions to approximate the Neyman–Pearson detector using supervised learning machines. IEEE Transactions on Signal Processing, 57(11), 4175–4181. Jing, X. J. (2011). An H∞ control approach to robust learning of feedforward neural networks. Neural Networks, 24(7), 759–766. Kuschewski, J. G., Hui, S., & Zak, S. H. (1993). Application of feedforward neural networks to dynamical system identification and control. IEEE Transactions on Control Systems Technology, 1(1), 37–49. Lera, G., & Pinzolas, M. (2002). Neighborhood based Levenberg–Marquardt algorithm for neural network training. IEEE Transactions on Neural Networks, 13(5), 1200–1203. Man, Z., Wu, H. R., Liu, S., & Yue, X. (2006). A new adaptive backpropagation algorithm based on Lyapunov stability theory for neural networks. IEEE Transactions on Neural Networks, 17(6), 1580–1591. Nelles, O. (2001). Nonlinear system identification. New York: Springer.
X. Jing / Neural Networks 31 (2012) 33–45 Nishiyama, K., & Suzuki, K. (2001). H∞ -learning of layered neural networks. IEEE Transactions on Neural Networks, 12(6). Osowski, S., Bojarczak, P., & Stodolski, M. (1996). Fast second order learning algorithm for feedforward multilayer neural network and its applications. Neural Networks, 9(9), 1583–1596. Phansalkar, V. V., & Sastry, P. S. (1994). Analysis of the back-propagation algorithm with momentum. IEEE Transactions on Neural Networks, 5(3), 505–506. Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In Proc. of IEEE inter. conf. on N. N.: Vol. 1 (pp. 586–591). San Francisco, California: IEEE. Sarkar, D. (1995). Methods to speed up error back propagation learning algorithm. ACM Computing Surveys, 27(4), 519–544. Shao, H., & Wu, W. (2006). Convergence of BP algorithm with variable learning rates for FNN training. In Proceedings of the fifth Mexican international conference on artificial intelligence. IEEE Computer Society, (catalog number 0-7695-27221/06).
45
Subudhi, B., & Ray, P. K. (2009). Estimation of power system harmonics using hybrid RLS-Adaline and KF-Adaline algorithms. In TENCON 2009—2009 IEEE region 10 conference (pp. 1–6). Sutton, R. S. (1992). Gain adaptation beats least squares. In Proceedings of the 7th Yale workshop on adaptive and learning systems (pp. 161–166). Syrmos, V. L., Abdallah, C. T., Dorato, P., & Grigoriadis, K. (1997). Static output feedback—a survey. Automatica, 33(2), 125–137. Wei, H. L., & Billings, S. A. (2008). Model structure selection using an integrated forward orthogonal search algorithm assisted by squared correlation and mutual information. International Journal of Modelling, Identification and Control, 3(4), 341–356. Wei, H. L., Billings, S. A., Zhao, Y. F., & Guo, L. Z. (2010). An adaptive wavelet neural network for spatio-temporal system identification. Neural Networks, 23(10), 1286–1299. Xie, L., & Soh, Y. C. (1995). Guaranteed cost-control of uncertain discrete-time systems. Control Theory and Advanced Technology, 10(4), 1235–1251.