INTERNATIONAL JOURNAL OF ADAPTIVE CONTROL AND SIGNAL PROCESSING Int. J. Adapt. Control Signal Process. (2013) Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/acs.2432
Adaptive neural network-based optimal control of nonlinear continuous-time systems in strict-feedback form H. Zargarzadeh 1, * ,† , T. Dierks 2 and S. Jagannathan 1 1 Department
of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO, USA 2 DRS Sustainment Systems, Inc., 201 Evans Lane, St Louis, MO 63121, USA
SUMMARY This paper focuses on neural network (NN) based optimal control of nonlinear continuous-time systems in strict-feedback form when the system dynamics are known by using an adaptive backstepping approach. A single NN-based adaptive approach is designed to learn the solution of the infinite horizon continuous-time Hamilton–Jacobi–Bellman (HJB) equation while the corresponding optimal control input that minimizes the HJB equation is calculated in a forward-in-time manner without using value and policy iterations. First, the optimal control problem is solved for a generic multi-input and multi-output nonlinear system with a state feedback approach. Then the approach is extended to a single-input and single-output nonlinear system by using output feedback via a nonlinear observer. Lyapunov techniques are used to show that all signals are uniformly ultimately bounded and that the approximated control signals approach the optimal control inputs with small bounded error both for the state and output feedback-based controller designs. In the absence of NN reconstruction errors, asymptotic convergence to the optimal control is demonstrated. Finally, simulation examples are provided to validate the theoretical results. Copyright © 2013 John Wiley & Sons, Ltd. Received 13 November 2011; Revised 2 July 2013; Accepted 17 August 2013 KEY WORDS:
online nonlinear optimal control; neural network control; output feedback control; strictfeedback systems
1. INTRODUCTION During the past few decades, stabilization of nonlinear systems has been tackled through a variety of ways [1]. One of the objectives of the nonlinear stabilization is to design an adaptive controller due to structured uncertainties [2]. In addition to stabilization, it is desired that the control law minimizes a predefined performance measure [3, 4] so that optimal control laws can be generated. It is well known that the optimal control of linear systems is obtained by solving the Riccati equation [3] in a backward-in-time manner when the system dynamics are known. In contrast, the optimal control of nonlinear continuous or discrete-time systems is a much more challenging task, even when the system dynamics are known, as it often requires solving the nonlinear Hamilton–Jacobi–Bellman (HJB) equation, which has no closed-form solution [3, 4]. Moreover, when the system dynamics are uncertain, solving the HJB equation is a major challenge similar to solving the Riccati equation for uncertain linear systems. Thus, the optimal control of uncertain nonlinear systems is an important and difficult problem. For uncertain nonlinear systems, the adaptive control methods can still be employed not only to estimate the solution to the HJB equation but also to approximate the optimal control law [5, 6]. As a first step, researchers utilized policy iterations and Q-learning-based adaptive control schemes in [6, 7], respectively, for uncertain linear systems to derive the optimal control laws. *Correspondence to: H. Zargarzadeh, Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO, USA. † E-mail:
[email protected] Copyright © 2013 John Wiley & Sons, Ltd.
H. ZARGARZADEH, T. DIERKS AND S. JAGANNATHAN
On the other hand, the optimal control of nonlinear continuous-time systems in affine form with known system dynamics has been addressed initially with the aim of minimizing a predefined index functional by finding an approximate solution to the HJB equation [4, 6] using an iterative, offline, and backward-in-time [4] methodology. Subsequently, adaptive critic designs (ACD) that utilizes a forward-in-time yet iterative methodology are proposed [8] for nonlinear systems when the system dynamics are uncertain. The central theme in the ACD techniques is that the optimal control law and HJB function are approximated by online parametric structures, such as neural networks (NNs) in a forward-in-time manner by using policy and value iterations [6, 8–10]. Because of the online parametric approximators, the ACD techniques [11] relax the need for system dynamics. Recently, in [12], a new framework of ACDs with sparse kernel machines is presented by integrating kernel methods into the critic of ACDs. While these methods [6, 8, 9] render stability, the number of iterations needed within a sampling interval for convergence is significant, and therefore, such iterative schemes are not preferred for hardware implementation [11]. In addition, it is found that inadequate number of policy or value iterations [11] can lead to instability when using these online ACD methods. In some cases [13], the mathematical proofs of convergence are not offered [6], and an initial stabilizing control is needed [8]. Additionally, the ACD schemes in general require two NNs, one NN for approximating the cost function while the other NN for estimating the control input, which may be considered computationally intensive. Therefore, in [5, 14], online approximator (OLA) based ACD techniques are introduced for nonlinear continuous-time systems in affine and nonaffine forms without using value and policy iterations. All the available ACD techniques [5, 6, 8, 9, 13] in the literature address nonlinear continuous or discrete-time systems [11] in affine form by assuming that the states are considered measurable. In addition, an initial stabilizing control input is needed in order to learn the cost or value function. On the other hand, nonlinear systems in strict-feedback form [15,16] are an important class of nonlinear systems structurally made by cascading affine nonlinear continuous-time systems. However, despite their similarity to affine systems, the strict-feedback nonlinear systems require a different way of designing control schemes. The control of nonlinear continuous-time systems in strict-feedback form with uncertain dynamics is attempted in [15] by using standard linear in the unknown parameter (LIP) adaptive backstepping scheme without minimizing a performance index. Other papers [16–18] focus on the tracking control of unknown nonlinear continuous-time systems in strict-feedback form by using NN-based backstepping schemes without any sort of optimality. More recently, the inverse optimal control of strict-feedback nonlinear systems is introduced in [19] under the assumption that the system dynamics are known. In the inverse optimal control design [19], first, the control law is designed based on known system dynamics, and then it is shown that there exists a meaningful performance functional, which can be minimized by using the control law. This is in contrast with the traditional optimal control design where the control law is designed based on a given cost function. To the best knowledge of the authors, no known adaptive optimal control scheme is available for nonlinear continuous-time systems in strict-feedback form when its dynamics are known. In addition, no known output feedback-based optimal adaptive technique is available in the literature for such systems. Therefore, in this paper, a novel optimal adaptive control scheme using an NN is introduced for nonlinear continuous-time systems in strict-feedback form with known system dynamics. First, the nonlinear continuous-time system in strict-feedback form is transformed into a nonlinear tracking error system in affine form by using the backstepping-like technique as there is no known way to directly design an optimal adaptive controller for such strict-feedback nonlinear continuoustime systems. Then a single OLA (SOLA) is utilized for the tracking error system in affine form to learn its cost function, which becomes the solution to the HJB equation, in a forward-in-time manner. The Lyapunov theory is utilized to demonstrate the convergence of this approximate optimal control scheme for the overall nonlinear continuous-time system while explicitly considering the approximation errors resulting from the use of the OLA in the backstepping approach. Initially, the state feedback-based optimal design is considered for multi-input and multi-output (MIMO) systems, and subsequently, an output feedback controller design is addressed for single-input and single-output (SISO) nonlinear continuous-time strict-feedback systems. Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
ADAPTIVE NN-BASED OPTIMAL CONTROL
In this paper, policy and value iterations that are commonly found in ACD techniques [8,9] are not utilized. Instead, value and policies are updated once in a sampling interval, thus making the scheme suitable for real-time control. An initial stabilizing control is not required in contrast to [6, 8, 11]. It is shown that the approximated control input approaches the optimal value over time, and if the NN reconstruction errors become zero as in the case of traditional adaptive control, asymptotic stability is demonstrated. The contribution of this paper is the development of an adaptive optimal controller for a class of nonlinear continuous-time systems in strict-feedback form first by using state measurements and then with output feedback. An initial stabilizing control is not required, and stability is demonstrated using Lyapunov methods without using value and policy iterations. The proposed optimal adaptive NN control is a third generation of NN-based learning control system, which has its own challenging issues to be solved when used in feedback control of dynamic systems. Traditionally, the first generation of the NN-based controllers needed to be trained offline by using back propagation [20] while stability is not shown. On the other hand, the second generation of NN-based controllers [20] simply met stability requirements but not optimality, while the third generation of NN-based controllers guarantee both stability and optimality objectives through online learning. Achieving optimality is often not straightforward and requires novel design methods and nonstandard NN weight update laws along with a persistence of excitation condition. By using novel design and nonstandard NN weight update laws, the HJB solution is approximated online in this work, and optimality is demonstrated. The paper is organized as follows. Section 2 is dedicated to the optimal control of a class of nonlinear continuous-time systems in strict-feedback form by transforming the system to an equivalent nonlinear system in affine form. Section 3 introduces an online optimal stabilization scheme for affine systems by assuming state feedback. Next, Section 4 extends the results from Section 3 to an observer-based output control approach where the state measurements are relaxed. Finally, Section 5 provides numerical results for the proposed optimal controller. In the next section, a solution to the optimal tracking control of nonlinear system in strict-feedback form is introduced. 2. THE TRACKING PROBLEM FOR STRICT-FEEDBACK SYSTEMS Consider the MIMO nonlinear continuous-time system in the absence of disturbances described by xP i D fi .x1 , : : : , xi / C gi .x1 , : : : , xi / xi C1 for 1 6 i 6 N 1 and N > 2
(1)
xP N D fN .x1 , : : : , xN / C gN .x1 , : : : , xN / u,
(2)
y D x1 ,
(3)
where each xi 2 <m denotes a state vector, u 2 <m represents the input vector with fi .x1 , : : : , xi / 2 <m , and gi .x1 , : : : , xi / 2 <mm represent nonlinear smooth functions. For the class of nonlinear systems given by (1), the next system state is treated as the virtual control input. Nonetheless, the system is going to be controlled through the control input u. The following assumption is needed before we proceed. Assumption 1 It is assumed that gi .x1 , : : : , xi / ¤ 0 (1 6 i 6 N ) is invertible, belongs to 2 0 is the positive semidefinite penalty on the states, and R > 0 2 <M M is a positive definite matrix with M D mN . Equations (10)–(12) along with (14) demonstrate that the optimal control of nonlinear system in strict-feedback form can be transformed into solving optimal control of the affine nonlinear system written as a function of the error vector, E, represented in (13). Now, consider the optimal stabilization problem for an affine nonlinear continuous-time system written in terms of the error vector EP D F .E/ C G.X /U , (16) T where f1T .e1 / fNT .e1 , : : : , eN / D F .E/ and G.X / D diag Œg1 .x1 / , : : : , gN .x1 , : : : , xN /. It is desired that E converges to zero while cost function (15) is minimized. Moving on, the control input U is required to be designed such that cost function (15) is finite. We define the Hamiltonian for cost function (11) with an associated admissible control input U to be [3] H.E, U / D r.E, U / C VET .E/ .F .E/ C G.X /U / ,
(17)
where VE .E/ is the gradient of the V .E/ with respect to E. In the sequel, we will use the same terminology for denoting gradient of functions, that is, for any function . /, . / means gradient of . / with respect to . Using the stationarity condition @H.E, U /=@U D 0 and revealed to be [3] U .E/ D R1 G.X /T VE .E/=2.
(18)
By substituting (18) into the Hamiltonian (17), while noting H.E, U / D 0, the HJB equation reveals to be 1 Q.E/ C VET .E/F .X / VET .E/G .X / R1 G.X /T VE .E/ D 0, 4
(19)
with V .0/ D 0. For linear systems, Equation (19) yields the standard algebraic Riccati equation [3]. Before proceeding, the following technical lemma is required. Lemma 2 ([5]) Given nonlinear system (16) with associated cost function (15) and optimal control (18), let J.E/ be a continuously differentiable, radially unbounded Lyapunov candidate such that JP .E/ D JET .E/EP D JET .E/ .F .E/ C G.X /U / < 0, where JET .E/ is the radially unbounded partial derivaT N N tive of J .E/. Moreover, let Q.E/ be a positive definite matrix satisfying Q.E/ D 0 only if 6 QN max for min 6 kEk 6 max for positive constants QN min , QN max , N kEk D 0 and QN min 6 Q.E/ N N satisfy lim Q.E/ min , and max . In addition, let Q.E/ D 1 as well as E !1
T N RU . VET Q.E/J E D r.E, u / D Q.E/ C U
(20)
Then the following relation holds: N JET .F .E/ C G.E/U / D JET Q.E/J E. Copyright © 2013 John Wiley & Sons, Ltd.
(21)
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
ADAPTIVE NN-BASED OPTIMAL CONTROL
In [9], the closed-loop dynamics F .E/ C G.E/U is required to satisfy a Lipschitz condition such that kF .E/ C G.X /U k 6 K for a constant K. In contrast, in this work, the optimal closed-loop dynamics are assumed to be bounded above by a function of the system states such that (22) kF .E/ C G.X /U k 6 ı.E/. p The generalized bound ı.E/ is taken as ı.E/ 4 K jjJE jj in this work, where kJE k can be selected to satisfy general bounds and K is a constant assumed to be existing. For example, if ı.E/ D K1 kEk for a constant K1 , then it can be shown that selecting J.E/ D .E T E/.5=2/ =5 with JE .E/ D .E T E/.3=2/ E T satisfies the bound. The assumption of a time-varying upper bound in (22) is a less stringent assumption than the constant upper bound required in [9]. The next section develops an approach to optimally stabilize the affine system, which is required for optimal tracking of original strict-feedback systems. We rewrite cost function (15) using an OLA representation as V .E/ D ‚T '.E/ C ".E/,
(23)
where ‚ 2 0 is bounded such that Dmin 6 kDk 6 Dmax for known constants Dmin and Dmax , and T 1 T 1 T "HJB D rE " F .E/ G.X /R G.X / rE '.E/‚ C rE " 2 1 1 C rE "T G.X /R1 G.X /T rE " D rE "T .F .E/ C G.X /U / C rE "T DrE " (27) 4 4 is the residual error due to the OLA reconstruction error. Asserting the bounds for optimal closedloop dynamics (22) along with the boundedness of G.X / and rE ", the residual error "HJB is 02 bounded above on a compact set according to j"HJB j 6 "0M ı.E/ C "M Dmax . In addition, it has been shown [9] that by increasing the dimension of the basis vector '.E/ in the case of a single-layer NN, the OLA reconstruction error decreases. The OLA estimate of (15) is now written as O T '.E/, VO .E/ D ‚
(28)
O is the OLA estimate of the target parameter vector ‚. Similarly, the estimate of optimal where ‚ O as control (14) is written in terms of ‚ 1 T O UO D R1 G.X /T rE '.E/‚. 2 Copyright © 2013 John Wiley & Sons, Ltd.
(29)
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
H. ZARGARZADEH, T. DIERKS AND S. JAGANNATHAN
It is shown in [5] that an initial stabilizing control is not required to implement the proposed SOLAbased scheme in contrast to that in [9, 11], which requires initial control policies to be stabilized. In fact, the proposed OLA parameter tuning law described next ensures that the system states remain bounded and that (29) will become admissible. Now, using (28), the approximate Hamiltonian can be written as T O D Q.E/ C ‚ O T rE '.E/F .E/ 1 ‚ O O T rE '.E/DrE HO .E, ‚/ '.E/‚. 4
(30)
Observing the definition of the OLA approximation of cost function (28) and Hamiltonian function (30), it is evident that both become zero when kEk D 0. Thus, once the system states have converged to zero, the cost function approximation can no longer be updated. This can be viewed as a persistency of excitation (PE) requirement for the inputs to the cost function OLA [9, 11]. That is, the system states must be persistently exciting long enough for the OLA to learn the optimal cost function. O should be tuned to minimize Recalling the HJB equation shown in (19), the OLA estimate ‚ O O O O H .E, ‚/. However, tuning to minimize H .E, ‚/ alone does not ensure the stability of nonlinear system (16) during the OLA learning process. Therefore, the proposed OLA tuning algorithm is designed to minimize (30) while considering the stability of (16) and written as 1 OT O T T O O Q.E/ C ‚ rE '.E/F .E/ ‚ rE '.E/DrE '.E/‚ .O T O C 1/2 4 ˛2 C †.E, u/ O rE '.E/g.E/R1 G.X /T JE .E/, 2
PO D ˛ ‚ 1
(31)
T O where O D rE 'F .E/ rE '.E/DrE '.E/‚=2, ˛1 > 0 and ˛2 > 0 are design constants, JE .E/ is described in Lemma 1, and the operator †.E, UO / is given by
8 < 0 if JET .E/EP D JET .E/.F .E/ T O O †.E, U / D '.E/‚=2/ 0 and JP .E/ D JET .E/EP < 0, then the states E are stable). From the definition of the operator in (32), the second term in (31) is removed when nonlinear system (16) exhibits stable behavior, and learning the HJB cost function becomes the primary objective of OLA update (31). In contrast, when system (16) exhibits signs of instability (i.e., JET .E/EP > 0), the second term of (31) is activated and tunes the OLA parameter estimates until nonlinear system (16) exhibits stable behavior. Q D ‚ ‚. O Observing Now, we form the dynamics of the OLA parameter estimation error ‚ T T T Q.E/ D ‚ rE '.E/F .E/ C ‚ rE '.E/DrE '.E/‚=4 "HJB from (25), the approximate Q as HJB (30) can be rewritten in terms of ‚ 1 QT T T Q HJB . O D ‚ Q T rE '.E/F .E/C 1 ‚ Q T rE '.E/DrE '.E/‚ ‚ rE '.E/DrE '.E/‚" HO .E, ‚/ 2 4 (34) Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
ADAPTIVE NN-BASED OPTIMAL CONTROL
PO and O D r '.E/.EP C Dr "=2/ C r ' .E/Dr T '.E/‚=2 PQ D ‚ Q Next, observing ‚ where E E E E P E D F .E/ C G.X /U , the error dynamics of (20) are written as ! T Q rE '.E/DrE '.E/‚ DrE " ˛1 P P Q C ‚ D 2 rE '.E/ E C 2 2 1 QT ' DrE " (35) T T Q P Q C ‚ rE .E/DrE '.E/‚ C "HJB ‚ rE '.E/ E C 2 4 ˛ 2 1 †.E, UO / rE '.E/G.X /R G.X /T JE .E/, 2 where D .O T O C 1/. Next, the stability of the SOLA-based adaptive scheme for optimal control is examined along with the stability of nonlinear system (16). Theorem 1 (SOLA-based optimal control scheme [5]) Given the nonlinear system in affine form (13) or (16) with target HJB (19), let the tuning law for the SOLA be given by (31). Then there exist computable positive constants bJE and b‚ such Q and kJE .E/k are uniformly ultimately bounded (UUB) [20] that the OLA approximation error ‚ Q for all t > t0 C T with ultimate bounds given kJE .E/k 6 bJE and ‚ 6 b‚ . Further, in the O presence of OLA reconstruction errors, one can show that V V 6 "r1 and U UO 6 "r2 p for some small positive constants "r1 and "r2 , respectively, where b‚ 4 ."/=ˇ1 and bJE ˛1 ."/= .˛2 xP min ˛1 ˇ2 K / with ˇ2 chosen such that ˛2 xP min ˛1 ˇ2 K > 0. The stability of the SOLA-based optimal control scheme can be examined when there are no OLA reconstruction errors as would be the case when standard adaptive control techniques [2] are utilized. In other words, when an NN is replaced with a standard linear in the unknown parameter (LIP) adaptive control, the parameter estimation errors and the states are globally asymptotically stable. Next, the stability of the optimal adaptive control scheme for strict-feedback system is introduced. Theorem 2 (Optimal adaptive control scheme for strict-feedback systems) Given the nonlinear system in strict-feedback form as (1)–(3), assume that the virtual and real control input vector U D x2d xNd u is designed such that U D U a C U where a a xNd ua is the feedforward control input designed in (5), (7), and (9) and U a D x2d xNd u represents the feedback control input given by (29). Let the tuning U D x2d law for the SOLA be given by (31). Then the closed-loop system is UUB. Proof Use Lemma 1 and Theorem 1.
4. OBSERVER-BASED OUTPUT FEEDBACK CONTROL DESIGN In the previous section, the optimal adaptive control of a class of nonlinear continuous-time systems is introduced when the states are available for measurement. Practically, the states are not measurable in a vast class of nonlinear systems. In this section, we consider the control problem of strict-feedback control of system (1)–(3) where fi (.) and gi (.) are known, whereas the state vector is not measured and only the output y D h.x/ is given. The MIMO feedback control of strict-feedback systems will have to mitigate several challenges and will be relegated for a future publication. For example, selecting different outputs can change the relative degree of the system, which in turn can complicate the process of the controller design. Therefore, we consider system (1)–(3) in a SISO case. This problem is still difficult as no known output feedback-based optimal control scheme is available in the forward-in-time manner, although recently for linear systems some results are achieved [21] where policy and value iteration method are used to estimate the optimal control solution. While the work in [21] uses adaptive dynamic programming and only input/output data, Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
H. ZARGARZADEH, T. DIERKS AND S. JAGANNATHAN
this paper follows the traditional method, which introduces an observer to estimate the unmeasured states. Now, assume that (1)–(3) is represented in a SISO representation, that is, xi 2 < and u 2 0 are real design Here, J1 .E/ O is the basis function for the estimation of V1 .E/. O parameters, ‚1 is the target parameter, and '1 .E/ Moreover, O UO / D †1 .E,
O 1 r T '1 .E/ O ‚ O 1 =2, O 1 D rEO '1 ! .eO1 / rEO '1 .E/D EO
(46)
with D1 D B.y/R11 B.y/T > 0,pwhere D1 min < jjD1 jj < D1 max . It is finally assumed that ! .eO1 / C B.y/U 6 ı1 .E/ 4 K jjJ O jj with K > 0. 1 1 1 E We can now introduce Theorem 3 under the case where the states are not measured while the output is only available. Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
H. ZARGARZADEH, T. DIERKS AND S. JAGANNATHAN
Theorem 3 (Output feedback SOLA-based optimal control scheme) Assume that the states of the nonlinear system (1) through (3) are not measurable while the output is only available with m D 1. Assume also that xi are transformed using @ Œx1 , : : : , xN to , which forms system dynamics (36). Given nonlinear system (36), observer (37), and target HJB (41), let the tuning law for the SOLA be given by (44) with the cost function estimation O D‚ O T '1 .E/. O Then there exist computable positive constants b1JE , b1‚ , and b1 such that VO1 .E/ 1 O 1 , jjJ O .E/jj, O Q 1 D ‚1 ‚ and Q are UUBfor all t > t0 C T the OLA approximation error ‚ 1E Q < b1 . Further, in the O 6 b1JE , jj‚ Q 1 jj 6 b1‚ , and jj jj with ultimate bounds given jjJ1EO .E/jj O O presence of OLA reconstruction errors, one can show that jjV1 V1 jj 6 "Np r1 and jjU1 U1 jj 6 4 "Nr2 for small positive constants where b1‚ 1 ."/ =1 , b1JE
"Nr1 and "Nr2 , respectively, p O 1
1 1 ."/=T min provided 2 = 1 > 2 K1 =EOP min ,
1 1 ."/= 2 EP min 1 2 K1 , and b 1 T where 1 D O 1 O 1 C 1 . In addition, the following relationship T D ATo P C PAo is utilized with P and T being an arbitrary positive definite matrix, where T min being the minimum eigenvalue of T . Proof See the Appendix.
The interesting point is that the observer estimates , which in turn is utilized to generate the feedback control input. This means that by guaranteeing the existence of @.X /, the user does not need the system model of the form given by (36). Remark 1 (PE condition requirement for the OLA convergence) As mentioned in the description of (30) and (31), the PE condition is required in order to make the OLA weights converge and remain bounded. The PE condition is used in the stability proof of the adaptive systems to guarantee the estimated parameter convergence to their target values [2]. In the traditional adaptive control systems, the estimated parameter convergence to their target value is not a mandatory condition for the stability as the boundedness of the parameters estimation error is normally shown. With the PE condition provided, the stability proof shows that the parameters will converge asymptotically to their target values over time. Nonetheless, this paper requires that the OLA weights converge to a small enough bound in a reasonable time. In other words, although the tracking error is proven to be stable, the optimality Q converges while the PE condition is utilized to excite of the applied control input requires that ‚ the dynamics to learn them. To this end, the closed-loop system and particularly '.E/ should be persistently exciting. Unfortunately, there is no classical method for adaptive systems to define the level of PE or guarantee the convergence of the estimated parameters in a finite time so that the PE condition can be turned off. Previous works in online optimal control [5, 6, 13] also require the PE condition to ensure convergence to the optimal controllers. However, the recent paper [22] shows that noise can be utilized to meet the PE condition. While in this paper we rely on the results by [22], the PE condition has been verified in some particular cases. For example, in [23], it has been shown that with RBF networks and a recurrent desired trajectory, the PE condition will be satisfied. It is finally noted that the proposed optimal control scheme can be implemented online without performing offline calculations, and stability is guaranteed. 5. NUMERICAL RESULTS In this section, we start with applying the results to linear systems. While we have shown that the proposed online controller will be able to solve the HJB equation online, it would be interesting to see that it can also solve the Riccati equation online for linear systems. Then a MIMO system is considered, and a state feedback optimal approach is designed and verified in simulation. Subsequently, the output feedback-based optimal scheme is evaluated in another example. Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
ADAPTIVE NN-BASED OPTIMAL CONTROL
5.1. Online optimal control of linear systems It is obvious that in the case of linear systems, the infinite horizon optimal control requires the Riccati equation to be solved instead of the HJB equation. Consider the following unstable linear system given by
xP 1 xP 2
D
0 1 2 1
x1 x2
C
0 1
u
(47)
y D x1 , which is in the strict-feedback form. It is clear that by the proposed backstepping approach, the tracking error dynamics takes the following form:
eP1 eP2
D
0 1 2 1
e1 e2
C
1 0 0 1
U .
(48)
By solving the Riccati equation, we can easily find that the cost function is given by V .E/ D E T PE D 3.924e12 C 1.461e22 C 1.854e1 e2 , where P is the solution to the Riccati equation when Q D I and R D 1. The design parameters are chosen as ˛1 D 200 and ˛2 D 0.01. Therefore, the SOLA should converge to this solution if we choose the basis function as '.E/ D e12 e22 e1 e2 , as the cost function is known for the linear systems. A probe noise is also added to system dynamics to provide the PE condition. Figure 1 depicts the evolution of the OLA weights during the online learning that shows that the estimated cost function accurately converges to the desired one in about 8 s, although the PE condition is applied for an additional 15 min in order to show that the estimated Hamiltonian will stay close to origin. Starting from zero, the weights of the online OLA are tuned to learn the optimal cost function and obviously converge to their exact values in V .E/. The system tracking performance is shown in Figure 2, where x1d D sin.0.5t / is chosen as the output desired trajectory. Figure 3 shows O with respect to time. The figure shows that the Hamiltonian convergence the Hamiltonian HO .E, ‚/ time is shorter than that of the cost weights and the tracking error. The overshoot in Figure 2 is mainly because of the optimal controller is not initiated by an admissible controller, while plant (48) is unstable. While the update law helps the initial controller to converge to the optimal one (that is obviously stabilizing), the plant states trajectory may have overshoots/undershoots temporarily. Here, it should be mentioned that the purpose of the represented results is to show that the proposed method will converge even in the worst case. Otherwise, the user can observe much better results when cost function weights are initially chosen to provide an admissible controller. The same argument is also valid for the next two sections when the method is applied to the nonlinear systems in the state feedback and output feedback cases.
Estimation of
(t)
10 8 6 4 2 0 -2
0
5
10
15
20
25
30
35
Time (s)
O Figure 1. The evolution of the cost function weights ‚.t/ with time for the linear system. Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
x1(t) tracking performance
H. ZARGARZADEH, T. DIERKS AND S. JAGANNATHAN
8 x1d(t) desired trajectory
6
x1(t) actual trajectory
4 2 0 -2
0
5
10
15
20
25
30
35
Time (s)
Figure 2. The convergence of the linear system outputs to the desired trajectory.
Estimation of H(t)
600
400
200
0
0
5
10
15
20
25
30
35
Time (s)
Figure 3. The convergence of the Hamiltonian.
5.2. MIMO online optimal control Consider the following nonlinear system in the form of (1)–(2) respectively as # " x12 x P 1 0 x21 11 2 D (49) XP 1 D C 5x 11 xP 12 x22 0 3 C 4x12 x11 2 C tan1 .5x11 / 2 1C25x 2 . / 11
XP 2 D
xP 21 xP 22
"
D
2 4x21 C x12 2x11 2 3x22 C 2x12 x11
#
" C
2 1 C x21
0
0 1C
1 2
cos .x21 C x11 /
# u.
(50)
Using HJB cost function (3) with Q.x/ D E T E and R D 1, the basis vector for the SOLA-based 2 2 2 scheme implementation was selected as '.E/ D xe11 xe12 xe11 xe12 xe11 xe12 xe11 tan1.5xe11 / T 3 2 2 2 3 xe11 xe21 xe22 xe21e22 xe21 xe22 xe21 tan1 .5xe21 / xe21 while the tuning parameters are selected as ˛1 D 200 and ˛2 D 0.01. Moreover xe11 D x11 x11d , xe12 D x12 x12d , xe21 D x21 x21d , and xe22 D x22 x22d . The initial conditions of the system states are taken T D Œ 2 2 2 2 T while all NN weights are initialized to zero. as x11 x12 x21 x22 That is, no initial stabilizing control was utilized for implementation of this online design for the T as nonlinear system. Moreover, it is desired that the output track X1d D sin.t =50/ sin.t =40/ the desired trajectory. Figure 4 depicts the evolution of the OLA weights during the online learning. Starting from zero, the weights of the online OLA are tuned to learn the optimal cost function. The system output (X1 D Œx11 , x12 T ) is shown in Figure 5, and noise is added to each state to ensure the PE condition. Figure 6 depicts the stability of the internal system states (X2 D Œx21 , x22 T ). Figure 7 shows the control input to the system UO . Finally, in the case of Figures 4–7, Figure 8 demonstrates the estimated Hamiltonian in (30). To demonstrate the importance of the secondary stabilizing term in the tuning law given by (31), the online OLA design is attempted with †.E, UO / D 0. That is, the learning algorithm only seeks to Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
ADAPTIVE NN-BASED OPTIMAL CONTROL 12 10
NN Weights
8 6 4 2 0 -2 -4 -6 0
100
200
300
400
500
600
700
800
900 1000
Time (s)
X - actual and desired trajectories
Figure 4. The evolution of NN weights with time.
2 1 0 -1 -2
X1 desired trajectory
-3
X2 desired trajectory
-4
X1 actual trajectory
-5
X2 desired trajectory
-6 0
100
200
300
400
500
600
700
800
900
1000
Time (s)
Z - Actual and desired trajectories
Figure 5. The convergence of system outputs to the desired trajectory.
10 5 0 Z1 desired trajectory
-5
Z2 desired trajectory Z1 actual trajectory
-10
Z2 actual trajectory
-15
0
50
100
150
200
250
300
350
400
Time (s)
Figure 6. The convergence of the internal system states to their desired trajectory. 30
The control input
u1 u2
20 10 0 -10 -20
0
100
200
300
400
500
600
700
800
900 1000
Time(s)
Figure 7. The actual control input to the system UO . Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
H. ZARGARZADEH, T. DIERKS AND S. JAGANNATHAN 120
Estimation of H(t)
Estimated Hamiltonian convergence
100 80 60 40 20 0 -20
0
10
20
30
40
50
60
70
80
90
100
Time (s)
X1 tracking the desired output
Figure 8. Approximated Hamiltonian convergence indicating optimality. 0 -1000 -2000 x1 trajectory
-3000
x2 trajectory
-4000 -5000 -6000 0.5
0
1
1.5
2
Time(s)
Figure 9. The system output without the OLA stabilizing term in the update.
minimize auxiliary HJB residual (33) and does not consider system stability. Figure 9 shows the result of not considering the nonlinear system stability while learning the optimal HJB function. From this figure, it is clear that the system state quickly escapes to infinity, and the SOLA-based controller fails to learn the HJB function. Thus, the importance of the secondary term in (31), which ensures the stability of the system, is revealed. 5.3. Observer-based online optimal control output feedback control Consider the following nonlinear system in the form of (1)–(2) respectively as
5 12 C 4 1 1 D 2 1 C tan1 .5 1 / 2 2 1 C 25 12
(51)
1 P2 D 2 12 1 C 1 C cos . 1 / u. 2
(52)
y D 1 ,
(53)
which is in the form of system (36). Here, we repeat the experiment of part b with the assumpO D EO T EO and R1 D tion that ´ is not measurable. Using HJB cost function (3) with Q1 .E/ O D 1, the basis vector for the SOLA-based scheme implementation was selected as '1 .E/ 2 3 2 2 2 3 T 1 1 eO1 , eO1 , eO1 , eO1 tan .5eO1 /eO2 , eO2 , eO2 tan .5eO2 /eO2 while the tuning parameters were selected as
1 D 200, 2 D 0.01, and 1 D 0.04, and A D 0.1. The initial conditions of the system states o T min were taken as Œ 1 2 T D Œ 2 2 T while all NN weights were initialized to zero. That is, no initial stabilizing control was utilized for implementation of this online design for the nonlinear system. Moreover, it is desired that the output track xd D sin.t =50/ as the desired trajectory. Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
ADAPTIVE NN-BASED OPTIMAL CONTROL
Figure 10. System output 1 , the observed output O1 , and desired trajectory.
Figure 11. System output 2 , observed output O2 , and desired trajectory.
The simulation results are given in Figures 10 and 11. In these figures, the convergence of O1 and O 2 to 1 and 2 are depicted. We can check from (A.5) and (A.7) that by properly choosing T min , Q can be arbitrarily adjusted as small as desired. Therefore, after a transient the upper bound of jj jj response time of about 100 s, the observed state O is equal to , and the online optimal controller can rely on the observed value instead of the real value. The presence of PE noise injected is very vital in the process of the cost function learning. In fact, without a significant level of the noise injected to the process, there is no guarantee that the Hamiltonian will converge to zero. 6. CONCLUSIONS This work proposed an optimal scheme for stabilizing nonlinear MIMO strict-feedback systems by using a single OLA to solve the HJB equation forward-in-time. In the presence of known dynamics, the regulation problem was undertaken. Then by using a backstepping approach, the control input to the nonlinear system was derived by using state measurements. Next, this scheme was developed to the optimal output feedback of SISO nonlinear systems. A nonlinear observer was designed in order to estimate the unknown states in the output feedback case. The UUB stability of the overall system is guaranteed in the presence of OLA approximation errors. Simulation results were also provided to verify the theoretical conjectures. Future work should extend the results of the output feedback case of SISO to MIMO systems. APPENDIX Proof of Theorem 3 It should be mentioned that although this theorem uses Lyapunov stability and similar to the Theorem 1, here, the observer dynamics should be proven to be stable while the tracking error converges. This fact distinguishes the proof from that of [5] and makes the proof more difficult. Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
H. ZARGARZADEH, T. DIERKS AND S. JAGANNATHAN
Consider the following positive definite Lyapunov candidate: O C 1‚ Q 1 C 1 Q T P Q Q T1 ‚ J1HJB D 2 J1 .E/ 2 2 whose first derivative with respect to time is given by
(A.1)
PQ (A.2) PQ C PQ T P Q C Q T P PQ D EO EPO C ‚ PQ C Q T .AT P C PA/ , T O EPO C ‚ Q T1 ‚ Q T1 ‚ JP1HJB D 2 J1E .E/ 1 2 1 O and J1E .E/ O are given in Lemma 1. With the same steps as of Theorem 1, we obtain where J1 .E/
T T O !.e1 / B.y/R11 B.y/T rE O ‚ O 1 =2 JP1HJB 6 2 J1E .E/ '1 .E/ 4
2 Q T
1 QT 1 T T 2 O O O †1 .y, UO / ‚ rEO '1 .E/B.y/R B.y/ J . E/ r ' . E/ ‚ Dmin 1 1 EO 1 1EO 2 3212 2 4 1 QT PO D1 rEO " jj2 C 3 1 "2 Q T T . PQ O C 2 ‚ 1 rE '1 .E/ jjE C HJB 2 2 2 1 1 Q T rE '.E/jj O 2 renders Now, completing the square with respect jj‚ T O !.e1 / 1 B.y/R11 B.y/T r T '1 .E/ O ‚ O1 JP1HJB 6 Q T T PQ C 2 J1E .E/ EO 2 T O !.e1 / 1 B.y/R11 B.y/T r T '1 .E/ O ‚ O1 C 2 J1E .E/ EO 2 1 T T O UO / 2 ‚ O O Q T r O '1 .E/B.y/R †1 .E, 1 B.y/ J1EO .E/ 2 1 E 4 4
1
1 256 D1 rE " QT P 2 C 3 1 "2 . O O ‚ r ' .E/ D1 min C 2 2 E C 2 1 EO 1 2 641 1 D1 min 212 HJB O which is similar to (22), and applying the Next, observing the bound jj'1 .e1 / C B.y/U1 jj 6 ı1 .E/, Cauchy–Schwarz inequality, JPHJB is upper bounded according to 1 T 1 T T O O O P J1HJB 6 2 J1E .E/ !.e1 / B.y/R1 B.y/ rEO '1 .E/‚1 2
1 O UO 1 / 2 ‚ O O Q T r O '1 .E/B.y/R (A.3) †1 .E, B.y/T J1TEO .E/ 2 1 E
1 Q PQ 4 1 C 1 1 ."/ C 1 2 ı 4 .E/ O Q T T , 2 ‚ 1 1 12 12 1 0
4 2 2 4 4 2 with 0 1 D 0 r'1 min D 1 min =64, 2 D 1024=D1 min C 3=2, and 1 ."/ D 64D1 max "1M =D1 min C 4 4 O is ensured by jjEjj O > 0 for a constant C "1M D12 max =2, and 0 < r'1 min 6 jjr'1 .E/jj 3 "1M O UO 1 / D 0 and †1 .E, O UO 1 / D 1 will be considered. r'1 min . Now, the cases of †1 .E,
Case 1 O UO 1 / D 0, the first term in (A.3) is less than zero by the definition of the operator in (32). For †1 .E, p O 4 K jjJ O jj and observing jj1=2 jj 6 1, (A.3) is rewritten as Recalling ı1 .E/ 1 1 1E
1 Q 4 1 C 1 1 1 ."/ , O JPHJB 6 Q T T PQ 2 EOP min 1 2 K JE .E/ (A.4) 2 ‚ 1 12 and (A.4) is less than zero provided 2 = 1 > 2 K1 =EOP min , and the following inequalities hold
O > 1 1 ."/= 2 EOP min 1 2 K1 b1JE 0 , or jjJ1EO .E/jj (A.5) p p Q 1 jj > 4 1 ."/=1 b1‚0 , or jj‚ Q > 11 1 1 ."/=T min b1 . Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
ADAPTIVE NN-BASED OPTIMAL CONTROL
Case 2 O UO 1 / D 1, which implies that the OLA-based input Next, consider the case of †1 .E, O ‚ O 1 =2 may not stabilizing. To begin, add and subtract UO 1 D R11 B.y/T r TO '1 .E/ E
T T O O .E/D rE '.E/‚ C rE " =2 in (A.3) to obtain ˛2 J1E
O !.e1 / 1 D r T '.E/‚ O Cr O" JPHJB 6 Q T T PQ C ˛2 J1TEO .E/ E EO 2 ˛2 T O
1 Q T 4
1
1 O C J1E .E/D1 rEO " 2 jj‚ jj 1 C 2 1 ."/ C 2 ı14 .E/ 2 1 1 1 O r "=2 D Q T T PQ C J T .x/ !.e / C B.y/U C J T .E/D 2 1EO
1
1
2 1E
1
1 Q 4 1 C 1 ."/ C 1 2 K J O . 2 ‚ 1 1 1E 2 2 1 1 1
EO
Next, using Lemma 2 and recalling the boundedness of D, JP1HJB is rewritten as O 2 JP1HJB 6 Q T T QP 2 QN min jjJ1EO .E/jj D1 max "0M
1 2 K1 O 1 jj‚ Q 1 jj4 1 C 1 1 ."/ , jjJ1EO .E/jj C 2 C 2 2 2
2 1 1 12 O and is ensured by the condition jjEjj O > 0. As a final where QN 1 min > 0 satisfies QN 1 min 6 jjQ1 .E/jj 2 O step, complete the square with respect to jjJ1E .E/jj to reveal ˛2 O 2 JPHJB 6 Q T Q1 PQ QN 1 min jjJ1EO .E/jj 2 (A.6)
12
1 Q 4
1
2 02 2 2 2 jj‚ D12 max "1M C K 1 jj 1 C 2 1 ."/ C 2 1 1 4QN 1 min
2 14 QN 1 min and JPHJB < 0 provided the following inequalities hold: s 02 D12 max "1M 0 O > jjJ1E .E/jj b1JE , or 2QN 12 min s
1 0 Q 1 jj > 4 1 ."/ C 2 K 2 b1‚ , or jj‚ 1 1 2 QN 1 min 2 s
1 1 1 ."/ 0 2 Q > C 2 K 2 b1 . jj jj T min 1 1 2 QN 1 min 2
(A.7)
According to standard Lyapunov extensions [20], the inequalities in (A.7) guarantee that JP1HJB is O as well as the OLA parameter estimation less than zero outside of a compact set. Thus, jjJ1E .E/jj Q 1 jj remains bounded for the case †1 .E, O UO 1 / D 1. Recalling that the Lyaerror estimation errorjj‚ O punov candidate J1E .E/ is a radially unbounded and continuously differentiable (Lemma 1), the O implies the boundedness of the states jjEjj. O boundedness of jjJ1E .E/jj O UO 1 / D 0 and †1 .E, O UO 1 / D 1 are given by jjJ1E .E/jj O 6 The overall bounds for the cases †1 .E, Q b1JE and jj‚1 jj 6 b1‚ for computable positive constants b1JE D max.b1JE 0 , b1JE1 / and b1‚ D max.b 1‚0 , b 1‚1 /. Note that b1J x0 and b1‚1 in (A.5) and (A.7), respectively, can be reduced through appropriate selection of 1 and 2 . To complete the proof, subtract (23), (28), and (29) from (24) to obtain O VO1 .E/ O D‚ Q T1 '1 .E/ O C "1 .x/ V1 .E/ 1 1 1 T T O Q O U1 UO 1 D R11 B.y/T rE O '1 .E/‚1 R1 B.y/ rEO "1 .E/. 2 2 Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs
H. ZARGARZADEH, T. DIERKS AND S. JAGANNATHAN
Next, observing that the boundedness of the system states ensures the existence of positive constants 0 0 '1M and '1M such that jj'1 jj 6 '1M and jjjrEO '1 jj 6 '1M , respectively, and taking norm and the O limit as t ! 1 when †1 .E, U1 / D 0 reveals Q 1 jjjj'1 .E/jj O C "M 6 b1‚ '1M C "1M "r1 jjV1 VO1 jj 6 jj‚ 1 max 1 0 R1 BM "01M "r2 . jjU1 .x/ UO 1 .x/jj 6 max R11 B1M b1‚ '1M C 2 2 ACKNOWLEDGEMENT
This research was supported in part by NSF grants ECCS 0624644 and ECCS 0901562. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9.
10. 11.
12. 13. 14.
15. 16. 17. 18. 19. 20. 21.
22. 23.
Khalil HK. Nonlinear Systems, (3rd edn). Prentice-Hall: Upper Saddle River, NJ, 2002. Narendra KS, Annaswamy AM. Stable Adaptive Systems. Prentice-Hall: Englewood Cliffs, NJ, 1989. Lewis FL, Syrmos VL. Optimal Control, (2nd edn). Wiley: Hoboken, NJ, 1995. Beard R, Saridis G, Wen J. Improving the performance of stabilizing controls for nonlinear systems. IEEE Control Systems Magazine 1996; 16(5):27–35. Dierks T, Jagannathan S. Optimal control of affine nonlinear continuous-time systems. Proceedings of the American Control Conference, Baltimore, MD, USA, 2010; 1568–1573. Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 2009; 45(2):477–484. Watkins CH. Learning from Delayed Rewards. PhD Dissertation, University of Cambridge, 1989. Lewis FL, Vrabie D. Reinforcement learning and adaptive dynamic programming for feedback control. Circuits and Systems Magazine, IEEE 2009; 9(3):32–50. Vrabie D, Vamvoudakis K, Lewis FL. Adaptive optimal controllers based on generalized policy iteration in a continuous-time framework. Proceedings of the IEEE Mediterranean Conference on Control and Automation, Thessaloniki, Greece, 2009; 1402–1409. Bhasin S, Sharma N, Patre P, Dixon WE. Asymptotic tracking by a reinforcement learning-based adaptive critic controller. Journal of Control Theory and Applications 2011; 9(3):400–409. Dierks T, Jagannathan S. Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Transactions on Neural Networks and Learning Systems 2012; 23(7):1118–1129. Xu X, Hou Z, Lian C, He H. Online learning control using adaptive critic designs with sparse kernel machines. IEEE Transactions on Neural Networks and Learning Systems 2013; 2(5):762–775. Vamvoudakis KG, Lewis FL. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 2010; 46(5):878–888. Hou Z, Cui L, Zhang X, Luo Y. Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Transactions on Neural Networks 2011; 22(12):2226–2236. Kristic M, Kanellakopoulos I, Kokotovic P. Nonlinear and Adaptive Control Design. John Wiley and Sons: New York, NY, 1995. Shuzhi SS, Ge SS, Wang C. Direct adaptive NN control of a class of nonlinear systems. IEEE Transactions on Neural Networks 2002; 13(1):214–221. Wang D, Huang J. Neural network-based adaptive dynamic surface control for a class of uncertain nonlinear systems in strict-feedback form. IEEE Transactions on Neural Networks 2005; 16(1):195–202. Zhang T, Ge SS, Hang CC. Adaptive neural network control for strict-feedback nonlinear systems using backstepping design. Automatica 2000; 36:1835–1846. Li ZH, Krstic M. Optimal design of adaptive tracking controllers for nonlinear systems. Automatica 1997; 33(8):1459–1473. Lewis FL, Jagannathan S, Yesildirek A. Neural Network Control of Robot Manipulators and Nonlinear Systems. Taylor and Francis: Philadelphia, PA, 1999. Lewis FL, Vamvoudakis KG. Reinforcement learning for partially observable dynamics process: adaptive dynamic programming using measured output data. IEEE Transaction System, Man, and Cybernetics, Part B 2011; 41(1):14–251. Xu H, Jagannathan S, Lewis FL. Stochastic optimal control of unknown linear networked control system in the presence of random delays and packet losses. Automatica 2012; 48(6):1017–1030. Wang C, Hill DJ. Learning from neural control. IEEE Transactions on Neural Networks 2002; 17(1):130–146.
Copyright © 2013 John Wiley & Sons, Ltd.
Int. J. Adapt. Control Signal Process. (2013) DOI: 10.1002/acs