Adaptive Dynamic Programming for a Class of Complex ... - IEEE Xplore

Report 2 Downloads 61 Views
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2014

1733

Adaptive Dynamic Programming for a Class of Complex-Valued Nonlinear Systems Ruizhuo Song, Member, IEEE, Wendong Xiao, Senior Member, IEEE, Huaguang Zhang, Senior Member, IEEE, and Changyin Sun, Member, IEEE

Abstract— In this brief, an optimal control scheme based on adaptive dynamic programming (ADP) is developed to solve infinite-horizon optimal control problems of continuous-time complex-valued nonlinear systems. A new performance index function is established on the basis of complex-valued state and control. Using system transformations, the complex-valued system is transformed into a real-valued one, which overcomes Cauchy–Riemann conditions effectively. With the transformed system and the performance index function, a new ADP method is developed to obtain the optimal control law by using neural networks. A compensation controller is developed to compensate the approximation errors of neural networks. Stability properties of the nonlinear system are analyzed and convergence properties of the weights for neural networks are presented. Finally, simulation results demonstrate the performance of the developed optimal control scheme for complex-valued nonlinear systems. Index Terms— Adaptive critic designs, adaptive dynamic programming (ADP), approximate complex-valued systems, dynamic programming, optimal control, neural networks.

I. I NTRODUCTION In many science problems and engineering applications, the parameters and signals are complex-valued [1], [2], such as quantum systems [3] and complex-valued neural networks [4]. In [5], complex-valued filters were proposed for complex signals and systems. In [6], a complex-valued B-spline neural network was proposed to model the complex-valued Wiener system. In [7], a complex-valued pipelined recurrent neural network for nonlinear adaptive prediction of complex nonlinear and nonstationary signals was proposed. In [8], the output feedback stabilization of complex-valued reaction-advectiondiffusion systems was studied. In [4], the global asymptotic stability of delayed complex-valued recurrent neural networks Manuscript received January 23, 2013; revised December 3, 2013; accepted February 5, 2014. Date of publication March 11, 2014; date of current version August 15, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61304079 and Grant 61034005, in part by the Beijing Natural Science Foundation under Grant 4143065, in part by the China Post-Doctoral Science Foundation under Grant 2013M530527, in part by the Fundamental Research Funds for the Central Universities under Grant FRF-TP-13-018A, and in part by the Open Research Project through SKLMCCS under Grant 20120106. R. Song, W. Xiao, and C. Sun are with the School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China (e-mail: [email protected]; [email protected]; [email protected]). H. Zhang is with the School of Information Science and Engineering, Northeastern University, Shenyang, Liaoning 110004, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2306201

was studied. In [9], a reinforcement learning algorithm with complex-valued functions was proposed. In the investigations of complex-valued systems, many system designers want to find the optimal value of the complex-valued parameters or controlled variable, by optimizing a chosen performance index function [10]. During last decades, adaptive dynamic programming (ADP), proposed by Werbos [11], [12], has demonstrated the powerful capability to find the optimal control law and solve the Hamilton-Jacobi-Bellman (HJB) equation forward-in-time [13]–[21]. There were several synonyms used for ADP, including adaptive critic designs [22]–[25], adaptive dynamic programming [26], [27], approximate dynamic programming [28], [29], neuro-dynamic programming [30], [31], and reinforcement learning [32], [33]. Until now, ADP has successfully solved nonlinear zero-sum/nonzero-sum differential games [34], [35], optimal tracking control problems [36], [37], multiagent control problems [38], and so on. From the previous discussions, it can be seen that the optimal control schemes based on ADP are constrained to realvalued systems. In many real-world systems, however, the system states and controls are complex values [39]. As there exist inherent differences between real-valued systems and complex-valued ones, the ADP methods for real-valued systems cannot solve the optimal control problems of complexvalued systems, directly. To the best of our knowledge, there are no discussions on ADP for complex-valued systems. Therefore, a novel ADP method for complex-valued systems is eagerly anticipated. This motivates our research. In this brief, for the first time an ADP-based optimal control scheme for complex-valued systems is developed. First, a new performance index function is defined based on the complexvalued state and control. Second, using system transformations, the complex-valued system is transformed into a realvalued one, where the Cauchy–Riemann conditions can effectively be avoided. Then, a new ADP method is developed to obtain the optimal control law of the nonlinear systems. Neural networks, including critic and action networks, are employed to implement the developed ADP method. A compensation control method is established to overcome the approximation errors of neural networks. It is proven that the developed control scheme makes the closed-loop system uniformly ultimately bounded (UUB). It is also shown that the weights of neural networks will converge to a finite neighborhood of the optimal weights. Finally, the simulation studies are given to show the effectiveness of the developed control scheme.

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1734

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2014

II. M OTIVATIONS AND P RELIMINARIES Consider the following complex-valued nonlinear system: z˙ = f (z) + g(z)u Cn

(1)

where z ∈ is the system state, f (z) ∈ and f (0) = 0. Let g(z) ∈ C n×n be a bounded input gain, that is, ||g(·)|| ≤ g, ¯ where g¯ is a positive constant, and ||·|| is the two-norm, unless special declaration is given. Let u ∈ C n be the control vector. Let z 0 be the initial state. For system (1), the infinite-horizon performance index function is defined as  ∞ r¯ (z(τ ), u(τ ))dτ (2) J (z) = Cn,

t

where the utility function r¯ (z, u) = z H Q 1 z + u H R1 u. Let Q 1 and R1 be diagonal positive definite matrices. Let z H and u H denote the complex conjugate transpose of z and u, respectively. The aim of this brief is to obtain the optimal control of the complex-valued nonlinear system (1). To achieve this purpose, the following assumptions are necessary. Assumption 1 [4]: Let i 2 = −1, and then z = x +i y, where x ∈ n and y ∈ n . If C(z) ∈ C n is the complex-valued function, then it can be expressed as C(z) = C R (x, y) + i C I (x, y), where C R (x, y) ∈ n , and C I (x, y) ∈ n . Assumption 2: Let f (z) = ( f 1 (z), · · · , f n (z))T , and f j (z) = f jR (x, y) + i f jI (x, y), j = 1, 2, · · · , n. The partial derivatives of f j (z) with respect to x and y satisfy  R   R   I   ∂ f   ∂ f   ∂ f  j     j    j  ,  ,   ≤ λRR  ≤ λRI  ≤ λIR  j j j  ∂ x   ∂y   ∂ x  1

1

and

1

 I   ∂ f   j    ≤ λIIj  ∂y 

RI IR II where λRR j , λ j , λ j , and λ j are positive constants. Let || · ||1 stand for one-norm. According to above preparations, the system transformation for system (1) will be given. Let f (z) = f R (x, y)+i f I (x, y), g(z) = g R (x, y) + ig I (x, y), and u = u R + i u I . Then, system (1) can be written as

Let

 G(η) =

(3)

+ g R (x, y))i u I ⇔ x˙ + i y˙ = f R (x, y) + i f I (x, y)    + g R (x, y) + i g I (x, y) u R + i u I ⇔ z˙ = f (z) + g(z)u. Therefore, if the optimal control for (4) is acquired, then the optimal control problem of system (1) is also solved. In the following section, the optimal control scheme of system (4) will be developed based on continuous-time ADP method. Let Q = diag(Q 1 , Q 1 ) and R = diag(R1 , R1 ). According to the definition of r¯ (z, u), the utility function can be expressed as r (η, ν) = ηT Qη + ν T Rν.

(5)

According to (5), the performance index function (2) can be expressed as  ∞ J (η) = r (η(τ ), ν(τ ))dτ. (6) t

For an arbitrary admissible control law ν, if the associated performance index function J (η) is given in (6), then an infinitesimal version of (6) is the so-called nonlinear Lyapunov equation [40] (7)

where Jη = (∂ J /∂η) is the partial derivative of the performance index function J . Define the optimal performance index function as  ∞ r (η(τ ), ν(η(τ )))dτ (8) J ∗ (η) = min ν∈U

t

H (η, ν, Jη ) = JηT (F(η) + G(η)ν) + r (η, ν)

(9)

the optimal performance index function J ∗ (η) satisfies 0 = min{H (η, ν, Jη∗)}. ν∈U

(10)

According to (9) and (10), the optimal control law can be expressed as

 g R (x, y) − g I (x, y) . g R (x, y) g I (x, y)

Then, we can obtain η˙ = F(η) + G(η)ν

⇔ x˙ + i y˙ = f R (x, y) + i f I (x, y)   + g R (x, y) + i g I (x, y) u R + (i g I (x, y)

where U is a set of admissible control laws. Defining the Hamiltonian function as

  R    R f (x, y) x u , F(η) = η= ,ν = y f I (x, y) uI

and

η˙ = F(η) + G(η)ν     R x˙ f (x, y) + g R (x, y)u R − g I (x, y)u I ⇔ = y˙ f I (x, y) + g I (x, y)u R + g R (x, y)u I

0 = JηT (F(η) + G(η)ν) + r (η, ν)

1

x˙ + i y˙ = f R (x, y) + i f I (x, y)   + g R (x, y) + i g I (x, y))(u R + i u I .

Remark 1: The system transformations between system (1) and system (4) are equivalent and reversible, which can be seen in the following equations:

(4)

where η ∈ 2n , F(η) ∈ 2n , G(η) ∈ 2n×2n , and ν ∈ 2n . From (4) we can see that F(0) = 0.

1 (11) ν ∗ (η) = − R −1 G T (η)Jη∗ (η). 2 Remark 2: In this brief, the system transformations between (1) and (4) are necessary. We should say that the optimal control cannot be obtained from (1) and (2) directly. For example, if the optimal control is calculated from (1) and (2), according to Bellman optimality principle, we have

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2014

u = −1/2R1−1 g H (z)Jz (z). Let z = x + i y and J = J R + i J I . The partial derivative Jz = ∂ J /∂z exists only if Cauchy–Riemann conditions are satisfied, that is, ∂ J R /∂ x = ∂ J I /∂y and ∂ J R /∂y = −∂ J I /∂ x. As the performance index function J is in the real domain, then J I = 0, which means ∂ J R /∂ x = ∂ J I /∂y = 0 and ∂ J R /∂y = −∂ J I /∂ x = 0. Therefore ∂ J /∂z = ∂ J R /∂ x + i ∂ J I /∂ x = ∂ J I /∂y − i ∂ J R /∂y = 0. Thus the optimal control of complex-valued system (1) is u = 0, which is obviously invalid. If complexvalued system (1) is transformed into (4), then Cauchy– Riemann conditions are avoided. Therefore, the optimal control of system (1) can be obtained by the transformed system (4) and the performance index function (6). In the next section, the ADP-based optimal control method will be given in details.

where ε H is the residual error, which is expressed as ε H = −∇εcT (F + Gν). Let Wˆ c be the estimate of Wc , and then the output of the critic network is Jˆ(η) = Wˆ cT ϕc (η). H (η, ν, Wˆ c ) = Wˆ cT ∇ϕc (F + Gν) + r (η, ν).

A. Critic Network The critic network is utilized to approximate the performance index function J (z), and the ideal critic network is expressed as J (z) = W¯ cH ϕ¯ c (z) + εc , where W¯ c ∈ C nc1 is the ideal critic network weight matrix. Let ϕ¯c (z) ∈ C nc1 be the activation function and let εc ∈  be the approximation error of the critic network. Let W¯ c = W¯ cR + i W¯ cI and ϕ¯c = ϕ¯cR + i ϕ¯ cI . Then, we have  T  R  J (η) = W¯ cR + i W¯ cI ϕ¯c + i ϕ¯ cI + εc   = W¯ cRT ϕ¯cR + W¯ cI T ϕ¯cI + i W¯ cI T ϕ¯cR − W¯ cRT ϕ¯ cI + εc . (12)  R  R W¯ ϕ¯ c Let Wc = ¯ cI and ϕc = . As J (η) is a real-valued ϕ¯ cI Wc function, we can get (13)

W˜ c = Wc − Wˆ c .

Jη =

+ ∇εc

H (η, ν, Wc ) = WcT ∇ϕc (F + Gν) + r (η, ν) − ε H

ec = −W˜ cT ∇ϕc (η)(F(η) + G(η)ν) + ε H .

(20)

It is desired to select Wˆ c to minimize the squared residual error E c = 1/2ecT ec . Normalized gradient descent algorithm is used to train the critic network [40]. Then, the weight update ˙ rule of the critic network Wˆ c is derived as   ξ1 ξ1T Wˆ c + r (η, ν) ∂ E c ˙ = −αc (21) Wˆ c = −αc  T 2 ∂ Wˆ c ξ1 ξ1 + 1 where αc > 0 is the learning rate of the critic network and ξ1 = ∇ϕc (F + Gν). It is a modified Levenberg-Marquardt algorithm, that is, (ξ1T ξ1 + 1) is replaced by (ξ1T ξ1 + 1)2 , which is used for normalization, and it will be required in the proofs [40]. Let ξ2 = (ξ1 /ξ3 ) and ξ3 = ξ1T ξ1 + 1. We have   ξ1 ξ1T Wˆ c + r ˙ ˜ W c = αc ξ32   ξ1 ξ1T Wc + r T ˜ = −αc ξ2 ξ2 Wc + αc ξ32 εH = −αc ξ2 ξ2T W˜ c + αc ξ2 . (22) ξ3 B. Action Network The action network is used to obtain the control law u. The ideal expression of the action network is u = W¯ aT ϕ¯a (z) + ε¯ a , where W¯ a ∈ C na1 ×n is the ideal weight matrix of the action network. Let ϕ¯ a (η) ∈ C na1 be the activation function and let ε¯ a ∈ C n be the approximation error of the action network. Let W¯ a = W¯ aR + i W¯ aI , ϕ¯a = ϕ¯aR + i ϕ¯aI and ε¯ a = ε¯ aR + i ε¯ aI . Let WaT

(14)

(15)

(19)

Let ec = H (η, ν, Wˆ c ) − H (η, ν, Wc ). We can obtain



where ∇ϕc (η) = ∂ϕc (η)/∂η and ∇εc = ∂εc /∂η. According to (13), Hamiltonian function (9) can be expressed as

(18)

Note that, for a fixed admissible control law ν, Hamiltonian function (9) becomes H (η, ν, Jη ) = 0, which means H (η, ν, Wc ) = 0. Therefore, from (15), we have

Thus the derivative of J (η) is written as ∇ϕcT (η)Wc

(17)

Then, we can define the weight estimation error of the critic network as

ε H = WcT ∇ϕc (F + Gν) + r (η, ν).

In this section, neural networks are introduced to implement the optimal control method. Let the number of hidden layer neurons of the neural network be L. Let the weight matrix between the input layer and hidden layer be Y . Let the weight matrix between the hidden layer and output layer be W and let the input vector be X. Then, the output of the neural network is represented as FN (X, Y, W ) = W T σ (Y X), where σ (Y X) is the activation function. For convenience of analysis, only the output weight W is updated, while the hidden weight is kept fixed [41]. Hence, the neural network function can be simplified by the expression FN (X, W ) = W T σ¯ (X), where σ¯ (X) = σ (Y X). There are two neural networks in the ADP method, which are critic and action networks, respectively. In the following subsections, the detailed design methods for the critic and action networks will be given.

(16)

Hence, Hamiltonian function (9) can be approximated by

III. ADP-BASED O PTIMAL C ONTROL D ESIGN

J (η) = WcT ϕc (η) + εc .

1735

and

=

W¯ aRT W¯ aI T

 R  ϕ¯a − W¯ aI T , ϕa = ϕ¯aI W¯ aRT 

 ε¯ aR εa = . ε¯ aI

1736

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2014

We have ν = WaT ϕa (η) + εa .

(23)

where K c ≥ ||G||2 εa2 M (η T η + b)/2η T GG T η, and b > 0 is a constant. Then, the optimal control law can be expressed as νall = νˆ + νc

The output of the action network is νˆ (η) = Wˆ aT ϕa (η)

(24)

where Wˆ a is the estimation of Wa . We can define the output error of the action network as 1 (25) ea = Wˆ aT ϕa + R −1 G T ∇ϕcT Wˆ c . 2 The objective function to be minimized by the action network is defined as 1 (26) E a = eaT ea . 2 The weight update law for the action network weight is a gradient descent algorithm, which is given by  T 1 −1 T ˙ T T ˆ ˆ ˆ (27) Wa = −αa ϕa Wa ϕa + R G ∇ϕc Wc 2 where αa is the learning rate of the action network. Define the weight estimation error of the action network as W˜ a = Wa − Wˆ a .

(28)

(33)

where νc is the compensation controller, and νˆ is the output of the action network. Substituting (33) into (4), we can get η˙ = F + G Wˆ aT ϕa + Gνc .

(34)

As Wˆ aT ϕa = WaT ϕa − W˜ aT ϕa = ν − εa − W˜ aT ϕa we can obtain η˙ = F + Gν − Gεa − G W˜ aT ϕa + Gνc .

(35)

In the next subsection, the stability analysis will be given. D. Stability Analysis For continuous-time ADP methods, the signals need to be persistently exciting in the learning process [40], that is, the persistence of excitation assumption. Assumption 4: Let the signal ξ2 be persistently exciting over the interval [t, t + T ], that is, there exist constants β1 > 0, β2 > 0 and T > 0, such that, for all t  t +T β1 I ≤ ξ2 (τ )ξ2T (τ )dτ ≤ β2 I (36) t

Then, we have 1 W˙˜ a = αa ϕa ((Wa − W˜ a )T ϕa + R −1 G T ∇ϕcT (Wc − W˜ c ))T 2  1 = αa ϕa − W˜ aT ϕa − R −1 G T ∇ϕcT W˜ c 2 T 1 −1 T T T + Wa ϕa + R G ∇ϕc Wc . (29) 2 As ν = −(1/2)R −1 G T Jη , according to (14) and (23), we have 1 WaT ϕa + εa = − R −1 G T ∇ϕcT Wc − 1/2R −1 G T ∇εc . (30) 2 Thus, (29) can be rewritten as  T 1 (31) W˙˜ a = −αa ϕa W˜ aT ϕa + R −1 G T ∇ϕcT W˜ c − ε12 2 where ε12 = −εa − 1/2R −1 G T ∇εc . C. Design of the Compensation Controller In this subsection, a compensation controller is designed to overcome the approximation errors of the critic and action networks. Before the detailed design method, the following assumption is necessary. Assumption 3: The approximation errors of the critic and action networks, that is, εc and εa satisfy ||εc || ≤ εcM and ||εa || ≤ εa M . The residual error is upper bounded, that is, ||ε H || ≤ εHM . εcM , εa M , and εHM are positive numbers. The vectors of the activation functions of the action network satisfy ||ϕa || ≤ ϕa M , where ϕa M is a positive number. Define the compensation controller as νc = −

Kc G T η ηT η + b

(32)

holds. Remark 3: This assumption makes system (4) be persistently excited sufficiently for tuning critic and action networks. The persistent excitation assumption ensures ξ2m ≤ ξ2 , where ξ2m is a positive number [40]. Before giving the main result, the following preparation works are presented. Lemma 1: For ∀x ∈ n , we have √ ||x||2 ≤ ||x||1 ≤ n||x||2 . (37) Proof: Let x = (x 1 , x 2 , · · · , x n )T . As ||x||22 = n 2

n |x i |2 ≤ |x i | = ||x||21 , we can get ||x||2 ≤ i=1 i=1 2

n

n 2 ||x||1. As ||x||1 = |x i | ≤ n |x i |2 = n||x||22 , i=1 √i=1 we can obtain ||x||1 ≤ n||x||2. Theorem 1: For system (1), if f (z) satisfies Assumptions 1 and 2, then we have ||F(η) − F(η )||2 ≤ k||(η − η )||2 (38)     x x RI where η = and η = . Let k Rj = max{λRR j , λ j } and y y

n II

k Ij = max{λIR k Rj + j , λ j }, j = 1, 2, . . . , n. Let k = j =1

n √ k I and k = k 2n. j =1 j Proof: According to Assumption 2 and the mean value theorem for multivariable functions, we have || f jR (x, y)− f jR (x , y )||1

RI

≤ λRR j ||x − x ||1 + λ j ||y − y ||1 .

(39)

According to the definition of one-norm, we have ||η −η ||1 = ||x − x ||1 + ||y − y ||1 , and  R   f (x, y) − f R (x , y ) ≤ k R ||η − η ||1 . (40) j j j 1

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2014

1737

According to (40), we can get || f R (x, y) − f R (x , y )||1 ≤

n

j =1

k Rj ||η − η ||1 .

(41)

According to the idea from (40) and (41), we can also obtain || f I (x, y) − f I (x , y )||1 ≤

n

k Ij ||η − η ||1 .

(42)

j =1

Therefore, we can get ||F(η) − F(η )||1 = || f R (x, y) − f R (x , y )||1 + || f I (x, y) − f I (x , y )||1 ≤ k ||η − η ||1 .

(43)

According to Lemma 1, we have ||F(η) − F(η )||2 ≤ ||F(η) − F(η )||1 and ||η − η ||1 ≤

√ 2n||η − η ||2 .

(44)

(45)

From (43)–(45), we can obtain ||F(η)− F(η )||2 ≤ k||η−η ||2 . The proof is completed. Next theorems show that the estimation error of critic and action networks are UUB. Theorem 2: Let the weights of the critic network Wˆ c be  T   ˜ updated by (21). If the inequality (ξ1 /ξ3 )Wc > εHM holds, then√the estimation error W˜ c converges to the set W˜ c ≤ β1−1 β2 T (1 + 2ρβ2 αc ) εHM exponentially, where ρ > 0. Proof: Let W˜ c be defined as in (18). Define the following Lyapunov function candidate: 1 ˜T ˜ W Wc . 2αc c The derivative of (46) is given by  ξ1 ξ T ξ1 ε H L˙ = W˜ cT − 21 W˜ c + 2 ξ3 ξ    3     ξT  ε  ξT H     . ≤ −  1 W˜ c   1 W˜ c  −    ξ3   ξ3   ξ3 L=

(46)

(47)

As ξ3 ≥ 1, we have (ε H /ξ3 ) < εHM . If (ξ1T /ξ3 )W˜ c > εHM , then we can get L˙ ≤ 0. That means L decreases and ||ξ2T W˜ c || is bounded. In light of [42] and technical Lemma 2 √ in [40], W˜ c ≤ β1−1 β2 T (1 + 2ρβ2 αc ) εHM . Theorem 3: Let the optimal control law be expressed as in (33). The weight update laws of the critic and action networks are given as in (21) and (27), respectively. If there exists parameters l2 and l3 , that satisfy l2 > ||G|| and



2k + 3 ||G||2 , l3 > max λmin (R) λmin (Q)

(48)  (49)

respectively, then the system state η in (4) is UUB and the weights of the critic and action networks, that is, Wˆ c and Wˆ a converge to finite neighborhoods of the optimal ones. Proof: The proof can be seen in Appendix A.

Fig. 1.

Control trajectories.

Theorem 4: Let the weight updating laws of the critic and the action networks be given by (21) and (27), respectively. If Theorem 3 holds, then the control law νall converges to a finite neighborhood of the optimal control law ν ∗ . Proof: From Theorem 3, there exist νc > 0 and W˜ a > 0, such that lim ||νc || ≤ ν and lim ||W˜ a || ≤ W˜ a . From (33), t →∞ t →∞ we have νall − ν ∗ = νˆ + νc − ν ∗ = Wˆ aT ϕa + νc − WaT ϕa − εa = −W˜ aT ϕa + νc − εa .

(50)

Therefore, we have lim ||νall − ν ∗ || ≤ W˜ a ϕa M + ν + εa M .

t →∞

(51)

As W˜ a ϕa M + ν + εa M is finite, we can obtain the conclusion. The proof is completed. IV. S IMULATION S TUDY Example 1: Our first example is chosen in ([3], Example 3) with modifications. Consider the following nonlinear complexvalued harmonic oscillator system:  2 5 −1 −2z z − 2 + 10(1 + i )u (52) z˙ = i 2z 2 − 1 where z ∈ C 1 , z = z R + i z I , and u = u R + i u I . The utility function is defined as r¯ (z, u) = z H Q 1 z+u H R1 u. Let Q 1 = E and R1 = E, where E is the identity matrix with a suitable dimension. Let η = [z R , z I ]T and ν = [u R , u I ]T . Let the critic and action networks be expressed as Jˆ(η) = Wˆ cT ϕc (Yc η) and νˆ (η) = Wˆ aT ϕa (Ya η), where Yc and Ya are constant matrices with suitable dimensions. The activation functions of the critic and action networks are hyperbolic tangent functions [20]. The structures of the critic and action networks are 2-8-1 and 2-8-2, respectively. The initial weights of Wˆ c and Wˆ a are selected arbitrarily from (−0.1, 0.1), respectively. The learning rates of the critic and action networks are selected as αc = αa = 0.01. Let z 0 = −1 + i . Let K c = 3 and b = 1.02. Implementing the developed ADP method for 40 time steps, the trajectories of the control and state are shown in Figs. 1 and 2, respectively. The weights of the critic and action networks converge to Wc = [−0.0904; 0.0989; −0.0586; 0.0214; −0.0304; 0.0435; −0.0943; −0.0866] and Wa = [−0.0269, 0.0117; 0.0197, −0.0548; 0.0336, −0.0790; 0.0789, −0.0980; −0.0825, −0.0881; 0.0078, −0.0354;

1738

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2014

From the simulation results, we can say that the ADP method developed in this brief is effective and feasible. V. C ONCLUSION

Fig. 2.

In this brief, for the first time an optimal control scheme based on ADP method for complex-valued systems has been developed. First, the performance index function is defined based on complex-valued state and control. Then, system transformations are used to overcome Cauchy–Riemann conditions. With the transformed system and the corresponding performance index function, a new ADP-based optimal control method is established. A compensation controller is presented to compensate the approximation errors of neural networks. Finally, the simulation examples are given to show the effectiveness of the developed optimal control scheme.

State trajectories.

A PPENDIX A P ROOF OF T HEOREM 3 Proof: Choose the Lyapunov function candidate as V = V1 + V2 + V3

Fig. 3.

Fig. 4.

where V1 = 1/2αc W˜ cT W˜ c , V3 = η T η + l3 J (η) with l2 >

(54)

l2 /2αa tr{W˜ aT W˜ a },

V2 = and 0, l3 > 0. Taking the derivative of the Lyapunov function candidate (54), we can get V˙ = V˙1 + V˙2 + V˙3 . According to (22), we have εH . (55) V˙1 = −(W˜ cT ξ2 )T W˜ cT ξ2 + (W˜ cT ξ2 )T ξ3

Control trajectories.

State trajectories.

−0.0143, 0.0558; 0.0234, −0.0329], respectively. From Fig. 2, we can see that the system state is UUB, which verifies the effectiveness of the developed ADP method. Example 2: In the second example, the effectiveness of the developed ADP method will be justified by a complex-valued Chen system [43]. The system can be expressed as z˙ 1 = −μz 1 + z 2 (z 3 + α) + i z 1 u 1

According to (31), we can get  T 1 V˙2 = −l2 tr W˜ aT ϕa (W˜ aT ϕa + R −1 G T ∇ϕcT W˜ c − ε12 ) 2  T T T  T ˜ ˜ = −l2 Wa ϕa Wa ϕa + l2 W˜ aT ϕa ε12  T 1 −1 T R G ∇ϕcT W˜ c . −l2 W˜ aT ϕa (56) 2 The derivative of V3 can be expressed as  V˙3 = 2η T F − 2η T G W˜ aT ϕa + 2η T Gν  −2η T Gεa + 2η T Gνc + l3 (−r (η, ν)). (57) From Theorem 1, we have 2η T F ≤ 2k ||η||2 . In addition, we can obtain −2η T G W˜ aT ϕa ≤ ||η||2 + ||G||2 (W˜ aT ϕa )T W˜ aT ϕa

z˙ 2 = −μz 2 + z 1 (z 3 − α) + 10u 2 z˙ 3 = 1 − 0.5(¯z 1 z 2 + z 1 z¯ 2 ) + u 3

(53) ]T

C3

and where μ = 0.8 and α = 1.8. Let z = [z 1 , z 2 , z 3 ∈ g(z) = diag(i z 1 , 10, 1). Let z¯ 1 and z¯ 2 denote the complex conjugate of  z 1 and z 2 , respectively. Define η =  R R Rvectors T T and ν = u 1R , u 2R , u 3R , u 1I , u 2I , u 3I . z 1 , z 2 , z 3 , z 1I , z 2I , z 3I The structures of action and critic networks are 6-8-6 and 6-6-1, respectively. Let the training rules of the neural networks be the same as in Example 1. Let z 0 = [1 + i, 1 − i, 0.5]T . Implementing the developed ADP method for 500 time steps, the trajectories of the control and state are shown in Figs. 3 and 4, respectively.

2η T Gν ≤ ||η||2 + ||G||2 ||ν||2 −2η T Gεa ≤ ||η||2 + ||G||2 εa2 M . From (32), we can get

2η T Gνc = 2η T G −

Kc G T η η T η+b



≤ −||G||2 εa2 M .

(58)

(59)

Then, (57) can be rewritten as V˙3 ≤ (2k + 3 − l3 λmin (Q))||η||2 + (||G||2 − l3 λmin (R))||ν||2 + ||G||2 (W˜ aT ϕa )T W˜ aT ϕa . (60) Let Z = [η T , v T , (W˜ cT ξ2 )T , (W˜ aT ϕa )T ]T , and NV = [0, 0, (ε H /ξ3 ), MV 4 ]T , where MV 4 = −(l2 /2)R −1 G T

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 9, SEPTEMBER 2014

 ∇ϕcT W˜ c + l2 ε12 , and MV = diag (l3 λmin (Q) − 2k − 3), (l3 λmin (R) − ||G||2), 1, (l2 − ||G||2 ) . Thus, we have V˙ ≤ −Z T MV Z + Z T NV ≤ −||Z ||2 λmin (MV ) + ||Z ||||NV ||.

(61)

According to (48) and (49), we can see that if ||Z || ≥ ||NV ||/λmin (MV ) ≡ Z B , then the Lyapunov candidate V˙ ≤ 0. As MV 4 and ε H /ξ3 are both upper bounded, we have ||NV || is upper bounded. Therefore, the state η, the weight errors W˜ c and W˜ a are UUB [44]. The proof is completed. R EFERENCES [1] T. Adali, P. Schreier, and L. Scharf, “Complex-valued signal processing: The proper way to deal with impropriety,” IEEE Trans. Signal Process., vol. 59, no. 11, pp. 5101–5125, Nov. 2011. [2] T. Fang and J. Sun, “Stability analysis of complex-valued impulsive system,” IET Control Theory Appl., vol. 7, no. 8, pp. 1152–1159, Aug. 2013. [3] C. Yang, “Stability and quantization of complex-valued nonlinear quantum systems,” Chaos, Solitons Fractals, vol. 42, no. 2, pp. 711–723, Oct. 2009. [4] J. Hu and J. Wang, “Global stability of complex-valued recurrent neural networks with time-delays,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 6, pp. 853–865, Jun. 2012. [5] S. Huang, C. Li, and Y. Liu, “Complex-valued filtering based on the minimization of complex-error entropy,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 695–708, May 2013. [6] X. Hong and S. Chen, “Modeling of complex-valued Wiener systems using B-spline neural network,” IEEE Trans. Neural Netw., vol. 22, no. 5, pp. 818–825, May 2011. [7] S. Goh and D. Mandic, “Nonlinear adaptive prediction of complexvalued signals by complex-valued PRNN,” IEEE Trans. Signal Process., vol. 53, no. 5, pp. 1827–1836, May 2005. [8] S. Bolognani, A. Smyshlyaev, and M. Krstic, “Adaptive output feedback control for complex-valued reaction-advection-diffusion systems,” in Proc. Amer. Control Conf., Seattle, Washington, USA, Jun. 2008, pp. 961–966. [9] T. Hamagami, T. Shibuya, and S. Shimada, “Complex-valued reinforcement learning,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., Taipei, Taiwan, Oct. 2006, pp. 4175–4179. [10] A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-Time Wireless Communications. Cambridge, U.K.: Cambridge Univ. Press, 2003. [11] P. J. Werbos, “Advanced forecasting methods for global crisis warning and models of intelligence,” General Syst. Yearbook, vol. 22, pp. 25–38, 1977. [12] P. J. Werbos, “A menu of designs for reinforcement learning over time,” in Neural Networks for Control, W. T. Miller, R. S. Sutton, and P. J. Werbos, Eds. Cambridge, MA, USA: MIT Press, 1991, pp. 67–95. [13] A. Heydari and S. N. Balakrishnan, “Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp. 145–157, Jan. 2013. [14] D. Liu, H. Javaherian, O. Kovalenko, and T. Huang, “Adaptive critic learning techniques for engine torque and air-fuel ratio control,” IEEE Trans. Syst., Man, Cybern., B, Cybern., vol. 38, no. 4, pp. 988–993, Aug. 2008. [15] D. Liu, D. Wang, and X. Yang, “An iterative adaptive dynamic programming algorithm for optimal control of unknown discrete-time nonlinear systems with constrained inputs,” Inf. Sci., vol. 220, pp. 331–342, Jan. 2013. [16] D. Liu and Q. Wei, “Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 779–789, Apr. 2013. [17] D. Liu, Y. Zhang, and H. Zhang, “A self-learning call admission control scheme for CDMA cellular networks,” IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1219–1228, Sep. 2005. [18] Z. Wang and D. Liu, “A data-based state feedback control method for a class of nonlinear systems,” IEEE Trans. Ind. Inf., vol. 9, no. 4, pp. 2284–2292, Nov. 2013. [19] Q. Wei and D. Liu, “An iterative -optimal control scheme for a class of discrete-time nonlinear systems with unfixed initial state,” Neural Netw., vol. 32, no. 6, pp. 236–244, Aug. 2012. [20] Q. Wei and D. Liu, “A novel iterative θ -adaptive dynamic programming for discrete-time nonlinear systems,” IEEE Trans. Autom. Sci. Eng., 2014, doi: 10.1109/TASE.2013.2280974.

1739

[21] X. Xu, Z. Hou, C. Lian, and H. He, “Online learning control using adaptive critic designs with sparse kernel machines,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 762–775, May 2013. [22] D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997. [23] J. Liang, D. D. Molina, G. K. Venayagamoorthy, and R. G. Harley, “Two-level dynamic stochastic optimal power flow control for power systems with intermittent renewable generation,” IEEE Trans. Power Syst., vol. 28, no. 3, pp. 2670–2678, Aug. 2013. [24] Z. Ni, H. He, J. Wen, and X. Xu, “Goal representation heuristic dynamic programming on maze navigation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 12, pp. 2038–2050, Dec. 2013. [25] Z. Ni, H. He, and J. Wen, “Adaptive learning in tracking control based on the dual critic network design,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 913–928, Jun. 2013. [26] Q. Wei and D. Liu, “Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification,” IEEE Trans. Autom. Sci. Eng., 2014, doi: 10.1109/TIE.2014.2301770. [27] M. Fairbank, E. Alonso, and D. Prokhorov, “An equivalence between adaptive dynamic programming with a critic and backpropagation through time,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 12, pp. 2088–2100, Dec. 2013. [28] X. Xu, C. Lian, L. Zuo, and H. He, “Kernel-based approximate dynamic programming for real-time online learning control: An experimental study,” IEEE Trans. Control Syst. Technol., vol. 22, no. 1, pp. 146–156, Jan. 2014. [29] D. Molina, G. K. Venayagamoorthy, J. Liang, and R. G. Harley, “Intelligent local area signals based damping of power system oscillations using virtual generators and approximate dynamic programming,” IEEE Trans. Smart Grid, vol. 4, no. 1, pp. 498–508, Jan. 2013. [30] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA, USA: Athena Scientific, 1996. [31] H. Xu and S. Jagannathan, “Stochastic optimal controller design for uncertain nonlinear networked control system via neuro dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 3, pp. 471–484, Mar. 2013. [32] Z. Ni, H. He, and J. Wen, “Adaptive learning in tracking control based on the dual critic network design,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 913–928, Jun. 2013. [33] B. Xu, C. Yang, and Z. Shi, “Reinforcement learning output feedback NN control using deterministic learning technique,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 3, pp. 635–641, Feb. 2014. [34] H. Zhang, Q. Wei, and D. Liu, “An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games,” Automatica, vol. 47, no. 1, pp. 207–214, Jan. 2011. [35] K. G. Vamvoudakis and F. L. Lewis, “Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton-Jacobi equations,” Automatica, vol. 47, no. 8, pp. 1556–1569, Aug. 2011. [36] H. Zhang, Q. Wei, and Y. Luo, “A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern., B, Cybern., vol. 38, no. 4, pp. 937–942, Aug. 2008. [37] Y. Huang and D. Liu, “Neural-network-based optimal tracking control scheme for a class of unknown discrete-time nonlinear systems using iterative ADP algorithm,” Neurocomputing, vol. 125, pp. 46–56, Feb. 2014. [38] K. Hengster-Movric, K. You, F. L. Lewis, and L. Xie, “Synchronization of discrete-time multi-agent systems on graphs using Riccati design,” Automatica, vol. 49, no. 2, pp. 414–423, Feb. 2013. [39] D. P. Mandic and V. S. L. Goh, Complex Valued Nonlinear Adaptive Filters: Noncircularity, Widely Linear and Neural Models. New York, NY, USA: Wiley, 2009. [40] K. G. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem,” Automatica, vol. 46, no. 5, pp. 878–888, May 2010. [41] T. Dierks and S. Jagannathan, “Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using timebased policy update,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp. 1118–1129, Apr. 2012. [42] H. K. Khalil, Nonlinear System. Upper Saddle River, NJ, USA: PrenticeHall, 2002. [43] G. M. Mahmoud, S. A. Aly, and A. A. Farghaly, “On chaos synchronization of a complex two coupled dynamos system,” Chaos, Solitons Fractals, vol. 33, pp. 178–187, Jul. 2007. [44] F. L. Lewis, S. Jagannathan, and A. Yesildirek, Neural Network Control of Robot Manipulators and Nonlinear Systems. New York, NY, USA: Taylor & Francis, 1999.