Int. J. Appl. Math. Comput. Sci., 2013, Vol. 23, No. 3, 521–537 DOI: 10.2478/amcs-2013-0040
NONPARAMETRIC INSTRUMENTAL VARIABLES FOR IDENTIFICATION OF BLOCK–ORIENTED SYSTEMS G RZEGORZ MZYK Institute of Computer Engineering, Control and Robotics Wrocław University of Technology, Wybrze˙ze Wyspia´nskiego 27, 50-370 Wrocław, Poland e-mail:
[email protected] A combined, parametric-nonparametric identification algorithm for a special case of NARMAX systems is proposed. The parameters of individual blocks are aggregated in one matrix (including mixed products of parameters). The matrix is estimated by an instrumental variables technique with the instruments generated by a nonparametric kernel method. Finally, the result is decomposed to obtain parameters of the system elements. The consistency of the proposed estimate is proved and the rate of convergence is analyzed. Also, the form of optimal instrumental variables is established and the method of their approximate generation is proposed. The idea of nonparametric generation of instrumental variables guarantees that the I.V. estimate is well defined, improves the behaviour of the least-squares method and allows reducing the estimation error. The method is simple in implementation and robust to the correlated noise. Keywords: system identification, instrumental variables, NARMAX system, Hammerstein system, Wiener system, Lur’e system, nonparametric methods.
1. Problem statement
zk
1.1. System. The paper considers the problem of identification of a scalar, discrete-time, asymptotically stable nonlinear dynamic system shown in Fig. 1, and described by the following equation (cf. Bai, 1998): yk =
p
λj η(yk−j ) +
n
j=1
γi μ(uk−i ) + zk ,
(1)
i=0
uk
P
wk
K
w' k
^ i ` ni 0
vk
^ j ` pj 1
v' k
yk
Fig. 1. Additive NARMAX system.
where μ(u) = η(y) =
m t=1 q
ct ft (u), (2) dl gl (y).
l=1
The structure is well known in the literature (see, e.g., Giri and Bai, 2010), and can be treated as a special case of the additive NARMAX model (Chen and Billings, 1989). The signals yk , uk and zk are the output, the input and the noise, respectively. The system in Fig. 1 is more general than the Hammerstein system often met in the literature. The Hammerstein system is obtained when the function η(·) is linear (see Appendix A). Also, it is
not equivalent to the Wiener–Hammerstein (sandwich) system widely considered in the literature, where two linear dynamic blocks surround one static nonlinearity. In spite of many possibilities of applications in various domains (Haber and Keviczky, 1999; Bai, 1998; Zhang et al., 1996; Suykens et al., 1998; Sastry, 1999; Lu and Hill, 2007), relatively little attention has been paid to this structure in the literature.
1.2. Assumptions. made.
The following assumptions are
Assumption 1. The static nonlinear characteristics are of
G. Mzyk
522 a given parametric form, μ(u) = η (y) =
m t=1 q
ct ft (u), (3) dl gl (y),
(a) the matrices ΘΛd = ΛdT and ΘΓc = ΓcT are not both zero;
l=1
where f1 (·),. . . ,fm (·) and g1 (·),. . . ,gq (·) are a priori known linearly independent basis functions such that |ft (u)| ≤ pmax , |gl (y)| ≤ pmax ,
constants α and β, the systems with parameters Λ, Γ, c, d and βΛ, αΓ, c/α, d/β cannot be distinguished, i.e., they are equivalent (see (1)-(2)). For the uniqueness of the solution, the following technical assumptions are introduced (see Bai, 1998):
(4) (5)
(b) ||Λ||2 = 1 and ||Γ||2 = 1, where ||·||2 is the Euclidean vector norm; (c) first non-zero elements of Λ and Γ are positive. Let θ = (γ0 c1 , . . . , γ0 cm , . . . , γn c1 , . . . , γn cm ,
for some constant pmax .
(10)
T
Assumption 2. The linear dynamic blocks have finite impulse responses, i.e., vk =
n
γi wk−i ,
(6)
λj wk−j ,
(7)
i=0
vk =
p j=1
λ1 d1 , . . . , λ1 dq , . . . , λp d1 , . . . , λp dq )
= (θ1 , . . . , θ(n+1)m , θ(n+1)m+1 , . . . , θ(n+1)m+pq )T be the vector of aggregated parameters (1) obtained by inserting (2) to (1), and let φk be the respective generalized input vector φk = (f1 (uk ), . . . , f1 (uk−n ), . . . , f1 (uk−1 ), . . . ,
(11) (12)
with known orders n and p.
fm (uk−n ), g1 (yk−1 ), . . . , gq (yk−1 ), . . . , g1 (yk−p ),
Assumption 3. The input process {uk } is a sequence of i.i.d. bounded random variables, i.e., there exists (unknown) umax , such that |uk | < umax < ∞.
. . . , gq (yk−p ))T .
Assumption 4. The output noise {zk } is a correlated linear process. It can be written as zk =
∞
ωi εk−i ,
where {εk } is some unknown zero-mean (Eεk = 0) and independent bounded (|εk | < εmax < ∞) i.i.d. process, ∞ ∞ of the input {uk }, and {ωi }i=0 ( i=0 |ωi | < ∞) is an unknown stable linear filter. The overall system is asymptotically
Assumption 6. Only the input {uk } and the output of the whole system {yk } are accessible for measurements. Let T
Λ = (λ1 , . . . , λp ) , Γ = (γ0 , . . . , γn )T , c = (c1 , . . . , cm )T ,
YN = ΦN θ + ZN ,
(8)
i=0
Assumption 5. stable.
Thanks to above notation, the description (1)–(2) can be simplified to the form yk = φTk θ + zk , which means that the system remains linear with respect to the parameters. For k = 1, . . . , N , we obtain
(9)
(13)
where YN = (y1 , . . . , yN )T , ΦN = (φ1 , . . . , φN )T , and ZN = (z1, . . . , zN )T . The purpose of identification is to recover the parameters in Λ, Γ, c and d (given by (9)), using the input-output measurements (uk , yk ) (k = 1, ..., N ) of the whole system. 1.3. Comments on the assumptions. The representation (1) belongs to the class of the so-called “equation-error” models, while in practical situations a more complicated case of “output-error” models is often met, i.e., ⎧ p n ⎨ y k = j=1 λj η(y k−j ) + i=0 γi μ(uk−i ), ⎩
y k = y k + δk ,
T
d = (d1 , . . . , dq ) , denote true (unknown) parameters of the system. Obviously, the input-output description of the system, given by (1) and (2) is not unique. For each pair of
with zero-mean disturbance δk . Since the resulting noise zk in (1) results from nonlinear filtering of δk , it can be of a relatively high order and may have a non-zero mean. The first problem is omitted by making Assumption 7 in
Nonparametric instrumental variables for identification of block-oriented systems
523
Section 5. The second one can be simply solved when the constant function is appended to the basis f1 (·),. . . ,fm (·). To simplify the presentation, it was assumed that the input process, the nonlinear characteristics and the noise are bounded. In fact, since further analysis assumes only finite fourth-order moments of all signals, the approach can be simply generalized for Lipschitz nonlinearities and most of popular finite-variance distributions of excitations. As regards the i.i.d. restriction imposed on the input process, it can be weakened, for invertible processes, by e.g., data pre-filtering and the use of specially designed instrumental variables in parameter identification (see Mzyk, 2013).
Algorithm 1. LS-SVD method. Step 1. Compute the LS estimate
1.4. Organization of the paper. In Section 2, the least squares based identification algorithm (see Bai, 1998) is presented for white disturbances. Then, the reason of its asymptotic bias is shown for correlated noise. Next, in Section 3, an asymptotically unbiased, instrumental variables based estimate is proposed. The idea originates from linear system theory (see, e.g., Wong and Polak, 1967; S¨oderstr¨om and Stoica, 1983; Sagara and Zhao, 1990; Zhao et al., 1991), where the instrumental variables technique is used for identification of simple one-element linear dynamic plants. The proposed method is then compared with the least squares. In particular, the consistency of the proposed estimate is shown, in Section 4, even for correlated disturbances. The form of the optimal instrumental variables is established in Section 5, and the method of their approximate generation is described in Section 6. Also, the asymptotic rate of convergence of the estimate is analyzed.
min(n,m)
(LS) θN = (ΦTN ΦN )−1 ΦTN YN
of the aggregated parameter vector θ (see (10) and (13)), and next construct (by the plug-in method) evaluations (LS) and Θ (LS) of the matrices ΘΛd = ΛdT and ΘΓc = Θ Λd Γc ΓcT , respectively (see the condition (a) above). Step 2. Perform the SVD (Singular Value Decomposition, (LS) and Θ (LS) : see Appendix B) of the matrices Θ Γc Λd (LS) = Θ Λd
min(p,q)
δi ξi ζiT ,
i=1
2. Least squares and SVD approach For comparison purposes with the instrumental variables method proposed further, let us start from the presentation of a two-stage algorithm based on the least-squares estimation of the aggregated parameter vector and decomposition of the obtained result with the use of the SVD algorithm (see Bai, 1998; Kincaid and Cheney, 2002). The algorithm has the following steps. The fundamental meaning for the algorithm has the form of SVD representations of the theoretical matrices ΘΓc = ΓcT and ΘΛd = ΛdT . Each matrix being the product of two vectors has the rank equal to 1, and only one singular value is non-zero, i.e.,
(14)
(LS) Θ Γc
=
(15) σi μ i νiT ,
i=1
and next compute the estimates of parameters of particular blocks (see (9)), (LS) = sgn(ξ1 [κξ1 ])ξ1 Λ N (LS) = sgn( Γ μ1 [κμ1 ]) μ1 N (LS)
cN
= sgn( μ1 [κμ1 ])σ1 ν1 ,
(16)
(LS) dN = sgn(ξ1 [κξ1 ])δ1 ζ1 ,
where x[k] denotes the k-th element of the vector x and κx = min{k : x[k] = 0}.
Thus ΘΓc = σ1 μ1 ν1T ,
(17)
where μ1 2 = ν1 2 = 1. The representation of ΘΓc given by (17) is obviously unique. To obtain Γ, which fulfills the condition (b), one can take Γ = μ1 or Γ = −μ1 . The condition (c) guarantees the uniqueness of Γ. The remaining part of decomposition allows computing c. The vectors Λ and d can be obtained from ΘΛd in a similar way. The singular value decomposition allows splitting (LS) and Θ (LS) the aggregated matrices of parameters Θ Γc Λd into products of two vectors (see (15)) and estimating (LS)T (LS) (LS) d(LS)T according to (16). It was Γ cN and Λ N N N shown by Bai (1998) that
min(n,m)
ΘΓc =
σi μi νiT
i=1
( μ1 , σ1 ν1 ) = arg
min
c∈Rm ,Γ∈Rn
Θ Γc
(LS)
− ΓcT 2 ,
(18)
and σ1 = 0, σ2 = · · · = σmin(n,m) = 0.
and for the noise-free case (zk ≡ 0) the estimates (16)
G. Mzyk
524 equal the true system parameters, i.e., (LS) = Λ, Λ N (LS) = Γ, Γ N (LS)
cN
(19)
= c,
(LS) dN = d.
Moreover, if the noise {zk } is an i.i.d. process, independent of the input {uk }, then (LS) → Λ, Λ N (LS) → Γ, Γ N (LS)
cN
(20)
→ c,
(LS) dN → d,
with probability 1, as N → ∞. Remark 1. For a less sophisticated linear ARMAX n p model yk = j=1 dλj yk−j + i=0 cγi uk−i + zk , where c and d are scalar constants, the vector (10) reduces T to θ = ΘTΓc , ΘTΛd , with single column matrices ΘΓc and ΘΛd . Consequently, the estimate (14) plays the role of the standard least-squares method and the SVD decomposition in (15) guarantees normalization, i.e., ||Λ||2 = 1 and ||Γ||2 = 1. By taking (13) and (14) into account, the estimation error of the vector θ by the least squares can be expressed as follows: (LS)
ΔN
(LS) = θN − θ (21) T −1 T = ΦN ΦN ΦN Z N
−1
N N 1 1 T = φk φk φk zk . N N k=1
k=1
If {zk } is a zero-mean white noise with finite variance, independent of {uk }, then all elements of the vector ZN are independent of the elements of the matrix ΦN and from the ergodicity of the noise and the process {φk } (LS) get that ΔN → 0 with probability 1, as N → ∞. Nevertheless, if {zk } is correlated, i.e., Ezk zk+i = 0 for some i = 0, then the LS estimate (14) of θ is not consistent because of the dependence between zk and the values gl (yk−i ) (l = 1, . . . , q and i = 1, . . . , p) included in φk . Consequently, the estimates given by (16) are not consistent, either.
a correlated input. In this paper, a similar approach is proposed for more general systems, including nonlinear feedback. Let us assume that we have given, or we are able to generate, an additional matrix ΨN of instrumental variables, which fulfills (even for correlated zk ) the following conditions (see Wong and Polak, 1967; Finigan and Rowe, 1974; Ward, 1977; Hansen and Singleton, 1982; S¨oderstr¨om and Stoica, 1983; 2002; Kowalczuk and Kozłowski, 2000; Hasiewicz and Mzyk, 2009): (C1) dimΨN = dim ΦN , and the elements = (ψ1 , ψ2 , . . . , ψN )T , where ψk = of ΨN (ψk,1 , ψk,2 , . . . , ψk,m(n+1)+pq )T , are jointly bounded, i.e., there exists 0 < ψmax < ∞ such that |ψk,j | ≤ ψmax (k = 1, . . . , N and j = 1, . . . , m(n + 1) + pq) and ψk,j are ergodic, not necessarily zero-mean, processes; (C2) there exists Plim( N1 ΨTN ΦN ) = Eψk φTk and the limit is not singular, i.e., det{Eψk φTk } = 0; (C3) Plim( N1 ΨTN ZN ) = Eψk zk and Eψk zk cov(ψk , zk ) = 0 (see Assumption 4).
Lemma 1. A necessary condition for the existence of the instrumental variables matrix ΨN , which fulfills (C2) is the asymptotic non-singularity of N1 ΦTN ΦN .
As was shown by Hasiewicz and Mzyk (2009), for any Hammerstein system, the bias can be reduced by the instrumental variables method, known from linear system theory. This result was generalized by Mzyk (2013) for
Proof. For the proof, see Appendix A. Premultiplying (13) by ΨTN , we get ΨTN YN = ΨTN ΦN θ + ΨTN ZN .
Taking into account the conditions (C1)–(C3), a natural idea is to replace the LS estimate, given by (14) and computed in Step 1 (see Section 2), with the instrumental variables estimate (IV ) θN = (ΨTN ΦN )−1 ΨTN YN .
(22)
Step 2 is analogous, i.e., the SVD decomposition is (IV ) and Θ (IV ) of matrices ΘΛd made for the estimates Θ Γc Λd (IV ) and ΘΓc , obtained on the basis of θN .
4. Limit properties For the algorithm (22) the estimation error of the aggregated parameter vector θ has the form (IV )
3. Instrumental variables approach
=
ΔN
(IV ) = θN − θ (23) T −1 T = Ψ N ΦN Ψ N ZN
−1
N N 1 1 T = ψk φk ψk zk . N N k=1
k=1
Nonparametric instrumental variables for identification of block-oriented systems Theorem 1. Under (C1)–(C3), the estimate (22) converges in probability to the true parameters of the system, independently of the autocorrelation of the noise, i.e., (IV )
Plim ΔN
N →∞
(IV )
Proof. For the proof, see Appendix A.
5. Optimal instrumental variables Theorem 2 gives a universal guaranteed asymptotic rate of convergence of the estimate (22). Nevertheless, for a moderate number of measurements, the error depends on particular instruments used in a given application. In this section, a optimal form of instruments is established for the special case of NARMAX systems, which fulfills the following assumption concerning η() and {λj }pj=1 . Assumption 7. The nonlinear characteristic η() is a Lipschitz function, i.e., |η(y (1) ) − η(y (2) )| ≤ r|y (1) − y (2) |,
(25)
η(0) = 0.
(26)
and Moreover, the constant r > 0 is such that |λj | < 1.
(27)
j=1
Let us consider the following conditional processes (cf. (2)): Gl,k E{gl (yk ) | {uk−i }∞ i=0 },
(28)
ξl gl (y) − Gl . We have gl (yk ) = Gl,k + ξl,k , (29)
for l = 1, 2, . . . , q and k = 1, 2, . . . , N , will be interpreted as the “noise”. Equation (1) can now be presented as follows: yk =
λj η(yk−j ) +
j=1
n i=0
γi μ(uk−i ) + zk
p n = Ak {yk−j }j=1 + Bk ({uk−i }i=1 ) + Ck (uk ) + zk ,
γi μ(uk−i ),
Ck (uk ) = γ0 μ(uk ). The random variables Ak , Bk and zk are independent of the input uk (see Assumptions 1–6). For a fixed uk = u, we get Ck (u) = γ0 μ(u). The expectation in (28) has the following interpretation:
(31) Gl,k = E gl (Ck (uk ) + Ak {yk−j }pj=1 n + Bk ({uk−i }i=1 ) + zk ) | {ui }ki=−∞ , and cannot be computed explicitly. However, as will be shown further, the relation between Gl,k and the characteristics μ(·), η(·) is not needed. The most significant are the properties below. N
(P1) The “disturbances” {ξl,k }k=1 given by (29) are independent of the input process {uk } and are all ergodic. N
The mutual independence of {ξl,k }k=1 and {uk }∞ k=−∞ is a direct consequence of the definition (28). On the basis of Assumptions 3–5, the output N {yk }k=1 of the system is bounded and ergodic. Owing to Assumption 1, concerning the nonlinear N characteristics, the processes {gl (yk )}N k=1 and {Gl,k }k=1 (l = 1, 2, . . . , q) are also bounded and ergodic. Consequently, the “noises” {ξl,k }N k=1 (l = 1, 2, . . . , q), as the sums of ergodic processes, are ergodic too (see (29)). (P2) The processes {ξl,k } are zero-mean. By the definition (29) of ξl,k , we have
(P3) If the instrumental variables ψk,j are generated by the nonlinear filtering
and the signals ξl,k = gl (yk ) − Gl,k ,
n
Bk ({uk−i }i=1 ) =
j=1 n
Eξl,k = Egl (yk ) − EGl,k = E{u}kj=−∞ E gl (yk ) | {u}ki=−∞ − E{u}kj=−∞ E gl (yk ) | {u}ki=−∞ = 0.
where l = 1, 2, . . . , q, and write
p
p
p λj η(yk−j ), Ak {yk−j }j=1 =
i=1
Theorem 2. The estimation error √ ΔN converges to zero with the asymptotic rate O(1/ N ) in probability, for each strategy of instrumental variable generation, which guarantees the fulfillment of (C1)–(C3).
p
where
(24)
= 0.
Proof. For the proof, see Appendix A.
α=r
525
(30)
ψk,j = Hj ({ui }ki=−∞ ),
(32)
where the transformations Hj (·) (j = 1, 2, . . . , m(n + 1) + pq) guarantee the ergodicity of {ψk,j }, then all products ψk1 ,j ξl,k2 (j = 1, 2, . . . , m(n + 1) + pq, l = 1, 2, . . . , q) are zero-mean, i.e., Eψk1 ,j ξl,k2 = 0. Owing to (P1) and (P2), we have 1 E [ψk1 ,j ξl,k2 ] = E Hj ({ui }ki=−∞ )ξl,k2 1 = EHj ({ui }ki=−∞ )Eξl,k2 = 0.
G. Mzyk
526 (P4) If the measurement noise zk and the instrumental variables ψk,j are bounded (i.e., Assumption 4 and the condition (C1) are fulfilled), i.e., |zk | < zmax < ∞ and |ψk,j | = |Hj (uk )| < ψmax < ∞ (see 3), then N 1 ψk zk → Eψk zk , N
δk =
n
γi μ (uk−i ) + zk ,
(42)
i=0
we get (33)
k=1
with probability 1, as N → ∞ (cf. the condition (C3)). The product sk,j = ψk,j zk of stationary and bounded signals ψk,j and zk is also stationary, with finite variance. To prove (33) making use of Lemma B.1 by S¨oderstr¨om and Stoica (1989), it must be shown that rsk,j (τ ) → 0, as |τ | → ∞. Let us notice that the autocovariance function of zk (Ezk = 0), rz (τ ) = E [(zk − Ez) (zk+τ − Ez)] = Ezk zk+τ , (34) as the output of linear filter excited by a white noise has the property that (35) rz (τ ) → 0, as |τ | → ∞. Hence, the processes ψk,j = Hj {ui }ki=−∞ are ergodic (see (P3)), and independent of zk (see Assumption 4). Thus rsk,j (τ ) = E [(sk,j − Esk,j ) (sk+τ,j − Esk,j )]
where 0 < a < 1. Introducing the symbol
yk = η (yk−1 ) + δk .
(43)
Since the input {uk } is an i.i.d. sequence, independent of {zk }, and the noise {zk } has the property that rz (τ ) → 0, as |τ | → ∞ (see (35)). There holds rδ (τ ) → 0 as |τ | → ∞. Equation (43) can be written in the following form: yk = δk + η {δk−1 + η [δk−2 + η (δk−3 + . . . )]} . (44) Let us introduce the coefficients ck defined, for k = 1, 2, . . . , N , as η (yk ) ck = , (45) yk with 0/0 treated as 0. Owing to (41), we have |ck | ≤ a < 1, and using ck Eqn. (44) can be rewritten as follows: yk = δk + ck−1 δk−1 + ck−2 δk−2
(36) + ck−3 (δk−3 + . . . )
= E [ψk,j ψk+τ,j zk zk+τ ] = crz (τ ),
(46)
,
where c = (Eψk,j ) is a finite constant, 0 ≤ c < ∞. Consequently, (37) rsk,j (τ ) → 0,
i.e.,
as |τ | → ∞, and
where ck,0 1, and ck,i = ck−1 ck−2 . . . , ck−i . From (46) we conclude that
2
N 1 sk,j → Esk,j , N
(38)
k=1
with probability 1, as N → ∞. (P5a) For the NARMAX system with the characteristic η() as in Assumption 7 and the order of autoregression p = 1 (see Eqn. (1)), it holds that N 1 ψk φTk → Eψk φTk , N
(39)
k=1
with probability 1 as N → ∞, where ψk is given by (32); compare the condition (C2). For p = 1 (for clarity of presentation, let also λ1 = 1) the system is described by yk = η (yk−1 ) +
n
γi μ (uk−i ) + zk ,
(40)
yk =
|η (y)| ≤ a |y| ,
(41)
ck,i δk−i ,
i=0
(47) |ck,i | < ai . ∞ i Since for 0 < a < 1 the sum i=0 a is finite, from ∞ (47) we get i=0 |ck,i | < ∞, and from (42) we simply conclude that for |τ | → ∞ we have ry (τ ) → 0 and rgl (yk ) (τ ) → 0, where the processes gl (yk ) (l = 1, . . . , q) are elements of the vector φk . Thus, for the system with the nonlinearity η(·) as in (41), the processes {yk } and {gl (yk )} (l = 1, . . . , q) fulfill the assumption of the ergodic law of large numbers, and the property (39) holds. (P5b) Under Assumption 7, the convergence (39) takes place also for the system (1) with p ≥ 1. For any number sequence {xk }, let us define the norm (48) {xk } = lim sup |xk | , K→∞ k>K
i=0
and the nonlinearity η(), according to Assumption 7, fulfills the condition
∞
and let us present Eqn. (1) in the form yk =
p j=1
λj η(yk−j ) + δk ,
(49)
Nonparametric instrumental variables for identification of block-oriented systems where δk is given by (42). The proof of the property (P5b) (for p > 1) is based of the following theorem (see Kudrewicz, 1976, p. 53). (1)
(2)
Theorem 3. Let {yk } and {yk } be two different output (1) sequences of the system (1) (see also (49)), and {δk }, (2) {δk } be respective aggregated inputs (see (42)). If (25), (26) and (27) are fulfilled, then (1)
(2)
(1)
(2)
{δk − δk } {δk − δk } (1) (2) ≤ {yk − yk } ≤ , 1+α 1−α (50) where the norm · is defined in (48). From (50) and under the conditions (25)–(27), the steady state of the system (1) depends only on the steady (2) state of the input {δk }. The special case of (50) is δk ≡ (2) 0, in which limK→∞ supk>K yk = 0, and 1 1 (1) (1) (1) {δk } ≤ {yk } ≤ {δk }. 1+α 1−α The impulse response of the system tends to zero, as k → ∞, and for an i.i.d. input the autocorrelation function of the output {yk } is such that ry (τ ) → 0
as |τ | → ∞.
Moreover, on the basis of (1)–(4), since the process {yk } is bounded, it has finite moments of any orders and the ergodic theorems hold (see (S¨oderstr¨om and Stoica, 1989) Definition B.2, Lemma B.1, B.2). In consequence, the convergence (39) holds. The properties (P5a) and (P5b) (see (39), (12) and (32)) can be rewritten for particular elements of ψk and φk in the following way: N 1 ψk1 ,j gl (yk2 ) → Eψk1 ,j gl (yk2 ), N
where Gl,k E{gl (yk ) | {ui }ki=−∞ } (see (28)), and making use of the ergodicity of the processes {ψk,j } (j = 1, . . . , m(n + 1) + pq), {ft (uk )} (t = 1, . . . , m) and {Gl,k } (l = 1, . . . , q) (see (32) and Assumption 3), we get N 1 T # 1 Ψ Φ = ψk φ#T → Eψk φ#T with p. 1, k k N N N N k=1
and, using (39), we get N 1 T 1 Ψ N ΦN = ψk φTk → Eψk φTk with p. 1, N N k=1
for the instruments as in (32). Directly from the definitions (28) and (51), we conclude that E [ψk1 ,j gl (yk2 )] = E [ψk1 ,j Gl,k2 ] and = Eψk φTk . Eψk φ#T k Thus, for any choice of instrumental variables matrix ΨN , which fulfills the property (P3) (see (32)), the following equivalence takes place asymptotically with probability 1, as N → ∞: 1 T # 1 Ψ Φ = ΨTN ΦN . (52) N N N N The estimation error (i.e., the difference between the estimate and the true value of parameters) has the form −1 1 T 1 T (IV ) (IV ) Ψ N ΦN Ψ N ZN . ΔN = θN − θ = N N Introducing ΓN ∗ ZN
1 T Ψ ΦN N N
√1 ZN N
zmax
(IV )
with probability 1 as N → ∞. Under the property that E [ψk1 ,j ξl,k2 ] = 0 (see (P3)), for instrumental variables generated according to (32), there obviously holds that
Writing (cf. (12))
(f1 (uk ), . . . , fm (uk ), . . . f1 (uk−n ), . . . , fm (uk−n ), G1,k−1 , . . . , Gq,k−1 , . . . , G1,k−p , . . . , Gq,k−p )T ,
ΔN
1 √ ΨTN , N
,
(51)
∗ = zmax ΓN ZN ,
(53)
∗ ZN ,
with the Euclidean norm of N 1 2 N √ zk 2 1 zk N ∗ ZN = = ≤ 1. zmax N zmax k=1
E [ψk1 ,j gl (yk2 )] = E [ψk1 ,j Gl,k2 ] .
# # # T Φ# N = (φ1 , φ2 , . . . , φN ) ,
−1
where zmax is an upper bound of the absolute value of the noise (see Assumption 4), we obtain that
k=1
φ# k
527
k=1
Let the quality of the instrumental variables be evaluated on the basis of the following criterion (see, e.g., Wong and Polak, 1967) 2 (IV ) Q (ΨN ) = max ΔN (ΨN ) , (54) ZN∗ ≤1 (IV )
where · denotes the Euclidean norm, and ΔN (ΨN ) is the estimation error obtained for the instrumental variables ΨN .
G. Mzyk
528 Theorem 4. If Assumptions 1–7 and the condition (32) hold, then the criterion Q (ΨN ) given by (54) attains a minimum for the choice # Ψ# N = ΦN ,
(0)#
All elements of ψk After introducing
(white noises) fulfill (P3).
xl,k = gl (yk ),
(55)
the regression functions in (57) can be written as
i.e., for each ΨN ,
Rl (u) = E {xl,k | uk = u} .
lim Q(Ψ# N ) ≤ lim Q(ΨN ) with prob. 1.
N →∞
N →∞
Proof. For the proof, see Appendix A.
Obviously, instrumental variables given by (55) fulfill the postulates (C1)–(C3).
6. Nonparametric generation of instrumental variables The optimal matrix of instruments Ψ# N cannot be computed analytically, because of the lack of prior knowledge of the system (the probability density functions of excitations and the values of parameters are unknown). Estimation of Ψ# N is also difficult, because the elements Gl,k depend on an infinite number of measurements of the input process. Therefore, the only choice is the following FIR approximation: (r)#
(r)#
(r)#
, ψ2
(r)# T
, . . . , ψN
) ,
ΨN
= (ψ1
(r)# ψk
(f1 (uk ), . . . , fm (uk ), . . . ,
where K is a kernel function, and h is the bandwidth parameter. Further deliberations will be based on the following two theorems (see Greblicki and Pawlak, 2008). Theorem 5. If h(M ) → 0 and M h(M ) → ∞ as M → ∞, and K(v) is one of exp(− |v|), exp(−v 2 ), or 1+δ (1 + |v| )−1 , then
f1 (uk−n ), . . . , fm (uk−n ), (r)
Both uk and yk can be measured, and xl,k = gl (yk ) can be computed, because the functions gl () are known a priori. Thus the most natural method for generation of (r)# ΨN is the kernel method. A traditional estimate of the regression function Rl (u) computed on the basis of M M pairs {(ui , xl,i )}i=1 has the form (see, e.g., Greblicki and Pawlak, 2008)
u−ui 1 M x K l,i h(M) l,M (u) = M i=1
, R (58) M u−ui 1 i=1 K h(M) M
(r)
G1,k−1 , . . . , Gq,k−1 , . . . , (r)
1 M
(r)
G1,k−p , . . . , Gq,k−p )T ,
u−ui yi K h(M)
→ E{yi | ui = u} M u−ui i=1 K h(M)
M
1 M
i=1
(59)
where r is a cut-off level in (28), i.e., (r)
r
Gl,k = E{gl (yk ) | {uk−i }i=0 }.
M
(56)
It is based on the intuition that the approximate value becomes better, i.e.,
(r)# ΨN
(r)#
ΨN
# ∼ = ΨN
when r is increasing (this question is treated as open). The simplest realization of the algorithm (i.e., for r = 0) has the form (0)#
ΨN = ΨN (0)#
ψk
(f1 (uk ), . . . , fm (uk ), . . . , f1 (uk−n ), . . . , fm (uk−n ), R1 (uk−1 ), . . . , Rq (uk−1 ), . . . , R1 (uk−p ), . . . , Rq (uk−p ))T ,
where (0)
Theorem 6. If both the regression E{yi | ui = u} and the input probability density function ϑ(u) have finite 1 second order derivatives, then for h(M ) = O(M − 5 ) the 2 asymptotic rate of convergence is O(M − 5 ) in probability. To apply the above theorems, let us additionally make the following assumption. Assumption 8. The functions g1 (y),. . . ,gq (y), f1 (u),. . . ,fm (u) and the input probability density ϑ(u) have finite second order derivatives for each u ∈ (−umax , umax ) and each y ∈ (−ymax , ymax ).
,
Rl (u) = Gl (u) = E{gl (yk )} | uk = u}.
in probability as M → ∞, provided that {(ui , yi )}i=1 is an i.i.d. sequence.
(57)
In our problem, the process {xl,i } appearing in the numerator of (58) is correlated. Let us decompose the sums numerator and denominator in (58) for in1 the χ(M ) partial sums, where χ(M ) is such that r = M χ(M ) → ∞ and r → ∞, as M → ∞ (e.g., χ(M ) =
Nonparametric instrumental variables for identification of block-oriented systems √ log M ), i.e.,
7. Three-stage identification
M L({(ui , xl,i )}i=1 )
(60)
M r u − ui 1 1 xl,i K st , = M i=1 h(M ) r t=1 M r u − ui 1 1 M W ({ui }i=1 ) K wt , = M i=1 h(M ) r t=1
with
u − uir+t xl,ir+t K h(M ) {i:0 ε → 0 as N → ∞, aN √ for each ε > 0, each √ rN → 0 and aN = 1/ N . To prove that ξN = O(1/ N ) in probability, it suffices to show that ξN = O(1/N ) in the mean square sense. Introducing
i=0
l=0
and further
AN = yk =
∞
(A3)
∞ ∞ n where zk = l=0 rl vk−l , γq = l=0 i=0 rl bi δ(l + i − q), and δ() is a discrete impulse. Equation (A3) represents a Hammerstein system with an infinite impulse response.
A2. Necessary condition for the 3-stage algorithm Lemma A2. If det(B T A) = 0 for given matrices A, B ∈ Rα×β with finite elements, then det(AT A) = 0. Proof. Let det(AT A) = 0, i.e., rank(AT A) < β. From the obvious property that rank(AT A) = rank(A) one can conclude that there exists the non-zero vector ξ ∈ Rβ , such that Aξ = 0. Premultiplying this equation by B T we get B T Aξ = 0, and hence det(B T A) = 0. Thus, for A = √1N ΦN and B = √1N ΨN , a necessary condition for N1 ΨTN ΦN to be of full rank is det( N1 ΦTN ΦN ) = 0, i.e., a persistent excitation of {φk }. A3. Proof of Theorem 1 Proof. From the Slutzky theorem (cf. the work of Chow and Teicher (2003) and Appendix B) we have (IV ) Plim (ΔN ) N →∞
N 1 T 1 Ψ N ΦN = ψk φTk , N N k=1
γq μ(uk−q ) + zk ,
BN =
q=0
=
(IV ) = θN − θ,
−1 1 T Plim Ψ N ΦN N →∞ N 1 T Ψ N ZN , Plim N →∞ N
N 1 T 1 Ψ N ZN = ψk zk , N N k=1
we obtain that
(IV )
ΔN
= A−1 N BN .
(A5)
Therefore, under Assumptions 1–6, the system output yk is bounded, i.e., |yk | < ymax < ∞. Moreover, under the condition (C1), we have i,j AN ≤ ψmax pmax < ∞, for j = 1, 2, . . . , m(n + 1), and i,j AN ≤ ψmax pmax < ∞, for j = m(n+ 1)+ 1, . . . , m(n+ 1)+ pq, so each element of AN is bounded. Similarly, one can show the boundedness of the elements of the vector BN . The norm of the error error (IV ) ΔN given by (A5) can be evaluated as follows: −1 1 T 1 T (IV ) Ψ ΦN Ψ ZN ξN = ΔN = N N N N −1 1 T 1 Ψ N ΦN ΨTN ZN ≤ N N ≤ c
N 1 1 T ΨN ZN = c ψk zk , N N k=1
where c is some positive constant. Obviously, one can find α ≥ 0 such that
dim N N ψk 1 1 c | ψk zk ≤ αc ψk,i zk | , N N i=1 k=1
k=1
Nonparametric instrumental variables for identification of block-oriented systems and hence 2 ξN =
∞ ∞ ωi ωi+τ . D = Cvar ε
where (IV ) ΔN 2
2 dim ψ
N k 1 | ≤ α2 c2 ψk,i zk | N i=1 k=1
2 dim N ψk 1 2 2 | ≤ α c dim ψk ψk,i zk | N i=1 k=1
N 2 dim ψk 1 2 2 = α c dim ψk ψk,i zk . N2 i=1
τ =0 i=0
A5. Proof of Theorem 4 Proof. To simplify the presentation, let zmax = 1. From (53) we get (IV )
ΔN
=
Moreover, for uncorrelated processes{ψk } and {zk } (see the condition (C3)) we have that
≤ α c dim ψk 2 2
dim ψk i=1
2 2
= α c dim ψk
dim ψk i=1
≤ α2 c2 dim ψk
dim ψk i=1
N
×
N
1 E N2 1 E N2
N
2 ψk,i zk
k=1 N N
ψk1 ,i ψk2 ,i zk1 zk2
k1 =1 k2 =1
1 N2
|E [ψk1 ,i ψk2 ,i ]| |E [zk1 zk2 ]|
k1 =1 k2 =1 2 ψmax [|rz (0)| N N τ 1− |rz (τ )| +2 N τ =1 2
≤ α2 c2 (dim ψk )
≤
∞ C |rz (τ )| , N τ =0
where rz (τ ) = var ε
∞
ωi ωi+τ ,
i=0 2 2
2
2 . C = 2α c (dim ψk ) ψmax
Since
∞ ∞ ωi ωi+τ var ε τ =0 i=0 ∞ ∞
≤ var ε
≤ var ε
|ωi | |ωi+τ | τ =0 i=0 ∞ ∞ |ωi |
|ωi+τ | < ∞,
i=0
i=0
2 EξN
1 ≤D , N
we have
(IV )T
(ΨN )2 = ΔN
k=1
2 EξN
535
(IV )
(ΨN )ΔN
(ΨN )
∗T T ∗ ZN ΓN ΓN Z N ,
and the maximum value of the cumulated error is
(IV )T (IV ) (ΨN )ΔN (ΨN ) Q(ΨN ) = max ΔN Z∗N ≤1 % ∗ T & ∗ , ΓN ΓN Z N = max ZN ZN∗ ≤1 2 = ΓN = λmax ΓTN ΓN , where · is the spectral matrix norm induced by the Euclidean vector norm, and λmax (·) denotes the largest eigenvalue of a matrix. Since (see Wong and Polak, 1967; Rao, 1973) λmax ΓTN ΓN = λmax ΓN ΓTN , from the definition of ΓN we obtain that
(IV )T (IV ) max ΔN (ΨN )ΔN (ΨN ) ZN∗ ≤1 & % = max ζ, ΓN ΓTN ζ ζ ≤1 ' −1 1 T Ψ N ΦN = max ζ, N ζ ≤1 −1 ( 1 T 1 T Φ ΨN × ΨN ΨN ζ . N N N On the basis of (52), we get
(IV )T (IV ) (ΨN )ΔN (ΨN ) max ΔN ZN∗ ≤1 ' −1 1 T # 1 T Ψ N ΦN Ψ ΨN = max ζ, N N N ζ ≤1 −1 ( 1 #T Φ ΨN ζ , N N with probability 1, as N → ∞, where ΦN and Φ# N are given by (12) and (51), respectively. Using Lemma B1 for √1 ΨN , we get M1 = √1N Φ# N and M2 = N ζ
T
ΓN ΓTN ζ
≥ζ
T
1 #T # Φ Φ N N N
−1 ζ,
G. Mzyk
536 for each vector ζ, and consequently Q (ΨN ) = max ζ T ΓN ΓTN ζ ζ ≤1
−1 1 #T # T Φ Φ ζ . ≥ max ζ ζ ≤1 N N N For ΨN = Φ# N , we have max ζ T ΓN ΓTN ζ = max
ζ ≤1
ζ ≤1
ζ
T
1 #T # Φ Φ N N N
−1 ζ ,
and the criterion Q (ΨN ) attains a minimum. The choice Ψ N = Φ# N is thus asymptotically optimal.
Proof. The estimation error (67) can be decomposed as follows (IV ) ∗(IV ) ∗(IV ) ∗(IV ) ∗(IV ) ΔN,M = θN,M − θ = θN,M − θN + θN − θ,
−1 ∗T ∗(IV ) where θN = Ψ∗T ΨN YN , and Ψ∗N is defined N ΦN by (55) and (51). From the triangle inequality, for each norm · we have (IV ) ∗(IV ) ∗(IV ) ∗(IV ) + θN − θ. (A6) ΔN,M ≤ θN,M − θN
On the basis of Theorem 1, ∗(IV )
− θ → 0 in probability,
as N → ∞. To prove 7, let us analyze the component ∗(IV ) ∗(IV ) θN,M − θN in (A6) to show that, for fixed N , it tends to zero in probability as M → ∞. Write 1
εN 1 Ψ∗T ΦN N N
)
Ψ ∗T ΦN −1 N,M lim P M→∞ N * ∗T −1 Ψ N ΦN − ≤ rM = 1. N Since rM → 0 in probability as M → ∞, there holds ∗(IV ) ∗(IV ) θN,M − θN → 0 in probability, as M → ∞, for each N .
Appendix B Technical lemmas, theorems and definitions
A6. Proof of Theorem 7
θN
and using the Banach theorem (see Kudrewicz, 1976, Theorem 5.8.), we get
(N – fixed).
B1. SVD decomposition Theorem B1. (Kincaid and Cheney, 2002) For each A ∈ Rm,n there exist the unitary matrices U ∈ Rm,m and V ∈ Rn,n , such that U T AV = Σ = diag(σ1 , . . . , σl ),
(B1)
where l = min(m, n), and σ1 ≥ σ2 ≥ · · · ≥ σr > 0, σr+1 = · · · = σl = 0, where r = rank(A). The numbers σ1 , . . . , σl are called the singular values of the matrix A. Solving (B1) with respect to A, we obtain A = U ΣV T =
r
ui σi viT =
i=1
r
σi ui viT ,
(B2)
i=1
where ui and vi denote the i-th columns of U and V , respectively. B2. Factorization theorem
From (65) we have that 1 ∗T 1 ∗T Ψ Ψ Φ Φ − N →0 N N N,M N N
Theorem B2. (Rao, 1973) Each positive definite matrix M can be shown in the form M = P P T , where P (a root of M ) is nonsingular.
in probability as M → ∞, and particularly $ 1 ∗T 1 ∗T lim P ΨN,M ΦN − ΨN ΦN < εN = 1. M→∞ N N
B3. Technical lemma
Introducing
1 ∗T Φ N ΨN,M ΦN − N1 Ψ∗T N N
rM ∗T ΦN − 1 Ψ∗T ΦN εN εN − N1 Ψ N,M N N
Lemma B1. (Wong and Polak, 1967) Let M1 and M2 be −1 two matrices with the same dimensions. If M1T M1 , T −1 −1 T and M2 M1 exist, then M1 M2 −1 T −1 T −1 M2 M2 M1T M2 − M1 M1 DN = M2T M1 is nonnegative definite, i.e., for each ζ ζ T DN ζ ≥ 0.
Nonparametric instrumental variables for identification of block-oriented systems B4. Slutzky theorem Theorem B3. (Roe, 1973) If Plimk→∞ κk = κ # and the function g(·) is continuous, then Plimk→∞ g(κk ) = g(κ # ). B5. Chebyshev’s inequality Lemma B2. (Chow and Teicher, 2003, p. 106) For each constant c, each random variable X and each ε > 0, there 2 holds P {|X − c| > ε} ≤ ε12 E (X − c) . In particular, for c = EX, P {|X − EX| > ε} ≤ ε12 var X. B6. Persistent excitation Definition B1. A stationary random process {αk } is strongly persistently exciting of orders n × m (denote SP E(n, m)) if the matrix ⎡ Rκ (n, m) = E ⎣ where κk =
+
αk
α2k
⎤⎡
κk :
⎦⎣
κk−n+1 ...
κk :
αm k
⎤T ⎦ ,
κk−n+1 ,T
, is of full rank.
537
Lemma B3. (Stoica and S¨oderstr¨om, 1982) The i.i.d. process {αk } is SP E(n, m) for each n and m. Lemma B4. (Stoica and S¨oderstr¨om, 1982) Let xk = H(q −1 )uk , H(q −1 ) be an asymptotically stable linear filter, and {uk } be a random sequence with finite variance. If the frequency function of {uk } is strictly positive in at least m + 1 distinct points, then {xk } is SP E(n, m) for each n. B7. Modified triangle inequality Lemma B5. (Chow and Teicher, 2003) If X and Y are k-dimensional random vectors, then P [X + Y ≥ ε] ≤ P [X ≥ ε/2]+P [Y ≥ ε/2] for each vector norm · and each ε > 0. Received: 6 December 2012 Revised: 29 April 2013