On the convergence rate of the unscented transformation

Report 2 Downloads 96 Views
Approximate Conditional Least Squares Estimation of a Nonlinear State-Space Model via Unscented Kalman Filter Kwang Woo Ahn Division of Biostatistics Medical College of Wisconsin, Milwaukee, WI 53226 email: [email protected]

Kung–Sik Chan Department of Statistics and Actuarial Science The University of Iowa, Iowa City, Iowa 52242 email: [email protected] July 22, 2011

Abstract Nonlinear state-space models driven by differential equations have been widely used in science. Their statistical inference generally requires computing the mean and covariance matrix of some nonlinear function of the state variables, which can be done in several ways. For example, such computations may be approximately done by Monte Carlo, which is rather computationally expensive. Linear approximation by the first

1

order Taylor expansion is a fast alternative. However, the approximation error becomes non-negligible with strongly nonlinear functions. Unscented transformation was proposed by Julier and Uhlmann (1997) to overcome these difficulties, but it lacks of theoretical justification. In this paper, we derive some theoretical properties of the unscented transformation and contrast it with the method of linear approximation. Particularly, we derive the convergence rate of the unscented transformation.

—————————————— Keywords: Unscented transformation; nonlinear transformation; Monte Carlo; Linear approximation. 1.

INTRODUCTION

Many scientific studies employ nonlinear state-space models for describing the dynamics of a continuous-time state process driven by a system of an ordinary differential equation (Diekmann and Heesterbeek 2000; Simon 2006). Simulation based methods such as sequential Monte Carlo methods have been popular for estimating such models, but it is computationally expensive (Doucet, deFreitas and Gordon 2001). In the engineering literature, the extended Kalman filter (EKF) has been proposed as a fast alternative, but it approximates the nonlinear system by its first order Taylor expansion, which is subject to rapidly increasing approximation errors with the degree of nonlinearity. To overcome these difficulties, Julier and Uhlmann (1997) proposed the unscented Kalman filter (UKF). The UKF mimics the updating scheme of the Kalman filter, with each updating step requiring the computation of the mean and covariance matrix, of some nonlinear function of the state vector, which is done via the unscented transformation (UT). In contrast with Monte Carlo methods, the UT makes use of a small number of deterministic “sigma points” in estimating the mean and covariance matrix of some nonlinearly transformed random variable. Empirical works suggest that the UKF is a promising technique with satisfactory performance, see Julier and Uhlmann (1997), Julier and Uhlmann (2004), and Wan and van der Merwe (2000). However, theoretical justification of the UKF is lacking. 2

Since the UT plays a pivotal role in the UKF, it is essential to study the theoretical properties of the UT. As the UT is used for approximately computing the mean and covariance matrix of some nonlinear transformation of the state vector over a small time step, denoted by h, the main issue we address here concerns the error rate of the UT approximation as h → 0. In Section 2, we elaborate the definition of the UT. The error rates of the UT are then derived in Section 3. We conclude briefly in Section 4. 2.

UNSCENTED TRANSFORMATION

The UT is an approximate scheme for computing the mean and covariance matrix of y = f (x), where x is a a c × 1 random vector with known mean E(x) and covariance matrix P, and f : Ω → Rq is a q × 1 vector function, i.e. f = (f1 , . . . , fq ), where R is a set of real numbers and Ω ⊆ Rc is the sample space of x, i.e. P (x ∈ Ω) = 1. Let Py and Pxy be the covariance matrix of y and the covariance matrix between x and y, respectively. For a ˆ (0) , . . . , x ˆ (2c) are defined as follows: constant λ > −c, the sigma points x ˆ (0) = E(x), x ˆ (i) = E(x) + x ˘ (i) , i = 1, . . . , 2c, x T p T p ˘ (c+j) = − ˘ (j) = (c + λ)P , x (c + λ)P , j = 1, . . . , c, x j

where

p (c + λ)P is the matrix square root of (c + λ)P such that p

and

j

p

T p  (c + λ)P (c + λ)P = (c + λ)P,

 p p (c + λ)P is the jth row of (c + λ)P. Here, (c + λ)P can be obtained by j

the Cholesky decomposition or singular value decomposition. The constant λ controls the distance between the sigma points and E(x). If λ → −c, the sigma points tend to be closer to E(x). If λ → ∞, the sigma points tend to be further away from E(x). Hence, λ is a tuning parameter that controls the error between the true mean or covariance matrices and ˆ (i) = f (ˆ their UT approximations to be defined below. Let y x(i) ), i = 0, . . . , 2c. The UT ˆ y of y, and covariance matrix P ˆ xy ˆ , covariance matrix P formulas for the approximate mean y

3

between x and y are ˆ= y ˆ xy = P

2c X i=0 2c X

ˆy = ˆ , P W y (i) (i)

2c X

ˆ )(ˆ ˆ )T , W (i) (ˆ y(i) − y y(i) − y

i=0

(1)

ˆ )T , W (i) (ˆ x(i) − E(x))(ˆ y(i) − y

i=0

where W (0) = λ/(c + λ), W (i) = 1/(2c + 2λ), i = 1, . . . , 2c. Hence, the UT estimates are weighted sample analogues based on the sigma points, see Simon (2006). On the other hand, the linear approximation scheme used by the EKF approximates the mean and covariance matrices via the first order Taylor expansion, resulting in the following formulas: ˆ y,L = HPHT , P ˆ xy,L = PHT ˆ L = f (E(x)), P y

(2)

for estimating E(y), Py , and Pxy , respectively, where H is the Jacobian matrix of f evaluated at E(x). Clearly the preceding UT method is computationally more efficient than the Monte Carlo simulation. In addition, the UT does not require calculating the Jacobian matrix. Below, we show that the UT method provides more accurate approximation than linear approximation, see Section 3. 3.

PROPERTIES OF UT

˜ = x−E(x) = (x1 −E(x1 ), . . . , xc −E(xc ))T . The derivative ∂ i f (E(x))/(∂xk11 · · · ∂xkc c ) Define x P is the derivative of f evaluated at E(x) where i = cj=1 kj , and kj ’s are non-negative integers. Assume f is an analytic function over Ω. Then, the Taylor series of f around E(x) is given as follows: f (x) = f (E(x)) +

∞  X

x˜1

i=1

Define

Dxk˜ f

∂ i f (E(x)) ∂ + · · · + x˜c . ∂x1 ∂xc i!

as Dxk˜ f

=

c X i=1

We assume that

x˜i

∂ k f (E(x)). ∂xi

Dxk˜ f

is integrable on Ω for any non-negative integer k. We also assume that P i there exists Y with finite absolute first moment such that | m ˜ f /i!| ≤ Y a.e. on Ω for all i=0 Dx Pm P Pm i i m. Since i=0 Dxi˜ f /i! → f , we have limm→∞ E( m ˜ f /i!) = limm→∞ ˜ f /i!) = i=0 Dx i=0 E(Dx 4

E{f (x)} by the dominated convergence theorem. In practice, these conditions are satisfied Pc

if i) there exists a constant Q > 0 such that E|˜ xk11 · · · x˜kc c | ≤ Q

j=1

kj

, for any non-negative

integers kj ; ii) for all j, 1 ≤ j ≤ q, there exists some constant R > 0 ∂ Pci=1 ki f Pc j ≤ R i=1 ki , k1 ∂x1 · · · ∂xkc c for any non-negative integers kj , 1 ≤ j ≤ c. Thus, we can derive the following results: c



c

X1 1 XX ∂2 E(y) = f (E(x)) + f (E(x)) + E[Dxj˜ f ], Pij 2! i=1 j=1 ∂xi ∂xj j! j=3 c

c

1 XX ∂2 ˆ = f (E(x)) + y Pij f (E(x)) 2 i=1 j=1 ∂xi ∂xj 2c

(3) (4)



XX 1 1 D2j f, + 2(c + λ) i=1 j=2 (2j)! x˜ (i) where Pij is the (i, j)th entry of P. We consider the case that the random variable x = xh is indexed by a positive number h such that x = E(x) + op (h) where E(x) is independent of h, in which case equation (3) provides an heuristic expansion of E(y) = E{f (xh )} with summands of orders hj , j = 0, 1, · · · . The order of the error rate of the UT estimator of E(y) may then be studied by comparing Equations (3) and (4). If x has a symmetric distribution about its mean E(x), we have c



c

X 1 1 XX ∂2 E(y) = f (E(x)) + E[Dx2j Pij f (E(x)) + ˜ f ]. 2! i=1 j=1 ∂xi ∂xj (2j)! j=2

5

(5)

Define A = {(a, b)|a, b ∈ N but (a, b) 6= (1, 1)} and B = {(a, b)|a, b ∈ N, but a 6= 1, b 6= 1, (a, b) 6= (2, 2)} where N is the set of natural numbers. Then, we have h X 1 i 1 (Dxi˜ f )(Dxj˜ f )T Py = HPHT − E(Dx2˜ f )E(Dx2˜ f )T + E 4 i!j! (i,j)∈A h X 1 i − E(Dxi˜ f )E(Dxj˜ f )T , i!j! (i,j)∈B

ˆ y = HPHT − 1 E(D2˜ f )E(D2˜ f )T P x x 4 2c h i X X 1 1 T ` k + (D (i) f )(Dx˜ (i) f ) 2(c + λ) i=1 k+`= even k!`! x˜

(6)

(k,`)∈A



X h (k,`)∈A

2c X 2c i X 1 1 T 2` 2k f ) . f )(D (D (i) ˜ (j) x (2k!)(2`!) 4(c + λ)2 i=1 j=1 x˜

If the distribution of x is symmetric about its mean, h X i 1 1 Py = HPHT − E(Dx2˜ f )E(Dx2˜ f )T + E (Dxi˜ f )(Dxj˜ f )T 4 i!j! i+j= even (i,j)∈A



h X (i,j)∈A

In addition, Pxy

(7)

i 1 2j T E(Dx2i f )E(D f ) . ˜ ˜ x (2i!)(2j!)

∞ X 1 h  i T i ˜ Dx˜ f E x , = PH + i! i=2 T

ˆ xy = PHT + P

∞ X k=1

2c

(8)

 T X 1 1 ˜ (i) Dx2k+1 x f . ˜ (i) (2k + 1)! 2(c + λ) i=1

When the distribution of x is symmetric about the mean E(x), T

Pxy = PH +

∞ X i=1

h  T i 1 2i+1 ˜ Dx˜ f . E x (2i + 1)!

(9)

Detailed derivations of (3)-(9) can be found in Appendix. From (3)-(9), we have the following lemma: Lemma 1. Assume that 1. f is an analytic function; 6

2. Dxk˜ f is integrable on Ω for any non-negative integer k; 3. there exists Y with finite absolute first moment such that |

Pm

Dxi˜ f /i!| ≤ Y a.e. on

i=0

Ω for all m. Then, ˆ= 1. E(y) − y

P∞

j 1 ˜f ] j=3 j! E[Dx



1 2(c+λ)

P2c P∞

2j 1 f; j=2 (2j)! Dx ˜ (i)

i=1

h i hP i j j 1 1 i i T T ˆy = E P 2. Py − P (D E(D f )(D f ) − f )E(D f ) ˜ ˜ x x ˜ ˜ x x (i,j)∈A i!j! (i,j)∈B i!j! hP i P 2c 1 1 k f )(Dx`˜ (i) f )T − 2(c+λ) k+`= even k!`! (Dx i=1 ˜ (i) (k,`)∈A i h P P2c P2c 1 1 2k 2` T + (k,`)∈A (2k!)(2`!) 4(c+λ)2 i=1 j=1 (Dx˜ (i) f )(Dx˜ (j) f ) ; T i P h  i ˆ xy = P∞ 1 E x ˜ 3. Pxy − P D f − ∞ ˜ x i=2 i! k=1

1 1 (2k+1)! 2(c+λ)

 T 2k+1 (i) ˜ x D f . i=1 ˜ (i) x

P2c

Furthermore, if the distribution of x is symmetric about the mean E(x), then ˆ= 1. E(y) − y

P∞

2j 1 ˜ f] j=2 (2j)! E[Dx



1 2(c+λ)

P2c P∞ i=1

2j 1 f; j=2 (2j)! Dx ˜ (i)

i h ˆ y = E Pi+j= even 1 (Di˜ f )(Dj f )T 2. Py − P x ˜ x i!j! (i,j)∈A hP i 2j 1 2i T − ˜ f )E(Dx ˜ f) (i,j)∈A (2i!)(2j!) E(Dx i P2c h P 1 1 k ` T − 2(c+λ) (D f )(D f ) k+`= even k!`! i=1 ˜ (i) ˜ (i) x x (k,`)∈A i h P P2c P2c 1 1 2k 2` T ; (D f )(D f ) + (k,`)∈A (2k!)(2`!) 2 (i) (j) i=1 j=1 4(c+λ) ˜ ˜ x x ˆ xy = P∞ 3. Pxy − P i=1

h  T i P 2i+1 ˜ − ∞ x Dx˜ f k=1

1 E (2i+1)!

1 1 (2k+1)! 2(c+λ)

 T 2k+1 (i) ˜ Dx˜ (i) f . i=1 x

P2c

For the linear approximation scheme defined by (2), we have the following parallel results: Lemma 2. Under the same conditions of Lemma 1, the linear approximation based on the first order Taylor expansion enjoys the following properties: ˆL = 1. E(y) − y

1 2!

Pc

i=1

Pc

j=1

2

Pij ∂x∂i ∂xj f (E(x)) +

P∞

h ˆ y,L = − 1 E(D2˜ f )E(D2˜ f )T + E P 2. Py − P x x (i,j)∈A 4 hP i j 1 i T − ; ˜ f )E(Dx ˜f ) (i,j)∈B i!j! E(Dx 7

j 1 ˜ f ]; j=3 j! E[Dx

1 (Dxi˜ f )(Dxj˜ f )T i!j!

i

P∞ 1 h  i  T i ˆ ˜ Dx˜ f 3. Pxy − Pxy,L = i=2 i! E x . If the distribution of x is, furthermore, symmetric about its mean, then ˆL = 1. E(y) − y

1 2!

Pc

i=1

Pc

j=1

2

Pij ∂x∂i ∂xj f (E(x)) +

P∞

2j 1 ˜ f ]; j=2 (2j)! E[Dx

h ˆ y,L = − 1 E(D2˜ f )E(D2˜ f )T + E Pi+j= even 2. Py − P x x 4 (i,j)∈A i hP 2j 1 2i T ; − ˜ f )E(Dx ˜ f) (i,j)∈A (2i!)(2j!) E(Dx ˆ xy,L = P∞ 3. Pxy − P i=1

1 (Dxi˜ f )(Dxj˜ f )T i!j!

i

h  T i 2i+1 ˜ Dx˜ f x .

1 E (2i+1)!

Remark 1. These results show that the UT estimator of the mean E(y) = E{f (x)} matches the true value correctly up to the second order terms in the expansion (3) whereas the estimator from linearization of f (E(x)) only matches the true mean up to the first order term. If the distribution of x is symmetric about its mean, the UT estimator of E(y) matches the true value correctly up to the third order terms in the expansion. The UT estimator of the covariance matrix Py matches more terms than its linear approximation counterpart. For approximating Pxy , the UT estimators and its linear approximation counterpart provide the same order of approximation. Nevertheless, in all cases, the UT estimator has other terms to compensate for the remaining higher-order terms. This compensation appears to be better when x has a symmetric distribution about its mean. Remark 2. Lemma 1 shows that if λ is close to −c or very large, bias may be severe unless the nonlinearity of f is minimal. Choosing an optimal λ is a difficult problem because it is generally infeasible to calculate the expectations stated in Lemma 1 for a nonlinear f . However, they can be calculated for certain distributions, for example, normal distributions. When x is normal, E(xk11 · · · xkc c ) can be expressed as a function of the components of P, see Isserlis (1918), Holmquist (1988), and Triantafyllopoulos (2003). So we can calculate the mean squared error (MSE) in this case. From Lemma 1, the bias can be approximated as P P2c Pm 2j 2j ˆ≈ m E(y) − y f for some m, where ˜ f ] − 1/(2c + 2λ) j=2 1/(2j)!E[Dx i=1 j=2 1/(2j)!Dx ˜ (i) the first term can be expressed as a function of the components of P. The corresponding 8

variance can be derived based on (1). Then, we can minimize the MSE to obtain an optimal λ. Next, define Dk as follows: Definition 1. n o Dk = f : Ω ⊂ Rc → Rq |f is a polynomial of degree at most k . Theorem 3.

ˆ = E(y). In addition, if the distribution of x is 1. If f ∈ D2 , then y

ˆ =E(y). symmetric about its mean and f ∈ D3 , then, y ˆ y = Py and P ˆ xy = Pxy . 2. If f ∈ D1 , i.e., f is linear, then P Proof. This follows from the Taylor series expansions in (3)–(9). The UKF has been applied for estimating a nonlinear state-space model where the state equation is an ordinary differential equation or a stochastic differential equation with observations at time t1 , . . . , tn , where ti+1 − ti = h for i = 1, . . . , n − 1, where, with no loss of generality, h ≤ 1. To solve these differential equations, one may discretize the system equation using the Euler method or Runge-Kutta method (Boyce and DiPrima 2004), which results in the case that conditionally x has a covariance matrix P = hP∗ for some h-free positive definite matrix P∗ . Thus, it is interesting to compare the error rates of the UT estimators of E(y) = E{f (x)}, Py , Pxy and their counterparts from linear approximation when P = hP∗ . We allow that f may depend on h. For simplicity, we assume that the distribution of x is symmetric about its mean. Then, we can obtain the following theorem: Theorem 4. Suppose P = hP∗ and x has a symmetric distribution about its mean. Assume 1. there exists a h-free constant M > 0 such that E|˜ xk11 · · · x˜kc c | ≤ hi/2 M i/2 , for any nonP negative integers kj where cj=1 kj = i; 2. for all j, 1 ≤ j ≤ q, and for all 0 < h ≤ 1, there exists some h-free constant K > 0, ∂ i f (E(x)) j k1 ≤ K i, k ∂x1 · · · ∂xc c 9

for any non-negative integers kj , 1 ≤ j ≤ c, where i =

Pc

j=1

kj .

Then, ˆ y = O(h2 ), Pxy − P ˆ xy = O(h2 ), ˆ = O(h2 ), Py − P E(y) − y ˆ y,L = O(h2 ), Pxy − P ˆ xy,L = O(h2 ). ˆ L = O(h), Py − P E(y) − y Proof. See Appendix. Remark 5. Condition 1 of Theorem 4 is motivated by the fact that x˜k11 · · · x˜kc c = Op (hi/2 ) √ because x˜j = Op ( h). Condition 2 of Theorem 4 is satisfied if the function f is a polynomial. Remark 6. Theorem 4 shows that the UT estimator E(y) has a smaller error rate than the estimator from linear approximation although the estimators of the covariance matrices Py and Pxy have the same error rate. This suggests that discretization methods based on the UKF would be more accurate than the EKF, which is shown rigorously by Ahn and Chan (2011). 4.

CONCLUSION

We have derived some theoretical properties of the UT and linear approximation. The UT makes use of some deterministic sigma points in contrast with the Monte Carlo method. In addition, it does not require calculating the Jacobian matrix unlike the method of linear approximation. Derivations based on Taylor expansions show that the estimated mean and covariance matrix from the UT match their true values to more higher order terms than the method of linear approximation. Thus, the UT is faster than the Monte Carlo and more accurate than the method of linear approximation. ACKNOWLEDGEMENTS This work was partially supported by the U.S. National Science Foundation. REFERENCES Ahn, K. W., and Chan, K. S. (2011), “Nonlinear state-space modeling via the unscented Kalman filter,” Technical report, Medical College of Wisconsin, . 10

Boyce, W. E., and DiPrima, R. C. (2004), Elementary Differential Equations and Boundary Value Problems, 8 edn, : Wiley. Diekmann, O., and Heesterbeek, J. A. P. (2000), Mathematical Epidemiology of Infectious Diseases: Model Building, Analysis and Interpretation, : John Wiley & Sons Ltd. Doucet, A., deFreitas, J. F. G., and Gordon, N. J. (2001), Sequential Monte Carlo Methods in Practice, New York: Springer-Verlag. Holmquist, B. (1988), “Moments and cumulants of the multivariate normal distribution,” Stochastic Analysis and Applications, 6,, 273–278. Isserlis, L. (1918), “On a formula for the product-moment coefficient of any order of a noraml frequency distribution in any number of variables,” Biometrika, 12,, 134–139. Julier, S. J., and Uhlmann, J. K. (1997), “A new extension of the Kalman filter to nonlinear systems,” Proc. of AeroSense: The 11th International Symposium on Aerospace/Defense Sensing, Simulation and Controls, Orlando, Florida Vol. Multi Sensor Fusion, Tracking and Resource Management II, . Julier, S. J., and Uhlmann, J. K. (2004), “Unscented filtering and nonlinear estimation,” IEEE Review, 92,, 401–422. Simon, D. (2006), Optimal State Estimation, New Jersey: John Wiley & Sons, Inc. Triantafyllopoulos, K. (2003), “On the central moments of the multidimensional Gaussian distribution,” The Mathematical Scientist, 28,, 125–128. Wan, E. A., and van der Merwe, R. (2000), “The unscented Kalman filter for nonlinear estimation,” In Symposium 2000 on Adaptive Systems for Signal Processing, . APPENDIX A. DERIVATION OF RESULTS FOR THEOREM 1 For simplicity, assume the pdf of the random vector x is symmetric about its mean with a known mean E(x) and covariance matrix Px , and y = f (x). Assume f is an analytic 11

function. We also assume that Dxk˜ f is integrable on Ω for any non-negative integer k and P i there exists Y with finite absolute moment suth that | m ˜ f /i!| ≤ Y a.e. on Ω for all i=0 Dx ˆ (i) by its Taylor expansion around E(x), we get m. Replacing y ˆ = W (0) y ˆ (0) + y

2c X

ˆ (i) W (i) y

i=1 2c

=

 X 1 1 λ f (E(x)) + f (E(x)) + Dx˜ (i) f + Dx2˜ (i) f + · · · . c+λ 2(c + λ) i=1 2! 2c

 X 1 1 2 = f (E(x)) + Dx˜ (i) f + Dx˜ (i) f + · · · . 2(c + λ) i=1 2! Now notice that

2c X

Dx2k+1 f = 0, ˜ (j)

j=1

˜ (j) = −˜ because x x(c+j) , j = 1, . . . , c. See (Simon 2006) for the details. Therefore, 2c

2c

 X 1 X 1 1 1 1 ˆ = f (E(x)) + y Dx2˜ (i) f + Dx4˜ (i) f + Dx6˜ (i) f · · · . 2(c + λ) i=1 2! 2(c + λ) i=1 4! 6! In addition, similar to (Simon 2006), we can obtain 2c

c

c

X 1 1 1 XX ∂2 Dx2˜ (i) f = Pk` f (E(x)), 2(c + λ) i=1 2! 2 k=1 `=1 ∂xk ∂x`

(A.1)

˜ (i) = −˜ because x x(c+i) . Thus, c

c

1 XX ∂2 ˆ = f (E(x)) + y Pij f (E(x)) 2 i=1 j=1 ∂xi ∂xj +

1 2(c + λ)

2c  X 1

4!

i=1

Dx4˜ (i) f +

(A.2)

1 6 D (i) f + · · · . 6! x˜ 

Similarly, E(y) = f (E(x)) +

1 1 E[Dx2˜ f ] + E[Dx4˜ f ] + · · · . 2! 4!

It can be easily shown c

c

∂2 1 1 XX E[Dx2˜ f ] = Pk` f (E(x)). 2! 2! k=1 `=1 ∂xk ∂x` 12

As a result, we have c c i 1 hXX ∂2 E(y) = f (E(x)) + Pij f (E(x)) 2! i=1 j=1 ∂xi ∂xj

(A.3)

1 + E[Dx4˜ f ] + · · · . 4! Next, we turn our attention to the covariance structure: Py = E[(y − E(y))(y − E(y))T ]. Based on the results obtained so far, we have h i 1 ˆ − E(y) = f (E(x)) + Dx˜ f + Dx2˜ f + · · · y 2! i h 1 1 − f (E(x)) + E(Dx2˜ f ) + E(Dx4˜ f ) + · · · 2! 4! h i h1 i 1 2 1 2 4 = Dx˜ f + Dx˜ f + · · · − E(Dx˜ f ) + E(Dx˜ f ) + · · · . 2! 2! 4! Then, using the definition of A = {(a, b)|a, b ∈ N} − {(1, 1)} where N is the set of natural numbers, Py = E[(y − E(y))(y − E(y))T ] hn  1 1 =E Dx˜ f + Dx2˜ f + Dx3˜ f + · · · 2! 3! 1 o 1 1 E(Dx2˜ f ) + E(Dx4˜ f ) + E(Dx6˜ f ) + · · · − 2! 4! 6!  n 1 3 1 2 × Dx˜ f + Dx˜ f + Dx˜ f + · · · 2! 3! 1 oT i 1 1 2 4 6 − E(Dx˜ f ) + E(Dx˜ f ) + E(Dx˜ f ) + · · · 2! 4! 6! 1 = E[(Dx˜ f )(Dx˜ f )T ] − E(Dx2˜ f )E(Dx2˜ f )T 4 h X i 1 +E (Dxi˜ f )(Dxj˜ f )T i!j! i+j= even (i,j)∈A



h X (i,j)∈A

i 1 2j T E(Dx2i f )E(D f ) , ˜ ˜ x (2i!)(2j!)

13

(A.4)

where all odd-powered terms in the expected value are zero. Here, c h X

c

∂f (E(x))  X ∂f (E(x)) T i E[(Dx˜ f )(Dx˜ f ) ] = E x˜i x˜i ∂x ∂xi i i=1 i=1 c X c hX ∂f (E(x)) ∂f (E(x))T i x˜j =E x˜i ∂xi ∂xj i=1 j=1 T

=

c X c X

Hi E(˜ xi x˜j )HTj

=

c X c X

i=1 j=1

(A.5)

Hi Pij HTj = HPHT ,

i=1 j=1

where H is the Jacobian matrix of f (x) and Hi is the ith column of H. Now we consider ˆ y , defined as the approximate covariance matrix, P ˆy = P

2c X

ˆ )(ˆ ˆ )T W (i) (ˆ y(i) − y y(i) − y

i=0 2c

X 1 λ ˆ )(ˆ ˆ )T + ˆ )(ˆ ˆ )T . = (ˆ y(0) − y y(0) − y (ˆ y(i) − y y(i) − y c+λ 2(c + λ) i=1 Consider

λ (y(0) c+λ

ˆ y

(0)

ˆ )(y(0) − y ˆ )T first. By the Taylor expansion, we get −y

n ˆ = f (E(x)) − f (E(x)) + −y

2c

o X 1 1 1 Dx2˜ (i) f + Dx4˜ (i) f · · · 2(c + λ) i=1 2! 4!

2c

 X 1 1 1 =− Dx2˜ (i) f + Dx4˜ (i) f · · · 2(c + λ) i=1 2! 4! Thus, we have λ ˆ )(ˆ ˆ )T (ˆ y(0) − y y(0) − y c+λ 2c  2c X λ 1 1 2  X  1 2 T D (i) f D (i) f = c + λ 4(c + λ)2 i=1 2! x˜ 2! x˜ i=1 2c X 2c i X h X λ 1 1 2` T 2k + (D f )(D f ) (i) (j) ˜ x c+λ (2k!)(2`!) 4(c + λ)2 i=1 j=1 x˜ (k,`)∈A

λ 1 E[Dx2˜ f ]E[Dx2˜ f ]T = c+λ4 2c X 2c i X h X λ 1 1 2k 2` T + (D f )(D f ) . (i) (j) ˜ x c+λ (2k!)(2`!) 4(c + λ)2 i=1 j=1 x˜ (k,`)∈A

14

Next, consider

1 2(c+λ)

P2c

i=1 (y

(i)

ˆ )(y(i) − y ˆ )T . Using (A.1) and the fact that −y

2c 2c X 2c X 2c   ∂f (E(x)) T X X 1 1 (i) (i) ∂f (E(x)) T x˜` [(Dx˜ (i) f )(Dx˜ (i) f ) ] = x˜k 2(c + λ) i=1 2(c + λ) i=1 k=1 `=1 ∂xk ∂x` c c c 1 X X X  (i) ∂f (E(x))  (i) ∂f (E(x)) T = x˜ x˜` c + λ i=1 k=1 `=1 k ∂xk ∂x` (i)

(c+i)

(using x˜j = −˜ xj ) c X c  ∂f (E(x))  ∂f (E(x)) T X = HPHT = Pk` ∂x ∂x k ` k=1 `=1 = E[(Dx˜ f )(Dx˜ f )T ] (By (A.5)), we can show 2c

X 1 ˆ )(ˆ ˆ )T (ˆ y(i) − y y(i) − y 2(c + λ) i=1 2c

=

 X hn 1 1 1 Dx˜ (i) f + Dx2˜ (i) f + Dx3˜ (i) f + · · · 2(c + λ) i=1 2! 3!

2c  o X 1 1 2 1 Dx˜ (j) f + Dx4˜ (j) f + · · · 2(c + λ) j=1 2! 4!  n 1 3 1 2 × Dx˜ (i) f + Dx˜ (i) f + Dx˜ (i) f + · · · 2! 3! 2c  oT i X 1 1 2 1 Dx˜ (j) f + Dx4˜ (j) f + · · · − 2(c + λ) j=1 2! 4!  λ 1 E[Dx2˜ f ]E[Dx2˜ f ]T = E[(Dx˜ f )(Dx˜ f )T ] − 1 + c+λ 4 2c h i X X 1 1 + (Dxk˜ (i) f )(Dx`˜ (i) f )T 2(c + λ) i=1 k+`= even k!`!



(k,`)∈A



X h (k,`)∈A

×

2c X 2c X

 1 1 λ  1 + (2k!)(2`!) 4(c + λ)2 c+λ i 2` T (Dx2k f )(D f ) . (j) (i) ˜ ˜ x

i=1 j=1

15

(A.6)

Thus, we have 2c

ˆy = P

X λ 1 ˆ )(ˆ ˆ )T + ˆ )(ˆ ˆ )T (ˆ y(0) − y y(0) − y (ˆ y(i) − y y(i) − y c+λ 2(c + λ) i=1

1 = E[(Dx˜ f )(Dx˜ f )T ] − E[Dx2˜ f ]E[Dx2˜ f ]T 4 2c i h X X 1 1 T ` k + (D (i) f )(Dx˜ (i) f ) 2(c + λ) i=1 k+`= even k!`! x˜

(A.7)

(k,`)∈A



X h (k,`)∈A

2c X 2c i X 1 1 2k 2` T (D f )(D f ) . (i) ˜ (j) x (2k!)(2`!) 4(c + λ)2 i=1 j=1 x˜

Now we consider the covariance matrix Pxy . Then, Pxy = E[(x − E(x))(y − E(y))T ] = E[(˜ x)(y − E(y))T ] ∞ h  T i X h  T i 1 2i+1 ˜ Dx˜ f ˜ Dx˜ f =E x + E x (2i + 1)! i=1 ∞ h  T i X 1 ˜ Dx2i+1 E x f . = PHT + ˜ (2i + 1)! i=1

(A.8)

ˆ xy , equals Using x(0) = E(x), the approximate covariance matrix, P 2c

ˆ xy = P

X 1 ˆ )T (ˆ x(i) − E(x))(ˆ y(i) − y 2(c + λ) i=0 2c

=

 T X 1 ˜ (i) Dx˜ (i) f x 2(c + λ) i=1 +

∞ X k=1 T

(A.9)

2c

 T X 1 1 ˜ (i) Dx2k+1 x f ˜ (i) (2k + 1)! 2(c + λ) i=1

= PH +

∞ X k=1

2c

 T X 1 1 2k+1 (i) ˜ Dx˜ (i) f . x (2k + 1)! 2(c + λ) i=1

APPENDIX B. DERIVATION OF RESULTS FOR THEOREM 2 In this section, |A| indicates the absolute value of A, where A may be a scalar, a vector, or a matrix. When A is a vector (matrix), |A| is a vector (matrix) consisting of the absolute values of A’s components. In addition, when “≤” is used for a vector (matrix), it implies that every component of the left vector (matrix) is less than or equal to the corresponding 16

component of the right vector (matrix). From the Taylor series in Section A of Appendix and Condtion 1 of Theorem 2, we have c c i h ∂2 1 1 XX 4 ˆ = f (E(x)) + Pij f (E(x)) + E[Dx˜ f ] + · · · E(y) − y 2! i=1 j=1 ∂xi ∂xj 4! c c h 1 XX ∂2 − f (E(x)) + f (E(x)) Pij 2 i=1 j=1 ∂xi ∂xj 2c

i X 1 1 1 + Dx4˜ (j) f + Dx6˜ (j) f · · · 2(c + λ) j=1 4! 6! ∞ ∞ 2c X X X 1 1 1 2i = E[Dx˜ f ] − Dx2i (j) f (2i)! 2(c + λ) j=1 (2i)! ˜ i=2 i=2

For 1 ≤ ` ≤ q, we have c 1 ∂ 2i 1  X 2i x˜j E[Dx˜ f` ] = f` E (2i)! (2i)! ∂xj j=1  ∂ 2i f (E(x)) i h X 1 2i ` ≤ hi M i k1 k c (2i)! 1≤k ,...,k ≤2i k1 , . . . , kc ∂x1 · · · ∂xc 1

(A.10)

c

≡ hi ai,` . Let ai = (ai,1 , . . . , ai,q ). Similarly, 2c

X 1 1 Dx2i (j) f 2(c + λ) j=1 (2i)! ˜ c

c

X 1 Xp 1 ∂ 2i = (c + λ)P `j f (c + λ) j=1 (2i)! `=1 ∂x`   c  X hi X 2i i−1 = (c + λ) (2i)! j=1 k1 , . . . , k c 1≤k1 ,...,kc ≤2i √ √ ∂ 2i f (E(x))  × ( P ∗ 1j )k1 · · · ( P ∗ cj )kc k1 ∂x1 · · · ∂xkc n

(A.11)

≡ hi bi , P i where Pij∗ is the (i, j) component of P∗ . Since ∞ i=1 K /i! is finite, Condition 2 of Theorem P∞ i P∞ i 2 implies that all components of i=1 h ai and i=1 h bi defined in (A.10) and (A.11) respectively are finite. Thus, we have ∞ ∞ X X ˆ| ≤ |E(y) − y hi (|ai | + |bi |) = h2 hi−2 (|ai | + |bi |) = O(h2 ). i=2

i=2

17

(A.12)

ˆ L = f (E(x)). It can be similarly shown that On the other hand, linearization uses y (A.13)

ˆ L = O(h). E(y) − y ˆ y . From the results so far, we get Now, we turn our attention to Py − P h ˆy = E Py − P

i h X i 1 1 2j T f )E(D f ) (Dxi˜ f )(Dxj˜ f )T − E(Dx2i ˜ ˜ x i!j! (2i!)(2j!) i+j= even X

(i,j)∈A

(i,j)∈A 2c h i X X 1 1 − (Dxj˜ (i) f )(Dx`˜ (i) f )T 2(c + λ) i=1 j+`= even j!`! (j,`)∈A

+

X h (j,`)∈A

2c X 2c i X 1 1 2j 2` T (D f )(Dx˜ (m) f ) . (2j!)(2`!) 4(c + λ)2 i=1 m=1 x˜ (i)

First of all, we consider the first term E

h

1 (Dxi˜ f )(Dxj˜ f )T i!j!

i

where i + j is even. Then, we

have  h 1 i h 1 X  i ∂ i f (E(x))  j i T E (Dx˜ f )(Dx˜ f ) = E x˜k11 · · · x˜kc c k1 i!j! i! 1≤k ,...,k ≤i k1 , . . . , kc ∂x1 · · · ∂xkc c c 1  1 X  j ∂ j f (E(x)) T . × x˜`11 · · · x˜`cc `1 j! 1≤` ,...,` ≤j k1 , . . . , kc ∂x1 · · · ∂x`cc 1

c

18

Let the ith component of a function f be fi . Then, the (a, b) component of i h i h 1 1 (Dxi˜ f )(Dxj˜ f )T , E i!j! (Dxi˜ f )(Dxj˜ f )T , satisfies: E i!j! ab

h 1 i j T i (Dx˜ f )(Dx˜ f ) E i!j! ab  h 1  i X  i k1 kc ∂ fa (E(x)) = E x˜ · · · x˜c i! 1≤k ,...,k ≤i k1 , . . . , kc 1 ∂xk11 · · · ∂xkc c c 1  1 X  ∂ j fb (E(x)) i j × x˜`11 · · · x˜`cc `1 j! 1≤` ,...,` ≤j `1 , . . . , `c ∂x1 · · · ∂x`cc c 1   h 1 X  i j = E x˜k1 +`1 · · · x˜kc c +`c i!j! 1≤k ,...,k ≤i k1 , . . . , kc `1 , . . . , ` c 1 c

1

(A.14)

1≤`1 ,...,`c ≤j i

j

∂ fa (E(x)) ∂ fb (E(x)) i × k1 ∂x1 · · · ∂xkc c ∂x`11 · · · ∂x`nc   X  h(i+j)/2 M (i+j)/2 i j ≤ i!j! k1 , . . . , kc `1 , . . . , ` c 1≤k ,...,k ≤i 1

c

1≤`1 ,...,`c ≤j

∂ i f (E(x)) ∂ j f (E(x)) a b × k1 ` 1 k ` ∂x1 · · · ∂xc c ∂x1 · · · ∂xcc ij ≡ h(i+j)/2 Rab .

h i j 1 i T Thus, E i!j! (Dx˜ f )(Dx˜ f ) ≤ h(i+j)/2 Rij . Then, we obtain h E

i 1 j i T (D f )(Dx˜ f ) i!j! x˜ i+j= even X

(i,j)∈A

(A.15)

X



(i+j)/2

h

ij

2

R =h

i+j= even (i,j)∈A

X

(i+j)/2−2

h

ij

R .

i+j= even (i,j)∈A

By (A.10), the second term satisfies X (i,j)∈A



1 2j T f ) f )E(D E(Dx2i ˜ ˜ x (2i!)(2j!)

X

(A.16) i

i

T

2

h ai (h aj ) = h

(i,j)∈A

X (i,j)∈A

19

i+j−2

h

T

ai (aj ) .

Next, let’s consider the third therm

1 2(c+λ)

P2c

j 1 f )(Dx`˜ (i) f )T i=1 j!`! (Dx ˜ (i)

where j + ` is even.

2c c X 1 1 1 X 1 j ` T (Dx˜ (i) f )(Dx˜ (i) f ) = (Dxj˜ (i) f )(Dx`˜ (i) f )T 2(c + λ) i=1 j!`! c + λ i=1 j!`! c c c ∂ j ih 1  X p ∂ ` iT 1 Xh 1 Xp (c + λ)P ri f (c + λ)P si f = c + λ i=1 j! r=1 ∂x` `! s=1 ∂x`   c X √ √ (c + λ)(j+`)/2 X h 1 j = ( P 1i )u1 · · · ( P ci )uc c+λ j! 1≤u ,...,u ≤j u1 , . . . , uc i=1 c 1 ∂ j f (E(x)) i × u1 ∂x1 · · · ∂xuc c  h1 iT ` X  √ √ ` v1 vc ∂ f (E(x)) × ( P 1i ) · · · ( P ci ) . `! 1≤v ,...,v ≤` v1 , . . . , vc ∂xv11 · · · ∂xvc c c

1

Similarly with E

h

1 (Dxi˜ f )(Dxj˜ f )T i!j!

i

, we have ab

2c

i X 1 1 (Dxj˜ (i) f )(Dx`˜ (i) f )T 2(c + λ) i=1 j!`! ab c h X X (c + λ)(j+`)/2−1 (j+`)/2 =h j!`! i=1 1≤u ,...,u

h



c ≤j 1 1≤v1 ,...,vc ≤`

j u1 , . . . , uc



` v1 , . . . , vc



i j ` √ √ u1 +v1 uc +vc ∂ fa (E(x)) ∂ fb (E(x)) ∗ ∗ × ( P 1i ) · · · ( P ni ) ∂xu1 1 · · · ∂xuc c ∂xv11 · · · ∂xvc c j` ≡ h(j+`)/2 Tab .

Thus, we obtain 2c h i X X 1 1 j ` T (D f )(Dx˜ (i) f ) 2(c + λ) i=1 j+`= even j!`! x˜ (i) (j,`)∈A

=

X

(i+j)/2

h

(A.17)

j`

2

T =h

j+`= even (j,`)∈A

X j+`= even (j,`)∈A

20

(i+j)/2−2

h

j`

T .

i h P2c P2c 2j 1 1 2` T ˆ y, P , by (D f )(D f ) For the last term of Py − P 2 (m) i=1 m=1 (j,`)∈A (2j!)(2`!) 4(c+λ) ˜ x ˜ (i) x (A.11), we have X h (j,`)∈A

=

2c X 2c i X 1 1 2j 2` T (D f )(D f ) (m) ˜ x (2j!)(2`!) 4(c + λ)2 i=1 m=1 x˜ (i)

2c X h X i=1

(j,`)∈A

=

X

2c

T i  X 1 1 1 1 2j 2` D f D (m) f (2j)! 2(c + λ) x˜ (i) (2`)! 2(c + λ) x˜ m=1 X

hj bj (h` b` )T = h2

(j,`)∈A

(A.18)

hj+`−2 bj (b` )T .

(j,`)∈A

Combining (A.15)-(A.18), we obtain ˆ y = O(h2 ), Py − P

(A.19)

P P where Condition 2 of Theorem 2 implies that all components of i+j= even h(i+j)/2 Rij , (i,j)∈A hi+j ai (aj )T , (i,j)∈A P P j+`−2 T (j+`)/2 j` bj (b` ) are finite. By (A.10), we have T , and (j,`)∈A h j+`= even h (j,`)∈A

1 E[Dx2˜ f ]E[Dx2˜ f ]T = h2 a1 (a1 )T . 4 ˆ y,L = HPx HT . Thus, it can be similarly shown that In linearization, P ˆ y,L = O(h2 ). Py − P

(A.20)

Now, we know ˆ xy = Pxy − P

∞ X i=1

We consider



2c

h  T i X  T X 1 1 1 2i+1 (j) ˜ Dx2i+1 ˜ E x f x . − D f ˜ ˜ (j) x (2i + 1)! (2i + 1)! 2(c + λ) j=1 i=1

1 E (2i+1)!

h  T i 2i+1 ˜ Dx˜ f x first.

h  T i 1 ˜ Dx2i+1 E x f ˜ (2i + 1)!   h  X 1 2i + 1 ∂ 2i+1 f (E(x)) T i ˜ = E x x˜k11 · · · x˜kc c k1 . kc (2i + 1)! k ∂x · · · ∂x 1 , . . . , kc 1 c 1≤k ,...,k ≤2i+1 1

c

21

As before, we can obtain h  T i 1 2i+1 ˜ Dx˜ f E x (2i + 1)! ab   h 2i+1 X fb (E(x)) i 2i + 1 1 k1 kc ∂ x˜ · · · x˜c = E x˜a (2i + 1)! k1 , . . . , kc 1 ∂xk11 · · · ∂xkc c 1≤k1 ,...,kc ≤2i+1   h 2i+1 X fb (E(x)) i 1 2i + 1 k1 ka +1 kc ∂ = E · · · x ˜ · · · x ˜ x ˜ c a (2i + 1)! k1 , . . . , kc 1 ∂xk11 · · · ∂xkc c 1≤k1 ,...,kc ≤2i+1 h 2i + 1  ∂ 2i+1 f (E(x)) i X hi+1 M i+1 b ≤ k1 (2i + 1)! 1≤k ,...,k ≤2i+1 k1 , . . . , kc ∂x1 · · · ∂xkc c c 1 ≡ hi+1 Ui,ab . Thus, we get ∞ X i=1





h  T i X X 1 2i+1 ˜ Dx˜ f hi+1 Ui = h2 hi−1 Ui . E x ≤ (2i + 1)! i=1 i=1

Furthermore, h

2c

 T i X 1 1 ˜ (j) Dx2i+1 x f ˜ (j) (2i + 1)! 2(c + λ) j=1 ab

c hi+1 (c + λ)i+1 X √ ∗ ( P aj ) = (2i + 1)! c + λ j=1    2i+1 X √ √ fb (E(x))  2i + 1 k1 kc ∂ ∗ ∗ × ( P 1j ) · · · ( P cj ) k1 , . . . , kc ∂xk11 · · · ∂xkc c 1≤k ,...,k ≤2i+1 1

c

≡ hi+1 Vi,ab . Thus, we have ∞ X i=1



2c



 T X X X 1 1 2i+1 (j) ˜ x Dx˜ (j) f = hi+1 Vi = h2 hi−1 Vi . (2i + 1)! 2(c + λ) j=1 i=1 i=1

Therefore, we obtain ∞ X ˆ |Pxy − Pxy | =

h  T i 1 ˜ Dx2i+1 E x f ˜ (2i + 1)!

i=1 ∞ X



≤ h2

i=1 ∞ X

2c  T X 1 1 ˜ (j) Dx2i+1 x f ˜ (j) (2i + 1)! 2(c + λ) j=1

hi−1 (|Ui | + |Vi |) = O(h2 ),

i=1

22

(A.21)

where Condition 2 of Theorem 2 implies that all components of

P∞

i=1

hi+1 Ui and

P∞

i=1

hi+1 Vi

ˆ xy,L = Px HT . Thus, it can be similarly shown defined in (A.21) are finite. In linearization, P that ˆ xy,L | ≤ h2 |Pxy − P

∞ X i=1

23

hi−1 |Ui | = O(h2 ).

(A.22)