The asymptotic convexity of the negative likelihood ... - Semantic Scholar

Report 0 Downloads 50 Views
Computational Statistics & Data Analysis 50 (2006) 311 – 331 www.elsevier.com/locate/csda

The asymptotic convexity of the negative likelihood function of GARCH models W.C. Ipa , Heung Wonga , J.Z. Panb,∗ , D.F. Lic a Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong b LMAM and Department of Financial Mathematics, School of Mathematical Sciences, Peking University,

Beijing 100871, China c LMAM and Department of Probability and Statistics, School of Mathematical Sciences, Peking University,

Beijing 100871, China Received 26 February 2004; received in revised form 21 August 2004; accepted 31 August 2004 Available online 25 September 2004

Abstract We prove the convexity of the negative likelihood function in the asymptotic sense for GARCH models. This property provides assurance for the convergence of numerical optimization algorithms for maximum likelihood estimation of GARCH. A simulation study is conducted in order to compare the performance of several different iteration algorithms. An example based on the log-returns of foreign exchange rates is also given. © 2004 Elsevier B.V. All rights reserved. Keywords: GARCH; Convexity; Maximum likelihood estimation; Iterative algorithm; Convergence; Foreign exchange rates

1. Introduction The autoregressive conditional heteroscedastic (ARCH) model was first proposed by Engle (1982). It is now very popular with econometricians and financial researchers. The ARCH model has many generalized forms, including GARCH (Bollerslev, 1986; Taylor, 1986). For GARCH modeling, the maximum likelihood estimation (MLE) for the parameters is a basic statistical problem (see Weiss, 1986; Lee and Hansen, 1994; Ling and Li, ∗ Corresponding author. Tel.: 861062752478; fax: 861062751801.

E-mail address: [email protected] (J.Z. Pan). 0167-9473/$ - see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2004.08.012

312

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

1998 among others). Several algorithms for an approximate MLE can be used for GARCH. However, little work has been reported on the convergence and comparison of these algorithms for the GARCH case. Mak et al. (1997) compared the quasi-Newton method and an algorithm proposed by Mak (1993), which is an iteratively weighted least squares method, in the case of ARCH models. In this paper, we present some research on this aspect for the GARCH case. More importantly, the convexity of the negative likelihood function is proved in the asymptotic sense, providing assurance for the convergence of numerical optimization algorithms in large samples. Suppose {et } ∼ GARCH(p, q), i.e., 1/2

e t = h t εt , 2 2 ht = 0 + 1 et−1 + · · · + p et−p + 1 ht−1 + · · · + q ht−q , 0 > 0, i  0, i = 1, . . . , p, j  0, j = 1, . . . , q, εt ∼ iidN (0, 1), εt and et−s , s > 0 are independent.

(1.1)

The parameter vector is denoted by

 = (1 , . . . , p+q+1 )T := (0 , 1 , . . . , p , 1 , . . . , q )T . Here x T denotes the transpose of a vector x. Let   D =  ∈ R p+q+1 : 0 > 0, i  0, i = 1, . . . , p,   p q    j  0, j = 1, . . . , q, i + j < 1 .  i=1

j =1

It is well known that if  ∈ D, then there is a unique strictly stationary solution {et } to the GARCH equation (1.1), and in this case, Ee2t < ∞. See Nelson (1990) for the GARCH(1,1) case and Bougerol and Picard (1992) for the general case. See also Ling and Li (1997) for further discussion and the causal representation of ht . Note that for  ∈ D,

1 2 2 ht =  +  e + · · · +  e q 0 1 p t−p , t−1 1 − j =1 j Lj where L denotes the lag-operator. Thus, ht depends on all past values of the process {et }.1 Given the -field (et , t  0), the conditional log-likelihood function of the sample 1 Since the process is only observed during a limited time period t = 1, . . . , n, we often make the assumption that the starting value of et and ht are

e1−p = · · · = e0 = 0,

h1−q = · · · = h0 = 0.

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

313

observations e1 , . . . , en is ln Ln (, ) =

n  1 n et2 () − ln ht (, ) − − ln (2) 2 2ht (, ) 2 t=1

=:

n 

lt (, ) −

t=1

n ln (2), 2

(1.2)

where  is a point of the probability space on which {et } is defined. The MLE of the parameters in GARCH model on D, denoted by ˆ () = (ˆ0 (), . . . , ˆ p (), ˆ 1 (), . . . , ˆ q ())T , is obtained by maximizing the log-likelihood function, that is, ln Ln (ˆ , ) = max ln Ln (, )

(1.3)

∈D

for almost every . Let fi,n (, ) = −

1 * ln Ln (, ) , n *i

i = 1, . . . , p + q + 1.

A sufficient condition for (1.3) is that fn (ˆ , ) := (f1,n (ˆ , ), . . . , fp+q+1,n (ˆ , ))T = 0 and fn (ˆ , ) :=





*fi,n (, )

i = 1, . . . , p + q + 1

*j

j = 1, . . . , p + q + 1 =ˆ

(1.4)



(1.5)

is positive definite for all  ∈ D. Here, for each fixed , the vector function fn (, ) is the gradient function of −1/n ln Ln (, ), and fn (, ) is the Hessian matrix of this objective function. Eq. (1.4) is usually solved by numerical iteration (cf. Thisted, 1988 and Mak, 1993). By the Newton–Raphson (NR) method, from a suitably chosen starting point (0) , we use the iteration equation  −1 (k+1) = (k) − k fn ((k) , ) fn ((k) , ) (1.6) to get an approximate value of the solution ˆ to (1.3), where k is a step length scalar. One interpretation of this method is that at each iterate, a direction is computed by  −1 − fn ((k) , ) fn ((k) , ), in which −1/n ln Ln (, ) descends, and the next iterate is obtained from the current one by taking a step in this direction. A drawback of the NR method is the necessity for computing

314

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

(p + q + 1)(p + q + 2)/2 derivatives and the inverse of the Hessian matrix fn (, ) at each step. This drawback may induce algebraic and numerical problems, especially for multivariate GARCH or high order models with many parameters (cf. Ling and McAleer, 2003; Wong and Li, 2002 among others). The main problem is the large number of the mixed partial derivatives 2

* ln Ln (, ) , *i *j

i, j = 1, . . . , p + q + 1, i  = j

in the case of GARCH(p,q). A less known but useful iterative method for solving nonlinear systems of equations is the nonlinear Gauss–Seidel (GS) iteration (cf. Thisted, 1988; Ortega and Rheinboldt, 1970). This method needs only the diagonal elements of the Hessian matrix fn (, ). The GS iteration can also be used as an algorithm for maximum likelihood estimates of GARCH parameters. To our knowledge, the convergence of the algorithms for the MLE in GARCH modeling has possibly not been sufficiently discussed so far. This convergence is based on the convexity of the negative likelihood function −1/n ln Ln (, ), that is, the positive definiteness of the Hessian matrix fn (ˆ , ) defined by (1.5) on D. Theorem 1 in Section 2 of this paper gives such convexity in an asymptotic sense (with large enough sample size). It is a basic result for discussing the convergence of numerical optimization iterations for MLE for GARCH models. The arrangement of the paper is as follows. Section 2 discusses the asymptotic convexity of the negative likelihood function for GARCH models. Section 3 gives some simulation results, including a comparison of the NR algorithm, the quasi-Newton algorithm and the GS algorithm. Section 4 gives an example in which the three algorithms are used to model financial data by GARCH. Section 5 makes some concluding remarks.

2. Asymptotic convexity of the negative likelihood function for the GARCH model As we saw in Section 1, if we denote

T T  2 2 Et−1 := 1, et−1 , . . . , et−p , Ht−1 = ht−1 , . . . , ht−q ,

T 2 , ht , . . . , ht−q+1 = (EtT , HtT )T , Yt = 1, et2 , . . . , et−p+1

 = (0 , 1 , . . . , p , 1 , . . . , q )T , then ht = T Yt−1 , 1 e2 1 lt = − ln ht − t = − 2 2ht 2

ln ( Yt−1 ) + T

et2

T Yt−1

.

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

Note that



* 1 1 lt () = 2 ht * *

315

et2 *ht −1 , ht *

2

1 1 l () = T t 2 ht * *



2 et2 et2 *ht *ht * ht 1 1 1−2 −1 + . ht ht * *T * *T 2 h2t

We then have n

1 * lt () n * t=1 n 1  e2 1 *h t = 1− t , 2n ht ht * t=1

fn () = −

n

(2.1)

1 * l () T t n t=1 * * 2 n n 1  2et2 * ht 1  1 et2 1 *ht *ht = + 1− −1 . (2.2) 2 T 2n ht 2n h ht * *T ht * * t=1 t=1 t

fn () = −

2

But

*ht−q * ht *ht−1 = Yt−1 + 1 + · · · + q , * * *  q q 2 2 * ht * ht−k  *ht−k T 0 + =  + up+1+k , * H t−1 k * *T * *T k=1 * *T k=1 where ui = (0, . . . , 0, 1, 0, . . . , 0)T .    i

For  ∈ D, we can write

*ht = (L)Yt−1 , *    q 2 * ht *ht−k T 0 , = (L) u *Ht−1 + * p+1+k * *T *T k=1 where (L) =  1 −

1−

q  j =1

q 1

j j =1 j L



j z j  

=:

∞  j =0

∞

j =0 j L



j z j  = 1

j,

and the coefficients {i } determined by

(2.3)

(2.4)

316

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

i.e.,

0 = 1, k  k − i k−i = 0,

k = 1, 2, . . . .

(2.5)

i=0

Therefore, it is easy to see that each j is a polynomial function of the parameter  on D and ∞  j =0

|j | < ∞.

(2.6)

Let D0 be the open inner of D, i.e.   p+q+1    D0 =  ∈ R p+q+1 : i > 0, i = 1, . . . , p + q + 1, i < 1 .   i=2

We will show that the objective function −1/n ln Ln () is convex on any compact, convex subset of D0 , e.g.,   D =  ∈ R p+q+1 : 1  M,   p+q+1   i  , i = 2, . . . , p + q + 1, i  1 −  i=2

with > 0 and M > , for large n with probability one. Specifically, we have the following theorem: Theorem 1. Suppose D1 is an arbitrary compact, convex subset of D0 and fn (, ) is defined in (2.2). Then there exist a constant C > 0 and a set with P ( ) = 1 satisfying that for each  ∈ and  ∈ D1 , there is a positive integer N (, ) such that g T fn (, )g  Cg T g,

∀n  N (, ), g ∈ R p+q+1 .

(2.7)

The uniform convexity of the objective function is essential for the convergence of an iterative algorithm for the minimizing point. Thus, Theorem 1 is a basic result for using iterative methods to compute the maximum likelihood estimates of parameters in GARCH models. But, as shown in the theorem, a sufficiently large sample size N (, ) required to

guarantee such a property as (2.7) of the sample Hessian matrix fn (, ) may vary with the parameter points. This may cause problems in practical computation. Here, we give some remarks as follows: Remark 1. If N (, ) in Theorem 1 is bounded on  ∈ D1 , i.e. N() := sup N (, ) < ∞, ∈D1

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

317

then for a fixed n  N (), fn (, ) is uniformly positive definite: g T fn (, )g  Cg T g, ∀g ∈ R p+q+1 . In this case, if there is a parameter point ˆ n () on D1 such that fn (ˆ n , ) = 0, i.e., the MLE ˆ n ∈ D1 , then from the subsections 3.4.6, 4.1.6, 4.3.6 and 14.6.6 of Ortega and Rheinboldt (1970), as the number of steps k → ∞, the sequence

T (k) (k) (k) (k) (k) (  ) =  (  ), . . . ,  (  ),  (  ), . . . ,  (  ) , q p n 0 1 given by nonlinear GS iteration in Section 2 starting from a 0 ∈ D1 , converges to ˆ n (), which is the unique maximizer of the likelihood function   Ln 0 , 2 , . . . , p , 1 , . . . , q ; e1 (), . . . , en () . But if the MLE of parameters in GARCH are strongly consistent, i.e.,

ˆ n () → ,

as n → ∞

for  ∈ 0 with P ( 0 ) = 1, then lim

lim (k) n () = ,

n→∞ k→∞

 ∈ D1

(2.8)

for  ∈ ∩ 0 , which means that (2.8) holds almost everywhere. Remark 2. Problems occur when we cannot be sure that N (, ) is bounded on  ∈ D1 . For each sample path  ∈ , N (, ) may be very different for different parameter points on D1 . In this case, for a fixed sample size, though the sample size can be very large, we cannot be sure that the Hessian matrix fn (, ) is positive definite at some points  ∈ D1 . But in practice, the sample size is always fixed at first, so the iteration procedure may sometimes not be continued . These phenomena have been observed in our simulation study. According to our experience, the problem often appears at parameter points near the boundary of D0 or when the starting point 0 is far away from the minimizer. Our method of dealing with this phenomena is to add a small, suitably chosen > 0 to the diagonal elements of the Hessian matrix fn (, ) (see Section 4 of this paper), and to try to choose the initial values of the iteration near to the true parameters. Remark 3. Is there a way of proving that we can choose a bounded N (, ) on  ∈ D1 in Theorem 1? The problem is to prove that there is a positive constant C and a set of sample points with P ( ) = 1 satisfying that, for each  ∈ , there is an N () such that when n  N(), g T fn (, )g  Cg T g,

∀g ∈ R p+q+1

and

∀ ∈ D1 .

Then the problem is: Does the Hessian matrix fn (, ) computed from sample observations converges to its expectation f () (i.e. the Fisher information matrix) uniformly for  ∈ D with probability one? The strong approximation theory of random fields is probably needed to elucidate the problem. This will be one of the topics in our future research. We even

318

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

believe that the convergence problem of iterative methods for the GARCH models has possibly not been fully discussed in the literature. Remark 4. Some authors have begun to consider the ML estimation of ARCH model under some partial ordering constraints on parameters, such as

1 2  · · · p , which has intuitive economic meanings (cf. Wang et al., 2002). We can consider the convergence problem of the iterative algorithm for MLE of GARCH parameters under the partial ordering constraint:

1 2  · · · p ,

1 2  · · · q

and of course, (0 , 1 , . . . , p , 1 , . . . , q )T ∈ D.

3. Simulation experiments and numerical comparison of some algorithms Note that Theorem 1 is only an asymptotic result and may not hold for finite sample sizes. In order to see the performance of different iterative algorithms in finite samples, we did some simulation experiments to compare three such methods: the quasi-Newton method used in AUTOREG procedure of SAS software, which is a modified form of the BFGS called DBFGS, the NR method, and the GS method. The NR iteration was mentioned in Section 1. Now we describe the GS method briefly. For a specific series of sample observations e1 , . . . , en , we choose a starting point

T

T (0) (0) (0) (0) (0) (0) = (0) = 0 , 1 , . . . , (0) . p , 1 , . . . , q 1 , . . . , p+q+1 The GS iteration is given as Cycle k. Given (k−1) , (k) is calculated by

(k) i

(k−1) = i



*fi,n () *i

−1



fi,n ()



(k) (k−1) (k−1) = (k) ,...,p+q+1 1 ,...,i−1 ,i

(k)

,

(3.1)

where i denotes the ith component of (k) , i = 1, . . . , p + q + 1, k = 1, 2, . . .. Simply speaking, the ith component of fn (, ) is considered to be a function only of the ith component of , all of the other components are treated as fixed at their most recently computed values. The above algorithm can also be described as

(k) (k) (k−1) (k−1) Solve fi,n 1 , . . . , i−1 , i , i+1 , . . . , p+q+1 = 0 for i , (k)

Set i = i (solution),

i = 1, . . . , p + q + 1, k = 1, 2, . . . .

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

319

This reduces the ith equation to a univariate one, which is then solved using the univariate Newton method. Note that in this algorithm, we only need to compute the derivatives 2

*fi,n (, ) 1 * ln Ln (, ) =− , n *i *2i

i = 1, 2, . . . , p + q + 1,

which are the diagonal elements of the Hessian matrix   2 1 * ln Ln (, ) − , i, j = 1, . . . , p + q + 1 , n *i *j and we do not need to compute the inverse of the full Hessian matrix. The GS iteration is one of the coordinate descent methods, and the drawback of this iteration is that the convergence rate may be slow. Because for a fixed sample size n, the Hessian matrix for the objective function is not always positive definite for all parameter points, in practice the iteration sometimes does not converge. To overcome this difficulty, we have added a damping factor to the NR update Eq. (1.6) and to the GS update Eq. (3.1), which become 

−1

(k+1) = (k) − k k I + fn (k) ,  fn (k) ,  (3.2) and

(k) i

(k−1) = i



*fi,n () − k + *i

−1



fi,n ()



(k) (k−1) (k−1) = (k) ,...,p+q+1 1 ,...,i−1 ,i

(3.3)

respectively, where k > 0 is a set of appropriately chosen numbers. Here, we used k = 0 /(1 + k 2 ). In order to search for parameter estimates inside region D, we made some adjustments to the algorithms to ensure this: The algorithms are considered as failed if the search goes out of the region D. The algorithms are considered as converged if no further progress can be made according to some tolerance level, and here we used 1 × 10−5 . We considered five GARCH models: GARCH(p, q) models with orders (1, 1), (3, 2), (2, 3), (4, 4), (6, 6) respectively. For each model we simulated 1000 series, and compared the three iterative methods for MLE in three different sample sizes: 400, 1600, 6400. We compared the following statistics of the algorithms: • Success rate of convergence. • Number of iteration passes. We will give the median and the inter-quartile range (IQR) of this number. Note that the SAS software does not give this information. • Time used (in seconds) to find the optimum. The median and the IQR of this number will be given. • The average relative error (ARE) of the converged estimates.

If the unknown parameter   is  = 1 , . . . , m , and the estimate is ˆ = ˆ 1 , . . . , ˆ m , then the ARE is defined as  ARE = 1/m m |ˆ i − i |/i . We will give the average of this number. i=1

320

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

Table 1 GARCH(1, 1) comparison of the quasi-Newton method(SAS), the GS method and the NR method n

Method

Success rate

Iteration passes

Time

ARE

(%)

Median

IQR

Median

IQR

Average(%)

400

SAS GS NR

100 99 100

N.A. 21 9

N.A. 14 2

0.04 0.02 0.03

0.01 0.01 0.01

27.8 24.1 26.5

1600

SAS GS NR

100 100 100

N.A. 19 9

N.A. 13 2

0.09 0.07 0.05

0.01 0.05 0.01

12.7 11.8 12.6

6400

SAS GS NR

100 100 100

N.A. 19 9

N.A. 8 1

0.29 0.27 0.15

0.05 0.11 0.02

6.3 6.7 6.3

True parameter  = (2, 0.3, 0.4), and starting point is (1.5, 0.2, 0.4).

Initial values: The iterative algorithms need an initial parameter point to start with. The SAS system has some built-in method to choose the initial values. For the GS and the NR methods, we compared several different initial points, and found that initial values have only a minor influence on the convergence, so long as they are not too far away from the true parameter. Therefore, here we will give only results from one set of initial values. How does the SAS software choose the initial values is unknown to us. Clearly it does not have prior knowledge about the true values, and this will affect its performance. Convergence criteria: For the SAS quasi-Newton method, convergence is reached when the changes in the parameter estimates become smaller than 10−5 . This is the default criterion of the software. To compare, we use the same convergence criterion for the Gauss-Seidel and the NR method. Table 1 gives the results for our simulation comparison on a GARCH(1,1) model. From the results we can see that for the low order GARCH model, the three methods all perform well. The sample size requirement is not very large. For example, n = 1600 will get an relative error of around 13%. Table 2 lists results of our comparison of MLE estimation for GARCH(3,2). To get reasonable estimation accuracy, we need very large sample sizes. In fact, even at a sample size of 6400 we do not get very accurate results. This time the GS method is seen to have high failure rate at small sample size. In terms of computation speed, the NR method is the best. This is no surprise since it uses more information of the objective function, and it requires less iteration passes. Table 3 lists results for GARCH(2,3), and it is very similar to the case of GARCH(3,2). Tables 4 and 5 give results of two high order models. We can still see that the NR method is faster, and all the three methods cannot produce satisfactory estimation results even at sample size 6400. The accuracy of the SAS method is weak, but it should be mentioned that the SAS method does not use any prior knowledge in the selection of initial values while we select initial values near the true parameter point for the NR method and the GS method.

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

321

Table 2 GARCH(3, 2) comparison of the quasi-Newton method(SAS), the GS method and the NR method n

Method

Success rate

Iteration passes

Time

ARE

(%)

Median

IQR

Median

IQR

Average(%)

400

SAS GS NR

100 34 100

N.A. 16 6

N.A. 20 3

0.09 0.03 0.03

0.02 0.04 0.01

65.0 47.0 62.6

1600

SAS GS NR

100 80 100

N.A. 20 8

N.A. 20 2

0.28 0.19 0.09

0.06 0.19 0.03

50.5 32.1 52.9

6400

SAS GS NR

100 98 100

N.A. 11 9

N.A. 4 3

0.95 0.42 0.29

0.23 0.15 0.07

34.3 20.3 35.1

True parameter  = (2, 0.2, 0.1, 0.1, 0.1, 0.2), and starting point is (1.7, 0.18, 0.07, 0.07, 0.07, 0.15). Table 3 GARCH(2, 3) comparison of the quasi-Newton method(SAS), the GS method and the NR method n

Method

Success rate

Iteration passes

Time

ARE

(%)

Median

IQR

Median

IQR

Average(%)

400

SAS GS NR

100 26 100

N.A. 21 5

N.A. 26 1

0.09 0.05 0.03

0.02 0.05 0.01

72.9 45.2 59.4

1600

SAS GS NR

100 69 100

N.A. 28 7

N.A. 25 3

0.27 0.28 0.08

0.07 0.25 0.02

50.4 29.7 48.4

6400

SAS GS NR

100 98 100

N.A. 13 9

N.A. 16 3

0.96 0.52 0.28

0.2 0.62 0.06

30.4 18.1 31.8

True parameter  = (2, 0.1, 0.2, 0.2, 0.1, 0.1), and starting point is (1.7, 0.07, 0.15, 0.18, 0.07, 0.07).

What we can say is that initial value selection has important effect on the success of MLE estimation for high order models. From these simulations we have some observations. • All the methods perform well for the GARCH(1,1) model. • It needs very large samples to estimate high order GARCH models, and the selection of initial parameter values has great impact on the accuracy of estimation. • For the DBFGS method in SAS it performs reasonably well in low order GARCH estimation but it needs better initial value selection method for high order models. • For the GS method, it is simple to implement since it does not compute the whole Hessian matrix and the inverse matrix, but it suffers from a high failure rate.

322

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

Table 4 GARCH(4, 4) comparison of the quasi-Newton method(SAS), the GS method and the NR method n

Method

Success rate

Iteration passes

Time

ARE

(%)

Median

IQR

Median

IQR

Average(%)

400

SAS GS NR

100 5 100

N.A. 5 4

N.A. 7 1

0.15 0.02 0.04

0.05 0.03 0.01

104.3 52.2 57.7

1600

SAS GS NR

100 50 100

N.A. 13 5

N.A. 10 2

0.63 0.25 0.09

0.21 0.19 0.02

83.8 38.0 57.0

6400

SAS GS NR

100 96 100

N.A. 11 5

N.A. 8 1

2.66 0.79 0.31

0.84 0.59 0.08

75.2 26.0 50.7

True parameter  = (2, 0.2, 0.1, 0.05, 0.05, 0.1, 0.07, 0.03, 0.04), (1.7, 0.18, 0.08, 0.05, 0.05, 0.08, 0.05, 0.05, 0.05).

and

starting

point

is

Table 5 GARCH(6, 6) comparison of the quasi-Newton method(SAS), the GS method and the NR method n

Method

Success rate

Iteration passes

Time

ARE

(%)

Median

IQR

Median

IQR

Average(%)

400

SAS GS NR

100 0.2 100

N.A. 2 3

N.A. 1 1

0.34 0.01 0.04

0.11 0.02 0.01

134.8 78.8 62.9

1600

SAS GS NR

100 11 100

N.A. 3 4

N.A. 9 1

1.56 0.1 0.13

0.57 0.37 0.03

105.7 53.1 57.8

6400

SAS GS NR

100 50 100

N.A. 13 5

N.A. 15 1

7.47 1.96 0.51

2.6 2.34 0.12

89.6 38.5 47.8

True parameter  = (2, 0.2, 0.05, 0.03, 0.04, 0.05, 0.02, 0.1, 0.07, 0.04, 0.03, 0.02, 0.02), and starting point is (1.7, 0.18, 0.05, 0.05, 0.05, 0.05, 0.05, 0.08, 0.05, 0.05, 0.05, 0.05, 0.05).

• For the NR method, it has faster speed and needs fewer iteration passes. Its disadvantage lies in the large amount of analytical calculations involved in finding the second derivatives. • It is observed in our experiments that the initial value effect is not obvious in low order models but becomes prominent in high order models. If we choose an initial parameter point far away from the true parameters, the Hessian matrix can be very far from being positive definite, so it is hard for the algorithms to reach the global minimum of the

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

323

objective function. Choosing larger values for the damping factors { k } in (3.2) and (3.3) can sometimes alleviate this effect but it is hard to decide how large we should choose, because iterations with large { k }’s will need many more passes to converge.

4. Applications in modeling financial returns by GARCH As an application, a GARCH model is fitted to a real data set, the daily log-return series of exchange rate of US dollar to Deutsche mark for the period from March 1, 1995 to November 3, 1998. The exchange rate data are shown in Fig. 1, and the log-return series (in percent) is shown in Fig. 2. We used AIC and BIC to select orders of the GARCH model. Table 6 1.9 1.8 1.7 1.6 1.5

1.4 01/03/1995 10/03/1995 07/03/1996 04/03/1997 01/03/1998 10/03/1998 Time

Fig. 1. US dollar to Deutsche Mark price.

2 1 0 -1 -2

01/03/1995 10/03/1995 07/03/1996 04/03/1997 01/03/1998 10/03/1998 Time

Fig. 2. Log-returns of the exchange price.

324

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

Table 6 AIC and BIC values of GARCH models for the log-return data p 1 AIC q

BIC q

2

3

1 2 3 4 5

73.59 72.23 71.29 70.73 72.50

79.58 74.84 72.68

98.60 84.95 77.55

1 2 3

88.33 91.88 95.85

99.23 99.40 102.16

123.17 114.43 111.93

gives the AIC and BIC values of the GARCH models with various p and q combinations for the log-return data, respectively. The AIC suggested a GARCH(1,4) model, while the BIC suggested a GARCH(1,1). We first fitted the GARCH(1,1) model using the algorithms discussed earlier. In order to start the GS and NR MLE iteration, we have to select some appropriate starting value (0) . An arbitrary choice of (0) = (1.5, 0.2, 0.2) led to a failure in iteration. We thus need a better method to get a crude initial parameter estimate. With no other hint, we first fitted an ARCH(1) model to the series using linear least squares, and obtained ˆ 0 = 0.0544, ˆ 1 = 0.200. We thus chose (0) = (0.0544, 0.200, 0.05). Starting from this initial value, both the GS algorithm and the NR algorithm gave MLE estimates 0 =0.00506, 1 =0.0905, 1 =0.836. The DBFGS quasi-Newton method in SAS gave similar, but not identical results. How well does the estimated GARCH(1,1) model describe the log-return series? We calculated the estimate of εt in (1.1), denoted by εˆ t . If the model is appropriate for the series, the εˆ t series should behave like an iid N (0, 1) series. To check that εˆ t is independent or not, we did Ljung–Box chi-square tests for white noise to the series εˆ t , εˆ t2 , εˆ t3 and εˆ t4 at lag 10. The p-values for these tests are 0.39, 0.63, 0.63, 0.99, respectively, suggesting independence of the residuals εˆ t . The sample mean and sample standard deviation of ˆ t are 0.0043 and 1.0005, very close to 0 and 1, while the p-value for the testing of zero mean is 0.90. All of these facts indicate that the fitted model is acceptable. We have also fitted a GARCH(1,4) model as suggested by the AIC. The initial parameter values are chosen similarly using an ARCH model. In order to be safe we tried several sets of 0 ’s in (3.2) and (3.3), and found that 0 = 100 gives the smallest minimum objective function value. This time the GS and NR estimates were not similar, but NR gave a better objective function value. We thus adopted the NR estimates. The parameter estimates are: ˆ 0 = 0.0069, ˆ 1 = 0.1456, ˆ 1 = 0.3518, ˆ 2 = 0.0629, ˆ 3 = 8.35 × 10−11 , ˆ 4 = 0.3354. We did similar test of white noise as before and arrived at similar conclusions. That means the GARCH(1,4) model is also reasonable. Using the principle of parsimony we think that the GARCH(1,1) model is adequate on this occasion.

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

325

5. Discussion First, when the sample size is sufficiently large, nonlinear GS, NR, and quasi-Newton iterates will converge. The convergence of these algorithms for the GARCH model had not been discussed theoretically. Theorem 1 provides a result in this direction. Second, the required sample size to guarantee the convexity of the Hessian matrix may vary with parameter points. In practice, the sample size is fixed in advance, thus the iterative procedure may fail to converge in some situations. Using a damped version of the algorithms can improve their chance of convergence. Finally, for the three algorithms used in the simulations, the quasi-Newton method needs the least amount of analytical calculations but the best skills of programming. The NR method is on the opposite. It involves the greatest amount of analytical evaluations but the easiest programming skills. The GS method gets a balance in these two aspects. Unfortunately it has a serious non-convergence problem in high order models. Generally speaking, the quasi-Newton algorithm seems to be a good choice for high order models, which agrees with current practice in this area.

Acknowledgement We thank the Editor and the referees for their constructive comments that lead to improvement of the paper. The first two authors’ research was supported by a grant from the Research Committee of The Hong Kong Polytechnic University. The third author’s research is supported by National Natural Science Foundation of China (Grant No: 10471005).

Appendix A. Proof of Theorem 1 In order to prove Theorem 1, we need some lemmas. Lemma A.1. Suppose {et } ∼ GARCH(p, q), which is defined by (1.1). If the parameter 2 vector  = (0 , 1 , . . . , p , 1 , . . . , q )T ∈ D, then the sequence Xt = (et2 , . . . , et−p+1 , T ht , . . . , ht−q+1 ) of (p + q)-dimensional random vectors is strongly mixing with geometric rate, that is, there exist some constant K > 0 and a ∈ (0, 1) such that sup A∈(Xs ,s  0) B∈(Xs ,s  l)

|P (A ∩ B) − P (A)P (B)|  Ka l .

Proof. Note that the squared process {et2 } satisfies the following stochastic recurrence equation: Xt = At Xt−1 + Bt ,

326

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

where

T 2 , ht , . . . , ht−q+1 , Xt = et2 , . . . , et−p+1   ε2 1 t 1  .. .  0 At =   1  0 . . . 0

··· ··· .. . ··· ··· ··· .. . ···

p−1 εt2 0 .. .

p εt2 0 .. .

1 εt2 0 .. .

1 p−1 0 .. .

0 p 0 .. .

0 1 1 .. .

0

0

0

(A.1) ··· ··· .. . ··· ··· ··· .. . ···

q−1 εt2 0 .. . 0 q−1 0 .. . 1

q εt2  0   ..  .   0 , q    0  ..  . 0



T Bt = 0 εt2 , 0, . . . , 0, 0 , 0, . . . , 0 . But (0 , 1 , . . . , p , 1 , . . . , q )T ∈ D implies that the Lyapunov exponent associated with {At } 1 := inf E ln A1 · · · An  , n

! n = 1, 2, . . . < 0,

whereA is the operator norm of a matrix A, i.e., A=sup

x=1 x∈R p+q

Ax (c.f. Bougerol and

Picard (1992). Then, from Proposition 4.1 in Davis et al. (1999), {Xt } is strictly stationary and strongly mixing with geometric rate. Lemma A.2. Suppose {zi , i = 1, 2, . . .} is strictly stationary and strongly mixing with rate (k). If E|zt |m < ∞ for some m > 1, and for any fixed  > 0, 

(k) =

m O k − 2m−2 − , if 1 < m < 2, 2

O k − m − , if m  2,

then n 1  a.s. (zt − Ezt ) −→ 0, n t=1

as n → ∞. This is given in Chen and Wu (1989). Now we prove Theorem 1.

(A.2)

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

327

Proof of Theorem 1. Given m > 1. Note that, for any  ∈ D0 , i > 0, i = 1, 2, . . . , p, j > 0, j = 1, . . . , q, and εt ∼ N (0, 1). Therefore, for any  ∈ D0 ,



m  2 2

e2 m e2 m et−j et−i

t−i t−j E ·



E 2 2

ht ht 0 + i et−i 0 + j et−j 1  (i j )m < ∞, (A.3) for i, j = 1, 2, . . . , p. Similarly,



ht−i m ht−j m



< ∞, E h t ht for i, j = 1, . . . , q and

e2 m

h

m

t−i t−j E 0 such that " " "ut (, ) − ut (m , )"  Mt,1 ()m − , ∞ and

" " "vt (, ) − vt (m , )"  Mt,2 ()m −  ∞

for , m ∈ D1 . So, for , m ∈ D1 and  ∈ 1 ,   n " " " 1  "

"f (, ) − f (m , )"  "m − " zt () , n n ∞ 2n t=1

where zt () = |2 2t () − 1|Mt,1 () + |1 − 2t ()|Mt,2 (). Note that the sequences Mt,1 and Mt,2 can be chosen to be strictly stationary, strongly mixing with a geometric rate and dependent only on ht . Moreover, E(Mt,1 )m < ∞, E(Mt,2 )m < ∞

330

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

for some m > 1 because of the facts (A.6) and (A.7). Thus, {zt } satisfies the conditions of Lemma A.2. Denoting M = Ez t and   n 1

=  :  ∈ 1 , z t ( ) → M , n t=1

we have P ( ) = 1 by Lemma A.2. Then, " " " " lim "fn (, ) − fn (m , )"∞  M "m − " . n→∞

Therefore, for each  ∈ and  ∈ D1 , taking a m such that m −  < 1/Mε and (A.12) holds, we can find a large N (, ) such that when n  N (, ), " " "f (, ) − f ()" n ∞ " " " " " "  "f (, ) − f (m , )" + "f (m , ) − f (m )" + "f (m ) − f ()" n

n

< 3ε.



n





Note that min () is also uniformly continuous on D1 , and that it reaches its minimum on D1 . Denoting min = min∈D1 min () > 0, and taking 0 < ε < 16 min , we have, for  ∈ ,  ∈ D1 and any g ∈ R p+q+1 \{0} g T fn (, )g g T f ()g g T (fn (, ) − f ())g = + gT g gT g gT g " " T " g fn (, ) − f ()"∞ g  min () − gT g > min − 3ε > 21 min , for n  N (, ). Therefore, the result of Theorem 1 holds for C = 21 min .



References Bollerslev, T., 1986. Generalized autoregressive conditional heteroscedasticity. J. Econ. 31, 307–327. Bougerol, P., Picard, N., 1992. Stationarity of GARCH processes and of some nonnegative time series. J. Econometrics 52, 115–127. Chen, X.R., Wu, Y.H., 1989. Strong law for a mixing sequence. Acta Math. Appl. Sinica 5, 367–371. Davis, R.A., Mikosch, T., Basrak, B., 1999. The sample ACF of solutions to multivariate stochastic recurrence equations with application to GARCH. Technical Report, University of Groningen, http://www.math.ku.dk/mikosch. Engle, R.F., 1982. Autoregressive conditional heteroscedasticity with estimates of variance of United Kingdom inflations. Econometrica 50, 987–1007. Lee, S.W., Hansen, B.E., 1994. Asymptotic theory for the GARCH(1,1) quasi-maximum likelihood estimator. Econometric Theory 10, 29–52. Ling, S., Li, W.K., 1997. On fractionally integrated autoregressive moving-average time series models with conditional heteroskedasticity. J. Amer. Statist. Assoc. 92, 1184–1194. Ling, S., Li, W.K., 1998. Limiting distribution of maximum likelihood estimators for unstable ARMA models with GARCH errors. Ann. Statist. 26, 84–125.

W.C. Ip et al. / Computational Statistics & Data Analysis 50 (2006) 311 – 331

331

Ling, S., McAleer, M., 2003. Asymptotic theory for a new vector ARMA-GARCH model. Econometric Theory 19, 280–310. Mak, T.K., 1993. Solving nonlinear estimation equations. J. Roy. Statist. Soc. B, 945–955. Mak, T.K., Wong, H., Li, W.K., 1997. Estimation of nonlinear time series with conditional heteroscedastic variances by iteratively weighted least squares. Comput. Statist. Data Anal. 24, 169–178. Nelson, D.B., 1990. Stationarity and persistence in the GARCH(1,1) model. Econometric Theory 6, 318–334. Ortega, J.M., Rheinboldt, W.C., 1970. Iterative solution of nonlinear equations in several variables, Academic Press, New York. Taylor, S.J., 1986. Modelling Financial Time Series, Wiley, New York. Thisted, R.A., 1988. Elements of Statistical Computing: Numerical Computation, Chapman & Hall, New York. Wang, D.H., Song, L.X., Shi, N.Z., 2002. Estimation and test of the ARCH(0,2) model under order restriction. Chinese J. Appl. Probab. Statist. 18 (3), 244–254. Weiss, A.A., 1986. Asymptotic theory for ARCH models: estimation and testing. Econometric Theory 2, 107–131. Wong, H., Li, W.K., 2002. Detecting and diagnostic checking multivariate conditional heteroscedastic time series models. Ann. Inst. Statist. Math. 54 (1), 45–59.