Time Series Models for Forecasting: Testing or Combining? Zhuo Chen Department of Economics Heady Hall 260 Iowa State University Ames, Iowa, 50011 Phone: 515-294-5607 Email:
[email protected] Yuhong Yang Department of Statistics Snedecor Hall Iowa State University Ames, IA 50011-1210 Phone: 515-294-2089 Fax: 515-294-4040 Email:
[email protected] November, 2002
1
Abstract In this paper we compare forecasting performance of hypothesis testing procedures with a model combining algorithm called AFTER. Testing procedures are commonly used in practice to select a model based on which forecasts are made. However, besides the well-known difficulty in dealing with multiple tests, the testing approach has a potentially serious drawback: controlling the probability of Type I error can excessively favor the null, which can be problematic for the purpose of forecasting. In addition, as shown in this work, testing procedures can be very unstable, which results in high variability in the forecasts. Based on empirical evidence and theoretical consideration, we advocate the use of AFTER for the purpose of forecasting when multiple candidate models are considered and there is a non-negligible uncertainty in finding the best model, because AFTER tends to do better or much better for such cases.
Keywords: Combining forecasts, ARIMA modeling, hypothesis testing, model selection, instability in model selection Biographies: Zhuo Chen received his BS and MS degrees in Management Science from the University of Science and Technology of China in 1996 and 1999 respectively. He recently graduated from the Department of Statistics at Iowa State University with MS degree and is continuing his Ph.D. study in the Department of Economics at Iowa State University. Yuhong Yang (Corresponding author) received his Ph.D. in Statistics from Yale University in 1996. Then he joined the Department of Statistics at Iowa State University as assistant professor and became associate professor in 2001. His research interests include nonparametric curve estimation, pattern recognition, and combining procedures. He has published several papers in statistics and related journals including Annals of Statistics, Journal of the American Statistical Association, Bernoulli, Statistica Sinica, Journal of Multivariate Analysis, and IEEE Transaction on Information Theory.
2
1
Introduction
How to choose the “best” model among candidate models (e.g., ARMA models with different orders) is one of the most important issues that associated with time series modelling. It is interesting to see that statisticians and econometricians tend to have different preference over this (Chatfield 2001, p. 47). Statisticians seem to count more on various information criteria like AIC (Akaike, 1973) while econometricians incline to apply a series of testing procedures. At this time, to our knowledge, there does not seem to be a clear understanding on the advantage/disadvantage of these approaches of model selection. Although identification of the “best” model is an important task, for the purpose of forecasting, as is our focus in this work, combining forecasts from different models is a reasonable alternative for consideration.
1.1
Model Selection Procedures
Let Y1 , Y2 , ... be a time series. At time n for n ≥ 1, we are interested in one-step ahead point forecasting of the next value Yn+1 based on the observed values of Y1 , ..., Yn . Statistical modelling usually involves a group of candidate models. For those who believe that one of those models is true or approximately true, naturally they tend to use a certain model selection procedure to find it, which can then be used for forecasting. Information criteria for model selection are typically based on minus the maximized log-likelihood function plus a penalty function of the number of parameters in the model. Classical pair-wise testing is an alternative. Bauer, Pöcher & Hackl (1988) proposed a multiple test procedure for inferring the dimension of a general finite-parameter model. The procedure boils down to simultaneously testing each regression parameter by t-test in the case of standard linear regression model. Pöcher (1991) investigated the asymptotic properties of parameter estimators that are based on a model selected by means of a multiple testing procedure, which tests M (p − 1) against M (p), M(p − 2) against M (p − 1), . . . , until the test produces a significant p-value. Here M(p) denotes the model with p being the dimension of the parameter space, e.g., AR or MA order. In our view, the testing approach for model selection has the following drawbacks for forecasting. First, testing procedures usually assume that there is a true model existing among the candidates and it performs better than the other models. As Granger, King & White (1995) pointed out, the customary application of controlling the probability of type I error causes testing procedures unduly favor the null hypothesis. While there is no objective guideline for the choice of the size of test, the trade-off between type I and type II error has an unclear effect on the forecasting accuracy. Second, one also faces the challenging issue of multiple testing. Due to the sequential nature of the tests, different sequence of testing often produces different results and unfortunately there is no well grounded guideline on the choice of the sequence of the tests in the literature. Generally, there is little
3
one can say about the overall probabilities of error. Another drawback associated with testing is its instability. With a small or moderate number of observations, as expected, models close to each other are usually hard to be distinguished. Testing procedures usually have to deal with making a tough decision with a borderline p-value. A slight disturbance of the data may result in the choice of a different model other than the original or the “true” one. Forecasts based on these procedures can thus be quite unstable, as will be seen in section 3. The use of information criteria shares this drawback as well. The above consideration suggests that, we might need to go back to the original question: what should we do with the candidate models? Is that selecting a “best” model the best approach?
1.2
Combining: An Alternative?
An alternative choice is combining the candidate models. Yang (2001b) proposed an algorithm AFTER to combine forecasts from the candidate models. The goal of this algorithm is that with an appropriate weighting scheme, the combined forecast has a smaller variability so that forecasting accuracy can be improved relative to the use of a model selection method. The rationale is that combining can reduce the prediction risk when it is difficult for a testing procedure to choose the true or the best model. Note that this motivation of combining is quite different from the one that is typically considered in the literature, which focused on improving the candidate forecasts (see Clemen (1989) for a review). In that direction, for instance, econometricians have been combining structural models and non-structural models, e.g., Liang (1995) made a critical evaluation of several macroeconomic forecasting procedures and showed that forecast accuracy of several candidate models can be improved significantly when combined together. In a different direction, Yang (2001a,b) argued that combining (mixing) forecasts from very similar models also has its own value. While the true (or the best) model can never be revealed with certainty, combining can manage to reduce variability that arises in the forced action of selecting a single model. Indeed, the goal of forecasting does not have to share the possible pain in deciding the “best” model. This paper is not denouncing the usefulness of testing procedures. Testing hypotheses to identify the true model (when it makes good sense) is important to understand the structural relationship. Forecast combining will not be beneficial when the null model and the alternative have so strong a distinction that testing procedures can make a sound decision with ease. Yang and Zou (2002) studied the difference between combining and model selection based on information criterion. This paper compares the performance of testing procedures and the AFTER algorithm. In section 2, we present the testing and combining methods to be used in this work. Some stability measures of testing procedures are studied in section 3. We present results of several simulation studies on the comparison of testing and combining in the simple setting of white noise vs. AR(1) in section 4. More general cases and examples of real data are studied in section 5. Section 6 gives our conclusions and possible new research directions.
4
2
Some Preliminaries
2.1
Evaluation of Forecasting Accuracy
For i ≥ 1, suppose the conditional distribution of Yi given the previous observations Y i−1 = {Yj }i−1 j=1
has mean mi and variance vi . Let Ybi be a predicted value of Yi based on Y i−1 . Then the one-step ahead
mean square prediction error E(Yi − Ybi )2 can be decomposed as:
´2 ´2 ³ ³ E Yi − Ybi = E mi − Ybi + Evi .
For comparing forecasts, we can remove the second term since it is same for all forecasting procedures. Therefore, we may consider the loss function: ´2 ³ L(Yi , Ybi ) = mi − Ybi
and the corresponding risk is
´2 ³ E mi − Ybi .
Suppose δ is a forecasting procedure that yields forecasts Yb1 , Yb2 , ... at times 1, 2 and so on. The
sequential average net mean square prediction error AN MSEP (δ; n0 ; n) for forecasting Y between times n0 + 1 and n + 1 is defined by Yang & Zou (2002) as: AN M SEP (δ; n0 ; n) =
n+1 ´2 ³ X 1 E mi − Ybi n − n0 + 1 i=n +1 0
When n0 = n, it is denoted N M SEP (δ; n). AN MSEP and NM SEP will be used to evaluate forecasting performance in our simulation studies. When evaluating forecasting procedures based on a real data set of size n with the first n0 values used as the training data, we will consider the (sequential) average square error in prediction : ASEP (δ; n0 ; n) =
n ³ ´2 X 1 Yi − Ybi n − n0 i=n +1 0
as a performance measure of a forecasting procedure. ASEP can be computed based on the data alone.
2.2 2.2.1
Testing Procedures Several Commonly Used Testing Procedures
Testing theory has provided us many varieties of methods, even when focusing our attention on time series model. One of the most widely used in testing white noise against AR(1) is Durbin-Watson test (Durbin & Watson, 1971). It is reasonably powerful with the drawback of having an inconclusive region. Lagrange multiplier test proposed by Breusch (1978) is a less restrictive procedure. It tests the null hypothesis of no auto-correlation against the alternative of an AR or MA model of a given order.
5
Another less restrictive testing procedure, which is asymptotically equivalent to the Lagrange multiplier test, is the Portmanteau Q test (Box and Pierce, 1970) based on an approximate chi-square statistic. Ljung and Box (1979) provided a refinement of this test using a better approximated chi-square statistic. These tests have been widely used but all of them have the weakness of having the difficulty to decide the order of the alternative model. Likelihood ratio test can be applied when the models are nested (e.g., Pöcher, 1991). A similar idea exists in Whittle’s variance-ratio test (Whittle, 1952) which is based on estimating both the null hypothesis and a more general or overfitted alternative. McAleer, McKenzie & Hall (1988) developed “separate test” for testing an AR(p) against the separate MA(q) alternative. Based on Monte Carlo simulations, Hall and McAleer (1989) compared the robustness of Durbin-Watson test, Whittle’s ratio test, Q test and Ljung-Box Q test with several choices of p. They found that when the errors are normally distributed, Q test is unreliable and the other tests are reasonably accurate for most sample sizes for testing an AR null hypothesis; Whittle’s ratio test, however, is generally unreliable for testing an MA null hypothesis. 2.2.2
Testing Procedures to be Applied in the Paper
In this work, we do not consider Lagrange multiplier test. Durbin-Watson is ruled out by its incapability when the test statistic falls in the ambiguity region. We next discuss the testing procedures that will be applied in our work. Testing white noise against AR(1) model Though this problem looks simple, it is important since AR(1) represents a large family of auto-correlated series, just as ARMA(1,1) being used in Andrews and Ploberger (1996) to provide “parsimonious representations of a broad class of stationary time series”. Let γ denote the auto-correlation coefficient and γ b be its sample version. We formalize N test for
testing H0 : γ = 0 vs H0 : γ 6= 0 as:
Reject the null when |b γ | is greater than
z1−α/2 √ , n
otherwise conclude the series is white noise.
Note, α denotes the pre-determined significance level. This test actually is equivalent to a Q test with p set as one due to the obvious relationship between chi-square distribution and normal distribution. The Ljung-Box modified version of Q test is to: Reject the null when
n(n+2)b γ2 is n−1
greater than χ21,1−α/2 .
Testing AR(p) and MA(q) models Suppose we are trying to find out the best model among white noise, AR(1), · · · , AR(p) (p > 1) using testing procedures. We consider three natural ways to do multiple tests. The first starts from testing AR(p − 1) against AR(p), reject the null and conclude AR(p) is the true model if it gives a significant p-value, otherwise we will proceed to test AR(p − 2) against AR(p), and so on until we find out a significant p-value. We conclude that the series is white noise if none of the tests turns out to be significant. We note this method as LRT1 hereafter. 6
The second is to compare the models sequentially down by testing AR(p−1) against AR(p), AR(p−2) against AR(p − 1) and so on until there is a significant p-value. Conclude that the series is white noise if none of the tests turns out to reject the null hypothesis. This procedure is noted as LRT2 and was studied, e.g., in Pöcher (1991). We also can compare the candidate models sequentially up by testing white noise against AR(1), AR(1) against AR(2), until we reach AR(p − 1) against AR(p) or one of the nulls is accepted. We call this method as LRT3. Similarly, we can define LRT1, LRT2 and LRT3 for MA(q) models. Our plan We will compare the testing procedures and AFTER for three scenarios. The first is white noise versus AR(1). We compare N test, Ljung-Box Q test (LB-Q test), and likelihood ratio test with AFTER. The second is for more flexible AR(p) and MA(q) models, we use the three likelihood ratio test procedures afore mentioned to select a model while using AFTER to combing all the p + 1 models. Monte Carlo simulations are used to evaluate the forecasting performance. We also compare the performance of testing procedures and AFTER for several real data sets.
2.3
Algorithm AFTER for Combining
Yang (2001b) proposed the Aggregated Forecast Through Exponential Re-weighting (AFTER) algorithm to combine different forecasts. He examined its theoretical convergence properties and proved that it asymptotically performs as well as the (unknown) best candidate procedures. In this paper, we apply AFTER to the case that multiple models of similar nature (e.g., AR models of different order) are considered for forecasting and examine its performance relative to testing procedures. To combine the forecasting procedures ∆ = {δ 1 , δ 2 , ..., δ J }, at each time n, AFTER algorithm looks at their past performances and assign weights adaptively as given below. Based on Y1 , · · · , Yk , let vbj,k denote the estimate of the error variance and ybj,k the forecast of Yk by procedure δ j . Let Wj,1 = 1/J and for n ≥ 2, let Wj,n =
P
´ ³ −1/2 (Y −ˆ yj,n−1 )2 Wj,n−1 vˆj,n−1 exp − n−1 2ˆ vj,n−1 Ã ³ ´2 !
−1/2 0 ˆj 0 ,n−1 exp j 0 Wj ,n−1 v
Then combine the forecasts by yˆn∗ =
PJ
ˆj,n . j=1 Wj,n y
−
(1)
Yn−1 −ˆ yj 0 ,n−1 2ˆ vj 0 ,n−1
Note that the combined forecast is a convex
combination of those forecasts made by the candidate models. The weighting in equation (1) has a Bayesian interpretation. If we view Wj,n−1 , j ≥ 1 as the prior probabilities on the procedures before observing Yn−1 , then Wj,n is the posterior probability of δ j after Yn−1 is seen. However, AFTER is not a formal Bayesian procedure in nature as no prior probability distributions are considered for the parameters.
7
The initial uniform weighting can be modified to improve the forecasting efficiency. The idea is to assign more weight for those procedures that have been predicting more accurately in the past (assuming there are training data available). Simulations show that by doing so, the overall forecasting efficiency can be improved. See section 4.2 for some details. A theoretical result similar to Theorem 2 of Yang (2001b) holds. We mention briefly that under the conditions for Theorem 2 in Yang (2001b), assuming the conditional variance of the error ei is constant σ 2 for all i, which is the case for the time series models considered in this paper, we have: ∗
AN M SEP (δ ; 1, n) ≤ c inf
1≤j≤J
Ã
E log W1j,1 n
1 Xn + ANM SEP (δ j ; 1, n) + E(b vj,i − σ2 )2 i=1 n
!
where δ ∗ denotes the combining procedure and c is a constant. When the initial weighting puts a higher weight on the best procedure, the risk bound improves the one in Yang (2001b).
3
Measuring Stability of Testing
Testing as an approach of model selection can be very unstable. When there is substantial uncertainty for testing to find the best model, alternative methods for forecasting such as model combining are potentially advantageous. Yang and Zou (2002) proposed two approaches for measuring instability of model selection, which we borrow here in the context of testing with further developments. A testing procedure is targeted at finding the “true” data generating process, therefore if the testing procedure’s selection is unstable, its ability to choose the “true” model is subject to doubt.
3.1
Sequential stability
One idea of measuring stability of testing is to examine its consistency in selection at slightly different data sizes. Suppose that the model b kn is accepted by a testing procedure based on all the observations {Yi }ni=1 . Let L be an integer between 1 and n − 1. For each j in {n − L, n − L + 1, ..., n − 1}, let b kj
denote the model selected by the testing procedure based on the data {Yi }ji=1 . Then let κ be the relative
frequency of the number of times that the same model (b kn ) is chosen, i.e., κ=
Pn−1
j=n−L I{b kj =b kn }
L
,
where I{} denotes the indicator function. The rationale behind κ is that removing a few observations should not cause any significant change for a stable procedure. Thus with a relatively small L, if κ is substantially smaller than 1, it indicates instability of the testing procedure. We give a simple property of the κ measure here. Recall that a model selection procedure is said to be consistent if its probability of selecting the correct model goes to 1 as the sample size goes to ∞. Proposition 1: For any choice of 1 ≤ L ≤ n − 1, if a model selection procedure is consistent, then κ converges to 1 in probability as n → ∞. 8
The proof of this proposition is given in the appendix. Based on this result, if κ is much smaller than 1, we have a strong reason to doubt that the selected model is true. Testing procedures can be very unstable in terms of κ measure for some data sets. In our simulation, we sometimes get zeroes of κ. This phenomenon seems to be related to the significance level. When a testing procedure chooses a certain model based on the original data set with a p-value very close to the significance level, it is possible to reject all the models based on shortened data sets. The κ measure gives very useful information regarding how much we can trust the selected model. LRT2 n=20
LRT3 n=20 1.00 0.98 0.96
K measure
0.92
0.40
0.38
0.2
0.4
0.6
0.8
0.0
0.2
Gamma
0.4
0.6
0.8
0.0
0.6
0.8
0.6
0.8
0.98 K measure
0.94
K measure
0.92
0.94
0.90 Gamma
0.8
1.00
0.98 0.96
0.96 0.94 0.92
0.4
0.6
LRT3 n=50
0.90 0.88
0.2
0.4 Gamma
LRT2 n=50
0.98
LRT1 n=50
0.0
0.2
Gamma
0.96
0.0
K measure
0.94
0.46 0.42
0.44
K measure
0.44 0.42 0.40
K measure
0.46
0.48
0.48
0.50
LRT1 n=20
0.0
0.2
0.4
0.6
0.8
0.0
0.2
Gamma
0.4 Gamma
Figure 1: κ Measure of LRT with AR models Figure 1 is the κ measure of LRT 1, LRT 2 and LRT 3 with AR models up to order 8 as the candidates and the true models are AR(1) with autocorrelation coefficient γ varying from zero to one and series length of 20 and 50 respectively. L is set as 20% of the series length. The figure showed that the testing procedures are more stable for longer series or series generated with autocorrelation coefficients near zero or one as intuition suggests. In this case, LRT 3 is comparatively better, possibly because it selects upward thus more candidate models does not appear to cause a big loss in stability by κ measure. For data set 1 in section 5.2, κ (L = 15) equals 1 for all the three testing procedures when choosing from white noise and AR models up to order 8. The fact that it has 127 observations might have ensured its sequential stability. For a much shorter series data set 2, κ (L = 5) are 0.4, 1 and 1 respectively for LRT 1, LRT 2 and LRT 3 when choosing from white noise and AR models up to order 5, 0.4, 0.4 and 1 when choosing from white noise and AR models up to order 8. It is interesting to notice that LRT 1 and LRT 2 are less sequentially stable than LRT 3 in this case and that more candidate models increases the instability.
9
3.2
Perturbation stability
Another approach is to measure stability in model selection through perturbation. The idea is that if a model selection procedure is stable, a minor perturbation of the data should not change the outcome dramatically. Consider ARMA models: φ(B)Yi = θ(B)ei , where B is the backward shift operator, φ(B) = 1 −
φ1 B − · · · − φp B p and θ(B) = 1 + θ 1 B + · · · + θq B q . We generate a time series following the model
b b φ(B)W i = θ(B)η i selected by the testing procedure with the ARMA parameters estimated based on
the original data. Let ηi be i.i.d.∼ N(0, τ 2 σ b2 ) with τ > 0 and σ b2 being an estimate of σ2 based on the selected model. Consider Yei = Yi + Wi for 1 ≤ i ≤ n and test the new data {Yei }ni=1 . If the testing
procedure is stable for the data, then when τ is small, the new model accepted is most likely the same as before and the corresponding forecast should not change too much. Repeat the perturbation a large number of times. Stability in acceptance (or selection) For each τ , we can record the percentage of times that the originally selected model is chosen again with the perturbations. Plot the percentage versus τ . If the percentage decreases sharply in τ , it indicates that the testing procedure is not able to get a stable decision, which in turn casts doubt on its ability of choosing the best model for forecasting. We can summarize stability in acceptance by minus the slope of the perturbation plot above at τ = 0. For computing the slope, we may consider equally spaced k values for τ in the interval e.g. [0, 13 ] and run a linear regression of the first k observations of the percentage. In this work, we chose k to be 15. The regression is forced to pass through the point (0,1). We define minus the fitted slope as φS measure. If the φS measure is too big, the results of the testing procedures are subject to doubt since it chooses different models (whether it is “true” or not) with even a small perturbation. Testing procedures are possibly quite sensitive to the size. When we tried to use 0.05 as the significance level for LRT 2 on data set 2, it failed to choose the same model even at very small perturbation. Instability in forecasting Stability in acceptance (or selection) of testing does not necessarily capture the stability in forecasting since different models may perform similarly well or bad in prediction. By averaging
|˜ yn+1 −ˆ yn+1 | σ ˆ
over a
large number (e.g. 200) of independent perturbations at size τ, where y˜n+1 is obtained by the testing procedure again on the perturbed data, and yˆn+1 and σ ˆ are the forecast of Yn+1 and the estimate of error standard deviation based on the original data, Yang and Zou (2002) defined it as forecast perturbation instability at perturbation size τ for the given data. Similarly to φS measure, we can summarize the instability in forecasting by regressing the first k instability values on the perturbation size and define the slope of the fitted line as φF measure. The φS and φF measures are mostly decided by: the length of the series, the parameter pattern of the true data generating process, the specific testing procedure applied and the size of the testing procedure. For all the empirical studies given below, the tests concern white noise and AR models up to order 8 as 10
0.2
0.4
0.6
0.8
0.8 0.4
Selection Stability
0.4 0.0
0.0
0.8
LRT1
0.0
Forecasting Instability
LRT1
1.0
0.0
0.2
Pertubation Size
0.4
0.6
0.8
0.0
0.2
0.4
0.8
1.0
0.8
1.0
0.8 0.4
Selection Stability 0.6
0.0
0.8 0.4
Forecasting Instability
0.6
LRT3
0.0
0.4
1.0
Pertubation Size
LRT3
0.2
0.8
0.8
1.0
Pertubation Size
0.0
0.6
0.4
Selection Stability 0.4
1.0
0.0
0.8 0.4
0.2
0.8
LRT2
0.0
Forecasting Instability
LRT2
0.0
0.6
Pertubation Size
0.0
0.2
Pertubation Size
0.4 Pertubation Size
Figure 2: Perturbation Stability of Data set 1 (n=126) candidate models. The size of the tests is always set as 0.05. Figures 2 and 3 are the perturbation plots for data sets 1 and 2 respectively. Figure 2 shows that for data set 1, the three procedures are quite stable while LRT 3 is the best. φS measures for the tests are 1.8748, 1.6540 and 0.0809 while φF measures are 1.3630, 1.2519 and 0.8409, respectively. Figure 3 shows the selection stability and forecasting instability of testing procedures for data set 2, which only has 26 observations. The φS measures for the tests are 1.6038, 7.3279 and 3.2614 respectively, and the φF measures are 1.8561, 5.764×107 and 5.8151, respectively. Clearly LRT 2 is having a great difficulty (in fact, it provided a non-stationary model, which caused the peculiar behavior in the perturbation plots). This corresponds well with the much worse performance of LRT 2 compared to LRT 1 and LRT 3 as shown in Table 6 in section 6 with max p = 8. In our simulation with n = 30, when the true model is AR(1), the φF measure is increasing in γ (see the first row of Figure 4), which means the procedure is less stable against perturbation when γ is greater. The reason is that the measure is not only related to the instability in terms of selection, but also related to the instability due to the parameter re-estimation. As for φS measure, similar to the κ measure, our simulation shows that with a true model of AR(1), the testing procedures are less stable in terms of φS measure when γ is on the center of [0, 1] interval (see the second row of Figure 4). In our setting, LRT 3 seems to be more stable than LRT 1 and LRT 2. For a more complicated simulated AR(4) series with autocorrelation coefficients -0.579241, 0.696305, 0.181805 and -0.104232, the testing procedures are quite unstable by both φS and φF measures, as shown in Figure 5. The φS measures are 1.1790, 1.2194 and 0.4913 while the φF measures are 2.0425, 2.0409 and 1.9127, respectively. We made several replications and similar pattern appears.
11
0.0
0.2
0.4
0.6
0.8
0.8 0.4 0.0
0.4
0.8
Selection Stability
1.2
LRT1
0.0
Forecasting Instability
LRT1
1.0
0.0
0.2
Pertubation Size
0.4
0.6
0.8
0.0
0.2
0.4
0.8
1.0
0.8
1.0
0.8 0.4
Selection Stability 0.6
0.0
1.4 1.0 0.6
Forecasting Instability
0.6
LRT3
0.2
0.4
1.0
Pertubation Size
LRT3
0.2
0.8
0.8
1.0
Pertubation Size
0.0
0.6
0.4
Selection Stability 0.4
1.0
0.0
7.8*10^6
0.2
0.8
LRT2
6.8*10^6
Forecasting Instability
LRT2
0.0
0.6
Pertubation Size
0.0
Pertubation Size
0.2
0.4 Pertubation Size
Figure 3: Perturbation Stability of Data set 2 (n=26)
4
Comparing Testing with AFTER for White Noise vs. AR(1)
As mentioned in section 2, the problem of testing white noise against AR(1) is not trivial, and the simplicity is also helpful for understanding the difference between testing and combining.
4.1
Do we need to identify the true model for forecasting?
Given two candidate models to fit a time series data, when the two models are nested, it seems most natural to employ a testing procedure to assess which is more likely to be the one that generated the data (assuming one of them is correct). Even in this case, a serious concern with the testing approach is that identifying the true model is not necessarily the right goal for the purpose of forecasting. Since the mean square error is composed of the squared bias and the variance, a significant reduction in variance with a slight increase of bias as a trade-off might be better in terms of forecasting performance. This is illustrated in the comparison of white noise against AR(1) model, i.e., model 1: Yn = en versus model 2: Yn = γYn−1 + en , where {ei } are i.i.d. normally distributed with mean zero and variance σ2 . Here we consider that 0 ≤ γ < 1. We will see that at a given sample size when auto-correlation coefficient falls in a certain interval, the true model does not perform better. Under white noise, the mean square prediction risk is clearly minimized when Ybn+1 = 0. If actual γ
is nonzero, then the MSEP is
´2 ³ E Ybn+1 − Yn+1 = σ2 + γ 2 EYn2 .
For model 2, with γ estimated by the sample correlation γ bn , a natural forecast is Yen+1 = γ bn Yn and then the MSEP is
´2 ³ E Yen+1 − Yn+1 = σ2 + E (b γ n − γ)2 Yn2 . 12
1.5 1.0
Forecasting Instability
0.5
2.0 1.5 0.5
1.0
Forecasting Instability
1.5 1.0
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
Gamma
Gamma
LRT1
LRT2
LRT3
0.6
0.8
0.5 0.4 0.3
Selection Stability
0.5
Selection Stability
0.8
0.2
0.1
0.3
0.2
0.6
0.7 0.6 0.5 0.4 0.3
0.6
0.6
Gamma
0.7
0.2
0.4
Forecasting Instability
0.5 0.0
Selection Stability
LRT3 2.0
LRT2
2.0
LRT1
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
Gamma
0.6
0.8
0.0
Gamma
0.2
0.4 Gamma
Figure 4: Comparison of Stability: φF and φS for AR(1) For comparing performance of the two models, we can examine the ratio γ 2 EYn2 E (b γ n − γ)2 Yn2
(2)
We take that σ 2 = 1 in the following analysis. From the appendix, based on a heuristic argument, there exists an approximate interval (0,
√1 ) n+1
for γ on which the true model performs worse than the
false. The actual interval for a finite sample seems even wider though it is hard to identify it analytically. Based on Monte Carlo simulations with 1000 replications for each choice of γ with n = 20, 50, 100, Figure 6 gives the graphs of the ratio defined above with respect to γ, which confirms the existence of the afore mentioned interval. Under N MSEP (; 20), the upper left graph plots the risk ratio against γ, and the others focused on γ in the range from 0 to 0.45 with n = 20, 50 and 100. From the graph, it is clear that when γ is small, the wrong model performs better. The true model works better when γ grows larger. We also see that when n = 20, the critical value for γ is about 0.25, which is larger than 0.218 = 0.14 =
√1 n+1
√1 n+1
with n = 20. When n = 50, the critical value is closer to the value of
with n = 50. The expression provided a very close approximation of the critical value as
0.1 when n equals 100.
4.2
Comparing Testing and AFTER
The existence of such an interval raised an interesting question that whether testing procedures should pick out the true model as we usually presumed as the goal of testing or should it choose the “best” model for the purpose of forecasting. When the auto-correlation coefficient is within this interval, testing procedures are in a dilemma since the true model is not identical to the “best” model. It loses forecasting efficiency when it chooses the true model. 13
0.2
0.4
0.6
0.8
0.8 0.4
Selection Stability
1.0 0.0
0.0
2.0
LRT1
0.0
Forecasting Instability
LRT1
1.0
0.0
0.2
Pertubation Size
0.4
0.6
0.8
0.0
0.2
0.4
0.8
1.0
0.8
1.0
0.8 0.4
Selection Stability 0.6
0.0
1.0
Forecasting Instability
0.6
LRT3
0.0
0.4
1.0
Pertubation Size
LRT3
0.2
0.8
0.8
1.0
Pertubation Size
0.0
0.6
0.4
Selection Stability 0.4
1.0
0.0
2.0 1.0
0.2
0.8
LRT2
0.0
Forecasting Instability
LRT2
0.0
0.6
Pertubation Size
0.0
Pertubation Size
0.2
0.4 Pertubation Size
Figure 5: Perturbation Stability of Simulated AR(4) series (n=50) We conducted a Monte Carlo simulation of 1000 replications with n equal to 30. Figure 7 showed the performance of testing and combining procedures when the size is 0.01. We also presented the results when the size is set as 0.05 and 0.1, in Figure 8 and Figure 9 respectively. When the size is set as 0.01, testing procedures are too wary to reject the null. As a consequence, they excessively favor the null. It is hard for those procedures to select the right model in this setting. Under AN M SEP (; 20, 50), the testing procedures performed significantly worse than AFTER. N test and LB-Q test performed better than AFTER only when γ is approximately less than 0.14, which means that they selected the wrong model, an action they are not supposed to take. Yang and Zou (2002) noticed that, in their simulation that compared AFTER with model selection criteria AIC, BIC and HQ, when the average risk AN M SEP is considered instead of N MSEP , the advantage of AFTER is stronger. They reasoned that smaller sample sizes involved in AN MSEP might account for this difference. Figure 7 showed that the advantage region of AFTER over test procedures measured by ANM SEP is more than that measured by N MSEP . However, there is a noticeable difference compared to their case: for the LRT test, the situation is different when γ is close to one. Figures 8 and 9 showed similar pattern when the size is 0.05 or 0.1. That is, the degree that LRT outperforms AFTER when γ close to one is greater in AN MSEP than NM SEP . It seems that compared to N test and LB-Q test, LRT test is more stable. The results also suggest that, as model identification methods, testing and the use of information criteria can be quite different in forecasting performance. Note that the advantage region of AFTER shifts to the left as the test size becomes larger. Clearly, this is related to the trade off between the two types of probabilities of error. So far, for AFTER, we used equal initial weight for the candidate models. One can consider other
14
NMSEP(;20)
2 1 0 -1
Log Risk Ratio of True/False
4 2 0 -2 -4
Log Risk Ratio of True/False
6
3
NMSEP(;20)
0.0
0.2
0.4
0.6
0.8
0.1
0.2
Gamma
0.3
0.4
0.3
0.4
Gamma
NMSEP(;100)
0.2
0.3
1 0 -1 -2
Log Risk Ratio of True/False 0.1
-3
1 0 -1 -2
Log Risk Ratio of True/False
2
NMSEP(;50)
0.4
0.1
Gamma
0.2 Gamma
Figure 6: Comparing True and False models (n = 20, 50, 100) initial weighting schemes that try to take advantage of the data available at the beginning. Figure 10 compares the performance of testing and AFTER algorithm with such a choice. Following the idea of encompassing test (e.g., Clements & Hendry, 1998, p. 232), we performed a regression of the true values on the forecasts based on the AR(1) for the first 10 periods. We assign more weight (0.75) to model 2 if the slope is greater than 0.75; assign less weight (0.25) to the model 2 if the slope is less than 0.25; assign equal weight if slope is between 0.25 and 0.75. This procedure is denoted as M-AFTER. Figure 10 compares the perform in terms of AN M SEP (10, 20) and the size of testing procedures is set as 0.05. It shows that the alternative initial weighting improve the performance for large enough γ (roughly γ > 0.25) at the expense of worse performance for smaller γ.
4.3
Conditional usage of AFTER
In the previous sub-section we found an interval of γ where AFTER is superior to testing procedures by simulation. Then naturally one can consider applying AFTER when the estimate of γ falls into that region. We programmed a conditional usage of AFTER in simulation. First we randomly select a γ in [0.01, 0.91] and simulate a time series of 30 observations, then we calculate γ b. We obtained forecasts
using AFTER, LB-Q, N and LRT . We also applied a conditional AFTER which uses the forecasts of AFTER if the γ b falls into the interval we obtained in the earlier simulation (here we use a slightly shrank approximation of [0.15, 0.45]), on which AFTER is showed to be superior to testing procedures.
Otherwise, the forecast of LRT is used. Repeat this simulation 10000 times. The size of the tests is 0.05. Table 1 gives the AN MSEP risk measure of the testing procedures and AFTER and conditional
AFTER algorithms when n0 is 20 and 30 respectively. The numbers in the bracket are the corresponding standard errors. The table shows that the conditional AFTER improved the performance of AFTER.
15
ANMSEP(;20,30)
0.2
0.4
0.6
0.8
0.5 0.0 -0.5 -1.5
-1.0
Log Risk Ratio of LRT/AFTER
0.0 -2.0
-1.0
Log Risk Ratio of N/AFTER
0.5
0.5 0.0 -1.0
0.0
0.2
Gamma
0.4
0.6
0.8
0.0
0.2
Gamma
0.6
0.8
Gamma
NMSEP(;30)
NMSEP(;30)
0.2
0.4
0.6
0.8
0.0
Gamma
0.2
0.4
0.6
0.0 -0.5
Log Risk Ratio of LRT/AFTER
0.5 0.0 -0.5
Log Risk Ratio of N/AFTER
0.0 -0.5 -1.0 0.0
-1.0
0.5
0.5
1.0
1.0
NMSEP(;30)
0.4
-1.0
Log Risk Ratio of LBQ/AFTER
-2.0 0.0
Log Risk Ratio of LBQ/AFTER
ANMSEP(;20,30)
1.0
1.0
ANMSEP(;20,30)
0.8
0.0
Gamma
0.2
0.4
0.6
0.8
Gamma
Figure 7: Testing Procedures vs. AFTER (Size=0.01) AFTER n0 = 20 n0 = 30
0.0925 (0.0018) 0.0735 (0.0032)
LB-Q 0.1229 (0.0028) 0.0837 (0.0028)
N 0.1368 (0.0028) 0.0905 (0.0031)
LRT 0.0766 (0.0012) 0.0611 (0.0019)
Conditional AFTER 0.0647 (0.0010) 0.0532 (0.0015)
Table 1: Comparing Testing Procedures to Conditional Combining (I) Generally speaking, when the best model is quite easy to detect, AFTER does not necessarily have advantage over testing procedures. Thus the overall performance of AFTER when γ is randomly chosen on the interval [0.01, 0.91] may not be better than testing procedures. However, conditional AFTER performs better since we take consideration of the disadvantageous region of γ by using the forecast of LRT when γ b falls on that region.
To illustrate the fact that AFTER is better when γ is at the center of the (0,1) interval, we pro-
grammed another simulation. First, γ is randomly selected from a uniform distribution on [0.05, 0.71], then we simulate a time series of 30 observations. After b γ is calculated, we apply conditional AFTER if
the estimate falls on the approximated interval [0.15, 0.45]. Repeat this simulation 10000 times. We obtained Table 2 in similar notation of Table 1, which shows that AFTER is superior to testing procedures and the conditional application of AFTER improved the performance of AFTER.
5
Comparing Testing and AFTER for AR and MA Models
We consider more flexible AR(p) and MA(q) models in this section. Real data sets are also to be used.
16
ANMSEP(;20,30)
0.6
0.8
0.4 0.0 -0.8
0.0
0.2
Gamma
0.4
0.6
0.8
0.0
0.2
Gamma
0.6 0.4 0.2
Log Risk Ratio of LRT/AFTER
0.6 0.4 0.2 0.0 -0.2
Log Risk Ratio of N/AFTER
0.4 0.2
0.8
NMSEP(;30)
-0.6
-0.4
0.0
0.6
Gamma
NMSEP(;30)
0.6
NMSEP(;30)
0.4
-0.2
0.4
-0.6
0.2
-0.4
Log Risk Ratio of LRT/AFTER
0.6 0.4 0.2 -0.6 0.0
Log Risk Ratio of LBQ/AFTER
ANMSEP(;20,30)
-0.2
Log Risk Ratio of N/AFTER
0.4 0.2 0.0 -0.2 -0.4
Log Risk Ratio of LBQ/AFTER
0.6
ANMSEP(;20,30)
0.0
0.2
0.4
0.6
0.8
0.0
Gamma
0.2
0.4
0.6
0.8
0.0
Gamma
0.2
0.4
0.6
0.8
Gamma
Figure 8: Testing Procedures vs. AFTER (Size 0.05) AFTER n0 = 20 n0 = 30
0.0681 (0.0009) 0.0620 (0.0017)
LB-Q 0.1136 (0.0015) 0.0889 (0.0023)
N 0.1229 (0.0016) 0.0960 (0.0025)
LRT 0.0814 (0.0011) 0.0682 (0.0017)
Conditional AFTER 0.0644 (0.0009) 0.0581 (0.0016)
Table 2: Comparing Testing Procedures to Conditional Combining (II)
5.1
AR(p) models
Here the candidate models are AR models with orders up to 5 or 8. We fixed the true error variance to be 1 (but this is unknown to the forecasting procedures) and focus on the three likelihood ratio testing procedures stated in section 2. AFTER begins weighting at the 20th observation. The length of the series is 50 and n0 is set to be 20 or 50, corresponding to AN M SEP (δ; 20; 50) and N M SEP (δ; 50) respectively. The probabilities of choosing the right order is based on all the decisions that the testing procedures made in the simulation. The results for the four cases are presented in Table 3 based on 1000 replications. 5.1.1
Case 1
The true model is AR(1) with coefficient 0.8. When max p is 8, the probabilities of choosing the right order for LRT 1, LRT 2 and LRT 3 are 0.5785, 0.7674 and 0.9213 respectively. When max p is 5, the probabilities become 0.8109, 0.8320 and 0.9356 respectively. In this case, AFTER has no advantage over testing procedures, especially compared to LRT 3. The reason is that the coefficient 0.8 is fairly easy for testing procedure to pick out the true model. Note
17
ANMSEP(;20,30)
-0.5
Log Risk Ratio of LRT/AFTER
0.0
0.5
0.6 0.4 0.2 0.0 -0.4 0.6
0.8
0.0
0.2
Gamma
0.4
0.6
0.8
0.0
NMSEP(;30)
0.6
Gamma
0.8
0.8
0.6 0.0
0.2
0.4
0.6
0.8
0.4
Log Risk Ratio of LRT/AFTER
0.2 0.0 -0.2
Log Risk Ratio of N/AFTER
-0.6 0.4
0.6
NMSEP(;30)
0.4
0.6 0.4 0.2 0.0 -0.2 -0.6
0.2
0.4 Gamma
0.6
NMSEP(;30)
0.0
0.2
Gamma
0.2
0.4
-0.2
0.2
-0.6
0.0
Log Risk Ratio of LBQ/AFTER
ANMSEP(;20,30)
-0.2
Log Risk Ratio of N/AFTER
0.4 0.2 0.0 -0.4
Log Risk Ratio of LBQ/AFTER
0.6
ANMSEP(;20,30)
0.0
0.2
Gamma
0.4
0.6
0.8
Gamma
Figure 9: Testing Procedure vs. AFTER (Size=0.1) that enlarging the list of candidate models had a damaging effect on the testing approaches. The effect was surprising severe for LRT 1 and LRT 2, while LRT 3 and AFTER were much less affected. 5.1.2
Case 2
The true model is AR(2) with coefficient (0.5864, -0.15). When max p is 8, the probabilities of choosing the right order for LRT 1, LRT 2 and LRT 3 are 0, 0.0976 and 0.0859 respectively. When max p is 5, the probabilities become 0, 0.0870 and 0.0767 respectively. In this case, AFTER has a significant advantage over testing procedures (except under N MSEP measure with the maximum order 8 for LRT 3). Testing procedures performed very poorly in picking out the right model. Note that including more models again greatly affected the performance of the forecast for LRT 1 and LRT 2. 5.1.3
Case 3
The true model is AR(3) with coefficient (0.7, -0.3, 0.2). When max p is 8, the probabilities of choosing the right order for LRT 1, LRT 2 and LRT 3 are 0, 0.1020 and 0.0116 respectively. When max p is 5, the probabilities become 0, 0.1339 and 0.0153 respectively. Clearly, AFTER has advantage over testing procedures. It is interesting to note that though LRT 3 chooses the right model much less frequently than LRT 2, it performs slightly better than LRT 2. 5.1.4
Case 4
The true model is AR(4) with coefficient (0.9, -0.5, 0.2, 0.2). When max p is 8, the probabilities of choosing the right order for LRT 1, LRT 2 and LRT 3 are 0, 0.0235 and 0.003 respectively. When max p is 5, the probabilities are 0, 0.0415 and 0.005 respectively. 18
2.5 2.0 1.5 0.5
1.0
0.8
1.0
1.2
1.4
Risk Ratio of LBQ/M-AFTER
1.6
ANMSEP(;10,20)
0.6
Risk Ratio of AFTER/M-AFTER
ANMSEP(;10,20)
0.2
0.4
0.6
0.8
0.2
Gamma
0.4
0.6
0.8
0.6
0.8
Gamma
ANMSEP(;10,20)
2.0 1.5 0.5
1.0
Risk Ratio of LRT/M-AFTER
2.5 2.0 1.5 1.0 0.5
Risk Ratio of N/M-AFTER
3.0
ANMSEP(;10,20)
0.2
0.4
0.6
0.8
0.2
Gamma
0.4 Gamma
Figure 10: Comparison of M-AFTER with AFTER and Testing Procedures AFTER again performs better than the likelihood ratio tests in this setting. 5.1.5
Random Models
The setting here is that we compare testing procedures with AFTER based on randomly chosen AR models with n=40 and n0 =20 with 200 replications for each model. The order is generated from a discrete uniform distribution between 1 and 6, then the coefficients are generated with continuous uniform distributions on [-2, 2] (discard the case if the coefficients do not yield stationarity and re-generate the coefficients until the stationary condition is satisfied for the randomly picked order). Three hundred models are generated in this way. Table 4 compares AFTER with LRT 1, LRT 2 and LRT 3. We consider candidate models from white noise to AR(8). The result clearly shows that AFTER performs better than LRT 1, LRT 2 and LRT 3. Note that the standard error associated with LRT 3 are much larger than those for LRT 1 and LRT 2. Further examination shows that LRT 3 performed extremely bad for some models, which accounted for the huge variability. Most of those models have one or more root(s) of the characteristic equation close to unity. AFTER is advantageous to LRT 3 for 67.7% of all the 300 models. The percentage is 98% and 75.3% for LRT 1 and LRT 2, respectively.
5.2
Data examples
We compare the three LRT procedure with AFTER using 3 data sets. AFTER starts weighting at n0 .
19
case 1
max p = 8
n0 = 20 n0 = 50
max p = 5
n0 = 20 n0 = 50
case 2
max p = 8
n0 = 20 n0 = 50
max p = 5
n0 = 20 n0 = 50
case 3
max p = 8
n0 = 20 n0 = 50
max p = 5
n0 = 20 n0 = 50
case 4
max p = 8
n0 = 20 n0 = 50
max p = 5
n0 = 20 n0 = 50
AFTER 0.1198 (0.0041) 0.0636 (0.0088) 0.1006 (0.0036) 0.0480 (0.0053) 0.1173 (0.0027) 0.0707 (0.0057) 0.0921 (0.0023) 0.0545 (0.0040) 0.1498 (0.0035) 0.1017 (0.0069) 0.1255 (0.0030) 0.0958 (0.0058) 0.2493 (0.0045) 0.2130 (0.0120) 0.2499 (0.0045) 0.1876 (0.0094)
LRT 1 0.5971 (0.0137) 0.1004 (0.0151) 0.1821 (0.0064) 0.0436 (0.0050) 0.3767 (0.0063) 0.2455 (0.0163) 0.2627 (0.0041) 0.1256 (0.0091) 0.4708 (0.0071) 0.2412 (0.0190) 0.2938 (0.0047) 0.1576 (0.0091) 0.6874 (0.0080) 0.3597 (0.0193) 0.4481 (0.0067) 0.2473 (0.0125)
LRT 2 0.1955 (0.0112) 0.0780 (0.0073) 0.0953 (0.0039) 0.0446 (0.0041) 0.2377 (0.0070) 0.1154 (0.0091) 0.1539 (0.0040) 0.0720 (0.0052) 0.2717 (0.0070) 0.1420 (0.0107) 0.1805 (0.0038) 0.1191 (0.0065) 0.3834 (0.0070) 0.2717 (0.0142) 0.3137 (0.0050) 0.2178 (0.0106)
LRT 3 0.0732 (0.0032) 0.0319 (0.0031) 0.0568 (0.0026) 0.0246 (0.0019) 0.1479 (0.0032) 0.0694 (0.0051) 0.1276 (0.0031) 0.0621 (0.0047) 0.1955 (0.0039) 0.1668 (0.0035) 0.1668 (0.0035) 0.1133 (0.0056) 0.3510 (0.0046) 0.2554 (0.0125) 0.3456 (0.0054) 0.2324 (0.0110)
Table 3: Testing vs. AFTER with AR models 5.2.1
Data Set 1: U.S. Income Data
This data set is the disposable income data from “National Income and Product Accounts Survey of Current Business: Business Statistics” by U.S. Department of Commerce, Bureau of Economic Analysis. We obtained it via Greene (2000). The data set has 128 observations. We differenced the data to remove trend. We did not find significant seasonal component. The differenced data appears to be AR(4) series based on graphical inspections. We fit this data set with AR models with order up to 8 and use LRT 1, LRT 2 and LRT 3 to choose a model while using AFTER to combine those models. The sequential stability κ (L = 15) are all 1 for LRT 1, LRT 2 and LRT 3. We take n0 = 110. Table 5 showed that AFTER is slightly better than the testing approaches.
20
Risk Ratio to AFTER Risk Difference with AFTER
LRT 1 2.883 (0.0688) 0.558 (0.0186)
LRT 2 1.403 (0.0310) 0.042 (0.0074)
LRT 3 2.968 (0.2974) 1.178 (0.3062)
Table 4: Testing vs. AFTER with Random AR models
max p = 8 max p = 5 max p = 3
AFTER 93.680 89.220 84.108
LRT 1 124.349 96.634 96.634
LRT 2 114.099 96.634 96.634
LRT 3 96.634 96.634 88.946
Table 5: ASEP (110, 127) for the U.S. Income Data 5.2.2
Data Set 2: U.S. Agriculture Production Data
This data set contains yearly data of U.S. agriculture output from 1960 till 1986. The source is Council of Economic Advisors, Economic Report of the President by U.S. Government Printing Office. This was also obtained from Greene (2000). The data set has 27 observations. To get a stationary series, we performed logarithm transformation and differenced it to remove the trend. Graphical inspection suggested AR(2) model. We fit AR models with order up to 8. In this setting, the sequential stability κ (L = 5) is 0.4, 0.4 and 1 for LRT 1, LRT 2 and LRT 3 respectively. Table 6 shows that testing procedures perform worse than AFTER. Note that AFTER has the smallest risk increment when max p increases from 3 to 8, which might suggest that AFTER is more preferred when the true order is uncertain so that more candidate models need to be considered. max p = 8 max p = 5 max p = 3
AFTER 0.0298 0.0298 0.0296
LRT 1 0.0408 0.0353 0.0326
LRT 2 0.0581 0.0406 0.0306
LRT 3 0.0326 0.0326 0.0307
Table 6: ASEP (18, 26) for the U.S. Agricultural Output Data
5.2.3
Data Set 3: Gasoline Overshorts Data (MA models)
This data set consists of 57 consecutive daily overshorts from an underground gasoline tank at a filling station in Colorado. It was obtained from Brockwell & Davis (1990), which suggested an MA(1) model for the data. We fit this data with white noise and MA(q) models with order q up to 5 and 8, use likelihood ratio tests to choose among those models and combining those models with AFTER. The sequential stability κ (L = 15) is 1 for all the three likelihood ratio test criteria based on this data when max q = 5. It is 0.27, 0.6 and 1 when max q = 8. Note that when max q increases, which means that there are more candidate models, LRT 1 and LRT 2 perform worse. The percentages of risk reduction of AFTER over LRT 1, LRT 2 and LRT 3 are 54%, 36.7% and 41.9% for the case of max q = 8, 21
max q = 8 max q = 5
AFTER 230.416 231.217
LRT 1 501.873 363.503
LRT 2 363.989 361.431
LRT 3 396.697 396.697
Table 7: ASEP (40, 57) measure based on Overshort Data and are 36.4%, 36% and 41.7% for the case of max q = 5.
6
Conclusions
Testing plays an important role in assessing hypotheses. However, it may not be the best choice for forecasting when multiple models are present. The reason is that testing procedure may excessively favor the null when it is trying to control probability of type I error and furthermore, testing procedure can be very unstable, which casts doubt on its ability to pick out the “true” model as intended. We analyzed a simple situation that the true model may not be more efficient than the false model in the sense of forecasting. This puts the testing approach for forecasting into an awkward position. The main contributions of the paper are: • We studied instability of testing procedures via several sensible measures. It was shown that the testing approach can be very unstable in terms of selection as well as forecasting. • We compared the performance of testing procedures and AFTER. We found that AFTER performs better in most of those cases in our study when the testing procedures are unstable. • The comparisons of the three LRT testing procedures on AR models are also informative. LRT 1 and LRT 2 are much more sensitive to the choice of candidate models. Our simulations suggest that LRT 3 should be preferred unless one or more of the roots of the characteristic equation of the true AR model is/are close to unity. In our simulation, LRT 1 is inferior for forecasting. The study leaves several questions open. Should a preliminary analysis be performed to eliminate bad models before combining? If so, how should it be done? How the instability measure can be used quantitatively to decide whether the instability is too large due to model selection? Comparison of testing procedures and combining in time series models with explanatory variables is also of interest.
7
Appendix
Advantageous Region of the False Model: In section 4.1, we were trying to find out when the false model is better for the purpose of forecasting. We know that by Bartlett’s formula (Brockwell & Davis, 1996 p. 59), for AR(1) series, the estimate b γ is
approximately distributed for large n as N (γ, n−1 w), where w is: w = (γ −1 − γ)2 22
γ2 . 1 − γ2
(3)
Heuristically, it seems that when n is very large γ b is asymptotically independent of Yn (formally
speaking, this needs more rigorous statement and proof, but it is beyond this work). Then we expect the ratio in (2) to be close to:
γ2 γ2 γ2 γ 2 EYn2 ≈ 1 . ≈ ≈ 2 2 var(b γ) E (b γ − γ) EYn2 E (b γ − γ) nw
(4)
Therefore we can calculate (4) approximately as: γ2 1 −1 n (γ
γ2
− γ)2 1−γ 2
nγ 2 1 − γ2
=
(5)
Set the ratio in (5) as unity, we get an approximate boundary of γ: 1 γ∗ = √ n+1
(6)
Furthermore, it is easy to see that the ratio in (5) is monotonously increasing in γ. Therefore, based on the approximation there exists an interval [0,
√1 ) n+1
for γ on which the true
model performs worse than the false. Proof of Proposition 1: Note that since obviously κ ≤ 1, we only need to work with the lower bound direction. Let k∗ denote the true model, then: κ>
Pn−1
b = k∗ }I{b kn = k∗ } > L
j=n−L I{kj
(
P n−1
I{b kj =k∗ } L
j=n−L
0
It follows that for any 0 < ε < 1,
when b kn = k ∗ when b kn 6= k∗ .
! ∗ b I{ k = k } j ≤1−ε P (κ ≤ 1 − ε) ≤ P (b kn 6= k∗ ) + P L ! Ã Pn−1 bj = I{k 6 k∗ } j=n−L ∗ = P (b kn 6= k ) + P >ε L Pn−1 ∗ b j=n−L P (kj 6= k ) ∗ b ≤ P (kn 6= k ) + , Lε Ã Pn−1
j=n−L
where the last inequality follows from the Markov inequality.
Under the consistency assumption on the model selection criterion, we have P (b kj 6= k∗ ) → 0 as
j → ∞. It then can be easily shown that conclusion of Proposition 1 follows.
8
P n−1
P (b kj 6=k∗ ) L
j=n−L
→ 0 for any choice of 1 ≤ Ln ≤ n − 1. The
Acknowledgments
The work of the second author was supported by the United States National Science Foundation CAREER Award Grant DMS-00-94323.
23
References [1] Akaike H,. 1973. Information theory and an extension of the maximum likelihood principle. In Proc. 2nd Int. Symp. Info. Theory, pp. 267-281, eds. B.N. Petrov and F. Csaki, Akademia Kiado, Budapest.
[2] Andrews DW, Ploberger W. 1996. Testing for Serial Correlation Against an ARMA(1,1) Process. Journal of the American Statistical Association, Vol. 91, No. 435, pp. 1331-1342.
[3] Bauer P, Pötscher BM, Hackl P. 1988. Model Selection by Multiple Test Procedures. Statistics, 19, pp. 39-44.
[4] Box G, Pierce D, 1970. Distribution of Residual Auto-correlations in Autoregressive Moving Average Time Series Models. Journal of the American Statistical Association, 65, pp. 1509-1526.
[5] Breusch T. 1978. Testing for Auto-correlation in Dynamic Linear Models. Australian Economic Papers, 17, pp. 334-355.
[6] Brockwell PJ, Davis RA. 1996. Introduction to Time Series and Forecasting. Springer-Verlag. [7] Chatfield C. 2001. Time-series forecasting. Chapman & Hall, New York. [8] Clemen RT. 1989. Combining forecasts: a review and annotated bibliography. Intl. J. Forecast, 5, pp. 559-583.
[9] Clement MP, Hendry DF. 1998. Forecasting Economic Times Series. Cambridge University Press. [10] Durbin J, Watson G. 1971 Testing for Serial Correlation in Least Square Regression–III. Biometrika, 58, pp. 1-42.
[11] Granger CWJ, King M, White H. 1995 Comments on testing economic theories and the use of model selection criteria. Journal of Econometrics 67, pp. 173-187.
[12] Greene WH. Econometric Analysis. 4th edition. Prentice Hall. [13] Hall AD, McAleer M. 1989. A Monte Carlo Study of Some Tests of Model Adequacy in Time Series Analysis, Journal of Business & Economic Statistics, Vol. 7, No. 1., pp. 95-106.
[14] Liang KY. 1995. A Critical Evaluation of Quarterly Macroeconomics Forecasting in Taiwan. Taiwan Economic Review. 23:1, pp. 43-82.
[15] Ljung GM, Box GEP 1979 The Likelihood Function of Stationary Autoregressive-Moving Average Models. Biometrika, Vol. 66, No. 2., pp. 265-270.
[16] McAleer M, McKenzie CR, Hall AD. 1980. Testing separate time series models. Journal of Time Series Analysis. 9, pp. 169-189.
[17] Pöcher BM. 1991. Effects of model selection on inference. Econometric Theory 7, pp. 163-185. [18] Whittle P. 1952. Tests of Fit in Time Series. Biometrika. 39, pp. 309-318. [19] Yang Y. 2001a. Adaptive regression by mixing. Journal of American Statistical Association, 96: pp. 574-588. [20] Yang Y. 2001b. Combining forecasting procedures: some theoretical results. Revised for Econometric Theory.
[21] Yang Y, Zou H. 2002. Combining time series models for forecasting. Accepted by International Journal of Forecasting.
24