DRAFT FOR DISCUSSION AND REVIEW, NOT TO BE CITED OR QUOTED

Size: px
Start display at page:

Download "DRAFT FOR DISCUSSION AND REVIEW, NOT TO BE CITED OR QUOTED"

Transcription

1 DRAFT FOR DISCUSSION AND REVIEW, NOT TO BE CITED OR QUOTED USING MISSPECIFICATION TESTS ON WITHIN-SAMPLE MODELS AND CONSIDERING THE TRADE-OFF BETWEEN LEVEL AND SIGNIFICANCE AND POWER TO PROMOTE WELL-CALIBRATED, POST-SAMPLE PREDICTION INTERVALS Bernard J. Morzuch Department of Resource Economics, University of Massachusetts, Amherst, MA 13 P. Geoffrey Allen Department of Resource Economics, University of Massachusetts, Amherst, MA 13 Abstract We applied the naïve no-change method to the 33 series in the M3-Competition. This method is optimal for the random walk model. Based on within-sample misspecification tests, a series was placed in one of two groups; either it passed our entire battery of four tests or it failed one or more tests. Distributions of ex post forecasts from groups that passed all tests were better calibrated than distributions from the other groups. To improve the delineation between the two groups, we developed overall significance levels to reflect the joint nature of the tests rather than treat the tests independently. We analyze the change in calibration of the distributions as a result of paying attention to the joint nature of the tests. In addition, recognizing that increasing the level of significance increases the power of the test, we analyze the change in calibration of the forecast distributions when manipulating the significance level. Keywords: Prediction intervals, caibration, significance, power Two Objectives Will within-sample misspecification tests on the residuals from a model detect whether or not the model is any good for post-sample forecasting? Suppose we become more stringent about accepting our model by increasing the level of significance, which also increases the power of the test. Will such a model lead to even bettercalibrated, post-sample forecasts? Model Chosen For The Study We chose the random walk (naïve no-change) one of the simplest models, since the purpose is not to engage in model selection but to determine whether a model selected as appropriate for a series on the basis of within-sample misspecification tests is able to produce good forecasts.

2 In forecast competitions, naïve no-change method has performed well against more complex models. But the random walk model is not expected to forecast well for seasonal series. Hypotheses We are interested in the following hypotheses. Null: Our within-sample model is appropriate Alternative: Our within-sample model is not appropriate Because of lack of success in ex post forecasting, the current and conventional wisdom appears to be that all forecasting methods perform equally badly. The corollary is not even to bother with within-sample testing. Fildes and Makridakis (1995), echoing Chatfield (1993) call for further research to overcome the inadequacies of selection methods based on within-sample fit and the notion of a single true model. Within-sample misspecification testing appears to have been overlooked as a selection method. Using it raises a of issues to which we now turn. There are at least three issues that need to be addressed in order to use within-sample testing as a selection method. They are: choosing the appropriate size versus power tradeoff, establishing the actual size of the test battery, and determining the size distortions when test assumptions are violated. There are also the larger questions of which tests and how many of them should be used. In deciding the size-power tradeoff we assume, in the present case, that the random walk is the DGP. We can control is the size or level of significance of the battery of misspecification tests, and if we make this too high, we will reject many series for which the naïve no-change method is suitable (i.e., series assumed to be generated by a random walk). At the same time, the overall power of the battery of tests should be reasonably high. That is, we want to exclude a large proportion of the series for which the naïve no-change method is unsuitable from the group of series on which we plan to use the method. The probability will be small of making a Type II error of failing to reject the null hypothesis when false. For a given size and DGP, power increases with sample size. We could attempt to control power by increasing the level of significance of the tests for series with fewer observations, but we take the more familiar route of keeping the level of significance constant across all series and examining the consequences of using different levels of significance. McGuirk, Driscoll and Alwang (1993) used a Monte Carlo approach to compare a of different misspecification-testing strategies. The strategies included various subsets of six individual and four joint misspecification tests. Samples of n = 47 were generated from the true model, which was based on a fitted linear demand equation with four explanatory variables. The variables were fixed in repeated samples, and errors were normally distributed and homoscedastic. Correct (linear) and incorrect (e.g., log-linear) model specifications were estimated. Strategies that included more misspecification tests had better size-power tradeoffs than those using fewer tests. They concluded (p. 154) that in order to achieve decent power, an overall size of 2% to 25% may be necessary. Below 1% there is a steep decline in power. The situation will be even less appealing with smaller sample sizes. The next issue is how the size of the individual tests relates to the overall size of the battery of tests. The overall test is a finite induced test. For example, we will accept the overall null 2

3 hypothesis that a residual series is normally and homoscedastically distributed if we can accept (or fail to reject) each null hypothesis, first of normality, then of homoscedasticity, separately. Suppose each separate test is conducted at a typical significance level, e.g., α =.5, on 1 series that are normally distributed and homoscedastic. The expected of rejections of each test is 5. If each test is independent it will randomly select from the same 1 series and the probability of a series being accepted by both tests is (1.5) 2 =.925; the overall size of the battery of two tests is =.975. This is the calculation for the Sidák critical value. It produces a slightly sharper overall critical region than the more familiar Bonferroni critical value, which is calculated as [1.5(2)] =.9, giving an overall size of.1. The final issue is size distortion. The actual size of the test and its nominal size will coincide if all the required assumptions are met. For example, a test for heteroscedasticity may assume a normal distribution. Also, many tests are correctly-sized only in large samples and their behavior in samples typically found in applied work may be unknown, or at best, have been investigated by Monte Carlo methods in only a limited of situations. Usually, the actual size of the test is larger than nominal size (so that the test tends to overreject series that are in fact acceptable under the null hypothesis). McGuirk, Driscoll and Alwang (1993) provide some information that simultaneously addresses both multiple tests and size distortions. They show for the model described earlier and for two of the testing strategies they investigated, that both the Bonferroni and Sidák test criteria are reasonable approximations to overall actual size for overall sizes of less than 2%. The overall actual sizes, as discovered by Monte Carlo simulation, should lie entirely within the critical regions calculated by Bonferroni or Sidák formulas, since the formulas are inequalities. That they do not for a range of nominal sizes of the individual tests shows that there is some size distortion and the tests are rejecting more often than they should. We will adopt the Bonferroni test criterion as a way of dealing simultaneously with both of these issues. Data Data used in the study are from the 33 series of the M3 competition. The complete data set is available at There are 645 yearly data series. After excluding those quarterly and monthly series that display seasonality, based on the existence of non-zero seasonal factors, there are 459 quarterly and 863 monthly nonseasonal series. (We thank Michele Hibon for providing the data containing her computations of seasonal factors.) We could have assumed the existence of a particular form of seasonality and introduced it into the model. To do justice to all seasonal series, we would have needed to add tests to discover the kind of seasonality for each series and, as a minimum, diagnostic tests for parameter constancy. We viewed this as a major project in itself and chose to conduct our analyses on nonseasonal series only. Following the procedures of the M-competitions, the last six observations of the annual series, the last eight of the quarterly series and the last 18 of the monthly series were reserved for post-sample testing. Method We begin by applying the random walk model to within-sample observations of each series. The first difference for each observation in a series results in a set of residuals for the series. Four 3

4 misspecification tests are conducted at a given level of significance: (1) for non-zero mean, which distinguishes the model from random walk with drift., a t-test; (2) for non-normality, the D Agostino-Pearson K 2 omnibus test; (3) for (dynamic) heteroscedasticity, the ARCH test; (4) for autocorrelation, Ljung-Box Q-statistic. We keep those series that pass all within-sample tests and use them for post-sample analysis. An example of the situation is shown in Table 1, for individual tests conducted at the 5% level of significance. We also looked at the effect of using both seasonal and non-seasonal quarterly and monthly series and found that the of series that pass all tests is almost the same as shown in Table 1. Almost all the non-seasonal series fail the autocorrelation, the expected result when the naïve no-change method is applied to a series with a seasonal pattern. Table 1: Number of series passing all within-sample diagnostic tests, or all but zero mean, and failing each test for different data types, using first differences of the original data. Data type Total of series Pass all tests Zero mean Normality Fail Autocorrelation Heteroscedasticity Annual Quarterly nonseasonal Monthly nonseasonal Tests (and distribution under the null hypothesis) Zero mean: sample mean equals zero (t) Normality: D=Agostino K 2 statistic based on third and fourth moments about mean (χ 2 2) Autocorrelation: Ljung-Box Q statistic for autocorrelation (χ 2 12) Heteroscedasticity: Engle s ARCH test for dynamic heteroscedasticity on 4 lags of squared differenced observations (χ 2 4) Determining Whether Forecast Errors Are Calibrated 1. Forecast errors should follow the theoretical forecast error distribution. This distribution depends on the size of the series and the forecast horizon. To aggregate over series, the error distribution is scaled to standard deviation units. 2. The theoretical forecast error distribution for the random walk is well-known. For each series, calculate the within-sample standard deviation of the error. The standard deviation of the forecast error for h-steps ahead is the within-sample standard deviation times h. 4

5 3. Taking a forecast and observing if it calibrates. We obtain only one out-of-sample, h-stepsahead forecast for each series. To determine if the misspecification strategy is effective, we need many series. Since we have each series standard deviation of the forecast error, we can calculate boundary values for equi- probable deciles of the forecast error distribution. We note the decile into which the forecast error falls, for each series, in each forecast horizon. Forecasts for a particular forecast horizon are well-calibrated if 1% of forecast errors over all series of a given frequency of data fall into each decile. (See Figure 1.) EXPECTED ACTUAL Figure 1: Construction of decile intervals for the forecast error distribution and illustration of the calibration for a set of time series. Expected Results We chose a simple model. It may not be appropriate for a large of series. We expect to produce a relationship showing the tradeoff between the level of significance and the degree of calibration. As the level of significance increases, the of retained i.e., acceptable or not rejected series decreases. The degree of miscalibration decreases. Equivalently, the remaining series are better calibrated. This is the result of more power to reject misspecified 5

6 series. As the forecast horizon increases, the degree of miscalibration increases. If the series truly has a random-walk data generating process and stays that way, miscalibration should not occur. (See Figure 2.) probability significance level step 2-step 3-step 4-step 5-step 6-step Figure 2: Expected relation between level of significance of individual tests and of retained series, and between level of significance and calibration performance at different forecast horizons. Results As level of significance increases, the of retained series declines somewhat exponentially. The misspecification tests show some evidence of being able to discriminate so that retained series are, in fact, those that are appropriate for the model being used. Measured by degree of miscalibration, there is some evidence for annual series that higher significance levels lead to selection of a set of series that are reasonably well-calibrated. Results are much less satisfactory for quarterly and monthly series. 6

7 probability significance level step 2-step 3-step 4-step 5-step 6-step Figure 3: Relation between level of significance of individual tests and of retained series, and between level of significance and calibration performance at different forecast horizons, M3 competition annual series. probability significance level 1-step 2-step 3-step 4-step 5-step 6-step 7-step 8-step Figure 4: Relation between level of significance of individual tests and of retained series, and between level of significance and calibration performance at different forecast horizons, M3 competition quarterly non-seasonal series.

8 probability significance level step 2-step 3-step 4-step 5-18 step Number Figure 5: Relation between level of significance of individual tests and of retained series, and between level of significance and calibration performance at different forecast horizons, M3 competition monthly non-seasonal series. References Chatfield, C., 1993, Calculating interval forecasts, Journal of Business and Economic Statistics, 11, D Agostino, R.B., A. Belanger and R.B. D Agostino, Jr., 199, A suggestion for using powerful and informative tests of normality, American Statistician, 44, Engle, R.F., 1982, Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation, Econometrica, 5, Fildes, R. and S. Makridakis, 1995, The impact of empirical accuracy studies on time series analysis and forecasting, International Statistical Review, 63, Ljung G.M. and G.E.P. Box, 1978, On a measure of lack of fit in time series models, Biometrika, 65, McGuirk, A.M., P. Driscoll and J. Alwang, 1993, Misspecification testing: A comprehensive approach, American Journal of Agricultural Economics, 75,