Flash Estimates for Short Term Indicators - Data cleaning with X12 Arima

Size: px
Start display at page:

Download "Flash Estimates for Short Term Indicators - Data cleaning with X12 Arima"

Transcription

1 Markus Froehlich, Alexander Kowarik Statistics Austria Work Session on Statistical Data Editing Budapest, September 2015 Flash Estimates for Short Term Indicators - Data cleaning with X12 Arima Froehlich, Kowarik (Statistics Austria) 1 / 25 Budapest, 2015

2 Short term indicators - industry and production Short term statistics: Monthly survey About 12,000 enterprises with reporting obligation Very detailed questionnaire asking for many variables (turnover, new orders, production, number of employees, salaries,...) Survey data published 85 days after end of reference period Basis for short term indicators (index of production, turnover index, etc.) Data cleaning/editing done by subject matter experts - various administrative sources available Froehlich, Kowarik (Statistics Austria) 2 / 25 Budapest, 2015

3 Short term indicators - industry and production Short term indicators: Indicators published 55 days after end of reference period Data cleaning as for the survey - just shorter time limits Missing data (10-20 per cent) are substituted with imputation methods Data revision with every new data publication Shorter publication-limits intended for the near future (T45 and T30??) Froehlich, Kowarik (Statistics Austria) 3 / 25 Budapest, 2015

4 Short term indicators - industry and production Publishing at T30 Since 2013 preliminary short term indicators are published at T30 For selected indices (turnover index, index of employees, index of hours worked, production index) at a higher aggregation level Flash estimates are calculated with a multivariate time series model based on early respondents Major issue: Data editing, outlier adjustment Froehlich, Kowarik (Statistics Austria) 4 / 25 Budapest, 2015

5 Short term indicators - industry and production Publishing at T30,... continued No routine data cleaning procedure possible because of short time limits Outlier identification and replacement has to be done (almost) fully automatically Moreover, imputation of missing values in past periods because of time-series approach of estimation (reporting obligation linked to reporting thresholds - yearly gaps in time series) Seasonality in time-series should not be destroyed. Froehlich, Kowarik (Statistics Austria) 5 / 25 Budapest, 2015

6 Response for short term survey Registered Data Reporting time for big units size proportional to turnover Share (from, to) T+30 T+55 T+85 Arrived after... days January February March 2012 April May June Froehlich, Kowarik (Statistics Austria) 6 / 25 Budapest, 2015

7 X12-Arima State-of-the-art seasonal adjustment program, developped by the US Bureau of the Census (Open source) Offers sophisticated model identification and preadjustment (outliers, trading days, etc.) procedure based on TRAMO The program handles different types of outliers, like additive outliers, level shifts, ramps, transitory changes and seasonal outliers Outlier procedure: stepwise regression approach based on Chang and Tiao Providing replacement values for missing data, forecasts and backcasts with corresponding standard errors Froehlich, Kowarik (Statistics Austria) 7 / 25 Budapest, 2015

8 The R-package x12 Access to X12-ARIMA directly from within R (no spc, out,... files) Class oriented command line interface Change tracking for the X12-ARIMA parameters and output Batch processing of multiple time series at once (in parallel) Easy generation of graphical output Import the parameter settings from spc files to R Froehlich, Kowarik (Statistics Austria) 8 / 25 Budapest, 2015

9 Outlier adjustment in X12 - program limits Minimum number of observations for X12: 36 observations for monthly series (3 years) 16 observations for quarterly series (4 years) Restrictions about missing values: no month (quarter) can contain missing values for all years Warning if more than half of all values for a month (quarter) are missing (in this case other methods for missing data replacement are suggested before running X12) Number of regressors limited to 80 Maximal length of input series is 600 observations The sensitivity for outlier adjustment in X12-Arima can be determined by varying the t-value (default values in X12: depending on the series length, e.g. t = 3.55 for 36 observations,t = 4.07 for 360 observations) Froehlich, Kowarik (Statistics Austria) 9 / 25 Budapest, 2015

10 Outlier adjustment in X12 X12 is a procedure intended for seasonal series - however series not exhibiting seasonality might also be treated with the program Interpolation of missing values, backcasting and forecasting done with the identified (or predefined) Arima-model Integration of regression variables, particularly trading day and holiday effects (± 2 days in a month, compared to the month a year ago, could cause a difference of ± 6-10 percentage points in the series) Systematic overestimation of missing values in the first periods after a unit is eliminated from the survey (reentering the survey in a later period again) Froehlich, Kowarik (Statistics Austria) 10 / 25 Budapest, 2015

11 Problematic series Any incorrect values? Missing Values e e e Froehlich, Kowarik (Statistics Austria) 11 / 25 Budapest, 2015

12 Outlier adjustment for short term indicators Outlier Adjustment was performed for each variable seperately -series objects constructed for all units obliged to report for the reference period (based on units obliged to report for the latest month before reference month) Different variables had to be treated differently (e.g. turnover vs. number of employees) Total was divided into: (a) Long series (more than 35 observations) (b) series of first-time reporting units: units sampled in the actual year for the first time ( 12 observations) (c) Rest: series with 1 to 35 observations not in the first two categories Units, which are not sampled any more being still relevant for construction of index numbers Froehlich, Kowarik (Statistics Austria) 12 / 25 Budapest, 2015

13 Performance of automatic data cleaning Class (a): Series with at least 3 years of observations - treated with X12-Arima (R-package x12) Total of 8,245 series out of 9,840 (for variable turnover) The R-package x12 offers parallel-processing, therefore the adjustment was run on 14 cores Processing time for all 8,245 series (per variable): 3-5 minutes One or more outliers were identified in 124 series (turnover) and 49 series (number of employees, with different settings) respectively Froehlich, Kowarik (Statistics Austria) 13 / 25 Budapest, 2015

14 Performance of automatic data cleaning Class (a):... continued 123 (205) series could not be processed, mostly because of too many missing values (or because of low variability in case of number of employees) - these series were transferred to class (c) Output: time-series with filled gaps (missing values interpolated), removed and replaced outliers and forcasted values Caution: negative-values have been created, which were replaced by zeros Zero-values in the survey for the very last observation could indicate missing values Froehlich, Kowarik (Statistics Austria) 14 / 25 Budapest, 2015

15 Automatic data-cleaning in X12 - Settings -series settings selected for X12: Automatic transformation of time-series (additive or multiplicative model) was chosen Automatic model identification was disabled because of processing time - Airline model was selected for all series Sensitivity for outlier adjustment for short term indicators should be low - avoiding to eliminate (as far as possible) unusual but correct data, therefore the t-value for outlier identification was set to 13 (20 for variable number of employees), Could allowing for more outliers be beneficial for time-series forecasts, because of smoother time-series? Froehlich, Kowarik (Statistics Austria) 15 / 25 Budapest, 2015

16 Performance of automatic data cleaning -series settings for X12:... continued Outlier identification could be limited to check for the very last observation only (because historic data have already been edited carefully). In our case the whole length of the series were checked for outliers. Allowing for additive outliers only. Not detecting outliers for a longer period of time (level shifts, ramps, etc.) was not considered plausible. Caution: A level shift in one series could induce a level shift (in the opposite direction) in another series in case of reorganisation of enterprises. No trading-day regressor included yet Froehlich, Kowarik (Statistics Austria) 16 / 25 Budapest, 2015

17 Examples in X Outlier Series Outlier Series Outlier Series Outlier Series Froehlich, Kowarik (Statistics Austria) 17 / 25 Budapest, 2015

18 Automatic data cleaning - Alternatives Selected alternatives for outlier identification, estimation of missing values, forecasting TError (Tramo for Errors): Tramo based programm able to handle thousands of series in batch processing, very similar to X12-Arima pre-treatment, which is based on Tramo. Not embedded in R yet. The R-package tsoutliers (Javier Lopez-de-Lacalle) offering detection of all kinds of outliers and removing outliers as well as creating and including trading day regressors. Theoretical connex to Tramo? The R-package forecast (Rob J Hyndman et al.) offering a lot of functionalities but also identification of outliers, interpolation of missing values and forecasts for time-series. Treatment of shorter series possible. The R-package zoo (Achim Zeileis et al.) offering different functions for interpolating missing values. Froehlich, Kowarik (Statistics Austria) 18 / 25 Budapest, 2015

19 Comparison: performance of X12 versus tsclean 2000 Outlier Series Data cleaning with X Data cleaning with tsclean Outlier Series Data cleaning with X Data cleaning with tsclean Froehlich, Kowarik (Statistics Austria) 19 / 25 Budapest, 2015

20 Performance of automatic data cleaning Class (b): Series of first-time reporters ( 12 observations) Total of 384 series out of 9,840 (for variable turnover) Treated seperatly because of very specific pattern (starts low - often with zeros, reaching a sustainable (lasting) level not before a few periods.) Particular attention to last two observations (which are not considered final in the survey data) Froehlich, Kowarik (Statistics Austria) 20 / 25 Budapest, 2015

21 Performance of automatic data cleaning Class (b):... continued Heuristic procedure for identifying outliers, dependent basically on medians and IQR. Series with one or two observations are also treated (one observation: highest absolute value plausible (reporting threshold)) Special attention put on so called euro-reporters. Forecasts generated with HoltWinters - important especially, if last two observations are outliers Visual inspectation advisable - most of the cases are negligible because of low absolute values, attention is put to cases with highest values ( euro- reporters an restructured units) Froehlich, Kowarik (Statistics Austria) 21 / 25 Budapest, 2015

22 Data cleaning for new reporting units 3000 New reporting units New reporting units Jän Feb Mär Apr Mai Jun 0 Jän Feb Mär Apr Mai Jun 3e+05 New reporting units New reporting units 2e e+05 0e+00 Jän Feb Mär Apr Mai Jun Jän Feb Mär Apr Mai Jun Froehlich, Kowarik (Statistics Austria) 22 / 25 Budapest, 2015

23 Performance of automatic data cleaning Class (c): Series with 1 to 35 observations not in the first two categories Total of 1, series out of 9,840 (for variable turnover) tsclean-function (forecast-package) was used for data-cleaning (forecasts are generated, if future periods are set to missing (NA)) Residuals are identified by fitting a loess curve for non-seasonal data and via a periodic STL decomposition for seasonal data. Residuals are labelled as outliers if they lie outside the range ±2(q 0.9 q 0.1 ) where q p is the p-quantile of the residuals (Rob J Hyndman). Froehlich, Kowarik (Statistics Austria) 23 / 25 Budapest, 2015

24 Performance of automatic data cleaning Class (c):... continued To change these limits the tsclean-function has to be adapted directly - in the concrete case, we changed the limits to ±2(q 0.95 q 0.05 ) in order to concentrate on outliers which are really unusual. To estimate missing values, linear interpolation is used for non-seasonal series, and a periodic stl decomposition is used with seasonal series (Rob J Hyndman). Special attention should be put on short series with less than 24 observations (extrapolation of trends) 14 (out of 1,334) could not be processed (6 series out of these 14 exhibiting only zeros) Froehlich, Kowarik (Statistics Austria) 24 / 25 Budapest, 2015

25 Performance of tsclean 500 Data cleaning with tsclean 5000 Data cleaning with tsclean Data cleaning with tsclean 2000 Data cleaning with tsclean Froehlich, Kowarik (Statistics Austria) 25 / 25 Budapest, 2015