Decision 411: Forecasting Spring 2007 Term 4 Homework Assignment #1

Size: px
Start display at page:

Download "Decision 411: Forecasting Spring 2007 Term 4 Homework Assignment #1"

Transcription

1 Decision 411: Forecasting Spring 2007 Term 4 Homework Assignment #1 On an afternoon in February 2007, executives of the Weisbud Brewery gathered in the board room, with its glass wall overlooking the brew kettles, to discuss needs for an expansion of the main brewery. Weisbud markets a full line of beers and consistently has roughly an 8% share of the U.S. market for beer sales for home consumption. The brewery was currently running at 95% of capacity. If the brewery expansion is undertaken, which will boost production capacity by 50%. It will take 2 years to complete. Augie Weisbud, the CEO, said: If we hit our production capacity and can t meet demand, we ll lose market share and it will be very hard to get back. Claudia Stein, the head of marketing, said: Based on the trends we saw over the last decade, I believe that sales should grow by around 5% per year. We ll probably hit that limit a year from now. Clyde Wingtip, the head of sales, said: Well, I m not so sure. Sales have not always gone up when we expected look at what happened in There are a lot of factors that affect demand, and there could be more surprises. Ernst Dinkelacker, the brewmeister, said: Yeah, remember the big hit we took in 1991! Susan Howe, the legal counsel, said: I doubt that will happen under the current administration, but on the other hand, there s a lot of political pressure to not make alcohol any more affordable or accessible to minors. Tom Malthus, the consultant, added: People are going to keep drinking, and more of them every year it s simple demographics but let s hope a majority of them continue to choose beer. Vodka is making a comeback, and Sideways is getting people talking about wine again. Your mission: Part A: Do some exploratory analysis of the data going all the way back to 1959, and see what you can learn about factors that have affected beer sales. In particular, answer the following questions: (i) (ii) (iii) (iv) Are people drinking more or less beer, on average, than they used to? What are some possible explanations for the long-term trend? Do you expect this trend to continue? Is beer cheaper or more expensive in relative terms than it used to be? Are consumers sensitive to the price of beer? How fast is the drinking-age population growing? 1

2 (v) (vi) How much does the average adult American spend on beer each year for athome consumption? Does this number surprise you, or not. (How much do YOU typically spend on beer in a whole year?) What specific events have had big impacts on beer sales at various points during the last 45 years? (Each of these questions could be addressed by one, or at most two Powerpoint slides featuring an annotated chart or table.) Part B: Assuming that Weisbud will continue to hold a constant share of the U.S. market, forecast the percentage increase in sales (in terms of quantity of beer sold) over the next two years. (In other words, generate a 24-month-ahead forecast, and compute the percentage increase of the January 2009 forecast over the last observed value in January 2009.) Report your point forecast and a 50% confidence interval for the percentage increase in demand. In particular, compare (minimally) the following three models: (i) Random walk (i.e., random walk with constant box unchecked) (ii) Random walk with drift (iii) Linear trend In addition to comparing different models, some other things to consider are (a) whether it is necessary or desirable to use a data transformation of some kind, (b) how much past data is relevant for the estimation of the model, and (c) if there are any recent developments that might affect the growth in beer consumption. Your comparison of models should be based not only on their one-month-ahead forecast error statistics (as shown in the model comparison report), but also on whether you feel their underlying assumptions are reasonable in light of what you have learned from your exploratory data analysis. Pay particular attention to what is assumed regarding the estimation of the trend (if any) that is expected over the next several years, because the trend assumption is especially critical for a 24-month-ahead forecast. Your writeup should present your results and explain the reasons behind your choice of model. How to proceed: 1. Download data from Economagic. Go to Economagic and download (at least) the following 4 monthly time series, all of which have histories back to 1959 or earlier: A. Beer and ale, at home: Personal Consumption Expenditures by Type of Product: Millions of dollars; quarters and months are SAAR (monthly) //period=monthly B. Beer and ale, at home: Chain-Type Price Indexes for Personal Consumption Expenditures: 2000=100; quarters and months are SA (monthly) //period=monthly C. CPI: U.S. city average; All items; =100; NSA //period=monthly D. Population level, 20 yrs & over; (Thousands): NSA //period=monthly 2

3 (Note: in order to use Economagic from off-campus and still be recognized as a Duke subscription user, you must either turn on your Fuqua VPN or else configure your browser to use the Duke proxy server. See the link on the course home page for instructions on how to do this.) The beer expenditures series only measures at home consumption, but for the purposes of this assignment you may assume that total beer consumption is proportional to athome consumption. (Optionally, to see what has happened in the rest of the alcoholic beverage industry over the same period, you want to look up the corresponding expenditures and price series for wine and distilled spirits.) Also, note that the beer series are seasonally adjusted (SA) while the population and CPI series are not seasonally adjusted (NSA). Normally it would be good to try to use variables that have been similarly transformed, but in this case CPI and population do not have very significant seasonality, and we will not be looking for seasonal patterns in this assignment anyway, so we won t worry about that. A detailed tutorial on how to use Economagic is given in video file #7 on CD#1, but here are the key steps: after locating a particular series in Economagic, either by browsing or searching, click on its link to see the raw data on the screen. You can get an instant chart of the series by clicking the GIF chart button on the menu while the data is displayed. In the options area below the chart, you may want to change the range of dates and then hit the Make chart button to redraw it. The charting options also include the capability to superimpose bands indicating past recessions as well as smoothed percentage changes. If you click on the chart, the GIF file will be displayed in its own browser window, and from there you can save it to disk as a separate file if you wish to be able to insert it into Word or Powerpoint documents later. Now click the Save series to personal workspace button to save the chosen series to your personal workspace. After you have saved one or more series to your personal workspace, you can click the View workspace button at any time to see the current contents. From there you can generate charts with multiple series as well as save all the series to a single Excel file for importing into Statgraphics. The fastest way to locate the above series is to click the Search link on the menu at the top of the screen, and then search on key words such beer, CPI, population etc. Here I suggest that you use the Economic Data Series Search window rather than the Google option it will allow you to use multiple keywords (e.g., beer and consumption separately, or population and 20yrs separately) and it will also give you a more compact list of the results. Note that there are actually many more series available through the search engine than are listed on the browseable menu. For example, if you make the mistake of searching on the keyword population alone, you will get a list of over 6000 time series (including every U.S. county). Note: there is also an already-deflated beer expenditures series called Beer and ale, at home: Personal Consumption Expenditures in Millions of Chained 2000 Dollars: 3

4 Millions of chained 2000 dollars; quarters and months are SAAR (monthly) This series is logically equivalent to series A (above) divided by series B. However, its history only goes back to 1990, so you should not bother to download it. Step 2: Clean up the data in Excel. In Excel, open the XLS or CSV file that you created in Economagic. Enter more descriptive column headings where needed e.g., DATE, YEAR, MONTH, BEER, BEERPRICE, or whatever. Then delete the extraneous comment rows at the top of the file, and also delete (if necessary) any rows of data prior to January After doing all this, you should end up with column headings (variable names) in row 1, with data from January 1959 onward beginning in row 2. Save the file and CLOSE IT before going to Statgraphics. (If you don t close the Excel file before trying to open it in Statgraphics, you will get an unable to import file message.) Step 3: Import the data into Statgraphics. In Statgraphics, use the File/Open/Open_Data_Source command to open the file you previously saved. (In the Open Data Source dialog box, you will need to check External file to find the file, since it is not yet a Statgraphics file.) After finding and clicking your filename, you should see a dialog box saying Read Excel File. Don t change anything here--just click OK to continue, and the variable names will automatically be read from row 1 of the data file. To make sure that the file was imported successfully, you should immediately take a look at the Statgraphics datasheet by clicking the Databook icon at the left of the screen scroll all the way to the bottom of the datasheet to make sure everything looks OK. If all is well, you should then immediately save the data file again as a Statgraphics data file (i.e., an *.SF6 file) by using the File/Save As/Save Data File As command. If you downloaded the data in XLS format, the DATE column will just look like a bunch of big numbers, which you will probably want to convert to Statgraphics date format. However, before you can make any changes to the data on the datasheet, you will need to change the databook properties to turn off the read-only setting. (When you first import a data file, its default setting is read-only.) Click once on any cell on the datasheet, the click the right mouse button and select Databook Properties, and then un-check the read-only setting for datasheet A where your data is now stored. Click OK to return to the datasheet, then click the DATE column heading, which will highlight the entire column. Now use the right-mouse-button and choose Convert from Excel date-time and choose the Month option. You should now see the dates in month/year format. For later plotting purposes, you will also find it helpful to create a new time index variable on the datasheet called TIME, using the formula YEAR+(MONTH-1)/12. Highlight an unused column on the datasheet, click the Right mouse button, choose Modify Column, then assign the name TIME (or some other new name), click the Formula button at the bottom, and enter the formula YEAR+(MONTH-1)/12 assuming that you already have variables named YEAR and MONTH. (Note: Statgraphics is NOT case-sensitive about variable names. I ve just used caps here for emphasis.) 4

5 Step 4: Exploratory analysis for Part A of the assignment. Next, draw some plots to get a feel for the overall patterns in beer sales and prices over the last few decades, as well as what has happened recently. Go to the Plots/Scatterplots/Multiple X-Y Plot procedure in order to make some plots showing several variables (e.g., beer sales and the price index or population) at once. This procedure allows to you plot some variables on the left axis and some on the right, if they are measured on very different scales. If you use the TIME variable as the X-axis variable, you will get plots vs. time with the yearnumbers nicely formatted. If you type an expression such as YEAR>1980 in the Select box on the Data Input panel, you can zoom in on more recent data.. You may also wish to draw some X-Y scatterplots in which the X-axis variable is something other than TIME. The following mathematical expressions involving the variables may be of interest: BEER/BEERPRICE: beer consumption in real terms (i.e., quantity of beer consumed, measured in units of year 2000 beer dollars, i.e., the amount of beer that a dollar would have bought in the year 2000) BEER/POPULATION: per capita expenditures on beer (SAAR) BEER/(BEERPRICE*POPULATION): per capita beer consumption in real terms BEERPRICE/CPI: relative price of beer compared to consumer goods and so on. Be sure to keep track of the units of the variables so that you can determine the appropriate units to attach to What trends or patterns do you see in the historical record? What explanations can you provide for what you see? (Hint: a few quick searches of Google and/or library resources might be helpful in shedding light on historical events.) Note: when the Multiple X-Y plot is first displayed, it will originally plot the data with points instead of lines, which is appropriate for cross-sectional data but not for plots of data versus time. Click the right mouse button and choose Pane Options, then specify Lines rather than Points. You may also wish to change the width of some of the lines: click on a line, hit the right mouse button, and choose Graphics Options. You should find yourself on the Lines tab with the appropriate line code selected just move the slider to make the line bigger. It is especially important to fine-tune the formatting of any graphs that you plan to copy to Powerpoint slides. Step 5: Data transformations that might be useful as part of a forecasting model. The Describe/Time_Series/Descriptive Methods procedure is handy for experimenting with transformations such as logging and differencing. The default plot options in this procedure are a time series plot and an autocorrelation plot, although other plots are also available. If you click the right mouse button and choose Analysis options, you get an options menu for data transformations. Try choosing the natural log transformation to see how it affects the trends in the data. When the log transformation is in effect, a linear trend on the time series plot corresponds to an exponential (compound growth) trend in the original series in fact, the slope of the trend line in the logged series is the average percentage growth in the original series. You can also display the difference 5

6 (i.e., period-to-period change) in the series by changing the Nonseasonal order [of differencing] setting from 0 to 1. A (possibly logged) series is a good candidate for a random walk model if a plot of its first difference looks like pure noise with a constant variance over time and no significant autocorrelation. Note: on the input screen for the Descriptive Methods procedure, you are asked to specify the sampling interval, the start date, and the seasonality (number of periods per season). If you have used all the data from 1959 onward, you should specify that the interval is monthly, that the start date is 1/59, and the seasonality is 12. (CAUTION: hit the Monthly radio button BEFORE typing 1/59 in the starting date box, otherwise the new starting date will not be processed.) To deflate by the beer price index or CPI, you must divide the input variable by the price index (and also multiply by 100 to get back to units of dollars). For example you would type 100*BEER/BEERPRICE to deflate by the beer price index, assuming these were the variable names you used. If, on the other hand, you wish to use a log transformation or deflate at a fixed rate, you can do this by using the buttons on the Analysis Options panel for the Descriptive Methods procedure. (To get the Analysis Options panel for the Descriptive Methods procedure, click on the Analysis Summary report for this procedure and then hit the right mouse button. Differencing transformations can also be specified on the Analysis Options panel.) Step 6: Truncate the data set? After studying the plots, you may decide that you don t need to use the entire 48 year data history. If you want to eliminate some of the earlier data from all the subsequent analysis, the simplest way to do this is to make a truncated version of the data file. First, save the existing data file to capture any changes or additional columns you have added thus far. Then highlight the rows that you wish to delete, click the right mouse button, and choose the Delete option (exactly as you would do to delete rows from an Excel spreadsheet). Then use the File/Save As/Save Data File As command to save the truncated data file under a new name. (Since the old and new data files use the same variable names, you can always reload the original data file and it will work with any analysis that you may subsequently perform. One of the nice features of Statgraphics is that you can perform the same analysis on different data files by loading a new file with the same variable names but different rows of data.) Step 7: Comparison of forecasting models. After you have explored the data and perhaps formed some hypotheses about the kinds of transformation (if any) that might be useful, what kind of forecasting models seem plausible, and how much data is relevant, you should go to the Forecast/User-Specified Model procedure. The input screen for this procedure is similar to the input screen in Descriptive methods: you must specify the variable name (divided by a price index, if necessary), sampling interval, starting date, and seasonality. (Note: if you have truncated the data set, be sure to also change the starting date on this screen so the plots will be labeled correctly.) Leave the Seasonality box blank to disable the seasonal modeling options. Also, here you must specify how many observations at the end of the series are to be held out for validation, and how many forecasts should be extrapolated into the future. For the purposes of this 6

7 assignment, you should hold out 24 observations for validation and also generate 24 forecasts for the future. When you first enter the Forecasting procedure, it fits an assortment of 5 default models. When the Seasonality box is left blank the defaults are: random walk with drift, mean, linear trend, simple average of 3 terms, simple exponential smoothing, and linear exponential smoothing. (We will study exponential smoothing models in more detail next week.) This procedure also shows, by default, only the Analysis Summary and Model Comparison text reports, and only the time series plot of the original series (including forecasts) and the autocorrelation plot of the residuals. In addition to these, you should also look at the Forecast Table text report and the plot of residuals versus time, so click the Tabular Options and Graphical options icons on the Analysis Window Toolbar (the second and third buttons from the left), and turn on the Forecast Table report and the Residual Plots. You should now see three text panes on the left and three graphics panes on the right of your screen. The default residual plot is a residual time sequence plot, but one of the right-mouse-button pane options for this plot is to switch between a time series plot and a normal probability plot. Note: if the panes in your analysis window get out of shape or bunched up at any point, just click the Tables and Graphs icons on the toolbar to redraw them To view or change the specification of the default models, click the right mouse button and choose Analysis Options. You will then see the Model Specification panel, which includes the same transformations that were available in Descriptive Methods plus many additional fields for specifying elements of a forecasting model. In the upper left are five radio buttons (labeled A through E) that control the selection of five models that may be compared side-by-side. By clicking on different radio buttons you can view and/or change the specifications of the different models. Two of the three models you are supposed to analyze in this assignment are already specified as models A and B, respectively: the random-walk-with-drift and linear trend models. You should also analyze a random walk model without drift, which you may wish to define as model C. To do this, click the C button and choose Random walk as the model type, then click OFF the box that is labeled Constant (in the lower right of the Model Specification panel). You can suppress models D and E from the reports, if you wish, by selecting these models and specifying "none" as the model type or you can leave them in for purposes of comparison, if you are curious about what some of the other models look like. (We will study the other model types in another week or two.) Once all of your models have been specified, you should look at the Analysis Summary report for each model, plots of the forecasts and residuals of each model, and the Model Comparison report that compares the estimation-period and validation-period statistics of all models. (Most of the reports and graphs refer to the model whose radio button was pushed last. To change from one model to another, just click the right mouse button, choose Analysis Options, hit the radio button for the desired model, and click OK. All the reports and graphs will then be redrawn for the chosen model.) 7

8 What to look for in the results: Your objective is to find the forecasting model which is best for forecasting beer expenditures in the sense of being more accurate and also (hopefully) more intuitively reasonable than the other models. The ideal properties of a best model are the following: It should have the smallest or nearly the smallest average errors in the estimation period, as measured by RMSE, MAE, and MAPE. (Don t split hairs, though: a 10% reduction in RMSE is probably a significant improvement, but a 2% reduction probably is not, particularly if it comes at the expense of increased model complexity or counterintuitive assumptions.) It should also have among the smallest errors in the validation period, and ideally the errors in the validation period should be similar in magnitude to those in the estimation period. (This suggests that you have not overfitted the data.) However, remember that the validation period is a fairly small sample, so some sampling variation is to be expected. Also, if there have been unusual events near the end of the series, within the validation period, some models may get lucky in fitting these events even if they are otherwise inferior. So, although the validation statistics are important, you shouldn t automatically assume that the best model is the one with the absolute best validation period statistics. Mainly you are looking for consistency between estimation and validation period statistics for a given model i.e., not too much increase in the size of a typical error in the validation period especially as measured by the MAPE statistic which is not affected by inflationary growth. The plot of its forecasts (extrapolated into the future) should look intuitively reasonable--i.e., it should agree with your theory about where the series is headed. (This is very important.) The plot of the residuals (errors) should look like stationary white noise--i.e., no trend, no significant increase in variance from beginning to end, no horrendous outliers, and no significant autocorrelations at any lags. The residuals ideally should pass most or all of the tests for randomness and goodness of fit (these are summarized in the Model Comparison report and more details are provided in the Residual Diagnostics report). However, don't be concerned if you don't get perfect test results on this assignment. The objective of this assignment is to become familiar with several of the simplest forecasting models, not to generate a "perfect" forecast for beer sales. You should be able to explain in plain English how your model works--i.e., what assumptions is it making about patterns and trends in the past data that are expected to reoccur in the future? What to hand in: Please submit three files using the HW#1 link on the Course Outline web page: your Statgraphics data file (with sf6 extension), your Statgraphics statfolio 8

9 file (with sgp extension), and a Powerpoint (or Word) file containing your presentation. The Statgraphics files will provide an audit trail where I can trace the results shown in your presentation, if necessary. Your presentation should describe: (a) what you learned about beer sales from your analysis, (b) why you chose the forecasting model that you did; and (c) the 2-year-ahead point forecast & confidence interval from your final model, in terms that a layman would understand. Most of your slides should feature graphs or reports copied from Statgraphics, with titles or bullet points or other annotations (e.g., boxes and arrows) as appropriate. Try to follow good principles of statistical graphics i.e., show the data and try to make its message as clear as possible. Feel free to comment on what you feel might be the underlying causes for the patterns you have observed (economic events and demographic trends). Your presentation should begin with one or two slides that highlight your key conclusions and recommendations (especially the bottom-line forecast that was asked for). Next, it should include some slides that address the background questions asked in Part A of the assignment, illustrated with appropriate graphs from your exploratory analysis. (Be sure to include a slide or two describing the data variables, the units in which they were measured, and where they were obtained.) Finally, it should include a few slides that illustrate your solution to Part B of the assignment, showing the forecasting models you tested and the rationale for your best model. Be sure to include the Model Comparison report that compares your various models, as well as the time series plot of the data with extrapolated forecasts and confidence intervals from your final model. You should also include a portion of the Analysis Summary report showing the estimated coefficients of the model. (The easiest way to copy a portion of a report into Powerpoint or Word is to first use the Copy analysis to Statreporter command, and then copy and paste the desired part of the report from the Statreporter window.) Your presentation of your final model should also include a slide showing the following three plots of the residuals: (i) the residuals vs. time plot, (ii) the (vertical) normal probability plot (this is a pane option behind the residual time series plot in the Forecasting procedure), and (iii) the residual autocorrelation plot. (These should all fit on one slide, one above the other, if you copy them directly from the multiple-pane view of your analysis in Statgraphics, in which the graphs are wider than they are tall.) 9