Linear model to forecast sales from past data of Rossmann drug Store

Size: px

Start display at page:

Download "Linear model to forecast sales from past data of Rossmann drug Store"

Coleen Dixon
5 years ago
Views:

1 Abstract Linear model to forecast sales from past data of Rossmann drug Store Group id: G3 Recent years, the explosive growth in data results in the need to develop new tools to process data into knowledge which helps people make decisions. 1 Here we develop a method which automatically fits models using past data of drug stores to predict future sales. We built both linear and non-linear models and examined their performance. Keywords: Data mining, Sales, Linear regression, Non-linear model Background Rossmann operates over 3,000 drug stores in 7 European countries. Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. 2 Given a dataset of 1115 stores sales data from Jan to Jul. 2015, our goal is to predict six weeks future sales of each store starting from Aug To achieve that, we first choose linear models since it is a useful tool for predicting a quantitative response. 3 We considered the following questions. How to deal with missing data? Which part of data do we use for training and testing? Which variables do we select for prediction? How well does the model perform? How to apply linear models to 1115 stores automatically? Eventually, we have basically solved those problems and were able to make predictions using our method. The following is an overview of our dataset. Table 1: Example rows of dataset Predictors Store Sales Customers Open Promo StateHoliday SchoolHoliday Descriptions a unique Id for each store the turnover for any given day (this is what we are predicting) the number of customers on a given day an indicator for whether the store was open: 0 = closed, 1 = open indicates whether a store is running a promo on that day indicates a state holiday. indicates if the (Store, Date) was affected by the closure of public schools Table 2: Description of predictors

2 Methods As an overview, the following is the algorithm of our linear method: Algorithm 1 Sale prediction For each store i = 1, 2, n: 1. Get training data: a) Select a subset dataset of store i from whole dataset. b) Remove rows in which the store is closed (sale = 0). c) Select previous years month k data as training set (k is the month in which we will predict sales) 2. Perform exhaustive selection and choose the best model (min BIC) 3. Use the best model to predict future month k sales of store i. Get train data To select train data, we explored the dataset and found an interesting phenomenon: Sales are strongly affected by month. We found that there is a wave-like pattern throughout year. And at the end of the year, there is a significant increase in sales due to Thanksgiving sale. Because of the pattern, we choose to use the same month of former years as training set to fit models, thus minimizing the impact of months. Figure 1: Boxplot of sale vs month Select predictors We have explored each of the predictors to see whether it has an impact on sales. Three important predictors are Promo, DayofWeek and SchoolHoliday. 1. Promo: promo has positive effect on sales. From the plot, we see that on days with a promo, the sale increases significantly. We applied the Wilcoxon rank-sum test showing that p-value < 2.2e-16. We reject the null hypothesis that two populations have the same continuous distribution. Figure 2: Boxplot of sale vs promo

State holiday: Normally all stores, with few exceptions, are closed on state holidays. So far, we do not choose it as a predictor in our model.

3 2. Day of week: Day of week has significant effect on sales. It has 7 factors. We consider each of them as an independent predictor. 3. School holiday: We checked the Wilcoxon test of the two populations (0 and 1). It shows a small p value. So we also consider school holiday as one of informative predictors. 4. State holiday: Normally all stores, with few exceptions, are closed on state holidays. So far, we do not choose it as a predictor in our model. Figure 3: Boxplot of sale vs promo We use boosting method to test the influence of predictors as figure 4 shows. It agrees with our former exploration. In sum, the equation of our linear model is: Sales = β 0 + β 1 * Promo +β 2 * DayOfWeek +β 3 * SchoolHoliday + ε (β 0,β 1, β 2, β 3 are parameters, ε is error term) Fit linear models and select the best model Figure 4: Influence of predictors Notice that DatOfWeek has 7 factors. Some of them are redundant variables which add noise to the estimation of Sales. So, in order to yield better prediction accuracy as well as model interpretability, the redundant predictors should be removed first. Given that the number of predictors is not large, we don t need to worry about computational efficiency. Since DayOfWeek is a qualitative predictor, we can use 6 variables to represent the 7 levels of DayOfWeek. Since the impacts of each level of DayOfWeek on Sales differs in each store, we perform best subset selection for each store separately, and then choose the best model for each store base on values such as Adjust R 2 and BIC. In the algorithm below, Step 2 identifies the best model on the training data for each subset size. Now in order to select a single best model, we must simply choose among these p+1 options. The algorithm of best subset selection is described as following.

4 Best subset selection Algorithm 3 1. Let M denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation. 2. For k = 1, 2,... p: a) Fit all b) Pick the best among these models that contain exactly k predictors. models, and call it M. Here best is defined as having the smallest RSS, or equivalently largest. 3. Select a single best model from among M,..., M using Adjust R 2. Apply linear models to each store The goal is to predict sales for each store. By now we have built a linear model for only one store. The next step is to automatically apply this process to all stores. For each store, we fit a unique model and use it to predict. Get training and testing data of store i Fit models and select the best Make prediction Go to next store i= i+1 Figure 5: workflow of our method Linear model validation For applying the process to all stores, we looked for writing a function (we called it predict.regsubsets ) that automatically applies the process of fitting model and selecting the best subset, for all stores. For each store, this function gets all the possible subsets of the linear model and chooses the best model by finding a subset that has the maximum Adjusted R 2 among others and then predict the sale for each store. This function helped us to look at each store independently and find a different subset selection for each store individually. As it is shown in Table 1, we had a column named Date which shows the day and month and year. We add a column named year that only shows year and a column named month that only shows month to our data set. Table 3 shows these columns in our data set. Table 3: Example of column month and year

5 To evaluate the performance of our linear model, we split the data set into three sets. The first set is part of the data of year 2013 (started from January to end of December). The second set is part of the data of year 2014(started from January to end of December) And the third set is the data of 2015(started from January to end of July). As we were supposed to predict sales of 2015 in August (month 8) and the first 2 weeks of September (month 9), we used the data of month 8 of 2013 (training data) for store i (i = 1,2,,1115, 1115 is the number of stores) to fit our linear model (based on best subset selection) and then applied that model to predict the sales of month 8 of 2014 for store i. We applied two subset selection to each store. The first one was based on minimum BIC and the second one was based on adjusted r-squared. Then we calculated the MSE of the two best subset selections. We chose the one which has the smallest MSE. We also compare it with the MSE based on linear regression which used all predictors. Preliminary exploration of non-linear model For data interpretability and visualization, we make our prediction using a decision tree. Trees are very easy to explain to people and can be displayed graphically. Some tree methods such as bagging, random forests and boosting improve predictive accuracy. The basic idea is stated as follows. We select a tree size that minimizes the cross-validated error. Then, we build the tree with recursive partitioning method. In order to avoid over-fitting, we set maximum depth as six. Finally, we prune the tree to get the optimized classification. We make use of boosting in implementation. The relative performances of tree-based and classical approaches can be assessed by estimating the test error, using either cross-validation or the validation set approach. Results Linear Model validation The results for 3 stores are shown in Table 4. Adjusted r2 MSE Min BIC MSE Linear model MSE Store Store Store Table 4: results of MSE based on adjusted r squared, min BIC and Linear model This shows that the results of MSE based on linear model is either the same as best subset selection or bigger, so the idea of using best subset selection makes our prediction better. Figure 6 model with BIC Figure 7 model with Adjusted R 2

6 In figure 6, we compared actual sales and predicted sales with the model which is selected by minimizing BIC. In figure 7, we compared actual sales and predicted sales with the model which is selected by maximizing Adjusted R 2. Non-linear model exploration Fit a tree model on training data and calculate mis-classification error. There could be a possible over-fitting (rules becoming too specific). Pruning the size of the tree could improve the prediction accuracy to an extent. It is essential to regroup factor variables in predictors to guarantee a maximum of 32 levels. In figure 10, the lower X axis is the number of terminal nodes and the upper X axis is the number of folds (# of pieces the data is split) in the cross validation. Given a choice of number of terminal nodes 3-4, all of which giving the same misclassification error, 4 terminal nodes should be the first choice. Step 1: Build the tree Step 2: Prune the tree Figure 8 Regression tree Figure 9 Cross validation error Step 3: Re-calculate the mis-classification error with pruned tree Step 4: Predict Discussion Figure 10 Pruned tree Learning through our path: 1. Grouping: Start form grouping, we set regular meetings and discuss ideas with each other. After several meetings, we became familiar with each other. Coming from different places, we sometimes talk about each other s lives such as foods, languages and habits. That broadens our eyes of the world. Getting together as friends is our greatest treasure. 2. Pick an interesting dataset: We collected various datasets associated with different topics from multiple sources. After comparing with each of them, we all agreed to use the dataset above. From this process, we formed a basic view of whether a dataset is good or not for analysis. For example, a dataset with too many or too few observations is not considered as a good one. Also as a team, we know a lot more sources to get datasets than working alone. 3. Explore dataset: We mainly use boxplots to explore our dataset. From the boxplots, we find it very interesting that some variables have significant impact on sales such as month, customers, dayofweek and Promo. These accord with our common sense and

7 turn out to be very useful in our prediction. Our model is guided by exploration results of our dataset. 4. Build a linear model: The biggest problem we met during this process is to understand why there are 6 predictors in DayOfWeek with 7 levels. We solved this problem by reading chapters related to qualitative predictors in linear regression. Besides, we also met a problem selecting the best model. Whenever we have a problem, we find the answer in textbook and make it clear. Through this process, we checked the main points in linear regression. 5. Apply to all stores: To fill up future sales, we wrote a script to automatically build a personalized model for each store, then predict sales for all stores. We came up with this idea together and it works out very well. 6. We explored non-linear model in the end. It shows good performance as well. From this, we learned tree-based methods. In the future, we will keep working on it to make better predictions. Future steps to increase accuracy of the model: 1. From the store.csv, we could also add Promo2 and Competition distance as predictors to increase the accuracy. 2. Use intersections of predictors in the linear model. 3. Try different approaches to get the optimal model such as using Cp, BIC, etc. 4. Build non-linear models such as tree-based model and SVM. References 1. Jun Lee, Sang, and Keng Siau. "A review of data mining techniques." Industrial Management & Data Systems (2001): James, Gareth, et al. An introduction to statistical learning. New York: springer, Appendix Explore store data store.csv is another data file provided by Kaggle. We did not use it in our model because it helps little predicting future sales. For example, when we predict August sales of store 1, predictors such competition distance and store type has only one value for this store. However it contains some information which is interesting for understanding features of stores. Also, variables such as Competition and Promo2 could be used in the future to increase accuracy of our model.

8 Figure 12 Store type vs sale mean Figure 11 Competition distance vs sale mean The core code of our model Figure 13 linear model code