Mail Volume Forecasting

Size: px

Start display at page:

Download "Mail Volume Forecasting"

Lorin Stone
6 years ago
Views:

1 UPTEC IT 165 Examensarbete 3 hp April 216 Mail Volume Forecasting an Evaluation of Machine Learning Models Markus Ebbesson Institutionen för informationsteknologi Department of Information Technology

Abstract Mail Volume Forecasting - an Evaluation of Machine Learning Models Markus Ebbesson Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4,

3 Abstract Mail Volume Forecasting - an Evaluation of Machine Learning Models Markus Ebbesson Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan Postadress: Box Uppsala Telefon: Telefax: Hemsida: This study applies machine learning models to mail volumes with the goal of making sufficiently accurate forecasts to minimise the problem of under- and overstaffing at a mail operating company. A most suitable model appraisal in the context is found by evaluating input features and three different models, Auto Regression (AR), Random Forest (RF) and Neural Network (NN) (Multilayer Perceptron (MLP)). The results provide exceedingly improved forecasting accuracies compared to the model that is currently in use. The RF model is recommended as the most practically applicable model for the company, although the RF and NN models provide similar accuracy. This study serves as an initiative since the field lacks previous research in producing mail volume forecasts with machine learning. The outcomes are predicted to be applicable for mail operators all over Sweden and the World. Handledare: Lisa Stenkvist Ämnesgranskare: Sholeh Yasini Examinator: Lars-Åke Nordén UPTEC IT 165 Tryckt av: Reprocentralen ITC

5 Table of Contents List of Abbreviations List of Figures List of Tables vi vii viii 1 Introduction Bring Citymail Sweden Motivation Research Questions Background Time Series Random Forest and Decision Trees Neural Networks and the Multilayer Perceptron Related Work Methodology and Implementation Data Data Extraction Training and Test Data Features Data Preparation Performance Measurement Auto Regression Random Forest Multilayer Perceptron The Backpropagation Algorithm Avoiding Overfitting with Cross-Validation Limitations Results Prediction Results Results per Mail Segment Feature Importances Feature Importances per Mail Segment Discussion Future Work Conclusion 34 Bibliography 35 v

6 List of Abbreviations AR ARIMA BCMS MA MAPE MLP NN OOB RF RMSE Auto Regression. Auto Regression Integrated Moving Average. Bring Citymail Sweden. Moving Average. Mean Absolute Percentage Error. Multilayer Perceptron. Neural Network. Out-Of-Bag. Random Forest. Root Mean Squared Error. vi

7 List of Figures 2.1 Example of how a tree with 7 leaves gives an output y based on input parameters x 1, x 2, x 3, and x 5. The output of the Random Forest (RF) will be the average of all trees outputs for the same input A fully connected feed-forward Multilayer Perceptron (MLP) with one hidden layer Illustrates the similarities of the mail volume behaviour between the destinations Total mail volume per day for destination Stockholm during the time period to Mail volume measured in number of postal items per segment for destination Stockholm during the time period to Together, the six segments make up the total volume for the destination Shows how inputs turn into an output of a unit in a Multilayer Perceptron (MLP) via the propagation rule, the activation rule and the output rule Illustration of the overfitting phenomenon Prediction outputs (number of postal items) for the time period from a selection of the models, plotted in green. The actual mail volume outcomes are drawn in black Actual volume and model prediction outputs per segment for the time period , measured in number of postal items Feature importance for a model trained on the total mail volume. Out-Of-Bag (OOB) error increase for each feature as the values of each feature are permuted across the Out-Of-Bag (OOB) data. A larger permutation error increase suggests higher relative feature importance Feature importance per segment as indicated by the Out-Of-Bag (OOB) error increase as the values of each feature are permuted across the OOB data. A larger value suggests higher feature importance vii

8 List of Tables 3.1 List of considered features. The features respective importance (prediction capabilities) will be evaluated and included/excluded thereafter Model accuracies measured in Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE). Each model s performance is measured on 1-5 days prediction, days prediction and 1-26 days prediction Model accuracies measured in Mean Absolute Percentage Error (MAPE) for each segment separately. Each model s performance is measured on 1-26 days prediction, except models NN E and RF E that predicts 1-5 days Positive and negative aspects of each method viii

9 1 Introduction This report pioneers the area in which the subjects of forecasting using Machine Learning and the daily work of mail operators meet. The efficient use of resources in a company is assessed through applying forecasting models on their mail volumes. Despite the growing digitalisation and usage of s, traditional mail will always be needed and the domain is therefore sustained. While some mail categories decline with digitalisation the mail operators are forced to adapt thereafter. This phenomenon put the mail operators in a competitive environment and efficient handling of resources is consequential. The study evaluates and compares three machine learning models to find if they are sufficiently powerful in predicting the mail volume for the near future. It also suggests which model is the most accurate and more practically applicable in the context of the company. 1.1 Bring Citymail Sweden Bring Citymail Sweden (BCMS) delivers mail from businesses to businesses (B2B) and from businesses to consumers (B2C). BCMS covers (i.e. delivers to) major areas of Sweden divided into five destinations based on four terminals located in Stockholm, Malmö, Gothenburg and Örebro. The five destinations are Stockholm, Malmö, Gothenburg, Örebro and Visby. Mail that will be delivered in destination Visby passes through terminal Stockholm. The terminals are where so called post producers deliver the customers mail once they have been printed. The terminals also serve as centralised sorting stations where the mail is sorted down to the exact order that each Cityman (mailman) will do his or her round of deliveries. The sorted mail is then transported to the appropriate distribution office, at which the Citymen work and proceed to deliver the mail to its recipients. BCMS handles large volumes of mail of different kinds and of varying size every day. The amount of mail being passed through depends on the amount of people that live in each area, but also has more advanced underlying influences affecting it. 1.2 Motivation The current predictions are based solely on the information about what the customers have booked, making them very inaccurate. These orders are summarised in a web-based tool that the schedule-setting staff can consider. The current booking situation, displayed in the tool, can be far from the truth. Orders can change in volume and date, and new orders can be made as close as the day before arrival, making the booking situation and therefore the predictions vastly preliminary. Making a good prediction based solely on 1

10 the current prediction tool therefore requires expert knowledge, long experience, guesswork and luck. Staff that sort and deliver mail (Citymen) are scheduled as far as three weeks in advance. However, this can be altered to some extent a few days in advance using employees working on hour-based salaries. Despite this, there is currently a problem at BCMS with both under- and over staffing because of the short booking time span. Looking at today s tools for future planning is hopeless for an inexperienced person. The varying amount of orders and orders switching dates makes it hard to interpret. With experience, it is easier to handle the current way of predicting the future. However, a proper prediction model alongside the current booking situation would be more useful, intuitive and accurate for both inexperienced and experienced staff schedulers. The varying nature of the mail volume, along with inaccurate forecasts and orders arriving late and being changed, makes it difficult to plan and schedule resources. An accurate forecast of this volume could therefore help the company plan their use of resources more efficiently. 1.3 Research Questions This thesis problem formulations include: 1. Can a supervised machine learning model be used to produce sufficiently accurate mail volume forecasts? 2. Comparing Auto Regression, Random Forest and the Multilayer Perceptron, which model is in this context... (a) the most accurate? (b) more applicable? 3. Can forecasts be improved by... (a)...splitting up the forecast into the six mail segments? (b)...incorporating future booked volume information? 4. Which features are most important for predicting mail volume? For a forecasting model to be reliable and trusted it needs to be accurate enough. In the company s interest, however, a model is sufficiently accurate as long as it is more accurate than the current predictions. The best model could be the most accurate model or the most practically applicable, or a combination of the two. A more accurate model is the one that produces more exact forecasted values in comparison to the actual outcome. A practically applicable model is a model whose properties would make it more suitable to implement as a finished product to be used in the daily operation in the BCMS context and at this problem. The most accurate model could also be most accurate in forecasting either 1 to 5 days in advance or 3 to 4 weeks in advance, matching the timespans of which staff scheduling are made. Different models are expected to be better in different aspects. 2

11 2 Background BCMS s mail volume varies a lot from day to day. Furthermore, the booked volume for each day changes up to as close as one day before delivery. The nature of this problem is in the agreements with the customers that for different reasons cannot be changed. Therefore, the sensible approach to this problem is to make prediction based on the time series data that is the mail volume history. The problem will also be approached with more advanced models, namely Random Forests (RFs) and Neural Networks (NNs). 2.1 Time Series Time series forecasting is a way of predicting the next step of a data set based on previous values. Common methods include Moving Average (MA), exponential smoothing, Auto Regression (AR), Auto Regression Integrated Moving Average (ARIMA), et cetera. The correlation of the current value to the previous values is defined in different ways in each model. The simplest models usually use an identical or a decaying importance (weight) with each step backward in time, while more advanced time series models find other relationships through regression. MA is the method of calculating the average of part of the data, such as the few previous time steps. The MA model gives a slight delay and does not produce powerful predictions, but rather makes it easier to find shorter trends. Exponential smoothing is another method of smoothing time series data. The AR model attempts to find patterns by estimating how a time step depends on a number of previous time steps. It can therefore have more prediction power, if it finds suitable parameters. The ARIMA model is a combination of AR, MA and I, which stands for Integrated. A non-stationary time series can be made stationary by differencing and is then called an integrated version of the stationary series. AR and/or MA can then be applied once the time series is stationary. 2.2 Random Forest and Decision Trees Introduced by Breiman in 21[1], RF is a way of bootstrap aggregating, or bagging, decision trees. Therefore, it attacks the infamous bias-variance tradeoff by minimising variance without affecting the bias. However originally introduced on classification trees, RFs can similarly be applied on regression trees, thereby sometimes referred to as Regression Forest. Regressions Forests are for nonlinear multiple regression and is therefore applicable on the problem at hand. The output of a RF is defined as the average of the output of each tree in the forest as they are exposed to the same inputs. Each tree is trained in a similar fashion, but with two random elements. First, the selection of data points. Through sample with replacement, each tree is trained on a subset 3

12 of the total training data, the so called bag or in-bag data. The variance is therefore minimised by training each tree with slightly different data, while keeping robustness and bias of decision trees. As a result, each tree can also be evaluated on its Out-Of-Bag (OOB) error. Second, by using a subset of input parameters (features). For each split, a subset of the parameters is selected and potential splits for each of these parameters are evaluated. Each tree is trained to a maximum depth or a maximum number of leaves by recursively evaluating the information gain of each selected parameter. The split with the most information gain is selected and the node is split into two based on the value of this parameter. Figure 2.1 shows an example of such a tree as a part of a forest. x 3 <.2 x 5 >.7 x 2 <.1 x5 >.7 x3 <.2 x2 <.1 x 2 <.2 y =.55 x 1 >.9 y =.88 x2 <.2 y =.55 x1 >.9 y =.88 y =.21 y =.32 y =.1 x3 >.8 y =.45 y =.61. y =.21 y =.32 y =.1 x 3 >.8 y =.45 y =.61 Figure 2.1: Example of how a tree with 7 leaves gives an output y based on input parameters x 1, x 2, x 3, and x 5. The output of the Random Forest (RF) will be the average of all trees outputs for the same input. A random forest, therefore, consists of a relatively large number of individual trees, usually in the order of hundreds to thousands. It is possible to take advantage of this property and gain information through analysing the individual, trained trees. For example, there have been applications where the individual trees have been used to find prediction intervals, i.e. confidence intervals for each prediction [2, 3]. 4

13 2.3 Neural Networks and the Multilayer Perceptron Although using neuron-like units to solve simple mathematics was introduced as early as the 194s [4], the explosion of interest for NNs merely dates back to the 198s. NNs have then steadily grown in popularity since the credit assignment problem (error propagation problem) for multilayer networks was solved around mid-198 [5]. NNs use a set of neuron-like units and a pattern of weighted connections between the units, it s so called architecture, which represent and interpret information by manipulating the weighted connections. The most recognised and implemented NN structure is the feedforward Multilayer Perceptron (MLP). In this approach, the input given to the network is passed through the input layer, to one or more hidden layers and finally to the output layer to create an output, thereof feedforward. The connections between the layers can be fully connected, where all units in one layer is connected to all units in the next layer, or in different ways partially connected. A fully connected feedforward MLP with one hidden layer is depicted in figure 2.2. This network has N input units, some hidden units and one output unit. The represents the propagation rule, and the represents the activation function. The network takes N inputs (x 1,...,x N ) and provides one output, y. Connection weights connect all units in one layer to all units in the next layer, making it fully connected. It also uses bias units, which always output 1. x 1 x 2 x 3 y x N.. Bias = 1 Bias = 1 Figure 2.2: A fully connected feed-forward Multilayer Perceptron (MLP) with one hidden layer. A NN is typically implemented with an activation rule, an output rule, a propagation rule and a learning rule. The activation rule defines the unit s state, or activation, based on its current input. The output rule determines the output of a unit based on its current activation, usually set to equal the activation. A unit s input is decided by the propagation rule on the basis of the outputs and weights of units connecting to it. Connection weights are manipulated using the learning rule to correct the network towards a desired behaviour. 5

14 The activation function is used to scale the input of one unit to make its output. Activation functions can be continuously differentiable or not. A continuous function, linear or non-linear, is necessary to allow gradient-based learning methods. Non-linear activation functions are therefore the most powerful, and examples of such functions are the logistic (log-sigmoid) function and the hyperbolic tangent (tan-sigmoid) function. In order to run an input pattern through a MLP, such as the one in figure 2.2, the input units are set to assume the value of each respective input. The units in the hidden layer thereafter acquire an input according to the propagation rule. The hidden units assume the value given by the activation function from this input and output according to the output rule. The output unit thereafter calculates its output in the same way as the hidden units, according to the input from the hidden layer. The output of the output unit is the output of the perceptron. The perceptron may have more than one output unit and consequentially more outputs. The perceptron is trained through supervised learning via the learning rule. Supervised learning means that the perceptron is exposed to an input pattern, and the output is compared to an expected value, also known as a teacher. The pattern error describes how wrong the perceptron is and the learning rule alters the weights in the network to minimise this error. The network is run with all input patterns in a training set, the population. Learning once from each training pattern is called an epoch. The network can change the weights after each pattern, called sequential or online learning, or make a summed change of the weights from all pattern errors at the end of the epoch, called batch learning. Typically, it takes hundreds or thousands of epochs for the perceptron to learn, depending on the complexity of the problem. Since the network s representation of what it has learnt lies in its architecture and weights, it does not have to save all input patterns that it has learnt from. Furthermore, it can be continuously taught as new data is available. However, this so called online learning introduce the problem of catastrophic forgetting. Catastrophic forgetting means that when a NN is exposed to new training data it can drastically affect what it has previously learnt. Therefore, online learning should be applied with caution. A sufficiently large MLP can be proved to learn any computable function (Turing equivalent) with the introduction of non-linear activation functions and bias units (see figure 2.2), which allow shifting of these activation functions. However, multiple layers introduce the credit assignment problem, which means that it is difficult to find a neuron or connection in a fully connected multilayered network to blame for an error in the output. The most famous learning rule to approach the credit assignment problem is the generalised delta rule, also known as backpropagation. It is a supervised learning rule which means that it compares the output of a network with a desired output, the error, and corrects the network towards producing a closer value next time. The error is passed backwards through the network to assign credit to weights and alter them accordingly. The continuous activation function lets the backpropagation algorithm find partial derivatives and step-wise correct output errors by altering the weights of the network. These errors are further propagated backwards through the network to assign credit deeper into the network. 6

15 2.4 Related Work During the research phase of this project it appeared that there are no previous studies done on mail volume forecasting. Therefore, it is the first time such a problem is being considered. On the other hand, there are many available papers comparing different forecasting models. The rise of NNs have inspired a lot of researchers to compare NNs to more traditional methods. They are found to be promising in areas such as railway passenger demand forecasting [6], airline passenger and car sales forecasting [7] and even the Makridakis Competition data set [8]. Comparing a NN with a time-series forecasting model using parameters as 1 through 13 months back in time, it was found that that NNs can be superior in forecasting [7]. Furthermore, NNs perform as well as, and occasionally better than, statistical models [9]. The same study concluded that NNs especially performed better when there are non-linear elements present. A subset of the same authors later found that NNs outperformed traditional statistical and human judgement methods when forecasting time series monthly and quarterly data [8]. Another positive aspect of NNs is that they can be viewed as more robust than time series models [7]. Time series models are more sensitive to noise. There are also some studies that find other models superior to the NN. For example, the Box-Jenkins algorithm seems to be more accurate in short term forecasting, compared to NNs [7]. However, traditional statistical models were also found to be comparable to NNs on annual data [8]. These contradicting conclusions may imply that the problem is context specific. Therefore, different data sets are best fit with different models. On the RF side, there are some research conducted producing forecasts with high accuracy. One example is electricity load forecasting [1]. A comparative study measured higher accuracy from its RF technique than the NN in predicting dementia progression [11]. Ease of use is another aspect to keep in mind when selecting a forecasting model that will be implemented in a finished product, a forecasting tool. When developing for a company, there has been some research conducted on what methods they use and what is important to them when selecting a model. Arguing that many companies could benefit from using more advanced models rather than naive implementations for accuracy, ease of use is of great importance as well [12]. 7

16 3 Methodology and Implementation This chapter describes data, features, models and implementation. The acquiring, and preparation, of the appropriate data results in different input data sets and teachers for the models. The implementations of the models are then described. The performance of the models is then increased by estimating optimal parameters and settings. 3.1 Data To analyse this data as a time series problem, it is organised on a mail volume per day basis. The mail volume is measured in number of postal items, such as letters, envelopes, papers, magazines, et cetera. BCMS wishes to make forecasts on so called destination level. There are currently five destinations that BCMS has divided its coverage area into; Stockholm, Malmö, Gothenburg, Örebro and Visby. Each of these cover an area of Sweden in the proximity of the city it is named after. Further division of the total mail volume can be made on varieties of mail. BCMS handles six different categories of mail, so called mail segments. These are: 1. Administrative routines: Invoices, bank account statements, credit records, et cetera. 2. Mailbox sized packages: Trackable postal items. 3. Direct advertising: Addressed direct advertisements, campaigns and offers. 4. Office mail: Unsorted mail collected directly from BCMS s customers, i.e. does not go through a post producer. 5. Unaddressed advertising: Free papers and civic information. 6. Magazines: Subscription papers that are delivered at least four times per year. Requires a Publication License Data Extraction Data is extracted from the time period 1st of March 214 until the 18th of September 215. It contains all orders that have been delivered in this time period. The information about each order includes delivery dates, split date, volume (number of postal items), destination, mail segment, item format, average weight et cetera. A delay at a sorting station will affect many distribution offices since they are centralised. Therefore, the focus is on making forecasts on the date that the mail will be sorted, referred in the order as each order s split date. 8

17 The destinations handle differing volumes because of the population differences of each area. However, the volume varies proportionally similarly over time, as shown in figure 3.1. The figure shows mail variations for each destination for the time period to Moving average has been applied to decrease complexity and the result has been transformed to lie between and 1. In reality, destination Stockholm handles much greater volumes than the Visby destination, for example. However, the transformed volumes behave similarly over time. Therefore, the rest of this report will focus on the Stockholm destination and assume that the forecasting models are similarly applicable to the other destinations. Transformed mail volume Stockholm Malmö Gothenburg Örebro Visby Figure 3.1: Illustrates the similarities of the mail volume behaviour between the destinations. The true total volume for destination Stockholm within the time period is displayed in figure 3.2. Non-operating days, such as weekends and public holidays, have been removed since they show a volume of. The graph shows a lowered volume during the two summers, as well as a slight increase before Christmas. There is also a lowered volume at the very end of the year. Further, there is a slight indication of a monthly pattern. The mail segment of each postal item is also of interest. In contrast to the total volume (of one destination), as shown in figure 3.2, the same volume is split up into each of the six mail segment and plotted in the six graphs of 9

18 1 3 2,5 Number of postal items 2, 1,5 1, Figure 3.2: Total mail volume per day for destination Stockholm during the time period to figure 3.3. It becomes apparent that the mail segments that the postal items belong to affects the way it behaves in the time series. For example, the mail segment Administrative routines has a monthly pattern that is not present in the Direct advertising segment. It is therefore a good idea to analyse these segments separately. The data is broken down into mail volume, measured in number of postal items, per split date and mail segment Training and Test Data The one-year time period 1st of March 214 to 1st of March 215 is used to train the models. It consists of 248 data points, once all non-operating days are removed. Each data point presents the mail volume, measured in number of postal items, per day. The test data, used to evaluate the trained models, consists of the following 1-26 days, i.e. the following four weeks. This effectively simulates 1st of March 215 and making predictions for the following four weeks based on previous volumes. The upcoming 1 to 5 days into the future ( ) 1

19 1, , (a) Administrative routines (b) Mailbox sized packages 1, (c) Direct advertising (d) Office mail , (e) Unaddressed advertising (f) Magazines Figure 3.3: Mail volume measured in number of postal items per segment for destination Stockholm during the time period to Together, the six segments make up the total volume for the destination

20 and 22 to 26 days into the future ( ) will also be evaluated separately, as motivated in section 1.2. These will tell how good each model is at predicting short-term versus long-term mail volumes Features For the RF and NN approaches, the key is to find a correlation between inputs and the output, or mail volume. These inputs are feature parameters that can be anything that changes with, affects or is somehow otherwise related to what the mail volume is. For example, the mail volume for the Administrative routines segment seems to be highly related to what time of the month it is, as seen in figure 3.3a. Feature Engineering The process of finding and creating features is a process that requires expert knowledge and, to some extent, creativity. Some features are included because they are expected to be related to the behaviour of the mail volume (expert knowledge), while others are experimental (creativity). The following features will act as inputs for the NN and RF models. They will be evaluated and used in different combinations to improve prediction performance. The considered features are listed in table 3.1. Variable name Value range Description Time index [, ] Days since Day of the week [1, 5] Day of the week. Weekends removed. Day of the month [1, 31] Day of the month. Day of the year [1, 365] Day of the year. Month [1, 12] Month index (January - December). Time difference [1, 6] Difference from last operating day. Mail accumulates when there are many public holidays in a row. Summer [, 1] Dummy variable. 1, between 15th of June and 31st of July., otherwise. 1 week ahead [, ] Booked mail volume for this date as booked volume known 1 week ago. 4 weeks ahead [, ] Booked mail volume for this date as booked volume 1 week ahead confirmed booked volume 4 weeks ahead confirmed booked volume known 4 weeks ago. [, ] Booked mail volume labelled confirmed for this date as known 1 week ago. [, ] Booked mail volume labelled confirmed for this date as known 4 weeks ago. Table 3.1: List of considered features. The features respective importance (prediction capabilities) will be evaluated and included/excluded thereafter. The Time index feature is incremented for each day since the first data point. It could possibly help the model find long term trends through mapping 12

21 the feature to a long-term increase or decrease in mail volume. Day of the week, Day of the month and Day of the year describes the input s weekly, monthly and yearly status. Day of the week ranges from 1 to 5 since weekends are non-operating days. These features might help the models find weekly, monthly and yearly patterns, if they exist. The feature Month is further used to detect possible yearly patterns. Time difference is a feature that explores the effect of non-operating days, more specifically many non-operating days in a row. The idea is that mail can accumulate as mail is not being handled. The dummy variable Summer is set to 1 for dates between the 15th of June and 31st of July, and otherwise. This feature is included to assist the model in learning the known phenomenon of the decrease in volume during summertime. Other known information that is available about the future is the booked volume for each day. Booked orders have different assurance codes such as preliminary or confirmed. This information is included in the predictions using the last four features, 1 weeks ahead booked volume, 4 weeks ahead booked volume, 1 week ahead confirmed booked volume and 4 weeks ahead confirmed booked volume. Section 1.2 describes the scheduling routines of the company and that the time spans of interest are a few days into the future and three weeks into the future. Therefore, the features 1 week ahead booked volume and 4 weeks ahead booked volume describes the total booked volume as it was known 1 week ago and 4 weeks ago, respectively. 1 week ahead confirmed booked volume defines how much of this total 1 week ahead booked volume was labelled confirmed, and 4 weeks ahead confirmed booked volume does the same for 4 weeks ahead. Feature Sets Evaluating and selecting a subset of features is the first approach to ensuring prediction accuracy. The complexity and therefore risk of overfitting is decreased by excluding non-contributing features. Furthermore, fewer features increases the model s performance as it decreases the amount of calculations. The RF and the NN was trained on three subsets of features: 1. All features excluding booked volumes, i.e. Time index, Day of the week, Day of the month, Day of the year, Month, Time difference and Summer. 2. Same as 1, plus 1 week ahead booked volume. 3. Same as 1, plus 4 weeks ahead booked volume. To see if booked volume labelled confirmed further increases accuracy, RF was also trained on: 4. Same as 2, plus 1 week ahead confirmed booked volume. 5. Same as 3, plus 4 weeks ahead confirmed booked volume. Feature Importance Estimation The features listed in table 3.1 will be evaluated using the OOB error in the RF model. Although there are more advanced and reliable feature selection methods (such as Principal Component Analysis and Support Vector Machines), 13

22 the OOB error shows an indication of how important each feature is for the sake of producing an accurate output. This method of estimating feature importance finds the increase or decrease in output accuracy by evaluating the different trees in the forest. Each tree is trained on different inputs, so called in-bag. The data that is not included are called out-of-bag. The OOB error, therefore, is the tree s error on training data that it has not been trained on. Each feature is evaluated by calculating the error increase as the values of each feature are permuted across the OOB data. This rearrangement should not affect the error if the feature is of low importance, and the magnitude of the error increase therefore gives an indication of how important each feature is Data Preparation To achieve as good results as possible, the data needs pre-processing. This helps the models to learn faster. The data in this study is of high quality, containing very few erroneous values. Therefore, the only removed data points are weekends and public holidays, where the mail volume is always zero. The mail volume and each input feature are all normalised using min-max scaling, to lie in the range [, 1]. The minimum and maximum values are taken from the time period to , i.e. the simulated previous oneyear period Performance Measurement The Root Mean Squared Error (RMSE) is a common way of measuring how well a model performs. The squared element of this equation penalises far-off forecasts heavily. For the application of these models, whose outputs will be interpreted by a human and schedules made thereafter, a model that occasionally outputs far-off values will not be trusted and is therefore no good. Therefore, the RMSE is in this case preferred. The RMSE for outputs ŷ i when the actual outcomes are y i is defined in equation 3.1. RMSE = 1 n (y i ŷ i ) 2 (3.1) n To complement the RMSE, the Mean Absolute Percentage Error (MAPE) will also be measured on for each model. The MAPE is a measurement on how close to the actual value the forecasting model generally performs. Equation 3.2 defines the MAPE for predictions ŷ i when the actual values are y i. 3.2 Auto Regression i=1 i=1 MAP E = 1 n (y i ŷ i ) n (y i ) 1% (3.2) The AR model is a simple time series model in which the output at time t, y t, depends on p number of previous values, as described in equation 3.3. Its 14

23 parameters, ϕ i, are estimated using least squared error. The noise term t is assumed to be white noise. y t = c + p ϕ i y t i + t (3.3) i=1 The capabilities of the AR model were measured in Matlab, using the Statistical Learning toolbox. 3.3 Random Forest A RF approach was implemented in Matlab using the built in Treebagger function. This allowed getting results quickly and left more room for experimenting with number of trees, maximum number of leaves and parameter subset size. The number of trees should be large, ranging from a hundred to thousands. The model showed no indication of overfitting when experimenting with the number of trees in the range 1-1,. It is confirmed by the original author that overfitting should not be a problem [1]. However, the model stopped improving in OOB prediction performance as the number of trees surpassed around 5 trees. Therefore, the number of trees was selected to be 5 to avoid excess calculations. The OOB error, as mentioned in section 2.2, was measured for different numbers of maximum number of leaves. 5 leaves per tree proved to be sufficient. 3.4 Multilayer Perceptron A fully connected feedforward MLP was implemented in C#. The program represents such a NN of three layers, one input, one hidden and one output layer, each of arbitrary size, similar to the NN in figure 2.2. The propagation rule in equation 3.4 states that the output o i of one unit i, multiplied by the weight w ij that connects units i and j constitutes an input input ji for the unit j in the next layer. input ji = o i w ij (3.4) The net input net j of a unit j, however, is dependent on all units i in the previous layer of size N and the bias unit θ j, as shown in equation 3.5. net j = N input ij + θ j (3.5) i=1 The non-linear activation function used in this implementation is defined in equation 3.6. activation = 1 1+e input (3.6) The output rule in equation 3.7 defines the output of a unit to be its activation. 15

24 output = activation (3.7) The behaviour of one unit in the implemented MLP is illustrated in figure 3.4, along with the activation, output and propagation rule. In this case, the propagation rule defines the unit s input to be the sum of the product of each output of a previous unit and its connection weight (equation 3.4). The activation rule in this example is a logistic activation function (equation 3.6) and the output of a unit is its activation (equation 3.7). o 1 o 2 o 3 w 3 w 1 w 2 N i=1 o iw i =input 1 input = activation = output 1+e. w N o N Figure 3.4: Shows how inputs turn into an output of a unit in a Multilayer Perceptron (MLP) via the propagation rule, the activation rule and the output rule. The number of input units depends on the number of inputs, or input features, that the network is learning from. There is always one input unit per input feature. Only one output unit is required for the learning of the mail volume data, representing the mail volume associated with the current input. The amount of hidden layers was kept at one to avoid unnecessary computation time and risk of overfitting. More hidden layers will rarely improve results [13]. The number of hidden units varied from the same as the number of input units to one or two fewer hidden units than input units. Fewer hidden units forces the network to squeeze the representations through the hidden layer. This complexity reduction can improve generalisation or reduce overfitting. However, too few hidden units could make a network that is not powerful enough The Backpropagation Algorithm The implementation uses the Backpropagation Algorithm to teach the network. The backpropagation algorithm, also known as the Generalised Delta Rule, is a gradient descent approach. It defines the way of altering the connecting weights in the NN to reduce the output error compared to the teacher, the desired output. The weights are updated as defined in equation 3.8, where w ij is the weight change (positive or negative) for the weight that connects units i and j, η is the learning constant, δ j is the error term for unit j and o i is the output of unit i. The learning constant η is typically set to around.1. 16

25 w ij = ηδ j o i (3.8) The error term δ j is calculated as described in equation 3.9. Here, t j is the target, or desired, output of unit j and o j is the actual output. f j (net j)is the derivative of the activation function and net j is the net input of unit j, as defined in 3.5. δ j =(t j o j )f j(net j ) (3.9) The backpropagation algorithm can also utilise momentum, defined by the momentum constant, for faster learning. The momentum constant also reduces the risk of the algorithm getting stuck in a local minima. Momentum means that the weight update from the previous epoch w ij (n 1) affects the change in weight strength for the current epoch w ij (n), as described in equation 3.1. α is the momentum constant which lies between and 1, typically.9. w ij (n) =ηδ j o i + α w ij (n 1) (3.1) 3.5 Avoiding Overfitting with Cross-Validation The overfitting phenomenon is crucial to understand when applying machine learning. Overfitting happens when a model learns a training set too well, i.e. it is trained too much on the same data. This implies that the model has not learned the general correlation of the features and the output, but rather acts similar to a look-up table of the specific training data. Therefore, the model loses the ability to generalise to new cases that it has not yet been introduced to. This phenomenon is illustrated in figure 3.5. For this figure, a model is trained on a training data set, and therefore the training data population error keeps decreasing. The population error of the test data set, that the model is not trained on, will eventually start increasing. The model is then losing its ability to generalise, i.e. overfitting the training data. Since the RF has its OOB error estimate, there is no need to perform crossvalidation for the RF model. A special form of cross-validation is needed to work with the AR model. We cannot shuffle or randomly pick out validation data points since the model needs the data for each step in chronological order. Instead, the data is split up into 1 ordered and numbered segments. The model s parameters are then estimated on different training segments and evaluated on other validation segments, as such: 1. Training segment 1, validation segments Training segments 1-2, validation segments Training segments 1-3, validation segments Training segments 1-4, validation segments Training segments 1-5, validation segments Training segments 1-6, validation segments

26 Training data Test data 1 Population error , Epochs of training Figure 3.5: Illustration of the overfitting phenomenon. 7. Training segments 1-7, validation segments Training segments 1-8, validation segments Training segments 1-9, validation segment 1 The average error over all of the validation data is the estimated error of the model. Different number of parameters are evaluated using this method. The estimated optimal number of parameters is the model that minimises this error. Estimation of the MLP s optimal input features, number of hidden units and learning rate is made with k-fold cross-validation. Therefore, the model s complexity is reduced to protect the model from overfitting the training data. k-fold cross-validation splits the training data into k parts, or folds. Each of these parts is used as validation data on a model trained on the remaining k 1 parts. This way, all of the data is used in both training and validation. Once k models have been trained on different data, the average validation error is the estimate of how good the model is. The optimal behaviour is found by applying cross-validation systematically over different input features, architectures and parameters. The model can then be trained on the complete training set once the preferred settings have been found. 18

27 3.6 Limitations This is a study of the possibility of producing sufficiently accurate forecasts on mail volume in the BCMS context. Therefore, the implementations are prototypes rather than finished, every-day working, continuously learning (updating) models. Further, the models perform predictions 1-26 days into the future from the simulated date and are trained on the time period This way, all models are trained on the same data and can be compared against the actual outcome and against each other. However, it would be interesting to apply these models and perform predictions on other dates. Another interesting aspect would be to explore how using longer time periods for training data affects, and possibly increases, prediction performance and how it possibly requires additional consideration for data point importance in history. The study is also limited to three models, AR, RF and NN. There are other possible methods such as Multiple Linear Regression, Auto Regression Integrated Moving Average (ARIMA) and Support Vector Machines (SVM). The booked volume features are also limited in two ways. First, they tell the booked total volume instead of the booked volume per segment. Second, they are limited to the booked volume 4 weeks in advance and 1 week in advance. These could successfully be updated for each day as the date comes closer, but this was not implemented for simplicity reasons. 19

28 4 Results The following sections describe each feature s importance and optimal usages of each model, i.e. parameter settings, after which the actual prediction results of each model is presented. The goal is to achieve as good accuracy as possible. However, as long as the model performs better than the current model, it is an improvement. The current forecasting model has been logged in archives, allowing evaluation of the predictions made on the same time period as being tested on the proposed models. Given the standpoint of the 1st of March 215, the model that is currently in use predicted the future to be as drawn in figure 4.1a. This represents a MAPE of % for the first five days, % for days into the future and % 1-26 days prediction overall. This will be used to compare against the proposed models in section 4.1. Each model is trained on both the total volume and on each segment separately. The models are also trained using different parameter settings and the NN and RF are trained on several different selections of features. These selections and adjustments are described more carefully in section Prediction Results The performance of each model is presented in table 4.1. The table contains the MAPE and RMSE for predictions made 1-5 days into the future, prediction made days into the future, and overall 1-26 days into the future, for each model. The table also contains these error measurements for the forecasting model that the company currently use to assist in scheduling. All predictions are made from the simulated date , based on training data from Table 4.1 provides an overview of how the models compare against each other and the current model. The baseline AR model in this project notably outperforms the model that is currently in use, with an overall MAPE of % compared to the current model s %. Furthermore, the RF and NN approaches improve the forecasts further, reaching around and below 2 %. The mail segments, such as Administrative routines, Direct advertising, et cetera, have for some approaches had one model trained on each separately. The models with this approach are practically six parallel models that produce an output for each mail segment respectively. The sum of these outputs make up the total prediction output. The table shows an indication of prediction improvement once the volume is split up into each mail segment when trained on with the AR model, reducing the overall MAPE from % to %. However, in some cases it can confuse the model and make it perform worse overall like in the cases of RF A (19.64 %) versus RF B (2.83 %). In the NN case, there is a slight improvement from NN A (21.65 %) to NN B (19.6 %). These results are further exposed in section The table (4.1) shows that the booked mail volume can be used to further improve the forecast accuracies. Both the RF and NN predictions improve when 2

29 CURR Current forecasts. AR A Auto Regression. AR B Auto Regression for each segment separately. Combined result for total volume. RF A Random Forest. RF B A Random Forest for each segment separately. Combined result for total volume. RF C Random Forest with 1 week ahead booked volume. RF D Random Forest with 4 weeks ahead booked volume. RF E Random Forest for each segment separately and with 1 week ahead booked volume. Combined result for total volume. RF F Random Forest for each segment separately and with 4 weeks ahead booked volume. Combined result for total volume. RF G Random Forest with 1 week ahead booked volume and 1 week ahead confirmed booked volume. RF H Random Forest with 4 weeks ahead booked volume and 4 weeks ahead confirmed booked volume. NN A Neural Network. NN B A Neural Network for each segment separately. Combined result for total volume. NN C Neural Network with 1 week ahead booked volume. NN D Neural Network with 4 weeks ahead booked volume. NN E Neural Network for each segment separately and with 1 week ahead booked volume. Combined result for total volume. NN F Neural Network for each segment separately and with 4 weeks ahead booked volume. Combined result for total volume. 1-5 days days Overall (1-26 days) Model MAPE RMSE MAPE RMSE MAPE RMSE CURR % 867, % 428, % 649,631 AR A 3.81 % 289, % 381, % 281,45 AR B % 35, % 361, % 281,94 RF A % 281, % 228, % 224,365 RF B 2. % 272, % 272, % 221,712 RF C 7.1 % 63,23 n/a n/a n/a n/a RF D 24.9 % 345, % 76, % 223,94 RF E 9.5 % 113,425 n/a n/a n/a n/a RF F 2.63 % 324, % 6, % 196,676 RF G 1.1 % 72,98 n/a n/a n/a n/a RF H % 27, % 89, % 195,567 NN A % 271, % 264, % 246,822 NN B % 291, % 262, % 222,485 NN C 3.31 % 29,343 n/a n/a n/a n/a NN D % 34, % 92, % 227,683 NN E 9.6 % 153,638 n/a n/a n/a n/a NN F % 312, % 66, % 222,873 Table 4.1: Model accuracies measured in Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE). Each model s performance is measured on 1-5 days prediction, days prediction and 1-26 days prediction. amodelisprovidedwiththe4 weeks ahead booked volume feature, from % to 18.1 % when comparing RF AtoRFD and from % to % in the NN case. The 1 week ahead booked volume reduces the error further 21

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.