ADVANCED DATA ANALYTICS

Size: px

Start display at page:

Download "ADVANCED DATA ANALYTICS"

Belinda Christiana Wiggins
5 years ago
Views:

1 ADVANCED DATA ANALYTICS MBB essay by Marcel Suszka 17 AUGUSTUS 2018 PROJECTSONE De Corridor 12L 3621 ZB Breukelen

2 MBB Essay Advanced Data Analytics Outline This essay is about a statistical research for a logistical company. After a short introduction and the case explanation, the statistical investigation is explained in a logical order. First the data is graphically explored to explain the possible correlations in pictures. After that hypothesis tests are performed to test the possible predictors for significance and relevance. In the end different models are fitted to the data. The part of the statistical investigation is setup in a way that the most important outcomes and conclusions are at the beginning of the document. Further graphs and hypothesis tests of the less important are summarized in an appendix at the end of the essay. Introduction Data-analysis provides many different ways for investigating your data. Various mainstream packages are available for this purpose, such as Minitab, SPSS, Stata and R. Most of these packages do not have specific routines incorporating machine learning methods. For that, a programme like SPM (Salford s Predictive Modeller) can be useful. With the increasing number of statistical methods and algorithm that henceforth is available, one may wonder, what is best to use. This applies in particular when the conclusions from different methods do not fully agree with each other. During the course in becoming a Master Black Belt at ProjectsOne, a four day training Advanced Data Analytics is included. In this particular training you learn how to use other statistical tools than taught in the Six Sigma Green and Black Belt training. Especially the part about predictive modelling is very interesting. The essence of this approach is to evaluate and compare different methods by their predictive performance. This essay is about modelling a large dataset from to illustrate the challenges that come across. The software used to investigate the data are Minitab and R. The literature used is the book Applied Predictive Modeling by Max Kuhn and Kjell Johnson. The case The case is about a logistic company specialized in agricultural products, selling and reselling everything that can be used in an agricultural environment. This company has affiliates throughout western Europe. The dataset is derived from the Dutch settlement. Every day thousands of goods are transported to the inbound department, after being unloaded, checked and booked these are ready to be stocked. The lead time of the process from unloading to scanning the goods at a sellable location must be as short as possible. Ordered today before 18:30 means correct deliverance on the next day. If a product is not in stock, this promise cannot be met. For this reason, products that are actually in-house, must be put at a picking location fast (at least within 24 hours), implying reduced stock waiting times. A lot of solutions are obvious, like using full capacity, first in first out during the whole process, tools that make work easy, etc. But still the problem persists of not being able to meet the 24 hour standard, although some products do and others don t. This is where statistics comes in, because

3 when improving the obvious doesn t lead to the necessary result, deeper investigation is needed. What do the data tell? Three datasets are provided from three different weeks. A week with high performance percentages, a mediocre week and a week with low percentages. Each dataset consists of single articles being on time or not on time in the warehouse. This makes the data attribute, dichotomous to be more specific. Hence, in six-sigma language, the internal CTQ (Y) is binary. The dataset used for this case is the set with low performance percentages. Exploring the data Before constructing a model, learning from the existing data is mandatory. Without proper knowledge of the dataset, there is a risk of overfitting the data (fitting noise in the data). So we want to know which predictors (X s) contribute significantly to the outcome (Y). The first step of this analysis is to make pictures of the data, to determine if a possible predictor can be visualized. Initially there were 18 predictors. The following things stood out: there were X s with a lot of categories. X s with impossible results (probably errors). X s with with unknown meaning. collinear X s. X s with unclear meaning. So further investigation on the X s was needed. An appointment with the procurement manager (as procurement is the source of the dataset) was made and all questions were answered, which led to a trimmed and validated dataset with 8 possible predictors. This finally resulted in a sample size of almost data points. Analysing the data The next step of the investigation will be investigating the data to conclude if the different X s contribute to the outcome Y and if their influence is big enough to be improved. As stated before we want to make some graphs first, so we start with exploratory data analysis accompanied by hypothesis tests to proof significance. The following predictors have been explored. Predictor Incoterm QTYPO AantalPO Lead-time Unit Price per piece Row status Currency Area Location Explanation Terms of delivery, who is paying for freight costs. Quantity per Purchase Order (order unit) Number of articles in a packing unit Duration of shipment Description of delivery unit Indication of the price in three categories Indication of the location in the process of the goods in the warehouse Currency on the invoice Location in the warehouse Main building or secondary building

4 Regression Lead-time Continuous No influence Binary Logistic Unit Categorical One unit might be more On time in the warehouse Predictor Type of Expected influence on Y Test Significant Relevant data Incoterm Categorical No influence Chi- Yes Yes Square QTYPO Continuous The larger the QTYPO, the Binary Yes Yes more likely the order unit is Not On time in the warehouse Logistic Regression AantalPO Continuous No influence Binary Logistic Yes Yes Regression Chi- Square Price per piece Categorical No influence Chi- No No Square Row status Categorical No influence Chi- Yes No Square Currency Categorical No influence Chi- No No Square Area Categorical No influence Chi- Yes Yes Square Location Categorical No influence Chi- Square No No * The level of significance has been set on α = 0,01, meaning that the Null-hypothesis was rejected for P-value < 0,01. Figure 1, The significance and relevance of the possible influencers on the outcome of the process Incoterm Incoterms are agreements between the seller and the buyer on who incurs for the risks of the sent goods and to which physical point. Here we distinguish three different terms: EXW (Ex Works); the goods have been collected from the premises of the seller. Franco; The seller incurs the risks till deliverance on the premises of the warehouse Normale vrachtkosten (Normal freight costs); As Franco with extra costs No Yes No Yes Analysing the complete dataset with data points results in a p-value close to 0, which indicates a significant remarkable contribution of this X. It is interesting to learn how this predictor has influence on Y, because it wasn t expected. QTYPO Quantity per Purchase Order (QTYPO) sums up the amount of order units in a purchase order. It could be that the size of the received order leads to longer lead times to process the order. The analysis of the impact of the QTYPO on the outcome (Y) was tested in two ways (binary logistic regression and Mann-Whitney).The binary logistic regression analysis resulted in a p-value of 0,000 and a relevance R² of 4,97%.

5 Unit The unit of a purchase order is the way the goods are packed. The p-value of the Chi-square test was 0,000 the categories that were positively beyond expectation were Cans, Drums and Meters, this knowledge lead to the following hypotheses: A: Cans, Drums and Meters perform better than other packing units B: Cans, Drums and Meters are delivered in specific areas C: These areas outperform other areas D: The outperforming areas are situated on a different location (warehouse) Ad A (Cans, Drums and Meters perform better than other packing units): The first hypothesis is right, there is a significant relation between being on time in the warehouse and the packing unit, the p-value is 0,000. Although we must consider that some of the expected counts are under 5, there is a large difference between expected and observed counts. Figure 7, shows the percentages of being on time or not on time in the warehouse per packing unit (above) and how frequent the packing units are used (below)

Units Ad B (Cans, Drums and Meters are delivered in specific areas): The pie chart in Figure 9 of Appendix D, shows that Cans (CAN), Drums (DRU) and Meters (MTR) are delivered at a relative small

6 Units Ad B (Cans, Drums and Meters are delivered in specific areas): The pie chart in Figure 9 of Appendix D, shows that Cans (CAN), Drums (DRU) and Meters (MTR) are delivered at a relative small number of areas. Due to a large amount combinations (degrees of freedom) yielded by these two categories, it is not possible to create a valid hypothesis test. However what we can say is that Cans and Drums are only processed in Mag 20 and Meters in Mag 8,9 and 19. The chi-square contributions are enormously high, meaning that they in fact are more than expected delivered in specific areas. Area Ad C (These areas outperform other areas): The bar chart (Figure 11) shows that the areas mag8, mag9, and mag20 are among the high performing areas and mag19 is above average. However there are more high performing areas. The Chi-square test yields a significant p-value of 0,000. Which means that some areas significantly perform better than other areas. (Mags 24, 32, 40, 41 and SPE where left out of the test because of low counts). When we highlight the best (green) and the worst (red) performers (Figure 10 in appendix D) it stands out that two areas (Mag 21 and Mag 22) are underperforming and Mag 21 is the most busy area. Of course the performance of the other areas is still not above 95% on time, but mag 21 is so large that it can tilt the total performance on its own to a positive level. The follow-up research should be in mag 21. Figure 11, shows the percentages of being on time or not on time in the warehouse per warehouse (above) and how frequent these warehouses are used (below)

7 Location Ad D (The outperforming areas are situated on a different location (warehouse) As we can see in Figure 12 of appendix D, it doesn t seem to matter on what physical location the goods arrive for the performance. What stands out is that the main warehouse (hoofdgebouw) processes almost 90% of all goods. This is because mag 21 is only processed in the main warehouse and not on the Lundia location.

8 Modelling with discrete CTQ s The data is explored and hypothesis tests have been conducted. Now there is more knowledge of how individual predictors influence the outcome Y. Unknown is how these predictors interact. Are there predictors that have a linear relation with each other, so they explain some of the same variance of Y (collinearity) or are there predictors which have a larger impact on Y when they are combined (interaction)? And to what extend can the outcome be explained (predicted)? Creating a model can provide answers to the questions asked. There are a lot of statistical methods that can be applied to the dataset (e.g. logistic regression, GLM, CART, Random Forests, etc ), we choose the method with the largest predictive power. Classification And Regression Trees (CART) Models based on trees split the data in nested if-then logic, for both discrete and continuous predictors. The splits are based on rules (an if-then condition) and the partitions are called nodes (rectangular shaped). In our case the first node would be the internal CTQ, a certain percentage of goods are in the warehouse on time and others not. So there is a Yes group and a No group, these two groups (nodes) can subsequently be split into more nodes, due to a predictor, until a prediction tree is created. Figure 14, Simple regression tree The results The top node shows the overall performance (12% On time and 88% Not on time). The first cut is made at several area codes (if the goods have to be stored in stated areas, the probability of being on time is 77%, this is in 3% of all cases), if in other areas, the probability is down to 10%.

9 If in this group, the goods are delivered in area; mag10,mag11 or SPE, the probability of being on time in the warehouse is 90%. This is only in 2% of the cases. When not in this area but in areas,mag09,mag20 or mag28, the probability of being on time is 71%, when AantalPO is less than 5, if it s equal to or over 5 the probability is 30%. The predictive power of this model is about 89%. Figure 15, Variable importance CART Binary Logistic Regression In our case we are dealing with a discrete Y. The outcome of every row can either be On time or Not on time. From the perspective of being On time, the outcome is 100% On time or 0% On time (in fractions 1 or 0). However 1 or 0 cannot be an outcome of a predicting model, there is always a chance So what we actually want to do is predict a probability of goods being On time between 1 and 0. In other words, how big is the chance that our products arrive on time in our warehouse. To do this, this function is created (for example). log ( p ) = β0 + β1 QTYPO + AREA + + βp AantalPO 1 p The left hand side of this equation is the log odds, the right hand side is the predictive function constructed from the coefficients of the different predictors from out the model. With this equation the result of the function can only be between 0 and 1. The only thing left to do is isolate p from the rest of the function. 1 p = 1 + e (β0+β1 QTYPO+AREA+ +βp AantalPO)

10 The binary logistic regression is performed in Minitab. Area has been replaced by Area2 in where there are only two distinct values, the better performing areas from the regression tree and the not so well performing areas. Figure 15, Coefficients and Standard Errors Table Figure 16, Regression equation After applying backwards elimination it turns out that several predictors seem to contribute significantly to the outcome Y. According to the Odds ratio and coefficients, QTYPO and Area2 have the highest contribution to the outcome Y. Area not so good has 21 times more chance not being on time in the warehouse than Area good. QTYPO has the highest coefficient related to the possible range of this specific predictor.

Figure 16, the effects per predictor As said, there are more ways that lead to Rome. So Random Forest was also fitted to see if the results are similar.

It is random because it selects a part of all the variables and data from the dataset. The chance of overfitting the model with noise shrink to a minimum.

11 Figure 16, the effects per predictor As said, there are more ways that lead to Rome. So Random Forest was also fitted to see if the results are similar. Random Forest The Random forest combines multiple simple trees into one model, this yields a higher precision model so it can predict more accurately. It is random because it selects a part of all the variables and data from the dataset. The chance of overfitting the model with noise shrink to a minimum. Random Forest is a machine learning algorithm. A random forest is performed on our dataset, the results are in the section below. The results Figure 17, Importance of the predictors (Random Forest)

12 The forest shows a different variable importance than the single tree. Area is considered still the factor with the largest predictive power, instead of Incoterm, QTYPO is chosen as second best predictor. The predictive power is slightly higher than the other two models, about 90%. Conclusions The different methods show to some extend the same results. We can say that the Area in where the goods are stocked is the most explaining variable in all methods. QTYPO and number of articles in a PO (Aantal PO) are also important. Incoterm, Unit and Location have limited effect on the outcome Y. Further investigation of the outcome of the various independent variables is needed. The performance of mag 21 is much worse compared to that of mag 10 and 11, it would be interesting to find out why. Cans are always on time, EA almost never, how is this possible? These investigations will not be in this essay, but will be discussed with the customer. The purpose of the essay was to show the reader what you will encounter when taking on a problem statistically. And to give insight into different data analysis methods. Marcel Suszka

Appendix A, Incoterm Figure 2, shows the percentages of being on time or not on time in the warehouse per incoterm (above) and how frequent these incoterms are used (below) EXW, Franco and Normale

13 Appendix A, Incoterm Figure 2, shows the percentages of being on time or not on time in the warehouse per incoterm (above) and how frequent these incoterms are used (below) EXW, Franco and Normale vrachtkosten are far out the most frequent incoterms, the other incoterms are left out of scope. Pie Chart of Incoterm Ja Nee Category EXW Franco Normale vrachtkosten Panel variable: On time Figure 3, The pie chart shows the different shares of the incoterms whether the goods arrived on time in the warehouse (Ja) and not on time (Nee) Franco seems to perform better than Normal freight costs (Normale vrachtkosten), while we pay extra for these shipments. EXW (if we get the goods ourselves) stays approximately the same.

14 Appendix B, QTYPO Main Effects Plot for On time Fitted Probabilities 1,00 Probability of Nee 0,95 0,90 0,85 0, QTYPO Figure 3, Main effect plot of quantity per purchase order and not being on time in the warehouse The graph shows that the probability of getting the goods On time in the warehouse depends on the size of the purchase order. Due to the overall capacity of the process for this week (12% On time in the warehouse) the scale of the probability starts at 80% Boxplot of QTYPO vs internal CTQ QTYPO In time Not in time Figure 4, Boxplot shows the different results of being on time or not related to the quantity The box plot shows a lot of outliers, but also a difference in outcome, in this picture it seems that the probability of being On time decreases as the QTYPO increases. The Mann-Whitney test gives a P- value of 0,000, that indicates a significant difference (median difference On time at 45, Not On time at 135)

15 Appendix C, AantalPO AantalPO The number of articles in a packing unit could say something about the size of the packing unit and the location where the packing unit is stocked and therefore what the underlying stocking process is. 0,9 0,8 0,7 Main Effects Plot for On time Fitted Probabilities Probability of Nee 0,6 0,5 0,4 0,3 0,2 0,1 0, AantalPO Figure 5, Main effects plot of numbers of articles in a packing unit and being on time The main effects plot shows an interesting relation between the number of articles in a packing unit and being in the warehouse on time. The relation is significant at p-value 0,000, but explains little with a R-square value of 0,5% Boxplot of AantalPO AantalPO Ja On time Nee Figure 6, Boxplot shows the different results of being on time or not related to the number of articles in a packing unit

16 Appendix D, Hypotheses UNIT Rows: On time Columns: UNIT BOX CAN DRU EA KIT MTR PAC PAI ROL SET All Ja ,0 3,28 4, ,39 1,58 30,23 41,40 27,31 11,6 50,62 4,49 66,1 84,73 0,289 1,28 69,31 11,05 4,686 0,23 6,134 Nee ,9 23,7 32, ,6 11,4 218,7 299,60 197,69 84,35 366,38 0,621 9,137 11,707 0,040 0,177 9,577 1,528 0,647 0,032 0,848 All Cell Contents Count Expected count Contribution to Chi-square Chi-Square Test Chi-Square DF P-Value Pearson 282, ,000 Likelihood Ratio 196, ,000 3 cell(s) with expected counts less than 5. Figure 8, Chi-square hypothesis test of units vs on time

17 Area Pie Chart of UNIT Category BOX CAN DRU EA KIT MTR PAC PAI ROL SET SPE Panel variable: AREA Figure 9, The pie chart of which packing units are received in which warehouse Location Chart of Location; On time On time Nee Ja Count Location Hoofdgebouw Lundia Figure 12, Bar chart of the location where the goods arrive and the performance

18 Chi-Square Test for Association: On time; AREA Rows: On time Columns: AREA Ja ,25 26,26 89,19 44,29 3,27 21,90 7,62 10,53 9,44 19,60 9,80 5,331 76,22 17,21 22,10 1,573 62,82 0, ,49 483,56 727,14 26,76 Ne e ,75 190,7 647,8 321,7 23,73 159,1 55,38 76,47 68,56 142,40 71,20 0,734 10,49 2,371 3,044 0,217 8,650 0,007 24,712 66, ,11 3,685 All All Ja ,55 10,04 12, ,8 132,75 90,64 22,75 3,51 7,99 36,920 0, ,05 47,203 35,608 11,045 8,311 77,487 12,553 Nee ,45 72,96 92, ,1 964,25 658,36 165,25 25,49 58,01 5,083 0,013 28,507 6,499 4,902 1,521 1,144 10,668 1, All Cell Contents Count Expected count Contribution to Chi-square Chi-Square Test Chi-Square DF P-Value Pearson 2319, ,000 Likelihood Ratio 1464, ,000 2 cell(s) with expected counts less than 5. Figure 10, Chi-square hypothesis test

19 Appendix E, Insignificant factors Lead-time Longer lead-times indicate a longer distance between supplier and receiver. This could give some information about the location where the suppliers are situated. Maybe there is a difference in the way of packing the goods that increases the lead time. The logistic regression yielded a p-value of 0,013, which is above our desired p-value of 0,01 and is irrelevant R² = 0,05%, the lead-time is no significant factor. Price per unit The purchase price of articles makes no difference whether the goods are in the warehouse On time or not. Row status Row status seems to matter. After further investigation it turns out that the status of the goods states that the goods with status 931 are further advanced in the process, which means that the probability of being On time in the warehouse is larger. Figure 13, shows the percentages of being on time or not on time in the warehouse per rowstatus (above) and how frequent these rowstatus are used (below) Currency The currency the goods are paid in says something about where the goods are coming from. Goods that come from inside of Europe have the Euro as currency, while goods from outside Europe use Dollar as currency. No significant difference is found between the currency s so it s left out of the model.