Untangling Correlated Predictors with Principle Components

Size: px
Start display at page:

Download "Untangling Correlated Predictors with Principle Components"

Transcription

1 Untangling Correlated Predictors with Principle Components David R. Roberts, Marriott International, Potomac MD Introduction: Often when building a mathematical model, one can encounter predictor variables that are highly correlated. These variables can be thought of as confounding (from Webster s, confusing, bewildering, failing to distinguish ). Because they tend to move together, they confuse the analysis. In a regression model, the existence of correlated predictors is called multicollinearity. Multicollinearity can invalidate a regression model, by skewing the individual coefficients as well as the associated tests of significance. While multicollinearity does not invalidate the full model parameters (goodness-of-fit, F- test, etc.), the individual parameters can be rendered useless. When one encounters this, the most common response is to remove individual predictors from the model until there is no significant correlation among the remaining predictors. Often, this works. However, there are cases where removing variables is inappropriate. In this paper, I will describe one such example, and introduce an alternative technique. Instead of removing variables that are correlated, we can transform them into useful predictors that are uncorrelated. Borrowing some techniques from a field of statistics called Principle Component Analysis, we can retain most or all of the information in the original variables, while circumventing multicollinearity. An Example: This topic is best discussed in the context of an example. Let s consider a hypothetical 100 room hotel called the Paradise Inn. We want to analyze the relationship between demand and revenue. We have one year of daily data for this hotel, including the amount paid by each guest as well as an estimate of what each potential guest who was turned away would have paid. The combination of these actual and potential guests can be thought of as representing a demand function for the Paradise Hotel. The following is a daily plot of the total demand and the total revenues. It is clear that these are related. The chart above shows that these two variables are certainly related, but the relationship does not appear to be linear. There are a variety of reasons for this, and a variety of fixes. The fixes are a prerequisite for the topic of this paper, in the sense that a linear model is only appropriate for modeling linear relationships. For simplicity, I ve used a local regression to establish a linear relationship. Basically, this involves taking a small range of predictors where slight curvature can be approximated by a straight line. (Please refer to Revenue Revenue vs. "Demand" Demand Appendix I for an explanation of local regression as well as other options for handling non-linear data). Imagine taking a small portion of the plot above, and generating a linear regression model. For this example, I ll use the 30% of observations with the lowest demand. The result of this is shown below: R-Sq = 77% F-Stat = 1,163 Intercept (Coeff, T-stat) = ( , -0.35) Demand (Coeff, T-stat) = (78.56, 34.10) According to this regression model, each incremental unit of demand is worth $78.56 (at this property, for this time frame, for the specified local range of predictors). Note: each unit of incremental demand is worth somewhat less than the asking price for a variety of reasons, the hotel cannot (or should not) capture all of the demand. Also note that the overall slope decreases as demand increases. This is because each unit of incremental demand is worth less as demand increases. So, now I can model the relationship between demand and revenue (I d actually string together a series of local regressions, or use some other non-linear technique as described in Appendix I, to model the full range of predictors. For this paper, I use a simple local regression, since the focus of the paper is how to deal with multicollinearity, rather than non-linear transformations).

2 But demand in the abstract has limited value. I don t just want to know how many people want the stay here, I want to know how many want to stay and are willing to pay $40, $60,...$200 to do so. For simplicity, I ve broken down the demand into premium demand (those at a high price point) and discount demand (those at a low price point). This type of analysis could be done for any slices of demand (transient vs. group, cuts by booking source, etc, which in many cases are themselves highly correlated). In most real examples, two buckets would not be sufficient, but in this case two is enough to illustrate my point. I want to know what a unit of premium demand is worth and what a unit of discount demand is worth. I could try to measure this with a regression model using premium demand and discount demand as inputs. The results of this are below (using the same range of predictors as before): R-Sq = 89% F-Stat = 1,420 Intercept (Coeff, T-stat) = ( , -1.45) Prem_Dmd (Coeff, T-stat) = (171.74, 34.39) Disc_Dmd (Coeff, T-stat) = (0.46, 0.11) According to this model, a unit of premium demand is worth nearly $172, while a unit of discount demand is worth only $0.46. Furthermore, discount demand is not even statistically significant as a predictor! It turns out that the levels of premium demand and discount demand are roughly equal in size in this sample dataset. One could interpret this as meaning that premium demand is worth $172, discount demand is worth $0, and on average, demand is worth about $86 (assuming equal weighting of premium and discount). This interpretation would be supported by the first regression model, which yielded a coefficient of approximately $79 for demand. Well, if this were the end of the analysis, I would not be writing this paper..... One final step with any regression model with more than one predictor is to verify that the predictors are not highly correlated. For this model, however, the correlation between the predictors turns out to be 76%!! What does this mean? It means that premium demand and discount demand tend to move together, and it means that the coefficients (and significance tests) for both are invalid. So premium demand is NOT worth $172, discount demand is NOT worth $0.46, and either one may or may not be statistically significant anyway. The full model parameters are still valid, and these indicate that the model is statistically significant (note very large F stat), but we cannot distinguish between premium and discount demand. If we consider all demand together, we re back to the original model with one predictor. What happens if we throw out one of the predictors (recall that this is the most common solution to multicollinearity)? Using only premium demand (and then only discount demand) as a predictor gives the two results shown below: R-Sq = 89% F-Stat = 2,849 Intercept (Coeff, T-stat) = ( , -1.48) Prem_Dmd (Coeff, T-stat) = (172.16, 53.38) R-Sq = 52% F-Stat = 377 Intercept (Coeff, T-stat) = ( , 4.82) Disc_Dmd (Coeff, T-stat) = (112.58, 19.42) Independently, both premium demand and discount demand are significant predictors. Premium demand appears to be worth $172, while discount demand appears to be worth $113. What s going on? It looks like premium demand is worth about $172 whether or not we also consider discount demand. Is this really valid? Is discount demand worth $0.46, $113, or some other value? What s going on here is that each predictor (premium demand & discount demand) is partially included in the other. Tossing one of them out, then, does NOT fully remove their effect! As an analogy, think about modeling revenues vs. demand from lefthanders. Each unit of demand from left-handers would be worth several thousand dollars. This is not because left-handers actually pay that much (I m in that demographic group; we re pretty cheap) rather it is because each unit of left-hander demand is typically matched by nine units of right-hander demand (based on approximate distribution in the population). If a model only includes left-hander demand, it will be misleading. A model that includes only premium demand (or only discount demand ) would be similarly misleading the value of each would be overstated. So..... If I simply use demand as a predictor, the model is valid, but I cannot distinguish between premium and discount. If I include premium and discount predictors, the correlation between these two predictors renders both of them useless. If I use only premium (or only discount ), the value will be overstated for the reasons described in the previous paragraph. What can we do?

3 Principle Components: For now, let s put our dilemma on hold while we discuss the concept of Principle Components. Principle Component Analysis is a Variable Reduction technique (see Appendix II for an explanation of Variable Reduction). It is similar to, and often confused with, a more well known branch of statistics called Factor Analysis. Instead of using the original predictors, we can generate Principle Components. For every model with one dependent variable and one or more independent variables, the number of Principle Components is equal to the number of independent variables. (In our simplified example, there are two independent variables - premium demand and discount demand - therefore, there are two Principle Components.) Each Principle Component is a linear combination of the original variables. The first Principle Component is the unique linear combination of original predictors that maximizes the explained variance in the observed variables. For example, the first Principle Component could be written as W1X1 + W2X2 +...WNXN, where there are N variables, and the W s represent the weights. The weights are selected such that no other set of weights can account for more of the variance in the original variables. The second Principle Component is the linear combination of original predictors THAT IS ORTHOGONAL TO THE FIRST COMPONENT that maximizes the REMAINING explained variance. This procedure is repeated until the number of Principle Components is equal to the number of original variables. Once the Principle Components are generated, they collectively contain ALL of the information from the original variables. For example, if you had a model with 10 variables, and you ran a regression against: a) the 10 variables, and b) the 10 calculated Principle Components, you would find that the full model parameters (Rsquare, F, SSR, etc.) are IDENTICAL between the two models. You can test a few examples using your own data (or you can just take my word for it). Furthermore, if you ran a regression with each Principle Component by itself, you d see that the parameter estimates (coefficient, t-ratio) are EXACTLY the same as they are for that Principle Component in the full model! Why? Recall that each of these Principle Components is orthogonal to all others (i.e., they are all mutually uncorrelated). The only way that a new variable changes the parameters of variables that are already in the model is if there is SOME correlation. In practice, there is ALWAYS correlation among some of the variables (even spurious correlation counts here) unless the variables are Principle Components, which are calculated explicitly to avoid correlation. So, using Principle Component Analysis, we can take k predictor variables that may or may not be correlated, and generate k Principle Components that are completely uncorrelated but contain ALL of the original information. The reader should be asking at this point: why should I not ALWAYS do this? If I can eliminate correlation (and multicollinearity) without losing any information, where s the downside? There must be some downside, right? The unfortunate answer is: Yes, there is a downside. The downside is that these Principle Components are not open to interpretation. They are purely mathematical constructs. For example, Principle Component One might be 0.6X X2 1.8X (where the X s are the original variables). If the model shows that this Principle Component is hugely significant, with a coefficient of 2.41, how can we interpret this? We can t! If we can t interpret the results, why would anyone waste their time doing this? Principle Component Analysis was never intended as a solution to multicollinearity. It was developed as a Variable Reduction technique (as an alternative to Variable Clustering for example see Appendix II). The purpose of my paper is to show how the concept of Principle Component Analysis can be applied to a problem with multicollinearity. The concept is to transform correlated predictors into uncorrelated predictors. A Principle Component-based Solution: Now back to our sample problem: how to use premium demand and discount demand in the same model. The solution here is to keep both variables in the model, but transform them in such a way that: 1) they are relatively uncorrelated, and 2) they can be interpreted. Note the slight twist: Principle Components are completely uncorrelated and not interpretable. I want components that are interpretable, and I m willing to tolerate a small amount of correlation. This is where some art comes into play. In many cases this can be quite complicated (hopefully not so in this paper s simplified example). Consider the first plot on the next page. The x-axis represents premium demand, while the y- axis represents discount demand. The high correlation of these two predictors is seen quite clearly in the plot an upward sloping line would fit this data pretty well (recall that the calculated correlation was 76%). Using a transformation, I m going to keep the exact same data points, but rotate the axes. Consider the second of the two plots on the this page:

4 Example of "Principle Components" Discount Demand Premium Demand The following regression output shows our parameter estimates: R-Sq = 85% F-Stat = 1,004 Intercept (Coeff, T-stat) = (-13,440, ) Demand (Coeff, T-stat) = (80.86, 43.64) Pr_Mix (Coeff, T-stat) = (27,331.5, 13.98) Note that the data points are unchanged. Now I m going to make the new x-axis the black arrow pointing to the upper right. I ll make the new y-axis the shaded arrow pointing to the upper left. This is a visual representation of what Principle Components accomplish. What have I done? The new x-axis will capture the magnitude of their movement together, while the new y- axis will capture the extent to which they deviate. The next step is to define new variables that roughly correspond to these new axis definitions, but that can be Discount Demand Example of "Principle Components" Premium Demand easily interpreted. For our sample problem, to capture the magnitude of movement together, I ll use overall demand. Premium demand plus discount demand equals total demand (in my simple model), so total demand becomes the co-movement variable. Now I need a deviation variable. Typically differences or ratios work here. I prefer ratios for a deviation variable, since they tend to be less correlated with the co-movement variable. In this example, I used the ratio of premium demand to total demand ( premium mix ). For this sample data, this ratio was virtually uncorrelated with total demand (correlation was approximately -9%). So now we have two new predictors, total demand and premium mix. Remember that our original goal was to get valid values for premium demand and discount demand (for the chosen range of predictors). According to this model, one unit of demand is worth about $81, while a unit of premium mix is worth $27,332. Since premium mix is measured as a ratio of premium demand to total demand, this ratio must be between zero and one. According to this model, increasing this mix by one percentage point (1/100th of total), at a fixed level of demand, is worth about $273. To estimate the value of premium demand and discount demand, we can pick a starting point, and then add Q units of premium demand. For example, if premium demand and discount demand are both 200, total demand is 400 and pr_mix is 50%. The model would then predict revenues of $32,670. Adding one unit of premium demand increases revenues by $115, while adding one unit of discount demand increases revenues by $47 (plugging the demand values into the model). We can then conclude that premium demand is worth $115, and discount demand is worth $47 (for this set of inputs, for this time frame, for this hotel, etc.). Recall that our original model suggested that a unit of demand was worth approximately $79. Given the relatively equal weighting in the dataset, these estimates for premium and discount seem reasonable since they average close to that $79 value. Note that, depending on the increment selected, these may be estimates on an interval rather than the actual slope of a line arc estimates as opposed to point estimates. As such they may be subject to error as the arc gets large. These values may be highly sensitive to the initial conditions entered. It is crucial that these initial conditions represent reasonable values for the predictors chosen (like most models, this one is only valid over a relevant range of predictors). For example, since the initial conditions I selected were 200 units of premium demand and 200 units of discount demand, it is critical that the initial local regression is centered on a demand level near 400. If not, the model loses its basis for prediction. Note the very large coefficient for the intercept and the Pr_Mix. As these get larger, the model is MORE sensitive to initial conditions, and it becomes MORE important to focus on the relevant range of predictors. One alternative is to force the intercept to zero, if that is appropriate for your specific application (it would be appropriate for this example).

5 It is important point out that these transformed predictors do NOT need to be linear. In fact, in our example, the co-movement variable is actually nonlinear (it appears to be linear only because of our initial local regression ). Conclusions: Before we looked at a Principle Component based solution, we could not determine the value of Premium Demand and Discount Demand now we can. Correlation among predictor variables does NOT have to mean a reduced quality of analysis. Concepts from Principle Component Analysis may be able to help. Complex problems will call for more elaborate and clever variable definitions, particularly when using more than two buckets. Some problems may call for a trial and error approach. Though not every problem can be addressed this way, this concept can be a valuable addition to your toolbox of analytical approaches. Contact Information: The author may be contacted at david.roberts@marriott.com. APPENDIX I Handling Non-Linear Data There are many different ways to handle non-linear data. A local regression, like the one used in this paper, is one approach. Another approach is to transform the data to produce a linear relationship among variables. Many types of functions are useful here polynomials, exponential, normal, trigonometric, etc. I prefer to use trigonometric functions because they are so easy to shape by varying the period, amplitude, and horizontal and vertical offsets. There does not need to be anything trigonometric about your data to use this it is just a curve fitting exercise. The best fit may be determined by minimizing the sum of the vertical deviations from the curve (or the square of the deviations, like a regression line). I don t know of any software that does this explicitly, but you can set up some repeating macros to hone in on the best fit. APPENDIX II Variable Reduction Variable Reduction becomes important when analyzing survey data with variable overlap (i.e., if you have more variables than you really need). Sometimes variables are clustered into a new variable, each of which may represent several original variables. Sometimes variables that are deemed redundant are removed. Sometimes Principle Components are generated with the idea that only the first few will contain useful information.