More Multiple Regression

Size: px
Start display at page:

Download "More Multiple Regression"

Transcription

1 More Multiple Regression Model Building The main difference between multiple and simple regression is that, now that we have so many predictors to deal with, the concept of "model building" must be considered in a fairly formal way. There are two major things to consider: 1) Transformations a) of predictors b) of the response 2) Which variables to include in the model. Transformations are used to help fit the data to the model. In particular, they can help "linearize" a trend -- transform a trend so that it's more linear correct the problem of non-constant variance normalize so that residuals are normally distributed. Obviously (or maybe not so obviously), you must make sure the assumptions behind the model are satisfied before you go about deciding which variables to keep and which to throw out. Therefore, you must do (1) before you do (2). Transformations I have no good advice to give as to whether you should explore transforming the predictors or the response. Except to say that often you will want to do both. But consider this: changes to the response affect the entire set of relationships, while changes to a single predictor may be less all-encompassing. So one strategy is to transform the response to get the effect you need (e.g. stable variance) and then investigate transformations of predictors. At least one book, "Applied Regression Including Computing and Graphics", (Cook and Weisberg), give a fairly comprehensive treatment of this subject and provide several strategies. But this is more detail than we need give here. Instead, some general advice. The problem of non-constant variance can often be fixed by using one of these transformations: square-root log Inverse if data are counts (e.g. Poisson distributed) if the variance at x is proportional to the square of the mean response at x if the variance is proportional the fourth power of the mean response; or if response values are mostly close to 0 with occasional large values.

2 sin -1 (sqrt(y)) the arcsine square-root transform is useful for responses that are restricted to lie between 0 and 1, or a limited range that can be first transformed to lie between 0 and 1. A useful technique for making the residuals more normally distributed -- and a technique that often linearizes, is to use the Box-Cox family of transformations. This family consists of the set of transformations indexed by a parameter lambda and given by: y (lambda) = { (y lambda - 1)/lambda if lambda <> 0; else log(y) The trick is to fiddle with lambda until the data look normal. One can write a program that facilitates this, or even develop an "automatic" method. One drawback of the automatic method actually suggests a useful strategy for a non-automatic method. The drawback of the method is that the "best" value of lambda might dictate transforming y to the power of, say, sqrt(pi), or some other strange power. Better to use a power that has some intuitive value, say integer values or at least rational fractions (e.g. 1/3 or 1/2). This suggests that one can just try transformations in a certain order: log, square root, 2, 3, etc. and see what works. Variable Selection One of the interesting -- and frustrating -- things about multiple regression is that the statistical significance of predictors often depends on what other predictors are present in the model. Thus, if you include only, say, temperature in your model to predict ozone, you will find it significant. But if you also include humidity, perhaps ozone is no longer significant. And if you add another variable, maybe humidity will no longer be significant. Partly this is due to correlations within the predictors. But it is also due to a "signal-tonoise" effect. The strongest signals (those with strong correlations to the response) can swamp out weaker signals. Suppose we have fit a model with only temp and visibility: Mean(ozone ) = * temp *visibility both terms are statistically significant, and R-squared is about 68%. Now if we add any new term, even if its meaningless (say the Dow Jones index), the R- squared will go up some small amount. Remember that the R-squared measures the unexplained variation, which means here that 72% of the variation in ozone is still unaccounted for. To be statistically significant, therefore, a new predictor must decrease the unexplained variation more than an arbitrary, meaningless predictor would. It's correlation with the residuals, therefore, must be strong enough to significantly *improve* the fit.

3 If we now add humidity, for example, the signal-to-noise ratio is not strong enough for humidity to sufficiently decrease the variation: > bigger <- update(small,. ~. + humidty) > summary.lm(bigger) Call: lm(formula = ozone ~ temp + visibility + humidty, data = ozone) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-11 *** temp < 2e-16 *** visibility *** humidty Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 137 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 3 and 137 DF, p-value: 0 (Note the use of the "update" command to more quickly update the model. The syntax is update(old lm object, new formula) and the "new formula" has this syntax: a "." means it uses the same terms as last time. Thus the call above says to use the same term on the left hand side, the same terms on the right hand side, and add a new term.) However, if we add new terms, humidty again becomes useful, as you can see from the full model: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e e temp 3.654e e e-05 *** inversionht e e pressure e e

4 visibility e e * height 1.270e e humidty 7.045e e * temp e e windspeed e e So what to do? There are two main approaches for strategies, and many variations. The first is "forward selection," the second "backward selection". In most situations, backward selection makes more sense. In forward selection, you add one variable at a time to the model. Actually, you add all of the variables one at a time, and choose to "keep" the one that improves R-squared the most. In the next step you add each of the remaining variables one at a time, and keep the one that improves R-squared the most. You continue in this fashion until none of them improves R-squared sufficiently (which means that you set some sort of pre-defined tolerance for what "sufficiently" means to you.) You don't have to use R-squared; other measures of goodness of fit are available, too, and each has particular advantages/disadvantages over the others. In backwards selection, you start with the "full" model containing all variables. You then remove each one, one at a time, and discard the one that lowered R-squared the most. You continue until R-squared is relatively unaffected by removing any of the terms of the model. In practice, I prefer a modification of backwards selection. Remove the term with the largest p-value. In practice, this almost always works out to be the same as the algorithm in the last paragraph. The reason I find backwards selection in general to be an attractive approach is that usually you collect data on variables because you think they're important. And if you think they're important, shouldn't they be given a fair chance at being included in the model? Forwards selection is better suited to situations in which you are exploring the possible relevance of certain variables, perhaps to decide whether its worth your time and money to collect more data on them. There is no theoretical reason why both methods should result in the same model, but many times they do. Not always, but often. Although they sound labor intensive, there are computational shortcuts (such as the one I described in the last paragraph.) Even more labor-intensive variations are available, thanks to the wonders of the computer. For example, there's "best subset selection." Here, rather than choose one variable at a time, we choose all pairs at a time, then all triplets, etc. All of these strategies can be lumped in the general category of "automatic selection" routines, which means that a computer can be programmed to do the selection for you

5 (assuming that all assumptions of the model are satisfied.) But you should be wary of automation in general, and in particular you shouldn't discard a term which your theory says is meaningful but for which the statistics says you should discard. In fact, it is not unusual to discard none of the variables, particularly in an "exploratory" setting, and let the readers interpret as they wish. Automatic selection procedures have come into their own in the field of "data mining", in which the goal is to find as small a subset of predictors that do the best job of predicting as possible, regardless (or despite) any scientific interpretation.