Correlation and Simple. Linear Regression. Scenario. Defining Correlation

Size: px
Start display at page:

Download "Correlation and Simple. Linear Regression. Scenario. Defining Correlation"

Transcription

1 Linear Regression Scenario Let s imagine that we work in a real estate business and we re attempting to understand whether there s any association between the square footage of a house and it s final selling price within a particular neighborhood. In order to determine this, we randomly collect data from 40 houses that were sold over the past 3 months. The size of these houses ranged from as small as 800 square feet to as large as 2,700 square feet. Once we have this data, we re ready to begin the process of determining whether these two continuous variables are correlated. Defining Correlation Defined, correlation helps us measure the extent to which two continuous variables are related in a linear manner. The first step of a correlation study is to collect data and create a scatterplot of the data. Linear Regression GembaAcademy.com 1

2 While each scenario will be different, we recommend attempting to collect at least 40 data points prior to conducting a Correlation Analysis. In our example, it s clear that as square footage increases so does the selling price. In other words, there s positive correlation between these two variables. We also see that the observed data points in red are tightly distributed around the blue line meaning there s strong correlation. In this scatterplot, we see an example of where the red data points are more random making it difficult to conclude there s any correlation at all between these two factors. This is why the blue line is nearly flat. The second step of the Correlation Study is to calculate the Pearson Correlation Coefficient, which is noted as a lower case r. This statistic helps us quantify the strength of the correlation if one exists. The null hypothesis is that r, the Correlation Coefficient, is 0 meaning there s no correlation between the two variables and the alternate hypothesis, or Ha, is that r, the Correlation Coefficient, does not equal 0 meaning there seems to be at least some correlation between the two variables. The Alpha value is typically set to.05, meaning we re willing to accept a 5% chance we could incorrectly reject the null hypothesis. Here s the Correlation Minitab output for our Square Footage versus Selling Price Study. Square Footage Versus Selling Price Study The first thing we check is the P- value which is 0, meaning we ll reject Ho and conclude there s at least some correlation between selling price and square footage. We then examine the Pearson Correlation Statistic which in this example, is.990. This r statistic will always range between - 1 and 1 and while every organization will need to determine this for themselves especially since some processes are far Linear Regression GembaAcademy.com 2

3 more critical than others, an r value greater than.85 is typically accepted as describing a strong correlation. If you re producing extremely critical medical equipment, you may want to see an r value above.99 before concluding there s a strong correlation. The interesting thing about the r statistic is that it also helps us determine the direction of the correlation. For example, here s what an r value of 1 looks like, meaning there s perfect, positive correlation. Here s what an r value of - 1 looks like, meaning there s perfect, negative correlation. Again in our selling price versus square footage example, the r value is.99 which means there definitely seems to be strong, positive, correlation between the size of the house and its selling price. Once we ve completed the correlation study, it s time to dig even deeper which we ll do by performing Simple Linear Regression. The third step of the process. Simple Linear Regression allows us to formulate an equation corresponding to the line that best fits the data. In other words, Simple Linear Regression is very similar to the Correlation Study we just completed. Only it gives us much more information. When we speak of the best fit line, we re talking about this blue line that s sandwiched in between the individual data points. This line is created using something called the Least Squares method. All this really means is this blue regression line is drawn where it best minimizes the vertical distance of each observed data point to the fitted line. Null and Alternate Hypothesis The null and alternate hypothesis for Simple Linear Regression is very similar to the Correlation Test we did earlier. The null hypothesis, or Ho, is that the slope of the line is zero, meaning there s no correlation. While the alternate hypothesis, or Ha, is that the slope of the line is not zero, meaning there s at least some correlation. Linear Regression GembaAcademy.com 3

4 R Squared and R Squared Adjusted Here s what a Fitted Line Plot looks like. As you can see, it s nothing more than a scatterplot with a blue fitted line down the middle. We also see some additional information including R Squared and R Squared Adjusted in this section. Again as we learned in our ANOVA modules, the R Squared and R Squared Adjusted statistics tell us how well the variables being studied explain the variation in the process. In this example, both values are at least 98%, meaning this model has done a superb job explaining the variation in the process. Next, perhaps the most powerful aspect of regression is how it allows us to build predictive equations. Here we see an equation that reads, Selling Price = x Square Footage. Now let s explore this equation a little, and to do so we need to travel back a bit to our Algebra and Geometry days which I m sure, brings back extremely nice memories for most of you! Another way of writing this same equation is Y equals b0 plus b1x. Y, in this case, is the response we re trying to predict. B0 is the intercept, or the value of y when x is 0. B1 is the slope, or the amount y changes when X changes by one unit. X is the explanatory variable. With this equation, we re able to plug a square footage value in, such as 1,000, and when we work out the math, we re able to predict what the selling price will be. In this example when we plug 1,000 into the equation, we learn that the Selling Price should be around $108,000. Linear Regression GembaAcademy.com 4

5 There are a few important caveats to this predictive capability of Regression. First, if our R- Squared Values are low, say less than 85%, the predictive capabilities of the model are weaker. Second, we must be very careful if we attempt to make predictions outside of the tested boundaries. In other words in our study, if we attempted to plug a Square Footage value of 3,500 into the equation, we d be outside of the tested boundary since 2,750 was the highest Square Footage examined in the study. Attempting to predict outside of a tested boundary is called Extrapolation and should be done with extreme caution. In fact, many statisticians strongly recommend against any and all extrapolation. Finally, no matter how strong we feel our Regression model is, we must always, and I do stress always, confirm the results by predicting the response with our equation and then actually performing an experiment to see if the prediction comes true. In our real estate example, we could wait to see what the next ten houses sell for and then plug their square footages into the equation to see how close the actual selling price is to the predicted price. With all this said, if we feel like we have a strong model well suited for predictions, Minitab allows us to add 95% Confidence and Predictive Intervals to the Fitted Line Plot. Difference Between Confidence and Predictive Intervals We want to explain the differences between these two confidence intervals. First, the red dashed lines represent the 95% confidence interval of the mean response. In other words, we can feel 95% confident that the fitted line will be between these two red lines so long as something doesn t fundamentally change with the process. The dashed green lines represent the 95% Predictive Intervals. These lines represent the range the individual data points are likely to fall. And since it s Linear Regression GembaAcademy.com 5

6 harder to predict where individual data points will fall, the predictive intervals will always be larger. In addition to the Fitted Line Plot, Minitab also provides a Regression Analysis that must be examined in harmony with the Fitted Line Plot. First, we see an ANOVA output with a P- value. This P- Value simply tells us whether or not there s evidence of correlation. Again, the null hypothesis for Simple Linear Regression is that the slope of the line is equal to zero and since our P- value is 0, we will reject Ho and conclude that there s at least some correlation between the variables being studied. We also see the same R- Squared and R- Squared Adjusted values we saw in the Fitted Line Plot here. As we learned in our ANOVA module, R- Squared is the coefficient of determination. It explains how much of the variation in the response is explained by the factors in the model. Again, while you have to be careful with stating what a good R- Squared value is, values above 85% generally mean the factors in the study have done a good job explaining the variation in the model. Adjusted R- Squared is similar to R- Squared but also accounts for the number of factors making it useful for comparing models with different numbers of factors such as we ll see when we learn about multiple regression. Last but most certainly not least, the fourth step of the Correlation Study Process is to check the residuals assumptions just like we did when performing ANOVA. The residuals in a Regression model simply represent the distance between the observed data point and the fitted line. Residuals Assumptions The residual for this data point is the distance between it and the fitted line. The residuals assumptions between ANOVA and Regression are the same. First, the residuals must be random and independent. We check this with the residuals versus order graph. Linear Regression GembaAcademy.com 6

7 The data points seem to be randomly distributed. The second assumption is that the residuals are normally distributed. We check this with the normal probability plot and histogram and as we see here, everything looks normal since the red data points fall close to the blue line and the histogram is symmetric about the mean. The third assumption is that the residuals should have constant variance across all levels. We check this with the residuals versus fits graph. As we see, all the variances seem to be evenly distributed. If any of these assumptions are violated, we should be very careful since the regression analysis may not be accurate and could lead to inaccurate predictive equations. If we do fail an assumption, we ll want to look for things such as data entry mistakes, unstable conditions during data collection, and loose measurement systems. Conclusion The last topic that we d like to discuss is an extremely important one. As it turns out, when lean and six sigma practitioners first learn about tools like regression, they can become very excited and even anxious to use the tool. While there isn t anything wrong with this, there is an important phrase we all must understand. Correlation does not imply Causation. For example, it can be shown that umbrella sales and traffic accidents are highly correlated. In other words, as sales of umbrellas go up so do traffic accidents. Does this mean we should push to decrease the number of umbrella sales? Of course not. A little common sense tells us that umbrella sales and traffic accidents increase because it s raining outside. While the tools of lean and six sigma are extremely useful, we must never lose sight of the most powerful tool we have, and that s good old fashioned common sense. Linear Regression GembaAcademy.com 7