Chapter 10 Regression Analysis

Size: px
Start display at page:

Download "Chapter 10 Regression Analysis"

Transcription

1 Chapter 10 Regression Analysis Goal: To become familiar with how to use Excel 2007/2010 for Correlation and Regression. Instructions: You will be using CORREL, FORECAST and Regression. CORREL and FORECAST are found on the Stat Menu and Regression is found in the Data Analysis Group. CORREL Select CORREL from the Stat Menu and here is what you should see: Typically, the data is organized so that Array1 is in Column 1 and Array 2 is in Column 2. After you have selected the data, the tool will return the Correlation Coefficient. This is a number between -1 and +1 and is explained further in the notes below. FORECAST Select FORECAST from the Stat Menu. You should see: 11

2 REGRESSION Select Regression from the Data Analysis Menu. You should see: You just have to be a little careful when you enter the data. Take note that it asks for the Y data first. Data is entered in the typical way by selecting the desired data. If you have labels at the top of your data columns, and want the labels to carry over, then check the Labels box. As before, select Output Range and then select some cell on the worksheet. Lastly, check the Line Fit Plots so you get a chart showing the data and the regression line. Be careful how you enter data into the tool. Pay close attention to which column of data you select for your Known_y s and for your Known_x s. Don t mix them up. In our example, the weight data would be the Known_y s. Once you ve entered the input data, you can make predictions about various males by entering various values for their height into the input field labeled X. the tool returns the predicted value, in this case, for weight. Here is a typical output with the important fields highlighted: 12

3 Multiple R This is the correlation coefficient R Square This is the coefficient of determination Standard Error This is the standard error, Significance F This is play the exact same role as p-value did for hypotheses testing. If this value is less than alpha, then the regression is significant. Intercept This is the y-intercept of the linear regression equation Slope This is the slope of the linear regression equation. The following is the chart output. The markers in blue are the actual data values and the red line is the linear regression line. 13

4 Regression Analysis involves the study of ordered pairs of data, such as (X,Y). If a strong linear relationship exists between X and Y, then given a value of X, we can make a prediction about what Y should be. Consider height and weight. There is a strong correlation between the two as well as a significant linear relationship, i.e. we can express weight as a linear function of height. Let s say that the average weight of all males in the United States is 170 lbs. If we were to pick some male at random, without seeing the person, the best guess of this person s weight would be 170 lbs, the population mean. However, if we knew of a linear relationship between height and weight, and we knew the person s height, we could make a better guess of the selected male s weight. Let s say that weight is related to height by the following linear equation: where W is in lbs and H is in inches. Let s say that we know that the selected male is 72 tall. We would then estimate the person s weight to be guess that the person weighed lbs which would be a better guess than 170, the average of the whole male population. Knowing the linear relationship that exists between weight and height enables us to make better predictions than just knowing the population mean. Correlation A correlation exists between two variables when there exists a relationship between the two. In other words, one can be used to predict the value of the other. In this class we will study those correlations when the relationship is a linear one, i.e. one variable can be expressed as a linear function of the other. The following formula shows a linear relationship between y and x: and are constant numbers, for example, -2 and 10: In this example, if x were to equal 4 then y would equal: ( ) Since we are dealing with random data, and are population parameters. Given a sample, our goal is to estimate and. Let s take a look at an example. The following table shows the costs of subway fare and a slice of pizza in New York City from 1960 through 2000: Cost of Pizza Subway Fare

5 It certainly looks like there exists a linear relationship between the two sets of data. We can measure the strength of this relationship using the Correlation Coefficient r. The Excel tool we use is CORREL. For the data in this example, CORREL returns a value of Use the following table of r values to determine if the correlation is statistically significant: Sample Size r Correlation can be positive, negative or near zero. The following figures show the relationship between the sign of the correlation coefficient and the arrangement of the data. Observe how the data is randomly distributed about when r is 0. Now that we have indeed seen that the cost of pizza is highly correlated with the cost of subway fare, we can use one to predict the other. The Excel tool that we use for this is FORECAST. For example, if we input the value of the cost of pizza, such as 1.00, the tool would return the predicted value of Notice that this does not equal the 1.00 in the table above for the cost of Subway Fare. Remember, FORECAST uses the data in the table to make estimates of and, and it uses these estimates to calculate predicted values. There s no reason to believe that a predicted value would equal an actual value in a sample, but it is close. FORECAST makes use of a bunch of equations we call Regression Analysis. Based on the sample data, Regression Analysis comes up with the straight line that best fits the data. See the figure below. 15

6 The data is from a random sample, and so, one would expect it to be scattered around a bit, but see how the straight line does a pretty good job of fitting the data. FORECAST does its job by taking the differences between the predicted values and the actual values and by moving the straight line around, minimizes the sum of these differences. It is called the Least Squares method. FORECAST finds the smallest sum of the squares of the differences. Regression Analysis lets us make better predictions of our data if we know something about the data. For example, let s suppose that during the work week, it takes you an average of 5.6 minutes to get out of bed. If I wanted to predict how long it would take you to get out of bed on Thursday, the best I can do would be to say 5.6 minutes. Now, suppose that there exist a linear relationship between how long it takes you to get out of bed and the day of the work week. Consider the following table: Day of Week Time (mins) The correlation coefficient is Now, let s see what we would predict for Thursday. FORECAST returns 3.5 minutes which is a much better estimate than 5.6 minutes. Therefore, you can see that if the data has a linear relationship associated with it, we can use that relationship to make better predictions than just using the sample data. 16