10.2 Correlation. Plotting paired data points leads to a scatterplot. Each data pair becomes one dot in the scatterplot.

Size: px
Start display at page:

Download "10.2 Correlation. Plotting paired data points leads to a scatterplot. Each data pair becomes one dot in the scatterplot."

Transcription

1 10.2 Correlation Note: You will be tested only on material covered in these class notes. You may use your textbook as supplemental reading. At the end of this document you will find practice problems similar to the ones you can expect to be on the exam. Important Concepts: understand positive, negative and zero correlation; scatter plot; be able to compute and interpret correlation coefficient r. You need to be able to use either your calculator or formula provided in the book to compute the value of r. In this chapter we will try to answer the following questions: Are two variables linearly related? If so, what is the strength of the relationship? What kind of predictions can be made from the relationship? Correlation exists between two variables when the values of one are somehow associated with the values of the other in some way. Independent variable: The variable on the x-axis, can be controlled and manipulated. (Also called the explanatory variable) Dependent variable: The variable on the y-axis, cannot be controlled or manipulated. (Also called the response variable.) Plotting paired data points leads to a scatterplot. Each data pair becomes one dot in the scatterplot. (= data cloud) A scatterplot unfolds the type of correlation that exists. It also makes outliers apparent. Note: use calculator instructions at the end of these notes to construct a scatterplot using your TI-83/84. The strength and type of a linear correlation The Correlation Coefficient r measures the direction and strength of the linear relationship between the paired values in a sample. The stronger the correlation, the closer the absolute value of r is to the value The closer r is to +1 (or 1) the less variation there is in the data around a line of best fit that we are about to graph through our data cloud. 2. When the variation in the plot increases (i.e. the cloud is more spread out), r gets closer to zero. 3. A value of r close to +1 or 1 does NOT necessarily imply a causal relationship. There can be a third (lurking) variable in the background connecting the two. (The question of cause can be tricky. It takes some insight in the context to decide if one of the variables causes the change in the other or if they are connected in some other way.) Example: Gas prices versus number of traffic deaths. How to Interpret r Range of Values Strength of Linear Relationship Direction of Linear Relationship -1 to -0.8 Strong Negative -0.8 to -0.6 Moderate Negative -0.6 to -0.3 Weak Negative -0.3 to 0.3 None None 0.3 to 0.6 Weak Positive 0.6 to 0.8 Moderate Positive 0.8 to 1 Strong Positive Requirements: 1. The sample of paired (x, y) data is a simple random sample of quantitative data. 2. Visual examination of the scatterplot must confirm that the points approximate a straight-line pattern. 3. The outliers must be removed if they are known to be errors. The effects of any other outliers should be considered by calculating r with and without the outliers included. Note: We will be using calculators to compute r Formula (not used in this course): 1

2 Properties of the Linear Correlation Coefficient r 1. 1 r 1 2. r measures strength of a linear relationship. 3. r is very sensitive to outliers, they can dramatically affect its value. A sketch can reveal a strong linear correlation, a weak linear correlation, or no linear correlation. In theory, there is also a perfect linear correlation if all points lie exactly on a line. The stronger the linear correlation, the less scattered the data cloud appears. Note: in the real world there is no perfect linear correlation. A positive correlation exists when both variables increase (or decrease) at the same time. Example: A child s height and weight the taller the child, generally, the more the child weighs. A positive linear correlation has a positive correlation coefficient and a data cloud that goes from the lower left to the upper right. In a negative correlation, as one variable increases, the other variable decreases (and vice versa). Example: Amount of money I spent in a month versus the amount of money I put in savings. A negative linear correlation has a negative correlation coefficient and a data cloud that goes from the upper left to the lower right. r = 1 indicates a perfect positive r = 0.9 indicates a strong negative linear correlation linear correlation r = 0.8 indicates a fairly strong positive linear correlation r = 0.5 indicates a weak negative correlation r = 0.3 indicates a very weak positive linear correlation r = 0 indicates no correlation 2

3 10.3 Regression Important Concepts to understand: the basic concepts of regression; use your calculator to find the regression equation; use the regression equation for predictions. The regression line, often less formally called the line of best fit, is a line fit through the data cloud in such a way that it has the least (squared) distance from all points. There is a mathematical method to calculate this line, but we are not covering this method. The regression line is used to predict (estimate) a y-value for a given value of x within the interval of x-values that were studied. It can also be used in reverse to make a prediction for x with a given y. This type of prediction within the domain is called interpolation. Shoe size and predicted height Graph the data and look for (and remove) any outliers. 2. Find the value of the Linear Correlation Coefficient. Is there a linear correlation? a) Interpret. 3. Explain the Variation using the value of. We conclude that of the variation in height can be explained by the linear relationship between lengths of shoe size and heights. This implies that about of the variation in heights cannot be explained by lengths of shoe prints. In addition to shoe size, there are other factors are contributing to the height variation. 4. Use the LinReg to find the equation of the regression line and then: a) determine the best predicted height for an adult male with shoe size 8.5. b) determine the best predicted shoe size for an adult male that is 200 cm tall. 3

4 5. Does Drinking Small Amounts of Red Wine Reduce Your Risk of Heart-Disease Death? There is some evidence that drinking moderate amounts of red wine reduces the risk of heart attacks. The table gives data on wine consumption and deaths from heart disease in 19 developed countries as of Wine consumption is measured as liters of alcohol from drinking wine per person. The heart-disease death rate is deaths per 100,000 people. Country Alcohol from Wine Source: New York Times, December 28, 1994 Heart-Disease Death Rate Country Alcohol from Wine Heart-Disease Death Rate Australia Italy Austria Netherlands Belg./Luxe New Zealand Canada Norway Denmark Spain Finland Sweden France Switzerland Germany (West) U.K Iceland United States Ireland a) Make a scatterplot of these data on the provided graph paper or using your TI-83 to show the possible influence of wine consumption on heart-disease deaths. Heart-Disease Death Rate (per 100,000 people) Alcohol from wine (liters per 4

5 b) Use your calculator to find the regression equation and coefficient r. c) Is there a correlation? What does coefficient r tell us about the strength and direction of the correlation? d) Graph this line in the scatterplot constructed earlier, either manually or using TI-83/84 e) Use the regression line to predict the heart-disease death rate in a country where annual wine consumption is 5 liters of alcohol per person. (The number you find is deaths per 100,000 people ) f) In your opinion, does the data give good reason to think that increasing the wine consumption of Americans (say from 1.2 to 8 liters per person per year) would reduce the heart-disease death rate in the United States? Explain your answer! 6. With the growth of Internet Service providers, a researcher decides to examine whether between the cost of internet per month and degree of customer satisfaction (on a scale of 1-10 with 1 being not at all satisfied and 10 being extremely satisfied). Only programs with comparable types of services were included in the study. A simple random sample of the data is provided below. Dollars Spent Customer Satisfaction a. Graph the data. Look to see if there is a linear relationship. b. What is the correlation coefficient? c. What does this statistic mean concerning the relationship between the amount of money spent per month and level of customer satisfaction? d. What percent of the variability is accounted for by the relationship between the two variables and what does this statistic mean? e. Does it make sense to use the regression line to make predictions about customer satisfaction based on the amount of money spent for internet service? Which service should you pick in this case? 5

6 7. It is assumed that Achievement Test Scores should be correlated with student classroom performance. One would expect that students who consistently perform well in the classroom (tests, quizzes, etc.) would also perform well on a Standardized Achievement Test (Scores 0-100). A teacher decides to examine this hypothesis. At the end of the academic year, she computes a correlation between the students Achievement Test Scores and the overall GPA for each student over the entire year. The data for her class are provided below. a. Graph the data. Look to see if there are any outliers. b. Compute the correlation coefficient. c. What does this statistic mean concerning the relationship between Achievement Test Performance and GPA? d. What percent of the variability is accounted for by the relationship between the two variables? What percent of the variability is not accounted for? GPA Standardized Test Score e. What is the Equation of the Regression Line? Does it make sense to use the regression line to predict the Standardized Test score from the GPA? f. Determine the best predicted GPA with a Standardized Test score of 80. g. Determine the best predicted Standardized Test score with a GPA of

7 8. Researchers interested in determining if there is a relationship between death anxiety and religiosity conducted the following study. Students completed a death anxiety survey (high score = high anxiety) and a checklist designed to measure an individual s degree of religiosity - belief in a religion, regular attendance at religious services, number of times per week they regularly pray, etc. (high score = greater religiosity). A data sample is provided below. a. Graph the data. Look to see if there are any outliers. b. Compute the correlation coefficient. c. Use table A-6 to interpret. Death Anxiety Religiosity d. Explain the Variation using the value of. e. Use the LinReg to find the equation of the regression line and then: 1) determine the best predicted religiosity score if a person s death anxiety score is 35. 2) determine the best predicted death anxiety score if a person s religiosity score is 10. 7