A Research Note on Correlation

Size: px
Start display at page:

Download "A Research Note on Correlation"

Transcription

1 A Research ote on Correlation Instructor: Jagdish Agrawal, CSU East Bay 1. When to use Correlation: Research questions assessing the relationship between pairs of variables that are measured at least at ordinal level: 1. What is the nature of the relationship between volume of sales and amount of advertisement $? 2. Is there any association between market size and size of the sales force in different geographical regions? 3. Are consumer perceptions of relative brand qualities related to prices of brands in a product category? Or do prices of brands reflect their relative qualities? 2. Different types of correlation: 2.1 Pearson Correlation: A measure of linear association between two variables measured at least at the interval or ratio scale. If used on ordinal data, it assumes that variables are measured at least at the interval scale. Since the Pearson correlation measures linear association, it is very important to examine the nature of the relationship between two variables by plotting the data of two variables. A non-linear relationship, although it could be very significant and meaningful, will not be captured by the Pearson correlation. 2.2 Spearman s Rank Correlation (Spearman s Rho) A measure of association between two variables measured at the ordinal level. Useful to apply when there are several objects (brands) ranked on some characteristics (quality). It simply considers the data at the ordinal level (ranks of brands on quality and prices) rather than the absolute value of the variables (interval or absolute values of differences on the quantity of the quality of brands measured on some scale or actual price differences across brands). (Since it simply considers the ordered nature of data of both the variables, linearity is not an issue as it is in the Pearson Correlation). Here the objective is to assess the direction of relationship (monotonic or not) rather than linearity (straight line) of the relationship. Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,

2 2.2 Kendall s Rank Correlation (Kendall s tau b): A measure of association between two variables measured at the ordinal level. Useful to apply when there are a limited number of objects ranked on some characteristics. Similar to Spearman s Rank Correlation, it recognizes the values of the variables at the ordinal level only. It does not consider the interval or ratio characteristic of the data even if one or both the variables are measured at the interval or ratio scale. (Since it simply considers ordered nature of data of both the variables, linearity is not an issue as it is in the Pearson Correlation). Correlation coefficient ranges from 1.0 to An Example of the Pearson Correlation: Suppose the following data indicates the Sales (in $ 000) and average number of TV spots per month from different geographical areas marked by their numbers. We are interested in knowing if the sales $ and TV spots are positively correlated. Area # Sales in $ TV spots per month Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,

3 Generate a graph of data for review: Since both TV spots and Sales in units are measured at ratio scales, the Pearson correlation is appropriate. Since the Pearson correlation assesses the linear relationship, it is important to review the data graphically to get a feeling about the data. Graph 1.1 Graph of Sales in units and umber of TV Spots across Geographical Regions Sales in thousand units TV spots per month Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,

4 In SPSS, choose: Analyze Correlate Bivariate The above commands will open the following window. Window 1.1 : Bivariate Correlations The list on the left hand side of the window presents the list of the variables in the SPSS file. Choose the variables to be correlated (could be more than 2) by moving the target variables to right hand side under Variables: More than two variables can be selected to generate bivariate correlations of two variables at a time. There are three different correlations in this window. Pearson Kendall s tau b Spearman Let us suppose that we want to examine the relationship between Sales (measured in thousand units) across different geographical regions and the number of TV spots used per month in those geographical areas. Since both Sales and TV spots are measured at the ratio scale, choose the Pearson correlation. Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,

5 Test of significance: Two-tailed One-tailed If the researcher has a hypothesis about the direction of relationship (e.g. umber of TV spots positively affects the amount of sales $), then choose the one-tailed test. Otherwise, the two-tailed test is chosen by default. The following window shows that we selected two ratio-scaled variables: the Pearson correlation and the two-tailed test to examine the significance of correlation. Window 1.2: Bivariate Correlations Clicking the OK button generated the following table 1.1 Sales in thousand units TV spots per month Table 1.1 Correlations Pearson Correlation Pearson Correlation Sales in thousand units **. Correlation is significant at the 0.01 level (2-tailed). TV spots per month ** ** Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,

6 Results indicate that Sales $ and the umber of TV spots per month are positively (r = 0.898) and significantly (p <.01) correlated. This suggests that there is a linear positive relationship between these two variables. The size of the correlation ranges from 1.0 to 0 to Given this possible range, a correlation coefficient of indicates a strong linear relationship between the two variables. The following table 1.2 presents an example of Pearson correlation which is statistically not significant. The table 1.2 presents the correlation between Sales (in thousand units) measured across different geographical areas and the wholesaler efficiency index measured on a 5-point scale where 1= wholesaler in that geographical area is not efficient and 5=means that the wholesaler in that area is very efficient. We expect a positive correlation between the wholesaler efficiency index and sales across geographical territories. Therefore, the following hypotheses are presented: ull hypothesis: There is no significant linear relationship between and Sales and Wholesaler Efficiency Index. Alternate Hypothesis: There is a significant positive relationship between and Sales and Wholesaler Efficiency Index. The Table 1.2 shows that the correlation coefficient is = and its significance level is p = Since the calculated p-value, i.e is > our rule of thumb for p-value, i.e. 0.05, we accept the null hypothesis and reject the alternate hypothesis. It means that these two variables are not significantly correlated in the target population from where the data came even though the sample correlation is as high as Since the correlation is not significant, a negative sign for observed correlation is not very meaningful. Table 1.2 Correlations Sales in thousand units Wholesaler efficiency index: 5=very efficient Wholesaler Sales in thousand units efficiency index: 5=very efficient Pearson Correlation Pearson Correlation Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,

7 It must be noted that a statistically insignificant correlation coefficient does not mean that the researcher cannot make any conclusion regarding the nature and degree of association between the two variables in target population. It simply means that we accept the null hypothesis of no significant linear (Pearson correlation) relationship between these two variables in the target population. An example of correlation on Ordinal level data: Research question: Do the prices of brands reflect their relative quality? Table 1.3 presents the data on brands (marked as A to Z), their quality ranks (lower number of rank indicates better quality), and their prices in $. Since the higher number for quality ranks indicates poorer quality whereas the higher number for prices indicates higher prices, we expect a negative correlation between these two variables if prices reflect brands relative quality. (We expect to pay more for higher quality brands, whereas lower number means higher quality). Here the Quality rank is measured at the ordinal level whereas the Price is measured at the ratio level. Since one of the two variables is measured at ordinal level, we cannot use the Pearson correlation. We have to use a correlation that considers ordinal level data. Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,

8 Table 1.3: Brands, Quality Ranks and Prices Brands Quality ranks Price ($) A B C 3 95 D E 5 85 F 6 90 G 7 80 H 8 70 I 9 70 J K L M O P Q R S T U V W X Y Z We selected these two variables in the following Window for correlations. Instead of selecting the Pearson Correlation, we selected Kendall s and Spearman s correlations to generate correlations of these data. Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,

9 Window 1.3: Bivariate Correlations This selection generated the following Table 1.4. Table 1.4: Kendall s tau b Correlations Kendall's tau_b quality ranks lower ranks mean better quality prices in dollars Correlation Coefficient Correlation Coefficient **. Correlation is significant at the 0.01 level (2-tailed). quality ranks lower ranks mean better prices in quality dollars ** ** The correlation coefficient is 0.67 which is significant at p <.05. It indicates that there is a negative and strong correlation between Quality ranks of the brands and their relative prices. In other words, prices tend to reflect the relative quality of the brands although this relationship is not perfect. Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,

10 We also generated Spearman s correlation as shown in the following Window 1.4. Window 1.4: Bivariate Correlaitons This selection generated the following Table 1.5. Table 1.5 : Spearman s Rho Correlations Spearman's rho quality ranks lower ranks mean better quality prices in dollars Correlation Coefficient Correlation Coefficient **. Correlation is significant at the 0.01 level (2-tailed). quality ranks lower ranks mean better prices in quality dollars ** ** Similar to Kendall s tau b, Spearman s rho also shows a highly significant correlation (p <.05) of indicating a negative correlation between quality ranks of brands and their relative prices. It shows that brands that are judged to be better in quality tend to charge higher prices. In other words, prices of brands tend to reflect their relative quality to a great extent. Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,

11 A note of caution: 1. Correlation does not demonstrate or prove causality by itself. What causes what has to come from theory. It is a symmetrical measure, i.e., a correlation simply shows the linear relationship between a pair of variables. The correlation between X and Y is the same as the correlation between Y and X. 2. The correlation measures the linear (straight line) or monotonic (positive or negative direction) relationship. If the relationship is curvi-linear, the correlation will underestimate the relationship between those variables and may produce results that show the correlation to be insignificant. 3. The correlation is unlikely to tell the true relationship between variables if the nature of the relationship varies depending upon the values of the variables. Actually, Graph 1.1 shows this kind of relationship where the relationship tends to be different at lower values than at the higher values of Sales and TV spots. 4. A correlation between two variables (such as income and intelligence) may be spurious (artificial). A high correlation between these two variables may simply be an artifact of their relationship with another variable such as education. This is yet another evidence of a correlation not measuring causality. 5. It is important to note that correlation is not a measure of percentage, i.e., a correlation of 0.80 does not mean that the correlation between two variables is 80%. The correlation shows the strength of relationship between two variables given a possible range of correlation of 1.0 to A statistically insignificant correlation coefficient does not mean inconclusive result. The conclusion is that the two variables are not (linearly or monotonically depending upon the correlation used) related in the target population. Instructor: Jagdish Agrawal, CSU East Bay, Date: September 3,