Degree Topics in Mathematics Chi-Squared distribution Chi is a Greek letter, which is pronounced khi and is written. The symbol is used to represent a probability distribution with some particular characteristics. The shape of the distribution depends on a parameter (The Greek letter nu) which represents the number of degrees of freedom, which is explained later. (Here, the letter k is used instead of ). For k=1 and k=2 the curve has a J shape. For k=3 and above the distribution is positively skewed. The distribution becomes more symmetrical as the value of k increases. When k is very large, the distribution is approximately Normally distributed. Image taken from http://en.wikipedia.org/wiki/chi-square_distribution One of the uses of the test is to consider situations where individuals are classified according to two sets of characteristics and determine whether the characteristics are independent. For example, we might want to know if the scores on GCSE Spanish speaking examinations and written examinations are independent of each other. We use a hypothesis test to determine whether the characteristics are independent. Never used a Hypothesis Test? Hypothesis tests are used extensively in statistical disciplines to determine whether something that has occurred is within the usual range of expected outcomes or is an unusual occurrence. If you are familiar with the Binomial distribution, you will find this short video introduction to Hypothesis Testing useful Examsolutions: Hypothesis Testing for the Binomial Distribution. If you haven t studied any statistics at A level, you might like to watch the video An Introduction to Hypothesis Testing which discusses the general principles of Hypothesis Testing and introduces the main terminology.
Task 1 The management team of a Zoo want to find out if there is any link between the age of a person and the things that they prefer to visit at the zoo. The team carry out a survey of the tickets purchased by 500 visitors on one day and classify them by two criteria: Criteria 1 Criteria 2 Child ticket Theme park only Adult ticket Zoo only Senior Citizen ticket The data is summarised in a table as follows: Child ticket Adult ticket Senior Citizen ticket Theme park only 22 15 17 Zoo only 45 39 26 70 80 34 56 78 18 Is there an association between age group and ticket type? Solution First we state the Null and Alternative hypotheses. These forms of statements are always used as the Null and Alternative Hypotheses in this type of question. H 0 : There is no association between age group and type of ticket H 1 : There is an association between age group and type of ticket We now carry out some calculations using the assumption that there is no association between age group and type of ticket i.e. that they are independent variables.
Child ticket Adult ticket Senior Citizen ticket Total Theme park only 22 15 17 54 Zoo only 45 39 26 110 70 80 34 184 56 78 18 152 193 212 95 500 Based on the totals, we would expect to have tickets for the theme park only and to be children. If independent, we can multiply the probabilities: p(child, theme park only) = and 0.041688 x 500 = 20.844. So in a sample of 500 the expected frequency for child, zoo only tickets would be 20.844. The same calculations can be carried out for all of the other combinations of ticket type and age group. Expected values Child ticket Adult ticket Senior Citizen ticket Theme park only 20.844 22.896 10.26 Zoo only 42.46 46.64 20.9 71.024 78.016 34.96 58.672 64.448 28.88 We can now calculate the test statistic where O = observed frequency and E = expected frequency. This statistic follows a distribution. The calculation twelve values are summed. is carried out for each of the twelve cells of the tables above, and then the
e.g. for theme park and child the calculation would be = 0.064111303 Test Statistic calculations Child ticket Adult ticket Senior Citizen ticket Theme park only 0.0641 2.7230 4.4276 Zoo only 0.1519 1.2515 1.2445 0.1476 0.0505 0.0264 0.1217 2.8497 4.0988 Summing these values gives 17.16. We can now test whether this value is significant by using statistical tables. If you do not have a copy of these, you can find it on page 23 of this book of statistical tables. To calculate the number of degrees of freedom we need to work out how many of the values in the table need to be known in order for the all of the others to be known. In this table: Child ticket Adult ticket Senior Citizen ticket Theme park only 22 15 17 54 Total Zoo only 45 39 26 110 70 80 34 184 56 78 18 152 193 212 95 500 we only need to know the values in bold, as the other values could then be calculated by using the row/column totals. Hence the number of degrees of freedom is 6. A quick way to calculate the number of degrees of freedom is (number of rows 1) x (number of columns -1) In this case, (4-1) x (3-1) = 3 x 2 = 6.
Testing at the 5% level, the tables show the critical value to be 12.59. 12.59 17.16 The test statistic value of 17.16 is greater than this critical value and so the result is significant. We therefore reject H 0 and conclude that there is evidence of an association between age group and ticket type. Repeat these processes on the following problem: Task 2 A market research company want to investigate whether some mobile phone operators are more popular with people in certain age groups. 500 people were asked which mobile phone operator they used. The results were as follows. Vodaphone O2 T Mobile Orange Virgin Other Under 18 23 22 18 36 5 7 111 18-29 30 24 26 29 10 8 127 30-39 32 18 23 14 8 7 102 40-59 26 14 13 25 6 10 94 60 and over 11 18 15 8 6 8 66 122 96 95 112 35 40 500 Test, at the 5% level, whether there is any association between mobile phone operator and age group.
Solution Following the same steps as in Problem 1, you should have obtained the following tables: Expected frequencies Vodaphone O2 T Mobile Orange Virgin Other Under 18 27.084 21.312 21.09 24.864 7.77 8.88 111 18-29 30.988 24.384 24.13 28.448 8.89 10.16 127 30-39 24.888 19.584 19.38 22.848 7.14 8.16 102 40-59 22.936 18.048 17.86 21.056 6.58 7.52 94 60 and over 16.104 12.672 12.54 14.784 4.62 5.28 66 122 96 95 112 35 40 500 Here there is an expected frequency of less than 5. In cases such as this the approximation is less accurate and so two categories are combined to ensure all frequencies are over 5. In this case we could combine Virgin with Other to give the following expected frequencies table. Expected frequencies Vodaphone O2 T Mobile Orange Other Under 18 27.084 21.312 21.09 24.864 16.65 111 18-29 30.988 24.384 24.13 28.448 19.05 127 30-39 24.888 19.584 19.38 22.848 15.3 102 40-59 22.936 18.048 17.86 21.056 14.1 94 60 and over 16.104 12.672 12.54 14.784 9.9 66 122 96 95 112 75 500 The same two columns of the observed frequency table would need to be combined too, before carrying out the test statistic calculations. Test statistic calculations Vodaphone O2 T Mobile Orange Other Under 18 0.6158 0.0222 0.4527 4.9876 1.2986 18-29 0.0315 0.0060 0.1449 0.0107 0.0579 30-39 2.0323 0.1281 0.6762 3.4264 0.0059 40-59 0.4093 0.9079 1.3225 0.7388 0.2560 60 and over 1.6177 2.2402 0.4826 3.1130 1.6780 Degrees of freedom = (5-1) x (5 1) = 16. = 26.7 and the critical value in the statistical tables is 26.296. The value is significant and so we reject H 0 and conclude there is an association between age and mobile phone operator.