Chapter 5. Statistical Reasoning

Size: px
Start display at page:

Download "Chapter 5. Statistical Reasoning"

Transcription

1 Chapter 5 Statistical Reasoning

2 Measures of Central Tendency Back in Grade 7, data was described using the measures of central tendency and range. Central tendency refers to the middle value, or perhaps a typical value of the data. Mean, median and mode were classified as measures of central tendency. Range was used measure dispersion.

3 Mean The mean is the most commonly used measure of central tendency. When we take an average we are calculating a mean. It is simply the sum of the numbers divided by the number of numbers in a set of data. Find the mean of the following exam marks: 83, 51, 23, 77, 4, 88, 66, 56, 77, 55, 92

4 Median Median is the number in the middle when the numbers in a set of data are arranged in ascending or descending order. Find the median of the following exam marks: 83, 51, 23, 77, 4, 88, 66, 56, 77, 55, 92 If the number of numbers in a data set is even, then the median is the mean of the two middle numbers. Find the median of the following: 4, 7, 7, 8, 10, 11, 11, 12

5 Mode Mode is the value that occurs most frequently in a set of data. Find the mode of the following exam marks: 83, 51, 23, 77, 4, 88, 66, 56, 77, 55, 92 Find the mode of the following: 4, 7, 7, 8, 10, 11, 11, 12

6 Example Joe has taken an aptitude test 8 times and his scores are: 96, 98, 98, 105, 36, 87, 95 and 93. Is mean or median the better measure of central tendency?

7 Solution Mean is a better measure of central tendency if there is no outlier for the data. Is there an outlier in our data? 36 is the outlier of the data as it is far apart from other data values. So, it may skew, or throw off the central tendency. As the outlier influences the mean, the median is the better measure of the central tendency, in this case.

8

9

10

11 Dispersion Suppose you wanted to determine the spread of data within a bunch of data. For example, if you were describing the heights of students in your class to a friend, they might want to know how much the heights vary. Are all the men about 5 feet 11 inches within a few centimeters or so? Or is there a lot of variation where some men are 5 feet and others are 6 foot 5 inches?

12 Dispersion Measures of dispersion like the range and standard deviation tell you about the spread of scores in a data set. Like central tendency, they help you summarize a bunch of numbers with one or just a few numbers. Data that is uniform is clustered around the central values, resulting in a dispersion that will be small.

13 Which person was most consistent in their jumping heights? Brendan Which person would have the most dispersion in their jump heights? Alex

14 Dispersion The range, based on the two extreme values of the data set, is one measure of dispersion. Range = Max value Min value It gives a general idea about the total spread of the data but gives no weight to the central values of the data.

15 Dispersion Mean = 70 Median = 70 Mode = n/a This tells us that they both have the same average. Is there any difference in their marks?

16 Range (Tim) = 20 Range (Luke) = 4 This tells us that Tim's grades were more spread out than Luke's grades. Which person has a greater variation from the mean? Tim s marks (range 20) are more spread out than Luke s marks (range 4). In other words, Luke s marks are more clustered around the mean than Tim s marks.

17 Dispersion Range is commonly used as a preliminary indicator of dispersion. However, because it takes into account only the scores that lie at the two extremes, it is of limited use. Later in this unit, a more complete measure of dispersion known as standard deviation will be introduced. It takes into account every score in a distribution.

18 In a science experiment, students tested whether compost helped plants grow faster by counting the number of leaves on each plant. The following results were obtained in the table above. (i) Calculate the mean, median and mode for each group. (ii) Calculate the range for each group. (iii) Describe the dispersion in the data for each group. (iv) Which group of plants grew better? Justify your decision.

19 Calculate the range of each group of numbers. Then explain why the range, by itself, can be a misleading measure of dispersion. Group A: 8, 13, 13, 14, 14, 14, 15, 15, 20 Group B: 7, 7, 8, 9, 11, 13, 15, 15, 17, 18

20 Histograms Histograms are useful for grouping data to show the distribution of the data. It is basically a bar graph, with the bars touching. It represents continuous data. The data has to be sorted into evenly spaced frequency classification intervals (or bins).

21 When choosing the bin width we should have no more than 10 bins. To determine an appropriate bin width find the range of data. Divide the range by 10 and take the smallest integer less than or equal to the quotient.

22 Consider the data below 18, 2, 6, 3, 5, 11, 5, 27 10, 10, 15, 11, 12, 10, 3 16, 1, 20, 9, 1,31, 11, 9 What is the range of the data? Range=largest value smallest value Range = 31 1 = 30

23 The bin size is selected by dividing the range into equal amounts. Here we will divide the range by 10 30/10 = 3 Thus we will have 10 equal spaced bins each 3 units wide. As the intervals are required to be non-overlapping, the convention is that the lower limit of each interval includes the number. For example, the data value 3 would be placed in the 3-6 interval and not in the 0-3 interval.

24 Sort the data into a frequency table Bins Tally Frequency

25 Construct a Histogram Refer to your notes!

26 Note: The bars on a histogram are touching. This shows that the bars represent a range of values, not just one number

27 Advantages They are good for showing distributions Disadvantages They are not good for comparing between 2 sets of data

28 Example

29

30

31

32

33

34

35 Frequency Polygon A graph made by joining the middle-top points of the columns of a frequency histogram They serve the same purpose as histograms, but are especially helpful for comparing sets of data.

36 The frequency polygon below showing the number of tickets, which vary by price depending on seat location, sold for one day concert tickets. 1. What is the frequency for the $20 price? a. 4 b. 5 c What is the cumulative frequency for the $30 price? a. 3 b. 16 c Which price has a frequency of 2? a. $15 b. $30 c. $35 4. Which price has the highest frequency? a. $10 b. $20 c. $25

37 Frequency Polygons Frequency polygons are useful for comparing distributions. This is achieved by overlaying the frequency polygons drawn for different data sets.

38 The data come from a task in which the goal is to move a computer mouse to a target on the screen as fast as possible. On 20 of the trials, the target was a small rectangle; on the other 20, the target was a large rectangle. Time to reach the target was recorded on each trial. The two distributions (one for each target) are plotted together in a frequency polygon.

39 What conclusion can be reached about the time it took to move the mouse on a small target? It generally took longer to move the mouse to the small target than to the large one.

40 Drawing Frequency Polygons people were interviewed in each of two communities and asked how many people lived in their household. The results are given in the table below. Draw a frequency polygon to illustrate these results and state any conclusions that can be reached. Number of people in household Community A Community B

41 Refer to your notes!

42 2. Coleman s recorded the number of items bought by each of 100 customers in two of its stores and recorded the data in the following table. Draw frequency polygons to illustrate these results and state any conclusions that can be reached Number of items bought Store A Store B

43 Refer to your Notes!

44 Standard Deviation Range is a measure of dispersion. However range is only a measure of how spread out the extreme values are. So it does not provide any information about the variation within the data values themselves. Another measure of dispersion is standard deviation. It is useful for comparing two or more sets of data.

45 Standard Deviation Standard deviation is a measure indicating how the data is spread out or dispersed about the mean. It is a calculation that identifies the distance each data value is from the mean. A small standard deviation shows that most of the data is located close to the mean. A large standard deviation shows the data is more spread out.

46 Standard Deviation Consider the unit test marks for Steve and Sally Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Steve Sally Calculate the mean and range for each student Mean Range Steve: Sally: Whose marks are more dispersed? The students have a standard deviation of 7.1 & 6.4. Which student has which standard deviation? 7.1: 6.4: Whose marks were more consistent over the 5 unit tests?

47 If the data is more clustered around the mean, there is less variation in the data and the standard deviation is lower. If the data is more spread out over a large range of values, there is greater variation and the standard deviation would be higher.

48 Symbols used for descriptive statistics Mean (x with a line over it) Standard deviation (σ) Size of the data sample (n) Size means the number of data values

49 Calculating standard deviation (σ) Consider the following eight values: 2, 4, 4, 4, 5, 5, 7, 9 What is the size of the population? n = 8 To calculate the standard deviation, first compute the mean Mean = 5

50 Next we find the difference between each data point and the mean (ie subtract the mean from each data point) and square the difference It is easier to do the calculation if we organize the numbers in a table 2, 4, 4, 4, 5, 5, 7, 9 Data (x) Mean Data-Mean (Data-Mean) The sum of the (Data-Mean) 2 Is 32 The symbol that we used for this sum is Σ

51 We then find the average of these squares Add them together and divide by the number of differences (ie divide by the size of the data (n)) 32/4 = 8 The last step is to take the square root of our last calculation above: σ = square root of 4 = 2 So the standard deviation of our data is 2

52 Note: To show the sum of the difference we use: Σ(x mean) 2 To show the average we do: Σ(x mean) 2 n For the last step of finding the standard deviation we: σ = square root of Σ(x mean) 2 n

53 Example 1. Xander has a small class of math students. They wrote a quiz where they got the following marks marked out of 10: 8, 4, 5, 7, 6, 7, 7, 4 A)What is the size of the sample of data? B)What is the mean? C)What is the standard deviation?

54 Example 2. Xavier has 3 classes of academic grade 12 math. He took a random sample of the marks his students got on the midterm exam. 30, 60, 55, 23, 2, 70, 90, 62, 45, 73 A)What is the size of the sample of data? B)What is the mean? C)What is the standard deviation?

55 Which teacher had more dispersion in their grades? How do you know?

56 For you to try!

57

58

59 Standard deviation and the Normal Distribution Curve

60 Normal Distribution Data can be "distributed" (spread out) in different ways. It can be spread out more on the left... or more on the right

61 Or it can be all jumbled up

62 But there are many cases where the data tends to be around a central value with no bias left or right, and it gets close to a "Normal Distribution" like this: A Normal Distribution

63 The "Bell Curve" is a Normal Distribution. And the yellow histogram shows some data that follows it closely, but not perfectly (which is usual). It is often called a "Bell Curve" because it looks like a bell.

64 Bin Frequency

65 Check notes for frequency polygon! The frequency polygon is made by placing a dot at the center of each bar on the histogram that is made from the frequency table and connecting the dots with a curved line. The line is known as a frequency polygon Is the data normally distributed? In other words, is the frequency polygon bell shaped?

66 The Normal Distribution has: mean = median = mode at the centre of the curve and all of the other data is symmetric about the centre. 50% of values are less than the mean and 50% are greater than the mean

67 Many things closely follow a Normal Distribution: heights of people Heart rates errors in measurements blood pressure marks on a test We say the data is "normally distributed".

68 Which normal distribution curve has the largest standard deviation? Explain your reasoning A) B) C) D)

69 ANSWER: B) has the largest standard deviation because its curve is spread out the most.

70 Normal Distribution Approximately 68% of the data falls within one standard deviation of the mean. Approximately 95% of the data falls within two standard deviation of the mean. Approximately 99.7% of the data falls within three standard deviation of the mean.

71 Normal Distribution Divide the normal bell shaped histogram into sections based on the mean and the standard deviations labelling the percentage of the population (or sample) that fits into each section

72 Check your notes for a more detailed version of the above graph!

73 Problems What percentage of a normally distributed population would be in the following intervals? A) μ + σ B) μ 2σ C) μ σ, μ + 3σ 34% 47.5% 83.85%

74 Three different normal distribution curves representing different populations. Check your notes for the graphs!

75 1. Which population has the largest population mean? Explain how you arrived at your answer. 2. Which population has the largest population standard deviation? Explain how you arrived at your answer. 3. Which population has the smallest population standard deviation? 4. Which population has the smallest population mean?

76 1) Graph 3 because it has the largest middle value (8) (ie the mean) 2) Graph 1 because it is spread out the most as a result of it having the lowest peak 3) Graph 2 because it's spread out the least as a result of having the highest peak 4) Graph 1 because it has the smallest middle value (4) (ie the mean)

77 Heights The distribution of heights of American women aged 18 to 24 is approximately normally distributed with mean 65.5 inches and standard deviation 2.5 inches. 68%of these American women have heights between: (ie 63) and (ie 68) inches 95%of these American women have heights between: (2.5) (ie 60.5) and (2.5) (ie 70.5) inches

78 Heights Put those values on a normal curve Check your notes for the graph How many of those girls (if there were 20) should be taller than 68 inches? Above 68 inches --> = 16% 0.16 x 20 = 3.2 ~3.2 girls taller than 68 inches

79

80 Graph 2 has the largest mean because it has the highest center value (63) Graph 3 has the smallest mean because it has the smallest center value (51) Graph 4 has the highest standard deviation because it is spread out the most (ie it has the most dispersion) Graph 2 has the smallest standard deviation because it is spread out the least (ie it has the least dispersion)

81

82 #20: Check your notes for the graph Mean = 178cm Standard deviation = 16cm Number of years = cm --> = 68% --> 0.68 x 30 = cm --> = 95% --> 0.95 x 30 = 28.5 Less than 130cm --> 0.15% --> x 30 = 0.045

83 #21: mean = 3.0kg standard deviation = 200g = 0.2kg number of babies born = kg: = 68% --> 0.68 x 620 = kg: = 95% --> 0.95 x 620 = kg: = 15.85% --> x 620 =98.27 Less than 2.6kg: 2.35% --> x 620 = 14.57

84 A data set of 50 items is given below with a standard deviation of 1.8 What is the mean, median and mode of this data? Is the data normally distributed? Why?

85 Mean = 5.96 Median = 6 Mode = 6 This data can be considered normally distributed because the mean, median and mode are all very close to 6 (ie all three values are almost equal to each, but they are close enough for the data to be considered normally distributed )

86 Example The assessed value of housing in Bay Roberts is normally distributed. There are 1200 houses with a mean assessed value of $ The standard deviation is $ What percentage of houses have an assessed value between $ and $ ? How many houses have an assessed valued between $ and $ ? Solution Draw normal distribution curve and fill in percentages, mean and standard deviation values Percentage = = 81.5% Number of houses = x 1200 = 978

87 Example The graph below shows the scores on a standardized test, normally distributed, for two classes. What do these graphs have in common Same mean, but different standard deviations Which graph has the higher standard deviation B because it is spread out more If a student scored 70, then what class are they likely in? B because class A doesn't contain a mark of 70 according to the graph

88 Z-SCORES

89 The importance of data analysis and its implications in the real world We may want to talk about particular scores within a set of data. We may want to tell other people about whether or not a score is above or below average. We may want to indicate how far away a particular score is from the average. We might also want to compare values from different sets of data and figure out which value is farther from the norm.

90 Example Jessica has a height of 1.75 m and lives in a city where the average height is 1.60 m and the standard deviation is 0.20 m. Maddison is 1.80 m and lives in a city where the average height is 1.70 m and the standard deviation is 0.15 m. Identify which of the two is considered to be taller compared to their fellow citizens.

91 We can't really answer the question since the values lie in a normal distribution However, there is a statistical tool called Z- scores that can be used to determine which person is taller.

92 Z-Scores It is a standardized value that indicates the number of standard deviations a data value is from the mean. z-score formula: where» x is the data value» μ is the mean» σ is the standard deviation» z is the number of standard deviations that x is from the mean. Note: μ is the same as x with a bar over it for mean

93 Note: If z is positive the data value is to the right of the mean (bigger). If z is negative the data value is to the left of the mean (smaller).

94 Examples 1. The previous example with Jessica's height Jessica is considered taller 2. On her first math test, Susan scored 70%. The mean class score was 65% with a standard deviation of 4%. On her second test she received 76%. The mean class score was 73% with a standard deviation of 10%. On which test did Susan perform better with respect to the rest of her class? She did better on the first test

95 Example IQ tests are normally distributed with a mean of 100 and a standard deviation of 15. What percentage of students achieved less than the 130 mark? Draw the normal distribution curve, labeling the mean and standard deviation. Verify your answer using z-scores.

96 When we convert normally distributed scores into z- scores, we can determine the probabilities of obtaining specific ranges of scores using a z-score table. This table is based on a normal distribution that has a mean of 0 and a standard deviation of 1.

97 Z-Score Table Page 580 in Text Values determined in the z-score table represent the percentage (as a decimal) of the data that is less than the z-score (standard deviations). To determine the percent of data with a z- score equal to or less than a specific value, locate the z-score on the left side of the table and match it with the appropriate second decimal place at the top of the table. For example, when z = 2 the percent of data that is 2 standard deviations less than the mean is or 97.72%

98 We are no longer restricted to data values that are exactly one, two or three standard deviations from the mean. We can now use z-scores to determine the percent of data that is also less than or greater than any particular data value or even between two data values.

99 Example IQ tests are normally distributed with a mean of 100 and a standard deviation of is the average IQ for people under 40 nowadays. What percentage of people have an IQ less than 110? Find z-score and % on table What percentage of people have an IQ greater than 110? Find the percentage by subtracting the value in the table from 1. This gives the percentage that is greater than the z-score.

100 Example IQ tests are normally distributed with a mean of 100 and a standard deviation of is the average IQ of a bright university student. What percentage of the population could by university educated? Calculate z-score with 115 & 125 Look up values in table To find percentage, subtract the smaller percent from the larger

101 Example IQ tests are normally distributed with a mean of 100 and a standard deviation of 15. In order to be a member of Mensa you need to be in the top 2% of population. What would be the minimum IQ to be in Mensa? Look up the corresponding z-score for the closest percent that is less than or equal to 98% Rearrange the z-score formula to solve for x

102 Application of standard deviation for making decisions Standard deviation is used in society to influence decisions such as: length of warranties provided by a company cost of insurance quality control and opinion polls. Since z-scores are affected by standard deviation, they will also play a role in such situations.

103 1. Consider the following example using two different automobiles, a Chevrolet Corvette and a Honda Civic. It is known that the mean value of repairs, due to an accident, for both cars is $3500. The standard deviation for the Corvette is $1200, while the standard deviation for the Civic is only $800. If the cost of repairs is normally distributed, determine the probability that the repairs costs will be over $5000 for both cars.

104 Calculate the z-scores (formula) for each: Corvette: z = 1.25 Civic: z = 1.88 Using the z-score table, determine the probability that the repairs will cost over $5000 Corvette: = 10.56% Civic: = 3.01%

105 You should recognize that the Corvette is nearly 3 times as likely to have damages exceeding $5000 than the Civic. Therefore, its insurance premiums will be adjusted accordingly

106 2.Athletes should replace their running shoes before the shoes lose their ability to absorb shock. Running shoes lose their shockabsorption after a mean distance of 640 km, with a standard deviation of 160 km. Mr. Sanders is an elite runner and wants to replace his shoes at a distance when only 25% of people would replace their shoes. At what distance should he replace his shoes?

107 Look up 25% or 0.25 in the z score table Z = Rearrange the z score formula to solve for x The distance the shoes should be replaced at is 532.8km.

108 Solving a quality control problem 3.The ABC Company produces bungee cords. When the manufacturing process is running well, the lengths of the bungee cords produced are normally distributed, with a mean of 45.2 cm and a standard deviation of 1.3 cm. Bungee cords that are shorter than 42.0 cm or longer than 48.0 cm are rejected by the quality control workers. If bungee cords are manufactured each day, how many bungee cords would you expect the quality control workers to reject?

109 Find the z-score for 42cm and 48cm Z = & Z = 2.15 respectively Look up the values in the z-score table for the z-scores and respectively The difference between those numbers is the percent of bungee cords that get kept = % get kept = 2.27% get rejected 20,000 x = 454 get rejected everyday

110 Determining warranty periods 4. A manufacturer of personal music players has determined that the mean life of the players is 32.4 months, with a standard deviation of 6.3 months. What length of warranty should be offered if the manufacturer wants to restrict repairs to less than 1.5% of all the players sold?

111 Look in the z-score table for the z-score that corresponds with 1.5% (0.0150) Z = Use the z-score formula and solve for x X = Take the higher number and say 19 months

112 5.6 Confidence Intervals An entire population is often difficult to study, so we often use a representative selection from the population called a sample. If the sample is truly representative, then the statistics generated from the sample will be the same as the information gathered from the population as a whole. Surveys are used to draw conclusions from a sample.

113 It is unlikely, however, that a truly representative sample will be selected. We need to make predictions of how confident we can be that the statistics from our sample are representative of the entire population. How well the sample represents the larger population depends on two important statistics, the margin of error and confidence level.

114 Example A Rent-A Car-company surveys customers and finds that 50 percent of the respondents say its customer service is very good. The confidence level is cited as 95 percent and the margin of error is ± 3 percent. This information indicates that if the survey were conducted 100 times, then 95 times out of 100, the percent of people who say service is very good will range ± 3 percent from the mean of 50 percent. The confidence interval is from 47% to 53%.

115 Confidence Level The confidence level describes the uncertainty of a sampling method. The confidence level describes the likelihood that a particular sampling method will produce a confidence interval that includes the true population value. It is often given as percentage (90%) or as 9 times out of 10

116 Example What is the percentage confidence level for a result that is accurate 19 times out of 20? 19 = 0.95 x 100% = 95% 20

117 Confidence Interval The confidence interval gives a range of values, based on a sample, in which a population value may be found. It is calculated by using the value from a sample and the margin of error to obtain the lower and upper limits of the interval.

118 Confidence Interval Confidence Interval: Lower Limit to Upper Limit To calculate the limits we use: Lower limit = Sample value margin of error Upper limit = Sample value + margin of error

119 Example 1 A brand of battery has a mean life expectancy of 12.6 hours with a margin of error of 0.7 hours. Calculate the confidence interval. Lower Upper = 11.9 = 13.3 The confidence interval is 11.9 to 13.3

120 Example 2 A report claims that the average family income in a large city is $ It states the results are accurate 19 times out of 20 and have a margin of error of ±2500. What is the confidence level in the situation? 19/20 = > 95% Calculate the confidence interval 32, to 32, , 500 to 34,500 is the confidence interval

121 Determining the mean and margin of error from the confidence interval Suppose the expiration time of milk is said to occur between 13.5 and 14.7 days. The mean expiration time would be the midpoint of the interval. We find this by taking the average of the upper and lower limits. Max + min = = We can then calculate the margin of error by subtracting the mean from either limit or subtracting the lower limit from the upper limit and dividing by = 0.6 OR Max - min = =

122 Example A botanist collects a sample of 50 iris petals and measures the length of each. He reports that he is confident, 19 times out of 20, that the average petal length is between 5.39 cm and 5.71 cm. Determine: A) The confidence level B) The confidence interval 19/20 = 95% 5.39 to 5.71 C) The mean of the sample D) The margin of error = =

123 Effects of Sample Size on Confidence Interval The sample size of a random sample will affect the margin of error and confidence interval. What effect does a larger sample size have on the margin of error? The increase in sample size would decrease the margin of error and, in turn, decrease the range on the confidence interval, thus, getting us closer to the actual result.

124 Note If all the population was sampled in a survey, then there would be no margin of error

125 Effects of Confidence Level on Confidence Interval. The required confidence level will affect the margin of error and confidence interval. What effect does a larger confidence level have on the margin of error? The increase in confidence level would increase the margin of error and, in turn, increase the range on the confidence interval, thus, better guaranteeing that the confidence interval contains the actual result.

126 In a national survey of 400 Canadians from the ages of 20 to 35, 37.5% of those interviewed claimed they exercise for at least four hours a week. The results were considered accurate within 4%, 9 times out of 10. Answer the following questions: (i) Are you dealing with a 90%, 95%, or 99% confidence interval? How do you know? (9/10 = 90%) (ii) How many people in the survey claimed to exercise at least four hours a week? (37.5%) (iii) What is the margin of error? (4%) (iv) What is the confidence interval? % to 41.5%

127 (v) How would the confidence interval change if the sample size was increased to 1000 but the sample proportion remained the same? (gets smaller) (vi) If the writers of the article created a 99% confidence interval based on this data, how would it be different? How would it be the same? The confidence interval would get wider The mean (37.5%) would not change

128 Other problems 1. A recent study reports that 61% of students at Xavier High own a cell phone. The results of the study are reported to be accurate, 19 times out of 20, with a margin of errror of 3.6 percent. (i) What is the confidence level? (19/20 = 95%) (ii) What is the confidence interval? 57.4% to 64.6% (iii) According to the study, if there are 258 students at Xavier high, what is the range of students who could own a cell phone? 148 to 167

129 Other problems 2. The results of a survey show that 72% of residents in Mount Pearl own cell phones. The margin of error for the survey was 2.3%. If there are people in Mount Pearl, what range of people own cell phones. 17,146 to 18,278