Statistics is the area of Math that is all about 1: collecting, 2: analysing and 3: reporting about data.

Size: px
Start display at page:

Download "Statistics is the area of Math that is all about 1: collecting, 2: analysing and 3: reporting about data."

Transcription

1 Unit 2 Statistical Analysis Statistics is the area of Math that is all about 1: collecting, 2: analysing and 3: reporting about data. 1: Collecting data Collecting data is done by performing Surveys, either on the whole population or just a sample of it. the population is the total group being studied, either people, animals or things. usually information is collected only from a sample (part of the pop.) collecting survey information from the whole population group is called a census # every sample must beware of being Biased (prejudiced) in some way. We try to avoid bias by picking our sample in a random way. 1

2 2: Analysing Data Once we have collected some information we have to start trying to make sense of it all. This is done by analysing the data to look for patterns that can represent the large data set in a simple to understand way. There are a few basic tools that can represent any data: Range highest lowest value Shows how spread out the data is. Mean the average of the scores add all numbers and divide by how many numbers you have (n) Median the middle number rank all values in order and pick middle. with an even data set you get an average of the two middle values. Mode the most common value in the set. there may be no mode, or there could be more than one mode, depending on how the numbers are arranged Mean, Mode and Median are all called Measures of Central Tendency, because they all try to find the normal, central type of score in a group of data. 2

3 First... rewrite the data in order Range (highest lowest) : = 30 cm Mode (most common): Bimodal 157cm & 173 cm Medians (middle) : 173 cm Mean (average) : = I70 cm 3

4 The Problems of Central Tendency Mean, Mode and Median all try to describe the full set of data by finding a 'normal' score. They are called the Measures of Central Tendency. Each measure of central tendency can be accurate or messed up (skewed) depending on the data set. Mode there can be no Mode or multiple modes in a set of data. the mode could be a very high or low score in the date, and not show the normal trend. Mean a couple of very high or low scores could pull the average up of down, even when all other data is grouped close together. Median The median is always the middle number and is a little harder to skew, but if you had a lot of numbers in a low range and a lot of numbers in a high range the median would likely be either high or low when the mean would be somewhere in the middle of the range. The point is... you can mess up any of the measures of central tendency so you need to do all three and compare in order to get a good understanding. 4

5 practice: Mean, Mode & Median All 3 are good measures of central tendency, but the mean is probably the best because it is in the center of the three The mode here has a problem of being split, and being lower and higher than the other two. Mean or Median would be the best to use here. 5

6 Weighted Mean when not all pieces of data are equally important to the total, this gives each one a better balance. e x: polling across Canada would have to weight Ontario and Quebec more than PEI and NL because of much larger populations. Province Weighting Score * each category must be given a weight. Ont Que NL AB x x x x = = = = To calculate a weighted mean, you take the original score and multiply it by the weighting to get scores that can work together at the same 6

7 A Trimmed Mean There are times when you may want to remove some of the extreme scores from the data set so that you get a better picture of the normal scores. This would work for a student's work when they normally score really well, but on one assignment they failed. The failing score would be an outlier for the data. By removing the outlier (s) the central tendency would best show the student's normal achievement. Removing the outliers and then calculating the average is called a Trimmed Mean Bob's scores on 3202 work... 92, 90, 86, 95, 100, 35, 91 The mean of the complete data set is... The 35 is an outlier for this group. We would trim the mean by removing the highest and lowest score (sorry, but this means losing the 100 too!) and calculating the mean from the rest of the scores... NOTE!!!! Trimming the mean should not change the mode or the median of the data. Because you took the top and bottom score, the middle number (median) should still be the same. And if your top or bottom number was the mode, it wasn't a very good mode to begin with! 7

8 3. Reporting your findings the best way to report on data is with a chart or graph. any good graph should include: scales (x axis, y axis) a title at top and on each axis. the best style of graph for the data used. y axis Graph Title scale and label scale and label x axis 8

9 9

10 10

11 Percentile Ranking There is a form of reporting the scores of a group according to how they rank to the rest of the group, and not by how they actually scored. This is done by using a Percentile. Percentile the percentage of score that are below a certain mark. The top score in the group is always the 100th percentile and the bottom score is always the 1st percentile. Even if they scored a 90% on the assignment, if it was the lowest in the class they score as the 1st percentile. Example: Alison was in the 80th percentile with a mark of 72%. This means that 80% of the people scored lower than her. Percentile Rank the percent of scores at or below the given score. Percentiles are often used to compare results to the general population. For example, 10 minutes after you were born, you were measured in length and weight and ranked according to how you come out against the general size of all other Canadian babies. If you were in the 90th percentile you were a huge kid, and if you were in the 20th percentile you were probably a premature baby. Any time you score close to the 50th percentile you are average. The median number is always exactly the 50th percentile. To calculate the percentile, divide the number of scores that are lower than yours by the total number of scores in the data set, and then multiply by 100. NOTE!!! Its best to rewrite the scores in order to do work on percentiles. Percentile = number of scores lower than yours total number of scores (n) x 100 Ex: 22, 47, 57, 65, 66, 84, 75, 80, 88, 89, 91, 98 Percentile rank (89) To find the 25th percentile: take the median (middle) of the bottom half of the data find the 75th percentile: take the median of the top half of the data 11

12 here is a percentile chart for all baby boys born in North America a newborn baby is compared to all other children, not to a weight or height goal. 12

13 How do you find the percentiles? First write the data in ascending order... the median is the 50th percentile the 'lower median' is the 25th percentile the 'upper median' is the 75th percentile 13

14 Section 2.3 Scatterplots. a Scatterplot is a 2 axis graph which tries to compare 2 variables each piece of data makes one point on graph. we look for a trend (pattern) in the data. "The line of best fit" dependent variable independent variable The Line of Best fit. extrapolate Interpolate doesn't have to include origin have equal #'s above / below line try to fit the flow of dots. * you can predict values beyond the range of the data: extrapolating * you can predict values that are within the range of the range: interpolating 14

15 Describing the Pattern. (Correlation) To describe the pattern we can use the correlation term but also in terms of the variables "As the forearm lenqht (independent variable) the hand length increases. (dependent variable) increases 15