Descriptive Statistics Tutorial

Descriptive Statistics Tutorial Measures of central tendency Mean, Median, and Mode Statistics is an important aspect of most fields of science and toxicology is certainly no exception. The rationale behind the importance of statistics in a field such as this is that no two individuals are the same. There is a lot of variation in the population. In trying to make sense of this variation in a population, it is sometimes important to generalize. This is often done by providing an average. Given the range of salaries for the safety and health profession, for instance, one might argue that the average salary in the US is around $70,000 per year. There are some questions that might be asked about this figure, however. For instance, how was this average determined? There are actually a number of ways to determine an average and each is used for a different reason. Usually when somebody discusses salaries, they use an average known as the median. The median is determined by lining up all the salaries in numerical order and picking the middle number. If there are an even number of values in the sample, the median will equal the sum of the middle two values divided by two. The reason one would use the median for salaries is because there are often extreme outliers in salary data. There may be safety professionals, for instance, who have made millions of dollars as a result of owning a very successful consulting firm. If a person is looking for a measure of central tendency here, the median allows folks to not have to account for these extreme values because one is simply lining up the numbers and picking the one in the middle. Here is an example: Calculate the median for the following data that represents the number of hours rodents in a sample went before positively responding to a specific experimental treatment: 2, 11, 14, 15, 15, 17, 17, 17, 17, 19, 19, 20, 20, 21, 22, 378 There are 16 numbers here. The two middle numbers (8 th and 9 th number) are 17 and 17. 17 X 2 = 34. 34/2 = 17 Notice that if we were to calculate an average known as the Mean, which would essentially be calculated by adding all the numbers together and dividing by the total number in the sample, we would get an average of 624/16 = 39 Notice how that large number (378) moved the average from the middle value of 17 to something much larger. This is why, when we know there might be unusual outliers, we often consider using the median as the reported value of central tendency. Sometimes, outliers are also removed in order to conduct standard statistical analysis. Perhaps something unusual was going on with our rodent that failed to respond for 378 hours that did not really reflect the response of the general population. One other measure of central tendency that is sometimes used is the mode. This is used when a person wants to know what value is repeated the most in a sample. Looking at our example above, for instance, we can see that the number 17 is repeated 4 times. No other number is repeated that many times, so the mode would be 17. With this said, it should be obvious that a given sample can have more than one mode. Measures of Dispersion

The Range When we talk about measures of dispersion, we are typically referring to how wide the data is spread. There are a number of ways to do this. One way to do this is to report the range. The range is basically the difference between the highest value and the lowest value. Let s look at our sample above again with the outlier removed. 2, 11, 14, 15, 15, 17, 17, 17, 17, 19, 19, 20, 20, 21, 22 In this situation, the highest number is 22 and the lowest number is 2, so the range would be 20. Note, if we kept the outlier in the sample, the range would be much larger. The range is the most basic measure of dispersion and does not really convey a lot of information. It is like asking someone to describe his or her daily driving habits and getting a response like Sometimes I do not drive at all and the most I drive is 20 miles on a given day. This does not convey much information about driving habits, does it? However, if the person indicated he or she drove an average of 10 miles a day and also reported the range, we would have a much better idea as to driving habits. Two more commonly used measures of dispersion are the variance and the standard deviation. They are related because the latter is the square root of the former. That is, take the square root of the variance and you get the standard deviation. In order to discuss these concepts further, it is important to first consider the concept of the normal distribution or what is commonly known as the Bell Curve. Chances are you have seen something like this or at least heard of the bell curve somewhere along the way: Often when we take a measurement of different subjects in a sample, we see the distribution best represented as a bell curve as depicted above. Let s consider a variable like weight, for example, and its relationship with blood alcohol dosages. If we take the weight measurement of 3000 randomly selected adult males, we will likely note that there are a few individuals who are much lighter than average and a few people who are much heavier, but most people will be somewhere in the middle, much closer to the average. This is why the bell curve is shaped the way it is. The left tail would represent the few very light individuals in the population and the right tail would represent the few very heavy individuals, but most males would fall somewhere in the middle, around the mean average which is represented by the middle, and the highest point on the curve.

Variance and Standard Deviation: A more commonly used measure of dispersion used in most sciences is the standard deviation. This value basically represents the average distance most of the data falls from the mean. Let s take a look at the normal distribution diagram below where we set our mean average to zero. Please note that the Greek letter σ, or sigma is used here to represent the population standard deviation. Standard Deviation. (n.d.) In a normally distributed sample, one standard deviation on either side of the mean typically accounts for 68.2% of the variation around the mean. Considering our adult male sample above, that would mean that 68.2 % of all males, or 1860 individuals, weighed in within one standard deviation. It is also important to note that the standard deviation is a calculation that depends on the data in the sample and so this value can fluctuate depending on the variation within the sample. Let s say we have two samples of 3,000 individuals from different countries we will weigh for a study. We calculate a standard deviation of 35 pounds for group A and 10 pounds for group B. What these two standard deviations tell us is that there is a lot more variation in Group A than there is in group B. In group B, most people (68.2% in fact) are within 10 pounds of the average. In group A, however, most people are within 35 pounds of the average. We can say, therefore, that there is a lot more variation in group A as compared to group B. This is why the standard deviation is frequently reported along with the mean. It gives the reader an idea as to how much variation exists in the population or sample being considered. If one hears, for instance, that the average weight of a group is 170 pounds with a standard deviation of 10.7 pounds, it gives the person a much better picture than reporting the mean average alone. Calculating Standard Deviation We pretty much know how to calculate the mean average but as indicated above, but it is also good to be able to determine the standard deviation of a sample. It is quite a bit of work to calculate the standard deviation of a sample, but it is doable if it is undertaken step by step. The first step in calculating the standard deviation is to calculate the variance. The variance is essentially the square of the standard deviation. Once the variance is calculated, one only needs to click the square root button on the calculator to get to the standard deviation. Also, above we pointed out that the standard deviation of a population is typically depicted with the Greek letter σ. When dealing with samples (as opposed to an entire population), the value is reported as an italicized letter s. Since the sample variance is simply the square of the sample standard deviation, it is commonly depicted as follows s 2. For the purposes of this tutorial, we will not get into too much

discussion regarding the usefulness of the variance value except to indicate that it needs to be calculated first in order to determine the standard deviation. Here is how we calculate the variance of a sample. And here is the formula for calculating the standard deviation of a sample: (Formula images from Standard Deviation (n.d.)) Again, both of these look like fairly scary formulas, but it is nothing to be overly concerned about. For our purposes, we will calculate the variance using a step-by-step process and we will save deciphering these seemingly complex mathematical equations for another time. Here are the steps for calculating a sample standard deviation (the formulas above actually instruct us to perform these steps): 1. Calculate the mean average. 2. List all of the values of the sample in a column. 3. Subtract the mean average from each row. 4. Square the result in each row. 5. Add all of these squared values together. 6. Divide the squared values by the number of values in the sample minus 1 to get the variance. 7. Take the square root of the variance to get the standard deviation. OK, now that you know the steps, let s give this a whirl. Say we are going to do a preliminary study on the diameter of a skin rash exhibited on rodents after being treated with a very small quantity of chemical A. Since we are just trying to get a general idea as to the response, we only treat eight individuals. We get the following values in millimeters: 10, 8, 10, 8, 6, 4, 12, 6 The first step is to find the mean average. So we add them all together and divide the total by the number in the sample (8). We end up with an average of 8 mm (64/8=8). The next step is to list each value in the sample and subtract the mean

10-8 = 2 8-8 = 0 10-8 = 2 8-8 = 0 6-8 = -2 4-8 = -4 12-8 = 4 6-8 = -2 The next step is to square each result we just obtained: 2 2 = 4 0 2 = 0 2 2 = 4 0 2 = 0-2 2 = 4-4 2 = 16-4 2 = 16-2 2 = 4 The next step is to add these results: 4+0+4+0+4+16+16+4 = 48 Finally, to get our variance, we divide this total by N-1 (the sample size minus 1). 48/7 = 6.86 So 6.86 is our variance. Now do you remember how to determine the sample standard deviation from the sample variance? Correct! You just hit the square root button on your calculator. SqRt of 6.86 = 2.62 Based on this information, we would report our sample mean as 8.0 and our standard deviation as 2.62. This is, of course, a very small sample and clearly does not reflect a perfect normally distributed sample. A larger sample needs to be obtained. It is possible, and likely that a much larger sample will be much more normally distributed. But regardless, you now know the steps to calculating the mean, variance, and standard deviation. References: Standard Deviation. (n.d.). Wikipedia. Retrieved from: https://en.wikipedia.org/wiki/standard_deviation Note: Although using Wikipedia sources is typically discouraged as they have the potential to be unreliable, this tutorial utilized such sources primarily to obtain images. However, the writer of this tutorial is well versed in the use of statistics and therefore able to evaluate the reliability of the images used.