How to do Statistics in Excel

Size: px

Start display at page:

Download "How to do Statistics in Excel"

Kelley Walsh
6 years ago
Views:

1 How to do Statistics in Excel Introduction The best answer is Don t. Get someone else to do it. But since there is no American who has a life and also likes statistics, this may not be easy. And it likely won t be cheap. On the other hand, you could do it yourself. Admittedly, that is not a fun choice. This tutorial, though, may reduce the pain and even enable you to understand what question you are asking that needs statistics. The most common flaw in reasoning that people have when they turn to statistics is that they don t quite know how numbers are going to help them answer their question. And that causes fear and avoidance. Overview. There is help. Statistics is a branch of Probability theory. Probability is actually the underlying base of knowledge in all the sciences, including social sciences. We seldom can say, This causes that, with certainty. We can, though, say When this happens, the odds are that this other thing is going to happen. And the odds may be pretty good. How good is the question statistics will help answer. (But remember E. G. Bulwer-Lytton s comment, Fate laughs at probabilities, or in other words, the improbable MUST occur at some time.) (S.A. 5/14 80) To do statistics, you do not need to understand them, any more than you have to be able to fix a car in order to drive. You do, though, have to know or at least suspect when they can mislead you. As British Prime Minister Benjamin Disraeli allegedly said, There are lies, damned lies and statistics. (1) What do you study in statistics? A lot of numbers with something in common. Usually they come from a larger set of numbers called a population. The numbers are associated with each member of the population in some way that matters. For instance, students GPAs. You choose a group containing less than all the population to study, usually because the population is too large or inaccessible to study everything in it. The smaller group is called a sample. If the properties of the total population matter, then the properties of a sample can sometimes give you a good approximation of the properties of the whole population. So, if you want to know something about a lot of things, take some of the population and measure the thing you care about. Most likely what you learn from the sample will tell you something fairly reliable about the whole population. At this point, if somebody says anything about the binomial distribution or the Normal distribution, smile at them and thank them and go away from them. Unless they volunteer to do this job for you. Note: The spreadsheets listed in the tutorial will not work on web servers. If you want them, me at rfarnham@rxf3.com and I will send them. 1

2 What Excel has Too much. And it all looks frightening. Don t try to learn it, you are not and don t want to be a spreadsheet jockey. But the nice thing is that people have created spreadsheets that solve tasks you might need solved, so your job is reduced to finding something someone has built for you. Now come the examples. Some populations have most of their members near a central value, e.g. humans have height and there is an Average height. If you are a sauna manufacturer, the saunas must have higher ceilings in Scandinavia than in Japan. But knowing this central value or Average for height does not help you pick the height of your saunas made for Scandinavia, because bigger costs more and you don t want the cost to turn away buyers just because you made them big enough for the tallest Scandinavians. So, you need to know how tall the tallest are, and how many are that tall. Since you are interested in only the extreme values in the population, specifically the tallest, you don t need to bother with complicated statistical concerns such as Normal Distribution and Central Tendencies. Instead use: Non-Parametric Statistics. (The name comes because all the statistics you have learned stressed the importance of a Normal Frequency Distribution, the shape of which is specified by four Parameters, values that define some properties of the Normal Distribution. You don t need any of that for this statistical task.) 1. Start Excel and load the file named Sort. xslx The column of numbers represents 500 measurements of the height of male Scandinavians. Some are very tall, but not many. How many can you not accomodate in order to keep your costs down, while still keeping most people as potential customers? You need to know how many are how tall, so sort the heights in this column, highest to lowest. To do this, click on the green shaded choice at the top right labeled Sort 2

Even if the frequency distribution of heights is Normal and you could get the answer another way, the answer you got is still the right answer.

3 The box that appears offers you choices. You could choose what you want, but it is set now to sort the values in Columns 1. Click OK. If you want to sell saunas in Scandinavia that will accommodate 97 percent of the population, you are done. Even if the frequency distribution of heights is Normal and you could get the answer another way, the answer you got is still the right answer. If that s all you needed to know, go make saunas and don t spend any more time on statistics. (For a copy of any of the spreadsheets, me at rfarnham@rxf3.com and I will send it. They can be re-used with different numbers of your choice.) 3

4 Parametric statistics Central tendencies Now comes this Normal Distribution stuff. Among the properties of populations, there is often a tendency of values to cluster more in the middle than at the extremes. This central tendency is what we saw in the Scandinavian study, but we did not need it. Many times, though, this central tendency and the spread of values around it are valuable information. The measures you will use, then, are: Average, take all the heights of everybody and divide the sum by the number of people. Median, the height with the most people. Dispersion, = how many are close to the Average and how many are very different. Excel has the tools to find these numbers if you have the data about the population, or sample of the population, but remember you do not want to do it, you want someone to do it for you. I live to serve, so I have made a spreadsheet that does it for you. You still have to type in the numbers for the heights, but the calculations are automatic. 2. Load the file called FreqDistr.xlsx (usually also named a Bell Curve ) and if you want it personally, send me an at rfarnham@uncc.edu and I will send it to you. Why the name? Because the number of people with heights near the central value (Median or Average, depending) should be more than the number of people who are much taller or shorter. If we make a graph ( histogram ) where we plot the number of people at each height ( Frequency ) using the data from the first example, it should look like this: The frequency of each height is in the table Freq Distr. The histogram above graphs that data. 4

5 Now the value with the most people is visible (the Median), and an idea of the dispersion around the central values is visible. We need a measure for this dispersion, named the Standard Deviation and it is calculated for you below. (You probably heard from someone that the use of a Standard Deviation depends on whether the frequency distribution is Normal. I told you already not to go near that person. But now I will discuss it as a matter of personal privilege.) Figure 4 Skewness and Kurtosis can be used as tests for the likelihood that the distribution is Normal. The four parameters of the Normal Frequency Distribution are: Average. In a population with a Normal distribution, the Average and the Median ought to be about the same. Standard Deviation. If the population has a Normal distribution, the Standard Deviation has a useful property, discussed below. Skewness. This should be around zero in a population with a Normal distribution. Kurtosis. This should also be about zero in a population with a Normal distribution. If your distribution has the correct Skewness and Kurtosis, it probably is a Normal distribution. Do not worry about how these things are calculated. Just think of Excel as magic. The graph above does not show a very Normal distribution. Heights are in fact not Normally distributed. Further, Scandinavians consist of two populations, Teutonic and Lapp, who are shorter on average. Here is a truer picture of the distribution of heights in Sweden. 5

6 3. Load the file named FreqDistr2pops.xlsx Note: The Average and Median are close The Skewness is close to zero. The Kurtosis is a little low. But the distribution is NOT Normal. There are in fact two populations shown, Teutonics and the shorter Lapps, i.e. two populations distributions of heights have been mingled. Lapps Teutonics This shows why you have to always be suspicious of statistics. The problem is always in the assumptions, and one might have thought heights were a single population until one looks at the graph. Eyes are scientific instruments. But in sauna manufacturing, all you needed was the top three percent value and didn t care if more than one population was present. 6

7 Let s put this stuff to work. 4. Load the spreadsheet named QC.xlsx Example: If you have a machine that measures something, it will not always give you the exact same answer on the same item. There is always a little change in the machine each time that leads to a slightly different result. It is caused by things like a minor temperature or humidity change or a truck passing by and rattling the floor, and that is called Random Error. It is not Systematic Error, which it would be if you could find what caused the error and fix it. Random Error always has a Normal Distribution, and the Normal distribution has special properties. The most important property of a Normal distribution is that the number of values within: 1 Standard Deviation from the Average will be 68% of the total number of values. 2 Standard Deviations will be 95 percent of the total number of measurements. 3 Standard Deviations will be 99.7 percent. This is useful. Suppose you have a Control specimen of known value, say 100, that you are going to make the machine measure three every day, and after some days doing that you then calculate the Average and Standard Deviation of those measurements, you know pretty much how accurate the measurements can be with this machine. If you need more accuracy, you will have to get a different machine. Remember, this assumes a Normal Distribution of Random Error, which is true by definition. If the distribution of Error is not Normal, then there is Systematic Error, and you should fix the machine. Here is an example of Random Error: 7

8 8

9 We now have measures of the machine s accuracy and its inherent error. To use this, we follow the machine s analyses for the next 30 days, to see if it starts to show more error than before: +2 SD Average (Mean) - 2 SD (Green lines represent 2 Standard Deviations above and below the mean.) But the next 23 days: 9

Correlation: Sometimes you are measuring things with one instrument, but then you move to another place and are stuck with a different machine that is still reliable but does give different numerical

10 Correlation: Sometimes you are measuring things with one instrument, but then you move to another place and are stuck with a different machine that is still reliable but does give different numerical measurements from the old one. For instance, the old instrument measures pressure in pounds per square inch and the new one measures it in kilopascals by a different method. How do you know if the second one is as accurate as the old one? You compare their performance by measuring the same item on each one. But there is still Random Error in each measurement from each machine, so you need to measure how close their measurements are the same items on average and then decide if that is close enough. This process is called Regression Analysis. The most common form is Linear Regression, in which the assumption is that Random Error on one machine is a multiple of the Random Error on the other machine. So if the second machine gives a measurement 2.4 times the measurement of the first machine, then we can easily use the second machine and divide its answers by 2.4 to find out what the first machine would have gotten. Here is a tool to do this: 5. Load the file named Statistics Lin Rgrs begin.xlsx, remembering that the graph of a linear relationship will be straight: This is the empty template. 10

This is one populated with measurement data from an old machine and from a newer one: First, this is Linear Regression, and the Regression Line certainly looks linear.

While the calculations are complex and hidden, the regression line is characterized by this Slope multiplier and the Intercept, the number the new machine would show if the old machine showed zero.

11 This is one populated with measurement data from an old machine and from a newer one: First, this is Linear Regression, and the Regression Line certainly looks linear. Remember, that means the results on the new machine are reliably some multiple of the results on the old machine, and that multiplier is the Slope above, While the calculations are complex and hidden, the regression line is characterized by this Slope multiplier and the Intercept, the number the new machine would show if the old machine showed zero. The new machine could certainly be used, although the expected Reference Ranges that define normal would have to be multiplied by in order to be interpreted correctly. This is done for you at the upper right. 6. Load the file named Statistics Lin Rgrs Tilt.xlsx 11

The clients also will have difficulty interpreting these non-linear results in terms of their prior experience with the old machine s results. They might not like that. The machine is non-returnable.

12 Note that the higher value points seem to fall below the Regression Line. Uh Oh, something is not Linear. That means that at the upper levels, the results on the new machine are NOT multiples of those on the old machine, and our elegant calculations for new Reference Ranges are wrong. The clients also will have difficulty interpreting these non-linear results in terms of their prior experience with the old machine s results. They might not like that. The machine is non-returnable. Can we use it somehow? The graph looks like it starts to slant around 20 units on the old machine, 32 on the new one. Below that it looks linear. Check this, by erasing all the points that are above 20 on the old machine. (This has been done for you below, and the points deleted are encircled in red.) Good. We can report results that are less than 32 on the new machine with confidence, our clients can reliably interpret those results with the new reference ranges, and if we get results higher than 32 on the new machine, we can dilute the test specimen in half and then measure it. If the result is now in the linear range under 32, then we can double that diluted result and report that value safely. We will of course badmouth the machine among our peers, but it is not a total loss. 7. Load the file named Lin Rgrs Outlier begin.xlsx ) 12

Two points are clearly way outside the Regression Line. Is this data usable? Remember the question you want an answer to: Do the machines reliably correlate?

13 Two points are clearly way outside the Regression Line. Is this data usable? Remember the question you want an answer to: Do the machines reliably correlate? The data shows that in at least two cases one or the other machine has significantly erred. Are you asking if the machines always give the same answer on the same item? Not at all, there are better tests than this to answer that question. So the best way to find the correlation is to remove these points that cloud the issue, since they are clearly errors. 8. Do it now. Hover the mouse arrow over the erroneous points (called Outliers ) and the data item will be shown. Look in the data list and delete those points. Result is below: Great straight line. The machines have a linear relationship, if we know one machine s result then we know the other machine s as well. Work done. Discarding data is always risky in statistical analysis and must be done judiciously (= correctly ) but there are times when it needs to be done or the true underlying relationships are obscured. Always ask yourself first if doing this might misleadingly affect the answer you seek. Or, better, find someone who knows. Conclusion: There are many other tasks needing statistics, and one of the most common not mentioned here is the task of telling whether one population is different from each other, e.g did the patients taking the medicine do any better than the ones with the same disease who didn t take the medicine? Someone likely has told you that this is a test of the Null Hypothesis. Avoid them. But find someone somewhere who has built a spreadsheet application like the ones here, and borrow it, using it like the car you drive but can t fix. Or you may have more than two variables to deal with, with each having an effect on the other. This needs a method called Analysis of Variance, or ANOVA. Don t look it up in a book, for you will not be able to function for a week. At some point you encounter tasks that you yourself simply cannot do. So look helpless, beg, or just tell your boss you can t do it. That s better than trying to do it and thinking you were successful when you weren t. Some people have been known to try prayer, some alcohol, some Mountain Dew or Red Bull. Results vary. Good luck. 13

14 Glossary: Average. Divide the sum of all data items by the number of items. Synonym: Mean Correlation: A mathematical relationship between populations, such that when you pick a measurement from the first, you have a reliable approximation of what the correlating measurement should be on an item from the second population. Dispersion: How many data items are close to the Average and how many are very different. Linear Regression: An attempt to find a linear relationship between two populations of data. Linear means that one item is the multiple of the other, together with an added constant number. Mean: The data item that has equal numbers of other measurements above it as below it. Non-Parametric Statistics: Analysis of data that does not have a Normal distribution. Normal: An alternate definition of a valued and respected normal English Word. The Normal Distribution is not particularly normal, but as a definition in Statistics, it has some useful features. Parametric statistics: Analysis of data that has a Normal distribution. The parameters are Average, Standard Deviation, Skewness and Kurtosis, and they must meet specific numerical criteria to establish that the population does in fact have a Normal Distribution. Random Error: Error whichis unpredictable and for which no explanation can be found. It mathematically will average out to zero, and is a property of a system that cannot be changed Reference Range: A property of a population such that measurements outside the Reference Range are defined as abnormal. Standard Deviation: If the population has a Normal distribution, the Standard Deviation has a useful property, discussed below. Systematic Error: Error that is due to a change in the system and which persists, implying a duty to find the source and fix it. Quicksand terms: (If you act like you know them, you will be given unwanted tasks.) Analysis of Variance: An attempt to find correlations for populations that may have an effect on measurements of the other populations. binomial distribution: Do not seek this knowledge. Kurtosis: Normal: A measure of the Peakedness of the frequency distribution curve of a population. An alternate definition of a valued and respected normal English Word. It is a common frequency distribution that is neither normal or abnormal. Get over it. It does have some useful properties, as discussed in the text Null Hypothesis: The statement that two samples of data represent the same population. Parameter: A calculable property of a data population. There are four, called moments, and are useful for inference about a population that iis Normally distributed. Skewness: A measure of the asymmetrical slant of the frequency distribution curve of a population. Both skewness and kurtosis should be also be about zero in a population with a Normal distribution. But this alone is not absolute proof the distribution is Normal. There are other tests for this. 14

15 Illustrations: Kurtosis is the degree of peakedness of a distribution. A normal distribution is a mesokurtic distribution. A pure leptokurtic distribution has a higher peak than the normal distribution and has heavier tails. A pure platykurtic distribution has a lower peak than a normal distribution and lighter tails. Skewness is the degree of departure from symmetry of a distribution. A positively skewed distribution has a tail which is pulled in the positive direction. A negatively skewed distribution has a tail which is pulled in the negative direction. 15

Descriptive Statistics Tutorial

Descriptive Statistics Tutorial Measures of central tendency Mean, Median, and Mode Statistics is an important aspect of most fields of science and toxicology is certainly no exception. The rationale behind