The Kruskal-Wallis Test with Excel 2007 In 3 Simple Steps Kilem L. Gwet, Ph.D.
Copyright c 2011 by Kilem Li Gwet, Ph.D. All rights reserved. Published by Advanced Analytics, LLC A single copy of this document may be printed and the printed copy be shared with other interested parties. However, this document is NOT to be transmitted in any other form (electronic or mechanical, photocopying, recording, or information storage and retrieval system) without the prior written permission from the publisher. Advanced Analytics, LLC PO BOX 2696 Gaithersburg, MD 20886-2696 e-mail: info@advancedanalyticsllc.com This publication is designed to provide accurate and authoritative information in regard of the subject matter covered. However, it is sold with the understanding that the publisher assumes no responsibility for errors, inaccuracies or omissions. The publisher is not engaged in rendering any professional services. A competent professional person should be sought for expert assistance. Publisher s Cataloguing in Publication Data : Gwet, Kilem Li The Kruskal-Wallis Test with Excel 2007 in 3 Simple Steps A Practical Guide for Students and Professionals/ By Kilem Li Gwet p. cm. 1. Biostatistics 2. Statistical Methods 3. Statistics - Study - Learning. I. Title.
Preface The purpose of this document is to show you a few simple steps for implementing the Kruskal-Wallis test using Excel 2007 1. The proposed solution uses a user-friendly Excel macro program called KruskalWallis2007.xlsm, which requires no installation at all and does all the work for you. It is provided in the form of a stand-alone Excel workbook that you may use to input your data, perform your analysis, and save your results independently of the current configuration of MS Office. The only requirement is to have MS Office 2007 installed in your Windows system. Section 1 provides a detailed description of the Excel solution proposed for implementing the Kruskal-Wallis test along with the screenshots, and assumes that you already possess a working knowledge of this statistical procedure. In case you need a technical description of the statistical techniques that underly the Kruskal-Wallis test, you will find it in section 2. In addition to describing the Kruskal-Wallis test, section 2 has the advantage of showing for interested readers, the specific equations that were programmed. The Kruskal- Wallis implementation solution proposed here, is practical, intuitive, and easy to use. You use it by following specific and detailed instructions provided in this document. This is the ideal solution for students and researchers using the Kruskal-Wallis test only occasionally. The free version of this document has the trial version of the macro program KruskalWallis2007.xlsm attached to it. This trial version can process a maximum of 5 observations per variable to allow you to evaluate the program. The purchased version of this document contains the following: The full version of the macro program KruskalWallis2007.xlsm A complimentary PDF file of chapter 10 of my Statistics book 2 on the Analysis of Variance and related tests such as the Kruskal-Wallis and 1 Although the solution proposed here has been successfully tested with Excel 2007 and not with older versions of Excel, I expect it to work smoothly with Excel 2003, and perhaps with some older versions of Excel as well 2 Gwet, K.L. (2011), The Practical Guide to Statistics: Basic Concepts, Methods and Meaning. Application With MS Excel, R, and OpenOffice Calc - ii -
Preface - iii - Friedman tests, or the ANOVA for dependent and independent samples. Attachments to all versions of this document can be retrieved as shown in Figure 1.3 of section 1. Feel free to experiment with the trial version of the macro program following the instructions provided in this document. You will discover how easy and practical the proposed solution is. If you have comments or questions, do not hesitate to contact the author using one of the following 2 methods: E-Mail: info@advancedanalyticsllc.com Mail: Advanced Analytics, LLC PO BOX 2696 Gaithersburg, MD 20886-2696 Kilem Li Gwet, Ph.D.
1 The Kruskal-Wallis Test Implementation with Excel 2007 This section shows you step by step how to go from the dataset to the the Kruskal-Wallis test results, and their interpretation using Excel 2007. The introductory section 1.1 provides a high-level overview of the proposed Excel solution, while section 1.2 describes the output you will obtain after conducting the test. In section 1.3, you will see the specific steps to follow for solving the problem. 1.1 Introduction Excel itself does not implement the Kruskal-Wallis test. Even the Data Analysis ToolPak add-in, which is the collection of data analysis tools that comes with Excel does not offer a module for conducting the Kruskal- Wallis test. That is why, I developed the KruskalWallis2007.xlsm macro as a convenient tool for Excel users wanting to implement the Kruskal-Wallis test. Even non users of Excel, who have access to Excel 2007 could use this proposed method with no anticipated difficulty. To illustrate how the KruskalWallis2007.xlsm macro works, I will use the data shown in Table 1.1. This table shows the productivity of 3 employees at a sporting goods store. Productivity is measured by the number of customers served on different days. The days were randomly selected independently for each employee. It appears for example that Jennifer served 74 customers on day 2. You as manager would like to know whether Peter, Jennifer, and Kosta all have the same productivity level. Table 1.1 provides a limited amount of information on the employees productivity based a few days that you randomly selected. This will result in a measure of productivity that is subject to - 1 -
- 2 - Chapter 1: Kruskal-Wallis Test with Excel 2007 sampling error. That is why I must use a statistical test that accounts for this error, before I can determine whether all 3 employees are equally productive or not. To compare these 3 employees with respect to their performance measures, I decided to use the Kruskal-Wallis test rather than the traditional ANOVA due to the concern that the observations may not follow the Normal distribution required to implement ANOVA. Table 1.1 : Number of customers served by employee Peter Jennifer Kosta 57 65 45 53 74 52 49 69 47 56 70 50 68 49 71 1.2 Output of the Kruskal-Wallis2007.xlsm Macro The KruskalWallis2007.xlsm macro will produce the results shown in Figures 1.1 and 1.2 on the same Output worksheet. The first output table of Figure 1.1 displays basic summary statistics computed from Table 1.1. The second output table of Figure 1.1 on the other hand, displays the summary statistics based the ranks 1, as well as the final test results highlighted in yellow. The H statistic is the Kruskal-Wallis test statistic without the tie correction (see sub-section 2.2 of section 2 for more details about the correction factor, its purpose and its use) The tie-corrected H statistic is what is typically used in most professional statistical package. The significance level (α) is a value you must supply. The critical value is the threshold that must be exceeded by the test statistic for the null hypothesis of equality in performance between the 3 employees to be rejected. For this particular example, the adjusted and unadjusted test statistics both exceed the critical value, leading to the rejection of the null hypothesis 1 Note that the ranks are calculated with respect to the entire series 15 data points.
1.2 Output of the Kruskal-Wallis2007.xlsm Macro - 3 - The P-value. In this example, the p-value equals 2.66E-02, which the scientific notation for 2.66/10 2 = 2.66/100 = 0.00266. This quantity is smaller than the significance level, which will also cause the rejection of the null hypothesis. KruskalWallis2007.xlsm calculates the p-value based on the adjusted test statistic. The third table on Figure 1.2 summarizes the pairwise comparison, often called the post-hoc analysis. This analysis is conducted when the null hypothesis of equality of means is rejected. Its objective is to identify the specific pair or pairs of factors with a mean rank difference that is statistically significant, and which may have caused the rejection of the global null hypothesis 2. Figure 1.1. Results of the Kruskal-Wallis Test on Table 1.1 Data 2 Sub-section 2.3 of section 2 provides a mathematical description of the pairwise comparison for interested readers
- 4 - Chapter 1: Kruskal-Wallis Test with Excel 2007 The first 2 columns list the 2 factors being compared. The third column contains the absolute value of the difference in mean ranks between the two factors, while the fourth column shows the threshold to be exceeded before the absolute mean rank difference is deemed statistically significant. The last column reports the statistical significance of the respective pairwise comparisons. Figure 1.2 shows the ranks associated with the 3 factors (Peter, Jennifer, and Kosta) in columns I through K. The KruskalWallis2007.xlsm macro displays these ranks for you to see the data that went into the calculation of the test statistic H. Figure 1.2. Ranks Associated with Table 1.1 Data
1.3 Using the Kruskal-Wallis2007.xlsm Macro - 5-1.3 Using the Kruskal-Wallis2007.xlsm Macro You need to open the Kruskal-Wallis2007.xlsm Excel workbook attached to this PDF file. This is accomplished by clicking on the paper clip picture at the bottom left side of the PDF file you are reading as shown in Figure 1.3. For the first time that you open this workbook, you may get a security warning message notifying you that some active content has been disabled. In this case, you should Enable Content. Click on this paper clip picture to see attachments Figure 1.3. Opening the Kruskal-Wallis2007.xlsm Workbook Once you open the workbook, you will see 4 worksheets named K- W(Input), K-W(Output), Sheet1, and Sheet2 as shown in Figure 1.4. Never change the names of the first 2 worksheets K-W(Input), and K-W(Output) as this may cause the program to stop working. However you can modify the names Sheet1 and Sheet2 or even add more worksheets as you like.
- 6 - Chapter 1: Kruskal-Wallis Test with Excel 2007 Figure 1.4. Launching the Kruskal-Wallis2007.xlsm Excel Macro To conduct the Kruskal-Wallis test, follow the instructions in the next 3 simple steps: 1 Launching the Kruskal-Wallis2007.xlsm Excel Macro Populate the K-W(Input) worksheet with the data that you want to analyze as shown in Figure 1.4. Then click the Kruskal-Wallis Test gray button to launch the program. The program expects 3 columns of data or more. You may use any worksheet in the Kruskal-Wallis2007.xlsm workbook to capture your data, except K-W(Output). 2 Fill out the Dialog Form Launching the program will display the dialog form of Figure 1.5. (1) Select the worksheet containing your input data from the Worksheets list box control. (2) Click inside the Input Range RefEdit control with the computer mouse. (3) Using the mouse, select all the data to be analyzed, including column labels as shown in Figure 1.5. (4) Select the Columns radio button if your data is organized columnwise as in Figure 1.5, or select the Rows radio button otherwise. (5) Select the Labels in First Row checkbox if the first row in the selected range contains the labels.
1.3 Using the Kruskal-Wallis2007.xlsm Macro - 7 - If the first row contains numeric values to be analyzed, then leave that box unchecked. (6) Specify the significance level of the test, if you have one. Otherwise, a significance level of 0.05 will be considered by default. Figure 1.5. Completing the Kruskal-Wallis Dialog Form 3 Execute the Macro and Interpret the Results After filling out the dialog form in step 2, you must click of the Execute button, and look at the results in the K-W(Output) worksheet. The results will be similar to what is displayed in Figures 1.1 and 1.2.
2 The Kruskal-Wallis Test Procedure The single-factor ANOVA requires the law of probability underlying the data to be reasonably close to the Normal distribution, and the population variances to show some homogeneity. You may not feel comfortable with one or both of these assumptions. An alternative approach widely used by researchers is the Kruskal-Wallis test. It was suggested by Kruskal (1952) and, Kruskal and Wallis (1952), and requires the data to be ordinal. This test was designed primarily to compare population medians, and not population means. Should this be a problem? Not really. In fact, if the underlying populations are symmetric, population means and medians become identical. Otherwise, the median is what you should be interested in. The Kruskal-Wallis test generalizes the Mann-Whitney test 1 to 3 populations or more. Both tests are based on ranks instead of raw data, which makes them applicable to a variety of data types. In that sense, the Kruskal-Wallis test is another nonparametric test, which does not require the data to follow a particular law of probability. The downside being the loss of power due to the use of ranks in place of actual measurements. 2.1 Kruskal-Wallis Test Statistic Assuming that you want to compare k population means, the hypotheses would be formulated as follows: { H0 : µ 1 = µ 2 = = µ k, (2.1) H a : Not all µ i are equal. Whether µ i represents the population mean or the population median is irrelevant when defining the Kruskal-Wallis test. You will not be estimating these parameters. Instead, you will merely be comparing them. 1 The Mann-Whitney test is used for testing two population means when the t-test assumptions are violated. - 8 -
2.1 Kruskal-Wallis Test Statistic - 9 - Test Statistic The Kruskal-Wallis test statistic will be denoted by H. It is calculated by first ranking all n t observations (all samples combined) in ascending order from 1 to n t. Let R i be the average of the ranks associated with sample i, and R the average of all ranks. The H statistic is calculated as follows: H = 12 n t (n t + 1) k ( n i Ri R ) 2, (2.2) i=1 where n i is the number of observations in sample i. The law of probability associated with this test statistic is well approximated by the Chi-square distribution with k 1 degrees of freedom, when the null hypothesis is true. If the null hypothesis is true, then all population distributions would be the same. Any random sample taken from any population would yield approximately the same mean rank R i. This would lead to a small value for the H statistic, which by the way tells you how far you expect the mean rank from any given sample to be away from the overall mean rank. Let χ 2 α,k 1 be the 100(1 α)th percentile of the chi-square distribution, where α is the significance level of the test. The decision rule for the Kruskal- Wallis test if formulated as follows: Reject H 0 if H obs χ 2 α,k 1. (2.3) Validity Conditions Several authors including Conover (1980, 1999), Daniel (1990), or Marascuilo and McSweeney (1977) mentioned a series of assumptions upon which they indicated the Kruskal-Wallis was based. These are: (a) Each of the k samples is randomly selected from the population it represents. (A key aspect of this assumption is the need to avoid a large number of duplicates in the observations) (b) The k samples under study are independent. (A key aspect of this assumption is to avoid using the Kruskal-Wallis test on repeated measurements taken on several occasions. Special repeated-measure procedures should be used (see Gwet (2011).)
- 10 - Chapter 2: The Kruskal-Wallis Test Procedure (c) The analytic variable used for ranking is continuous. (This assumption is not critical to ensuring the validity of the Kruskal-Wallis test, and is often ignored by practitioners. The concern here is that if the null hypothesis of true, you want all k populations to be homogeneous, which may not quite be the case with discrete variables. If you are dealing with an unusual dataset, you may want to pay attention to this issue.) (d) The probability distributions underlying the sample data are identical in their shape. (The purpose of this assumption is to ensure that a rejection of the null hypothesis can only be attributed to the difference in means.) (e) Each independent sample has a size of 5 or more. (This ensures the validity of the chi-square approximation) 2.2 Tie Correction for the Kruskal-Wallis Test Several sources in the literature recommend that the Kruskal-Wallis test statistic be divided by an adjustment factor C to correct it for the presence of tied scores. Tie correction will increase the test statistic slightly making the test more powerful. The correction factor C is defined as follows: C = 1 s (t 3 i t i ) i=1 n 3 t n t, (2.4) where s is the number of different series of ties, and t i the number of tied scores within the i th series. The resulting tie-corrected Kruskal-Wallis statistic H c is defined as follows: H c = H/C. (2.5) The law of probability of the tie-corrected statistic is expected to be closer to the chi-square distribution with k 1 degrees of freedom, than the law of probability of the uncorrected statistic.
2.3 Pairwise Comparisons - 11-2.3 Pairwise Comparisons After rejecting the null hypothesis of equality of means with the Kruskal-Wallis test, it becomes important to determine which means are different from one another. This goal is achieved with the pairwise comparisons. For the sake of comparing two means µ a and µ b, the pairwise test will be implemented as follows: (i) The test statistic is, z = R a R b nt (n t + 1)(1/n a + 1/n b )/12, (2.6) where R a and R b are respectively the mean ranks associated with samples a and b. (ii) The law of probability associated with the z test statistic is approximated with the Standard normal distribution. The typical number of pairwise comparisons to be made is c = k(k 1)/2. If α is the significance level used with the global test, the significance level that should be used with the pairwise comparisons is α = α/c. The difference R a R b is statistically significant if z > z α /2 where z α /2 is the 100(1 α /2) th percentile of the standard Normal distribution. In other words, statistical significance is achieved when R a R b exceeds the threshold of, z α /2 n t (n t + 1)(1/n a + 1/n b )/12.
Bibliography [ 1] Conover, W.J. (1980). Practical Nonparametric Statistics (2nd ed.). New York: John Wiley & Sons, Inc. [ 2] Conover, W.J. (1999). Practical Nonparametric Statistics (3nd ed.). New York: John Wiley & Sons, Inc. [ 3] Daniel, W.W. (1990). Applied Nonparametric Statistics (2nd ed.). Boston: PWS-Kent Publishing Company. [ 4] Gwet, K. L. (2011). The Practical Guide to Statistic: Basic Concepts, Methods and Meaning. Application With MS Excel, R, and OpenOffice Calc (2nd ed.). Advanced Analytics, LLC. [ 5] Kruskal, W.H. (1952). A nonparametric test for the several sample problem. Annals of Mathematical Statistics, 23, 525-540. [ 6] Kruskal, W.H. and Wallis, W.A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47, 583-621. [ 7] Marascuilo, L.A. and McSweeney, M. (1977). Nonparametric and Distribution-Free Methods for the Social Sciences. Monterey, CA : Brooks/Cole Publishing Company. - 12 -
Printed Books by Kilem L. Gwet The Practical Guide to Statistics: Basic Concepts, Methods and Meaning. Application With MS Excel, R, and OpenOffice Calc HANDBOOK OF INTER-RATER RELIABILITY (Second Edition): The Definitive Guide to Measuring the Extent of Agreement Among Multiple Raters INTER-RATER RELIABILITY USING SAS: A Practical Guide for Nominal, Ordinal, and Interval Data e-documents by Kilem L. Gwet The Friedman Test with Excel 2007 & 2010 in 3 Simple Steps The Kruskal-Wallis Test with Excel 2010 in 3 Simple Steps The Kruskal-Wallis Test with Excel 2007 in 3 Simple Steps The Wilcoxon Nonparametric Tests with Excel in 3 Simple Steps How to Compute Intraclass Correlation Using Excel: A Practical Guide to Inter-Rater Reliability for Quantitative Data Confidence Intervals in Statistics with Excel 2010: 75 Problems & Detailed Solutions
The Kruskal-Wallis Test with Excel 2007 in 3 Simple Steps Kilem Li Gwet Ph.D The KruskalWallis2007.xlsm Excel Macro is what you need, when the Analysis ToolPak and ANOVA cannot be used This booklet shows you step by step How the Excel Macro Program KruskalWallis2007.xlsm Implements the Kruskal-Wallis test with Excel 2007 About the author Kilem L. Gwet, Ph. D. A Statistical Consultant, Researcher and Instructor. He has over 15 years of experience in various industries, and has several publications on Inter-rater and Intra-rater reliability assessment in peer-reviewed journals. Advanced Analytics, LLC PO Box 2696 Gaithersburg, MD 20886-2696