Using Excel s Analysis ToolPak Add-In

Using Excel s Analysis ToolPak Add-In Bijay Lal Pradhan, PhD Introduction I have a strong opinions that we can perform different quantitative analysis, including statistical analysis, in Excel. It is powerful, it has a very simple user interface and the purpose of this document is to acquaint you with the free Analysis ToolPak add-in how it works and what it can do. Excel provides a set of data analysis tools called the Analysis ToolPak which you can use to save steps when you develop complex statistical analyses. You provide the data and parameters for each analysis; the tool uses the appropriate statistical macro functions and then displays the results in an output table. Some tools generate charts in addition to output tables. This document illustrates the use of Excel s Analysis ToolPak add-in for data analysis. Analysis ToolPak is bundled with Excel, which is free for anyone who owns Excel. Since the early days of Excel at least 20 years ago Analysis ToolPak has been part of Excel. Indeed, its current form is almost identical to its form then. Admittedly, Microsoft has recently revised many of Excel s statistical functions to make them more accurate numerically and to provide a more consistent naming convention, but the functionality and user interface of Analysis ToolPak have changed hardly at all. If the Data Analysis command is selectable on the Tools menu (2003 or 2007), Data Menu (2010) onwards, then the Analysis ToolPak is installed on your system. However, if the Data Analysis command is not on the Tools menu, you need to install the Analysis ToolPak by doing the following: For Excel 2003 and previous version On the Tools menu, click Add-Ins... If Analysis ToolPak is not listed in the Add-Ins dialog box, click Browse and locate the drive, folder name, and file name for the Analysis ToolPak Add-in Analys32.xll usually located in the Program Files\Microsoft Office\Office\Library\Analysis folder. Once you find the file, select it and click OK. For Excel 2007 and above On the file menu click option menu, then you will get this window and then click the Add-Ins category. If you're using Excel 2007, click the Microsoft Office Button, and then click Excel Options. In the Manage box, select Excel Add-ins and then click Go. In the Add-Ins box, check the Analysis ToolPak check box, and then click OK. If Analysis ToolPak is not listed in the Add-Ins available box, click Browse to locate it. If you are prompted that the Analysis ToolPak is not currently installed on your computer, click Yes to install it.

Figure 1 Excel Option The Data Analysis ToolPak has a Descriptive Statistics tool that provides you with an easy way to calculate summary statistics for a set of sample data. Summary statistics includes Mean, Standard Error, Median, Mode, Standard Deviation, Variance, Kurtosis, Skewness, Range, Minimum, Maximum, Sum, and Count. This tool eliminates the need to type individual functions to find each of these results. Step 1. Select the Tools *pull-down menu, if you see data analysis, click on this option, otherwise, click on add-in.. option to install analysis tool pak. Step 2. Click on the data analysis option. Step 3. Choose Descriptive Statistics from Analysis Tools list. Step 4. When the dialog box appears: Enter A1:A4 in the input range box, A1 is a value in column A and row 1, in this case this value is 2. Using the same technique enter other VALUES until you reach the last one. If a sample consists of 20 numbers, you can select for example A1, A2, A3, etc. as the input range. Step 5. Select an output range, in this case B1. Click on summary statistics to see the results. Select OK. When you click OK, you will see the result in the selected range. Figure 2 Add-Ins List Once Analysis ToolPak is loaded, you will see a Data Analysis item on the Data ribbon. In fact, if you have also loaded the Solver addin, the Data Analysis button is right below the Solver button, as shown in Figure 3. Figure 3 Data Ribbon When you click the Data Analysis button, you see the list of tools available, some of which appear in Figure 4. These tools are described in subsequent sections of this document.

Figure 4 Data Analysis Tools Descriptive Statistics You can obtain summary measures of numeric variables by selecting Descriptive Statistics from the Data Analysis Tools list in Figure 4. Here is an example based on the file bank confidance.xlsx (see Figure 5). Figure 5 confidence score data checked by default. When you select Descriptive Statistics, you see the dialog box in Figure 4. It guesses correctly that the only numeric data are in the range C1:C71, although you have to check that labels are in the first row. The Grouped By: option should usually be Columns, meaning that each variable is in a column, not a row. There are three options for the location of the results, and if you choose the New Worksheet option, you can provide a name for this new worksheet. Finally, you can check any of the four options at the bottom, although none are Figure 6 Descriptive Statistics Dialog Box

Figure 7 Summary Statistics of Salaries Salary Mean 13906.49 Standard Error 767.3394 Median 12800 Mode 9200 Standard Deviation 6420.022 Sample Variance 41216686 Kurtosis -1.01838 Skewness 0.455332 Range 22000 Minimum 5000 Maximum 27000 Sum 973454 Count 70 Largest(1) 27000 Smallest(1) 5000 Confidence Level(95.0%) 1530.8 If you check the Confidence Level for Mean option in Figure 6. What do you get? The output for the Salary variable appears in Figure 7. Unfortunately, no interpretation of the 1530.8 value in last row is provided. It turns out that this is the value that should be subtracted from and added to the sample mean to get a 95% confidence interval for the mean. Histograms The Histogram option in Analysis ToolPak allows you to create a frequency table and accompanying chart of a numeric variable. However, the Histogram dialog box not only requires the range for the data variable, but it also requires a Bins range. In this case of age, there will be 7 bins: less than or equal to 20, greater than 20 but less than or equal to 40, and so on, up to greater than 60. Figure 8 Histogram Dialog Box

Figure 9 Age Data with Bins The results appear in Figure 10. They include the table of bin frequencies and the corresponding chart. If you prefer the bars in the histogram to be right next to one another, you can right-click any bar, select Format Data Series, and choose a Gap Width of 0. Also, you can delete the Frequency legend. Figure 10 Age Data with Bins and Histogram Correlation (and Covariance) It is easy to create a table of correlations with Analysis ToolPak, as illustrated here with the selected three columns of data in the file bank confidance.xlsx (see Figure 13). You choose Correlation from the Data Analysis Tools list and fill out the dialog box as shown in Figure 14. The resulting table of correlations appears in Figure 15. Figure 11 bank confidence data

Figure 12 Correlations Dialog Box Salary Age Qualification Salary 1 Age 0.349988 1 Qualification 0.192526 0.122485 1 In the same way we can find out the covariances. Rank and Percentile Analysis ToolPak has a Rank and Percentile procedure that you might find useful. You select a column of numeric data, and the procedure essentially sorts the data from high to low. Figure 16 shows the results of doing this to the Salary variable in the file. The Rank column is equivalent to using Excel s RANK (or the newer RANK.EQ) function. The Percent column lists the approximate percentage of salaries at or below each given salary. It is equivalent to using the PERCENTRANK.INC function (available starting in Excel 2010). Figure 13 Ranks of salary Point Column1 Rank Percent 27 27000 1 98.50% 40 27000 1 98.50% 34 25000 3 95.60% 46 25000 3 95.60% 68 24665 5 94.20% 38 24300 6 92.70% 14 24000 7 91.30% 44 23670 8 89.80% Hypothesis Tests Analysis ToolPak has five separate procedures for implementing common hypothesis tests. Three of these are for testing the difference between two sample means when the two samples are independent. Another tests the difference between two sample means when the two samples are paired. Finally, there is a test for equality of two sample variances. Tests for Difference between Two Sample Means: Independent Samples

These procedures are labeled z-test: Two Sample for Means, t-test: Two-Sample Assuming Equal Variances, and t-test: Two-Sample Assuming Unequal Variances. The first assumes the population variances are known, whereas the last two make no such assumption. Analysis ToolPak requires the data to be unstacked. As an example, the data in the file bank confidance.xlsx are stacked (see Figure 17). There is a categorical variable sex and a numeric variable income. Indeed, this is the usual data arrangement in such data sets. However, to use any of the Analysis ToolPak tests for testing the mean rating across male and female, the data must first be unstacked, as shown in Figure 18, where the two column lengths for the unstacked variables are not necessarily the same. Here we have to unstack the data manually (by sorting on gender and then copying and pasting). Figure 17 Stacked Exercise Data salary sex 1 7000 53 2 12500 20 3 9200 26 4 22000 32 5 18200 40 6 5000 21 7 11500 29 8 17400 40 9 12200 26 10 12800 29 11 7000 42 12 7500 31........ 71.... Figure 18 Unstacked Data salary male salary female 1 7000 11500 2 12500 17400 3 9200 7000 4 22000 7500 5 18200 22000 6 5000 24000 7 12200 17000 8 12800 14000 9 18000 9200 10 15000 9500 11 7800 10000 12 6800 16200........ 71.... In any case, once you have the unstacked data, all three of the procedures are straightforward and similar. For example, the dialog box for the t-test: Two-Sample Assuming Equal Variances procedure is shown in Figure 19. Figure 19 Two-Sample Test Dialog Box

After widening the columns appropriately, the results appear in Figure 20. Interestingly, even though the dialog box asks for a significance level (alpha), it is not used in the results at all. However, you can mentally compare your alpha level to the p-value shown in cell B11 for a one-tailed test or in cell B13 for a two-tailed test. Figure 20 Two-Sample Test Results t-test: Two-Sample Assuming Equal Variances salary male salary female Mean 13217.97297 14678.45455 Variance 47057190.92 34771460.07 Observations 37 33 Pooled Variance 41275670.52 Hypothesized Mean Difference 0 df 68 t Stat - 0.949418795 P(T<=t) one-tail 0.172884911 t Critical one-tail 1.667572281 P(T<=t) two-tail 0.345769822 t Critical two-tail 1.995468931 Test for Equality of Variances Given the test results in Figure 20, you might want to check whether the equal-variance assumption is reasonable. You can do this with the F-Test Two-Sample for Variances procedure, again using the unstacked data. The dialog box is filled out exactly as in t-test, and the results appear in Figure 21. The p-value of about 0.07 indicates that there is evidence, but not totally convincing evidence, that the two variances are not equal. Figure 21 Equal Variance Test Results F-Test Two-Sample for Variances salary male salary female Mean 13217.97297 14678.45455 Variance 47057190.92 34771460.07 Observations 37 33 df 36 32 F 1.353328012 P(F<=f) one-tail 0.194021106 F Critical one-tail 1.783845791

Test for Difference between Two Sample Means: Paired Samples If you are comparing two samples that are paired in some natural way, you should use the t-test: Paired Two Sample for Means procedure. As an example, the husband and wife ratings are naturally paired, assuming that the reactions of husbands and wives to bank confidance are correlated (see Figure 22). These data are already unstacked, as Analysis ToolPak requires, so the Paired Sample dialog box can be filled in directly, as shown in Figure 23. Figure 22 Paired Confidance Ratings Figure 23 Paired-Sample Test Dialog Box Figure 24 Paired-Sample Test Results t-test: Paired Two Sample for Means confidance Score Husband Confidance Score Wife Mean 4.444444444 5.055555556 Variance 2.614379085 2.408496732 Observations 18 18 Pearson Correlation 0.786607714 Hypothesized Mean Difference 0 df 17 t Stat -2.500455885 P(T<=t) one-tail 0.011463202 t Critical one-tail 1.739606726 P(T<=t) two-tail 0.022926405 t Critical two-tail 2.109815578

Analysis of Variance (ANOVA) Procedures Single-Factor ANOVA Single-factor ANOVA, also called one-way ANOVA, is an extension of the two-sample t-test (with independent samples) to more than two samples. It tests whether the means of all samples are equal. The Analysis ToolPak s Anova: Single Factor procedure implements this test, again assuming unstacked data. As an example, the file data2.xlsx lists cereal sales at Bhatbhateni superstore for five different shelf heights (see Figure 25). To run the analysis, we have to fill out the dialog box as shown in Figure 26. Figure 25 Cereal Sales Data Figure 26 Single-Factor ANOVA Dialog Box The results appear in Figure 27. The sample mean and variance for each shelf height are listed, followed by the ANOVA table for the test. In this case, its p-value (0.308 > 0.05) indicates that the means sales of cereal are not significantly different in the position of it placed.

Figure 27 ANOVA Results for Cereal Data Anova: Single Factor SUMMARY Groups Count Sum Average Variance Lowest 25 9483 379.32 9217.81 Next-to-lowest 25 10294 411.76 12520.61 Middle 25 10069 402.76 10373.44 Next-to highest 25 10295 411.8 11262.83 Highest 25 11080 443.2 10722.58 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 52549.71 4 13137.43 1.214241 0.308373 2.447237 Within Groups 1298335 120 10819.45 Total 1350884 124 Two-Factor ANOVA with Replication Analysis ToolPak s Anova: Two-Factor With Replication procedure is an extension of the single-factor ANOVA procedure. Now there are two factors, and observations are made for each combination of the two factor levels. The test is again basically a test of equal means, or equivalently, of equal factor-level effects. there is no easy way to rearrange the data as required other than by copying and pasting. The data arranged and its out is exhibited in sheet ANOVA_R. Figure 31 Two-Factor ANOVA with Replications Results Two-Factor ANOVA without Replication Analysis ToolPak also has an Anova: Two-Factor Without Replication procedure. It is like the With Replication except that there is only one observation per factor-level combination. As an example, the file data2.xlsx with production worksheet has one observation of production of each Machine for each four operator. With this arrangement, we can run Two-Factor ANOVA without replication and get solution of it. Figure 32 Rearranged Soap Sales Data Machines Operators Machine1 Machine2 Machine3 Machine4 Operator1 10 12 14 16 Operator2 12 11 13 16 Operator3 14 15 12 11 Operator4 16 10 17 17 Whether difference in average productivity is due to the difference in operators and the machines. Hypothesis (Alternative) 1: a. Production due to machine is significantly different b. Production due to operator are significantly different. The results appear in Figure 33. Summary statistics are listed for each level of each of the two factors, and the ANOVA table shows the results of the tests. Both the p-value are greater than 0.05 which indicates that the productivity due to different machine and due to different operators are not significantly different. (Alternative hypothesis is rejected in both cases)

Figure 33 ANOVA Results for Soap Sales Anova: Two-Factor Without Replication SUMMARY Count Sum Average Variance Operator1 4 52 13 6.666667 Operator2 4 52 13 4.666667 Operator3 4 52 13 3.333333 Operator4 4 60 15 11.33333 Machine1 4 52 13 6.666667 Machine2 4 48 12 4.666667 Machine3 4 56 14 4.666667 Machine4 4 60 15 7.333333 ANOVA Source of Variation SS df MS F P-value F crit Rows 12 3 4 0.62069 0.619097 3.862548 Columns 20 3 6.666667 1.034483 0.422775 3.862548 Error 58 9 6.444444 Total 90 15 Regression Analysis One of the favorite Analysis ToolPak procedures is its Regression procedure. For example we are interested to find out the regression analysis of income on age and education level from our bank confidence data Figure 34 Bank confidence data These data are fortunately in the form Analysis ToolPak requires the independent variables are in contiguous columns so the Regression dialog box can be filled out as shown in Figure 35. You can decide which of the five check boxes at the bottom to check (including none of them) for diagnostic analysis of the residuals.

Figure 35 Regression Dialog Box The regression output shown in Figure 36 is standard. It includes the regression summary statistics at the top, the ANOVA table for checking whether the regression has any significance as a whole, and the information on the individual regression coefficients. One curious feature is that you automatically get two versions of the confidence intervals for the coefficients and you get them regardless of whether you check the Confidence Level box in Figure 35. If you type 90% confidence level after checking as in figure 35 then the two different confidence will be exhibited in the output summary as in figure 36. Figure 36 Regression Output SUMMARY OUTPUT Regression Statistics Multiple R 0.38109 R Square 0.14523 Adjusted R Square 0.119714 Standard Error 6023.492 Observations 70 ANOVA df SS MS F Significance F Regression 2 413026782.72 206513391.36 5.69 0.01 Residual 67 2430924538.77 36282455.80 Total 69 2843951321.49 Coefficients Standard Error t Stat P- value Lower 95% Intercept 4980.41 2743.26 1.82 0.07-495.16 10455.98 404.89 9555.93 Age 201.26 69.12 2.91 0.00 63.29 339.22 85.97 316.54 Qualification 659.19 493.76 1.34 0.19-326.36 1644.75-164.36 1482.75 The residuals and residual plots requested in Figure 35 are shown in Figure 37. (Actually, all charts overlap one another, so you will probably want to move them around.) These residual plots let you see whether there are any obvious violations of the regression assumptions. Upper 95% Lower 90.0% Upper 90.0%

Figure 37 Residuals and Residual Plots Sampling The Sampling tool in Analysis ToolPak allows you to choose a random sample from a larger population of values. Of course, this population could be considerably larger. In any case, we might want to choose a random sample from large records. In this case, we have to chose one column only, Multiple column is not accessible in Excel. Conclusion Although Analysis ToolPak lacks many features and the overall professional look of other statistical software packages, but it has basic statistical tool which can be use easily and it is easily accessible. If you want to perform standard statistical analyses and you have only Excel, you can certainly get by with Analysis ToolPak. You might have to rearrange data, widen output columns, reformat graphs, and possibly a few other things, but you will be able to get the basic results you need fairly quickly and easily. The alternative way to do theses statistical analysis through another add ins (Possibly Megastat) which we will discuss another session.