A Visual Basic Program for Estimating Missing Cell Frequencies. in Chi Square Tests for Association

Size: px
Start display at page:

Download "A Visual Basic Program for Estimating Missing Cell Frequencies. in Chi Square Tests for Association"

Transcription

1 Missing Cell Frequencies 1 Running head: MISSING FREQUENCIES IN CHI SQUARE TESTS A Visual Basic Program for Estimating Missing Cell Frequencies in Chi Square Tests for Association Richard G. Graf, Edward F. Alf, Jr, and Steve Williams S a n D i e g o S t a t e U n i v e r s i t y Ifeanyi Okolo Nnamdi Azikiwe University

2 Missing Cell Frequencies 2 A Visual Basic Program for Estimating Missing Cell Frequencies in Chi Square Tests for Association Richard G. Graf, Edward F. Alf, Jr., and Steve Williams S a n D i e g o S t a t e U n i v e r s i t y Ifeanyi Okolo Nnamdi Azikiwe University It is usually easy to calculate the chi square test for association in a table with r rows and c columns; but if the frequency data are missing for one or more cells in the table, the analysis can be complicated. One class of such tables arises in the analysis of transaction flows (Savage & Deutsch, 1960), such as the analysis of export-import data, emigration data, or psychological interaction data, where the diagonal entries in the table are missing. Goodman (1968) presents an excellent summary of methods for analyzing contingency tables with missing elements, including missing diagonals. Wagner (1970) develops a maximum likelihood solution for estimating expected cell frequencies when the diagonal elements are missing. The present paper offers a slightly different solution to this problem. Each missing cell frequency is replaced with the value that would be expected if the null hypothesis of no association between rows and columns were true. In this way, the missing cells make no contribution to the resulting chi square. Further, replacing the missing values with their expectations yields the same maximum likelihood solution given by

3 Wagner (1970). A computer program is available to estimate the missing diagonal cells. Missing Cell Frequencies 3 1. Introduction and Summary It is usually easy to calculate the chi square test for association in a table with r rows and c columns; but if the frequency data are missing for one or more cells in the table, the analysis can be complicated. One class of such tables arises in the analysis of transaction flows (Savage & Deutsch, 1960), such as the analysis of export-import data, emigration data, or psychological interaction data, where the diagonal entries in the table are missing. Wagner (1970) discusses an example of such a table arising in the analysis of sexual displays among three monkeys. Each row in Wagner's table corresponds to a sender, and each column corresponds to a receiver, of a sexual display; the cell entries are the number of times a monkey in the row made a sexual display to the monkey in the column. There are zeroes in the diagonal, since the monkeys cannot display to themselves. Analogous problems arise in many fields of research. Goodman (1968) developed a computer program to solve the chi-square analysis arising in the analysis of transaction flows, as well as in the analysis of other contingency tables in which one or more cell entry might be missing. These include tables where (1) some nondiagonal cells are known a priori to have zero entries, (2) some nondiagonal cells are known a priori to have entries that should be excluded from consideration because their inclusion would bias the parameter estimates obtained, (3) some nondiagonal cell entries are unknown (the data may be unavailable), or (4) the analysis of the data is focused on a specified subset of the cells, including possibly some diagonal cells and excluding possibly some nondiagonal cells. Goodman (1968) presents an excellent summary of methods for analyzing contingency tables with missing elements, including missing diagonals. Wagner (1970) develops a maximum likelihood solution for estimating expected cell frequencies when the diagonal elements are missing. The present paper offers a slightly different solution to this problem. We replace each missing cell

4 Missing Cell Frequencies 4 frequency with the value that would be expected if the null hypothesis of no association between rows and columns were true. In this way, the missing cells make no contribution to the resulting chi square. Further, replacing the missing values with their expectations yields the same maximum likelihood solution given by Wagner (1970). 2. Analyzing a Complete Square. 2.1 Terminology Consider an rxc chi square table, in which we wish to test the hypothesis that the r rows and c columns are independent. We define: the total number of observations, the number of observations in row I, the number of observations in column j, and the number of observations in the cell at the intersection of row I and column j. 2.2 Analysis. Let be the expected value in cell i,j. The numerical value of is: After all the values are computed, the calculated chi square is given by:

5 Missing Cell Frequencies 5 with - 1)(c-1) degrees of freedom. 3. Analyzing a Square with One Missing Cell. 3.1 Terminology Assume the frequency at the intersection of row I and column j to be missing. We designate this missing frequency as. The frequency totals for row I, for column j, and the over-all total will also be incomplete; and will be designated as,, and respectively. Then the true row, column, and over-all totals will be: Our procedure is to estimate to be its expected value,, under the null hypothesis of independence of the rows and columns. Substituting [3], [4], and [5] in [1] yields: Solving [6] for yields:

6 Missing Cell Frequencies 6 Note that need not be a whole number. can then be substituted for in [3], [4] and [5] to obtain estimates of, and. These estimates can then be substituted into [1] to obtain expected cell frequencies. These in turn can be substituted into [2] to obtain the calculated chi square. The resulting chi square will have - 1) -1) - 1 degrees of freedom, because one degree of freedom is lost in estimating. 4. Analyzing a Table with Several Missing Cells. When the frequencies in several cells are missing, it is not possible to apply equation [7] directly, because for a given cell,, and depending on where the other missing cells are located, possibly even and, could be unknown. To solve for all the unknown cell frequencies, it is possible to use the method of iteration (Scarborough, 1966, pp ) To use this method, we first substitute initial frequency estimates for all the unknown cell frequencies. To assure convergence in all cases, these initial estimates should be zero. Then, in turn, each missing cell frequency is estimated, treating the estimated cell frequencies as if they were the actual cell frequencies. As each cell frequency is estimated, it enters into the estimation process for all subsequent cells. After all cell frequencies have been estimated, the process is repeated once again. The process is repeated iteratively until the estimated cell frequencies stabilize to some desired accuracy. 4.1 Terminology. We define: = The unknown expected frequency in the cell we are estimating. = The just previous estimated frequency for this cell, = The sum of the estimates at this time for the unknown cell frequencies in row I, = The sum of the estimates at this time for the unknown cell frequencies in column j, and = The sum of the estimates at this time for all the unknown cell frequencies

7 Missing Cell Frequencies 7 in the table. We may then re-write equation [7] for more efficient use in the estimation process: 4.2 Numerical Example. Table 1 is an example taken from Wagner (1970), and will serve to illustrate the procedure. Substituting the data for the diagonal elements in turn into [8] yields: ============================= Insert Table 1 About Here =============================

8 Missing Cell Frequencies 8 The procedure is continued in the same way into the second iteration. The data for the diagonal elements are substituted in turn into [8], including the approximations from the first iteration. When only the diagonal elements are missing, then. and will all be identical. Equation [8] under these circumstances will be reduced to: For example, the second approximation to becomes: A similar process is followed until the expected diagonal frequencies stabilize to a satisfactory degree. Table 2 lists the diagonal values for the first 12 iterations. These entries have been carried to ten significant figures; but two decimal accuracy should be sufficient for most practical purposes. ============================= Insert Table 2 About Here ============================= 5. Discussion. It should be noted that the above procedure provides the maximum likelihood estimates for the missing cell frequencies. Note that the final values in Table 2 are identical, within rounding error, to those obtained by Wagner (1970) for the same data. Before starting the iteration process, any row and/or column having all known frequencies equal to

9 Missing Cell Frequencies 9 "zero" should be eliminated, since they will make no contribution to the chi square, and their terms would involve division by zero. Even for small tables, the calculations become tedious. A Basic program that will estimate the missing cells, calculate the chi square value, and provide the degrees of freedom for any missing data table can be found in Appendix A. A Visual Basic program can be downloaded from References Goodman, Leo A. (1968). "The Analysis of Cross-Classified Data: Independence, Quasi-independence, and Interactions in Contingency Tables With or Without Missing Values," Journal of the American Statistical Association, 63, Savage, I Richard, and Deutsch, Karl W. (1960). "A Statistical Model of the Gross Analysis of Transaction Flows," Econometrica, 28, Scarborough, James B.(1966). Numerical Mathematical Analysis, Baltimore: The Johns Hopkins Press. Wagner, S. S. (1970). "The Maximum-Likelihood Estimate for Contingency Tables with Zero Diagonal," Journal of the American Statistical Association, 65,

10 Missing Cell Frequencies 10

11 Missing Cell Frequencies 11 Table 1 Chi-Square Table with Missing Diagonal Elements* ============================================================ Row: Column: Sum: Sum: ============================================================ *Note: An entry of -1 designates a missing cell frequency.

12 Missing Cell Frequencies 12 Table 2 Successive Iterations for the Diagonals in Table 1 ============================================================ Diagonal Element: Iteration: ============================================================

13 Missing Cell Frequencies APPENDIX A The following program is written in Microsoft BASIC. Be sure to turn on your printer before running the program. 10 'MISSCHI.BAS: This program estimates the missing 20 ' frequencies in an rxc chi square 30 ' matrix, and calculates the resulting 40 ' chi square for association. It will 50 ' also calculate the chi square for 60 ' association when there are no missing 70 ' cells. 80 ' 90 CLS 100 INPUT "HOW MANY ROWS";M 110 INPUT "HOW MANY COLUMNS";N 120 PRINT " " 130 ' 140 DIM X(M,N),O(M,N),E(M,N) 150 ' 160 PRINT "Enter the frequency for the cell in the row" 170 PRINT "and column specified at each prompt. Enter" 180 PRINT "a frequency of -1 if the cell value is missing." 190 PRINT " " 200 ' 210 FOR I = 1 TO M 220 FOR J = 1 TO N 230 PRINT "THE FREQUENCY IN ROW ";I; 240 PRINT ", COLUMN ";J;" = "; 250 INPUT X(I,J)

14 Missing Cell Frequencies 260 IF X(I,J) = -1 THEN LET K = K NEXT J 280 NEXT I 290 ' 300 FOR I = 1 TO M 310 FOR J = 1 TO N 320 IF X(I,J) = -1 THEN X(0,0) = X(0,0) + X(I,J) 340 X(I,0) = X(I,0) + X(I,J) 350 X(0,J) = X(0,J) + X(I,J) 360 NEXT J 370 NEXT I 380 ' 390 LPRINT "INITIAL MATRIX:" 400 LPRINT "(A frequency of -1 denotes a missing cell.)" 410 LPRINT " " 420 FOR I = 1 TO M 430 FOR J = 1 TO N 440 LPRINT USING "####.# ";X(I,J); 450 NEXT J 460 LPRINT " " 470 NEXT I 480 LPRINT " " 490 ' 500 FOR I = 1 TO M 510 FOR J = 1 TO N 520 IF X(I,J) = -1 THEN O(I,J) = X(I,J) 540 NEXT J 550 NEXT I

15 Missing Cell Frequencies 560 ' 570 D = FOR I = 1 TO M 590 FOR J = 1 TO N 600 IF X(I,J)<>-1 THEN F = (X(I,0)-O(I,J))*(X(0,J)-O(I,J)) 620 F = F/(X(0,0) - X(I,0) - X(0,J) + O(I,J)) 630 X(0,0) = X(0,0) - O(I,J) + F 640 X(I,0) = X(I,0)-O(I,J)+F 650 X(0,J) = X(0,J)-O(I,J)+F 660 D = D + ABS(O(I,J) - F) 670 O(I,J) = F 680 NEXT J 690 NEXT I 700 ' 710 IF D >.01 THEN ' 730 LPRINT "FINAL MATRIX:" 740 LPRINT " " 750 FOR I = 1 TO M 760 FOR J = 1 TO N 770 LPRINT USING "####.# ";O(I,J); 780 NEXT J 790 LPRINT " " 800 NEXT I 810 LPRINT " " 820 ' 830 FOR I = 1 TO M 840 FOR J = 1 TO N 850 O(0,0) = O(0,0) + O(I,J)

16 Missing Cell Frequencies 860 O(I,0) = O(I,0) + O(I,J) 870 O(0,J) = O(0,J) + O(I,J) 880 NEXT J 890 NEXT I 900 ' 910 FOR I = 1 TO M 920 FOR J = 1 TO N 930 E(I,J) = O(I,0)*O(0,J)/O(0,0) 940 NEXT J 950 NEXT I 960 ' 970 LPRINT "EXPECTED VALUES:" 980 LPRINT " " 990 FOR I = 1 TO M 1000 FOR J = 1 TO N 1010 LPRINT USING "####.# ";E(I,J); 1020 NEXT J 1030 LPRINT " " 1040 NEXT I 1050 LPRINT " " 1060 ' 1070 FOR I = 1 TO M 1080 FOR J = 1 TO N 1090 X2 = X2 + (O(I,J) - E(I,J))^2/E(I,J) 1100 NEXT J 1110 NEXT I 1120 ' 1130 DF = (M-1)*(N-1) - K 1140 ' 1150 LPRINT "CHI SQUARE = ";X2

17 Missing Cell Frequencies 1160 LPRINT "DEGREES OF FREEDOM = ";DF 1170 LPRINT " " 1180 END