A New Method for Estimating the Risk Ratio in Studies Using Case-Parental Control Design

Size: px
Start display at page:

Download "A New Method for Estimating the Risk Ratio in Studies Using Case-Parental Control Design"

Transcription

1 American Journal of Epidemiology Copyright 1998 by The Johns Hopkins University School of Hygiene and Public Health All rights reserved Vol. 148, No. 9 Printed in U.S.A. A New Method for Estimating the Risk Ratio in Studies Using Case-Parental Control Design Fengzhu Sun, 1 W. Dana Flanders, 2 Quanhe Yang, 3 and Muin J. Khoury 4 The authors describe a new simple noniterative, yet efficient method to estimate the risk ratio in studies using case-parental control design. The new method is compared with two other noniterative methods, Khoury's method and Flanders and Khoury's method, and with a maximum likelihood-based method of Schaid and Sommer. The authors found that the variance of the new estimation method is usually smaller than that of Khoury's method or Flanders and Khoury's method and that it is slightly larger than that of the maximum likelihood-based method of Schaid and Sommer. Despite the slightly large variance of the new estimator compared with that of the maximum likelihood-based method, the simplicity of the new estimator and its variance makes the new method appealing. When genotypic information for only one parent is available, the authors also describe a method to estimate the risk ratio without assuming Hardy-Weinberg equilibrium or random mating. A simple formula for the variance of the estimator is given. Am J Epidemiol 1998; 148:92-9. case-control studies; gene frequency; genes; genetic markers; genetics; odds ratio; risk Association studies are widely used in epidemiology. One such approach is the case-control design in which case subjects and appropriate control subjects are selected from the population, and the fraction of case subjects with a risk factor is then compared with the fraction of control subjects with that risk factor. A potential problem with the case-control design involves the selection of appropriate control subjects. Rubinstein et al. (1) and Falk and Rubinstein (2) proposed a case-parental control design to avoid this problem. With the case-parental control design, case subjects are randomly sampled and the hypothetical control subjects are assumed to carry the genotypes formed by the nontransmitted parental alleles. If the candidate locus for disease susceptibility has alleles "AT and "AT with "M" the susceptible allele, the fraction of case subjects with allele "M" is compared with the fraction of hypothetical control subjects with that allele. The haplotype relative risk is the odds ratio used with the case-parental control design. Ott (3) Received for publication October 29, 1997, and accepted for publication March 25, Abbreviations: CPG, conditional on parental genotype. 1 Department of Genetics, Emory University, Atlanta, GA. 2 Department of Epidemiology, Emory University, Atlanta, GA. 3 Birth Defects and Genetic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA. 4 Office of Genetics and Disease Prevention, Centers for Disease Control and Prevention, Atlanta, GA. Reprint requests to Fengzhu Sun, Department of Genetics, Emory University School of Medicine, 1462 Clifton Road, 4th Floor, Atlanta, GA studied the statistical properties of the haplotype relative risk under the recessive model, and Knapp et al. (4) showed that the distribution of the estimator for haplotype relative risk is stochastically smaller than that for risk ratio in the presence of positive linkage disequilibrium between a marker and disease allele. For epidemiologic purposes, it is important to estimate the risk ratio of individuals carrying genotypes "MM" and "MN" versus those carrying "AW." Schaid and Sommer (5) gave maximum likelihood approaches to estimate the risk ratio with and without the Hardy- Weinberg equilibrium assumption. Because the main purpose of the case-parental control design is to avoid population stratification, the Hardy-Weinberg equilibrium assumption may not hold in some situations. Schaid and Sommer (5) proposed a maximum likelihood-based conditional on parental genotype (CPG) method to estimate the risk ratio without using the Hardy-Weinberg equilibrium assumption. Knapp et al. (6) showed how to estimate the relative risks based on the CPG method without using iterative algorithms. Khoury (7) and Flanders and Khoury (8) proposed noniterative methods to estimate the risk ratio. Both methods provide approximate unbiased estimates. Here we propose another simple noniterative method to estimate the risk ratio. The variance of the new estimator is generally smaller than those obtained using Khoury's method and Flanders and Khoury's method, and it is slightly larger than that obtained with the CPG method. We also give a simple formula to Downloaded from by guest on 19 October

2 Risk Ratio in Case-Parental Control Design Studies 93 estimate the variance of the new estimator. The simplicity of the new estimator and its variance makes the new method appealing. When genotypic information from only one parent is available, Schaid and Sommer (5) proposed an estimation method for the risk ratio assuming both Hardy-Weinberg equilibrium and random mating. Flanders and Khoury (8) proposed another method that only assumed random mating. The method we propose here requires neither of these two assumptions and is thus applicable to more general situations. The organization of this paper is as follows. First, we describe the case-parental control design and present a new 4 X 4 table that summarizes relevant data by distinguishing the paternal and maternal alleles for case subjects and the hypothetical control subjects formed by the two nontransmitted parental alleles. Second, we calculate the probabilities for the 16 cells. Third, we present Khoury's method, Flanders and Khoury's method, and the CPG method of Schaid and Sommer in terms of the 16 cells. Fourth, we present the new estimation method and the variance of the new estimator. Finally, we compare the new method with the other three methods using Monte-Carlo simulations. MATERIALS AND METHODS The case-parental control design In the case-parental control design, case subjects are randomly sampled from all new case subjects in the population, and both the case subjects and their parents are genotyped at the candidate locus. Relevant data thus consist of the genotypes of case subjects and their parents. To facilitate analysis, we summarize the data using a new 4X4 table (table 1) formed by the genotypes of case subjects and the genotypes of hypothetical control subjects carrying the nontransmitted parental alleles. In this table, we distinguish between paternal and maternal alleles; that is, the first allele is the paternal allele, and the second allele is the maternal allele. We also use the following notation: NN =, NM = 1, MN = 1', and MM = 2, with the numbers corresponding to the number of susceptible alleles "M" a genotype has. Let F tj be the number of case subjects in cell (i, j). The genotypes of parents are completely determined in each cell and are given in parentheses. The first genotype in each set of parentheses is the father's genotype, and the second is the mother's genotype. The first allele in each parental genotype is the transmitted allele, and the second allele is the nontransmitted allele. All the estimating methods are based on this table. Calculation of cell probabilities. First let us calculate the cell probabilities for table 1. We do not assume the Hardy-Weinberg equilibrium nor do we assume random mating. There are a total of nine mating types in the parental generation that can be identified by distinguishing paternal and maternal genotypes: (AW, AW), (AW, MAO, (NN, MM), (MN, NN), (MN, MN), (MN, MM), (MM, NN), (MM, MN), and (MM, MM). As above, we denote NN =, MN = 1, and MM = 2 corresponding to the number of mutant alleles in the genotype. Let q^ be the probability of mating type (i,j) in the parental generation. Let p t be the risk of the disease among people with genotype i. Then it is possible to calculate the cell probabilities in table 1 conditional on the case subjects being affected. Table 2 gives the cell probabilities corresponding to table 1 multiplied by P(D), the overall risk of the disease in the general population. The risk ratio for individuals having genotype "MN" versus those having genotype "AW" is defined as \ x = p x lp Q, and the risk ratio for individuals having genotype "MM' versus those having genotype "AW" is defined as A 2 = p 2 lpo- We also define X_ x = p x /p 2 - The objective of this study is to obtain accurate noniterative methods for estimating A, and A 2. The Khoury, Flanders and Khoury, and CPG methods To simplify the presentation, we define the following variables. vo = m xo, F u. + F yi = m u,f ]2 + F V2 = m n, F 2 o = m 2, F 2l + F 2V = WRI- Downloaded from by guest on 19 October 218 alleles 1. Case-parental control design by transmitted and nontransmitted genotypes distinguishing paternal and maternal Case NN(O) Control MN(1') NN(O) F oo (NN, NN) F 1 (NN, MN) F ro (MN, NN) F 2 (MN, MN) F 1 (NN, NM) F n (NN, MM) F V1 (MN, NM) F 21 (MN, MM) F ov (NM, NN) F 1V (NM, MN) F vr (MM,NN) F 21. (MM, MN) F 2 (NM, NM) F 12 (NM, MM) F V2 (MM, NM) F 22 (MM, MM)

3 94 Sun et al. TABLE 2. to table 1 Case NN() Cell probabilities multiplied by P(D) corresponding NN(O) QooPo 1/2q 1 p, Control MN(V) 1/4(7,,p1 Q 2oPi 1/2q 12 p 1 / ^.Q -] 2P1 q 22p 2 All the estimation methods depend on these seven variables. Khoury's estimator for A, and A 2 is given by X, = m lo /m O i, X 2 = By comparing table 1 above with table 1 in the paper by Flanders and Khoury (8), we calculate Flanders and Khoury's estimator as A,= K 2 =. m 2 m n /(m 2 + m u + m 2 ) m Ql /(m w + m 1 ) + 2m O2 /(m O2 + m n + m 2 )' Note here that the estimator of A 2 is the same as that given by Khoury (7). Flanders and Khoury (8) also proposed a modified method to estimate A 2 in which A_, = P\lp 2 can be estimated in a similar manner as A,: *-,= m nl(m u + m 2] ) + m u /(m 2 +. m u + m 2 ) m 2] /(m n + m 2] ) + 2m 2 /(m 2 + m u + m 2 )' and then A 2 can be estimated by A,/A_ t. Similarly, by comparing table 1 above with table 1 in Schaid and Sommer (5), the log-likelihood function to be maximized is given by L(A,, A 2 ) = (m 2 + m 21 )log(a 2 ) + (m 1 + m n + m 12 )log(a,) - (m 12 + m 21 )log(a, + A 2 ) - (m 2 + m u + m 2 )\og(\ 2 + 2A, + 1) - (m m + m ] )log(a 1 + 1). The maximum point of this log-likelihood function yields the CPG estimator. We use the approach of Knapp et al. (6) to calculate the CPG estimator. The new estimation method When the number of case subjects is large, table 1 approximately equals a constant times table 2. From tables 1 and 2, we obtain one approximate unbiased estimator for A_, and Aj similar to that of the Khoury method based on m 1, m 1, m 12, and m 21. () = ^li ) m 21 m l We can also obtain another approximate unbiased estimator for A_, and Aj based on m 2, m n, and m 2. 2m ; 2 2m 2 ' Now we have two estimators for both A_ t and A,. The weighted average of X^, with weight m 21 and X_, with weight 2m 2 gives our estimator A_j. Similarly, the weighted average of Aj -* with weight m 1 and A, with weight 2m 2 gives our estimator X P The final estimator is 2m 2 m 21 ' 2m, 2 m 1 Finally, we estimate A 2 by X ] /X_ 1. From table 2, it can be shown that the above estimator is an approximate unbiased estimator of the risk ratio. We note that Flanders and Khoury's estimator of A_j can be regarded as a weighted average of A^, with weight m 2 \l(m X2 + m 21 ) and A ( _!i with weight 2m 2O /(m o2 + m u + m 2 ). Similarly, their estimator of A, is the weighted average of A ( i ) with weight m Ql /(m u + m 1 ) and X (, ]) with weight 2m O2 /(m o2 + m u + m 2 ). Next we treat (m^, i,j =, 1, 2) as a multinomial random vector. Using the delta method, the variance of the logarithm of A_, and A, can be approximated by Var(ln(X_,)) - Var(ln(X,)) «m n 1 1 m u 2m 2 + m 21 2m : 2 (2m 2 + m 21 ) 2 ' 1 2m 2 + W1 (2m 2 + m,) 2 ' Downloaded from by guest on 19 October 218

4 Risk Ratio in Case-Parental Control Design Studies 95 The variance of the logarithm of A 2 can be approximated by Var(ln(A 2 )) * Var(ln(A_,)) + Var(ln(A,)) 2m,, (m ] + m u )(m n + m n )' As previously shown (2, 8), we can approximate the confidence limits of the risk ratio by treating the logarithm of the estimated risk ratio as though it is normally distributed. Only one parental genotype is known. Sometimes genotypic information from only one parent is available. We summarize the relevant data as in table 3 where A tj is the number of genotype i case subjects whose available parent has genotype j. We define the following estimator of A_, and A,, respectively. X_, =' x,= + A19 A 1 2A 21 2A1 Then A 2 is estimated by A,/A_j. It can be shown that the estimator is an approximate unbiased estimator of the corresponding risk ratio when {q tj ) are symmetric. It is also an approximate unbiased estimator of the corresponding risk ratio if father or mother is missing with equal probability Vz. Using the delta method, the variance of the logarithm of the estimated risk ratio is given by 2A 1 (A,, +A n -A ] ) 2 Var(ln(X,)) - ^- A A, o A 12 2A 12 Var(ln(A 2 )) Var(ln(X,)) 2{A n -A m -A n ) A?,-(A 1 -A 12 ) 2 - When (q^) are not symmetric, A_, and A, given above are not the approximate unbiased estimators of A_j and A,. Next we give another estimation method that is unbiased even if q tj are not symmetric. Let P and M be the numbers of available fathers and mothers, respectively. Let PyiMfj) be the number of genotype i case subjects whose father (mother) has genotype;. Define A[j = MP tj + PMy. The new estimator is given as above by replacing A tj with A-j. From table 1 we can obtain the number of fathers and mothers with genotypes, 1, and 2 given the genotypes of case subjects as shown in table 4. From table 2, the corresponding cell probabilities multiplied by P{D) for table 4 can be obtained as given in table 5. From tables 4 and 5, we can easily calculate the expected values of A\y Substitution of the expected values of A,y into the expression for the estimated risk ratio shows approximate unbiasedness. We assume that the availability of the maternal genotype is independent of the individual genotype as is the availability of paternal genotype. The variance of this estimator is much more complicated and is omitted from the paper. Interested readers can request the exact formula from the first author. Comparisons of the estimation methods Next we use Monte-Carlo simulations to compare the four estimation methods: the CPG method of Schaid and Sommer (5), Khoury's method (7), Flanders and Khoury's method (8), and the new method. Under the Hardy-Weinberg equilibrium and random mating, the nine mating probabilities are de- TABLE 4. Case-parental control design when only one parental genotype is available Downloaded from by guest on 19 October 218 (A,, +A 1 -A 12 ) TABLE 3. Case-parental control design when only one parental genotype is available Case genotype NN() NN() Parental genotype ^1 A 12 A 22 2> Case genotype NN() NN() NN() F oo 1- F ol Fio"l- F n Parental genotype When only father is available For + Fo2 Fiv + F 12 + F v + ^n F 2 + F 21 Frr F F r2 \-F 2v When only mother is available Foo " F ov Fro" f F r1. ^1 + F 2 F,i + F r2 + F 1 + F 1V Frr + F 12 F 2 + F 2V F hf 21

5 96 Sun et al. TABLE 5. Cell probabilities multiplied by P{D) corresponding to table 6 Case genotype NN(O) Parental genotype NN() 1/2(2q Oo + 9oi)Po 1/2(Qoi " I" 2q O2)p, When on/y father is available 1/4(2q 1 + q n )p 1/2(q 12 + q u + q, )p, 1/4(q q 12)p 2 1/2(2q 2 + q 2,)p. 1/2(q q 22)p 2 NN() 1/2(2q O + <7io)Po V2(q 1 " I" 2q 2)p, termined by a single parameter, /, the prevalence of allele 'W in the parental population. The following equations hold under the assumption of the Hardy- Weinberg equilibrium and random mating. = 4/ 2 (l ~f)\ qn = fci = 2/(1 -/) 3, To simplify the presentation, we assumed the Hardy-Weinberg equilibrium and random mating in the parental population in our simulations, although our results should hold even if these conditions are not present. We also fixed the three risksp o,p x, andp 2 for individuals harboring genotypes, 1, and 2, respectively. For all the four estimators to be meaningful, none of the nty should equal zero. We chose the number of case subjects such that the minimum expected value of my is about 1 for given/, prevalence of allele "N" in the general population, and for given risks p Q, p x, and p 2. For that number of case subjects, we ran the simulations 1, times and recorded the estimated risk ratios calculated by each of the four methods. We calculated the average of the estimated risk ratios and the square" root of the average squared differences of the estimated risk ratios from the true risk ratio for each method. Table 6 gives the averages of the estimated risk ratios from 1, simulations using Khoury's method, Flanders and Khoury's method, the new method, and the CPG method. Table 7 gives the square roots of the average squared differences of the estimated risk ratios from the true risk ratio. In both table 6 and table 7, we let X t = p x lp = 4, A 2 = p 2 /p = 1, and/ =.2,.5, and.8. RESULTS In these simulations, the average number of cases was 55, 48, and 6,4 for / =.2,.5, and.8, respectively. Some general patterns emerge from our simulation results (tables 6 and 7), although they may IVhen only mother is available 1/4(2q 1 + q^po 1/2(q 21 + q u + q o,)p, 1/4(9,, + 2q 21)p 2 1/2(2q 2 + q 12)Pt 1/2(q, 2 + 2q 22)p 2 TABLE 6. The average estimated risk ratios through 1, simulations using Khoury's method (K), Flanders and Khoury's method (FK), the new method (N), and the conditional on parental genotype method (CPG), with A 1 = 4 and As = 1 and f, allele frequency of N A ffl K FK A, N CPG K FK A 2 N CPG TABLE 7. The square roots of the average squared differences of estimated risk ratios with the true risk ratio through 1, simulations, with A 1 = 4 and Aj = 1 using Khoury's method (K), Flanders and Khoury's method (FK), the new method (N), and the conditional on parental genotype method (CPG) A W K FK A, N CPG K FK A 2 N CPG not hold for all parameters. As seen in table 6, all simulations yield risk ratios that are close to the true value on average, although all averages are slightly above the true value with these sample sizes and the new and CPG methods are the closest. The variances of the four estimators are in the order of Khoury's method > Flanders and Khoury's method > the new method > CPG. The variance of the new estimator is only slightly larger than that of the CPG method. We also simulated the process using other parameters (data not shown). The patterns are similar. We also calculated the variance and the lower and upper confidence limits of the logarithm of the estimated risk ratio. We used A,' = 4, A 2 = 1, and/ =.2 in these simulations. The histograms of Var(log(A,)) and Var(log(A 2 )) are given in figure 1. The real simulated variances of log(a,) and log(a 2 ) are.26 and Downloaded from by guest on 19 October 218

6 Risk Ratio in Case-Parental Control Design Studies Estimated variancetoo Downloaded from by guest on 19 October S U U U it U U '...* Estimated varlanc»*1 ' FIGURE 1. The histograms for the estimated variances of logfa,) (a) and logfaj (b). A, = 4, A 2 = 1, and f =.2. The real variances are.26 (a) and.39 (b), respectively..39, respectively. From figure 1, we see that the log(a,) and log(a 2 ) using the proposed approach are estimated variances of logca,) and log(a 2 ) using our given in figure 2. The real values of the upper 1 formulas center around the true variances. The histo- percent confidence limits of log(a,) and log(a 2 ) are grams for the upper 1 percent confidence limits of 1.67 and 2.65, respectively. Thus, the estimated con-

7 98 Sun et al IIS U S ISO Estimated upper 1% confidence limit Downloaded from by guest on 19 October W M J _ " Ettlmtttd upptr 1% confidence IMt _J&. FIGURE 2. The histograms for the estimated upper 1 percent confidence limits for logfa,) (a) and logfaj) (b). A, = 4, A 2 = 1, and f =.2. The real upper 1 percent confidence limits are 1.67 (a) and 2.65 (b), respectively. fidence limits also center around the true values. In the simulations, log(a,) = log(4) «= 1.39, and log(a 2 ) = log(1) = 2.3. From the simulated data, we see that in 94.8 percent of the simulations, the estimated upper 1 percent confidence limits of log(a.,) are greater than its true value. Similarly, in 95.2 percent of the simulations, the estimated upper 1 percent confidence limits of log(a 2 ) are greater than its true value.

8 Risk Ratio in Case-Parental Control Design Studies 99 DISCUSSION In this paper, we propose a new simple noniterative method for estimating the risk ratio in studies using case-parental control design. The new estimation method has several important features. First, it is an approximate unbiased estimator of the risk ratio. Second, the variance of the new estimator is smaller than that of Khoury's estimator and Flanders and Khoury's estimator, and it is roughly the same as that of the maximum likelihood-based method of Schaid and Sommer. Third, there exists a simple approximate formula for the variance of the logarithm of the estimated risk ratio. The simplicity of the new estimator and its variance makes the new method appealing. When information from only one parent is available, we propose a method to estimate the risk ratio without assuming the Hardy-Weinberg equilibrium and random mating, thus relaxing the conditions in earlier studies (5, 7). Therefore, the new method is applicable to more general situations. ACKNOWLEDGMENTS This research is partly supported by a grant from the Research Council of Emory University and NIH FIRST award R29-DK53392 (to F. Sun). The authors thank Dr. Schaid for bringing their attention to the noniterative solution of the CPG method by Dr. Knapp et al. (7). They also acknowledge Dr. Knapp for suggestions that led to the improvement of the presentation. REFERENCES 1. Rubinstein P, Walker M, Carpenter C, et al. Genetics of HLA disease associations: the use of the haplotype relative risk (HRR) and the "haplo-delta" (Dh) estimates in juvenile diabetes from three racial groups. (Abstract). Hum Immunol 1981;3: Falk CT, Rubinstein P. Haplotype risk ratio: an easy reliable way to construct a proper control sample for risk calculations. Ann Hum Genet 1987;51: Ott J. Statistical properties of the haplotype relative risk. Genet Epidemiol 1989;6: Knapp M, Seuchter SA, Baur MP. The haplotype-relative-risk (HRR) method for analysis of association in nuclear families. Am J Hum Genet 1993;52: Schaid DJ, Sommer SS. Genotype risk ratio: methods for design and analysis of candidate-gene association studies. Am J Hum Genet 1993,53: Knapp M, Wassmer G, Baur MP. The relative efficiency of the Hardy-Weinberg equilibrium-likelihood and the conditional on parental genotype-likelihood methods for candidategene association studies. Am J Hum Genet 1995;57: Khoury MJ. Case-parental control method in the search for disease susceptibility genes. Am J Hum Genet 1994;55: Flanders WD, Khoury MJ. Analysis of case-parental control studies: method for the study of associations between disease and genetic markers. Am J Epidemiol 1996;144: Downloaded from by guest on 19 October 218