MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE

Size: px
Start display at page:

Download "MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE"

Transcription

1 MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Jie Ding, B.S., M.S. * * * * * The Ohio State University 2008 Dissertation Committee: Approved by Shili Lin, Adviser Laura Kubatko Joseph Verducci Adviser Graduate Program in Biostatistics

2 c Copyright by Jie Ding 2008

3 ABSTRACT Family-based association test is one way of mapping disease susceptibility genes by testing for association between marker genotypes and disease phenotypes in family data. Missing genotypes usually exist in real datasets. We proposed the Monte Carlo pedigree disequilibrium test (MCPDT) to test for association using general pedigree data with missing genotypes. It generates Monte Carlo samples of missing genotypes conditioned on observed genotypes and then calculates test statistics with the Monte Carlo samples. In a simulation study, it achieved better performance than other family-based association test methods. Since MCPDT uses estimates of population marker allele frequencies in the generation of Monte Carlo samples, population structure may generate bias in MCPDT statistics. To adjust for population structure in MCPDT, a Markov chain Monte Carlo algorithm was designed to infer the structure from pedigree data with multiple null markers and the inferred structure was then used in MCPDT. Simulation studies were done to evaluate the performance of this method. ii

4 ACKNOWLEDGMENTS I d like to thank Dr. Shili Lin for her guidance in the past five years, Dr. Laura Kubatko and Dr. Joseph Verducci for serving in my dissertation committee and Dr. Chris Hans for serving in my candidacy exam committee. iii

5 VITA B.S., Biochemistry and Molecular Biology, Peking University, China Ph.D. student, Molecular, Cellular and Developmental Biology, The Ohio State University M.S., Statistics, The Ohio State University 2004-present...Graduate research associate with Dr. Shili Lin, Department of Statistics, The Ohio State University PUBLICATIONS Research Publications Lin, S., Ding, J., Dong, C., Liu, Z., Ma, Z.J., Wan, S. and Xu, Y. Comparisons of methods for linkage analysis and haplotype reconstruction using extended pedigree data. BMC Genetics, 6 (Suppl 1): S76, 2005 Ding, J., Lin, S. and Liu, Y. Monte Carlo pedigree disequilibrium test for markers on the X chromosome. American Journal of Human Genetics, 79: , 2006 Ding, J. and Lin, S. XMCPDT does have correct type I error rate. Journal of Human Genetics, 82: , 2008 American Lin, S. and Ding, J. Integration of ranked lists via cross entropy Monte Carlo with applications to mrna and microrna studies. Biometrics, DOI: /j x, 2008 iv

6 FIELDS OF STUDY Major Field: Biostatistics Studies in: Linkage analysis with linkage disequilibrium Family-based association analysis with missing genotypes Rank aggregation via cross entropy Monte Carlo Dr. Shili Lin Dr. Shili Lin Dr. Shili Lin v

7 TABLE OF CONTENTS Page Abstract Acknowledgments Vita List of Tables List of Figures ii iii iv viii ix Chapters: 1. Introduction Basic concepts in genetics Genetic mapping Association mapping Population structure Family-based methods Organization of this dissertation Monte Carlo pedigree disequilibrium test MCPDT statistics Expectation of D MC Simulation study Bias in D MC Discussions vi

8 3. Impact of population structure on MCPDT Bias for a family trio Bias for 3-generation pedigrees MCPDT with two subpopulations Estimating population structure Basic idea of STRUCTURE Family-based STRUCTURE Simulation study Errors in subpopulation origin estimates Errors in allele frequency estimates Choosing values of K in MCMC Adjusting for population structure in MCPDT Methods of combining f-structure and MCPDT Simulation study Summary and future directions Summary Future directions Bibliography vii

9 LIST OF TABLES Table Page 2.1 Simulation settings for XMCPDT Type I error rates of XPDT, XRCTDT, XAPL and XMCPDT Expectations and standard deviations (in parenthesis) of D MC Bias for a family trio with one parent ungenotyped Bias for a family trio with both parents ungenotyped Simulation settings for MCPDT Simulation settings for MCPDT with subpopulations Simulation settings for f-structure Choosing K based on BIC Descriptive statistics of two subpopulations under the second setting. 54 viii

10 LIST OF FIGURES Figure Page 2.1 Power of XPDT, XRCTDT, XAPL and XMCPDT Two pedigree structures D MC for marker M D MC for marker D Type I error rates of MCPDT with two subpopulations Power of MCPDT with two subpopulations Mean error rates of subpopulation origin estimates for two subpopulations Mean error rates of subpopulation origin estimates for three subpopulations Mean absolute error in allele frequency estimates under Setting Mean absolute error in allele frequency estimates under Setting Mean absolute error in allele frequency estimates under Setting Mean absolute error in allele frequency estimates under Setting Likelihood of the data under Setting Likelihood of the data under Setting ix

11 4.9 Likelihood of the data under Setting Likelihood of the data under Setting Type I error rates of MCPDT and MCMCPDT Power of MCPDT and MCMCPDT x

12 CHAPTER 1 INTRODUCTION 1.1 Basic concepts in genetics In most human cells, genetic information is stored in 22 pairs of chromosomes plus two X chromosomes in females and one X and one Y chromosomes in males. The X and Y chromosomes are called sex chromosomes and the other 22 pairs of chromosomes are called autosomes. Each chromosome consists of a long sequence of DNA, which carries genetic information encoded in nucleotides. Two chromosomes in each pair carry similar information, but normally there are differences at many locations on the two chromosomes. A locus (plural loci) is a site on a chromosome. If at a locus there are two or more forms of sequences in a population, we say there are polymorphisms at that locus. Each form of sequence is called an allele at that locus. For every locus on an autosome (or on the X chromosome in a female), there are two alleles in each person, one on each chromosome in that pair. These two alleles form a genotype. The process of determining the genotypes is called genotyping. If the two alleles in a genotype are of the same type, the genotype is homozygous, otherwise it is heterozygous. The combination of alleles at two or more loci on a chromosome is called a haplotype. 1

13 One person inherits one set of chromosomes from each parent. A female gets 22 autosomes and one X chromosome from each parent. A male gets 22 autosomes and one X chromosome from his mother and 22 autosomes and one Y chromosome from his father. During this process, crossovers happen between each pair of chromosomes (22 pairs of autosomes and one pair of X chromosomes) in each parent, which cause two chromosomes in each pair to exchange some segments before being transmitted to the offspring. This is called recombination. Hence each chromosome inherited by an offspring is a mixture of two grandparental chromosomes from each parent. Because of recombination, alleles at two loci very close to each other on the same chromosome have a higher chance to be passed together to one s offspring than alleles at two loci that are far away. The probability of recombination happening between two loci is called the recombination fraction, which is between 0 and 1/2 under certain assumptions. 1 It is a reflection of the distance between these two loci. Genes are functional segments of sequences on chromosomes. Many of them can influence some observable traits called phenotypes. Many genes are related to diseases. People with certain alleles of some genes may have higher risk of getting some diseases. The probability of getting a particular disease given the genotype at a gene locus is called the penetrance of that genotype. Markers are characterized loci on the chromosomes. Normally the location and the number and types of alleles of a marker are known through genotyping individuals in a population. They may not have any function by themselves, but since their locations are known, they are commonly used in genetic mappings. A common type of marker is single nucleotide polymorphism (SNP), which is a sequence variation at 2

14 a single nucleotide location on a chromosome. For a SNP marker, an allele is just a single nucleotide (A, G, C or T) at the marker locus. 1.2 Genetic mapping Genetic mapping is locating genetic variants in human genome that contribute to the risks of diseases. There are two common approaches to achieve this goal. One approach is linkage mapping, which identifies genomic regions that are more often co-inherited by genetically related individuals with a disease than expected by chance alone. As mentioned above, alleles that are close on a chromosome are more likely to be passed to the same offspring. Hence the genomic regions that are more often co-inherited by affected relatives have higher chances of containing the genes contributing to the risk of that disease. Since the discovery of human DNA sequence variants that can be assayed directly and used as genetics markers, 2 linkage mapping has been successful in identifying genes underlying many Mendelian diseases and traits. A Mendelian disease refers to a disease for which certain genetic variants at a single locus increase the risk greatly, resulting in an almost one-to-one correspondence between the genotype at that locus and the phenotype of the disease (e.g. affected or unaffected). However, most common diseases, including diabetes, heart diseases and cancers, are complex diseases, in the sense that they may be related to genetic variants at many different loci, each of which contributes only modestly to the overall genetic risks. Although one genetic variant may cause only a small increase in the risk, if this variant is relatively common in the population, it can still contribute to a large number of patients, hence is still worth studying. It has been shown that for such genetic variants, linkage mapping 3

15 has very low power, and may need impractically large sample sizes to detect those genetic variants. 3 The other common approach to genetic mapping is association mapping, which identifies genotypes that are associated with the phenotype of a disease in a population. For complex diseases or traits, association mappings potentially have higher power than linkage mappings. 3 Association studies can be done using a candidate-gene approach where a panel of genetic markers in one or a few candidate genes are tested for associations. Alternatively, genome-wide association studies test a large number of genetic markers covering a large part of or all the human genome for associations. 4 Unless the genetic variants underlying diseases are tested directly, association mapping relies on linkage disequilibrium between marker loci and causal loci. Linkage disequilibrium exists when there is association between genotypes at two different loci, normally happening to loci that are very close to each other on a chromosome. Genotypes at a marker locus in linkage disequilibrium with a disease gene can also show association with the disease phenotypes, with the strength of association depending on the level of linkage disequilibrium between the marker and the disease loci. When a disease variant was created by a mutation, it had associations with the genetic variants on the same chromosome where the mutation happened. When that disease variant was passed through generations, recombination events would gradually break it apart from other genetic variants. After a number of generations, the disease variant would have associations only with other genetic variants very close to it. Because of this, linkage disequilibrium normally decays rapidly when the distance between two loci increases, so a very dense map of DNA markers is needed to cover a region on a chromosome or the whole genome through linkage disequilibrium. Thanks 4

16 to the advancement of low-cost and high-throughput genotyping technologies, now a vast number of single nucleotide polymorphism (SNP) markers have been identified in the human genome. The dbsnp database includes more than 14 million SNP markers. 5 Phase I of the HapMap project genotyped more than 1 million SNPs in 269 DNA samples from 4 different populations. 6 Phase II of the HapMap project increased the number of genotyped SNPs to more than 3.1 million, which included 25-35% of common SNP variation in those populations. 7 The HapMap project also estimated the linkage disequilibrium structure in human genome, which can be used to select SNP markers needed to cover certain chromosome regions or the whole genome through linkage disequilibrium in association studies. Although association mapping is a potentially powerful method of genetic mapping, there are still various problems and limitations, some of which are discussed in the following sections. 1.3 Association mapping There are two different study designs used in association studies: populationbased studies and family-based studies. In population-based studies, unrelated cases with a disease and controls without the disease are sampled from a population. The frequencies of alleles, genotypes or haplotypes at testing loci are compared between cases and controls to find variants that are associated with the disease. In familybased studies, genotype data from pedigrees with affected offspring are collected. Association tests are based on either the transmissions of DNA variants from parents to affected offspring, 8,9 comparisons of DNA variants between affected and unaffected siblings 10 or both. 11 5

17 Some power comparisons between these types of approaches showed that the differences in statistical power are generally small when the use of family trios is compared to the case-control design 12. The main advantage of population-based designs is that it is more economical and less time-consuming to collect unrelated cases and controls compared to collecting family data, especially for late-onset diseases, where parents of patients may not be available at the time of study. Also, more individuals may need to be genotyped in family-based designs to achieve similar power. But populationbased designs are prone to confounding, which means data cannot distinguish between associations due to true causal loci and associations due to other sampling factors. Confounding can be caused by poor matching of cases and controls or population structure. On the contrary, since family-based designs use family members as controls, they are generally robust against population structure. Also, it is possible to test other genetic effects in family-based studies, such as maternal effect, where the genotypes of mothers affect the phenotypes of offspring, or imprinting, where the parental origins of genetic variants affect the phenotypes of offspring Population structure Population structure can cause spurious associations in a population-based study. This arises when cases are overrepresented in one subgroup and there are markers used in the association study with different allele frequencies in that subgroup and in the general population where the controls come from. Those markers will show spurious associations with the disease, even if there are no genuine associations between them. Although care can be taken to match cases and controls in terms of ethnicity and/or 6

18 some other criteria, there might still be undetected population stratification that can cause inflated type I error rates of a population-based association test. There has been a great deal of research done on controlling or eliminating the effects of population structure on association studies. Family-based design is one approach, which is discussed in details in the next section. Many methods have also be proposed for population-based designs. The general idea is to use information from markers assumed not associated with the disease (null markers) to control the effects of population structure. Genomic control (GC) 14,15 is a commonly used method, where association test statistics are calculated at each of the null SNP markers and then used to adjust the test statistics at the candidate SNPs. If there is population structure present inflating the test statistics at the null SNPs, the structure effect on the statistics at the candidate SNPs may be canceled out after the adjustment. It performs well under many situations, but only applies to simple markers like SNPs and may be conservative or anti-conservative under different situations. 16 Structured association methods take another approach, 17,18 using null marker information to infer population structure and then test for association conditional on the inferred structure. Unlike GC, these methods explicitly model population structure, which may be useful in some other applications. But they are computationally expensive compared to GC. Also, it is generally difficult to estimate the number of underlying subpopulations. Another way to use null DNA markers is to include them as covariates in logistic regression analysis. 19 This method is more general than GC because other covariates effects can also be included in the model. When many null markers are available, 7

19 principal components analysis can be used to explicitly model ancestry differences between cases and controls. 20 A recently proposed method uses null markers to model the odds of disease and calculates a stratification score for each individual. Then association tests on the test loci are done within individuals with similar scores Family-based methods Transmission/disequilibrium test (TDT) 8,9 was originally used to test for linkage in the presence of association. But since both linkage and association need to be present to reject the null hypothesis in TDT, it can also be used as a test for association. TDT uses family trios, each of which consists of two parents and one affected offspring. The test is based on the comparison of the observed number of a certain type of alleles transmitted to the offspring and the expected number under Mendelian law. If in the dataset, affected offspring inherited more copies of a certain allele at the test marker locus than expected, it suggests that association may exist between the test marker and the disease. Since TDT only uses family trios and needs complete genotype data for all three individuals in each trio, many extensions have been proposed to handle more general family structures and/or missing genotypes. When parental genotypes are not available, sibling TDT (STDT) uses the genotypes of phenotypically discordant sibships (i.e. some siblings are affected and some are not) to test for association. 10 It compares the numbers of a certain allele at the test marker locus in affected siblings and in unaffected siblings. If affected siblings on average have significantly more (or less) copies of that allele than unaffected siblings in a dataset, it suggests association between the test marker and the disease. 8

20 For extended pedigrees, the pedigree disequilibrium test (PDT) uses all informative family trios and discordant sibling pairs (DSP) in each extend pedigree. 11,22 It can use all types of families, but when there are many missing genotypes, PDT may not be able to use all the information, hence it may have low power. There are many family-based methods proposed to handle missing genotypes directly. In a nuclear family with multiple offspring and missing parental genotypes, depending on offspring s genotypes, sometimes it is possible to reconstruct parental genotypes from offspring s genotypes. Based on this, reconstruction-combined TDT (RCTDT) uses reconstructed genotypes to calculate the test statistics conditioned on the event that reconstruction can be done. 23,24 Also, when parental genotypes cannot be reconstructed unambiguously, the distribution of possible parental genotypes can be estimated from offspring s genotypes, assuming Mendelian inheritance and allele frequencies when there is no linkage between the marker locus and the disease locus. This is done in TRANSMIT, which uses a score test for association. 25 But since it does not consider linkage, the estimates may be biased when there are multiple affected offspring in a nuclear family. Another method is called APL, which takes into account the linkage between the marker locus and the disease locus in the inference of missing parental genotypes, but it only applies to nuclear families. 26 FBAT uses another approach to deal with missing parental genotypes: instead of inferring missing genotypes, it constructs the test statistic conditioned on a sufficient statistic for any genetic information about the founders in a family. 27 Although FBAT has many desirable statistical properties, it has been shown to have lower power compared to some other methods under simulation settings. 28 9

21 It is worth mention that although the original TDT and many of the newer familybased methods are robust against population structure, for methods that need to use population parameters, such as allele or genotype frequencies, in their calculations, population structure could still cause inflated type I error rates. 1.6 Organization of this dissertation This dissertation is organized as follows. Chapter 2 introduces test statistics of Monte Carlo pedigree disequilibrium test (MCPDT) and shows comparisons between MCPDT and a few other methods through a simulation study. Chapter 3 discusses the impact of population structure on MCPDT if the structure is ignored. Chapter 4 proposes an Markov chain Monte Carlo (MCMC) procedure to estimate population structure. Chapter 5 combines MCPDT in Chapter 2 and the MCMC procedure in Chapter 4. Chapter 6 gives a summary and some possible future directions. 10

22 CHAPTER 2 MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST The original motivation for Monte Carlo pedigree disequilibrium test was a multiple sclerosis dataset collected at The Ohio State University, referred to as the OSUMS dataset here. 29 It contains family genotype data at multiple marker loci on the X chromosome. We wanted to test for association between these X chromosomal markers and multiple sclerosis in this dataset. At that time there were few methods dealing with both X chromosomal markers and general pedigree data. So we proposed a new method which could analyze both autosomal and X chromosomal markers in a family dataset with missing genotypes. Among the family-based methods mentioned in Chapter 1, PDT 11 is a good candidate to base our method on since it can analyze any family structure and is easy to implement. A simple modification to PDT made it suitable for X chromosomal markers. Adding another extension yielded MCPDT, which can deal with missing genotypes. 2.1 MCPDT statistics MCPDT 30 is designed to test associations in general pedigrees with missing genotypes. It is an extension of PDT. 11 As in PDT, for each pedigree, a statistic D is 11

23 calculated from all family trios and DSPs. Suppose all markers are SNPs and genotypes of all individuals in a family are known. For one autosomal marker, denote the two alleles as M1 and M2. For each family trio with an affected offspring, define X T = (# M1 transmitted to the offspring) (# M1 not transmitted to the offspring). This statistic can be nonzero only if at least one of the parents is heterozygous at the marker locus. For each DSP, define X S = (# M1 in affected sibling) (# M1 in unaffected sibling). This statistic is nonzero only if two siblings have different genotypes at the marker locus. Under the null hypothesis of no association between the marker and the disease, E(X T ) = 0 for all family trios since under Mendelian law, the offspring has equal chance to inherit either copy of the two alleles in each parent. Also E(X S ) = 0 for all DSPs. The statistic for the whole pedigree is D = n T X Tj + n S j=1 j=1 X Sj, where n T is the number of family trios with an affected offspring and n S is the number of DSPs. E(D) = 0 under the null hypothesis since every term in the right hand side of the formula has expectation 0. We also have V ar(d) = E((D E(D)) 2 ) = E(D 2 ). For a total of N pedigrees in a dataset, define T = N i=1 D i N i=1 D2 i 12

24 where D i is the D statistic from the ith pedigree. Assuming all pedigrees are independent, T asymptotically follows a standard normal distribution due to central limit theorem. PDT is a two-sided test using statistic T and Normal(0,1) distribution since it is not assumed beforehand which allele is more likely to be associated with the disease. Note that if all the statistics are calculated using the other allele (M2) at the SNP locus, T will have the same absolute value with the opposite sign since all X T s and X S s will have opposite signs. Hence PDT will give the same result for a SNP marker regardless which allele is designated as M1. When applying this method to marker loci on the X chromosome, only transmissions from mothers need to be considered This is because each father has only one X chromosome and transmits that X chromosome to all of his daughters and none of his sons, hence there is no uncertainty about the transmissions. Also, only DSPs consisting of the same sex are used. This modification yielded XPDT. 30 When genotypes of some individuals in a pedigree are missing, PDT does not use family trios and DSPs with missing genotypes, which causes a loss of information. MCPDT was proposed to handle missing genotypes in the following way. Define a new statistic D MC = D(G m, G o, A) Pr(G m G o ) G m G m where G o and G m denote observed and missing genotypes, respectively, G m denotes the set of all possible genotypes for individuals with missing genotypes and A denotes observed phenotypes of the disease, i.e. the affection statuses of all non-founders. Here D is the same statistic defined above, either for an autosomal or X chromosomal marker, written here in a function format to show explicitly the information used in the calculation. 13

25 Note that the definition of D MC is not the same as the conditional expectation of D given G o and A, which is E(D G o, A) = D(G m, G o, A) Pr(G m G o, A). G m G m Since A is not included in the conditional distribution of G m in the definition of D MC, there is no need to estimate disease model or other parameters related to the disease in the calculation of D MC. But because of this omission, for fixed A, the conditional expectation of D MC is E(D MC A) = G o G o G m G m D(G m, G o, A) Pr(G m G o ) Pr(G o A), where G o denotes the set of all possible genotypes for individuals with observed genotypes. This expectation does not equal to the conditional expectation of D give A, which is E(D A) = G o G o G m G m D(G m, G o, A) Pr(G m G o, A) Pr(G o A) and equals to 0 under the null hypothesis of no association between the marker and the disease. If in a dataset the affection status patterns are the same among all pedigrees, e.g. the data collecting process requires all pedigrees to have two children who are both affected, D MC may be biased and have a non-zero expectation. On the other hand, if we can assume all pedigrees are drawn from a underlying population of pedigrees with at least one affected offspring, affection status pattern A can also be treated as random and the expectation of D MC can be taken over both G o and A. This expectation is equal to 0, as shown in the next section. Depending on the pedigree structure and missing genotype pattern, it may be difficult to calculate D MC directly. So in MCPDT, D MC is estimated by taking 14

26 average over a set of Monte Carlo samples, i.e. D MC 1 K K D(G mk, G o, A), k=1 where G mk, k = 1,, K are independent samples from Pr(G m G o ). For multiple pedigrees, T MC is calculated using D MC statistics from all pedigrees as before. MCPDT is based on this T MC statistic and its asymptotic standard normal distribution. 2.2 Expectation of D MC To show that E(D MC ) = 0, we first note that for a pedigree with complete genotypes, the D statistic can be written as a weighted sum of contributions from all offspring with the weights depending on the affection status pattern. Suppose there are N offspring in the pedigree. For the jth offspring and his/her parents, regardless of whether the jth offspring is affected or not, define X j = (# M1 transmitted to the offspring) (# M1 not transmitted to the offspring), which is the same as the definition of X T in the previous section. Equivalently, X j = 2 (# M1 in the jth offspring) (# M1 in the parents). It is apparent that the X j s are functions of the genotypes of the trio only. Then for the whole pedigree, we can write D(G m, G o, A) = N C j (A)X j (G) j=1 where C j is a coefficient for the jth individual depending only on A. Specifically, C j (A) can be decomposed into the sum of coefficients from family trios and DSPs. 15

27 Denote the set of offspring who are the siblings of the jth offspring as S j. If the jth offspring is affected, his/her contribution through the X T statistic defined previously is the same as X j, otherwise it is 0. If the jth offspring is affected and the kth offspring is one of his/her unaffected sibling, for this DSP, the X S statistic defined previously is X S = (# M1 in the jth offspring) (# M1 in the kth offspring) = 1 2 (X j X k ). Conversely, if the jth offspring is unaffected and the kth offspring is affected, we have X S = 1 2 (X k X j ). Combining all these possibilities, we have C j (A) = I{the jth offspring is affected} + 1 k S j I{the jth offspring is affected, the kth offspring is unaffected} 2 1 k S j I{the jth offspring is unaffected, the kth offspring is affected}, 2 where I is the indicator function, which equals 1 if the event is true, 0 otherwise. For D MC of the whole pedigree, let D MCj denote the contribution from the jth offspring. Then G o G o E(D MCj ) = A A G m G m D j (G m, G o, A) Pr(G m G o ) Pr(G o A) Pr(A) = A A G o G o G m G m C j (A)X j (G m, G o ) Pr(G m G o ) Pr(A G o ) Pr(G o ) = G o G o G m G m X j (G m, G o ) Pr(G m G o ) Pr(G o ) A A C j(a) Pr(A G o ), where A is the set of all possible affection status patterns. Note that, with our specification of the C j s, we have A A C j(a) Pr(A G o ) = Pr(the jth offspring is affected G o ) + 1 k S Pr(the jth offspring is affected, the kth offspring is unaffected G j 2 o) 1 k S Pr(the jth offspring is unaffected, the kth offspring is affected G j 2 o). Under the null hypothesis of no association, Pr(the jth offspring is affected, the kth offspring is unaffected G o ) = Pr(the jth offspring is unaffected, the kth offspring is affected G o ) 16

28 for all j, k = 1,, N, j k. Thus A A C j(a) Pr(A G o ) = Pr(the jth offspring is affected G o ) = Pr(the jth offspring is affected) (= a j ). The second equality is true under the null hypothesis and a j is independent of genotypes of the test marker. Consequently, E(D MCj ) = G o G o G m G m X j (G m, G o ) Pr(G m G o ) Pr(G o )a j = a j E(X j ) = 0, where E(X j ) = 0 under Mendelian law, i.e. the offspring has an equal chance getting one of each parent s two alleles at the test marker locus. Finally we have E(D MC ) = n E(D MCj ) = 0 j=1 Using the same method, it is straightforward to show that for marker loci on the X chromosome, E(D MC ) = 0 also holds Simulation study To evaluate the performance of MCPDT compared to some other methods, a simulation study was done based on the pedigree structure and missing data pattern of the OSUMS dataset. 29 After the removal of pedigrees with no genotyped offspring, 81 pedigrees were kept in the simulation, which included both nuclear families and extended pedigrees. The total number of individuals used was 386 with 102 having missing genotypes. The simulation was done for marker loci on the X chromosome. We wanted to compare XMCPDT (the X chromosome version of MCPDT) to other familybased association test methods applicable to markers on the X chromosome, including 17

29 XPDT 30, XRCTDT 31 and XAPL 32, which are X chromosome versions of PDT 11,22, RCTDT 23,24 and APL, 26 respectively. Two X chromosomal SNP markers, denoted as B and C, were simulated. B and C had the same allele frequencies, were in complete linkage and linkage equilibrium. The third locus D was a two-allele disease locus assumed to be in complete linkage with B and C, in linkage equilibrium with C and in linkage disequilibrium with B. Marker C was intended to be used in estimating type I error rates, and marker B was to be used in estimating power. There were nine different combinations of marker allele frequencies, haplotype frequencies and penetrances used in the simulation, as shown in Table 2.1. There were three different minor allele frequencies for markers B and C and a single minor allele frequency for disease locus D. The linkage disequilibrium was specified through frequencies of 4 different haplotypes composed with alleles of B and D. The haplotype frequencies were set to achieve the maximal linkage disequilibrium between B and D permitted by their allele frequencies. The high risk allele at locus D (D1) was always more strongly associated with the minor allele at locus B (B1). The penetrances in the table were for female individuals with genotypes D1D1, D1D2 and D2D2 at the disease locus. Those for males were set to be the same as the corresponding homozygous females. There were three different disease models with different penetrances. The nine different settings were ordered in a way such that we expected tests for association at the marker locus B would have increasing power from setting 1 to setting 9 based on preliminary simulations replicates were generated under each of the nine settings. 100 Monte Carlo samples of missing genotypes were generated for each pedigree in each replicate using 18

30 Allele frequencies Haplotype frequencies Penetrances Setting B1, C1 D1 D1B1 D2B1 D1B2 D2B2 D1D1 D1D2 D2D Table 2.1: Simulation settings for XMCPDT software package SLINK. 33,34 Either the actual allele frequencies used in the simulations or those estimated from all genotyped founders in each replicate were used in the Monte Carlo sampling. XPDT, XRCTDT, XMCPDT and XAPL were applied to the same simulated data to compare their results. Nominal level of significance was set to be 0.05 for all tests. The type I error rates are shown in Table 2.2. XRCTDT appeared to be conservative, with all of its type I error rates below the nominal level 0.05 and most of those below On the contrast, type I error rates of XAPL were mostly above the nominal level, suggesting XAPL might be anti-conservative under these simulation settings. Since APL is designed for nuclear families, when applied to extended pedigrees, as in this simulation study, it may produce inflated type I error rates. XM- CPDT with either true allele frequencies (XMCPDT T ) or estimated allele frequencies (XMCPDT E ) had type I error rates around the nominal level. 19

31 Setting XRCTDT XAPL XPDT XMCPDT E XMCPDT T Table 2.2: Type I error rates of XPDT, XRCTDT, XAPL and XMCPDT The power is shown in Figure 2.1. Under almost all settings, XRCTDT had the lowest power while XMCPDT T had the highest power. XMCPDT E had power close to that of XAPL. These two methods both had higher power than XPDT and XRCTDT under all settings. Comparing XMCPDT to XPDT, the increases in power ranged from 9% to 30% when using estimated allele frequencies and from 19% to 45% when using true allele frequencies. So simulating missing genotypes did generate a significant increase in power. 2.4 Bias in D MC As mentioned before, D MC is not the conditional expectation of D given observed genotypes and disease affection status pattern for a fixed pedigree structure. So the expectation of D MC given A may not be zero for a fixed affection status pattern A. If in the data collection process, the only requirement for a family to be included is that at least one offspring is affected, D MC does have expectation 0 since A can be treated as random with all possible values. If in the data collection process, the ascertainment 20

32 Power XMCPDT T XAPL XMCPDT E XPDT XRCTDT Setting Figure 2.1: Power of XPDT, XRCTDT, XAPL and XMCPDT 21

33 No of affected offspring No of offspring (0) (1.36) (0.77) (1.92) (2.04) (0.95) (2.43) (2.92) (2.66) (1.20) Table 2.3: Expectations and standard deviations (in parenthesis) of D MC criterion requires more than that, e.g. at least two offspring in a family are affected, D MC may not have expectation 0 since there is a constraint on the possible values of A. To investigate the magnitude of the bias in D MC in such situation, we looked at some simple family structures that are frequently used in genetic epidemiological studies. For nuclear families with 1, 2, 3 or 4 offspring, assuming neither parent was genotyped, we calculated the expectations and standard deviations of D MC over all possible genotypes of offspring given fixed numbers of affected offspring. The results are shown in Table 2.3. For a family trio with one offspring, the bias is 0. For the other three types of families, largest bias in D MC happens when all offspring are affected. But even the largest biases are very small with much larger standard deviations. To shown how these biases would affect the type I error rates of MCPDT, simulations with these family structures under the ninth setting in Table 2.1 were done. Three type of datasets were simulated. One included 500 nuclear families with two offspring, both affected. One included 500 nuclear families with three offspring, all affected. The other one included 500 nuclear families with four offspring, all affected replications were simulated for each type of dataset. The nominal level of significance was set to be The estimated type I error rates were 0.051,

34 and for datasets with two-offspring, three-offspring and four-offspring families, respectively, while the estimated power was 1 for all three types of datasets. This was expected since the standard deviations were more than 100 times larger than the expectations for all three types of nuclear families. Similar results should hold for most other types of families. So when the assumption of randomness of affection status is not satisfied in a real dataset, it is likely that the expectation of D MC will still be very close to 0 and the small bias in D MC won t lead to much inflation in type I error rates of MCPDT. 2.5 Discussions MCPDT is an extension of PDT that can handle X chromosomal markers and missing genotypes. It deals with missing genotypes by generating Monte Carlo samples of those missing genotypes conditioned on observed genotypes. Simulation results show that in the situation where a dataset includes general pedigrees and a large amount of missing genotypes, MCPDT can significantly increase power over PDT while maintaining adequate type I error rates. Among other methods designed to deal with missing genotypes, APL tends to have large type I error rates with general pedigrees, at least under the simulation settings shown above, possibly due to the fact that it is designed for nuclear families. RCTDT appeared to be conservative and also had lower power than other methods. There are a few possible problems in MCPDT. MCPDT uses sample allele frequencies in genotyped founders as the estimate of population allele frequencies. If a dataset only contains a small number of genotyped founders, the frequency estimate may not be reliable and MCPDT may have an inflated type I error rate. The use of 23

35 asymptotic normal distribution also needs a large enough sample size. In the simulation study shown above, there were 81 families, 386 individuals and 88 genotyped founders. The type I error rates for MCPDT with estimated allele frequencies were very close to the nominal level. So an adequate sample size for MCPDT should not be difficult to achieve in practical situations. Another issue with MCPDT is also related to the fact that population allele frequencies are used in the calculation. When estimating allele frequencies in MCPDT, it is assumed that all pedigrees in a dataset are from the same population with a single set of frequencies for each marker. In reality, a dataset may contain pedigrees from different subpopulations with different allele frequencies at some test marker loci. The subpopulation origins of these pedigrees may not be easily distinguishable based on available covariates, such as races. In this case, using a single estimate of test marker allele frequencies creates bias in the estimate, which may impact the validity and power of MCPDT. Hence MCPDT may not be robust against population structure while the original PDT is. However, if we understand how biases in the allele frequency estimates impact MCPDT, we should be able to fix this problem in an appropriate manner, which is the topic of the following chapters. 24

36 CHAPTER 3 IMPACT OF POPULATION STRUCTURE ON MCPDT Since MCPDT needs to use test marker allele frequencies to generate Monte Carlo samples, biases in frequency estimates may lead to biases in MCPDT statistics. If all pedigrees belong to the same population with the same allele frequencies and the number of genotyped founders used to estimate frequencies is not too small, the bias in MCPDT statistics should be very small. But if those pedigrees are from different underlying subpopulations with different allele frequencies at test marker loci, a single set of allele frequencies estimated from all genotyped founders will be biased no matter how large the sample size is. Using the biased frequency estimates in MCPDT may inflate type I error rates and/or decrease test power. To investigate how population structure impacts the validity and power of MCPDT, we turns to investigating how biases in allele frequency estimates translate into biases in MCPDT statistics. 3.1 Bias for a family trio Consider a family trio with an affected child and assume that both parents in the trio are unrelated. Suppose the test marker is a SNP marker with alleles M1 and M2 and it is not associated with the disease. Denote the true frequency of allele 25

37 Genotype of one parent Genotype of the child Probability D MC using f e M1M1 M1M1 ft 3 1 f e M1M1 M1M2 ft 2 (1 f t ) f e M1M2 M1M1 ft 2(1 f t) 2 f e M1M2 M1M2 f t (1 f t ) 1 2f e M1M2 M2M2 f t (1 f t ) 2 1 f e M2M2 M1M2 f t (1 f t ) 2 1 f e M2M2 M2M2 (1 f t ) 3 f e Table 3.1: Bias for a family trio with one parent ungenotyped M1 as f t and the estimated frequency as f e. If one parent is ungenotyped, for all possible combinations of genotypes of one parent and the child, the probabilities of those combinations and the corresponding exact D MC s are listed in Table 3.1. Using these quantities, it is straightforward to show that E(D MC ) = (1 + f t (1 f t ))(f t f e ). Here the expectation is taken over all possible combinations of observed genotypes. When both parents have missing genotypes, using Table 3.2, we have E(D MC ) = 2(f t f e ), which is about twice as much as the expectation when only one parent is missing. Based on these results, for a family trio, under the null hypothesis, the bias in D MC is proportional to the difference between the true and the estimated allele frequencies. The bias increases when there are more parents with missing genotypes. For a trio with one parent genotyped, the bias is also related to the true allele frequency. A true allele frequency of 0.5 produces largest bias when the difference between the true and the estimated allele frequencies is fixed. If it is not assumed that the child is 26

38 Genotype of the child Probability D MC using f e M1M1 f 2 t 2 2f e M1M2 2f t (1 f t ) 1 2f e M2M2 (1 f t ) 2 2f e Table 3.2: Bias for a family trio with both parents ungenotyped affected and the expectation is taken also over the affection status, the bias needs to be multiplied by the disease prevalence. But in a real dataset consisting of family trios, only trios with affected children would be included, hence disease prevalence wouldn t play a role here in such case. 3.2 Bias for 3-generation pedigrees For more complex family structures, similar exact calculation would be much more tedious compared to family trios. To investigate empirically the effect of biases in allele frequency estimates on MCPDT statistics, a simulation study was done. Two different 3-generation pedigree structures shown in Figure 3.1 were used. These two structures have the same numbers of founders but different numbers of offspring. Two markers were simulated, with one (marker D) being the disease gene itself and the other (marker M) being a marker in complete linkage and linkage equilibrium with the disease locus. Marker D had two alleles: D1 and D2. Marker M also had two alleles: M1 and M2. Five different simulation settings were used, as shown in Table 3.3. Different numbers of parents with missing genotypes, disease penetrances, numbers of offspring and allele frequencies (f t ) were present in these settings pedigrees with at least one affected offspring were simulated under 27

39 Frequencies Penetrances Ungenotyped Pedigree M1 D1 D1D1 D1D2 D2D2 founders Comment A , 6 Baseline A Less missingness A , 6 Lower prevalence B , 6 Fewer offspring A , 6 Lower frequency Table 3.3: Simulation settings for MCPDT each setting. D MC was calculated using allele frequencies (f e ) ranging from 0.05 to 0.95 for the first allele (M1 or D1). Then averages were taken over 5000 pedigrees for each setting, each marker and each value of f e. Results for marker M are shown in Figure 3.2. Notice that when f e = f t, the biases did not always have the smallest absolute values among all specified f e values. This was probably due to that the numbers of simulated pedigrees were still not large enough. Under all settings, the biases in D MC were essentially linear with respect to the differences between f t and f e, possibly with slightly downward curvature. The slopes were smaller for settings with less founders missing genotypes and with lower disease prevalence. These were consistent with the results for family trios. But the slope for the setting with 50% lower prevalence was larger than one half of the slope under the baseline setting. The reason is that in the simulations, it was required that each pedigree had at least one affected offspring. Hence the actual prevalences in the simulated data under these two settings were much closer. The slopes were similar for two different f t s used in the simulation, suggesting the effect of different f t s was small. Also, the slope was smaller for the setting with less offspring. This is because for those founders with missing genotypes, fewer offspring means fewer 28

40 Pedigree A Pedigree B Figure 3.1: Two pedigree structures 29

41 D MC Baseline Less missingness Lower prevalence Fewer offspring Lower frequency f e Figure 3.2: D MC for marker M affected offspring, which decreases the bias in D MC for a pedigree, since a nuclear family without any affected offspring does not contribute to D MC. Similar results can be found in Figure 3.3 for marker D, which was the underlying disease gene. Additionally, by comparing the lines of these two loci, the two lines were essentially parallel under each setting in which the two markers had the same allele frequencies. This suggests the biases in allele frequency estimates have similar effect for a marker associated with the disease. 30