ABSTRACT. In recent years population-based association studies have been advocated as the most

Size: px
Start display at page:

Download "ABSTRACT. In recent years population-based association studies have been advocated as the most"

Transcription

1 ABSTRACT DICKSON, SAMUEL PRICE. Improving Discovery of Causal Variants in Genetic Association Studies. (Under the direction of committee, Dr. Greg Gibson and Dr. Marie Davidian). In recent years population-based association studies have been advocated as the most powerful method of discovering genetic loci that are associated with heritable traits, particularly for complex traits that are likely caused by a variety of factors including environmental effects and multiple genetic loci. Genome-wide association studies (GWAS) have already yielded a large number of such associations, but there is growing concern that the results of these studies are not explaining as much genetic variation as they were expected to. Chapter 2 discusses tagging and imputation to leverage the information available on commercial genotyping chips to make inferences about variants found in large reference samples such as those made available by the International HapMap Consortium. Transferability of multi-marker tagging is assessed. Tagging and imputation are compared, and a method of using tagging to select a reduced tag set to be used for imputation. Chapter 3 details how multiple low frequency causal variants can create synthetic associations among more common variants and may be responsible for many of the genome-wide associations that have already been observed. Examples of synthetic associations are demonstrated in congenital deafness and sickle-cell anemia. Chapter 4 examines issues related to combining samples of diverse genetic ancestry for analysis in genetic association studies. Through simulation it is shown that type I error can be controlled and power increased using statistical methods to account for differences in populations.

2 Improving Discovery of Causal Variants in Genetic Association Studies by Samuel P. Dickson A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Bioinformatics and Statistics Raleigh, North Carolina 2 July 2009 APPROVED BY: Greg Gibson Committee Chair Marie Davidian Committee Co-chair David Bird Matthew R. Nelson Jung-Ying Tzeng

3 DEDICATION To my beautiful wife and my children ii

4 BIOGRAPHY Sam Dickson was born in Bakersfield, California in He lived in Lake Forest, California for his entire childhood. From 1998 to 2000 he served a mission for the Church of Jesus Christ of Latter-day Saints in Santiago, Chile. Sam married his beautiful wife, Lindsay, in He graduated from Brigham Young University in 2004 with a B.S. in Biostatistics and a minor in Mathematics just months after the birth of his first son. That same year he began work on his PhD in Bioinformatics and Statistics at North Carolina State University. While finishing his doctorate he has fulfilled a summer internship at Pfizer, enjoyed working with Matt Nelson in the Statistical Genetics Development group at GlaxoSmithKline through a graduate industrial traineeship, and has worked at the Center for Human Genome Variation in the Institute for Genome Sciences and Policy at Duke University under David Goldstein. During this time he and his wife added another son to their family and were expecting their first daughter at the time this written. Sam s interests include population and statistical genetics, pharmacogenetics, and Bayesian statistics. He does not particularly enjoy mandatory Biography sections or writing about himself in the third person. iii

5 ACKNOWLEDGMENTS Completing my doctorate may not have been possible at all without the guidance and extreme patience of Matt Nelson. I enjoyed working with him several years at GlaxoSmithKline. He may have enjoyed some of that time too. He is responsible for keeping me going when it seemed like I may never finish. I gratefully express my appreciation to North Carolina State University and everyone on the committee that decided to accept me into the program and to all the professors who have taught me along the way, especially my committee members: Greg Gibson, Marie Davidian, David Bird, and Jung-Ying Tzeng. I owe my gratitude to David Goldstein for believing in me and my abilities and for providing training and resources that were essential in completing my research. Most of all I thank my wife and children. They have sacrificed a lot of time with me so that I could finish school. My wife is my best motivator and has made me a much better person than I ever could have been without her. Thank you. iv

6 TABLE OF CONTENTS LIST OF TABLES...vi LIST OF FIGURES...vii INTRODUCTION...1 REFERENCES COMPARISON OF TAGGING AND IMPUTATION TO INFER GENOTYPES ABSTRACT INTRODUCTION METHODS Subjects and genotype data Tag selection and imputation LD and prediction accuracy Panel design experiment RESULTS DISCUSSION REFERENCES RARE VARIANTS CREATE SYNTHETIC GENOME-WIDE ASSOCIATIONS ABSTRACT INTRODUCTION RESULTS DISCUSSION METHODS REFERENCES IMPACT OF SAMPLE DIVERSITY ON TYPE I AND TYPE II ERROR IN GENETIC ASSOCIATION STUDIES ABSTRACT INTRODUCTION METHODS Simulation study Statistical tests Recursive partitioning RESULTS DISCUSSION REFERENCES CONCLUSION REFERENCES v

7 LIST OF TABLES TABLE 2.1 COVERAGE OF STUDY SAMPLES BY THE AFFYMETRIX 500K PANEL ON ALL CHROMOSOME 21 HAPMAP3 SNPS TABLE 3.1 LIST OF VARIANTS IN RECENT GENOME-WIDE ASSOCIATION STUDIES SHOWING EVIDENCE OF A DIFFERENCE IN EFFECT BETWEEN POPULATIONS TABLE 4.1 OVERVIEW OF TYPE I ERROR AND POWER vi

8 LIST OF FIGURES FIGURE 2.1 EXPECTED GENOTYPE PREDICTION ACCURACY BY ALLELE FREQUENCY FIGURE 2.2 COMPARISON OF PREDICTION ACCURACY, ADJUSTED ACCURACY, AND r 2 BY MINOR ALLELE FREQUENCY FIGURE 2.3 DISTRIBUTION OF r 2 BY NUMBER OF SNPS IN THE TAG FIGURE 2.4 CHANGE IN r 2 FROM THE REFERENCE SAMPLE FIGURE 2.5 ADJUSTED ACCURACY FIGURE 2.6 AVERAGE ADJUSTED ACCURACY FOR REDUCED TAG SETS FIGURE 3.1 EXAMPLE GENEALOGIES SHOWING CAUSAL VARIANTS AND THE BEST COMMON ASSOCIATION FIGURE 3.2 THE PROPORTION OF SIMULATIONS WITH A VARIANT OF GENOME-WIDE SIGNIFICANCE FIGURE 3.3 THE PROPORTION OF SIMULATIONS WITH A VARIANT OF GENOME-WIDE SIGNIFICANCE SEPARATED BY DISEASE CLASS FIGURE 3.4 MEAN AND VARIANCE OF LD BETWEEN RARE AND COMMON SITES AS A FUNCTION OF RATE OF RECOMBINATION FIGURE 3.5 ALLELE FREQUENCY DISTRIBUTIONS OF ALL HAPMAP SNPS, ILLUMINA 1M SNPS, AND GWAS ASSOCIATIONS IN CEU, AND SIMULATED SYNTHETIC ASSOCIATIONS. 52 FIGURE 3.6 SIMULATED MANHATTAN PLOTS IN A 10 MB REGION FIGURE 3.7 THE 2.5 MB GENOMIC REGION ON CHR11P15.4 CONTAINING 179 GENOME- WIDE SIGNIFICANT SYNTHETIC ASSOCIATIONS WITH SICKLE CELL ANEMIA IN AFRICAN AMERICANS FIGURE 3.8 OVERVIEW OF THE GJB2/GJB6 LOCUS ON 13Q12.11 IN THE DEAFNESS GWAS57 FIGURE 4.1 ALLELE FREQUENCY DIFFERENCES BETWEEN UTAH RESIDENTS WITH NORTHERN AND WESTERN EUROPEAN ANCESTRY FROM THE CEPH COLLECTION (CEU) AND EACH OTHER HAPMAP3 SAMPLE FIGURE 4.2 TYPE I ERROR FOR ALL SIMULATIONS FIGURE 4.3 ADJUSTED POWER BY PENETRANCE FIGURE 4.4 POWER BY RISK ALLELE FREQUENCY FIGURE 4.5 EFFECT OF FIGURE 4.6 TYPE 1 ERROR AND POWER WEIGHTED BY ALLELE FREQUENCIES IN THE CATALOG OF GENEOME-WIDE ASSOCIATIONS vii

9 Chapter 1 Introduction 1

10 We are driven to understand all aspects of life so that we may attempt to improve it. The study of human genetics is much more than fascination with identifying which maternal and paternal traits are present in their offspring. The study of genetics has helped increase the food supply, control the spread of disease, and improve treatment for disease. Such advancements are reliant on our ability to identify the genetic components of observed traits. Modern genetics was formed in the image of Gregor Mendel s experiments with pea plants, where observed traits were isolated and inheritance of these traits was observed in subsequent generations. Archibald Garrod brought this concept into the realm of human disease in 1902 when he noticed that alkoptonuria behaved in a Mendelian fashion [1]. Since then thousands of Mendelian traits have been identified in humans [2]. While it has been understood that Mendel s laws underlie the transmission of heritable traits in humans since the beginning of the twentieth century, the tools necessary to begin mapping traits to specific genes were not developed beyond rudimentary methods until the 1980s [3] when the concept of DNA sequence polymorphisms as genetic markers to map the location of genes was introduced [4]. Initially, linkage studies were the primary mapping tool. Linkage studies have been used to identify disease susceptibility loci in humans since the mid-1970s [5, 6]. A linkage analysis tracks inheritance through family-based data and attempts to discover chromosomal segments that are shared in affected siblings. The requirement for affected siblings is a major limitation of linkage analyses and is preventative of finding variants responsible for a genetic trait when the genotypic relative risk (GRR) is 2 or less [7]. 2

11 The transmission disequilibrium test (TDT) is a test introduced in human genetic analysis in 1993 that has been advocated as a way of relaxing the requirement for affected siblings in favor of trio data (two unaffected parents with their affected offspring). TDT tests genetic markers for linkage with disease susceptibility loci in the presence of association [8]. Family-based studies have been effective at identifying regions containing causal variants for many rare diseases caused by a single gene, but have been less effective at identifying genetic loci for more common complex diseases [9-11]. Association studies, where a group of affected cases are compared to unaffected controls in order to detect association between the trait and specific genetic loci, were effectively used to identify associations in regions identified by linkage analysis or using candidate gene studies where association was tested for only for variants in specific genetic loci thought a priori to be a possible candidate to influence the trait. Prior to the 1990s association studies had been used to identify genes accounting for some of the variation in peptic ulcers, autoimmune and infectious diseases, and Alzheimer s disease [3]. Until recently, a major obstacle in isolating genetic variants responsible for heritable traits was our inadequate knowledge of the map of the human genome. The Human Genome Project was initiated in 1990 to bridge this gap [12]. Through the efforts of this project most of the human genome has been mapped and technologies have been developed to assay polymorphisms throughout. The common disease-common variant hypothesis, which states that the heritable component of complex common diseases is the result of multiple common variants with low to moderate penetrance, led to the push to 3

12 create a database containing as many known common single nucleotide polymorphisms (SNPs) as possible in multiple populations [7, 13, 14]. Late 2005 witnessed the publishing of the results of what has been called a significant advance in biomedical research when researchers announced the completion of the first phase of the International HapMap project and the public availability of genotype data for 269 individuals in multiple populations for about one million SNPs. The International HapMap Consortium also published their analysis of this data including information about allele frequencies, linkage disequilibrium (LD), recombination and recombination hotspots, common haplotypes, and even an exploratory investigation of selection. Most importantly, they touted that their data would enable tagging the concept of using LD to obtain information about untyped SNPs in the genome from the genotypes of a smaller collection tag SNPs [15]. Since 2005, data has been made available for millions of additional SNPs. The International HapMap project took three years and approximately $140 million to complete [16]. There were many criticisms and concerns at the onset of HapMap patterns of LD would vary too much between populations, variants overlooked in HapMap samples could not be represented, and tag SNPs would not be transferable from the reference samples to independent samples, but these concerns have been mostly overcome [17]. Many new methods have been developed to make HapMap data more useful. Tagging has been extended by a variety of means to glean as much information as possible from the data. One of the main applications of tagging is to provide reassurance that a set of 4

13 SNPs is getting proper coverage of the genome. That is, a fixed panel is analyzed using HapMap to determine what proportion of known SNPs are in high LD with at least one marker on the panel. Associations found in genome-wide association studies lead investigators to a region. Investigators will research any known function of associated SNPs to determine if there is a plausible explanation why the associated SNP could affect the trait being studied, and the same will be done with any SNP in high LD with the associated SNPs not included in the genotyping panel. Depending on what they discover and the resources available, investigators may then resequence the regions surrounding the associations in search of polymorphisms with stronger associations or that can be found in more meaningful regions such as in genes essential to the studied trait. Another use of tagging, proposed by de Bakker et al. [18] and extended by Pe er et al. [19], is to use 2- and 3-marker haplotypes to tag untyped SNPs that cannot be tagged by a fixed panel. These methods rely on HapMap data to determine the LD, which is used to choose these multi-marker tags and have been shown to increase the information available in a fixed panel such as the Affymetrix 500K GeneChip in independent samples. Another possible use for tag SNPs, however, is that the number of SNPs that are tagged by a tag SNP can be used as prior information in a Bayesian analysis following the logic that we can assume a priori that a tag SNP that tags a greater number of untyped SNPs should be more likely to be associated with the outcome of interest [19]. More recently methods have been developed using all available SNPs in HapMap to help impute untyped genotypes. These methods represent a vast improvement in information 5

14 available for genome-wide association studies as millions of additional genotypes can be imputed relatively accurately from genotyping panels that represent only a fraction of the SNPs available in HapMap [20]. The mapping of traits to the human genome experienced a significant boost by the advent of large commercial genotyping panels. Companies like Affymetrix and Illumina have made it easier to assay markers throughout the genome. These so-called high throughput genome-wide genotyping panels have been steadily increasing in size. Affymetrix released their GeneChip Human Mapping 10K Array Set in 2003 (11,555 SNPs), their GeneChip Human Mapping 100K Array Set in 2004 (116,204 SNPs), their GeneChip Human Mapping 500K Array Set in 2005 (500,568 SNPs), and their Genome- Wide Human SNP Array 6.0 in 2007 (906,600 SNPs). Likewise, Illumina offered genotyping arrays contain 109,000 SNPs (2005), 318,000 SNPs (2006), 555,000 SNPs (2006), and over 1 million SNPs (2007). One of the large beneficiaries of HapMap has been Illumina, which has used the data collected by HapMap to select which tags will be included on its genotyping panels to assure the most complete coverage possible. This philosophy can make genotyping much more efficient. These advances have finally made possible genome-wide association studies (GWAS), representing a major step forward towards the goal of mapping the major genetic components of heritable traits. The success of candidate gene studies depends on correctly identifying all possible regions that can influence a heritable trait. With GWAS, identifying the correct genetic regions beforehand is not necessary. It s possible 6

15 to identify genetic variants in the vast intergenic regions that may be neglected otherwise, assuming that such a variant is included in the genome-wide panel or can be accurately inferred from variants that are. The first major success from a GWAS was announced shortly after the completion of the first phase of HapMap in Three independent research groups reported simultaneously on an association found between complement factor H (CFH) and adultonset macular degeneration (AMD). One group identified the association between CFH and AMD by using the Affymetrix 100K array set on 96 cases and 50 controls who selfidentified as white, not of Hispanic origin, then followed up by more densely genotyping around the highest association, which was in the 1q31 region [21]. Another group used a previous linkage analysis to narrow the search space to a region on chromosome 1 then genotyped additional SNPs in 224 cases and 134 controls to identify AMD-associated variants in CFH [22], while the third group performed additional linkage analyses in the 1q31 region followed by additional genotyping of SNPs in 182 families and 495 cases and 185 controls [23]. While two of the three reports used linkage analyses to identify the candidate region, the report by Klein et al. is generally heralded as the first major success of a GWAS and demonstrates that GWAS can identify such associations without requiring family data. Additionally, Klein et al. were able to use HapMap data to narrow the search to a reasonable area for additional genotyping, and they provided further biological support showing that variations in CFH could contribute to AMD [21]. 7

16 Another example of the type of success that could be achieved through GWAS comes from a group of researchers that applied an imputation method as part of their investigation into the genetic components of type 2 diabetes (T2D). Over 1000 cases and 1000 controls were genotyped from the Finland-United States Investigation of Non- Insulin-Dependent Diabetes Mellitus Genetics (FUSION) and Finrisk 2002 studies. FUSION had been studied since the 1990s using linkage analyses and other techniques with few strong results prior to using a GWAS approach with over 315,000 SNPs from the Illumina HumanHap300 BeadChip. The study identified five new T2D susceptibility loci and confirmed 5 other previously-identified T2D susceptibility loci. While imputation did not produce results in regions that would not have contained significant associations otherwise, the imputed genotypes did provide additional support for the regions of association, and in many instances the imputed genotypes were more highly associated than any of the typed genotypes, which gives researchers candidates that can be followed-up on in the hopes of finding a causal variant [24]. A major landmark in GWAS was the 2007 study published by the Wellcome Trust Case Control Consortium (WTCCC) for seven different diseases. The study used 3,000 independent controls with 2000 cases each for bipolar disorder, coronary artery disease, Crohn s disease, hypertension, rheumatoid arthritis, type 1 diabetes, and type 2 diabetes. Results of the study included the identification of between one and nine associations in six out of the seven diseases. Imputation was implemented as described by Marchini et al. [20]. One of the more intriguing findings were two associations found for type 1 8

17 diabetes which had no significant typed SNPs but did have significantly associated imputed SNPs [25]. The WTCCC study will surely be seen as one of the standards against which other association studies are measured and has shown that potentially interesting SNPs can be identified using HapMap that would have been overlooked otherwise. Since the first successful GWAS association in 2005 more than 1400 associations have been identified using GWAS. These associations come from studies ranging in sample size from less than 100 individuals to more than 80,000 for traits related to disease, drug response, and other phenotypes in populations across the world [26]. While much has been discovered using this powerful tool, much remains to be discovered, but there is disagreement on how best to proceed. Some have argued that continuing to increase the size of genotyping panels will not be as effective as increasing sample sizes [27], but it has also been pointed out that despite the many associations discovered by GWAS, for many traits they may not be sufficient to discover the genetic variants that explain even most of the observable genetic variation [28]. For example, 40 polymorphisms have been found to be associated with height accounting for approximately 5% of the variation of height, yet heritability of height is usually estimated to be between 80% and 90% [29]. Since variants with the largest effects are the most likely to be discovered, if we assume that the variants that have been found have the largest effects and all other variants follow the same diminishing pattern of effect, it would require over 90,000 variants to explain 80% of the variation in height as expected 9

18 by the estimates of heritability. Even a more conservative estimate that assumes that many of the undiscovered variants have effects in the range of those that have already been discovered can only reduce the number of variants necessary to explain the observed heritability to a still unmanageable 1500 SNPs. David Goldstein explains, If common variants are responsible for most genetic components of type 2 diabetes, height, and similar traits, then genetics will provide relatively little guidance about the biology of these conditions, because most genes are height genes or type 2 diabetes genes. [28] This gap between the variation that has been explained by GWAS associations and what is expected based on estimates of heritability has several possible explanations. One explanation for this gap is that associations that have been found represent nearby rare, high-penetrant variants so that what appear in the genotyped SNPs as weak associations are only weak because of the discrepancy between allele frequency of genotyped SNPs and the nearby causal variant. Such variants might be identifiable through linkage analyses coupled with targeted sequencing in the identified region. Another explanation is similar to the first but would be harder to identify, which is that the causal variants have a low enough frequency that they won t be identified through GWAS but a high enough frequency that their penetrance may still be low enough to avoid detection in most linkage analyses [29]. It s also possible that other types of polymorphisms such as copy number variants (CNVs) or epigenetic effects like methylation explain some of the variation. Many CNVs have little correlation with SNPs in the commercial genotyping panels [30]. Other explanations include gene-gene or gene-environment interactions 10

19 where genetic effects may be enhanced by the presence of other genetic or environmental factors that are difficult detect without a reasonable guess beforehand. It s also possible that traits we believe are common are actually multiple rarer conditions that have been mistakenly grouped together, which would serve to diminish the observable effects of all conditions [29]. There are a variety of ways to improve the discovery of causal variants in future genetic studies. Chapter 2 compares tagging to imputation in European and African samples in order to increase understanding of the properties of both methods. Imputation is a method that brings more complete genotyping to more researchers when costs of genotyping may otherwise inhibit the discovery of important associations. Tagging methods may increase the efficiency of imputation. In chapter 3, a possible explanation for the missing heritability is described. Genealogies are used to explain how multiple low-frequency variants can occur by chance on the same haplotype as more common variants in such a way that common variants may have significant associations with a studied trait, and this phenomenon is studied through simulation. An explanation for long-range LD between common and rare variants, which complicates local resequencing efforts, is provided. Chapter 4 describes the obstacles associated with analyzing samples containing individuals of diverse genetic heritage. Through simulation, different statistical tests involving samples containing subpopulations are compared and the circumstances under 11

20 which various tests perform better than the others are described. These results are valid for all genetic association studies, including candidate gene studies and GWAS. 12

21 REFERENCES 1. Weaver, R.F., Molecular biology. 3rd ed. 2005, Boston: McGraw-Hill. xvii, p OMIM, OMIM Statistics for June 18, Altshuler, D., M.J. Daly, and E.S. Lander, Genetic mapping in human disease. Science, (5903): p Botstein, D., et al., Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet, (3): p Cudworth, A.G. and J.C. Woodrow, Evidence for HL-A-linked genes in "juvenile" diabetes mellitus. Br Med J, (5976): p de Vries, R.R., et al., HLA-linked genetic control of host response to Mycobacterium leprae. Lancet, (7999): p Risch, N. and K. Merikangas, The future of genetic studies of complex human diseases. Science, (5281): p Spielman, R.S., R.E. McGinnis, and W.J. Ewens, Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet, (3): p Risch, N.J., Searching for genetic determinants in the new millennium. Nature, (6788): p Botstein, D. and N. Risch, Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet, Suppl: p Consortium, T.I.H., The International HapMap Project. Nature, (6968): p Collins, F.S., M. Morgan, and A. Patrinos, The Human Genome Project: lessons from large-scale biology. Science, (5617): p Collins, F.S., M.S. Guyer, and A. Charkravarti, Variations on a theme: cataloging human DNA sequence variation. Science, (5343): p Lander, E.S., The new genomics: global views of biology. Science, (5287): p

22 15. Consortium, T.I.H., A haplotype map of the human genome. Nature, (7063): p Schmidt, C., Latest HapMap update aims to direct researchers to genetic basis of disease. J Natl Cancer Inst, (22): p Need, A.C. and D.B. Goldstein, Genome-wide tagging for everyone. Nat Genet, (11): p de Bakker, P.I., et al., Efficiency and power in genetic association studies. Nat Genet, (11): p Pe'er, I., et al., Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat Genet, (6): p Marchini, J., et al., A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet, (7): p Klein, R.J., et al., Complement factor H polymorphism in age-related macular degeneration. Science, (5720): p Edwards, A.O., et al., Complement factor H polymorphism and age-related macular degeneration. Science, (5720): p Haines, J.L., et al., Complement factor H variant increases the risk of age-related macular degeneration. Science, (5720): p Scott, L.J., et al., A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science, (5829): p Consortium, W.T.C.C., Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, (7145): p Hindorff, L.A., et al., Potential etiologic and functional implications of genomewide association loci for human diseases and traits. Proc Natl Acad Sci U S A, Nannya, Y., et al., Evaluation of genome-wide power of genetic association studies based on empirical data from the HapMap project. Hum Mol Genet, (20): p Goldstein, D.B., Common genetic variation and human traits. N Engl J Med, (17): p

23 29. Maher, B., Personal genomes: The case of the missing heritability. Nature, (7218): p Cooper, G.M., et al., Systematic assessment of copy number variant detection via genome-wide SNP genotyping. Nat Genet, (10): p

24 Chapter 2 Comparison of tagging and imputation to infer genotypes Samuel P. Dickson and Matthew Nelson 16

25 Abstract Tagging and imputation have been developed to make inferences about unmeasured genetic variants. While these methods attempt to address the same problem, their ability to make these inferences has never been directly compared. We compare the efficiency of the inference of tagging using pairwise and multi-marker tags and imputation. Multimarker tagging is shown to transfer well from a reference population to genetically similar populations in European and African samples, though pairwise tags maintain their predictive power better than multi-marker tags in independent samples. Imputation makes more accurate inferences than tagging, especially compared to multi-marker tags. Imputation is less efficient when imputing SNPs with no tags than for imputing SNPs that have tags, so a method is proposed to use tagging to select a tag set for imputation using lower r 2 thresholds and impute the remaining SNPs with this tag set. This method is shown to produce accurate results with fewer SNPs than traditional tag selection methods. Introduction Genetic variation influences many complex traits, including risk of many diseases and drug response. Much of the genetic variation in humans is accounted for by single nucleotide polymorphisms (SNPs), yet current genome-wide genotyping products can directly assay only a fraction of common SNPs. Fortunately, linkage disequilibrium 17

26 (LD), a correlation structure among genetic variants,) allows inferences to be drawn for SNPs that have not been genotyped based on nearby genotyped SNPs [1-3]. Two methods developed to make such inferences are tagging and imputation. Tagging refers to the method of using one or more SNPs that have been genotyped as a proxy for an untyped SNP. To select tags, a reference sample that has been densely genotyped and is assumed to have patterns of LD that are similar to the study population is used to measure LD between all SNPs within a region. This reference sample is often derived from the International Haplotype Map (HapMap) project, but can be from any sample that has been extensively genotyped [4]. One SNP tags another if LD between the two exceeds a given threshold (commonly r 2 > 0.8). [5, 6] Coverage is the proportion of a set of SNPs that is either typed by a tag set or is in LD above a given threshold with at least one SNP in the tag set. High coverage by a tag set provides a high likelihood that if an effect is present among the variants in the reference sample, it will likely be detected by a SNP in the tag set. de Bakker et al. proposed using SNPs in a haplotype as multimarker tags to reach a given coverage with a reduced tag set [7]. Pe er et al. extended the method to extract additional information from a fixed panel about SNPs not genotyped or tagged by a single SNP in that panel [3]. More recently several genotype imputation methods have been introduced that infer unmeasured genotypes using a set of reference haplotypes or genotypes. [8-10] Imputed genotypes are typically accompanied by some measure of the prediction quality. 18

27 Imputation has been used to increase the number of markers that can be tested for association using a fixed genotyping panel as well as to combine samples that have been genotyped on different panels [11, 12]. Imputation has been used in genome-wide association studies to provide additional support for associations found with typed variants and has even identified associations with inferred variants with no significantly associated typed variants nearby [13, 14]. In one light, imputation methods are a more sophisticated implementation of panel extension with multi-marker tagging as proposed by Pe er et al. Inference using multimarker tagging consists of identifying a haplotype from the alleles of two or more typed SNPs that can be used as a proxy for an allele of an untyped SNP. In an analysis using multi-marker tagging, the haplotype of tags can be tested for association. Imputation, on the other hand, estimates a genotype for the untyped SNP but doesn t provide the haplotypes that were used to infer the genotype. One important difference between the two methods is that multi-marker tagging will only provide inferences for SNPs in sufficient LD with a tag, so many SNPs will have no inferences at all. While imputation may use a quality score to set a threshold for the quality of the SNPs imputed, the method provides the best estimate for SNPs in the reference dataset above a specified quality score; whereas tagging has emphasized providing the greatest number of covered SNPs with the fewest tags, and will therefore not always provide the best haplotype available. This makes it possible to impute all genotypes with confidence that the quality will be low only when necessary. Specifying a low r 2 threshold in the tag selection process, on 19

28 the other hand will provide lower quality estimates, even when higher quality estimates would be available with the same number of typed SNPs because the tag selection process in Tagger is designed to minimize the number of tags that clear the threshold rather than maximize the r 2 with each SNP. The philosophy behind tagging puts the onus on the user to choose a threshold that will yield acceptable results. When completing an association study that relies on tag SNPs, the most highly associated SNPs are treated as potential proxies for the causal variants and are used to identify a subset of untyped SNPs that can be the target of downstream analyses. Association analyses performed using imputation, on the other hand, can identify associations with both typed and untyped SNPs, providing improved guidance about the location of potential causal SNPs. Both multi-marker tags and imputation can provide improvements in power for a fixed genotyping panel [3, 7-10]. Portability of imputation between populations has been explored using both similar and dissimilar reference samples and using various combinations of reference samples [15]. It has also been shown that pairwise tagging maintains reasonable coverage when applied in an independent sample from a similar population [16-19]. However, there are no published reports evaluating the portability of multi-marker tags and no direct comparisons of the performance of imputation and multimarker tagging. We designed an experiment to compare multi-marker tagging for panel extension to imputation. We used the markers included in the Affymetrix 500K SNP panel on 20

29 chromosome 21 to make inferences on the remaining SNPs on chromosome 21 available in HapMap phase 3 (HapMap3) with multi-marker tagging and imputation in European and African populations. The HapMap3 data include samples from multiple similar populations that allowed us to investigate the performance of these methods in the reference sample, a sample from the same population and a sample from another population with the same continental origin. We first investigate the portability of multi-marker tagging and then compare the accuracy of the two methods to predict untyped SNPs. Since multi-marker tagging does not provide inferences for all untyped SNPs, the comparisons focus primarily on tagged SNPs; however, based on the results of inferences using imputation on SNPs inferred by multi-marker tagging compared to SNPs not inferred by multi-marker tagging we evaluate multi-marker tagging with lower r 2 thresholds in the tagging process for panel design to provide smaller tag sets that can be used for imputation. Methods Subjects and genotype data Genotype data were drawn from HapMap3, including two European samples (CEPH Europeans from Utah [CEU] and Toscans in Italy [TSI]) and two African samples (Luhya in Webuye, Kenya [LWK] and Yoruba in Ibadan, Nigeria [YRI]). The CEU and YRI samples were each subdivided into two groups consisting of the original HapMap 21

30 participants (hereafter referred to as CEU and YRI) and the additional subjects included in phase 3 (CEU3 and YRI3). All computations used the phased haplotypes from chromosome 21 from HapMap phase 3 build 36, obtained from the HapMap website. The subjects included in subsequent analyses were limited to unrelated individuals, including 56 CEU, 57 CEU3, 88 TSI, 55 YRI, 58 YRI3 and 90 LWK. Each SNP included in the phased haplotypes was required to satisfy the following quality criteria: Hardy-Weinberg p-value > 10-6, missing genotype rate < 0.05 per population sample, fewer than three Mendel errors (in samples with family data), valid dbsnp RefSNP identifier and map to a unique genomic location. Chromosome 21 haplotypes include 19,306 SNPs. We excluded all SNPs that were monomorphic within CEU and YRI each, used as reference samples for tagging and imputation. This resulted in 17,698 polymorphic SNPs in CEU used for the European samples and 18,281 in YRI used for the African samples. Tag selection and imputation In this experiment we used the subset of SNPs on chromosome 21 that are present in the Affymetrix 500K panel to tag or impute the remainder of the HapMap3 SNPs on that chromosome. There were 5,878 SNPs on the Affymetrix 500K panel that were present in the phased data, resulting in 11,820 SNPs in European samples and 12,403 in African samples that were to be inferred. To establish which SNPs were captured (i.e. which SNPs were in high LD with at least one tag) by the Affymetrix 500K panel, a 500 kb 22

31 sliding window was used with a minimum r 2 cutoff of 0.8. A LOD score greater than 3 was required for multi-marker tags. Identification of tags was carried out using Haploview version 4.1 with the aggressive tagging option with up to three markers in a tag [20]. Imputation was performed in MACH version using 20 iterations of the Markov sample and 100 possible haplotypes per individual with the greedy option. The 112 CEU haplotypes were used as the reference sample for CEU3 and TSI, and the 110 YRI haplotypes were used as the reference sample for YRI3 and LWK. Comparisons between tagging and imputation were made based only on the SNPs that were not tags. LD and prediction accuracy LD was estimated by r 2 based on phased haplotypes. Coverage was estimated as the proportion of SNPs that are included as a tag or have r with at least one pairwise or multi-marker tag. Prediction accuracy was estimated as the average of the proportion of measured genotypes that were accurately predicted by imputed genotypes. Similarly, the prediction accuracy of tag SNPs and haplotypes was defined as the proportion of genotypes accurately predicted by the tag when alleles between the marker to be inferred and the tag SNP were matched according to maximum concordance. One limitation to the use of predication accuracy as a measure of imputation or tagging performance is its dependence on allele frequency. A comparison of the expected 23

32 genotype prediction accuracy and minor allele frequency is shown in blue in Figure 2.1. As minor allele frequency approaches zero, genotype prediction accuracy approaches one when all subjects are arbitrarily assigned the most common genotype. Assuming Hardy- Weinberg equilibrium, at minor allele frequencies greater than 1/3, this corresponds to assigning a heterozygous genotype to every subject. For minor allele frequencies below 1/3, all subjects are assigned homozygous genotypes. Using this rudimentary form of imputation as a lower limit on acceptable accuracy of imputation allows for a simple transformation to better compare the accuracy across minor allele frequencies. If x is the genotype prediction accuracy for a given SNP, p is the minor allele frequency estimate and f(p) = 1 p 2 2p 1 p for 0 p 1 3 for 1 3 p 1 2 which gives the expected accuracy of the rudimentary imputation, then the transformation is x f ( p) 1 f ( p). We refer to this frequency-adjusted measure as the adjusted accuracy. We find that adjusted accuracy is well correlated with r 2 (see Figure 2.2). Ultimately, we are concerned with power to detect associations. Adjusted accuracy gives greater penalty to errors that cause a greater decrease in power. 24

33 Figure 2.1. Expected genotype prediction accuracy by allele frequency. The line represents the expected accuracy if imputation is applied by declaring all subjects to have the most probable genotype given allele frequency. 25

34 Panel design experiment We propose a method of reducing the number of SNPs needed to genotype with little loss of accuracy in imputation by selecting a reduced set of tags with r 2 thresholds set lower than commonly used and imputing all unmeasured SNPs. To evaluate the efficiency of this method tags were selected from CEU with the r 2 threshold ranging from 0.02 to 0.90 in increments of 0.02 using aggressive tagging with up to 3-marker tags without forcing the inclusion of any set of tags. Each reduced tag set was used to impute the full set of SNPs for chromosome 21 and accuracy was estimated as previously described and the mean adjusted accuracy for imputation was reported for each r 2 threshold. Results We selected pairwise and multi-marker tags from among the Affymetrix 500K SNPs to tag the remaining HapMap3 chromosome 21 SNPs within CEU and YRI and assessed the portability of these tags in the remaining European (CEU3 and TSI) and African (YRI3 and LWK) samples, respectively. In CEU, 2,153 SNPs were selected as pairwise tags of 5,136 SNPs, 253 two-marker haplotypes were selected to tag 386 SNPs, 325 threemarker haplotypes were selected to tag 497 SNPs and 3725 SNPs tagged only themselves. 5,801 SNPs (49%) were not tagged at the r 2 threshold of 0.8. In YRI, 1,575 SNPs were selected as pairwise tags for 2,558 SNPs, 193 two-marker haplotypes were selected to tag 253 SNPs, 177 three-marker haplotypes were selected to tag 215 SNPs, and 4,303 SNPs tagged only themselves. 9,377 SNPs (76%) were not tagged. Basic 26

35 Figure 2.2. Comparison of prediction accuracy, adjusted accuracy, and r 2 by minor allele frequency. Here r 2 is compared first to prediction accuracy then to adjusted accuracy variant by variant within CEU3. The comparison between r 2 and prediction accuracy shows that as minor allele frequency decreases prediction accuracy grows closer to 1 compared to r 2. The plot comparing r 2 and adjusted accuracy does not demonstrate this bias. 27

36 coverage by tagging of the entire set of SNPs in the six samples is summarized in Table 2.1. As expected, there is higher coverage in European samples than in African samples. The use of multi-marker tags compared to pairwise tags alone modestly increases coverage as has also been shown [3]. Within the three types of tags, coverage of SNPs with single-marker tags remains higher than coverage of SNPs with multi-marker tags when evaluated in an independent sample from the same population. Coverage of multimarker tags decays at a higher rate than single marker tags when evaluated in genetically more distant populations from the same continent. The distributions of r 2 estimates between single or multi-marker tags and the SNPs they tag are shown in Figure 2.3. Tag SNPs (i.e. r 2 always equals one) were excluded from these distributions. While both single-marker and multi-marker tags were selected with r 2 estimates above 0.8 within CEU and YRI, most SNPs with pairwise tags had r 2 estimates close to one while SNPs with multi-marker tags varied nearly uniformly from Table 2.1. Coverage of study samples by the Affymetrix 500K panel on all chromosome 21 HapMap3 SNPs. Coverage for SNPs in the multi-marker column includes those with pairwise and multi-marker tags. Within tag type coverage was only evaluated for those SNPs with the given tag type and does not include the tag SNPs. Including tags Excluding tags Within tag type Sample Pairwise Multi-marker Pairwise Multi-marker 1 marker 2 markers 3 markers CEU CEU TSI YRI YRI LWK

37 0.8 to one. When applied to independent samples, all distributions shifted towards zero. The distributions of the difference in r 2 values from each reference sample to the independent samples are shown in Figure 2.4. Quantile regression demonstrates an approximately linear decrease in median change in r 2 as the number of markers in the tags increased and the slope of this line increases with genetic distance from the reference sample. The adjusted accuracy, as described in the Methods, is used to measure and compare the performance of both tagging and imputation. The distributions of the adjusted accuracy for the subset of SNPs that are tagged at r 2 = 0.8 threshold are shown in Figure 2.5a. The adjusted accuracy distributions for tagging are very similar to the r 2 distributions (Figure 2.3), but adjusted accuracy creates negative scores that represent cases where the inference performs more poorly than the crude inference described above, which is a justifiable penalty. Imputation shows improved accuracy over tagging for all three types of tags, but especially over two- and three-marker tags. Pairwise tags actually had higher median accuracy in TSI and LWK than imputation, though greater skew in the accuracy of pairwise tags brought the mean accuracy of pairwise tags below that of imputation. This result is due largely to the fact that the selection of pairwise tags was concentrated heavily near r 2 values of one. As expected, prediction accuracy for SNPs that are not tagged is greatly reduced compared to tagged SNPs (Figure 2.5b). This demonstrates that SNPs that can be tagged with higher r 2 values are more accurately imputed. 29

38 Figure 2.3. Distribution of r 2 by number of SNPs in the tag. This violin plot shows the box-whisker summaries for the estimates of r 2 within each sample by tag type with the density of these values represented by the black lines around each boxwhisker summary. Tagging SNPs are excluded. 30

39 Figure 2.4. Change in r 2 from the reference sample. The distribution of the difference between the original estimate of r 2 between each SNP that is tagged and its tag and the estimate of r 2 in the independent sample are shown grouped by the number of SNPs in the tag. The line through the medians is the median regression slope estimate. Tagging SNPs are excluded. 31

40 Though unsurprising, the superior performance of imputation of SNPs with tags in a panel extension context suggests that combining tagging and imputation can be used in panel selection to reduce the number of SNPs and improve study power over tagging alone. By adjusting the r 2 threshold in tag selection to as low as 0.02 and up to 0.90, tag sets of varying size with total coverage for the selected coverage were used to impute the remaining SNPs to reveal the properties of this method. The mean adjusted accuracy for each r 2 threshold is shown in Figure 2.6. With the 5,878 SNPs on chromosome 21 from the Affymetrix 500K panel genotype prediction accuracy for imputation in CEU3 had a mean of and in TSI was This average was surpassed using an r 2 threshold of 0.64 using only 4957 SNPs. Lower thresholds caused a slow decrease in accuracy until 0.4. Values of r 2 lower than that caused more rapid decreases in prediction accuracy. Discussion Multi-marker tagging provides a modest improvement coverage relative to pairwise tagging as noted by Pe er et al. [3]. We ve shown that coverage improvement does apply to independent samples. However, coverage degrades as more markers are required to tag a single SNP and the rate of degradation increases as more genetically distant populations are considered. This is due in part to lower overall r 2 distributions for SNPs with multi-marker tags, which makes it easier for these estimates to fall below a coverage threshold because of the drop expected for all tags due to selection bias (i.e. winner s curse). In addition to lower initial estimates, it is also apparent that as the number of 32

41 Figure 2.5. Adjusted accuracy. a) Distributions of adjusted accuracy are separated within samples by the number of SNPs in each tag. b) Adjusted accuracy is shown for tagging SNPs with either pairwise or multi-marker tags compared to imputation prediction accuracy for the same set and for SNPs with no tags. 33

42 Figure 2.6. Average adjusted accuracy for reduced tag sets. Tag sets were reduced by using Haploview to select sets of tags using CEU with different r 2 thresholds under aggressive tagging. Shown here is the average adjusted accuracy for the different thresholds. 34

43 markers required to tag a SNP increases, there is a greater decrease in the LD from the sample it was selected in. This result indicates a trend that carries over into imputation because those SNPs that cannot be tagged by one-, two-, or three-marker haplotypes by definition require more markers in order to be accurately inferred. This requirement implies that we should expect greater decreases in prediction accuracy outside the reference sample. While the ability to use more markers improves the prediction accuracy for SNPs with tags, it also indicates that the accuracy falls as more markers are required for prediction. As previously noted, however, imputation maintains its accuracy better than tagging even as the number of markers required to tag a SNP above a given r 2 threshold increases. The concept of tagging was introduced as a justification to use LD to bridge the gap between the information we are capable of gathering and the information necessary to discover genetic variants responsible for disease and other heritable traits of interest [21]. It has been widely used as a means to reduce the set of SNPs needed to genotype to capture common genetic variation, but we are not aware that SNP selection and imputation have been used together to minimize the number of required SNPs. Based on the HapMap3 data on chromosome 21, we found that a tag SNP selection (with multimarker tags) threshold of 0.6 provided higher prediction accuracy than tagging alone at a threshold of 0.8, with a panel reduction of 20%. Where panel reduction is an important consideration in the design of an experiment, this approach could be applied in selecting 35

44 the most efficient marker set. Software to facilitate such a selection is currently unavailable, but can be accomplished through scripting and batch processing with available tools such as Haploview and MACH. This has shown that multi-marker tags can be used across closely related populations and that imputation can be used with tagging to reduce marker panel sizes while maintaining reasonable accuracy. Additionally, it is important to recognize that the use of imputation on commercial genotyping arrays will continue to grow; therefore, great benefit can be achieved by methods that reduce tag sets with imputation in mind. 36

45 REFERENCES 1. Magi, R., et al., Evaluating the performance of commercial whole-genome marker sets for capturing common genetic variation. BMC Genomics, : p Barrett, J.C. and L.R. Cardon, Evaluating coverage of genome-wide association studies. Nat Genet, (6): p Pe'er, I., et al., Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat Genet, (6): p Stram, D.O., Tag SNP selection for association studies. Genet Epidemiol, (4): p Gabriel, S.B., et al., The structure of haplotype blocks in the human genome. Science, (5576): p Carlson, C.S., et al., Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet, (1): p de Bakker, P.I., et al., Efficiency and power in genetic association studies. Nat Genet, (11): p Marchini, J., et al., A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet, (7): p Servin, B. and M. Stephens, Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet, (7): p. e Nicolae, D.L., Testing untyped alleles (TUNA)-applications to genome-wide association studies. Genet Epidemiol, (8): p Howie, B.N., P. Donnelly, and J. Marchini, A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet, (6): p. e Guan, Y. and M. Stephens, Practical issues in imputation-based association mapping. PLoS Genet, (12): p. e Scott, L.J., et al., A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science, (5829): p Consortium, W.T.C.C., Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, (7145): p

46 15. Huang, L., et al., Genotype-imputation accuracy across worldwide human populations. Am J Hum Genet, (2): p Montpetit, A., et al., An evaluation of the performance of tag SNPs derived from HapMap in a Caucasian population. PLoS Genet, (3): p. e Gonzalez-Neira, A., et al., The portability of tagsnps across populations: a worldwide survey. Genome Res, (3): p Lundmark, P.E., et al., Evaluation of HapMap data in six populations of European descent. Eur J Hum Genet, (9): p Xing, J., et al., HapMap tagsnp transferability in multiple populations: general guidelines. Genomics, (1): p Barrett, J.C., et al., Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, (2): p Consortium, T.I.H., A haplotype map of the human genome. Nature, (7063): p

47 Chapter 3 Rare variants create synthetic genome-wide associations Samuel P. Dickson, Kai Wang, Ian Krantz, Hakon Hakonarson, and David B. Goldstein 39

48 Abstract Genome-wide association studies (GWAS) have now identified at least 2000 common variants that appear associated with common diseases or related traits [1] ( accessed 8/24/09), hundreds of which have been convincingly replicated. It is generally thought that the associated markers reflect the effect of a nearby common (minor allele frequency > 0.05) causal site, which is associated with the marker [2-6], leading to extensive resequencing efforts to find causal sites [7-11]. We propose as an alternative explanation that variants much less common than the associated one may create synthetic associations by occurring, stochastically, more often in association with one of the alleles at the common site versus the other allele. While synthetic associations are an obvious theoretical possibility, they have never been systematically explored as a possible explanation for GWAS findings. Here we use simple computer simulations to show the conditions under which such synthetic associations will arise and how they may be recognized. We show that they are not only possible, but inevitable, and that under simple but reasonable genetic models they are likely to account for or contribute to many of the recently identified signals reported in genome-wide association studies. We also illustrate the behavior of synthetic associations in real data sets by showing that rare causal mutations responsible for both hearing loss and sickle cell anemia create genome-wide significant synthetic associations, in the latter case extending over a 2.5 Mb interval encompassing scores of blocks of associated variants. In conclusion, uncommon or rare genetic variants can easily create 40

49 synthetic associations that are credited to common variants and this possibility requires careful consideration in the interpretation and follow up of GWAS signals. Introduction Efforts to fine map the causal variants responsible for GWAS signals have been largely predicated on the common disease common variant theory, postulating a common variant as the culprit for observed associations. This has led to extensive resequencing efforts that have been largely unsuccessful [7-11]. Here we explore the possibility that part of the reason for this may be that one or more variants much less common than the associated ones may underlie some of the signals reported in recent GWAS associations a phenomenon we call synthetic association. For convenience, these less common variants will be referred to here as rare, but we emphasize that we use this term loosely only to refer to variants less common than those routinely studied in GWAS. The basic idea of how synthetic associations emerge in this model is illustrated in Figure 1, which shows how rare variants, by chance, can occur disproportionately in some parts of a gene genealogy. Any variant higher up in the genealogy that partitions those parts of the genealogy containing more disease variants than average will be identified as disease-associated. It is well appreciated that a non-causal variant will show association with a causal variant if the two are in strong linkage disequilibrium. We introduce the term synthetic association, however, to describe how such indirect association can occur between a common variant and at least one and possibly many rarer causal variants. 41

50 Using the term synthetic as opposed to indirect emphasizes that the properties of the association signal are very different when the responsible variant or variants are much less frequent than the marker that carries the signal, as we detail below. To assess the tendency of rare disease-causing variants to create synthetic signals of association that are credited to single polymorphisms that are much more common in the population than the causal variants, we have simulated 10,000 haplotypes based on a coalescent model in a region either with or without recombination (methods). We assumed that gene variants that influence disease have an allele frequency between and 0.02, which is generally below the range of reliable detection (or representation) using the genome-wide association platforms currently in use. We assumed a baseline probability of disease of for individuals with none of the rare genetic risk factors. The presence of at least one rare risk allele at the locus increased the probability of disease from to. We considered two values of (0.01, 0.1) and chose values of the penetrance such that the genotypic relative risk (GRR) of the rare causal variants varied incrementally between 2 and 6, where GRR is the ratio. These values were chosen to explore the space around a GRR of 4, a threshold above which consistent linkage signals would be expected [3]. We simulated scenarios with 1, 3, 5, 7, and 9 rare causal variants. 42

51 Figure 3.1. Example genealogies showing causal variants and the strongest association for a common variant. A genealogy with 10,000 original haplotypes was generated with 3000 cases and 3000 controls, genotype relative risk (γ) = 4, and 9 causal variants. The branches containing the strongest synthetic association are indicated in blue. The branches containing the rare causal variants are in red. b) A second genealogy was generated using the same parameters. These genealogies demonstrate two scenarios with genome-wide significant synthetic associations: the first had a high risk allele frequency (RAF = 0.49) and the second had a low RAF (0.08). 43

52 Results Across the conditions we have studied, not only is it possible to achieve genome-wide significance for common variants when one or more rare variants are the only contributors to disease, it is often the likely outcome (Figure 3.2). Overall, 30 percent of the simulations were able to detect an association with a common SNP at genome-wide significance (p < 10-8 ). Three factors GRR, sample size, and the number of rare causal variants had a notable impact on power to detect an association with a common SNP. As expected, greater proportions of synthetic associations were created when GRR increased for the rare causal variants and when sample size increased. As the number of rare causal variants increased the probability of creating a synthetic association did as well. One possible explanation for this increase due to increasing the number of rare causal variants is that adding more causal variants increases the size of the disease class, which is the proportion of haplotypes that carry one or more disease allele [5]. The size of the disease class varied in the simulations both because the frequency of causal variants was allowed to vary, and because the disease class increases on average with the number of causal variants. To investigate the effect of the disease class on synthetic associations we separated the results by size of disease class and found first that the larger the disease class the higher the chance of a significant synthetic association. We also find, however, that within a disease class size, the likelihood of significant synthetic associations decreases with the number of causal variants (Figure 3.3). 44

53 Figure 3.2. The proportion of simulations with a variant of genome-wide significance. Results for rare variants are shown in red, for the top hit among common variants are shown in black, and in blue are the results for the next best hit for common variants after including the top hit in the regression model. At the bottom of each graph the simulation parameters are represented graphically. Results across all parameters with no recombination are shown in (a) with the shaded region representing the area where linkage analysis just begins to be a possible method for detecting rare effects (GRR = 4). Results for simulations that included recombination are shown in (b). The shaded region in (b) is the same as the shaded region in (a), with the rate of recombination for the same parameters increasing along the x-axis. 45

54 Figure 3.3. The proportion of simulations with a variant of genome-wide significance separated by disease class. Importantly, association with a causal variant in individual simulations was stronger than with the strongest common synthetic association in 98 percent of the simulations, and for each combination of parameters the proportion of simulations with genome-wide significant associations was always higher for causal variants than for synthetic 46

55 associations. Of particular importance to note, except for the case of GRR = 2, all conditions considered here produced a non-negligible proportion of simulations with significant common variants. It is also noteworthy that significant signals of association can be credited to common variants even when there is only a single rare causal site. A control simulation was run by testing the common variants from one genealogy against phenotypes generated by a separate genealogy with the same parameter settings and not a single test fell below genome-wide significance of 10-8 for all simulations. This shows that significant synthetic associations depend on the associations that occur within a single gene genealogy (or correlated ones in a recombination graph) and that sites undergoing free recombination cannot create genome-wide significant synthetic associations. Intuitively, it seems obvious that when rare variants are the cause of the associations then there should be multiple common variants that carry significant independent associations. To evaluate this expectation we took those genealogies that produced a genome-wide significant association and asked what the strongest association was when the top genome-wide significant association was first incorporated in the model. We found that almost 40 percent of genealogies with a genome-wide significant variant had secondary, independent associations that also achieved genome-wide significance. We also found that fewer than 10 percent of genealogies had no further significant associations (at = 0.05). These results demonstrate a clear tendency of rare variants to create multiple independent signals of synthetic association. One essential question about synthetic 47

56 associations is whether they are expected to be robust to the presence of recombination. Surprisingly, not only does recombination fail to eliminate synthetic associations, low rates of recombination can enhance them compared with no recombination (Figure 3.2b). For example, for GRR = 4 and 9 risk alleles, and a sample size of 3000 cases and 3000 controls, we find the proportion of trees showing significance for zero recombination is When we introduce a recombination rate of (ten times the genome-wide average for 500 bp) between segments, however, we find that the proportion increases to When recombination is increased further the expected decline in the synthetic association is observed. Importantly, however, even at exceptionally high recombination through the region ( between segments) we find that almost 30% of the simulations show a significant common variant, and recombination must increase to to reduce the proportion to below 1%. Importantly, the simulations involving recombination prohibit evaluation of any common variant that has a rare causal site within the same segment. Thus the synthetic associations emerging in these simulations occur between sites that are separated by a minimum recombination distance of that between segments, which is to It is counter-intuitive that recombination would increase synthetic associations since recombination reduces the average linkage disequilibrium (LD) in a region. The observation can be explained, however, by the effect of recombination on the distribution of association amongst sites within a genomic region. While the average LD declines as recombination increases, it is not known how higher moments behave and these moments can influence the proportion of pairs of sites that exceed some given threshold level of association. 48

57 Figure 3.4. Mean and variance of LD between rare and common sites as a function of rate of recombination. 100,000 simulations of two loci with multiple variants in each loci show how the mean and variance of estimates of LD between rare and common variants are affected by recombination. While the mean is a nonincreasing function of recombination, the variance increases then decreases, which shows why the maximum LD between rare and common variants can increase with low amounts of recombination in a region. We tested this as the explanation for the capacity of recombination to enhance associations by directly evaluating the mean and the variance of the association between rare and common variants in a simplified simulation. We considered two regions separated by a specified recombination rate. We calculated the average pairwise association between rare and common variants and also the variance of the pairwise LD between rare and common variants in each simulation and evaluated both these parameters as a function of recombination. We found that while the mean is non increasing, the variance first increases then decreases (Figure 4), suggesting that increases in recombination can widen the distribution of LD amongst sites sufficiently to increase the density in the tail and thereby create stronger synthetic associations. 49

58 These patterns make clear that so long as a given genomic region has one or more rare variants that contribute to disease, these rare variants can generate synthetic associations which are credited to much more common polymorphisms. Under ideal conditions for such synthetic associations they can be detected with sample sizes far smaller than those routinely used in genome-wide association studies. Under less ideal conditions (for example, higher prevalence attributable to environment or to other genetic factors outside of the locus being considered or lower penetrance for the local rare variants) the sample size must be larger. One essential quality of synthetic associations is that, while they are often likely to be created when multiple rare variants exist in a region, there are certain conditions under which very little association will be detected even with very large sample sizes and large effects of the causal variants because causal alleles will segregate to opposite common alleles. In other words, no common variant will be able to partition the rare variants on a genealogy to create a large enough imbalance to create association. We also investigated trends in association with causal variants and found that even though our model specified that only derived alleles at causal sites are deleterious, more than a third of the most highly associated common SNPs showed a higher penetrance in the ancestral allele which follows observed patterns [12]. Another important trend is that if only rare variants are contributing to the disease class in a region, the risk allele frequency of the most significant synthetic association will tend to be lower (median = 0.10), though over 20 percent of genome-wide significant synthetic associations had a risk allele frequency above 0.25 (Figure 3.5). Of course, this trend is noted when all common variants in a region are included, which is not the case with the available 50

59 commercial genotyping chips, which have a greater probability of including more common variants. In this case, the skew towards lower frequency variants would be less. We next attempted to determine the expected genomic distances over which rare variants could create synthetic associations. To do so we simulated a 10 Mb region with a typical recombination rate (1 cm/mb), 9 rare causal variants, 2000 cases and 2000 controls, and GRR = 4. We then identified the most distal causal variant that was confirmed to actually contribute to the signal of synthetic association. We did this by finding the most distal variant that resulted in a minimum of a one log drop in p-value when its effect was statistically removed (by incorporation as a covariate into the regression). We found that when a synthetic association reached genome-wide significance the most distant causal variant that affected the significance of the synthetic association was closer than 2 Mb from a synthetic association in fewer than 13 percent of the simulations and at least 9 Mb away in 4 percent of the simulations. The median distance of the most distant causal variant was 5 Mb. A simulated Manhattan plot showing a 10 Mb region with average recombination and 9 causal variants with GRR = 4 shows an example of a signature created by synthetic association (Figure 3.6). 51

60 Figure 3.5. Allele frequency distributions of all HapMap SNPs, Illumina 1M SNPs, and GWAS associations in CEU, and simulated synthetic associations. The allele frequencies show both minor and major allele frequencies. GWAS associations have a clear tendency towards the center, representing greater power to detect association with variants with higher minor allele frequencies. 52

61 Figure 3.6. Simulated Manhattan plots in a 10 Mb region. a) This region has 9 rare causal variants selected at random with GRR = 4 and 3000 cases and 3000 controls. b) The same region with permuted phenotypes shows what the region would look like without any association. Finally, we evaluated the genomic pattern of synthetic associations using two real-world examples: hearing loss and sickle cell anemia. These two examples represent two possible extremes for synthetic associations. Sickle cell anemia is a serious Mendelian disease in which the body makes sickle-shaped red blood cells. The prevalence of the disease is approximately 1 in 5000 in the US, mostly affecting subjects with African ancestry [13]. It is known to be caused by autosomal recessive mutations in HBB, and the frequency of the most common causal variant (Hb S allele) is ~3.6% in Americans of African ancestry [14]. In comparison, hearing loss is a relatively common and complex human disease, occurring in one per 1000 newborns on average [15]. More than two dozen causal genes have been identified for autosomal recessive non-syndromic hearing 53

62 loss [16, 17], but mutations in the GJB2/GJB6 locus account for about half of the cases of European ancestry [15, 18]. Among hundreds of known causal mutations in the GJB2/GJB6 locus [17], the 35delG mutation in GJB2 is the most common with an allele frequency of 1.25% in European Americans [19], but hundreds of other point mutations in GJB2 as well as a 342-kb deletion encompassing GJB6 also represent known causal variants [20, 21]. For sickle cell anemia, a total of 179 SNPs reached genome-wide significance (p < ), encompassing a ~2.5 Mb region on chromosome 11p15.4 (from 3.59Mb for rs to 5.98Mb for rs997433).the region contains dozens of genes and dozens of visually discernable LD blocks in HapMap YRI population. The top association signal (rs , p = ) is 9 kb from OR51V1, which is very near the causal gene, HBB (Figure 3.7). Clearly, highly significant association signals can travel across multiple LD blocks to distant genomic regions. 54

63 Figure 3.7. The 2.5 Mb genomic region on chr11p15.4 containing 179 genome-wide significant synthetic associations with sickle cell anemia in African Americans. The -log 10 (p) values for all genome-wide significant SNPs were displayed in the upper track, while the LD patterns based on HapMap YRI population is displayed in the lower track. The region contains dozens of genes spanning several discernible LD blocks. 55

64 The three most significantly associated SNPs for hearing loss are all located at the GJB2/GJB6 locus on 13q12.1 (Figure 3.8), including rs near GJB6 (p = , OR=1.69), rs within GJB2 (p = , OR = 1.63) and rs within GJA3 (p = , OR = 1.68). The three SNPs have weak LD with each other (pairwise r 2 values range from 0.02 to 0.62), but all of them are common variants. For example, rs has a minor allele frequency (MAF) of 18.7% in controls and 28.0% in cases. To evaluate the independence of the association signals from the three SNPs, we tested association again by incorporating rs in a logistic regression model, yet still found residual association for rs (p=4.3x10-6 ) but not rs (p=0.33), consistent with the expectations derived above for the behavior of synthetic associations.. The locus has been extensively resequenced in numerous studies, and there is no common causal variant at the locus with ~18.7% allele frequency similar to rs Therefore, rare variants at the locus create multiple independent association signals captured by common tagging SNPs. 56

65 Figure 3.8. Overview of the GJB2/GJB6 locus on 13q12.11 in the deafness GWAS. The three most significantly associated SNPs have weak LD between each other. Although the most common causal variants (35delG) within GJB2 has a frequency of only 1.25% in European Americans, the locus can still be identified by GWAS with common tagging SNPs. 57