SAC review Haplotype mapping in human disease

Size: px
Start display at page:

Download "SAC review Haplotype mapping in human disease"

Transcription

1 /toag Haplotype mapping in human disease Author Linda Morgan Key content: Many obstetric and gynaecological disorders result from complex interactions between genetic and environmental factors. Mapping of the patterns of variation in the human genome by the International HapMap project has made it possible to screen for common genetic variants in the human genome rapidly and economically. Susceptibility genes for complex disorders typically have small effects. Many thousands of samples from well-phenotyped cases are required to detect susceptibility genes with confidence. National and international consortia provide a necessary research infrastructure for genetic studies of complex disorders. Learning objectives: To understand the principles and applications of linkage disequilibrium, haplotype tagging and genome-wide association screening. To be able to access the bioinformatic resources available to researchers studying the genetic basis of complex disorders. Ethical issues: Research investment in haplotype mapping is leading to the identification of susceptibility genes for complex disorders, largely related to populations of white, western European descent. It is important to ensure that these benefits extend to all populations, including those in developing countries. Keywords genome-wide association screening / International HapMap project / linkage disequilibrium / tagsnps Please cite this article as: Morgan L. Haplotype mapping in human disease. The Obstetrician & Gynaecologist. Author details Linda Morgan DM FRCPath Senior Lecturer and Consultant Chemical Pathologist Clinical Chemistry, Nottingham University Hospitals NHS Trust, QMC Campus, Derby Road, Nottingham NG7 2UH, UK linda.morgan@nottingham.ac.uk (corresponding author) 277

2 The Obstetrician & Gynaecologist This article was commissioned by the Scientific Advisory Committee (SAC) Introduction The majority of the 3000 million DNA bases that comprise the human genome are shared by all members of the human race. Nevertheless, genetic variation is abundant and accounts in some measure for the differences in individual susceptibility to disease. Painstaking analysis of affected pedigrees has unravelled many of the Mendelian disorders, conditions attributable to mutant alleles of single genes with high penetrance. Research efforts are now focusing on complex disorders diseases that result from interactions between multiple factors, both genetic and environmental. In contrast with the rare monogenic diseases, complex disorders affect a large proportion of the population and consequently impose a considerable global public health burden. Genetic factors are believed to contribute to many disorders in the field of obstetrics and gynaecology, including polycystic ovary syndrome, cervical cancer, uterine fibroids, endometriosis, recurrent miscarriage, preterm delivery, pre-eclampsia and fetal growth restriction. 1 8 First-degree relatives of women affected by these disorders typically have a two- to three-fold increase in disease risk compared with the female population as a whole. The search for susceptibility genes is likely to prove fruitful in many of these conditions. Identifying genes in complex disorders The remarkable success in identifying genes that cause monogenic disorders has depended on linkage analysis, which tracks the co-inheritance of DNA markers with disease in families with multiple affected subjects. Frustratingly, linkage analysis has proved to be a blunt tool in the search for susceptibility genes in complex disorders. The lack of success can be attributed to the relatively small effect of individual genetic variants on disease risk: linkage analysis lacks statistical power to detect susceptibility genes other than those with large effects. Genetic association studies, which compare the frequency of genetic variants in unrelated affected and unaffected individuals, provide a more powerful strategy for gene discovery in complex disorders. This approach has parallels with research studying exposure to environmental agents such as smoking: exposure to a genetic variant is tested as a risk factor for disease. Initially, genetic association studies of complex disorders were limited to candidate genes suggested by existing knowledge of the pathophysiology of the disease. With about genes in the human genome, many of unknown function, the candidate gene approach risks missing important susceptibility genes. Within the last few years, advances in genotyping technology, coupled with enhanced understanding of the patterns of human genetic variation, have made it possible to scan the entire genome using the technique of genome-wide association (GWA) screening. (See Table 1 for a glossary of terms used in this article.) This requires no prior knowledge of the function or genomic location of a susceptibility gene and has consequently been termed an agnostic approach to disease gene discovery. This may provide novel insights into disease pathways, a major advantage over the candidate gene approach. Introducing single nucleotide polymorphisms (SNPs) Genome-wide association screening has focused on polymorphisms involving a single base change, known as single nucleotide polymorphisms or SNPs (pronounced snips ). Single nucleotide polymorphisms are usually biallelic there are two possible variants. Over 10 million SNPs in which the less common (minor) allele occurs at a frequency of at least 1% have been reported to the SNP database (dbsnp). 9 Some SNPs lie within genes and may affect the amino acid sequence of the encoded protein. The majority of SNPs lie Table 1 Glossary allele biallelic CEU CHB dbsnp ENCODE founder effect GWA screen haplotype HapMap project JPT kb linkage disequilibrium locus (plural loci) minor allele proxy recombination tagsnp SNP YRI One variant of a genetic polymorphism Describes a genetic polymorphism with two alternative DNA sequences HapMap panel derived from Utah residents of northern or western European descent HapMap panel derived from Chinese subjects in Beijing SNP database A project to resequence and genotype ten 500 kb regions of the genome in HapMap subjects Loss of genetic variation when a small number of individuals from a large population establish a new population Genome-wide association screen The combination of alleles at multiple polymorphic sites in a chromosomal region The International HapMap project, which aims to map human genetic variation in multiple populations as a tool for research into human disease and drug discovery HapMap panel derived from Japanese subjects in Tokyo kilobase(s) Correlation between genetic polymorphisms A position on a chromosome The less common allele at a biallelic locus A SNP that can be used as a surrogate for another SNP, due to the tight linkage disequilibrium between them Exchange of chromosomal material as a result of crossover during meiosis One of a minimum set of SNPs that captures genetic variation in a region Single nucleotide polymorphism a polymorphism resulting in the alteration of a single DNA base HapMap panel derived from Yoriba subjects in Ibadan, Nigeria 278

3 within noncoding regions of the genome; their functional effects are often unknown but experimental evidence indicates that many are involved in the regulation of gene expression. Comprehensive screening of the genome must therefore take into account the possibility that any SNP may have a functional effect. Fortunately it is not necessary to genotype all 10 million SNPs directly, due to the correlation between nearby SNPs, a phenomenon known as linkage disequilibrium. Linkage disequilibrium, the lack of independence in the distribution of alleles of nearby SNPs, arises because a novel SNP usually results from a single mutational event on an ancestral chromosome bearing a distinctive repertoire of alleles at multiple polymorphic sites. The full complement of alleles along a length of chromosome is known as its haplotype. The haplotypic background on which a novel SNP arises is transmitted largely unchanged from generation to generation. It can be altered by further new mutations, which are relatively rare events, or by chromosomal recombination during meiosis. The probability that two chromosomal loci will be separated by recombination is greater for distant loci than for loci that are in close proximity. Consequently, over multiple generations the genetic distance that displays linkage disequilibrium with an individual SNP becomes progressively shortened. The International HapMap project The need for a detailed human haplotype map as a tool for genetic research inspired the International HapMap project. Initiated in 2002, the HapMap Consortium aimed to map genetic variation in a total of 270 individuals from four geographically diverse ethnic groups: 30 mother father offspring trios of northern and western European ancestry from Utah, USA (CEU); 30 trios from the Yoruba in Ibadan, Nigeria (YRI); 45 unrelated Han Chinese individuals in Beijing (CHB) and 45 unrelated Japanese individuals in Tokyo (JPT). The first phase of the project genotyped 1 million SNPs with a minor allele frequency of 0.05 at an average density of 1 per 5 kilobases (kb), and published its findings in The second phase of HapMap, published in 2007, 11 genotyped over 3 million SNPs, increasing the mapping density to one SNP per kilobase. This remarkable online resource is freely available to researchers via the HapMap website. 12 In a parallel project, ENCODE, complete DNA sequencing of 48 HapMap subjects was undertaken in ten 500 kb regions of the genome,a total of 5 million base pairs. 13 All SNPs in these regions were then genotyped in all the HapMap individuals. This has provided an inventory of total variation in the selected regions, against which the completeness of coverage by HapMap SNPs can be assessed. The ENCODE data revealed an average SNP density of one per 279 base pairs, of which almost half were rare (minor allele frequency 0.05). Rare alleles are more common in the YRI samples than in the CEU, JPT or CHB panel; this is consistent with the hypothesis that out-of-africa populations are derived from a relatively small number of founders. The HapMap project has provided valuable insights into the patterns of linkage disequilibrium and chromosomal recombination. This does not occur at random across the genome: multiple short regions of the genome, usually 5 kb in length, contain sequences of nucleotides that favour recombination (Figure 1). These recombination hotspots separate stretches displaying high levels of linkage disequilibrium known as haplotype blocks, averaging 5 22 kb in length depending on the method used to define the blocks and the HapMap panel. Haplotype blocks are, on average, longer in the CEU, JPT and CHB panels than in the YRI panel. The combination of alleles of multiple SNPs within each block produces a number of different haplotypes (Figure 2). The theoretical maximum number of combinations of alleles at n biallelic SNPs is 2 n. For example, there are mathematically 1024 (2 10 ) possible haplotypes in a block containing 10 SNPs. In practice, due to the lack of independence between SNPs, the number of Figure 1 Linkage disequilibrium in a 500 kb segment of chromosome 3. Image generated from the HapMap website showing a 500 kilobase region of chromosome 3 in the CEU panel. The uppermost horizontal line provides the coordinates of this segment on the human genome map. A histogram depicts the number of SNPs/20kb genotyped by the HapMap project. The blue grid represents the linkage disequilibrium between each pair of SNPs; deeper intensities of blue indicate greater correlation. Recombination hotspots separating haplotype blocks are clearly evident, with minimal linkage disequilibrium between blocks. 279

4 The Obstetrician & Gynaecologist Figure 2 Limited haplotype diversity. A hypothetical haplotype block containing eight SNPs is shown. In the absence of linkage disequilibrium, these could generate 256 (2 8 ) haplotypes. Due to the lack of independence between nearby SNPs, only five common haplotypes are observed, accounting for 97% of the haplotypic variation in this population. Note that SNPs 1, 2, 4 and 6 form a perfect proxy set: genotyping one SNPfrom this set provides full information about the genotype at the other 3 SNPs. Similarly, SNPs 3 and 5 are perfect proxies, as are SNPs 7 and 8. Consequently, genotyping just three SNPs captures all the common variation in this haplotype block. haplotypes observed is substantially lower than this, with between 4 6 common haplotypes containing most of the observed variation within each block. The HapMap project continues to provide novel insights into the structure, function and evolution of the human genome. Perhaps, however, the area that has the most exciting implications for clinicians is the application of HapMap data to genetic studies of complex human disorders. Applications to genetic association studies One of the principles underlying genetic association studies of complex disorders is that a causal allele can be identified either by direct genotyping, or by genotyping a SNP with which it is in linkage disequilibrium. Data from the HapMap and ENCODE projects have shown that as many as 80% of common SNPs in the CEU analysis panel are perfectly correlated with between 1 20 or more SNPs described as perfect proxies. Only one tagsnp from these perfect proxy sets needs to be genotyped to provide complete genotypic information about all the other SNPs in the set (Figure 2). One measure of the strength of linkage disequilibrium between a pair of SNPs is the square of the correlation coefficient, r 2, which can take a value from 0 1; for perfect proxies, r 2 1. Employing tagsnps showing lesser correlation also provides useful information for genetic association studies, although this carries the penalty of reduced statistical power for the detection of untyped susceptibility variants. Increasing the study size by a factor of 1/r 2 provides equivalent statistical power to direct genotyping of the causative SNP. For example, if 1000 cases and 1000 controls provide adequate statistical power for disease gene detection where the causal variant is directly genotyped, 1250 cases and 1250 controls provide equivalent power using a tagsnp which is in partial linkage disequilibrium, r 2 0.8, with the causal SNP. Most of the common variation across the genome in the CEU, JPT and CHB panels can be captured by tagsnps. Up to 1 million tagsnps are required to capture the majority of variation in the YRI panel, where linkage disequilibrium extends over shorter distances (Figure 3). Arrays designed using HapMap data that are capable of genotyping on this scale are commercially available and are now being used for GWA screening for complex disorders. The principles of haplotype mapping can, of course, be applied to more restricted regions of the genome and researchers can make use of HapMap resources to select a customised panel of tagsnps for the region of interest (Figure 4). A notable early example of the success of GWA screening has been the study of seven common complex disorders undertaken by the Wellcome Trust Case Control Consortium (WTCCC). 14 This group genotyped SNPs in 3000 UK controls: 1500 from the 1958 Birth Cohort and 1500 healthy blood donors. These were compared with the genotypes of 2000 patients affected by each of seven disorders: coronary artery disease, essential hypertension, Crohn s disease, type I and II diabetes, bipolar disorder and rheumatoid arthritis. Between one and nine strong SNP association signals (P ) were detected for six of these disorders (the exception being essential hypertension). Some hits confirmed previously reported genetic associations, but the majority were novel. A further 58 SNPs with moderate associations with disease were identified (P ), some of which have already been confirmed by other investigators. 15 Some important principles have been established by the WTCCC and similar studies. First, the odds ratio for disease conferred by individual SNPs is generally small, usually 1.5, implying that only large, well-powered studies will successfully 280

5 Figure 3 Linkage disequilibrium in CEU and YRI panels. An enlarged HapMap image of 150 kilobases of chromosome 3 including the largest haplotype block shown in Figure 1. SNPs genotyped in the HapMap project are shown as small trianles beneath the chromosomal coordinates. Linkage disequilibrium between pairs of SNPs is shown in the blue grids, representing the CEU panel (upper grid) and the YRI panel (lower grid). Although there are clear similarities in the pattern of linkage disequilibrium in the two panels, correlation is weaker in the YRI panel and there is evidence of an additional recombination hotspot in the midpoint of the block. identify disease genes in complex disorders: 2000 cases is regarded as a minimum number for genome-wide screening. Second, due to the multiple statistical comparisons undertaken in a GWA screen, a stringent threshold for declaring statistical significance must be applied to reduce false positive results to an acceptable level: P appears to meet this requirement. Adequately powered independent replication studies are equally important in establishing the reliability of hits in GWA screens. The WTCCC approach has shown that the use of population-based control genotypes is a valid strategy, providing the disease of interest does not have a high incidence. Genome-wide data from thousands of population controls are now deposited in databases freely available to researchers on application and they provide an economical and powerful resource for GWA screening in a wide range of complex disorders Pooling of data from multiple GWA screens has proved to be immensely valuable in increasing the power to detect variants with small effects. A good example is the Diabetes Genetics Replication and Meta-analysis (DIAGRAM) consortium, which undertook a meta-analysis of three GWA screens for type II diabetes susceptibility genes, which together provided over 4500 cases and 5500 controls. 19 Single nucleotide polymorphisms giving the strongest signals were carried forward for replication studies in over cases and controls. This study detected six previously unidentified susceptibility loci for type II diabetes, all with odds ratios for disease of The small effect size of these loci accounts for the failure to detect them in individual GWA studies and demonstrates the power of meta-analysis in genediscovery efforts. It is worth noting that genetic variants that confer a small increase in disease risk may nevertheless provide novel insights into pathophysiological mechanisms. From genome-wide association screening to molecular mechanisms Having identified an association between a SNP and disease, researchers are faced with the task of 281

6 The Obstetrician & Gynaecologist Figure 4 TagSNP selection. Images generated from the HapMap website depicting the haplotypes defined by 117 SNPs in the 150 kilobase segment of chromosome 3 shown in Figure 3. The upper figure is generated from 120 unrelated YRI subjects; the lower panel represents haplotypes from 120 CEU subjects. Chromosomes are stacked horizontally; the commoner allele at each SNP is coded blue, the minor (less common) allele is coded yellow. Note the greater variability and higher number of rare haplotypes in the YRI panel. The majority of haplotypic variability is captured by a subset of tagsnps displayed beneath each panel, identified by the dbsnp reference numbers (rs#). The 39 tagsnps selected for the YRI panel comprise perfect proxies (r 2 1) for all SNPs with a minor allele frequency Typically, a smaller number of SNPs is required to tag haplotypes from out-of-africa populations. Twenty tagsnps are shown for the CEU panel, reflecting both the stronger linkage disequilibrium and a relaxation of the selected linkage disequilibrium criteria in the choice of tagsnps to r establishing the molecular basis of the association. The typed SNP may be the causative variant, or may act as a marker for a causal variant in linkage disequilibrium. Fine mapping of all variants in the surrounding region of the genome is needed, making use of data available through initiatives like the ENCODE project, or through further extensive DNA sequencing. Due to linkage disequilibrium, it is likely that multiple SNPs will show association with disease and resolving which of these is the causal variant presents a considerable challenge. It may not even be possible to state confidently which of several genes is responsible for the disease association. Finer resolution may be possible by studying the locus of interest in a population with shorter haplotype blocks. Ultimately, laboratory studies of the functional effects of individual SNPs are inevitably needed to define their mechanism of action. The future Haplotype mapping has provided genetic researchers with an unparalleled opportunity to understand the genetic basis of complex disorders and the harvest has only just begun. There is increasing interest in the functional effects of copy number variation (CNV) in the human genome inherited differences in the number of copies of extended segments of DNA up to many megabases in length. 20,21 Copy number variation is much more common than was previously appreciated and may underlie much of human phenotypic variation. Whilst some CNVs are captured adequately by tagsnps, alternative technologies specifically targeting genomic regions known to display CNV are now incorporated into GWA arrays. It seems likely that affordable whole-genome sequencing will become available within the next decade. Whether this will make the tagsnp approach obsolete may depend on economic considerations. It is certain that the bioinformatic and statistical platforms that have been developed for the current generation of GWA scanning will remain key to the success of future efforts at disease gene discovery. If the benefits of this latest revolution in molecular genetics are to be extended to the fields of obstetrics and gynaecology, the collection of large DNA banks coupled with carefully collected phenotypic data is vital and depends heavily on the active involvement of clinicians and the engagement of patients. The expectation that insights into the molecular basis of 282

7 common disorders will lead to novel therapeutic strategies provides a strong motivation for the effort involved. Recommended website HapMap Tutorials [ References 1 Vink JM, Sadrzadeh S, Lambalk CB, Boomsma DI. Heritability of polycystic ovary syndrome in a Dutch twin-family study. J Clin Endocrinol Metab 2006;91: doi: /jc Magnusson PKE, Lichtenstein P, Gyllensten UB. Heritability of cervical tumours. Int J Cancer 2000;88: doi: / ( )88:5<698::aid-ijc3>3.0.co;2-j 3 Ligon AH, Morton CC. Leiomyomata: heritability and cytogenetic studies. Hum Reprod Update 2001;7:8 14. doi: /humupd/ Bischoff FZ, Simpson JL. Heritability and molecular genetic studies of endometriosis. Hum Reprod Update 2000;6: doi: /humupd/ ESHRE Capri Workshop Group. Genetic aspects of female reproduction. Hum Reprod Update 2008;14: doi: /humupd/dmn009 6 Ward K, Argyle V, Meade M, Nelson L. The heritability of preterm delivery. Obstet Gynecol 2005;106: Chappell S, Morgan L. Searching for genetic clues to the causes of pre-eclampsia. Clin Sci 2006;110: doi: /cs Clausson B, Lichtenstein P, Cnattingius S. Genetic influence on birthweight and gestational length determined by studies in offspring of twins. BJOG 2000;107: doi: /j tb13234.x 9 dbsnpwebsite [ 10 The International HapMap Consortium. A haplotype map of the human genome. Nature 2005;437: doi: /nature The International HapMap Consortium. A second generation human haplotype map of over 3.1million SNPs. Nature 2007;449: doi: /nature International HapMap Project [ 13 HapMap ENCODEProject [ 14 The Wellcome Trust Case Control Consortium. Genome-wide association study of seven common diseases and shared controls. Nature 2007;447: doi: /nature Parkes M, Barrett JC, Prescott NJ, Tremelling M, Anderson CA, FisherSA, et al. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Nat Genet 2007;39: doi: /ng Wellcome Trust Case Control Consortium [ 17 European Genome-phenome Archive [ 18 Sanger Institute [ access_to_data.shtml] 19 Zeggini E, ScottLJ, Saxena R, VoightBF. Meta-analysis ofgenome-wide association data and large-scale replication identifies additional susceptibility loci fortype 2diabetes. Nat Genet 2008;40: doi: /ng Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, et al. Large-scale copy number polymorphism in the human genome. Science 2004;305: doi: /science McCarroll SA, Altshuler DM. Copy-number variation and association studies of human disease. Nat Genet 2007;39:s doi: /ng