Linkage Disequilibrium iostatistics 666
Logistics: Office Hours Office hours on Mondays at 4 m. Room 4614 School of Public Health Tower
Previously asic roerties of a locus llele Frequencies Genotye Frequencies Hardy-Weinberg Equilibrium Relationshi between allele and genotye frequencies that holds for most genetic markers Exact Tests for Hardy-Weinberg Equilibrium
Today We ll consider roerties of airs of alleles Halotye frequencies Linkage equilibrium Linkage disequilibrium
Let s consider the history of two neighboring alleles
lleles that exist today arose through ancient mutation events efore Mutation C G T G T C G T C G T G T C G C G fter Mutation C G T G T C G T C G T G T C G C G C G T C G T C Mutation G T C G T G T C G C G
lleles that exist today arose through ancient mutation events efore Mutation fter Mutation C Mutation
One allele arose first, and then the other efore Mutation G C G fter Mutation G C G C C Mutation
Recombination generates new arrangements for ancestral alleles C C efore Recombination G G C fter Recombination G C G C C C Recombinant Halotye
Linkage Disequilibrium Chromosomes are mosaics Extent and conservation of mosaic ieces deends on Recombination rate Mutation rate Poulation size Natural selection ncestor Present-day Combinations of alleles at very close markers reflect ancestral halotyes
Why is linkage disequilibrium imortant for gene maing?
ssociation Studies and Linkage Disequilibrium If all olymorhisms were indeendent at the oulation level, association studies would have to examine every one of them Linkage disequilibrium makes tightly linked variants strongly correlated roducing cost savings for association studies
Tagging SNPs In a tyical short chromosome segment, there are only a few distinct halotyes Carefully selected SNPs can determine status of other SNPs Halo1 T T T Halo2 T T T Halo3 T T T Halo4 T T T 30% 20% 20% 20% Halo5 T T T 10%
asic Descritors of Linkage Disequilibrium
Commonly Used Descritors Halotye Frequencies The frequency of each tye of chromosome Contain all the information rovided by other summary measures Commonly used summaries D D r 2 or Δ 2
Halotye Frequencies Locus b Totals Locus a a b ab a Totals b 1.0
Linkage Equilibrium Exected for Distant Loci ) )(1 (1 ) (1 ) (1 b a ab a a b b = = = = = = =
Linkage Disequilibrium Exected for Nearby Loci ) )(1 (1 ) (1 ) (1 b a ab a a b b = = =
Disequilibrium Coefficient D b a ab a a b b D D D D D + = = = + = =
D is hard to interret Sign is arbitrary common convention is to set, to be the common allele and a, b to be the rare allele Range deends on allele frequencies Hard to comare between markers
What is the range of D? What are the maximum and minimum ossible values of D when = 0.3 and = 0.3 = 0.2 and = 0.1 Can you derive a general formula for this range?
D scaled version of D D' D min( = D min( b,, a a b ) ) D D < 0 > 0 Ranges between 1 and +1 More likely to take extreme values when allele frequencies are small ±1 imlies at least one of the observed halotyes was not observed
More on D Pluses: D = 1 or D = -1 means no evidence for recombination between the markers If allele frequencies are similar, high D means the markers are good surrogates for each other Minuses: D estimates inflated in small samles D estimates inflated when one allele is rare
² (also called r 2 ) ² = = χ ² 2n (1 2 D ) (1 Ranges between 0 and 1 1 when the two markers rovide identical information 0 when they are in erfect equilibrium Exected value is 1/2n )
More on r 2 r 2 = 1 imlies the markers rovide exactly the same information The measure referred by oulation geneticists Measures loss in efficiency when marker is relaced with marker in an association study With some simlifying assumtions (e.g. see Pritchard and Przeworski, 2001)
When does linkage equilibrium hold?
Equilibrium or Disequilibrium? We will resent simle argument for why linkage equilibrium holds for most loci alance of factors Genetic drift (a function of oulation size) Random mating Distance between markers
Why Equilibrium is Reached Eventually, random mating and recombination should ensure that mutations sread from original halotye to all halotyes in the oulation Simle argument: ssume fixed allele frequencies over time
Generation t, Initial Configuration b a b a a b D D a D D b + + ssume arbitrary values for the allele frequencies and disequilibrium coefficient
Generation t+1, Without Recombination b a b a a b D D a D D b + + Halotye Frequencies Remain Stable Over Time Outcome has robability 1- R
Generation t+1, With Recombination b b a b a b a b Halotye Frequencies re Function of llele Frequencies Outcome has robability R
Generation t+1, Overall b a b a b b D r D r a D r D r b ) (1 ) (1 ) (1 ) (1 + + Disequilibrium Decreases
Recombination Rate (R) Probability of an odd number of crossovers between two loci Proortion of time alleles from two different grand-arents occur in the same gamete Increases with hysical (base-air) distance, but rate of increase varies across genome
Decay of D with Time 1.000 0.900 0.800 Ө = 0.0001 R = 0.0001 Disequilibrium 0.700 0.600 0.500 0.400 0.300 Ө = 0.1 R = 0.1 R = 0.01 Ө = 0.01 Ө = 0.001 R = 0.001 0.200 0.100 R = 0.5 Ө = 0.5 0.000 1 10 100 1000 Generations
Predictions Disequilibrium will decay each generation In a large oulation fter t generations D t = (1-R) t D 0 better model should allow for changes in allele frequencies over time
Linkage Equilibrium In a large random mating oulation halotye frequencies converge to a simle function of allele frequencies
Some Examles of Linkage Disequilibrium Data
Summary of Disequilibrium in the Genome How much disequilibrium is there? What are good redictors of disequilibrium?
Raw D data from Chr22 1.00 0.80 0.60 D' 0.40 0.20 0.00 0 200 400 600 800 1000 Physical Distance (kb) Dawson et al, Nature, 2002
Raw ² data from Chr22 1.00 0.80 0.60 r 2 0.40 0.20 0.00 0 200 400 600 800 1000 Physical Distance (kb) Dawson et al, Nature, 2002
Summarizing Disequilibrium
Comaring Emirical of LD Poulations (chromosome 2, R²) 0.6 0.4 Statistic Han, Jaanese Yoruba CEPH 0.2 0 0 50 100 150 200 250 Distance (kb) LD extends further in CEPH and the Han/Jaanese than in the Yoruba International HaMa Consortium, Nature, 2005
Variation in Linkage Disequilibrium long The Genome
Comaring Genomic Regions Rather than comare curves directly, it is convenient to a ick a summary for the decay curves One common summary is the distance at which the curve crosses a threshold of interest (say 0.50)
Extent of Linkage Disequilibrium 12 Half-Life of R² 8 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Chromosome LD extends further in the larger chromosomes, which have lower recombination rates
ut within each chromosome, there is still huge variability!
Proortion of Genome 0.00 0.02 0.04 0.06 0.08 0.10 Extent of LD Extent of LD cross the Genome (in 1Mb Windows) 0 10 20 30 40 50 verage Extent: Median Extent: 10 th ercentile: 90 th ercentile: Extent of LD (r²) 11.9 kb 7.8 kb 3.5 kb 20.9 kb
Genomic Variation in LD (CEPH) Exected r 2 at 30kb right Red > 0.88 Dark lue < 0.12 Smith et al, Genome Research, 2005
Genomic Variation in LD (JPT + CH) Exected r 2 at 30kb right Red > 0.88 Dark lue < 0.12 Smith et al, Genome Research, 2005
Genomic Variation in LD (YRI) Exected r 2 at 10kb right Red > 0.88 Dark lue < 0.12 Smith et al, Genome Research, 2005
Genomic Distribution of LD (YRI) Exected r 2 at 30kb right Red > 0.88 Dark lue < 0.12 Smith et al, Genome Research, 2005
Sequence Comosition vs. LD (some selected comarisons) Genome Quartiles, Defined Using LD (Low LD) (High LD) Genome Q1 Q2 Q3 Q4 Trend asic Sequence Features GC ases (%) 40.8 43.5 41.0 39.6 39.0 Decreases With LD ases in CG Islands (%) 0.7 0.9 0.7 0.6 0.6 Decreases With LD Polymorhism ( Π 10,000 ) 10.1 11.9 10.6 9.6 8.3 Decreases With LD Genes and Related Features Known Genes (er 1000 kb) 6.4 6.6 6.1 6.2 6.7 U shaed Genic ases (Exon, Intron, UTR, %) 38.5 37.6 34.6 36.0 45.9 U shaed Coding ases (%) 1.2 1.1 1.0 1.1 1.4 U shaed Conserved Non-Coding Sequence (%) 1.4 1.6 1.5 1.3 1.1 Decreases with LD Reeat Content Total ases in Reeats (%) 47.9 44.2 46.4 48.6 52.3 Increases with LD ases in LINE reeats (%) 20.9 16.5 19.9 22.4 24.9 Increases with LD ases in SINE reeats (%) 13.6 14.7 13.1 12.6 14.0 U shaed Smith et al, Genome Research, 2005
Gene Function in Regions of High and Low Disequilibrium nnotated Gene Function (GO Term) Genes Low LD High LD χ² P-value ll Swissrot Entries Examined 7520 2305 2045 - - DN metabolism 366 74 139 35.37 <.0001 Immune Resonse 622 232 94 34.36 <.0001 Cell cycle 493 119 177 26.24 <.0001 Protein Metabolism 1193 318 375 23.33 <.0001 Organelle Organization and iogenesis 444 107 152 19.65 <.0001 Intracellular Transort 263 56 95 19.61 <.0001 Organogenesis 805 294 162 16.49 0.00005 Cell Organization and Metabolism 545 138 178 16.43 0.00005 RN Metabollism 208 41 71 15.33 0.00009 Results from a comarison of the distribution of 40 most common gene classifications in the GENE Ontology Database
Imlications for ssociation Studies
Linkage Disequilibrium in ssociation Studies: Psoriasis and IL23 Examle Multile nearby SNPs show evidence for association with soriasis. The SNPs are all in linkage disequilibrium, so it is hard to inoint causal SNP. Nair et al, Nature Genetics, 2009
Linkage Disequilibrium in ssociation Studies: Tag SNP Picking Many nearby SNPs will tyically rovide similar evidence for association To decrease genotying costs, most association studies will examine selected tag SNPs in each region The most common tagging strategy focuses on airwise r 2 between SNPs
Linkage Disequilibrium in ssociation Studies: Pairwise Tagging lgorithm Select an r 2 threshold, tyically 0.5 or 0.8 SNPs with r 2 above threshold can serve as roxies for each other For each marker being considered, count the number of SNPs with r 2 above threshold Genotye SNP with the largest number of airwise roxies Remove SNP and all the SNPs it tags from consideration Reeat the revious three stes until there are no more SNPs to ick or genotying budget is exhausted Carlson et al, JHG, 2004
Potential Number of tag SNPs Current tag SNP anels tyically examine 500,000 1,000,000 SNPs for a cost of $200 - $500 er samle. It is anticiated that future anels will allow for as many as 5 million SNPs. The International HaMa Consortium, Nature, 2007
Today asic descritors of linkage disequilibrium Learn when linkage disequilibrium is exected to hold (or not!)
dditional Reading I Dawson E et al (2002). first-generation linkage disequilibrium ma of human chromosome 22. Nature 418:544-548 The International HaMa Consortium. (2005). halotye ma of the human genome. Nature 437:1299-320 Carlson CS et al (2004). Selecting a maximally informative set of single-nucleotide olymorhisms for association analyses using linkage disequilibrium. m J Hum Genet 74:106-120
dditional Reading II Cardon and ell (2001) ssociation study designs for comlex diseases. Nature Reviews Genetics 2:91-99 Surveys imortant issues in analyzing oulation data. Illustrates shift from focus on linkage to association maing for comlex traits.