Linkage Disequilibrium. Biostatistics 666

Similar documents
Genes in Populations: Hardy Weinberg Equilibrium. Biostatistics 666

Genotypes, Phenotypes and Hardy Weinberg Equilibrium. Biostatistics 666 Lecture II

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Linkage Disequilibrium. Adele Crane & Angela Taravella

Analysis of genome-wide genotype data

Haplotypes, linkage disequilibrium, and the HapMap

Algorithms for Genetics: Introduction, and sources of variation

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Genetic Association Analysis Using Data from Triads and Unrelated Subjects

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

The Whole Genome TagSNP Selection and Transferability Among HapMap Populations. Reedik Magi, Lauris Kaplinski, and Maido Remm

Course Announcements

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

5/18/2017. Genotypic, phenotypic or allelic frequencies each sum to 1. Changes in allele frequencies determine gene pool composition over generations

Office Hours. We will try to find a time

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).

Understanding genetic association studies. Peter Kamerman

Association studies (Linkage disequilibrium)

The Firm and the Market

THE FIRM AND THE MARKET

LD Mapping and the Coalescent

Supplementary Note: Detecting population structure in rare variant data

The Firm and the Market

Human linkage analysis. fundamental concepts

The Human Genome Project has always been something of a misnomer, implying the existence of a single human genome

Genome-Wide Association Studies (GWAS): Computational Them

Human linkage analysis. fundamental concepts

Division Ave. High School AP Biology

Nature Genetics: doi: /ng Supplementary Figure 1. Neighbor-joining tree of the 183 wild, cultivated, and weedy rice accessions.

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

Population stratification. Background & PLINK practical

Gregor Mendel. Genetics & The Work of Mendel. Mendel s work. Terminology. " Heritable features that vary among individuals are called characters.

Three-dimensional design against fatigue failure and the implementation of a genetic algorithm

Why do we need statistics to study genetics and evolution?

Basic Concepts of Human Genetics

Computational Workflows for Genome-Wide Association Study: I

Basic Concepts of Human Genetics

ARTICLE Maximizing the Power of Principal-Component Analysis of Correlated Phenotypes in Genome-wide Association Studies

The Evolution of Populations

Identifying Genes Underlying QTLs

Statistical Methods for Quantitative Trait Loci (QTL) Mapping

AP Biology. Gregor Mendel. Chapter 14. Mendel & Genetics. Mendel s work. What did Mendel s findings mean? Looking closer at Mendel s work

Bioinformatic Analysis of SNP Data for Genetic Association Studies EPI573

Overview. Methods for gene mapping and haplotype analysis. Haplotypes. Outline. acatactacataacatacaatagat. aaatactacctaacctacaagagat

Genome-wide association studies (GWAS) Part 1

TEST FORM A. 2. Based on current estimates of mutation rate, how many mutations in protein encoding genes are typical for each human?

Mendel s work. AP Biology. Genetics & The Work of Mendel. What did Mendel s findings mean? Looking closer at Mendel s work

b. (3 points) The expected frequencies of each blood type in the deme if mating is random with respect to variation at this locus.

An introduction to genetics and molecular biology

Genotyping Technology How to Analyze Your Own Genome Fall 2013

Population Genetics II. Bio

On the Power to Detect SNP/Phenotype Association in Candidate Quantitative Trait Loci Genomic Regions: A Simulation Study

Human Genetics and Gene Mapping of Complex Traits

Genetic data concepts and tests

Take Home Message. Molecular Imaging Genomics. How to do Genetics. Questions for the Study of. Two Common Methods for Gene Localization

The HapMap Project and Haploview

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Multistage Cross-Sell Model of Employers in the Financial Industry

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY

Two-locus models. Two-locus models. Two-locus models. Two-locus models. Consider two loci, A and B, each with two alleles:

Familial Breast Cancer

Population and Statistical Genetics including Hardy-Weinberg Equilibrium (HWE) and Genetic Drift

Introduction to Population Genetics. Spezielle Statistik in der Biomedizin WS 2014/15

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Population Genetics Sequence Diversity Molecular Evolution. Physiology Quantitative Traits Human Diseases

Genetic Drift and Natural Selection:

PERSPECTIVES. A gene-centric approach to genome-wide association studies

Applied Bioinformatics

PP x pp. Looking closer at Mendel s work. Gregor Mendel a monk named Gregor Mendel documented inheritance in peas in the mid-1800s

Cross Haplotype Sharing Statistic: Haplotype length based method for whole genome association testing

Supplementary Material online Population genomics in Bacteria: A case study of Staphylococcus aureus

SNPs - GWAS - eqtls. Sebastian Schmeier

POPULATION GENETICS studies the genetic. It includes the study of forces that induce evolution (the

COMPUTER SIMULATIONS AND PROBLEMS

Evolutionary Mechanisms

ABSTRACT STRATEGIES IN OLIGOPOLIES. This dissertation is part of the effort to contribute to our understanding of Price

Summary. Introduction

Lecture 19: Hitchhiking and selective sweeps. Bruce Walsh lecture notes Synbreed course version 8 July 2013

Efficient Association Study Design Via Power-Optimized Tag SNP Selection

What is genetic variation?

Human Genetics and Gene Mapping of Complex Traits

Using the Association Workflow in Partek Genomics Suite

Linkage Disequilibrium

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007

Permanent City Research Online URL:

7-1. Read this exercise before you come to the laboratory. Review the lecture notes from October 15 (Hardy-Weinberg Equilibrium)

Petar Pajic 1 *, Yen Lung Lin 1 *, Duo Xu 1, Omer Gokcumen 1 Department of Biological Sciences, University at Buffalo, Buffalo, NY.

Resources at HapMap.Org

Genetics Effective Use of New and Existing Methods

Structure, Measurement & Analysis of Genetic Variation

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

PopGen1: Introduction to population genetics

An Introduction to Population Genetics

Concepts: What are RFLPs and how do they act like genetic marker loci?

Planning and Design of Flex-Route Transit Services

Whole Genome Regression and Prediction Methods Applied to Plant and Animal Breeding. & Mario P. L. Calus ᶲ

Variation Chapter 9 10/6/2014. Some terms. Variation in phenotype can be due to genes AND environment: Is variation genetic, environmental, or both?

Transcription:

Linkage Disequilibrium iostatistics 666

Logistics: Office Hours Office hours on Mondays at 4 m. Room 4614 School of Public Health Tower

Previously asic roerties of a locus llele Frequencies Genotye Frequencies Hardy-Weinberg Equilibrium Relationshi between allele and genotye frequencies that holds for most genetic markers Exact Tests for Hardy-Weinberg Equilibrium

Today We ll consider roerties of airs of alleles Halotye frequencies Linkage equilibrium Linkage disequilibrium

Let s consider the history of two neighboring alleles

lleles that exist today arose through ancient mutation events efore Mutation C G T G T C G T C G T G T C G C G fter Mutation C G T G T C G T C G T G T C G C G C G T C G T C Mutation G T C G T G T C G C G

lleles that exist today arose through ancient mutation events efore Mutation fter Mutation C Mutation

One allele arose first, and then the other efore Mutation G C G fter Mutation G C G C C Mutation

Recombination generates new arrangements for ancestral alleles C C efore Recombination G G C fter Recombination G C G C C C Recombinant Halotye

Linkage Disequilibrium Chromosomes are mosaics Extent and conservation of mosaic ieces deends on Recombination rate Mutation rate Poulation size Natural selection ncestor Present-day Combinations of alleles at very close markers reflect ancestral halotyes

Why is linkage disequilibrium imortant for gene maing?

ssociation Studies and Linkage Disequilibrium If all olymorhisms were indeendent at the oulation level, association studies would have to examine every one of them Linkage disequilibrium makes tightly linked variants strongly correlated roducing cost savings for association studies

Tagging SNPs In a tyical short chromosome segment, there are only a few distinct halotyes Carefully selected SNPs can determine status of other SNPs Halo1 T T T Halo2 T T T Halo3 T T T Halo4 T T T 30% 20% 20% 20% Halo5 T T T 10%

asic Descritors of Linkage Disequilibrium

Commonly Used Descritors Halotye Frequencies The frequency of each tye of chromosome Contain all the information rovided by other summary measures Commonly used summaries D D r 2 or Δ 2

Halotye Frequencies Locus b Totals Locus a a b ab a Totals b 1.0

Linkage Equilibrium Exected for Distant Loci ) )(1 (1 ) (1 ) (1 b a ab a a b b = = = = = = =

Linkage Disequilibrium Exected for Nearby Loci ) )(1 (1 ) (1 ) (1 b a ab a a b b = = =

Disequilibrium Coefficient D b a ab a a b b D D D D D + = = = + = =

D is hard to interret Sign is arbitrary common convention is to set, to be the common allele and a, b to be the rare allele Range deends on allele frequencies Hard to comare between markers

What is the range of D? What are the maximum and minimum ossible values of D when = 0.3 and = 0.3 = 0.2 and = 0.1 Can you derive a general formula for this range?

D scaled version of D D' D min( = D min( b,, a a b ) ) D D < 0 > 0 Ranges between 1 and +1 More likely to take extreme values when allele frequencies are small ±1 imlies at least one of the observed halotyes was not observed

More on D Pluses: D = 1 or D = -1 means no evidence for recombination between the markers If allele frequencies are similar, high D means the markers are good surrogates for each other Minuses: D estimates inflated in small samles D estimates inflated when one allele is rare

² (also called r 2 ) ² = = χ ² 2n (1 2 D ) (1 Ranges between 0 and 1 1 when the two markers rovide identical information 0 when they are in erfect equilibrium Exected value is 1/2n )

More on r 2 r 2 = 1 imlies the markers rovide exactly the same information The measure referred by oulation geneticists Measures loss in efficiency when marker is relaced with marker in an association study With some simlifying assumtions (e.g. see Pritchard and Przeworski, 2001)

When does linkage equilibrium hold?

Equilibrium or Disequilibrium? We will resent simle argument for why linkage equilibrium holds for most loci alance of factors Genetic drift (a function of oulation size) Random mating Distance between markers

Why Equilibrium is Reached Eventually, random mating and recombination should ensure that mutations sread from original halotye to all halotyes in the oulation Simle argument: ssume fixed allele frequencies over time

Generation t, Initial Configuration b a b a a b D D a D D b + + ssume arbitrary values for the allele frequencies and disequilibrium coefficient

Generation t+1, Without Recombination b a b a a b D D a D D b + + Halotye Frequencies Remain Stable Over Time Outcome has robability 1- R

Generation t+1, With Recombination b b a b a b a b Halotye Frequencies re Function of llele Frequencies Outcome has robability R

Generation t+1, Overall b a b a b b D r D r a D r D r b ) (1 ) (1 ) (1 ) (1 + + Disequilibrium Decreases

Recombination Rate (R) Probability of an odd number of crossovers between two loci Proortion of time alleles from two different grand-arents occur in the same gamete Increases with hysical (base-air) distance, but rate of increase varies across genome

Decay of D with Time 1.000 0.900 0.800 Ө = 0.0001 R = 0.0001 Disequilibrium 0.700 0.600 0.500 0.400 0.300 Ө = 0.1 R = 0.1 R = 0.01 Ө = 0.01 Ө = 0.001 R = 0.001 0.200 0.100 R = 0.5 Ө = 0.5 0.000 1 10 100 1000 Generations

Predictions Disequilibrium will decay each generation In a large oulation fter t generations D t = (1-R) t D 0 better model should allow for changes in allele frequencies over time

Linkage Equilibrium In a large random mating oulation halotye frequencies converge to a simle function of allele frequencies

Some Examles of Linkage Disequilibrium Data

Summary of Disequilibrium in the Genome How much disequilibrium is there? What are good redictors of disequilibrium?

Raw D data from Chr22 1.00 0.80 0.60 D' 0.40 0.20 0.00 0 200 400 600 800 1000 Physical Distance (kb) Dawson et al, Nature, 2002

Raw ² data from Chr22 1.00 0.80 0.60 r 2 0.40 0.20 0.00 0 200 400 600 800 1000 Physical Distance (kb) Dawson et al, Nature, 2002

Summarizing Disequilibrium

Comaring Emirical of LD Poulations (chromosome 2, R²) 0.6 0.4 Statistic Han, Jaanese Yoruba CEPH 0.2 0 0 50 100 150 200 250 Distance (kb) LD extends further in CEPH and the Han/Jaanese than in the Yoruba International HaMa Consortium, Nature, 2005

Variation in Linkage Disequilibrium long The Genome

Comaring Genomic Regions Rather than comare curves directly, it is convenient to a ick a summary for the decay curves One common summary is the distance at which the curve crosses a threshold of interest (say 0.50)

Extent of Linkage Disequilibrium 12 Half-Life of R² 8 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Chromosome LD extends further in the larger chromosomes, which have lower recombination rates

ut within each chromosome, there is still huge variability!

Proortion of Genome 0.00 0.02 0.04 0.06 0.08 0.10 Extent of LD Extent of LD cross the Genome (in 1Mb Windows) 0 10 20 30 40 50 verage Extent: Median Extent: 10 th ercentile: 90 th ercentile: Extent of LD (r²) 11.9 kb 7.8 kb 3.5 kb 20.9 kb

Genomic Variation in LD (CEPH) Exected r 2 at 30kb right Red > 0.88 Dark lue < 0.12 Smith et al, Genome Research, 2005

Genomic Variation in LD (JPT + CH) Exected r 2 at 30kb right Red > 0.88 Dark lue < 0.12 Smith et al, Genome Research, 2005

Genomic Variation in LD (YRI) Exected r 2 at 10kb right Red > 0.88 Dark lue < 0.12 Smith et al, Genome Research, 2005

Genomic Distribution of LD (YRI) Exected r 2 at 30kb right Red > 0.88 Dark lue < 0.12 Smith et al, Genome Research, 2005

Sequence Comosition vs. LD (some selected comarisons) Genome Quartiles, Defined Using LD (Low LD) (High LD) Genome Q1 Q2 Q3 Q4 Trend asic Sequence Features GC ases (%) 40.8 43.5 41.0 39.6 39.0 Decreases With LD ases in CG Islands (%) 0.7 0.9 0.7 0.6 0.6 Decreases With LD Polymorhism ( Π 10,000 ) 10.1 11.9 10.6 9.6 8.3 Decreases With LD Genes and Related Features Known Genes (er 1000 kb) 6.4 6.6 6.1 6.2 6.7 U shaed Genic ases (Exon, Intron, UTR, %) 38.5 37.6 34.6 36.0 45.9 U shaed Coding ases (%) 1.2 1.1 1.0 1.1 1.4 U shaed Conserved Non-Coding Sequence (%) 1.4 1.6 1.5 1.3 1.1 Decreases with LD Reeat Content Total ases in Reeats (%) 47.9 44.2 46.4 48.6 52.3 Increases with LD ases in LINE reeats (%) 20.9 16.5 19.9 22.4 24.9 Increases with LD ases in SINE reeats (%) 13.6 14.7 13.1 12.6 14.0 U shaed Smith et al, Genome Research, 2005

Gene Function in Regions of High and Low Disequilibrium nnotated Gene Function (GO Term) Genes Low LD High LD χ² P-value ll Swissrot Entries Examined 7520 2305 2045 - - DN metabolism 366 74 139 35.37 <.0001 Immune Resonse 622 232 94 34.36 <.0001 Cell cycle 493 119 177 26.24 <.0001 Protein Metabolism 1193 318 375 23.33 <.0001 Organelle Organization and iogenesis 444 107 152 19.65 <.0001 Intracellular Transort 263 56 95 19.61 <.0001 Organogenesis 805 294 162 16.49 0.00005 Cell Organization and Metabolism 545 138 178 16.43 0.00005 RN Metabollism 208 41 71 15.33 0.00009 Results from a comarison of the distribution of 40 most common gene classifications in the GENE Ontology Database

Imlications for ssociation Studies

Linkage Disequilibrium in ssociation Studies: Psoriasis and IL23 Examle Multile nearby SNPs show evidence for association with soriasis. The SNPs are all in linkage disequilibrium, so it is hard to inoint causal SNP. Nair et al, Nature Genetics, 2009

Linkage Disequilibrium in ssociation Studies: Tag SNP Picking Many nearby SNPs will tyically rovide similar evidence for association To decrease genotying costs, most association studies will examine selected tag SNPs in each region The most common tagging strategy focuses on airwise r 2 between SNPs

Linkage Disequilibrium in ssociation Studies: Pairwise Tagging lgorithm Select an r 2 threshold, tyically 0.5 or 0.8 SNPs with r 2 above threshold can serve as roxies for each other For each marker being considered, count the number of SNPs with r 2 above threshold Genotye SNP with the largest number of airwise roxies Remove SNP and all the SNPs it tags from consideration Reeat the revious three stes until there are no more SNPs to ick or genotying budget is exhausted Carlson et al, JHG, 2004

Potential Number of tag SNPs Current tag SNP anels tyically examine 500,000 1,000,000 SNPs for a cost of $200 - $500 er samle. It is anticiated that future anels will allow for as many as 5 million SNPs. The International HaMa Consortium, Nature, 2007

Today asic descritors of linkage disequilibrium Learn when linkage disequilibrium is exected to hold (or not!)

dditional Reading I Dawson E et al (2002). first-generation linkage disequilibrium ma of human chromosome 22. Nature 418:544-548 The International HaMa Consortium. (2005). halotye ma of the human genome. Nature 437:1299-320 Carlson CS et al (2004). Selecting a maximally informative set of single-nucleotide olymorhisms for association analyses using linkage disequilibrium. m J Hum Genet 74:106-120

dditional Reading II Cardon and ell (2001) ssociation study designs for comlex diseases. Nature Reviews Genetics 2:91-99 Surveys imortant issues in analyzing oulation data. Illustrates shift from focus on linkage to association maing for comlex traits.