EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

Size: px
Start display at page:

Download "EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011"

Transcription

1 EPIB 668 Genetic association studies Aurélie LABBE - Winter / 71

2 OUTLINE Linkage vs association Linkage disequilibrium Case control studies Family-based association 2 / 71

3 RECAP ON GENETIC VARIANTS 3 / 71

4 DNA variation >99.9 % of the sequence is identical between any two chromosomes. Even though most of the sequence is identical between two chromosomes, since the genome sequence is so long ( 3 billion base pairs), there are still many variations. Some DNA variations are responsible for biological changes, others have no known function. Alleles are the alternative forms of a DNA segment at a given genetic location. Genetic polymorphism: DNA segment with at least 2 common alleles. 4 / 71

5 SNPs SNPs - DNA sequence variations that occur when a single nucleotide is altered Alleles at this SNP are G and T SNPs are the most common form of variation in the human genome SNPs are catalogued in several databases 5 / 71

6 Genotype and haplotype Genotype: pair of alleles (one paternal, one maternal) at a locus Genotype for this individual is GT Haplotype: sequence of alleles along a single chromosome Genotypes for this individual (vertical) : CA and TT Haplotypes (horizontal): CT and AT 6 / 71

7 LINKAGE VERSUS ASSOCIATION 7 / 71

8 Linkage vs association LINKAGE: It is possible to encounter a situation where an allele cosegregates with the disease within a family, whereas another allele cosegregates with the disease in another family. This is because the linked marker is NOT causal, but it is LINKED with the causal marker. ASSOCIATION: We aim to demonstrate the presence of an excess of a given allele in affected subjects, compared to a control group. 8 / 71

9 Linkage vs association LIAISON SANS ASSOCIATION LIAISON AVEC ASSOCIATION CAS 1 CAS 2 9 / 71

10 Linkage vs association Magnitude of effect Rare, high penetrance mutations use linkage Common, low penetrance variants use association Frequency in population 10 / 71

11 Goals of association studies Short-term Goal: Identify genetic variants that explain differences in phenotype among individuals in a study population Qualitative (disease status, presence/absence of congenital defect) or quantitative (blood glucose levels, % body fat...) If association found, then further study can follow to Understand mechanism of action and disease etiology in individuals Characterize relevance and/or impact in more general population Long-term goal: to inform process of identifying and delivering better prevention and treatment strategies 11 / 71

12 LINKAGE DISEQUILIBRIUM 12 / 71

13 Linkage disequilibrium Non-random association of alleles at two or more loci, not necessarily on the same chromosome. It is not the same as linkage, which describes the association of two or more loci on a chromosome with limited recombination between them 13 / 71

14 Linkage disequilibrium Alleles that exist today arose through ancient mutation events A Before Mutation A After Mutation C Mutation 14 / 71

15 Linkage disequilibrium One mutation at one locus arose first, and then the other at another locus... Before Mutation A G C G After Mutation A G C G C C Mutation 15 / 71

16 Linkage disequilibrium If the two loci are far apart, recombination generates new arrangements for ancestral alleles A C C Before Recombination G G C A After Recombination G C G C C A C Recombinant Haplotype 16 / 71

17 Linkage disequilibrium If the two loci are close to each other, recombination does NOT generate new arrangements for ancestral alleles Here, the A allele at the first locus is always associated with a G allele at the second locus A C C Before Recombination G G C A After Recombination G C G C C 17 / 71

18 Linkage disequilibrium Chromosomes are mosaics The extent and conservation of mosaic pieces depends on recombination rate, mutation rate, population size, natural selection Combinations of alleles at very close markers reflect ancestral haplotypes Ancestor Present-day 18 / 71

19 LD: summary In general, LD between two SNPs decreases with physical distance Extent of LD varies greatly depending on region of genome If LD is strong, need fewer SNPs to capture variation in a region 19 / 71

20 How do we measure LD? Compute haplotype frequencies at 2 loci Here, C never shows with G: association between alleles at separate loci due to their proximity on the same chromosone SNP2 SNP1 A G Total C T Total / 71

21 How do we measure LD? Under equilibrium, P(AC) = P(C)P(A) (independence between alleles at separate loci) Under LD, P(AC) P(C)P(A) Here, 0.66 (0.66) (0.79) SNP2 SNP1 A G Total C T Total One measure of LD is D = P(AC) P(A)P(C) 21 / 71

22 LD and association Q: Why is linkage disequilibrium important for gene mapping? If all polymorphisms were independent at the population level, association studies would have to examine every one of them... Linkage disequilibrium makes tightly linked variants strongly correlated producing cost savings for association studies 22 / 71

23 Tagging SNPs Allows to analyze a minimal and optimal number of SNPs In a typical short chromosome segment, there are only a few distinct haplotypes Carefully selected SNPs can determine status of other SNPs 23 / 71

24 Tagging SNPs: the HapMap project Goal is to compare the genome sequences of several individuals in order to identify regions where we observe genetic variations Multi-country effort to identify, catalog common human genetic variants. Developed to better understand and catalogue LD patterns across the genome in several populations. Genotyped 4 million SNPs on samples of African, east Asian, European ancestry. All genotype data in a publicly available data base. Can download the genotype data 24 / 71

25 Tagging SNPs: the HapMap project Able to examine LD patterns across genome Can estimate approximate coverage of a given SNP chip Can represent 80-90% of common SNPs with 300,000 tag SNPs for European or Asian samples 500,000 tag SNPs for African samples 25 / 71

26 HapMap 26 / 71

27 Genetic rationale for association studies Assuming that one locus is the disease susceptibility locus (DSL), which is unknown Finding an allele at the marker locus with a higher frequency in cases (assumed to have disease susceptibility allele at DSL) than in controls suggests: that there is LD between the DSL and the marker, i.e. that the marker locus is close to the DSL OR that the marker locus IS the causal locus OR that the marker locus is a false positive... This is the association we are talking about. 27 / 71

28 Genetic rationale for association studies 28 / 71

29 ASSOCIATION STUDIES: INTRODUCTION 29 / 71

30 Association: context Association Search for genetic risk factors of a disease Using candidate markers Using candidate genes In a candidate region (from linkage study): fine mapping On the whole genome (GWAs) 30 / 71

31 Types of design Cohort studies Case-control studies Familial studies The choice is open, depending on the objectives and budget/time constraints Different type of bias 31 / 71

32 Common approach to all designs Analysis of the current literature Generate new hypothesis Avoid systematic errors (selection bias, measure bias, etc) Collect groups of individuals and collect data in order to compare the groups Assess if there is a statistic link between the genetic exposure and the occurrence of the disease Quantify this link by an index Estimate its accuracy Discuss the causal interpretation of the findings 32 / 71

33 ASSOCIATION STUDIES: COHORT DESIGN 33 / 71

34 Cohort studies Follow a population (sample) over time Here, the exposure is the genetic exposure to a specific allele or genotype (i.e. having this allele or genotype) 34 / 71

35 Genotype association test: one SNP We consider one SNP, with alleles 1 and 2 (or A and T) SNP Genotype 1/1 1/2 2/2 Total Disease + n 0 n 1 n 2 n Disease - m 0 m 1 m 2 m We test the association between the rows and columns of the table above Identify a reference genotype (usually the most frequent one, eg 2/2) Compute the Relative Risks (RR): P(Disease + 1/1) / P(Disease + 2/2) P(Disease + 1/2) / P(Disease + 2/2) 35 / 71

36 Allelic association test: one SNP We consider one SNP, with alleles 1 and 2 (or A and T) Number of alleles 1 2 Total Disease + n 1 n 2 n Disease - m 1 m 2 m We test the association between the rows and columns of the table above Identify a reference allele (usually the most frequent one, eg: 2) Compute the Relative Risks (RR): P(Disease + 1) / P(Disease + 2) 36 / 71

37 ASSOCIATION STUDIES: CASE-CONTROL DESIGN 37 / 71

38 Case-control studies Identify a group of cases and a group of controls Again, the exposure is the genetic exposure to a specific allele or genotype (i.e. having this allele or genotype) 38 / 71

39 Genotype association test: one SNP We consider one SNP, with alleles 1 and 2 (or A and T) SNP Genotype 1/1 1/2 2/2 Total Disease + n 0 n 1 n 2 n Disease - m 0 m 1 m 2 m We test the association between the rows and columns of the table above Identify a reference genotype (usually the most frequent one, eg 2/2) Compute the Odds Ratios (OR) with respect to genotype 2/2: Odds(1/1 Case) / Odds(1/1 Control) Odds(1/2 Case) / Odds(1/2 Control) 39 / 71

40 Allelic association test: one SNP We consider one SNP, with alleles 1 and 2 (or A and T) Number of alleles 1 2 Total Disease + n 1 n 2 n Disease - m 1 m 2 m We test the association between the rows and columns of the table above Identify a reference allele (usually the most frequent one, eg: 2) Compute the Odds Ratios (OR) with respect to allele 2: Odds(1 Case) / Odds(1 Control) 40 / 71

41 Extensions Can consider quantitative phenotype Can account for the effect of covariates, such as Age, sex, etc... Can handle several SNPs (if they are not in LD) 41 / 71

42 Potential bias: population stratification In the example below, the distribution of genotypes differs between cases and controls We might conclude that allele A (or genotype AA) is related to disease 42 / 71

43 Potential bias: population stratification If cases and controls not well-matched ancestrally: We may see an unequal distribution of non-disease-related alleles between cases and controls Any allele more common in population with increased risk of disease may appear to be associated with disease 43 / 71

44 Potential bias: population stratification If cases and controls not well-matched ancestrally: We may see an unequal distribution of non-disease-related alleles between cases and controls Any allele more common in population with increased risk of disease may appear to be associated with disease 44 / 71

45 Potential bias: population stratification Unequal distribution of alleles may result from: Sample made up of more than one distinct population Sample made up of individuals with differing levels of admixture 45 / 71

46 Potential bias: population stratification Illustration: Parra et al. AJHG 63:1839, / 71

47 ASSOCIATION STUDIES: FAMILY-BASED DESIGN 47 / 71

48 Family-based association Avoid population stratification bias: natural matched design for ethnicity Different designs: Case-sibling Trios (most common) Extended families 48 / 71

49 Case-sibling association Allows to adjust for ethnicity Compare the frequency of alleles in cases and controls A a A A A A A A A a A A a a A a 49 / 71

50 Association in trios Sample a set of trios: one case and his/her two parents Look at the transmission disequilibrium (TD test): one allele may be transmitted more often to the child (case) 50 / 71

51 Association in trios Transmission Disequilibrium Test Use a Chi2 test to test for the over-transmission of an allele 51 / 71

52 Association in trios Example of 5 trios The A allele is transmitted to affected offspring four times out of five. A B C Total Transmitted Not transmitted Total Use chi2 test to test for over-transmission 52 / 71

53 Association in extended families The TDT can be generalized to extended families 53 / 71

54 ASSOCIATION STUDIES: SEVERAL SNPs 54 / 71

55 Association at several SNPs In case-control studies, one can use a logistic regression model to test several SNPs as well as their interactions Problematic if the SNPs are correlated (i.e. in LD) One way to solve this issue is to an association on the haplotype 55 / 71

56 Genotype and haplotype Genotype: pair of alleles (one paternal, one maternal) at a locus Genotype for this individual is GT Haplotype: sequence of alleles along a single chromosome Genotypes for this individual (vertical) : CA and TT Haplotypes (horizontal): CT and AT 56 / 71

57 Haplotype analysis: general idea Create an artificial polymorphism from the SNPs studied Haplotypes (horizontal) of this individual: CT and AT Suppose that alleles at SNP1 are C or A Suppose that alleles at SNP2 are G or T The four possible haplotypes are CG, CT, AG, AT Create a new marker (artificial) with 4 alleles, such that 1=CG, 2=CT, 3=AG and 4=AT Do a standard association analysis with this new polymorphic marker 57 / 71

58 Haplotype analysis: issues Good alternative, compared to a multiple logisitic model Can account for the correlation between SNPs Main issue: the phase (haplotype) have to be inferred. Rare haplotypes contribute to the degrees of freedom in the model but don t increase power. 58 / 71

59 How to choose the SNPs for haplotype analysis? Can use HapMap to define blocks of SNPs that are in linkage disequilibrium 59 / 71

60 GENOMEWIDE ASSOCIATION STUDIES 60 / 71

61 GWAs: rationale Linkage analysis using families takes unbiased look at whole genome, but is underpowered for the size of genetic effects we expect to see for many complex genetic traits. Candidate gene association studies have greater power to identify smaller genetic effects, but rely on a priori knowledge about disease etiology. Genome-wide association studies combine the genomic coverage of linkage analysis with the power of association to have much better chance of finding complex trait susceptibility variants. 61 / 71

62 GWAs: why is it possible now? Genotyping technology: Now have ability to type hundreds of thousands (or millions) of SNPs in one reaction on a SNP chip. The cost can be as low as per person. Two primary platforms: Affymetrix and Illumina. Design and analysis: Availability of SNP databases, HapMap, and other resources to identify the SNPs and design SNP chips. Faster computers to carry out the millions of calculations make implementation possible. 62 / 71

63 SNP Chips: Number and Placement of SNPs A typical SNP chip has at least 317,000 SNPs distributed across the genome. Newest: 1 million. The newest chips can also measure (directly or indirectly) some types of copy number variation. We do not directly measure genotypes at all genetic polymorphisms, but rely on association between the polymorphisms we do assay and those which we do not assay. SNP-SNP association, or linkage disequilibrium, is fundamental to our ability to sample the whole genome with relatively few SNPs. 63 / 71

64 GWA Question of interest: Are the alleles or genotypes at a genetic marker associated with disease status? Use usual statistical machinery get estimates of measures of association and to test for association for each of the SNPs. One typical approach: Test for association between having 0, 1 or 2 copies of rare allele at a SNP using Cochran-Armitage test for trend. 64 / 71

65 GWA Genomewide Association Analysis of Coronary Artery Disease 65 / 71

66 GWAs: multiple testing issue It is common in GWAs to test hundred of thousands of SNP in a single study Consideration of multiple comparisons is an essential part of determining statistical significance. There are many methods available (Bonferroni, FDR, permutation, etc...) Threshold often considered appropriate: pvalue = / 71

67 GWA Hirschorn et al., 2011 (Annual Review of Medicine) Table 1 Characteristics of genome-wide association (GWA) discoveries for several exemplar common diseases and quantitative traits Disease/trait Inflammatory bowel disease Type 2 diabetes Number of loci in GWA studies (August 2010) Heritability explained (August 2010) Overlap with Mendelian disorders Drug targets within loci? Multiple variants per locus? Comments >40 20% N/A No Yes Several clear, novel pathways guiding new therapeutic research; variants might aid in distinguishing UC versus CD 32 10% Yes Yes Yes Cell cycle only statistically significant pathway, but genes in loci suggest multiple known and novel processes Autism 1 2 Low No N/A No Rare variants, especially structural variants, more prominent in initial genetic studies Lipid levels 95 across 3 traits 25 30% Near complete Yes Yes Previously unsuspected genes validated as regulating lipid levels, providing new potential drug targets Height % Many Yes Yes Additional pathways emerge after identification of >100 loci; direct evidence for role of common variants Hemoglobin F levels 3 35% Yes Yes N/A Identification of previously unsuspected key regulator of globin switching, new potential drug targets Abbreviations: N/A: Not applicable, as precise sites of action of therapies are not known or no therapies are available, or no known Mendelian forms of disease exist. UC: ulcerative colitis; CD: Crohn s disease. 67 / 71

68 GWA: what s next New genetic loci discovered Short time frame Short-to-medium time frame Identification of the relevant variants Short time frame Identification of the relevant genes Medium-to-long time frame Depends on predictive power and available clinical options Prediction models Clinically useful prediction Understanding of underlying biology Long time frame Clinically useful therapies 68 / 71

69 GWAs In the last years, it was hoped that GWAs would bring definitive evidence for gene effects GWAs revealed much less than hoped GWAS papers have reported a couple of hundred genetic variants that show statistically significant associations with a few traits. But the genes typically do not replicate across studies. Even when they do replicate, they never explain more than a tiny fraction of any interesting trait. In fact, classical Mendelian genetics based on family studies has identified far more disease-risk genes with larger effects than GWAS research has so far. 69 / 71

70 GE: what is next? Where is the missing heritability? The missing heritability may reflect limitations of DNA-chip design GWAS methods so far focus on relatively common genetic variants in regions of DNA that code for proteins. They under-sample rare variants and DNA regions translated into non-coding RNA, which seems to orchestrate most organic development in vertebrates. At worst, each human trait may depend on hundreds of thousands of genetic variants that add up through gene-expression patterns of mind-numbing complexity. Next generation sequencing / 71

71 References Lecture notes from G. Abecassis on linkage Disequilibrium Lecture notes from T. Fingerlin on Design and Analysis of Genome-Wide Association Studies 71 / 71