Genome-Wide Associa/on Studies: History, Current Approaches, and Future Opportuni/es. Addie Thompson Genomics,

Size: px
Start display at page:

Download "Genome-Wide Associa/on Studies: History, Current Approaches, and Future Opportuni/es. Addie Thompson Genomics,"

Transcription

1 Genome-Wide Associa/on Studies: History, Current Approaches, and Future Opportuni/es Addie Thompson Genomics,

2 Outline History and terminology Sta5s5cs and breeding Linkage and associa5on analysis, SNP data, and imputa5on Genome-Wide Associa5on Studies (GWAS) Considera5ons and limita5ons Popula5on structure and kinship, mul5ple test correc5ons, minor allele frequency, and power Tradi5onal and extended methods Exci5ng outcomes

3 How are traits inherited?

4 and what regions of the genome control* them? *or now, statistically significantly explain some portion of their variation

5 ( years later) 2007

6 Studying linkage

7 Published Genome-Wide Association Studies

8 Terminology Locus = loca5on in genome Gene = unit of DNA that codes for func5onal product Allele = forms or variants of a gene Polymorphism = allele present in popula5on, >5% Muta5on = allele present in popula5on, <5% Phenotype = measurable outcome QTL = Quan5ta5ve Trait Loci = region that contributes to a phenotype (best to use QTL for linkage studies, and associa5on for associa5on studies)

9 Types of Traits (= Phenotypes) Discrete One or two genes Catagorical classes Polygenic Mul5ple genes (2-5, ish) Compound distribu5ons Quan5ta5ve Many genes (may be hundreds or more) Con5nuous distribu5on

10

11 Popula/on [sample] Descriptors Mean Standard devia5on Variance Standard error

12 Par//oning Variance ANOVA Statistics.laerd.com

13 Par//oning Variance ANOVA sites.nicholas.duke.edu

14 Par//oning Variance Phenotype = Genotype + Environment (+ G x E) Phenotypic variance as the sum of gene5cs and environment: Vp = Vg + Ve (+ Vgxe) Gene5c variance comes from addi5ve and dominance variance Vg = Va + Vd (+ Vaxd) Quan5ta5ve gene5cs tends to focus on addi5ve gene5c varia5on because it is operated on by natural selec5on

15 Heritability A value of a popula5on (not an individual) Par5cular to popula5on and environment Impacts ability to map traits Broad-sense = Vg/Vp = Vg/(Vg + Ve) Propor5on of phenotypic variance explained by gene5c variance Narrow-sense = Va/Vp = Va/(Va + Vd + Vaxd) Degree to which offspring resemble parents Frac5on of varia5on visible to selec5on Predicts how phenotypes change with selec5on

16 Linkage Rela5onship between alleles at two loci Recombina5on: David Eccles Happens at different frequencies for different pairs of loci

17 Linkage Disequilibrium (LD) Different than linkage Linkage : Loca5on, LD : Alleles Either one can exist without the other! Reasons for LD: Muta5on Popula5on structure Gene5c dria Lack of recombina5on Selec5on Nonrandom ma5ng Linkage

18 Xu et al 2012, Nat Biotech, Resequencing 50 accessions of cultivated and wild rice [ ]

19 Linkage Mapping Using observed loci (markers) to draw inferences about unobserved loci (those controlling traits) Requires some amount (but not too much!) of LD Power to detect QTL relies on LD between a SNP and the causa5ve variant, but resolu5on in mapping requires lots of recombina5on events Tends to underes5mate the number of involved loci, and overes5mate their effects (the Beavis Effect ) Typically shown as Logarithm of Odds (LOD) score for linkage

20 Linkage Mapping Rosenquist 2011

21 Associa/on Analysis Different from linkage mapping: Not biparental; instead, uses more diverse lines to take advantage of historical recombina5ons, but uses biallelic SNPs (3 possible genotypes) Usually more SNPs, more recombina5on, higher resolu5on Associate genotypic and phenotypic varia5on in order to: Iden5fy func5onal variants Find loci or genes contribu5ng to the trait(s) Learn about the gene5c architecture, e.g. how many loci are contribu5ng and what are their allele frequencies and interac5ons Distribu5on of the effect sizes

22 ManhaNan plot Styrkarsdottir et al 2014

23 Genotypic Data: SNPs Now, mostly obtained by Genotyping-By-Sequencing (GBS) Limita5ons and biases Ascertainment in sequence genera5on Alignment Sequence not in reference Duplicated regions Unmapped regions Consider necessary coverage based on LD decay Changes based on popula5on and species; can drop to 0 within 500bp Must have LD between marker and causa5ve loci to detect

24 LD Decay Xu et al 2012 Nat Gen

25 Marker Imputa/on Filling in missing marker data based on knowledge from other lines Soaware: IMPUTE, MACH, BIMBAM, BEAGLE, TASSEL, others Can measure quality of imputa5on by seeing how well imputa5ons agreed with actual values (closer to 1 = bejer) Minor allele frequency and popula5on structure are concerns Review on genotype imputa5on for GWAS: Marchini and Howie 2010 Nature Rev Genet

26 GWAS: Considera/ons and Limita/ons Popula5on structure Kinship Mul5ple tes5ng Minor allele frequency Power

27 GWAS: Considera/ons and Limita/ons Cannot use a naïve simple linear model of associa5on because of popula5on structure LD due to different allele frequencies rather than actual linkage; ancestry predicts genotype AND phenotype, yet genotype is unrelated to phenotype Example: Beware the chops5cks gene (Hamer and Sirota 2000) SNPs can be linked but not causa5ve; associa5on is not causa5on

28 Accoun/ng for Subpopula/ons Y = u + vq + bw + e Subpop membership and effect Marker allele dosage and effect Matrix nota5on: Y=1u + Qv + Wb + e

29 Accoun/ng for Subpopula/ons Direct knowledge (pedigree) Clustering STRUCTURE or ADMIXTURE Bayesian approaches to assign probabilis5cally (try to obtain Hardy-Weinberg Equilibrium within the subpopula5ons Principal Components MutliDimensional Scaling EMMA = Efficient Mixed Model Associa5on = eigenvalue decomposi5on to solve mixed model to avoid taking the direct inverse of the covariance matrix (computa5onally intensive)

30

31 Accoun/ng for Kinship Add random effect K for kinship matrix, calculated with genome-wide markers Model the rela5onship of all pairs of individuals in a panel Marker similarity matrix (number of 5mes two individuals share an allele at a locus) Realized genomic addi5ve rela5onship matrix (scaled and standardized version of previous) Pedigree addi5ve rela5onship matrix (requires deep knowledge) See Price 2010; Yu and Buckler 2009

32 Mul/ple Test Correc/ons In 100 tests, 5 by chance are <0.05 extend to hundreds of thousands of tests The threshold used to correct is a major hurdle Bonferroni = too conserva5ve, because it assumes independent tests Permuta5on = good for linkage, but generally not GWAS because family structure is not preserved FDR (Benjamini and Hochberg 1995) = calculate expected propor5on of declared QTL that are false posi5ve, and just live with it Can calculate effec5ve number of tests (Li and Ji, using eigenvalue decomposi5on of the correla5on matrix) Some have tried 2-stage designs (screen for sugges5ve SNPs, then test the subset in an independent popula5on) but joint analysis is almost always more powerful

33 Minor Allele Frequency (MAF) Problem In a diversity panel with hundreds of lines, some large-effect alleles only present in 2-5 individuals Low MAF SNPs don t work well with many test sta5s5cs because they violate the large sample assump5on Common cut-off is MAF<0.05 but this may be stringent depending on the size of your popula5on; how many genotypes (different measures of an allelic effect) would it take to be considered believable?

34 Power Popula5on structure reduces power by removing varia5on generated by the QTL Bernardo 2010: Power vs FDR tradeoff Fst measure = how different are the subpopula5ons at that locus Need high MAF and low Fst

35 Nuts and Bolts: What you need to run an analysis Matrix of genotypes, with SNP informa5on and calls for each individual in the popula5on Phenotypes scored for each member of the popula5on, preferably as BLUP (Best Linear Unbiased Predic5on) values to account for environmental effects Usually op5onal: supply Q and K

36

37 Tools for GWAS TASSEL PLINK Eigenstrat Emmamax GAPIT Genabel gwastools Fast-lmm FARM-CPU Compare approaches using QQ plots of p-value distribu5ons

38 Rules of Thumb At least 300 individuals 600 or more is bejer Marker number depends on your species can range from 1,000 to 10,000,000 Q+K model, in some fashion Perspec5ve of a hypothesis-genera5ng approach eases disappointment caused by low power or resolu5on, or stringent mul5ple tes5ng correc5ons

39 GWAS Live Demos! Examples of more recent publica5ons Applica5ons of GWAS results and serious implica5ons Soaware for mapping and analysis

40 The Use of High Density Genotyping in Animal Health The sequencing of genomes, such as that of the cow, has led to the discovery of thousands of single nucleotide polymorphisms (SNPs). By combining this knowledge with new methods that can genotype thousands of SNPs efficiently, it has become possible to carry out genome-wide association studies in domestic animals to map genes for complex traits, including disease resistance, using the linkage disequilibrium between the SNPs and the unknown genes affecting the trait of interest. Although experiments using 10,000 SNPs and 384 animals have found many significant associations, power calculations suggest that we need >50,000 SNPs and >1000 animals to map genes explaining most of the genetic variance for complex traits. Such experiments are now underway and the results will have two applications. Firstly, they will lead to panels of SNPs that can be used to accurately select animals with high breeding value for desired traits leading to a great increase in the rate of genetic improvement. Secondly, they will form the first step in identifying the genes and mutations that cause variation in complex traits. A collaborative approach to achieving this second goal is proposed.

41 Near-Term Future of GWAS Post-GWAS priori5za5on How to choose which candidate genes to follow? Informed GWAS Using networks or other informa5on Haplotype mapping Using a series of alleles rather than a single allele in regression Becomes Cochran-Armitage or other trend test for trends across genotype categories QTL meta-analysis

42 GWAS Pathway Analysis ID-ing pathways KEGG, GO/DAVID/MESH/PANTHER How to assign SNPs to genes in order to label as a pathway? Currently, by distance Address bias from differences in pathway size, gene size, SNP density Difficult to form null distribu5on; use permuta5ons

43 Conclusions Quan5ta5ve gene5cs is an advanced and rapidly changing field GWAS has already begun to change the landscape of science, agriculture, and healthcare Account for: environmental effects, popula5on structure, kinship, mul5ple tes5ng, minor allele frequency Direc5ons of future research will determine further applicability of results