General aspects of genome-wide association studies Abstract number 20201 Session 04 Correctly reporting statistical genetics results in the genomic era Pekka Uimari University of Helsinki Dept. of Agricultural Sciences 29.11.2013 1
Quality check of the SNPs Call rate Low call rate indicates genotyping problems All SNPs with CR < 90% are usually excluded Minor allele frequency Eliminates non-polymorphic SNPs MAF limit of 1% or 5% are commonly used Number of animals in each genotyping class Hardy-Weinberg equilibrium May indicate genotyping problems Other causes are selection, recent mutation, random drift, small population size etc. Some variation in practice exist e.g. P- value limit from 0.05 to 10-6 CR, MAF and HWE of the best SNPs (the smallest P-value in the GWA) are checked with more detail Also Illumina quality control parameter can be used Illumina Beadstudio allele calling scatter plots 2
Quality check of the samples Call rate Commonly used limit for call rate is 90% Duplication IBS/IBD is appr. 100% Know relatedness, parentage test IBS/IBD estimation Parent-offspring: proportion of SNPs with IBD=1 should be close to 100% depending on the level of inbreeding Full-sibs expectation is IBD=0 25% IBD=1 50% IBD=2 25% Population test 3
Population structure Sample structure can be studies using e.g. PLINK multidimensional scaling, Eigenstrat by Patterson, PLOS 2006 or other methods and software Population clustering Finnish Landrace 0.2 0.15 0.1 Crossbred 0.05 Finnish Yorkshire 0-0.1-0.05 0 0.05 0.1 0.15-0.05-0.1 Dept. of Agricultural Sciences 4
Imputation of genotypes Estimates missing genotypes, the most probably is usually used as an imputed genotype Increases power of association analysis Is done simultaneously together with haplotyping In animal genetics the most commonly used software seems to be Beagle (Browning and Browning, 2009) Faster than for example older fastphase Imputation error is usually very low appr. 2-3% 5
Choice of Animals Purebred animals commonly used Genetic homogeneity Usually high LD For populations with high LD less markers are needed compared to populations with low LD Synthetic breeds / breed crosses Allele and locus heterogeneity 1 0.9 0.8 0.7 0.6 D' / r2 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 kbp 6
Number of Animals In human genetic studies the number of studied / genotyped individuals runs from 100 1000 10000 or more Effects the power of the study Power of the study design can be estimated before hand but usually several assumptions about the mode of inheritance must be made, thus these estimates are seldom for any use For simple Mendelian traits 10 cases and 10 controls can be enough More realistic picture about the power can be achieved comparing the results of GWAS of similar or bigger sample size Remember that even with small sample sizes you can achieve P-values < 10-7, some of those are false positives 7
Choice of Phenotypes The most common phenotype is based on estimated breeding values of males used in AI (AI-bulls, AI-boars) Possible only for traits that are measured in a national recording scheme (or similar type of data recording from performance testing stations etc.) For other traits observations of the genotyped animals are used Usually EBV s are deregressed prior to GWAS (Garrick et al. 2009) Removes the effect of parents on EBVs Deregression Calculation of weight of observations for GWAS 8
Statistical methods Single SNP model The most commonly used model Test for association one SNP at the time Multiple SNP model Several or all SNPs analysed simultaneously Oversaturation (number of markers >> number of individuals) needs to be handled Selecting a subset of variables Shrinking the estimates towards zero 9
Genomic control (GC) is based on the idea a majority of the markers are not associated with the trait and the test statistics should follow the null hypothesis distribution Q-Q plot Population stratification / relationship between the samples 10
Q-Q plot a) No association no population stratification or relatedness b) No association but indication of population stratification or relatedness c) Evidence for association and population stratification or relatedness d) Evidence for association but no population stratification or relatedness McCarthy et al, Nature Reviews Genetic, 2008, 9:356-369 11
Genomic control If there are indications (based on Q-Q plot) of population stratification or relatedness of samples test statistics can be adjusted using genomic control λ (Devlin & Roeder, 1999, Biometrics 55:997-1004) A simple estimate of λ is the mean of the obtained tests statistics or the median divided by 0.456 (0.456 is the expected median for chi-square distribution with df=1) Pearson and Manolio, 2008, JAMA 299:1335-2150 Dept. of Agricultural Sciences sciences Kuvat: 29.11.2013 17.5.2010 12
Population stratification / relationship between the samples The most commonly used way to control relatedness is to include the pedigree structure into a single marker mixed model Mixed linear model: y i = μ + b*x i + a i + e i, y i is the deregressed EBV x i is the number of minor alleles (0, 1, or 2) of the tested SNP b is the corresponding regression coefficient a i is a random polygenic effect with a i ~ N(0, Aσ 2 a), where A is the additive relationship matrix and σ 2 a is the polygenic variance e i is a random residual effect with e i ~ N(0, Iσ 2 e/w i ), where I is an identity matrix, σ 2 e is the residual variance, and w i is the weight 13
Relationship matrix Based either on pedigree (A) or genotypes (G) Genomic relationship matrix (G) Most commonly used the method presented by VanRaden (2008, J. Dairy Sci. 91, 4414-4423) G = ZZ /k, where k=2 p i (1-p i ) Or weight markers by reciprocals of their expected variance G = ZDZ, where D ii = 1/(mk i ) where k i = 2p i (1-p i ) and m=number of markers 14
P-values Test statistic of the SNP-effect from the mixed linear model: t-test F-test Wald test Squared t-test statistic has an exact F(1,n 1) distribution The Wald statistic can be used to test a simple hypothesis H 0 :θ = θ 0 on the entire parameter vector, θ θ T I θ θ θ has x 2 -distribution with 1 df When n Wald-test with 1 df the square of the t -test statistic 15
Manhattan plot 16
Commonly used programs Variance-covariance estimation packages DMU, ASREML etc Fits mixed model equation and estimates variance components for each SNP takes long time to analyze all markers Use either A or G-matrix Methods and programs that take into account population and family structure (approximations) GenABEL (GRAMMAR), Aulchenko et al, 2007, Genetics EMMAX, Kang et al, 2010, Nature Genetics Tassel, Zhang et al, 2010 Nature Genetics GEMMA, Zhou and Stephens 2012 Nature Genetics Should be faster than DMU and ASREML 17
Multiple SNP model number of markers >> number of individuals Bayesian LASSO y = μ + Xb + u + e X is the matrix on m most correlated marker genotypes b is a vector of marker effects u polygenic effect with G genomic relationship matrix computed from the rest of the markers b j σ j 2 ~N(0, σ j 2 ) σ j 2 λ~exp(λ 2 /2) λ~gamma(κ, ξ) Kärkkäinen & Sillanpää (2012, Genetics) 18
Multiple SNP model number of markers >> number of individuals Heteroscedastic Ridge Regression y = μ + Xb + e X is the matrix on marker genotypes b is a vector of random marker effects First round: Shrinkage factor λ~σ e 2 /σ b 2 Second round: Shrinkage factor λ~σ 2 e /σ 2 bj where σ 2 bj is calculated based on an estimate of a marker effect b j from the first round bigrr R-package (Shen et al. 2013, Genetics) 19
Comparison of the methods Single SNP methods 20
Comparison of the methods Multiple SNP methods 21
Post GWAS (see review by Wojcik et al. 2015 BMC Genetics) Aggregate markers into biologically relevant units, gene or pathway Increase power: combine multiple weak or moderate signals Allow for allelic or locus heterogeneity Gene-level analyses Combines independent signals within a gene Should take LD into account E.g. VEGAS (Liu et al. AJHG 2010) Pathway-level (gene-set) analyses Related collection of genes with similar biological function Assess if strong associations cluster within a gene set compared to genes outside of the pathway (or gene set) 22