GCTA/GREML. Rebecca Johnson. March 30th, 2017

Size: px
Start display at page:

Download "GCTA/GREML. Rebecca Johnson. March 30th, 2017"

Transcription

1 GCTA/GREML Rebecca Johnson March 30th, / 12

2 Motivation for method We know from twin studies and other methods that genetic variation contributes to complex traits like height, BMI, educational attainment, etc. Narrow-sense heritability: the proportion of variation in a phenotype like height that is due to additive genetic variance: V height = V gene + V env + V gene:enviroment Additive genetic variance is one contribution to V gene part of equation (e.g.: V gene = V additive + V dominance + V epistatic ) The missing heritability problem: the variation we are able to explain/predict in a phenotype like height using validated snp s is less than the narrow-sense heritability Different classes of explanations for missing heritability problem: Variation in phenotype due to genetic variance is driven by non-additive genetic variance- rejects because gap is defined in terms of narrow-sense heritability that is already restricted to additive genetic variance Discussed in group: authors might dismiss this too readily; some argue that twin and other methods for capturing narrow-sense heritability may not fully restrict the estimates to additive genetic variance 2 / 12

3 Author s preferred explanation: missing heritability problem is at least in part, artifact of existing methods for testing for associations between snp s and complex traits Standard method the authors are criticizing: regress the phenotype on each snp separately and then for inference, set stringent significance threshold for saying that a snp plays a causal role on that trait...in math (n = number of snp s): 1. Step one: run association testing on each snp separately 1.1 Test 1: height i = α + β snp1 i + e i 1.2 Test 2: height i = α + β snp2 i + e i 1.3 Test 3: height i = α + β snp3 i + e i Test n: height i = α + β snpn i + e i 2. Step two: set stringent p-value threshold for inference on ˆβ from step 1 regressions to determine which snp s are associated with height Authors argue that might be that variants are true causal variants, but because their additive effects are so small, this method misses their effects Not hidden heritability, but missed heritability using existing association testing methods 3 / 12

4 Author s proposed solution: in words Distinguish between two different goals of testing for associations between snp s and complex traits like height: 1. One goal (not the author s in present paper): identify a particular causal snp associated with a complex trait like height Method: may make sense to run separate tests for each snp and then set a stringent significance threshold 2. Different goal: identify the total additive genetic contribution to a complex trait like height Method: rather than conducting a separate association test for each snp, find way to estimate the simultaneous effect of all snp s in same step/regression What do author s mean by all snp s? Two options: 1. All measured snp s on the genome 2. All measured snp s on a particular chromosome General approach: decompose variation in a phenotype into two components: variation due to random genetic relatedness (a matrix of genetic relatedness between study participants that removes participants above a relatedness threshold that comes from being siblings, relatives, etc.) and residual variation 4 / 12

5 Author s proposed solution: in math Notation: n: number of individuals/samples; seem to be indexed by j rather than i y j : single outcome phenotype/trait (will use working example of height) for observation j X j and β: X = observed covariates relevant to trait like age and gender in height case (can also include maybe ancestry PC s); β = coefficients on those observed covariates Wij and u: can think of as particular way of structuring our information about individual j s i snp s (more detail on next slide); u can think of as coefficients on those snp effects ɛ j : residual variation in height 5 / 12

6 Focus on w ij term Dimensions: i j, where i = row for each snp, j = column for each individual More notation: xij : for person j, number of copies of reference allele for snp i...note these reference alleles differ across snp s, e.g. we might have: rs 1: T rs 2: G dbsnp info- choice of reference allele is arbitrary (often choose same allele as minor allele in dbsnp) but should be consistent between test and target sample p i : frequency of reference allele for snp i (as observed in data) Putting these together, each cell in w ij is basically observed count reference allele adjusted by measure of ref. allele population proportions: w ij = x ij 2p i 2pi (1 p i ) 6 / 12

7 Focus on w ij term Creating fake w ij to help us visualize: ##fake count of reference alleles for 4 snps ##and 3 individuals ref_allele_count <- data.frame(rebecca = c(2, 1, 1, 1), adam = c(0, 0, 1, 0), bruce = c(0, 1, 0, 2)) ##calculate p (freq of ref allele) using observed counts p <- rowsums(ref_allele_count)/(2*3) ##numerator of w_ij w_num <- ref_allele_count - 2*p ##denom of w_ij w_denom <- sqrt(2 * p * (1 - p)) wij <- w_num/w_denom 7 / 12

8 Focus on w ij term Result: Rebecca Adam Bruce What we then do with the w ij matrix: Calculate measure of genetic relatedness between each pair (each column) - described on slide 10 Would then exclude individuals above a certain relatedness threshold so that resulting matrix only contains genetic relatedness due to random allele assignment 8 / 12

9 General form: y = X β + Wu + ɛ Putting the terms together into a regression of our phenotype on that matrix containing all snp s 1. Regressing phenotype on observed non-genetic covariates like age and gender and observed matrix of snp s: Example with height: height = ageβ 1 + genderβ 2 + Wu + ɛ 2. Specifically interested in how variance of complex trait V is explained (or partition) into variance in observed matrix of snps (σ 2 u) and residual variance (variance neither explained by matrix of snps nor observed covars like age and gender: σ 2 ɛ ): var(y) = V = W W T σu 2 + I σɛ 2 i j j i i 9 / 12

10 Once we have this general model in place of partitioning proportion variance in complex trait into variance from all snp s, different applications 1. Use w ij to estimate genetic relationship between pairs of individuals in data (similar to cov(x, Y ): A Rebecca Bruce = 1 N N i=1 (x irebecca 2p i )(x ibruce 2p i ) 2p i (1 p i ) Use to exclude close relatives from certain analyses 2. Focus of height paper and (probably?) most substantive : estimate the variance explained by genome or chromosome-wide snp s r Outcome variable form: y = X β + i=1 g i + ɛ Variance partition form: V = r i A i σi 2 + I σepsilon 2 10 / 12

11 Those equations look complicated! How do we estimate? Restricted maximum likelihood estimation (REML) General method used when maximum likelihood estimation leads to biased estimates of components of variance (which are main QOI here) Basic gist of REML: apply likelihood estimation to residuals rather than full data Their specific implemention of REML: average information algorithm What is the average information algorithm? A type of quasi-newton method (quasi-newton: think bfgs) that uses first and second derivatives of likelihood (hence newton) but for second derivative, uses approximate rather than full form of Hessian (hence quasi) Average information refers to the specific method for approximating the Hessian (average of observed versus expected Fisher information matrix) 11 / 12

12 A couple points discussed in group If we are interested in making claims about differences between groups for instance, higher versus lower SES individuals in the variation in a phenotyped explained by variation in the complete genotype matrix, could we construct a group-specific w ij and perform separate decompositions...e.g., separate the sample into lower and higher SES and find: 1. w ij low SES 2. w ij high SES 3. With #1 and #2 having separate p i (frequency of allele for snp i if x ij differ between the groups While it is good proof of concept to show that the method works on high R 2 phenotypes like height/bmi, are there outcomes that fall in the sweet spot where the claim that there is x genetic contribution to the outcome is interesting to social scientists? If we choose an outcome like religiosity, are the assumptions behind the method that after dyads with high relatedness are removed from w ij, remaining genetic variation is as good as random more difficult to defend? 12 / 12