Authors: Yumin Xiao. Supervisor: Xia Shen

Size: px
Start display at page:

Download "Authors: Yumin Xiao. Supervisor: Xia Shen"

Transcription

1 Incorporating gene annotation information into fine-mapping quantitative trait loci in genome-wide association studies: a hierarchical generalized linear model approach Authors: Yumin Xiao Supervisor: Xia Shen Master Thesis in Statistics, Spring School of Technology and Business Studies, Dalarna University, Sweden.

2 Incorporating gene annotation information into fine-mapping quantitative trait loci in genome-wide association studies: a hierarchical generalized linear model approach Yumin Xiao 1 1 Statistics Unit, School of Technology and Business Studies, Dalarna University, Borlänge, Sweden. Abstract Genomic evaluation models estimated by e.g. GBLUP and Bayesian methods are used in genome-wide association (GWA) studies for taking all the genetic markers into account. Other shrinkage methods like LASSO are also applied to select massive variables in GWA. However, these methods do shrinkage only from genotypic information. They cannot solve problems like effectively fine-mapping causal genes in a complex associated region since plenty of known gene annotation information is not used. This paper introduces a HGLM approach for fine-mapping QTL, incorporating annotation information into structured random effect dispersion. Using a real GWA dataset, we tested five candidate genes with their corresponding phenotypes. P-values of the likelihood ratio test for these genes varied with the lengths of the surrounded SNP regions, and simulations showed that the test has a reliable false positive rate. We showed that the proposed method is capable of fine-mapping highly significant signals. v10yumxi@du.se 1

3 INTRODUCTION Pioneered by human geneticists as a potential solution to the challenging problem of finding the genetic basis of common human diseases, genome-wide association (GWA) studies have become an obvious general approach for studying the genetics of natural variation and traits of agricultural importance, owing to affordably obtained dense marker genotypes along genome by typing single nucleotide polymorphism (SNP) markers. Classic GWA methods are based on simple repeated single marker tests across the genome. To achieve more powerful mapping and better prediction, a unified model including multiple or even all the SNPs in the genome is preferred. Such models have been estimated using genomic best linear unbiased predictor (GBLUP) (ROBINSON 1991; DAETWYLER et al. 2010), Bayesian methods, e.g. (MEUWISSEN et al. 2001; XU 2003; YI and XU 2008), and recently proposed double hierarchical generalized linear model (DHGLM) (LEE and NELDER 2006) which also enables estimation of marker-specific variances (RÖNNEGÅRD and LEE 2010; SHEN et al. 2011). For many of the phenotypes analyzed, these methods were possible to predict which genes could be important in regulating phenotypic variation. One can get suggested peaks including a small number of genes which are sharply defined and clearly identified. However, there are still problems that need to be addressed. First, signals detected are usually diffuse, which might cover hundreds of genes without a clear center, owing to the existence of complex peaks of association; Second, shrinkage estimation in multiple-marker models uses mathematical shrinkage for variable selection, e.g. LASSO (TIBSHIRANI 1996; PARK and CASELLA 2008), but it does not solve the fuzzy problem informatively; Third, the genomes of many species are well annotated, where SNPs are typed in segments with different annotation, however, such information has not been used in GWA analyses. In GWA studies, deciding which associations are worth following up is highly important. The strongest associations do not always correspond to causal polymorphisms associated with a certain phenotype. There is a strong and interesting need for a unified approach to shrink down non-causal association taking into account gene annotation information. The aim of this paper is to introduce a hierarchical generalized linear model (HGLM; LEE 2

4 and NELDER 1996) approach for fine-mapping quantitative trait loci (QTL) in GWA analysis, incorporating annotation information into structured random effect dispersion. The method is capable of testing whether polymorphisms in with certain known annotation contribute more than the others in a complex associated SNP region. We use simulations to examine the false positive rates and power of the test. We also show examples how the method performs using a published GWA dataset. METHODS Data ATWELL et al. (2010) performed GWA studies for 107 phenotypes of Arabidopsis thaliana and successfully detected a set of candidate genes SNPs were genotyped for 199 ecotypes sampled from different locations over the world ( In this paper, we select and analyze the following quantitative traits and candidate gene regions according to the published candidate gene list. FRI gene expression (FRI) is the response of candidate gene FRI. Leaf number at flowering time in growth conditions of 10 C and 16 hours daylight (LN10) are supposed to be regulated by candidate gene DOG1. Trichome density of plants treated with jasmonic acid water (JA) is supposed to be associated with genes TCL1 and ETC3. Sodium concentration (Na) has a candidate gene HKT1. Models HGLM with structured dispersion provides a unified analysis for our purpose. The data are modeled on two levels, i.e. a mean model for the phenotype with random SNP effects and a dispersion model for the SNP effects with annotation variables as predictors. The phenotype y (n 1 vector) is postulated as a random effect model y = Xβ + Zg + e (1) where g N(0,diag(λ)) are the SNP effects, λ = (λ 1,λ 2,,λ m ) are the variances of the SNP effects, and the residuals e N(0,σ 2 I). The fixed effects β included an intercept. The SNP 3

5 variances λ are modeled as logλ = Bγ (2) where γ are the effects of gene annotation information on the variance of the SNPs. Fitting algorithm According to the extended likelihood principle, inference of the random SNP effects g should be drawn through the h-likelihood (LEE and NELDER 1996), fixed effects β through the marginal likelihood, and variance components λ and σ 2 through the adjusted profile likelihood (LEE et al. 2007). However, for efficient estimation, we propose to initialize variance components and iterate the following steps until convergence (see also SHEN et al. 2011), Algorithm. Solve the following WLS problem for ˆβ and ĝ, T M Σ 1 M T M ˆβ ĝ = T M Σ 1 M y 0 (3) where T M = X Z 0 I and Σ M = σ2 I 0 0 diag(λ). The subscript M stands for mean. Update σ 2 by fitting the deviance residuals d M1 = ê 2 M1 /(1 q M1) using an intercept-only gamma GLM and prior weight w M1 = (1 q M1 )/2, where ê M = (ê M1,ê M2 ) are the residuals of (3), and q M = (q M1,q M2 ) are the diagonal elements of T M (T M Σ 1 M T M) 1 T M Σ 1 M. The subscript 1 and 2 stand for different plants (1 to n) and SNPs (n +1 to n + m) respectively. Update λ by fitting the deviance residuals d M2 = ê 2 M2 /(1 q M2) using a gamma GLM with independent variable B and prior weight w M2 = (1 q M2 )/2. Testing type-of-covariate effects (annotation information) A fairly common method for testing variance component is likelihood ratio test. In this case, likelihood ratio test can be done by calculating the profile likelihood values under two models: 4

6 one is HGLM with structured dispersion, which is the one we use in this paper, and the other is a simple normal-normal HGLM (or linear mixed model). From (1) the marginal distribution of y is normal with mean Xβ and variance V = R + ZGZ (4) where R is σ 2 I, G is diag(λ). The variance component parameter is θ = (σ 2,λ). The loglikelihood of the fixed parameters (β, θ) is log L(β,θ) = 1 2 log V 1 2 (y Xβ) V 1 (y Xβ) (5) The profile likelihood of θ is log L(θ) = 1 2 log V 1 2 (y X ˆβ) V 1 (y X ˆβ) (6) The modified profile likelihood (LEE et al. 2006) log L m (θ) = log L(θ) 1 2 log X V 1 X (7) that takes into account the estimation of β, which matches restricted maximum likelihood (REML), first described by PATTERSON and THOMPSON (1971). The adjustment term does not involve ˆβ. Variance components are estimated by optimizing l(θ) = 1 2 log V 1 2 (y X ˆβ) V 1 (y X ˆβ) 1 2 log X V 1 X (8) which is exactly the profile likelihood used in likelihood ratio test, given by 2logΛ = 2log L 0 L 1 = 2(l 0 l 1 ) (9) where l 0 is the optimized likelihood value of simple normal-normal HGLM, l 1 is the optimized likelihood value of HGLM with structured dispersion. The standard theory regarding the asymptotic distribution of 2logΛ is that it is distributed as χ 2, where the degrees of freedom is the difference between the two models in the number of parameters estimated. However, it has been shown that for a variance component, the asymptotic distribution of the likelihood ratio is a 50 : 50 mixture of χ 2 (0) and χ 2 (1) (e.g. MILLER 1977; SELF and LIANG 1987). Hence, p-values used for likelihood ratio test are halves of p-values calculated from χ 2 (1). 5

7 RESULTS Fine-mapping FRI region Gene FRI is known to be associated with FRI gene expression (FRI), which is used as a standard example to test our model. First, we took 100 SNPs to the left of FRI and the other 100 SNPs to its right, creating a testing window, but the calculated p-value showed that FRI was not significant (P = ). Second, we increased the number of SNPs on each side of the window from 100 to 200, and FRI was still not significant, while P = Finally, we added 50 SNPs to each side of the window, and FRI showed significance (P = ). For the window containing 500 SNPs outside FRI and 5 FRI polymorphisms, the effect of each SNP was estimated by HGLM with structured dispersion (Figure 1a). The model strongly shrank the estimated SNP effects outside FRI towards zero, meanwhile, the SNPs in FRI were highlighted. Four of the five polymorphisms in gene FRI had negative effects on FRI expression and the other one was positive. The SNPs outside FRI had the same low value of variance and the ones in gene FRI shared another high variance value. And the likelihood ratio test proved that such a shift in variance was statistically significant. Significance is related to window size The process of testing gene FRI for FRI expression showed that the significance was related to the length of the SNP region. Hence, to test whether polymorphisms in with certain known annotation contribute more than the others in a complex associated SNP region, we took different lengths of SNP regions harboring the known candidate genes. Locating a known functional polymorphism in the center of the SNP region, the number of SNPs on the two sides of the polymorphism varied evenly from 50 to Four more potential genes and their corresponding phenotypes were taken for carrying out our method, which are DOG1 (8 polymorphisms therein) for trait LN10, TCL1 (10 polymorphisms therein) and ETC3 (4 polymorphisms therein) for trait JA and HKT1 (12 polymorphisms) for trait Na. It is shown in Figure 2 that DOG1 was statistically significant when the number of SNPs in the region was less than 200, while it turned to be not significant as the number of SNPs increased. According 6

8 Figure 1: a, Estimated SNP effects and variances by modeling FRI annotation in a 500-SNP region for FRI expression. The blue dots represent different values of SNP effects, whose corresponding axis was on the left. The purple line represent SNP variance, whose axis was on the right. Gene FRI located in the grey shadow. b, Absolute values of the correlation matrix of the SNPs in FRI region. FRI is centered in the region, with 100 SNPs on each side. The negative/positive labels indicate the left/right side of FRI. to the analysis by A TWELL et al. (2010), DOG1 had a rank of 6 or 7, but its p-value did not reach the significance threshold. However, in the supplementary information of the orginal publication, the authors indicated that DOG1 is an essential candidate gene for many traits especially flowering time related ones. Using our approach, it is possible to fine-map this gene within a small window (Figure 3a), but the significance got lost after increasing the window size. The red and purple lines in Figure 2 fluctuated sharply when the number of SNPs was less than 200, then both of them kept decreasing to be significant for large windows. HKT1 was not significant in a small region while it turned to be significant when the region became 7

9 larger. The case for FRI was similar, and the number of SNPs for reaching the threshold was even larger. In ATWELL et al. (2010) s results, HKT1 and FRI had the top rank with high significance for traits Na and FRI respectively. The results based on our HGLM showed that it is able to fine-map them in a large window (Figure 3b). The green line in Figure 2 was wholly under the threshold line, which showed that gene ETC3 was easily fine-mappable for trait JA (Figure 3c). On the other hand, the orange line representing gene TCL1 was also significant after the number of SNPs was larger than 100. The results are consistent with ATWELL et al. (2010) s, where both of ETC3 and TCL1 ranked very high. Since the credibility of the p-values was influenced by the length of the testing region, we did simulations to evaluate the false positive rates (FPR) for these particular examples. FPR is the probability of falsely rejecting the null hypothesis for a particular test among all the tests performed. From 100 simulations for each point in Figure 2, we obtained 20 FPRs for each candidate gene. All the FPRs ranged from 0 to 0.03, which was conservative compared to the threshold value 5% that we used in the likelihood ratio test. This indicated that the p-values are reliable. DISCUSSION Based on what we have known about FRI, we tried to figure out why the line in Figure 2 fluctuated sharply in SNP regions with the number of SNPs less than 200. A likely explanation for this is the complex associations in the region. A heatmap of absolute values of correlation matrix might be helpful for our speculation (Figure 1b). Correlations of the SNPs surrounding FRI are not high. There are some segments with high correlation. For instance, [ 25, FR I], [25, 50] and [60, 80]. These broken interval segments are associated with the changes of the red line in Figure 2. P-values of likelihood ratio test could be influenced strongly by newly entered SNPs which are highly correlated. Similar correlation blocks occurred also for the HKT1 example. The trends showed by the lines in Figure 2 might seem strange. However, there are still some explanations from the data. First of all, regarding the DOG1 region, the increasing curve 8

10 p value number of SNPs Figure 2: Relationship between number of SNPs in the region and p-value for potential causal gene. Red line is FRI for FRI expression. Orange one is TCL1 for JA. Green one is ETC3 for JA. Blue one is DOG1 for LN10. Purple one is HKT1 for Na. Threshold p-value (dashed line) equals

11 a SNP effect DOG SNP variance b SNP effect HKT SNP variance 0K 50K 100K 150K 200K 250K 300K 0K 50K 100K 150K 200K 250K c SNP effect SNPs in Chromosome 5 ( ) ETC SNP variance d SNP effect SNPs in Chromosome 4 ( ) TCL SNP variance 0K 50K 100K 150K 200K 0K 50K 100K 150K 200K 300 SNPs in Chromosome 4 ( ) 300 SNPs in Chromosome 2 ( ) Figure 3: Estimated SNP effects and variances by modeling gene annotation in a testing region. a, DOG1 in a 200-SNP region for LN10. b, HKT1 in a 300-SNP region for Na. c, ETC3 in a 300-SNP region for JA. a, TCL1 in a 300-SNP region for JA. 10

12 actually indicates that it is not possible to fine-map DOG1 using such data, since we are never certain that DOG1 contributes significantly more compared to a large polygenic region. For the other examples, the genes are fine-mappable with respect to a large enough window size. The minimum window size for attaining significance differs since the correlation structure in each gene region shows different pattern. It is straightforward to understand that for a highly correlated region, a larger window size is required in order to distinguish the contribution of the candidate gene from the others. In general, the function of causal gene should be outstanding in the whole SNPs, even though it is hard to distinguish it from its surrounding SNPs in a small associated region. The ranks and significances of the candidate polymorphisms suggested by ATWELL et al. (2010) were also based on the whole genome, which provides us a good list to test. Genes like FRI are not obvious in small regions, which was caused by many highly-correlated non-causal SNPs. If the p-values in large regions has a zero asymptote, it suggests promising significance of the testing gene. On the other hand, genes like DOG1 are difficult to fine-map but might be worth following up in further research if a larger sample size is available. Another problem worthy to pay attentions is that the method for testing gene annotation information for binary trait loci. Actually there are many phenotypes in the data from ATWELL et al. (2010) which are binary responses. HGLM with structured dispersion can be applied to binary data, simply by adding a link function and changing the forms of deviances d M1 and d M2 in the algorithm. Nevertheless, calculating the likelihood and deriving the asymptotic distribution of the likelihood ratio test statistic for binary HGLM with structured dispersion is still a problem and beyond the scope of this paper. LITERATURE CITED ATWELL, S., Y. S. HUANG, B. J. VILHJÁLMSSON, G. WILLEMS, M. HORTON, et al., 2010 Genome-wide association study of 107 phenotypes in arabidopsis thaliana inbred lines. Nature 465: DAETWYLER, H. D., R. PONG-WONG, B. VILLANUEVA, and J. A. WOOLLIAMS, 2010 The impact of genetic architecture on genome-wide evaluation methods. Genetics 185:

13 LEE, Y., and J. A. NELDER, 1996 Hierarchical generalized linear models (with discussion). Journal of the Royal Statistical Society, Series B 58: LEE, Y., and J. A. NELDER, 2006 Double hierarchical generalized linear models. Applied Statistics 55: LEE, Y., J. A. NELDER, and M. NOH, 2007 H-likelihood: problems and solutions. Statistical Computing 17: LEE, Y., J. A. NELDER, and Y. PAWITAN, 2006 Generalized Linear Models with Random Effects: Unified Analysis via H-likelihood. Chapman & Hall/CRC. MEUWISSEN, T. H. E., B. J. HAYES, and M. E. GODDARD, 2001 Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: MILLER, J. J., 1977 Asymptotic properties of maximum likelihood estimates in the mixed model of the analysis of variance. Annals of Statistics 5: PARK, T., and G. CASELLA, 2008 The bayesian lasso. Journal of the American Statistical Association 103. PATTERSON, H., and R. THOMPSON, 1971 Recovery of inter-block information when block sizes are unequal. Biometrika 58: ROBINSON, G. K., 1991 That blup is a good thing: The estimation of random effects. Statistical Science 6: RÖNNEGÅRD, L., and Y. LEE, 2010 Hierarchical generalized linear models have a great potential in genetics and animal breeding. Proc. WCGALP, Leipzig, Germany. SELF, S. G., and K.-Y. LIANG, 1987 Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association 82: SHEN, X., L. RÖNNEGÅRD, and Ö. CARLBORG, 2011 Hierarchical likelihood opens a new way of estimating genetic values using genome-wide dense marker maps. BMC Proceedings 5(Suppl 3). TIBSHIRANI, R., 1996 Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58: XU, S., 2003 Estimating polygenic effects using markers of the entire genome. Genetics 163: YI, N., and S. XU, 2008 Bayesian LASSO for quantitative trait loci mapping. Genetics 179: