A Bayesian Graphical Model for Genome-wide Association Studies (GWAS)

Size: px

Start display at page:

Download "A Bayesian Graphical Model for Genome-wide Association Studies (GWAS)"

Erica Sparks
6 years ago
Views:

1 A Bayesian Graphical Model for Genome-wide Association Studies (GWAS) Laurent Briollais 1,2, Adrian Dobra 3, Jinnan Liu 1, Hilmi Ozcelik 1 and Hélène Massam 4 Prosserman Centre for Health Research, Samuel Lunenfeld Research Institute, 60, Murray Street, Toronto, M5T 3L9, Canada laurent@lunenfeld.ca Abstract: The present paradigm to analyze GWAS is to perform an exhausting testing of all single SNP associations with the response variable with the major drawback that the selected subset of SNPs has in general a very low predictive value. As a shift to the usual approach, we propose here a general statistical framework for GWAS based on Bayesian graphical modeling and able to: 1) Assess the joint effect of multiple SNPs (linked or unlinked); 2) Explore the model space efficiently using the Mode Oriented Stochastic Search (MOSS) algorithm; We illustrate our new methodology through an application to the CGEM breast cancer data. Our algorithm selected several SNPs embedded in multi-locus models with high posterior probabilities. Most of the selected SNPs have a biological interest. Interestingly, several of them would not have been detected in the single-snp testing approach. Finally, we discuss the impact of informative priors in this statistical framework through simulations. Keywords and phrases: Keywords: Graphical model, Bayesian, Stochastic Search, GWAS, SNP, Breast Cancer. 1. Introduction The emergence of new high-throughput technologies for SNP genotyping and their application to large scale genome-wide association studies (GWAS) have generated great promises that the genetic basis of many common human diseases could be elucidated [18, 22, 23, 25, 32, 33]. These expectations have been partially fulfilled and indeed, GWAS have identified hundreds of genetic variants associated with common human diseases and complex traits, providing valuable insights into their genetic mechanisms [16, 17]. The rationale underlying GWAS is that common genetic variants (i.e. present in more that 1-5% of the population) could explain most of the attributable risk of common human diseases. This is also referred to as the common disease, common variant (CDCV) hypothesis. The present paradigm for designing and analyzing GWAS is relatively straightforward and involves the collection of more than 1,000 cases and 1,000 Research supported by a MITACS seed project Corresponding auhtor 1

2 L. Briollais et al./bayesian approach for GWAS data 2 controls and an exhaustive screen of > 500K SNPs using simple univariate association tests (sometimes adjusted for confounding risk factors). It follows a ranking of the genetic variants based on the test statistics (or corresponding p-values) and determination of significance threshold using appropriate correction for multiple testing. Genetic variants with p-values below the threshold are declared significant and often followed-up in replication studies. In many GWAS, this threshold is in the order of 10 8 to 10 7 [32], although no consensus exists on how this should be determined. Despite its relative merits at identifying new genetic variants, GWAS have also given rise to many criticisms and increasing skepticism. Most of the attempts to constructing multi-variable risk models (i.e. multi-locus/multiple-snps), that could support the CDCV hypothesis, have used univariate rankings of the SNPs (that individually measure the dependency between each candidate SNP and the response) to select those to be considered into the final model. However, there is no theoretical justification of the implicit claim that the resulting set of genetic variants can be converted into a good predictive model. This is clearly illustrated by the GWAS on breast cancer, where the panel of SNPs associated with breast cancer had a low attributable risk and low predictive ability [12, 40]. The actual limitations of GWAS analysis methods lie in their incapacity to account for the joint effects of multiple genetic variants that could be linked or unlinked on the genome, and also to estimate the effect of rare variants with minor allele frequency (MAF) lower than 5%. Attempts to remedy these problems have been proposed. A recent paper demonstrated theoretically the feasibility of genome-scale testing of the joint effect of two genetic markers [29] but did not propose any particular statistical framework for performing the variable search. The development of variable selection and model search methods in the context of high-dimensional data is still in the early stage and encompasses two main statistical approaches: penalized regression and Bayesian selection methods. The most popular penalized regression method is the LASSO [38], further extended for GWAS analysis [19]. The principle consists in estimating a vector regression coefficients associated with the SNP effects by maximizing a penalized log-likelihood function. The LASSO performs a variable selection and parameter estimation simultaneously, the SNPs not selected in the final model having regression coefficients set to zero. In the Bayesian framework, various competing approaches have been developed. A common feature is to propose a regression model for the SNP joint effect on the response and to specify a prior for the regression coefficients. The best multiple SNP models can be determined by evaluating the posterior probability of the model. Since the model space has a high dimension, efficient search algorithms are often used based on MCMC approaches or variants of those. Models are sampled from their posterior probability and the final model choice includes models with the highest posterior probabilities after convergence of the MCMC algorithm. Expert prior information, whenever available, can be included to guide the model search. Adjustment for the multiplicity problem, i.e. for the number of models evaluated, in the Bayesian framework has also been proposed [34]. Although promising, some of these approaches have been limited to low-dimension models with two SNPs [44], SNPs that are in link-

3 L. Briollais et al./bayesian approach for GWAS data 3 age disequilibrium [39], continuous data [13] or sometime need a heavy first step selection to reduce the high SNP dimensionality [42]. Therefore there is a compelling need to develop appropriate methodologies for GWAS, allowing an efficient evaluation of a large number of multi-snp models, where the SNPs can be linked or unlinked. In this paper, we propose a general framework based on discrete Bayesian graphical model able to 1) Assess the joint effect of multiple SNPs (linked or unlinked); 2) Explore the model space efficiently using the Mode Oriented Stochastic Search (MOSS) algorithm [9]. We illustrate our new methodology through simulation and an application to the CGEM breast cancer GWAS data. 2. Discrete Bayesian Graphical models for Modeling the Joint Effect of SNPs in GWAS 2.1. Model notations An undirected graph G is a pair (V, E) where V is a finite set of vertices, and E, the set of edges, is a subset of the set V V of unordered pairs of distinct vertices {i, j}, i V, j V. In GWAS, the vertices represent the disease status and the SNPs and the edges the associations between the variables, i.e. the SNPs and the disease status and between the SNPs themselves. Let X = {X 1,..., X r } be a vector of random variables and let G = (V, E) be a graph with vertex set V = {1, 2,..., r}. Each variable X i is represented by the vertex i. For A V, X A indicates the collection of random variables {X i, i A}. For a given graph G and any probability distribution of X, a probability distribution for X is said to be Markov with respect to G if for any two non-adjacent vertices i, j V, X i is independent of X j given X V \i,j. A graphical model is a family of probability distributions which are Markov with respect to a given graph G. A discrete graphical model is a graphical model where the random variables are discrete. Let X be the random vector of interest. Suppose that a random sample x = (x 1,..., x n ) of values of X has been observed. Let us assume that we choose to do a model search in the family of models M 1,..., M k. We write the models as M j = {p(x ϑ), ϑ Θ j }, j = 1,..., k (1) where ϑ is a parameter in the parameter set Θ j and p(x ϑ) is a probability density function. In the particular case where the model is a graphical model, the parameter space is defined by the underlying graph G and we identify models M j with their underlying graph G j. In a Bayesian framework we assume a prior probability P (M j ), j = 1,..., k on the set of models (M 1,..., M k ) and a prior probability on the parameters ϑ, and want to derive the posterior model probabilities P (M j x) for each one of

4 L. Briollais et al./bayesian approach for GWAS data 4 the models M 1,..., M k, that is, the conditional distribution of M j given the data. The Bayesian solution is to choose the model with the highest posterior probability. According to the Bayes theorem, the posterior probability for M j is P (M j x) = P (x M j )P (M j ) k i=1 P (x M i)p (M i ) (2) The term k i=1 P (x M i)p (M i ) in (2) is a constant. Therefore we can write P (M j x) P (x M j ) }{{} (the Marginal Likelihood) P (M j ) }{{} (the Model P rior) (3) Suppose that the random vector of interest X consists of r binary factors {X 1,..., X r }, each having two levels: 0 and 1. Take N individuals, and classify them accordingly to the r binary factors in V = {1, 2,..., r}, let E be the collection of all non empty subsets of V and E 0 the collection of possible subsets of V including, then the elements F in E 0 are in 1-1 correspondence with the cells in the contingency table and we can use p F to denote the cell probability p F = P (X v = 1, v F, X v = 0, v F ). (4) Similarly, we can use n F, F E 0 to denote the cell counts. We know that since N is fixed, the cell counts follow a multinomial distribution with the following distribution function f((n), p) = ( ) N p N F E n F (n) F E p n F F (5) An alternative representation of the multinomial distribution is to write it in a natural exponential family form using loglinear parameters instead of cell probabilities. We use the following loglinear parameters Let θ E = log It can be shown (See [30]) that p ( 1) E\F F E,F E 0 F with θ = log p. (6) D = {D E D is complete in G}. (7) X l X s X V \{l,s} (8) is equivalent to exp θ E = p ( 1) E\F F E,F E 0 F = 1, E D (9)

5 L. Briollais et al./bayesian approach for GWAS data 5 that is, the generalized odds ratio exp θ E, E D are equal to 1 or equivalently θ E = 0, E D. (10) Marginal cell counts refer to the counts of outcomes when we classify subjects on a subset of the variables of the original contingency table. So, if we use y to denote the marginal cell counts, for any given cell F, we have y F = n D (11) D F,D E 0 and clearly y = N. For a given contingency table, the transformation from the cell counts {n F } to the marginal cell counts {y F } is a simple linear transformation with Jacobian equal to 1. If we assume a multinomial distribution for the cell counts, we readily obtain, up to a multiplicative constant, the following natural exponential family distribution for the marginal cell counts y = (y E, E E 0 ) as a natural exponential family: ( f(y; θ, G) = exp exp( ) θ D )) θ D y D N log(1 + D D E E D G E with θ = (θ D, D D) (12) The conjugate priors for distributions in the exponential family have density π G (θ s, α) = I G (s, α) 1 exp{ ( θ D s D α log 1 + exp( ) θ D ) }, D D E E D G E (13) for some s = (s D, D D) R D and α R, where I G (s, α) is the normalizing constant Specification of the prior A method to construct hyperparameters of a proper prior π D (θ D (s, α)) is to start with a fictive prior contingency table with all cell counts n F positive, not necessarily integers. With α denoting the total count in the given fictive contingency table, y D denoting the marginal cell counts, we can take as hyperparameters α = N and s D = y D, D D. Lack of prior information can be expressed through what is sometimes called a flat prior by taking all the fictive cell entries to be equal and equal to α I. We used this latter prior specification in our application to the breast cancer GWAS Posterior of a model The posterior of G is proportional to the ratio of the two normalizing constants: P (G Y ) I G (y + s, n + α)/i G (s, α). (14)

6 L. Briollais et al./bayesian approach for GWAS data 6 For G decomposable, the prior π(θ α, s) is identical to the hyper Dirichlet (see [30]). It therefore follows that the normalizing constants I G can be computed analytically when the graph G is decomposable. When G is non decomposable, I G needs to be computed numerically Regression induced by a graphical model In GWAS, we are interested in modelling the response variable (i.e. case control status) as a function of the SNP variables. Let Y = X r, r V be a response variable and X A, A V \ {r} be a set of explanatory variables (i.e. the SNPs). Denote by (n) A {r} and (n) A the corresponding marginal tables. Here (n), (n) A {r} and (n) A are cross-classifications involving X V, X A {r} and X A, respectively. The connection between log-linear models and the regressions derived from them has been explored in [9]. They showed that the marginal likelihood of the regression [Y X A ] can be expressed as the ratio between the marginal likelihoods of the saturated log-linear models for (n) A {r} and (n) A. 3. SNP selection with MOSS in GWAS setting A major challenge in GWAS is to select a subset of SNPs associated with the response of interest among thousands of candidate. The Bayesian paradigm for model selection and model determination involves choosing models m with high posterior probability Pr(m (n)) selected from a set M of competing models. We denote by (n) the available data. We associate with each candidate model m M a neighborhood nbd(m) M. Any two models m, m M are connected through at least a path m = m 1, m 2,..., m l = m such that m j nbd(m j 1 ) for j = 2,..., l. The neighborhoods are defined with respect to the class of models considered. In a regression model m containing k regressors, the neighborhood is obtained by: (i) addition moves: individually including in m any variable that is not in m. The resulting neighbors contain k + 1 regressors; (ii) deletion moves: individually deleting from m any variable that belongs to m. The resulting neighbors contain k 1 regressors; and (iii) replacement moves: replacing any one variable in m with any one variable that is not in m. The resulting neighbors contain k regressors. In a graphical model m, the neighborhood consists of those graphical models obtained from m by adding or deleting one edge ([10] and [6]). The mode oriented stochastic search (MOSS, henceforth) is an algorithm [9] that identifies models in { } M(c) = m M : Pr(m (n)) c max m M Pr(m (n)), (15) where c (0, 1). As proposed in [26], models with a low posterior probability compared to the highest posterior probability model are discarded. This choice

7 L. Briollais et al./bayesian approach for GWAS data 7 drastically reduces the size of the target space from M to M(c). For suitable choices of c, M(c) can be exhaustively enumerated. MOSS makes use of a current list S of models that is updated during the search. Define the subset S(c) of S in the same way we defined M(c) based on M. Define S(c ) with 0 < c < c so that S(c) S(c ). Let q be the probability of pruning the models in S \ S(c). A model m is called explored if all its neighbors m nbd(m) have been visited. A model in S can be explored or unexplored. MOSS proceeds as follows: PROCEDURE MOSS(c,c,q) END. (a) Initialize the starting list of models S. For each model m S, calculate and record its posterior probability Pr(m (n)). Mark m as unexplored. (b) Let L be the set of unexplored models in S. Sample a model m L according to probabilities proportional with Pr(m (n)) normalized within L. Mark m as explored. (c) For each m nbd(m), check if m is currently in S. If it is not, evaluate and record its posterior probability Pr(m (n)). If m S(c ), include m in S and mark m as unexplored. If m is the model with the highest posterior probability in S, eliminate from S the models in S \ S(c ). (d) With probability q, eliminate from S the models in S \ S(c). (e) If all the models in S are explored, eliminate from S the models in S \ S(c) and STOP. Otherwise go back to step (b). In a MOSS search, the model explored at each iteration is selected from the most promising models identified so far. In MC 3 [26, 27] or SSS [21], this model is selected from the neighbors of the model evaluated at the previous iteration. This feature allows MOSS to move faster towards regions of high posterior probability in the models space. 4. Discretizing the SNP Variables The conjugate priors in [30] are appropriate for the analysis of general multiway tables in which each variable is allowed to have any number of categories. In order to increase the mean number of observations per cell, we transform the polychotomous data into dichotomous data. The response variable y in our analysis of the CGEM data is dichotomous by design (i.e. case control status). Diallelic SNPs can be represented as three-category discrete variables, or can be dichotomized as presence of 0 vs. absence of 0, or as presence of 1 vs. absence of 1. Coding of the latter type was used by transforming the categorical variables as follows. A segregating SNP site has three possible genotypes: 0/0, 0/1 and 1/1, where 0 is the wild type and 1 is the mutant allele. We choose the splits τ having the largest marginal likelihood of the regression of y on x τ, where x τ

8 L. Briollais et al./bayesian approach for GWAS data 8 is the binary variable obtained after splitting the continuous variable x at the value τ. Ideally, we would like to consider multiple choices of splits over all available predictors and choose those splits that lead to large values of the marginal likelihood of regressions of y on one, two or more categorical predictors. Each predictor will have several categorical versions associated with various splits sets. Unfortunately the datasets resulting from genome-wide studies precludes us to perform such large scale computations, and we need to choose splits sequentially for each covariate. 5. Risk Estimation and Prediction Based on Bayesian Model Averaging Once MOSS has identified a set of regression models S for some c (0, 1), one can estimate the risk associated with the selected SNPs and perform risk prediction. This is done using Bayesian Model Averaging. Let us consider a regression model m j S(c), which is [Y X Aj ] with A j V \ {r} and j B. Here B is a set of indices for the collection of models over which we are averaging. The regression model of Y on the remaining variables is a weighted average of the regression models in S, where the weights represent the posterior probability of each regression model (see for example [43]): Pr(Y = y (n)) = j B Pr(Y = y (n) Aj ) Pr(m j (n)). (16) Since we assumed that all models are apriori equally likely, the posterior probability of each regression is equal to its marginal likelihood normalized over all the models in S: Pr(m j (n)) = Pr(r A j ) Pr(r A l ). It is shown in [26] that the weighted average of regressions in (16) has a better predictive performance than any individual model in S. The revelance of each predictor X j can be quantified by its posterior inclusion probability defined as the sum of the posterior probabilities of all the models that include X j. To estimate the parameter θ of a given graphical model, we use the algorithm described in [9], called the Bayesian iterative proportional fitting (Bayesian IPF, henceforth) algorithm, which samples from the joint posterior (eq. 14) of these θ s. The resulting posterior samples can be used to estimate the coefficients of the regressions in S(c) and thus the probability of having the response of interest (e.g. the disease) given the SNPs selected. The coefficients for the Bayesian model averaging regression Pr(Y = y (n)) in (16) are estimated by sampling from the joint posterior distribution of the regression coefficients corresponding to a given graphical model m j = [Y X Aj ], j B, for a number of iterations proportional with Pr(m j (n)). l B

9 L. Briollais et al./bayesian approach for GWAS data 9 6. Simulation Study To assess the properties of our novel Bayesian graphical model approach, we simulated data sets that mimic the breast cancer GWAS data analyzed in Section 7. We considered four simulation scenarios that represent various multi-snp models commonly studied in genetic association studies. In our Scenario 1, we consider a 5-SNP main effects model with an interaction between SNP2 and SNP3. For each individual, the case-control status was generated from a Bernouilli trial with probability p of being a case given by the following model logit(p) = β 0 + β i SNP i + β 23 SNP 2 SNP 3. i=1..5 We chose the β s to reflect the range of associations detected in our breast cancer GWAS in section 7. More specifically, we fixed the regression coefficients at β 1 = 0.405, β 2 = 0.916, β 3 = 0.182, β 4 = 0.405, β 5 = and β 23 = β 2 β 3, corresponding to odds-ratio for the genetic association of 1.5, 3.0, 1.2, 0.67, 5.0, 1.92, respectively. The parameter β 0 was determined so that we have the same number of cases and controls as in our real data analysis, that is 1, 145 cases and 1, 142 controls. For each SNP, we generated two genotypes with probability f i, 1 f i, i = 1,..., 5 from a Bernoulli trial. We fixed f i for the five SNPs at 0.10, 0.10, 0.36, 0.36 and 0.02, respectively. These genotype frequencies correspond to a minor allele frequency (MAF) of 0.05, 0.05, 0.2, 0.2 and 0.025, respectively, under a dominant genetic model, and thus represent a wide spectrum of uncommon, common and rare SNPs. Scenario 2 corresponds to a SNP-SNP interactions model with three 2-way interactions without main effects. For each individual, the case-control status was generated from a Bernouilli trial with probability p of being a case given by the following model logit(p) = β 0 + β 23 SNP 2 SNP 3 + β 34 SNP 3 SNP 4 + β 45 SNP 4 SNP 5. The MAF of the SNPs was 0.25, 0.25, 0.36, 0.36 and 0.15, respectively for the five SNPs, and the regression coefficients for the interactions were β 23 = 1.099, β 34 = and β 45 = Scenario 3 corresponds to a fine mapping problem where a causal SNP (SNP 2 ) is associated with a disease status Y, and where SNP 2 is in linkage disequilibrium (LD) with two other SNPs, SNP 1 and SNP 3. These two other SNPs are conditionally independent of the disease status given SNP 2 (Figure 1), so that there is only one causal locus in the region and this causal locus is

10 L. Briollais et al./bayesian approach for GWAS data 10 observed. The 3 SNPs constitute a cluster of SNPs and are denoted X 1, X 2, and X 3 for simplicity. The distribution of the four discrete variables in the graph was generated from a multinomial distribution and can be represented by a 4way contingency table, where the joint cell probabilities are given by the log-linear model: logp ijkl = θ Y i + θ X1 j + θ X2 k + θ X3 l + θ Y X2 ik + θ X1X2 jk + θ X2X3 kl where the subscripts i, j, k, l {0, 1} index the levels of the variables Y, X 1, X 2, and X 3, respectively. We chose as parameters θ1 Y = 0, θ X1 1 = θ X2 1 = θ X3 l = ln(0.2) = 1.6, θ Y X2 11 = 0.717, θ X1X2 11 = θ X2X3 11 = All the other parameters were set to 0. The association between SNPs is equivalent to a level of LD of Q = 0.71 and D close to 0.55 [7], that is a moderate to strong level of LD as generally observed among SNPs in a same haplotype block. The MAF was 0.20 for all three SNPs and we assumed a dominant model for SNP 2 main effect. [Fig 1 about here.] Scenario 4 is very similar to Scenario 3 but corresponds to a situation where the 3 SNPs are functional and conditionally associated with the disease status (Figure 1). The 3 SNPs could be in different haplotype blocks, so we assumed a lesser extent of LD compared to Scenario 3. The corresponding log-linear model is now: logp ijkl = θ Y i + θ X1 j + θ X2 k + θ X3 l + θ Y X1 ij + θ Y X2 ik + θ Y X3 il + θ X1X2 jk + θ X2X3 kl The log-linear parameters were chosen as θ1 Y = 0, θ X2 1 = ln(0.05) = 3.0, θ X1 1 = θ X3 1 = ln(0.25) = 1.4, θ Y X1 11 = θ Y X2 11 = θ Y X3 11 = 0.6 and θ X1X2 11 = θ X2X3 11 = 1.1 for the SNP main effects and the associations between SNPs. This is equivalent to a level of LD between the SNPs of Q = 0.5 and D close to 0.33 [7], that is a moderate level of LD as generally observed across haplotype blocks. The MAF was 0.25 for SNP1 and SNP3 and 0.05 for SNP2 and we assumed a dominant model for each of them. For each simulation scenario, we also added to each simulated data set, 53K random SNPs taken from the original breast cancer GWAS in section 7. This represents approximately 10% of the total number of SNPs available in this study. These SNPs were not associated with the simulated case-control status and thus correspond to noise variables. Since these SNPs were extracted from a real SNP array, they represent a real genome-wide correlation structure. Control of False and True Discovery Rate For our different simulation scenarios, we estimated the False Discovery Rate (FDR) [2], given by the proportion of non causal SNPs discovered by each approach among all the SNPs discovered by this approach. FDR estimates were only based on the models main effects. The main effect FDR (FDRm) for a SNP j is defined for a particular method as

11 L. Briollais et al./bayesian approach for GWAS data 11 F DR m = k=1 N d I( SNP j is discovered in data set k SNP j / causal SNPs ) k=1 N d I( SNP j is discovered in data set k) where N d is the number of simulated data sets and I(.) is the indicator function. For the scenario 3 of the simulations, we also computed the cluster FDR (FDRc), where the cluster c included the 3 SNPs in linkage disequilibrium (with only one of them associated with the outcome), defined as F DR c = k=1 N d I( SNP j is discovered in data set k SNP j / cluster c ) k=1 N d I( SNP j is discovered in data set k) We also computed the true discovery rate (TDR) for each individual SNP j associated with the outcome as k=1 N T DR = d I( SNP j is discovered in data set k) N d and an overall TDR as T DR overall = k=1 N d j=1 N s I( SNP j is discovered in data set k) N d N s where N s the number of causal SNPs associated with the outcome. In scenarios 1 and 2 of our simulations, we also computed a TDR for each specific pair of SNPs (j, j ) involved in an interaction term of our simulated model as T DR pair = k=1 N d I( SNPs j and j are discovered in data set k) N d Note that the models of analysis did not include any interaction between the SNPs but is only based on the model main effects. The FDR and TDR results are presented in table 1 and table 2, respectively. We tried to control the FDRm at the same level for each method compared and compute the TDR at this FDRm level. However, this was not always possible with the Bayesian variable selection regression (BVSR) approach in simulation scenarios 2 and 4, the BVSR method being more designed to optimize the SNP predictive performance than controlling the FDR or type I error. In these two situations, even with a more relaxed control of FDRm, BVSR had the worst TDR among all the other methods, so we left the level of FDRm as it was for this approach, concluding it had the worst performance.

12 L. Briollais et al./bayesian approach for GWAS data 12 Additionally to each FDR and TDR result, for each SNP associated with the disease, we also included its rank based on an univariate test statistic with the χ 2 and the Bayes Factor (BF), where the BF is comparing a model with a single SNP main effect and a model without SNP, these statistics being averaged over all the simulated data sets. Method comparison For method comparison, we also fitted the penalized regression method LASSO [38], the BVRS approach [13] and a Markov blanket-based method (DASSO- MB) [14] to our simulated GWAS data sets. For the LASSO, we used a formulation proposed for GWAS data in [19], which is based on shrinkage priors. In this version of the LASSO, each regression coefficient is assigned independent shrinkage priors with a density that is sharply peaked at zero. The authors proposed to use the double exponential (DE) distribution or the normal exponential gamma (NEG) distribution as prior density function. Parameter estimates are obtained by maximizing the posterior density p(β X, y) over β, where X is the normalized genotype data and y the response variable (i.e. the case-control status). Taking logarithms in Bayes Theorem, the problem can be thought of as maximizing the penalized log-likelihood function: logp(β y, X) = L(β) f(β) + const where L is the log-likelihood for the logistic regression model and f is the logprior density with a minus sign to allow f to be interpreted as a penalty function. With the DE prior, the maximization of the penalized log-likelihood is equivalent to the LASSO procedure. We used the DE distribution prior which leads to a maximization equivalent to the LASSO ([38]). To control the FDRm (see above definition) at a given level, we can change the hyper-parameter of the prior DE distribution, so we chose the hyper-parameter that ensures the FDRm to be at the same level as the other approaches in each simulation scenario. The BVSR approach defines some normal priors for the regression coefficients associated with the SNP effects and also for the probability of each regression coefficient to be zero in the model, i.e. controlling the sparsity of the model. The originality of this approach is to have a variability parameter in the regression coefficient normal prior that depends on the proportion of genetic variability explained (PVE) by the selected set of SNPs in the model (which itself is function of the sparsity parameter). This prevents the risk that more complex models explained substantially higher PVE. The induced prior on the PVE given the sparsity parameter is therefore assumed flat in the range (0,1). BVSR was initially developed to find models associated with high PVE for a continuous response. An extension to binary responses was also proposed using a probit link function. The inference is conducted using MCMC and models with highest posterior PVE are identified. The model size for BVSR could vary between 1 to 5

13 L. Briollais et al./bayesian approach for GWAS data 13 and the hyper-parameters were chosen so that the control of FDRm was similar to the other approaches in the various simulation scenarios. However, this control of FDR could not be achieved in the scenarios 2 and 4 of the simulations. The DASSO-MB approach uses tests of conditional independence among discrete variables (disease variables and SNP variables) based on a frequentist probabilistic graphical model framework. The tests of conditional dependence are based on G 2 statistics for contingency tables. The algorithm starts with a forward step selecting the SNP with the highest G 2 statistic (conditionally on the SNPs already selected) and dependent on the response variable. Each time a SNP is added to the model, a backward step is applied to test the conditional dependence of all other SNP variables with the response conditional on the rest of variables until the variables selected do not change. When no other SNP is added to the selection, a final backward step is performed to test the conditional dependence of each selected SNP with the response conditional on all possible subsets of SNPs selected in forward step. A significance threshold for the forward and backward steps need to be specified, we use the default option of 0.01 for both steps and we ensured in our simulations that it controls the FDRm at the same level as the other approaches. For MOSS and BVSR, we fitted two-snp main effects models as we did in our real application. Fitting more complex models would be possible, e.g. with three SNPs, but given the sample size available the variable selection would be very unstable. With the LASSO and DASSO, although the model size could be theoretically larger than 2, in most situations the models identified by these two approaches had a model size of 1 or 2 SNPs. To control FDRm with MOSS, we vary the parameters c and c, which control the number of models and variables selected and we verified in our simulation studies that the FDRm was controlled at the same level as the other methods. Results The results related to FDR estimates are presented in Table 1. In scenario 1, all the four methods have a level of FDRm varying between 15.4(MOSS) to 19.4% (DASSO). For the same type of simulation problem, previous papers reported some very similar control of FDR level ([19], [15]). In the scenario 2, all methods control FDRm at a very low level (i.e <3%) except BVSR which has a very inflated FDRm of 53%. The scenario 3 is probably the most complex since the goal is to find a causal variant among a small chromosomal region comprised of 3 SNPs. The FDRm was close de 50% for all the methods, DASSO having the highest rate of 59%. Finally in scenario 4, the control of FDRm was about 14% for all the methods but BVSR which has an inflated rate of 27%. In scenario 3, the use of FDRc did not improve the control of FDR over the use of FDRm. The results related to TDR estimates are presented in Table 2. We recall that all the methods ensure almost the same control of FDRm except BVSR, which

14 L. Briollais et al./bayesian approach for GWAS data 14 has an inflated rate in scenarios 2 and 4. When considering the TDR over all the SNPs, i.e. T DR overall, MOSS surpasses the other approaches. The T DR overall statistics are 38%, 54%, 100% and 64% in the scenarios 1, 2, 3 and 4, respectively, for MOSS. The advantage of MOSS is clearly seen in scenario 2 where the second best method (DASSO) has only a T DR overall of 71%. Overall, the second best approach is the DASSO except in the scenario 4. The T DR m statistics confirm the superiority of MOSS for the individual SNPs results and these results are consistent when considering SNPs with different effect sizes and MAFs. In scenario 1, DASSO results are close to those from MOSS with a noticeable difference for SNP2 and SNP5, which are two rare SNPs. Both the LASSO and BVSR have worst performances and can even have some T DR m of 0% for particular SNPs. In scenario 2, the superiority of MOSS is confirmed by the T DR m statistics. Interestingly, the T DR pair results also show that MOSS can detect with higher probability the models including two SNPs that were involved in a SNP SNP interaction term. We remind that the T DR pair statistic is just based on evaluating models with main effects and not on a test of interaction. In scenario 3, MOSS reaches 100% T DR m to detect the causal variant, which is remarkable since it is a complex situation with high LD between the SNPs. The second best method was DASSO (T DR m of 71%) and LASSO (T DR m of 36%). Finally, in scenario 4, the two best methods were MOSS and LASSO, with T DR m varying between 22 to 99% for MOSS and 8 to 100% for LASSO. We therefore conclude that MOSS is superior to its competitors based on our simulation results in various situations. The advantages of MOSS appear more striking when considering rare SNPs, SNPs in high LD and SNPs that could interact with each other. The performances of MOSS were also very good in other situations. We also noticed that the performances of MOSS in terms of FDR and TDR remain unchanged for various specifications of the priors (results not shown). In particular, defining the prior cell counts to be all 1 or proportional to the sample size of the observed cell counts with various possible proportions, did not change our main conclusions. [Table 1 about here.] [Table 2 about here.] 7. Analysis of the CGEM breast cancer GWAS data 7.1. The breast cancer paradigm In most Western populations, approximately one in ten women develop breast cancer. Epidemiological studies have shown that women who have first-degree relatives with a history of breast cancer have a two-fold increase in risk of the

15 L. Briollais et al./bayesian approach for GWAS data 15 disease [4]. The risk ratio increases with increasing the number of affected firstdegree relatives. Twin studies have indicated that most of the excess familial risk is due to inherited predisposition [31]. Particularly, BRCA1 and BRCA2 are the most important susceptibility genes conferring, when mutated, high lifetime risks of breast cancer [3, 36]. Mutations in BRCA1 and BRCA2 account for 16% of the familial risk of breast cancer [1]. Mutations in other genes (TP53, PTEN, STK11, CDH) are also associated with elevated risks but it is unlikely that mutations in these six genes account for more than 20% of the familial risk of the disease. Therefore the remaining 80% of the familial risk remains to be explained. The search for this missing heritability has led to the identification of other high-penetrant mutations in candidate genes such as CHEK2, ATM, BRIP1 and PALB2. However, they still confer a small contribution to the familial risk of breast cancer [37]. Alternatively, common low-penetrant alleles have been sought through genome-wide association studies (GWAS). So far only a small number of such variants have been identified and confirmed in different populations but their inclusion just modestly improved the performance of risk models for breast cancer [12, 40]. The bulk of breast cancer genetic susceptibility thus remains to be determined. Our working hypothesis is that the combination of common and rare genetic variants contribute to the risk of breast cancer. To illustrate this point, we use MOSS to search for the most important combinations of genetic variants in the CGEM GWAS data The CGEM study We used the CGEM breast cancer data. The genome-wide association studies (GWAS) for breast cancer has been completed in the Nurses Health Study (NHS) with nearly 550,000 SNPs genotyped. The analysis includes 1,145 individuals who developed breast cancer during the observational period and 1,142 agematched individuals who did not develop breast cancer during the same time period. Both the genotype data and the pre-computed analyses based on the genotype data were retrieved from the following website ( The project team has developed easy access to pre-computed results of the data including SNP frequencies and single SNP association analyses Data pre-processing Our initial dataset included 555,341 SNPs and 2,287 observations (1,145 affected individuals and 1,142 controls). After exclusion of SNPs with inaccurate genotypes (missing rate 10%), we had 546,540 SNPs left. For the remaining SNPs, we tried to impute the missing values using the program MACH [24]. We were able to impute accurately 546,253 SNPs. We also assessed the presence of population stratification using the program EIGENSTRAT [11], which is based on principal component analysis (PCA). Using projections on the two

16 L. Briollais et al./bayesian approach for GWAS data 16 first principal components, we found 20 individuals (9 cases and 11 controls) who appear to be outliers and were removed from our analysis. We did not exclude any SNPs due to low frequency since we wanted to evaluate how our Bayesian method was able to detect them. We did not exclude SNPs based on Hardy-Weinberg disequilibrium since there was no evidence for any deviation in our data [20] Previous results from breast cancer GWAS The first GWAS study using the CGEM breast cancer data identified several SNPs within the gene F GF R2 [20] and this result has been replicated in many independent studies following the initial result. A SNP close to the gene BUB3 was also very significant in the initial study but has not been replicated yet. To our knowledge, none of the GWAS performed thus far on breast cancer has accounted for the joint effect of SNPs associated with the disease Analyses with MOSS Table 3 displays the most important SNPs found by MOSS. We employed MOSS with c = 0.5, c = , a pruning probability of 1.0 and ran it 1, 000 times with 100 starting values each. We searched for regression models containing at most 2 predictors explaining the response variable among 546, 255 SNP variables, which gives a total number of possible regressions of The number of models evaluated by MOSS in each of the 1, 000 instances were considerably smaller and varied between 209, 445 and 837, 778, with a mean of 497, 065. The eight regressions in the resulting S(0.5) involve twelve SNPs embedded or very close to known genes (Table 3). MOSS selected 12 SNPs with MAF varying between 0% and 42%. Three SNPs have a MAF lower than 5%. In general, frequentist approaches applied to GWAS would not be able to perform a test statistic for these SNPs. Most of the SNPs detected by MOSS have a high rank when using the more conventional univariate p value criteria. The two SNPs in the gene F GF R2 were previously identified from univariate analysis of the CGEM data [20] and have been replicated in multiple studies. The SNP in the gene BUB3 was also identified in the initial analysis of the CGEM data but no follow-up studies have confirmed its association with breast cancer. It is therefore noteworthy that MOSS was able to replicate some initial findings in these data. Additionally, several novel candidate SNPs emerge from our analysis. An example, is the SNP rs associated with the highest posterior probability. To our knowledge, this SNP has never been identified by previous GWAS on breast cancer. We also noticed that the SNP rs in the gene APC which has a very low rank based on univariate analysis and p-value ranking, would have never been picked by a

17 L. Briollais et al./bayesian approach for GWAS data 17 classical approach. This SNP has been selected by MOSS because it has a joint effect with the SNP in the gene BUB3. While MOSS is able to detect more SNPs associated with the disease of interest in GWAS, the question remains to know whether these results have any biological validity. In the next tables and figures, we give more insights into the biological interpretation of our findings. [Table 3 about here.] In table 4, we give a more detailed description of the genes associated or potentially associated with the most important SNPs selected by MOSS. Interestingly, 4 out of 17 genes in our list have been previously implicated in breast cancer, including BUB3, NT SR1, F GF R2 and AP C. Furthermore, eight genes have previous relation to cancers, which suggests an enrichment of cancer genes in the MOSS selection. The most interesting gene found by MOSS is the gene C6orf15. This gene is located in the HLA region and does not have a very clear function. However it is located in a region characterized by a dense cluster of genes which has been found over-expressed in many cancer types. This is therefore a region that would be worth sequencing to find potential causal variants associated with breast cancer or other cancers. [Table 4 about here.] Table 5 displays the best eight two-snp models identified by MOSS and their associated marginal likelihood and Bayes Factor (BF). We can first notice that the BFs for these models are much higher (from to 17.18) than any of the BF for the single SNP models in Table 3 (i.e. the maximum value was 4.50 for the SNP rs close to gene BUB3). There is also a certain level of internal replication. Indeed, two pairs of models (1, 3) and (5, 7) appear almost identical since they involve the same two genes though not necessarily the same SNPs. The two SNPs that belonged to same gene were in linkage disequilibrium (LD) in both cases. It is therefore remarkable that MOSS was able to identify SNPs strongly in LD through the selection of the best models. The parameter estimates for these models were obtained by generating 10, 000 samples from the posterior distributions of the induced logistic regression parameters using Bayesian Iterative Proportional Fitting (IPF). In some models, the interaction term between the two SNPs was not included. In most instances, the best models include one strong marginal SNP effect (log odds > 1) and a weaker one ( log odds < 1), the sign of the coefficient for this latter being either positive (risk effect) or negative (protective effect). In terms of allele frequency, we will denote a rare, uncommom and common SNPs whenever the MAF is < 5%, 5% and < 10% and 10%, respectively. Among the eight models detected by MOSS, four of them involve two common SNPs, two include one common and one uncommon SNPs and the last two models entail one rare and one common SNPs. It is remarkable that MOSS was able to select these latter two models since most common approaches for GWAS are limited to the detection of common SNPs.

18 L. Briollais et al./bayesian approach for GWAS data 18 [Table 5 about here.] After trying to understand the joint effect of two SNPs in each of our models, we analyzed the possible implication of the SNPs and genes identified in relation to known genetic pathways. We used the bioinformatic software Ingenuity to perform this analysis. The results are presented in Figure 2. Several gene interactions within this network were identified by MOSS. [Fig 2 about here.] 7.6. Risk prediction with Bayesian Model Averaging The model prediction was obtained by Bayesian model averaging of the eight regression models using 500 iterations of two-fold cross-validation. The area under the ROC curve (AUC) was estimated at 63.5%. By comparison, the prediction obtained from the set of best seven common SNPs identified through previous GWAS on breast cancer (based on univariate analysis) was only 57.4% [12]. A selection of the best 10 SNPs combined with the major known epidemiological risk factors for breast cancer resulted in an AUC of 61.3% ([40]). Therefore, MOSS improves substantially the AUC and the addition of known epidemiological and clinical factors (which were not available for this study) to our model could provide even better predictive ability Significance of the Bayes Factor (BF) and Multiplicity Adjustment Our main interest in this study was to construct predictive models for breast cancer based on the subset of SNPs selected by MOSS. In many studies, however, it is also of interest to assess the significance of the results, i.e. in our context the significance of the multi-snp models identified by MOSS. In general, there is no simple connection between Bayes Factor and p values. Under general assumptions, it is possible to compute an upper bound for the BF for a given p value but this represents an optimistic BF [35]. Alternatively, we can compute an empirical distribution for the BF under the complete null hypothesis H M that no two-snp model is associated with breast cancer. This is done through Monte-Carlo simulations (See section 6). The observed BF for a given two-snp model found by MOSS is then compared to the empirical distribution of the BFs under the null hypothesis to determine the significance. To adjust for the multiplicity problem, i.e. the number of two-snp models evaluated by MOSS, we used a single-step maxt procedure [41], where maxt corresponds to the maximum BF over the m models evaluated by MOSS. If we denote by t i, i = 1,..., 8 the observed BF for one of the eight two-snp models found by MOSS and T l the BF under the null distribution for a model l, l = 1..m, then an adjusted p value can be computed as

19 L. Briollais et al./bayesian approach for GWAS data 19 p i = P r(max 1 l m T l t i H M ), i = 1,..., 8. Following this principle, we found that the eight models selected by MOSS will have adjusted p values smaller than 5 %. 8. Summary and Discussion GWAS has emerged as one of the most spectacular advances in genetic research in the past five years with thousands of novel genetic variants discovered in many complex traits and diseases [16]. Despite this success, the clinical relevance of these findings still remains to be determined. Breast cancer illustrates well the current limitations of GWAS. The predictive models based on clinical factors barely improved when including genetic variants discovered through GWAS [12, 40]. One of the most challenging issues in GWAS analysis is to investigate the full spectrum of the genetic variability including rare variants (most likely to be functional), the joint effects of multiple variants on the response, combined effect of genetics and the environmental factors, etc. To shift the usual paradigm in genome-wide analysis and fully exploit the richness of the genetic information generated through GWAS experiments, we propose a novel Bayesian approach. Its theoretical foundations, based on discrete Bayesian graphical models [9, 30], allows to assess the joint effect of multiple SNPs (that could be linked or unlinked) in a GWAS. The high-dimensional space of multi-snps models is explored efficiently using the stochastic search algorithm MOSS [9]. Additionally, a prior contingency table representing the joint distribution of disease status and SNPs is used within this framework and can enhance the detection of rare functional genetic variants. The use of the Bayesian framework offers several advantages in the context of GWAS over the more classical frequentist approach. First, coupled with efficient stochastic algorithms such as MOSS or MCMC methods, this framework allows to explore the model space very quickly. This is done by sampling from the joint posterior distribution of the multi-snps models. Second, because it is very unlikely that a single model can represent well the complexity of the data, the use of the Bayesian model averaging paradigm acknowledges the high degree of model uncertainty in high-dimensional discrete data [26]. Third, sampling methods such as MOSS have a better ability to deal with missing and sparse data by sampling graphical models from the model space and specifying a prior distribution for the cell counts. This was clearly illustrated by the detection of rare genetic variants in our real analysis of the breast cancer GWAS. Our simulation results demonstrated the superiority of MOSS over recently proposed approaches for multi-snp analyses of GWAS data. The advantage was substantial in all the scenarios investigated including the detection of marginal SNP effects, interaction effects (with or without SNP main effects), causal variants, rare variants, SNPs that are linked or unlinked. This makes MOSS a very general approach for the analysis of a broad spectrum of SNP effects in GWAS. Our real application to a breast cancer GWAS data set confirms the added

20 L. Briollais et al./bayesian approach for GWAS data 20 of our novel approach and its relevance for genetic research. We found 12 SNPs embedded in 8 two-snps models associated with breast cancer. These two-snp models included both common and rare variants. We replicated some known associations, e.g. with SNPs in the FGFR2 gene, but also discovered new ones that seem biologically promising. Many of these genetic associations would not have been found by conventional approaches, which generally cannot evaluate multi-snp models and rare variants. This is the case of the two-snp model that involves SNPs in the genes APC and BUB3. The association with the SNP in BUB3 and breast cancer was not even reported in the original analysis of this GWAS data set since it corresponds to a rare SNP [20]. Biological information about BUB3 shows that it interacts physically with APC, thus validating biologically one of our major results. We should also emphasize that MOSS can also analyze the SNPs as categorical variables instead of binary variables as implemented in our R package GenMOSS, although for our data analysis, we did not find any clear advantage in doing so. This could be due to the fact that this would lead to sparser contingency tables and most of the SNPs found in our best models had dominant or recessive effect. MOSS can also provide estimates of effect sizes for the SNPs discovered by the method. This is illustrated in our real analysis and explained in section 5. An R package called GenMOSS and implementing the MOSS algorithm can be downloaded from ( A more complete version called GenMOSSplus is also available on CRAN and has additional steps for GWAS data pre-processing. Acknowlegments We would like to thank Olia Vesselova for her development of the R package. References [1] Anglian Breast Cancer Study Group(2000) Prevalence and penetrance of BRCA1 and BRCA2 in a population based series of breast cancer cases The British Journal of Cancer [2] Benjamini, Y. and Hochberg, Y.(1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing Journal of the Royal Statistical Society: Series B [3] The Breast Cancer Linkage Consortium(1999) Cancer risks in BRCA2 mutation carriers Journal of the National Cancer Institute [4] Collaborative Group on Hormonal Factors in Breast Cancer(2002) Lancet

21 L. Briollais et al./bayesian approach for GWAS data 21 [5] Darroch, J. N. and Speed, T. P.(1983) Additive and Multiplicative Models and Interaction The Annals of Statistics [6] Dellaportas, P. and Forster, J. J.(1999) Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical log-linear models Biometrika [7] Devlin, B. and Risch, N.(1995) A comparison of linkage disequilibrium measures for fine-scale mapping Genomics [8] Dobra, A.(2009) Variable selection and dependency networks for genomewide data. Biostatistics [9] Dobra, A. and Massam, H.(2010) The Mode Oriented Stochastic Search (MOSS) for Log-linear Models With Conjugate Priors Statistical Methodology [10] Edwards, D. E. and Havranek, T.(1985) A Fast Procedure for Model Search in Multidimensional Contingency Tables Biometrika [11] Price, A. L. and Patterson, N. J. and Plenge, R. M. and Weinblatt, M. E. and Shadick, N. A. and Reich, D.(2006) Principal components analysis corrects for stratification in genome-wide association studies Nature Genetics [12] Gail, M.(2008) Discriminatory accuracy from single nucleotide polymorphisms in models to predict breast cancer risk Journal of the National Cancer Institute [13] Guan, Y. and Stephens, M. (2011) Bayesian Variable Selection Regression for Genome-wide Association Studies, and other Large-Scale Problems. Annals of Applied Statistics To Appear. [14] Han, B. and Park, M. and Chen, X.W.(2010) A Markov blanketbased method for detecting causal SNPs in GWAS. BMC Bioinformatics 11 Suppl. 3 S5. [15] He, Q. and Lin, D-Y(2011) A variable selection method for genome-wide association studies. Bioinformatics [16] Hindorff, L., A. and Junkins,H., A. and Hall,P., N. and Mehta,J., P. and Manolio,T., A.(2009) A Catalog of Published Genome-Wide Association Studies. Available at: [17] Hindorff, L., A. and Sethupathy,P. and Junkins,H., A. and Ramos,E., M. and Mehta,J., P. and Collins,F., S. and Manolio,T., A. (2009) Potential etiologic and functional implications of genomewide association loci for human diseases and traits Proc Natl Acad Sci USA [18] Hirschhorn,J. N. and Daly,M. J(2005) Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics [19] Hoggart, C. J. and Whittaker, J. C. and De Iorio, M. D. and Balding, D. J. (2008) Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genetics 4 e [20] Hunter, D. J. and Kraft,P., Jacobs, K. B., Cox, D. G., Yeager, M., Hankinson, S. E., Wacholder, S. et al. (2007) A genome-wide

22 L. Briollais et al./bayesian approach for GWAS data 22 association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer Nature Genetics [21] Jones, B. and Carvalho, C. and Dobra, A. and Hans, C. and Carter, C. and West, M.(2005) Experiments in Stochastic Computation for High-dimensional Graphical Models Statistical Science [22] Kingsmore,S. F. and Linquist, I. E. and Mudge, J. and Gessler, D. D. and Beavis,W. D.(2008) Genome-wide association studies: progress and potential for drug discovery and development. Nature Reviews [23] Kruglyak, L.(2008) The road to genome-wide association studies Nature Genetics [24] Li, Y. and Willer, C. J. and Ding, J. and Scheet, P. and Abecasis, G. R.(2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes Genetic Epidemiology [25] McCarthy,M. I. and Hirschhorn,J. N.(2008) Genome-wide association studies: potential next steps on a genetic journey. Human Molecular Genetics 17 R156 R165. [26] Madigan, D. and Raftery, A. E. (1994) Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam s Window Journal of the American Statistical Association [27] Madigan, D. and York, J. (1995) Bayesian Graphical Models for Discrete Data International Statistical Review [28] Madigan, D. and York, J. (1997) Bayesian Methods for Estimation of the Size of a Closed Population Biometrika [29] Marchini, J. and Donnelly, P. and Cardon, L. R.(2005) Genomewide strategies for detecting multiple loci that influence complex diseases Nature Genetics [30] Massam, H. and Liu, J. and Dobra, A.(2009) A Conjugate Prior for Discrete Hierarchical Log-linear Models Annals of Statistics [31] Peto, J. and Mack, T. M. (2000) High constant incidence in twins and other relatives of women with breast cancer Nature Genetics [32] Risch,N. and Merikangas,K.(1996) The future of genetic studies of complex human diseases. Science [33] Risch,N. J.(2000) Searching for genetic determinants in the new millenium. Nature [34] Scott, J. G. and Berger, J. O.(2010) Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics [35] Sellke, T., Bayarri, M. J., and Berger, J. O.(2001) Calibration of p values for testing precise null hypotheses. American Statistician [36] Thompson, D. and Easton, D.F.(2002) Cancer incidence in BRCA1 mutation carriers Journal of the National Cancer Institute [37] Thompson, D. and Easton, D.F.(2004) The genetic epidemiology of breast cancer genes Journal of Mammary Gland Biolology and Neoplasia [38] Tibshirani, R.(1996) Regression shrinkage and selection via the lasso.

23 L. Briollais et al./bayesian approach for GWAS data 23 Journal of the Royal Statistical Society, Series B [39] Verzilli, C. J. and Stallard, N. and Whittaker, J. C. (2006) Bayesian Graphical Models for Genomewide Association Studies American Journal of Human Genetics [40] Wacholder, S. and Hartge, P. and Prentice, R. and Garcia- Closas, M. and Feigelson, H. S., et al. (2010) Performance of common genetic variants in breast-cancer risk models New England Journal of Medicine [41] Westfall, P. H., Young, S. S. (1993) Resampling-based multiple testing. Examples and methods for p-value adjustement. John Wiley & Sons, New York [42] Wilson, M. A., Iversen, E. S., Clyde, M. A., Schmidler, S. C., and Schildkraut, J. M. (2010) Bayesian model search and multilevel inference for SNP association studies Annals of Applied Statistics [43] Yeung, K.Y. and Bumgarner, R. E. and Raftery, A. E. (2005) Bayesian Model Averaging: Development of an Improved Multi-class, Gene Selection and Classification Tool for Microarray Data Bioinformatics [44] Zhang, Y. and Liu, J. S.(2007) Bayesian inference of epistatic interactions in case-control studies Nature Genetics

24 List of Figures L. Briollais et al./bayesian approach for GWAS data 24 1 Conditional dependence between three SNPs and the disease status Y in scenario 3 (left) and scenario 4 (right) of the simulations 25 2 Gene network analysis

25 FIGURES/FIGURES 25 Y SNP 1 SNP 2 SNP 3 Y SNP 1 SNP 2 SNP 3 Fig 1. Conditional dependence between three SNPs and the disease status Y in scenario 3 (left) and scenario 4 (right) of the simulations

26 FIGURES/FIGURES 26 Fig 2. Gene network analysis

Genomic models in bayz

Genomic models in bayz Luc Janss, Dec 2010 In the new bayz version the genotype data is now restricted to be 2-allelic markers (SNPs), while the modeling option have been made more general. This implements