A Bayesian Graphical Model for Genome-wide Association Studies (GWAS)

Size: px
Start display at page:

Download "A Bayesian Graphical Model for Genome-wide Association Studies (GWAS)"

Transcription

1 A Bayesian Graphical Model for Genome-wide Association Studies (GWAS) Laurent Briollais 1,2, Adrian Dobra 3, Jinnan Liu 1, Hilmi Ozcelik 1 and Hélène Massam 4 Prosserman Centre for Health Research, Samuel Lunenfeld Research Institute, 60, Murray Street, Toronto, M5T 3L9, Canada laurent@lunenfeld.ca Abstract: The present paradigm to analyze GWAS is to perform an exhausting testing of all single SNP associations with the response variable with the major drawback that the selected subset of SNPs has in general a very low predictive value. As a shift to the usual approach, we propose here a general statistical framework for GWAS based on Bayesian graphical modeling and able to: 1) Assess the joint effect of multiple SNPs (linked or unlinked); 2) Explore the model space efficiently using the Mode Oriented Stochastic Search (MOSS) algorithm; We illustrate our new methodology through an application to the CGEM breast cancer data. Our algorithm selected several SNPs embedded in multi-locus models with high posterior probabilities. Most of the selected SNPs have a biological interest. Interestingly, several of them would not have been detected in the single-snp testing approach. Finally, we discuss the impact of informative priors in this statistical framework through simulations. Keywords and phrases: Keywords: Graphical model, Bayesian, Stochastic Search, GWAS, SNP, Breast Cancer. 1. Introduction The emergence of new high-throughput technologies for SNP genotyping and their application to large scale genome-wide association studies (GWAS) have generated great promises that the genetic basis of many common human diseases could be elucidated [18, 22, 23, 25, 32, 33]. These expectations have been partially fulfilled and indeed, GWAS have identified hundreds of genetic variants associated with common human diseases and complex traits, providing valuable insights into their genetic mechanisms [16, 17]. The rationale underlying GWAS is that common genetic variants (i.e. present in more that 1-5% of the population) could explain most of the attributable risk of common human diseases. This is also referred to as the common disease, common variant (CDCV) hypothesis. The present paradigm for designing and analyzing GWAS is relatively straightforward and involves the collection of more than 1,000 cases and 1,000 Research supported by a MITACS seed project Corresponding auhtor 1

2 L. Briollais et al./bayesian approach for GWAS data 2 controls and an exhaustive screen of > 500K SNPs using simple univariate association tests (sometimes adjusted for confounding risk factors). It follows a ranking of the genetic variants based on the test statistics (or corresponding p-values) and determination of significance threshold using appropriate correction for multiple testing. Genetic variants with p-values below the threshold are declared significant and often followed-up in replication studies. In many GWAS, this threshold is in the order of 10 8 to 10 7 [32], although no consensus exists on how this should be determined. Despite its relative merits at identifying new genetic variants, GWAS have also given rise to many criticisms and increasing skepticism. Most of the attempts to constructing multi-variable risk models (i.e. multi-locus/multiple-snps), that could support the CDCV hypothesis, have used univariate rankings of the SNPs (that individually measure the dependency between each candidate SNP and the response) to select those to be considered into the final model. However, there is no theoretical justification of the implicit claim that the resulting set of genetic variants can be converted into a good predictive model. This is clearly illustrated by the GWAS on breast cancer, where the panel of SNPs associated with breast cancer had a low attributable risk and low predictive ability [12, 40]. The actual limitations of GWAS analysis methods lie in their incapacity to account for the joint effects of multiple genetic variants that could be linked or unlinked on the genome, and also to estimate the effect of rare variants with minor allele frequency (MAF) lower than 5%. Attempts to remedy these problems have been proposed. A recent paper demonstrated theoretically the feasibility of genome-scale testing of the joint effect of two genetic markers [29] but did not propose any particular statistical framework for performing the variable search. The development of variable selection and model search methods in the context of high-dimensional data is still in the early stage and encompasses two main statistical approaches: penalized regression and Bayesian selection methods. The most popular penalized regression method is the LASSO [38], further extended for GWAS analysis [19]. The principle consists in estimating a vector regression coefficients associated with the SNP effects by maximizing a penalized log-likelihood function. The LASSO performs a variable selection and parameter estimation simultaneously, the SNPs not selected in the final model having regression coefficients set to zero. In the Bayesian framework, various competing approaches have been developed. A common feature is to propose a regression model for the SNP joint effect on the response and to specify a prior for the regression coefficients. The best multiple SNP models can be determined by evaluating the posterior probability of the model. Since the model space has a high dimension, efficient search algorithms are often used based on MCMC approaches or variants of those. Models are sampled from their posterior probability and the final model choice includes models with the highest posterior probabilities after convergence of the MCMC algorithm. Expert prior information, whenever available, can be included to guide the model search. Adjustment for the multiplicity problem, i.e. for the number of models evaluated, in the Bayesian framework has also been proposed [34]. Although promising, some of these approaches have been limited to low-dimension models with two SNPs [44], SNPs that are in link-

3 L. Briollais et al./bayesian approach for GWAS data 3 age disequilibrium [39], continuous data [13] or sometime need a heavy first step selection to reduce the high SNP dimensionality [42]. Therefore there is a compelling need to develop appropriate methodologies for GWAS, allowing an efficient evaluation of a large number of multi-snp models, where the SNPs can be linked or unlinked. In this paper, we propose a general framework based on discrete Bayesian graphical model able to 1) Assess the joint effect of multiple SNPs (linked or unlinked); 2) Explore the model space efficiently using the Mode Oriented Stochastic Search (MOSS) algorithm [9]. We illustrate our new methodology through simulation and an application to the CGEM breast cancer GWAS data. 2. Discrete Bayesian Graphical models for Modeling the Joint Effect of SNPs in GWAS 2.1. Model notations An undirected graph G is a pair (V, E) where V is a finite set of vertices, and E, the set of edges, is a subset of the set V V of unordered pairs of distinct vertices {i, j}, i V, j V. In GWAS, the vertices represent the disease status and the SNPs and the edges the associations between the variables, i.e. the SNPs and the disease status and between the SNPs themselves. Let X = {X 1,..., X r } be a vector of random variables and let G = (V, E) be a graph with vertex set V = {1, 2,..., r}. Each variable X i is represented by the vertex i. For A V, X A indicates the collection of random variables {X i, i A}. For a given graph G and any probability distribution of X, a probability distribution for X is said to be Markov with respect to G if for any two non-adjacent vertices i, j V, X i is independent of X j given X V \i,j. A graphical model is a family of probability distributions which are Markov with respect to a given graph G. A discrete graphical model is a graphical model where the random variables are discrete. Let X be the random vector of interest. Suppose that a random sample x = (x 1,..., x n ) of values of X has been observed. Let us assume that we choose to do a model search in the family of models M 1,..., M k. We write the models as M j = {p(x ϑ), ϑ Θ j }, j = 1,..., k (1) where ϑ is a parameter in the parameter set Θ j and p(x ϑ) is a probability density function. In the particular case where the model is a graphical model, the parameter space is defined by the underlying graph G and we identify models M j with their underlying graph G j. In a Bayesian framework we assume a prior probability P (M j ), j = 1,..., k on the set of models (M 1,..., M k ) and a prior probability on the parameters ϑ, and want to derive the posterior model probabilities P (M j x) for each one of

4 L. Briollais et al./bayesian approach for GWAS data 4 the models M 1,..., M k, that is, the conditional distribution of M j given the data. The Bayesian solution is to choose the model with the highest posterior probability. According to the Bayes theorem, the posterior probability for M j is P (M j x) = P (x M j )P (M j ) k i=1 P (x M i)p (M i ) (2) The term k i=1 P (x M i)p (M i ) in (2) is a constant. Therefore we can write P (M j x) P (x M j ) }{{} (the Marginal Likelihood) P (M j ) }{{} (the Model P rior) (3) Suppose that the random vector of interest X consists of r binary factors {X 1,..., X r }, each having two levels: 0 and 1. Take N individuals, and classify them accordingly to the r binary factors in V = {1, 2,..., r}, let E be the collection of all non empty subsets of V and E 0 the collection of possible subsets of V including, then the elements F in E 0 are in 1-1 correspondence with the cells in the contingency table and we can use p F to denote the cell probability p F = P (X v = 1, v F, X v = 0, v F ). (4) Similarly, we can use n F, F E 0 to denote the cell counts. We know that since N is fixed, the cell counts follow a multinomial distribution with the following distribution function f((n), p) = ( ) N p N F E n F (n) F E p n F F (5) An alternative representation of the multinomial distribution is to write it in a natural exponential family form using loglinear parameters instead of cell probabilities. We use the following loglinear parameters Let θ E = log It can be shown (See [30]) that p ( 1) E\F F E,F E 0 F with θ = log p. (6) D = {D E D is complete in G}. (7) X l X s X V \{l,s} (8) is equivalent to exp θ E = p ( 1) E\F F E,F E 0 F = 1, E D (9)

5 L. Briollais et al./bayesian approach for GWAS data 5 that is, the generalized odds ratio exp θ E, E D are equal to 1 or equivalently θ E = 0, E D. (10) Marginal cell counts refer to the counts of outcomes when we classify subjects on a subset of the variables of the original contingency table. So, if we use y to denote the marginal cell counts, for any given cell F, we have y F = n D (11) D F,D E 0 and clearly y = N. For a given contingency table, the transformation from the cell counts {n F } to the marginal cell counts {y F } is a simple linear transformation with Jacobian equal to 1. If we assume a multinomial distribution for the cell counts, we readily obtain, up to a multiplicative constant, the following natural exponential family distribution for the marginal cell counts y = (y E, E E 0 ) as a natural exponential family: ( f(y; θ, G) = exp exp( ) θ D )) θ D y D N log(1 + D D E E D G E with θ = (θ D, D D) (12) The conjugate priors for distributions in the exponential family have density π G (θ s, α) = I G (s, α) 1 exp{ ( θ D s D α log 1 + exp( ) θ D ) }, D D E E D G E (13) for some s = (s D, D D) R D and α R, where I G (s, α) is the normalizing constant Specification of the prior A method to construct hyperparameters of a proper prior π D (θ D (s, α)) is to start with a fictive prior contingency table with all cell counts n F positive, not necessarily integers. With α denoting the total count in the given fictive contingency table, y D denoting the marginal cell counts, we can take as hyperparameters α = N and s D = y D, D D. Lack of prior information can be expressed through what is sometimes called a flat prior by taking all the fictive cell entries to be equal and equal to α I. We used this latter prior specification in our application to the breast cancer GWAS Posterior of a model The posterior of G is proportional to the ratio of the two normalizing constants: P (G Y ) I G (y + s, n + α)/i G (s, α). (14)

6 L. Briollais et al./bayesian approach for GWAS data 6 For G decomposable, the prior π(θ α, s) is identical to the hyper Dirichlet (see [30]). It therefore follows that the normalizing constants I G can be computed analytically when the graph G is decomposable. When G is non decomposable, I G needs to be computed numerically Regression induced by a graphical model In GWAS, we are interested in modelling the response variable (i.e. case control status) as a function of the SNP variables. Let Y = X r, r V be a response variable and X A, A V \ {r} be a set of explanatory variables (i.e. the SNPs). Denote by (n) A {r} and (n) A the corresponding marginal tables. Here (n), (n) A {r} and (n) A are cross-classifications involving X V, X A {r} and X A, respectively. The connection between log-linear models and the regressions derived from them has been explored in [9]. They showed that the marginal likelihood of the regression [Y X A ] can be expressed as the ratio between the marginal likelihoods of the saturated log-linear models for (n) A {r} and (n) A. 3. SNP selection with MOSS in GWAS setting A major challenge in GWAS is to select a subset of SNPs associated with the response of interest among thousands of candidate. The Bayesian paradigm for model selection and model determination involves choosing models m with high posterior probability Pr(m (n)) selected from a set M of competing models. We denote by (n) the available data. We associate with each candidate model m M a neighborhood nbd(m) M. Any two models m, m M are connected through at least a path m = m 1, m 2,..., m l = m such that m j nbd(m j 1 ) for j = 2,..., l. The neighborhoods are defined with respect to the class of models considered. In a regression model m containing k regressors, the neighborhood is obtained by: (i) addition moves: individually including in m any variable that is not in m. The resulting neighbors contain k + 1 regressors; (ii) deletion moves: individually deleting from m any variable that belongs to m. The resulting neighbors contain k 1 regressors; and (iii) replacement moves: replacing any one variable in m with any one variable that is not in m. The resulting neighbors contain k regressors. In a graphical model m, the neighborhood consists of those graphical models obtained from m by adding or deleting one edge ([10] and [6]). The mode oriented stochastic search (MOSS, henceforth) is an algorithm [9] that identifies models in { } M(c) = m M : Pr(m (n)) c max m M Pr(m (n)), (15) where c (0, 1). As proposed in [26], models with a low posterior probability compared to the highest posterior probability model are discarded. This choice

7 L. Briollais et al./bayesian approach for GWAS data 7 drastically reduces the size of the target space from M to M(c). For suitable choices of c, M(c) can be exhaustively enumerated. MOSS makes use of a current list S of models that is updated during the search. Define the subset S(c) of S in the same way we defined M(c) based on M. Define S(c ) with 0 < c < c so that S(c) S(c ). Let q be the probability of pruning the models in S \ S(c). A model m is called explored if all its neighbors m nbd(m) have been visited. A model in S can be explored or unexplored. MOSS proceeds as follows: PROCEDURE MOSS(c,c,q) END. (a) Initialize the starting list of models S. For each model m S, calculate and record its posterior probability Pr(m (n)). Mark m as unexplored. (b) Let L be the set of unexplored models in S. Sample a model m L according to probabilities proportional with Pr(m (n)) normalized within L. Mark m as explored. (c) For each m nbd(m), check if m is currently in S. If it is not, evaluate and record its posterior probability Pr(m (n)). If m S(c ), include m in S and mark m as unexplored. If m is the model with the highest posterior probability in S, eliminate from S the models in S \ S(c ). (d) With probability q, eliminate from S the models in S \ S(c). (e) If all the models in S are explored, eliminate from S the models in S \ S(c) and STOP. Otherwise go back to step (b). In a MOSS search, the model explored at each iteration is selected from the most promising models identified so far. In MC 3 [26, 27] or SSS [21], this model is selected from the neighbors of the model evaluated at the previous iteration. This feature allows MOSS to move faster towards regions of high posterior probability in the models space. 4. Discretizing the SNP Variables The conjugate priors in [30] are appropriate for the analysis of general multiway tables in which each variable is allowed to have any number of categories. In order to increase the mean number of observations per cell, we transform the polychotomous data into dichotomous data. The response variable y in our analysis of the CGEM data is dichotomous by design (i.e. case control status). Diallelic SNPs can be represented as three-category discrete variables, or can be dichotomized as presence of 0 vs. absence of 0, or as presence of 1 vs. absence of 1. Coding of the latter type was used by transforming the categorical variables as follows. A segregating SNP site has three possible genotypes: 0/0, 0/1 and 1/1, where 0 is the wild type and 1 is the mutant allele. We choose the splits τ having the largest marginal likelihood of the regression of y on x τ, where x τ

8 L. Briollais et al./bayesian approach for GWAS data 8 is the binary variable obtained after splitting the continuous variable x at the value τ. Ideally, we would like to consider multiple choices of splits over all available predictors and choose those splits that lead to large values of the marginal likelihood of regressions of y on one, two or more categorical predictors. Each predictor will have several categorical versions associated with various splits sets. Unfortunately the datasets resulting from genome-wide studies precludes us to perform such large scale computations, and we need to choose splits sequentially for each covariate. 5. Risk Estimation and Prediction Based on Bayesian Model Averaging Once MOSS has identified a set of regression models S for some c (0, 1), one can estimate the risk associated with the selected SNPs and perform risk prediction. This is done using Bayesian Model Averaging. Let us consider a regression model m j S(c), which is [Y X Aj ] with A j V \ {r} and j B. Here B is a set of indices for the collection of models over which we are averaging. The regression model of Y on the remaining variables is a weighted average of the regression models in S, where the weights represent the posterior probability of each regression model (see for example [43]): Pr(Y = y (n)) = j B Pr(Y = y (n) Aj ) Pr(m j (n)). (16) Since we assumed that all models are apriori equally likely, the posterior probability of each regression is equal to its marginal likelihood normalized over all the models in S: Pr(m j (n)) = Pr(r A j ) Pr(r A l ). It is shown in [26] that the weighted average of regressions in (16) has a better predictive performance than any individual model in S. The revelance of each predictor X j can be quantified by its posterior inclusion probability defined as the sum of the posterior probabilities of all the models that include X j. To estimate the parameter θ of a given graphical model, we use the algorithm described in [9], called the Bayesian iterative proportional fitting (Bayesian IPF, henceforth) algorithm, which samples from the joint posterior (eq. 14) of these θ s. The resulting posterior samples can be used to estimate the coefficients of the regressions in S(c) and thus the probability of having the response of interest (e.g. the disease) given the SNPs selected. The coefficients for the Bayesian model averaging regression Pr(Y = y (n)) in (16) are estimated by sampling from the joint posterior distribution of the regression coefficients corresponding to a given graphical model m j = [Y X Aj ], j B, for a number of iterations proportional with Pr(m j (n)). l B

9 L. Briollais et al./bayesian approach for GWAS data 9 6. Simulation Study To assess the properties of our novel Bayesian graphical model approach, we simulated data sets that mimic the breast cancer GWAS data analyzed in Section 7. We considered four simulation scenarios that represent various multi-snp models commonly studied in genetic association studies. In our Scenario 1, we consider a 5-SNP main effects model with an interaction between SNP2 and SNP3. For each individual, the case-control status was generated from a Bernouilli trial with probability p of being a case given by the following model logit(p) = β 0 + β i SNP i + β 23 SNP 2 SNP 3. i=1..5 We chose the β s to reflect the range of associations detected in our breast cancer GWAS in section 7. More specifically, we fixed the regression coefficients at β 1 = 0.405, β 2 = 0.916, β 3 = 0.182, β 4 = 0.405, β 5 = and β 23 = β 2 β 3, corresponding to odds-ratio for the genetic association of 1.5, 3.0, 1.2, 0.67, 5.0, 1.92, respectively. The parameter β 0 was determined so that we have the same number of cases and controls as in our real data analysis, that is 1, 145 cases and 1, 142 controls. For each SNP, we generated two genotypes with probability f i, 1 f i, i = 1,..., 5 from a Bernoulli trial. We fixed f i for the five SNPs at 0.10, 0.10, 0.36, 0.36 and 0.02, respectively. These genotype frequencies correspond to a minor allele frequency (MAF) of 0.05, 0.05, 0.2, 0.2 and 0.025, respectively, under a dominant genetic model, and thus represent a wide spectrum of uncommon, common and rare SNPs. Scenario 2 corresponds to a SNP-SNP interactions model with three 2-way interactions without main effects. For each individual, the case-control status was generated from a Bernouilli trial with probability p of being a case given by the following model logit(p) = β 0 + β 23 SNP 2 SNP 3 + β 34 SNP 3 SNP 4 + β 45 SNP 4 SNP 5. The MAF of the SNPs was 0.25, 0.25, 0.36, 0.36 and 0.15, respectively for the five SNPs, and the regression coefficients for the interactions were β 23 = 1.099, β 34 = and β 45 = Scenario 3 corresponds to a fine mapping problem where a causal SNP (SNP 2 ) is associated with a disease status Y, and where SNP 2 is in linkage disequilibrium (LD) with two other SNPs, SNP 1 and SNP 3. These two other SNPs are conditionally independent of the disease status given SNP 2 (Figure 1), so that there is only one causal locus in the region and this causal locus is

10 L. Briollais et al./bayesian approach for GWAS data 10 observed. The 3 SNPs constitute a cluster of SNPs and are denoted X 1, X 2, and X 3 for simplicity. The distribution of the four discrete variables in the graph was generated from a multinomial distribution and can be represented by a 4way contingency table, where the joint cell probabilities are given by the log-linear model: logp ijkl = θ Y i + θ X1 j + θ X2 k + θ X3 l + θ Y X2 ik + θ X1X2 jk + θ X2X3 kl where the subscripts i, j, k, l {0, 1} index the levels of the variables Y, X 1, X 2, and X 3, respectively. We chose as parameters θ1 Y = 0, θ X1 1 = θ X2 1 = θ X3 l = ln(0.2) = 1.6, θ Y X2 11 = 0.717, θ X1X2 11 = θ X2X3 11 = All the other parameters were set to 0. The association between SNPs is equivalent to a level of LD of Q = 0.71 and D close to 0.55 [7], that is a moderate to strong level of LD as generally observed among SNPs in a same haplotype block. The MAF was 0.20 for all three SNPs and we assumed a dominant model for SNP 2 main effect. [Fig 1 about here.] Scenario 4 is very similar to Scenario 3 but corresponds to a situation where the 3 SNPs are functional and conditionally associated with the disease status (Figure 1). The 3 SNPs could be in different haplotype blocks, so we assumed a lesser extent of LD compared to Scenario 3. The corresponding log-linear model is now: logp ijkl = θ Y i + θ X1 j + θ X2 k + θ X3 l + θ Y X1 ij + θ Y X2 ik + θ Y X3 il + θ X1X2 jk + θ X2X3 kl The log-linear parameters were chosen as θ1 Y = 0, θ X2 1 = ln(0.05) = 3.0, θ X1 1 = θ X3 1 = ln(0.25) = 1.4, θ Y X1 11 = θ Y X2 11 = θ Y X3 11 = 0.6 and θ X1X2 11 = θ X2X3 11 = 1.1 for the SNP main effects and the associations between SNPs. This is equivalent to a level of LD between the SNPs of Q = 0.5 and D close to 0.33 [7], that is a moderate level of LD as generally observed across haplotype blocks. The MAF was 0.25 for SNP1 and SNP3 and 0.05 for SNP2 and we assumed a dominant model for each of them. For each simulation scenario, we also added to each simulated data set, 53K random SNPs taken from the original breast cancer GWAS in section 7. This represents approximately 10% of the total number of SNPs available in this study. These SNPs were not associated with the simulated case-control status and thus correspond to noise variables. Since these SNPs were extracted from a real SNP array, they represent a real genome-wide correlation structure. Control of False and True Discovery Rate For our different simulation scenarios, we estimated the False Discovery Rate (FDR) [2], given by the proportion of non causal SNPs discovered by each approach among all the SNPs discovered by this approach. FDR estimates were only based on the models main effects. The main effect FDR (FDRm) for a SNP j is defined for a particular method as

11 L. Briollais et al./bayesian approach for GWAS data 11 F DR m = k=1 N d I( SNP j is discovered in data set k SNP j / causal SNPs ) k=1 N d I( SNP j is discovered in data set k) where N d is the number of simulated data sets and I(.) is the indicator function. For the scenario 3 of the simulations, we also computed the cluster FDR (FDRc), where the cluster c included the 3 SNPs in linkage disequilibrium (with only one of them associated with the outcome), defined as F DR c = k=1 N d I( SNP j is discovered in data set k SNP j / cluster c ) k=1 N d I( SNP j is discovered in data set k) We also computed the true discovery rate (TDR) for each individual SNP j associated with the outcome as k=1 N T DR = d I( SNP j is discovered in data set k) N d and an overall TDR as T DR overall = k=1 N d j=1 N s I( SNP j is discovered in data set k) N d N s where N s the number of causal SNPs associated with the outcome. In scenarios 1 and 2 of our simulations, we also computed a TDR for each specific pair of SNPs (j, j ) involved in an interaction term of our simulated model as T DR pair = k=1 N d I( SNPs j and j are discovered in data set k) N d Note that the models of analysis did not include any interaction between the SNPs but is only based on the model main effects. The FDR and TDR results are presented in table 1 and table 2, respectively. We tried to control the FDRm at the same level for each method compared and compute the TDR at this FDRm level. However, this was not always possible with the Bayesian variable selection regression (BVSR) approach in simulation scenarios 2 and 4, the BVSR method being more designed to optimize the SNP predictive performance than controlling the FDR or type I error. In these two situations, even with a more relaxed control of FDRm, BVSR had the worst TDR among all the other methods, so we left the level of FDRm as it was for this approach, concluding it had the worst performance.

12 L. Briollais et al./bayesian approach for GWAS data 12 Additionally to each FDR and TDR result, for each SNP associated with the disease, we also included its rank based on an univariate test statistic with the χ 2 and the Bayes Factor (BF), where the BF is comparing a model with a single SNP main effect and a model without SNP, these statistics being averaged over all the simulated data sets. Method comparison For method comparison, we also fitted the penalized regression method LASSO [38], the BVRS approach [13] and a Markov blanket-based method (DASSO- MB) [14] to our simulated GWAS data sets. For the LASSO, we used a formulation proposed for GWAS data in [19], which is based on shrinkage priors. In this version of the LASSO, each regression coefficient is assigned independent shrinkage priors with a density that is sharply peaked at zero. The authors proposed to use the double exponential (DE) distribution or the normal exponential gamma (NEG) distribution as prior density function. Parameter estimates are obtained by maximizing the posterior density p(β X, y) over β, where X is the normalized genotype data and y the response variable (i.e. the case-control status). Taking logarithms in Bayes Theorem, the problem can be thought of as maximizing the penalized log-likelihood function: logp(β y, X) = L(β) f(β) + const where L is the log-likelihood for the logistic regression model and f is the logprior density with a minus sign to allow f to be interpreted as a penalty function. With the DE prior, the maximization of the penalized log-likelihood is equivalent to the LASSO procedure. We used the DE distribution prior which leads to a maximization equivalent to the LASSO ([38]). To control the FDRm (see above definition) at a given level, we can change the hyper-parameter of the prior DE distribution, so we chose the hyper-parameter that ensures the FDRm to be at the same level as the other approaches in each simulation scenario. The BVSR approach defines some normal priors for the regression coefficients associated with the SNP effects and also for the probability of each regression coefficient to be zero in the model, i.e. controlling the sparsity of the model. The originality of this approach is to have a variability parameter in the regression coefficient normal prior that depends on the proportion of genetic variability explained (PVE) by the selected set of SNPs in the model (which itself is function of the sparsity parameter). This prevents the risk that more complex models explained substantially higher PVE. The induced prior on the PVE given the sparsity parameter is therefore assumed flat in the range (0,1). BVSR was initially developed to find models associated with high PVE for a continuous response. An extension to binary responses was also proposed using a probit link function. The inference is conducted using MCMC and models with highest posterior PVE are identified. The model size for BVSR could vary between 1 to 5

13 L. Briollais et al./bayesian approach for GWAS data 13 and the hyper-parameters were chosen so that the control of FDRm was similar to the other approaches in the various simulation scenarios. However, this control of FDR could not be achieved in the scenarios 2 and 4 of the simulations. The DASSO-MB approach uses tests of conditional independence among discrete variables (disease variables and SNP variables) based on a frequentist probabilistic graphical model framework. The tests of conditional dependence are based on G 2 statistics for contingency tables. The algorithm starts with a forward step selecting the SNP with the highest G 2 statistic (conditionally on the SNPs already selected) and dependent on the response variable. Each time a SNP is added to the model, a backward step is applied to test the conditional dependence of all other SNP variables with the response conditional on the rest of variables until the variables selected do not change. When no other SNP is added to the selection, a final backward step is performed to test the conditional dependence of each selected SNP with the response conditional on all possible subsets of SNPs selected in forward step. A significance threshold for the forward and backward steps need to be specified, we use the default option of 0.01 for both steps and we ensured in our simulations that it controls the FDRm at the same level as the other approaches. For MOSS and BVSR, we fitted two-snp main effects models as we did in our real application. Fitting more complex models would be possible, e.g. with three SNPs, but given the sample size available the variable selection would be very unstable. With the LASSO and DASSO, although the model size could be theoretically larger than 2, in most situations the models identified by these two approaches had a model size of 1 or 2 SNPs. To control FDRm with MOSS, we vary the parameters c and c, which control the number of models and variables selected and we verified in our simulation studies that the FDRm was controlled at the same level as the other methods. Results The results related to FDR estimates are presented in Table 1. In scenario 1, all the four methods have a level of FDRm varying between 15.4(MOSS) to 19.4% (DASSO). For the same type of simulation problem, previous papers reported some very similar control of FDR level ([19], [15]). In the scenario 2, all methods control FDRm at a very low level (i.e <3%) except BVSR which has a very inflated FDRm of 53%. The scenario 3 is probably the most complex since the goal is to find a causal variant among a small chromosomal region comprised of 3 SNPs. The FDRm was close de 50% for all the methods, DASSO having the highest rate of 59%. Finally in scenario 4, the control of FDRm was about 14% for all the methods but BVSR which has an inflated rate of 27%. In scenario 3, the use of FDRc did not improve the control of FDR over the use of FDRm. The results related to TDR estimates are presented in Table 2. We recall that all the methods ensure almost the same control of FDRm except BVSR, which

14 L. Briollais et al./bayesian approach for GWAS data 14 has an inflated rate in scenarios 2 and 4. When considering the TDR over all the SNPs, i.e. T DR overall, MOSS surpasses the other approaches. The T DR overall statistics are 38%, 54%, 100% and 64% in the scenarios 1, 2, 3 and 4, respectively, for MOSS. The advantage of MOSS is clearly seen in scenario 2 where the second best method (DASSO) has only a T DR overall of 71%. Overall, the second best approach is the DASSO except in the scenario 4. The T DR m statistics confirm the superiority of MOSS for the individual SNPs results and these results are consistent when considering SNPs with different effect sizes and MAFs. In scenario 1, DASSO results are close to those from MOSS with a noticeable difference for SNP2 and SNP5, which are two rare SNPs. Both the LASSO and BVSR have worst performances and can even have some T DR m of 0% for particular SNPs. In scenario 2, the superiority of MOSS is confirmed by the T DR m statistics. Interestingly, the T DR pair results also show that MOSS can detect with higher probability the models including two SNPs that were involved in a SNP SNP interaction term. We remind that the T DR pair statistic is just based on evaluating models with main effects and not on a test of interaction. In scenario 3, MOSS reaches 100% T DR m to detect the causal variant, which is remarkable since it is a complex situation with high LD between the SNPs. The second best method was DASSO (T DR m of 71%) and LASSO (T DR m of 36%). Finally, in scenario 4, the two best methods were MOSS and LASSO, with T DR m varying between 22 to 99% for MOSS and 8 to 100% for LASSO. We therefore conclude that MOSS is superior to its competitors based on our simulation results in various situations. The advantages of MOSS appear more striking when considering rare SNPs, SNPs in high LD and SNPs that could interact with each other. The performances of MOSS were also very good in other situations. We also noticed that the performances of MOSS in terms of FDR and TDR remain unchanged for various specifications of the priors (results not shown). In particular, defining the prior cell counts to be all 1 or proportional to the sample size of the observed cell counts with various possible proportions, did not change our main conclusions. [Table 1 about here.] [Table 2 about here.] 7. Analysis of the CGEM breast cancer GWAS data 7.1. The breast cancer paradigm In most Western populations, approximately one in ten women develop breast cancer. Epidemiological studies have shown that women who have first-degree relatives with a history of breast cancer have a two-fold increase in risk of the

15 L. Briollais et al./bayesian approach for GWAS data 15 disease [4]. The risk ratio increases with increasing the number of affected firstdegree relatives. Twin studies have indicated that most of the excess familial risk is due to inherited predisposition [31]. Particularly, BRCA1 and BRCA2 are the most important susceptibility genes conferring, when mutated, high lifetime risks of breast cancer [3, 36]. Mutations in BRCA1 and BRCA2 account for 16% of the familial risk of breast cancer [1]. Mutations in other genes (TP53, PTEN, STK11, CDH) are also associated with elevated risks but it is unlikely that mutations in these six genes account for more than 20% of the familial risk of the disease. Therefore the remaining 80% of the familial risk remains to be explained. The search for this missing heritability has led to the identification of other high-penetrant mutations in candidate genes such as CHEK2, ATM, BRIP1 and PALB2. However, they still confer a small contribution to the familial risk of breast cancer [37]. Alternatively, common low-penetrant alleles have been sought through genome-wide association studies (GWAS). So far only a small number of such variants have been identified and confirmed in different populations but their inclusion just modestly improved the performance of risk models for breast cancer [12, 40]. The bulk of breast cancer genetic susceptibility thus remains to be determined. Our working hypothesis is that the combination of common and rare genetic variants contribute to the risk of breast cancer. To illustrate this point, we use MOSS to search for the most important combinations of genetic variants in the CGEM GWAS data The CGEM study We used the CGEM breast cancer data. The genome-wide association studies (GWAS) for breast cancer has been completed in the Nurses Health Study (NHS) with nearly 550,000 SNPs genotyped. The analysis includes 1,145 individuals who developed breast cancer during the observational period and 1,142 agematched individuals who did not develop breast cancer during the same time period. Both the genotype data and the pre-computed analyses based on the genotype data were retrieved from the following website ( The project team has developed easy access to pre-computed results of the data including SNP frequencies and single SNP association analyses Data pre-processing Our initial dataset included 555,341 SNPs and 2,287 observations (1,145 affected individuals and 1,142 controls). After exclusion of SNPs with inaccurate genotypes (missing rate 10%), we had 546,540 SNPs left. For the remaining SNPs, we tried to impute the missing values using the program MACH [24]. We were able to impute accurately 546,253 SNPs. We also assessed the presence of population stratification using the program EIGENSTRAT [11], which is based on principal component analysis (PCA). Using projections on the two

16 L. Briollais et al./bayesian approach for GWAS data 16 first principal components, we found 20 individuals (9 cases and 11 controls) who appear to be outliers and were removed from our analysis. We did not exclude any SNPs due to low frequency since we wanted to evaluate how our Bayesian method was able to detect them. We did not exclude SNPs based on Hardy-Weinberg disequilibrium since there was no evidence for any deviation in our data [20] Previous results from breast cancer GWAS The first GWAS study using the CGEM breast cancer data identified several SNPs within the gene F GF R2 [20] and this result has been replicated in many independent studies following the initial result. A SNP close to the gene BUB3 was also very significant in the initial study but has not been replicated yet. To our knowledge, none of the GWAS performed thus far on breast cancer has accounted for the joint effect of SNPs associated with the disease Analyses with MOSS Table 3 displays the most important SNPs found by MOSS. We employed MOSS with c = 0.5, c = , a pruning probability of 1.0 and ran it 1, 000 times with 100 starting values each. We searched for regression models containing at most 2 predictors explaining the response variable among 546, 255 SNP variables, which gives a total number of possible regressions of The number of models evaluated by MOSS in each of the 1, 000 instances were considerably smaller and varied between 209, 445 and 837, 778, with a mean of 497, 065. The eight regressions in the resulting S(0.5) involve twelve SNPs embedded or very close to known genes (Table 3). MOSS selected 12 SNPs with MAF varying between 0% and 42%. Three SNPs have a MAF lower than 5%. In general, frequentist approaches applied to GWAS would not be able to perform a test statistic for these SNPs. Most of the SNPs detected by MOSS have a high rank when using the more conventional univariate p value criteria. The two SNPs in the gene F GF R2 were previously identified from univariate analysis of the CGEM data [20] and have been replicated in multiple studies. The SNP in the gene BUB3 was also identified in the initial analysis of the CGEM data but no follow-up studies have confirmed its association with breast cancer. It is therefore noteworthy that MOSS was able to replicate some initial findings in these data. Additionally, several novel candidate SNPs emerge from our analysis. An example, is the SNP rs associated with the highest posterior probability. To our knowledge, this SNP has never been identified by previous GWAS on breast cancer. We also noticed that the SNP rs in the gene APC which has a very low rank based on univariate analysis and p-value ranking, would have never been picked by a

17 L. Briollais et al./bayesian approach for GWAS data 17 classical approach. This SNP has been selected by MOSS because it has a joint effect with the SNP in the gene BUB3. While MOSS is able to detect more SNPs associated with the disease of interest in GWAS, the question remains to know whether these results have any biological validity. In the next tables and figures, we give more insights into the biological interpretation of our findings. [Table 3 about here.] In table 4, we give a more detailed description of the genes associated or potentially associated with the most important SNPs selected by MOSS. Interestingly, 4 out of 17 genes in our list have been previously implicated in breast cancer, including BUB3, NT SR1, F GF R2 and AP C. Furthermore, eight genes have previous relation to cancers, which suggests an enrichment of cancer genes in the MOSS selection. The most interesting gene found by MOSS is the gene C6orf15. This gene is located in the HLA region and does not have a very clear function. However it is located in a region characterized by a dense cluster of genes which has been found over-expressed in many cancer types. This is therefore a region that would be worth sequencing to find potential causal variants associated with breast cancer or other cancers. [Table 4 about here.] Table 5 displays the best eight two-snp models identified by MOSS and their associated marginal likelihood and Bayes Factor (BF). We can first notice that the BFs for these models are much higher (from to 17.18) than any of the BF for the single SNP models in Table 3 (i.e. the maximum value was 4.50 for the SNP rs close to gene BUB3). There is also a certain level of internal replication. Indeed, two pairs of models (1, 3) and (5, 7) appear almost identical since they involve the same two genes though not necessarily the same SNPs. The two SNPs that belonged to same gene were in linkage disequilibrium (LD) in both cases. It is therefore remarkable that MOSS was able to identify SNPs strongly in LD through the selection of the best models. The parameter estimates for these models were obtained by generating 10, 000 samples from the posterior distributions of the induced logistic regression parameters using Bayesian Iterative Proportional Fitting (IPF). In some models, the interaction term between the two SNPs was not included. In most instances, the best models include one strong marginal SNP effect (log odds > 1) and a weaker one ( log odds < 1), the sign of the coefficient for this latter being either positive (risk effect) or negative (protective effect). In terms of allele frequency, we will denote a rare, uncommom and common SNPs whenever the MAF is < 5%, 5% and < 10% and 10%, respectively. Among the eight models detected by MOSS, four of them involve two common SNPs, two include one common and one uncommon SNPs and the last two models entail one rare and one common SNPs. It is remarkable that MOSS was able to select these latter two models since most common approaches for GWAS are limited to the detection of common SNPs.

18 L. Briollais et al./bayesian approach for GWAS data 18 [Table 5 about here.] After trying to understand the joint effect of two SNPs in each of our models, we analyzed the possible implication of the SNPs and genes identified in relation to known genetic pathways. We used the bioinformatic software Ingenuity to perform this analysis. The results are presented in Figure 2. Several gene interactions within this network were identified by MOSS. [Fig 2 about here.] 7.6. Risk prediction with Bayesian Model Averaging The model prediction was obtained by Bayesian model averaging of the eight regression models using 500 iterations of two-fold cross-validation. The area under the ROC curve (AUC) was estimated at 63.5%. By comparison, the prediction obtained from the set of best seven common SNPs identified through previous GWAS on breast cancer (based on univariate analysis) was only 57.4% [12]. A selection of the best 10 SNPs combined with the major known epidemiological risk factors for breast cancer resulted in an AUC of 61.3% ([40]). Therefore, MOSS improves substantially the AUC and the addition of known epidemiological and clinical factors (which were not available for this study) to our model could provide even better predictive ability Significance of the Bayes Factor (BF) and Multiplicity Adjustment Our main interest in this study was to construct predictive models for breast cancer based on the subset of SNPs selected by MOSS. In many studies, however, it is also of interest to assess the significance of the results, i.e. in our context the significance of the multi-snp models identified by MOSS. In general, there is no simple connection between Bayes Factor and p values. Under general assumptions, it is possible to compute an upper bound for the BF for a given p value but this represents an optimistic BF [35]. Alternatively, we can compute an empirical distribution for the BF under the complete null hypothesis H M that no two-snp model is associated with breast cancer. This is done through Monte-Carlo simulations (See section 6). The observed BF for a given two-snp model found by MOSS is then compared to the empirical distribution of the BFs under the null hypothesis to determine the significance. To adjust for the multiplicity problem, i.e. the number of two-snp models evaluated by MOSS, we used a single-step maxt procedure [41], where maxt corresponds to the maximum BF over the m models evaluated by MOSS. If we denote by t i, i = 1,..., 8 the observed BF for one of the eight two-snp models found by MOSS and T l the BF under the null distribution for a model l, l = 1..m, then an adjusted p value can be computed as

19 L. Briollais et al./bayesian approach for GWAS data 19 p i = P r(max 1 l m T l t i H M ), i = 1,..., 8. Following this principle, we found that the eight models selected by MOSS will have adjusted p values smaller than 5 %. 8. Summary and Discussion GWAS has emerged as one of the most spectacular advances in genetic research in the past five years with thousands of novel genetic variants discovered in many complex traits and diseases [16]. Despite this success, the clinical relevance of these findings still remains to be determined. Breast cancer illustrates well the current limitations of GWAS. The predictive models based on clinical factors barely improved when including genetic variants discovered through GWAS [12, 40]. One of the most challenging issues in GWAS analysis is to investigate the full spectrum of the genetic variability including rare variants (most likely to be functional), the joint effects of multiple variants on the response, combined effect of genetics and the environmental factors, etc. To shift the usual paradigm in genome-wide analysis and fully exploit the richness of the genetic information generated through GWAS experiments, we propose a novel Bayesian approach. Its theoretical foundations, based on discrete Bayesian graphical models [9, 30], allows to assess the joint effect of multiple SNPs (that could be linked or unlinked) in a GWAS. The high-dimensional space of multi-snps models is explored efficiently using the stochastic search algorithm MOSS [9]. Additionally, a prior contingency table representing the joint distribution of disease status and SNPs is used within this framework and can enhance the detection of rare functional genetic variants. The use of the Bayesian framework offers several advantages in the context of GWAS over the more classical frequentist approach. First, coupled with efficient stochastic algorithms such as MOSS or MCMC methods, this framework allows to explore the model space very quickly. This is done by sampling from the joint posterior distribution of the multi-snps models. Second, because it is very unlikely that a single model can represent well the complexity of the data, the use of the Bayesian model averaging paradigm acknowledges the high degree of model uncertainty in high-dimensional discrete data [26]. Third, sampling methods such as MOSS have a better ability to deal with missing and sparse data by sampling graphical models from the model space and specifying a prior distribution for the cell counts. This was clearly illustrated by the detection of rare genetic variants in our real analysis of the breast cancer GWAS. Our simulation results demonstrated the superiority of MOSS over recently proposed approaches for multi-snp analyses of GWAS data. The advantage was substantial in all the scenarios investigated including the detection of marginal SNP effects, interaction effects (with or without SNP main effects), causal variants, rare variants, SNPs that are linked or unlinked. This makes MOSS a very general approach for the analysis of a broad spectrum of SNP effects in GWAS. Our real application to a breast cancer GWAS data set confirms the added

20 L. Briollais et al./bayesian approach for GWAS data 20 of our novel approach and its relevance for genetic research. We found 12 SNPs embedded in 8 two-snps models associated with breast cancer. These two-snp models included both common and rare variants. We replicated some known associations, e.g. with SNPs in the FGFR2 gene, but also discovered new ones that seem biologically promising. Many of these genetic associations would not have been found by conventional approaches, which generally cannot evaluate multi-snp models and rare variants. This is the case of the two-snp model that involves SNPs in the genes APC and BUB3. The association with the SNP in BUB3 and breast cancer was not even reported in the original analysis of this GWAS data set since it corresponds to a rare SNP [20]. Biological information about BUB3 shows that it interacts physically with APC, thus validating biologically one of our major results. We should also emphasize that MOSS can also analyze the SNPs as categorical variables instead of binary variables as implemented in our R package GenMOSS, although for our data analysis, we did not find any clear advantage in doing so. This could be due to the fact that this would lead to sparser contingency tables and most of the SNPs found in our best models had dominant or recessive effect. MOSS can also provide estimates of effect sizes for the SNPs discovered by the method. This is illustrated in our real analysis and explained in section 5. An R package called GenMOSS and implementing the MOSS algorithm can be downloaded from ( A more complete version called GenMOSSplus is also available on CRAN and has additional steps for GWAS data pre-processing. Acknowlegments We would like to thank Olia Vesselova for her development of the R package. References [1] Anglian Breast Cancer Study Group(2000) Prevalence and penetrance of BRCA1 and BRCA2 in a population based series of breast cancer cases The British Journal of Cancer [2] Benjamini, Y. and Hochberg, Y.(1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing Journal of the Royal Statistical Society: Series B [3] The Breast Cancer Linkage Consortium(1999) Cancer risks in BRCA2 mutation carriers Journal of the National Cancer Institute [4] Collaborative Group on Hormonal Factors in Breast Cancer(2002) Lancet

21 L. Briollais et al./bayesian approach for GWAS data 21 [5] Darroch, J. N. and Speed, T. P.(1983) Additive and Multiplicative Models and Interaction The Annals of Statistics [6] Dellaportas, P. and Forster, J. J.(1999) Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical log-linear models Biometrika [7] Devlin, B. and Risch, N.(1995) A comparison of linkage disequilibrium measures for fine-scale mapping Genomics [8] Dobra, A.(2009) Variable selection and dependency networks for genomewide data. Biostatistics [9] Dobra, A. and Massam, H.(2010) The Mode Oriented Stochastic Search (MOSS) for Log-linear Models With Conjugate Priors Statistical Methodology [10] Edwards, D. E. and Havranek, T.(1985) A Fast Procedure for Model Search in Multidimensional Contingency Tables Biometrika [11] Price, A. L. and Patterson, N. J. and Plenge, R. M. and Weinblatt, M. E. and Shadick, N. A. and Reich, D.(2006) Principal components analysis corrects for stratification in genome-wide association studies Nature Genetics [12] Gail, M.(2008) Discriminatory accuracy from single nucleotide polymorphisms in models to predict breast cancer risk Journal of the National Cancer Institute [13] Guan, Y. and Stephens, M. (2011) Bayesian Variable Selection Regression for Genome-wide Association Studies, and other Large-Scale Problems. Annals of Applied Statistics To Appear. [14] Han, B. and Park, M. and Chen, X.W.(2010) A Markov blanketbased method for detecting causal SNPs in GWAS. BMC Bioinformatics 11 Suppl. 3 S5. [15] He, Q. and Lin, D-Y(2011) A variable selection method for genome-wide association studies. Bioinformatics [16] Hindorff, L., A. and Junkins,H., A. and Hall,P., N. and Mehta,J., P. and Manolio,T., A.(2009) A Catalog of Published Genome-Wide Association Studies. Available at: [17] Hindorff, L., A. and Sethupathy,P. and Junkins,H., A. and Ramos,E., M. and Mehta,J., P. and Collins,F., S. and Manolio,T., A. (2009) Potential etiologic and functional implications of genomewide association loci for human diseases and traits Proc Natl Acad Sci USA [18] Hirschhorn,J. N. and Daly,M. J(2005) Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics [19] Hoggart, C. J. and Whittaker, J. C. and De Iorio, M. D. and Balding, D. J. (2008) Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genetics 4 e [20] Hunter, D. J. and Kraft,P., Jacobs, K. B., Cox, D. G., Yeager, M., Hankinson, S. E., Wacholder, S. et al. (2007) A genome-wide

22 L. Briollais et al./bayesian approach for GWAS data 22 association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer Nature Genetics [21] Jones, B. and Carvalho, C. and Dobra, A. and Hans, C. and Carter, C. and West, M.(2005) Experiments in Stochastic Computation for High-dimensional Graphical Models Statistical Science [22] Kingsmore,S. F. and Linquist, I. E. and Mudge, J. and Gessler, D. D. and Beavis,W. D.(2008) Genome-wide association studies: progress and potential for drug discovery and development. Nature Reviews [23] Kruglyak, L.(2008) The road to genome-wide association studies Nature Genetics [24] Li, Y. and Willer, C. J. and Ding, J. and Scheet, P. and Abecasis, G. R.(2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes Genetic Epidemiology [25] McCarthy,M. I. and Hirschhorn,J. N.(2008) Genome-wide association studies: potential next steps on a genetic journey. Human Molecular Genetics 17 R156 R165. [26] Madigan, D. and Raftery, A. E. (1994) Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam s Window Journal of the American Statistical Association [27] Madigan, D. and York, J. (1995) Bayesian Graphical Models for Discrete Data International Statistical Review [28] Madigan, D. and York, J. (1997) Bayesian Methods for Estimation of the Size of a Closed Population Biometrika [29] Marchini, J. and Donnelly, P. and Cardon, L. R.(2005) Genomewide strategies for detecting multiple loci that influence complex diseases Nature Genetics [30] Massam, H. and Liu, J. and Dobra, A.(2009) A Conjugate Prior for Discrete Hierarchical Log-linear Models Annals of Statistics [31] Peto, J. and Mack, T. M. (2000) High constant incidence in twins and other relatives of women with breast cancer Nature Genetics [32] Risch,N. and Merikangas,K.(1996) The future of genetic studies of complex human diseases. Science [33] Risch,N. J.(2000) Searching for genetic determinants in the new millenium. Nature [34] Scott, J. G. and Berger, J. O.(2010) Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics [35] Sellke, T., Bayarri, M. J., and Berger, J. O.(2001) Calibration of p values for testing precise null hypotheses. American Statistician [36] Thompson, D. and Easton, D.F.(2002) Cancer incidence in BRCA1 mutation carriers Journal of the National Cancer Institute [37] Thompson, D. and Easton, D.F.(2004) The genetic epidemiology of breast cancer genes Journal of Mammary Gland Biolology and Neoplasia [38] Tibshirani, R.(1996) Regression shrinkage and selection via the lasso.

23 L. Briollais et al./bayesian approach for GWAS data 23 Journal of the Royal Statistical Society, Series B [39] Verzilli, C. J. and Stallard, N. and Whittaker, J. C. (2006) Bayesian Graphical Models for Genomewide Association Studies American Journal of Human Genetics [40] Wacholder, S. and Hartge, P. and Prentice, R. and Garcia- Closas, M. and Feigelson, H. S., et al. (2010) Performance of common genetic variants in breast-cancer risk models New England Journal of Medicine [41] Westfall, P. H., Young, S. S. (1993) Resampling-based multiple testing. Examples and methods for p-value adjustement. John Wiley & Sons, New York [42] Wilson, M. A., Iversen, E. S., Clyde, M. A., Schmidler, S. C., and Schildkraut, J. M. (2010) Bayesian model search and multilevel inference for SNP association studies Annals of Applied Statistics [43] Yeung, K.Y. and Bumgarner, R. E. and Raftery, A. E. (2005) Bayesian Model Averaging: Development of an Improved Multi-class, Gene Selection and Classification Tool for Microarray Data Bioinformatics [44] Zhang, Y. and Liu, J. S.(2007) Bayesian inference of epistatic interactions in case-control studies Nature Genetics

24 List of Figures L. Briollais et al./bayesian approach for GWAS data 24 1 Conditional dependence between three SNPs and the disease status Y in scenario 3 (left) and scenario 4 (right) of the simulations 25 2 Gene network analysis

25 FIGURES/FIGURES 25 Y SNP 1 SNP 2 SNP 3 Y SNP 1 SNP 2 SNP 3 Fig 1. Conditional dependence between three SNPs and the disease status Y in scenario 3 (left) and scenario 4 (right) of the simulations

26 FIGURES/FIGURES 26 Fig 2. Gene network analysis

Genomic models in bayz

Genomic models in bayz Genomic models in bayz Luc Janss, Dec 2010 In the new bayz version the genotype data is now restricted to be 2-allelic markers (SNPs), while the modeling option have been made more general. This implements

More information

Supplementary Note: Detecting population structure in rare variant data

Supplementary Note: Detecting population structure in rare variant data Supplementary Note: Detecting population structure in rare variant data Inferring ancestry from genetic data is a common problem in both population and medical genetic studies, and many methods exist to

More information

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs (3) QTL and GWAS methods By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs Under what conditions particular methods are suitable

More information

KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies

KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies Abo Alchamlat and Farnir BMC Bioinformatics (2017) 18:184 DOI 10.1186/s12859-017-1599-7 METHODOLOGY ARTICLE Open Access KNN-MDR: a learning approach for improving interactions mapping performances in genome

More information

General aspects of genome-wide association studies

General aspects of genome-wide association studies General aspects of genome-wide association studies Abstract number 20201 Session 04 Correctly reporting statistical genetics results in the genomic era Pekka Uimari University of Helsinki Dept. of Agricultural

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

Genome-Wide Association Studies (GWAS): Computational Them

Genome-Wide Association Studies (GWAS): Computational Them Genome-Wide Association Studies (GWAS): Computational Themes and Caveats October 14, 2014 Many issues in Genomewide Association Studies We show that even for the simplest analysis, there is little consensus

More information

http://genemapping.org/ Epistasis in Association Studies David Evans Law of Independent Assortment Biological Epistasis Bateson (99) a masking effect whereby a variant or allele at one locus prevents

More information

Evolutionary Algorithms

Evolutionary Algorithms Evolutionary Algorithms Evolutionary Algorithms What is Evolutionary Algorithms (EAs)? Evolutionary algorithms are iterative and stochastic search methods that mimic the natural biological evolution and/or

More information

Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits

Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits Bertrand Servin a*, Matthew Stephens b Department of Statistics, University of Washington, Seattle, Washington,

More information

Exploring the Genetic Basis of Congenital Heart Defects

Exploring the Genetic Basis of Congenital Heart Defects Exploring the Genetic Basis of Congenital Heart Defects Sanjay Siddhanti Jordan Hannel Vineeth Gangaram szsiddh@stanford.edu jfhannel@stanford.edu vineethg@stanford.edu 1 Introduction The Human Genome

More information

A Propagation-based Algorithm for Inferring Gene-Disease Associations

A Propagation-based Algorithm for Inferring Gene-Disease Associations A Propagation-based Algorithm for Inferring Gene-Disease Associations Oron Vanunu Roded Sharan Abstract: A fundamental challenge in human health is the identification of diseasecausing genes. Recently,

More information

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016 CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016 Topics Genetic variation Population structure Linkage disequilibrium Natural disease variants Genome Wide Association Studies Gene

More information

Runs of Homozygosity Analysis Tutorial

Runs of Homozygosity Analysis Tutorial Runs of Homozygosity Analysis Tutorial Release 8.7.0 Golden Helix, Inc. March 22, 2017 Contents 1. Overview of the Project 2 2. Identify Runs of Homozygosity 6 Illustrative Example...............................................

More information

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data Justin S. Hogg Department of Computational Biology University of Pittsburgh Pittsburgh, PA 15213 jsh32@pitt.edu Abstract

More information

Machine Learning in Computational Biology CSC 2431

Machine Learning in Computational Biology CSC 2431 Machine Learning in Computational Biology CSC 2431 Lecture 9: Combining biological datasets Instructor: Anna Goldenberg What kind of data integration is there? What kind of data integration is there? SNPs

More information

Data Mining and Applications in Genomics

Data Mining and Applications in Genomics Data Mining and Applications in Genomics Lecture Notes in Electrical Engineering Volume 25 For other titles published in this series, go to www.springer.com/series/7818 Sio-Iong Ao Data Mining and Applications

More information

Generative Models for Networks and Applications to E-Commerce

Generative Models for Networks and Applications to E-Commerce Generative Models for Networks and Applications to E-Commerce Patrick J. Wolfe (with David C. Parkes and R. Kang-Xing Jin) Division of Engineering and Applied Sciences Department of Statistics Harvard

More information

A strategy for multiple linkage disequilibrium mapping methods to validate additive QTL. Abstracts

A strategy for multiple linkage disequilibrium mapping methods to validate additive QTL. Abstracts Proceedings 59th ISI World Statistics Congress, 5-30 August 013, Hong Kong (Session CPS03) p.3947 A strategy for multiple linkage disequilibrium mapping methods to validate additive QTL Li Yi 1,4, Jong-Joo

More information

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer T. M. Murali January 31, 2006 Innovative Application of Hierarchical Clustering A module map showing conditional

More information

Graphical Modelling in Genetics and Systems Biology

Graphical Modelling in Genetics and Systems Biology Graphical Modelling in Genetics and Systems Biology m.scutari@ucl.ac.uk Genetics Institute October 30th, 2012 Current Practices in Bayesian Networks Modelling Current Practices in Bayesian Networks Modelling

More information

Illumina s GWAS Roadmap: next-generation genotyping studies in the post-1kgp era

Illumina s GWAS Roadmap: next-generation genotyping studies in the post-1kgp era Illumina s GWAS Roadmap: next-generation genotyping studies in the post-1kgp era Anthony Green Sr. Genotyping Sales Specialist North America 2010 Illumina, Inc. All rights reserved. Illumina, illuminadx,

More information

Human linkage analysis. fundamental concepts

Human linkage analysis. fundamental concepts Human linkage analysis fundamental concepts Genes and chromosomes Alelles of genes located on different chromosomes show independent assortment (Mendel s 2nd law) For 2 genes: 4 gamete classes with equal

More information

Book chapter appears in:

Book chapter appears in: Mass casualty identification through DNA analysis: overview, problems and pitfalls Mark W. Perlin, PhD, MD, PhD Cybergenetics, Pittsburgh, PA 29 August 2007 2007 Cybergenetics Book chapter appears in:

More information

University of Groningen. The value of haplotypes Vries, Anne René de

University of Groningen. The value of haplotypes Vries, Anne René de University of Groningen The value of haplotypes Vries, Anne René de IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

More information

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010 Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010 Traditional QTL approach Uses standard bi-parental mapping populations o F2 or RI These have a limited number of

More information

Phasing of 2-SNP Genotypes based on Non-Random Mating Model

Phasing of 2-SNP Genotypes based on Non-Random Mating Model Phasing of 2-SNP Genotypes based on Non-Random Mating Model Dumitru Brinza and Alexander Zelikovsky Department of Computer Science, Georgia State University, Atlanta, GA 30303 {dima,alexz}@cs.gsu.edu Abstract.

More information

Predicting user rating for Yelp businesses leveraging user similarity

Predicting user rating for Yelp businesses leveraging user similarity Predicting user rating for Yelp businesses leveraging user similarity Kritika Singh kritika@eng.ucsd.edu Abstract Users visit a Yelp business, such as a restaurant, based on its overall rating and often

More information

Genomic Selection Using Low-Density Marker Panels

Genomic Selection Using Low-Density Marker Panels Copyright Ó 2009 by the Genetics Society of America DOI: 10.1534/genetics.108.100289 Genomic Selection Using Low-Density Marker Panels D. Habier,*,,1 R. L. Fernando and J. C. M. Dekkers *Institute of Animal

More information

Gene-centric Genomewide Association Study via Entropy

Gene-centric Genomewide Association Study via Entropy Gene-centric Genomewide Association Study via Entropy Yuehua Cui 1,, Guolian Kang 1, Kelian Sun 2, Minping Qian 2, Roberto Romero 3, and Wenjiang Fu 2 1 Department of Statistics and Probability, 2 Department

More information

PLINK gplink Haploview

PLINK gplink Haploview PLINK gplink Haploview Whole genome association software tutorial Shaun Purcell Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA Broad Institute of Harvard & MIT, Cambridge,

More information

Nima Hejazi. Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi. nimahejazi.org github/nhejazi

Nima Hejazi. Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi. nimahejazi.org github/nhejazi Data-Adaptive Estimation and Inference in the Analysis of Differential Methylation for the annual retreat of the Center for Computational Biology, given 18 November 2017 Nima Hejazi Division of Biostatistics

More information

Lees J.A., Vehkala M. et al., 2016 In Review

Lees J.A., Vehkala M. et al., 2016 In Review Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes Lees J.A., Vehkala M. et al., 2016 In Review Journal Club Triinu Kõressaar 16.03.2016 Introduction Bacterial

More information

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista Graphical Methods for Classifier Performance Evaluation Maria Carolina Monard and Gustavo E. A. P. A. Batista University of São Paulo USP Institute of Mathematics and Computer Science ICMC Department of

More information

Reliable classification of two-class cancer data using evolutionary algorithms

Reliable classification of two-class cancer data using evolutionary algorithms BioSystems 72 (23) 111 129 Reliable classification of two-class cancer data using evolutionary algorithms Kalyanmoy Deb, A. Raji Reddy Kanpur Genetic Algorithms Laboratory (KanGAL), Indian Institute of

More information

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping - from Darwin's time onward, it has been widely recognized that natural populations harbor a considerably degree of genetic

More information

ARTICLE Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach

ARTICLE Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach ARTICLE Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach Marc A. Coram, 1,8,9 Sophie I. Candille, 2,8 Qing Duan, 3 Kei Hang K. Chan, 4 Yun

More information

Axiom mydesign Custom Array design guide for human genotyping applications

Axiom mydesign Custom Array design guide for human genotyping applications TECHNICAL NOTE Axiom mydesign Custom Genotyping Arrays Axiom mydesign Custom Array design guide for human genotyping applications Overview In the past, custom genotyping arrays were expensive, required

More information

Characterization of Allele-Specific Copy Number in Tumor Genomes

Characterization of Allele-Specific Copy Number in Tumor Genomes Characterization of Allele-Specific Copy Number in Tumor Genomes Hao Chen 2 Haipeng Xing 1 Nancy R. Zhang 2 1 Department of Statistics Stonybrook University of New York 2 Department of Statistics Stanford

More information

Linkage Disequilibrium. Adele Crane & Angela Taravella

Linkage Disequilibrium. Adele Crane & Angela Taravella Linkage Disequilibrium Adele Crane & Angela Taravella Overview Introduction to linkage disequilibrium (LD) Measuring LD Genetic & demographic factors shaping LD Model predictions and expected LD decay

More information

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS The Bioinformatics book covers new topics in the rapidly

More information

Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination

Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination UC Davis Computer Science Technical Report CSE-2003-29 1 Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination Dan Gusfield, Satish Eddhu, Charles Langley November 24,

More information

"Genetics in geographically structured populations: defining, estimating and interpreting FST."

Genetics in geographically structured populations: defining, estimating and interpreting FST. University of Connecticut DigitalCommons@UConn EEB Articles Department of Ecology and Evolutionary Biology 9-1-2009 "Genetics in geographically structured populations: defining, estimating and interpreting

More information

Sequencing Millions of Animals for Genomic Selection 2.0

Sequencing Millions of Animals for Genomic Selection 2.0 Proceedings, 10 th World Congress of Genetics Applied to Livestock Production Sequencing Millions of Animals for Genomic Selection 2.0 J.M. Hickey 1, G. Gorjanc 1, M.A. Cleveland 2, A. Kranis 1,3, J. Jenko

More information

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist Whole Transcriptome Analysis of Illumina RNA- Seq Data Ryan Peters Field Application Specialist Partek GS in your NGS Pipeline Your Start-to-Finish Solution for Analysis of Next Generation Sequencing Data

More information

ARTICLE Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering

ARTICLE Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering ARTICLE Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering Sharon R. Browning * and Brian L. Browning * Whole-genome

More information

Bivariate Data Notes

Bivariate Data Notes Bivariate Data Notes Like all investigations, a Bivariate Data investigation should follow the statistical enquiry cycle or PPDAC. Each part of the PPDAC cycle plays an important part in the investigation

More information

S SG. Metabolomics meets Genomics. Hemant K. Tiwari, Ph.D. Professor and Head. Metabolomics: Bench to Bedside. ection ON tatistical.

S SG. Metabolomics meets Genomics. Hemant K. Tiwari, Ph.D. Professor and Head. Metabolomics: Bench to Bedside. ection ON tatistical. S SG ection ON tatistical enetics Metabolomics meets Genomics Hemant K. Tiwari, Ph.D. Professor and Head Section on Statistical Genetics Department of Biostatistics School of Public Health Metabolomics:

More information

Adaptive Design for Clinical Trials

Adaptive Design for Clinical Trials Adaptive Design for Clinical Trials Mark Chang Millennium Pharmaceuticals, Inc., Cambridge, MA 02139,USA (e-mail: Mark.Chang@Statisticians.org) Abstract. Adaptive design is a trial design that allows modifications

More information

An introductory overview of the current state of statistical genetics

An introductory overview of the current state of statistical genetics An introductory overview of the current state of statistical genetics p. 1/9 An introductory overview of the current state of statistical genetics Cavan Reilly Division of Biostatistics, University of

More information

Forecasting Cash Withdrawals in the ATM Network Using a Combined Model based on the Holt-Winters Method and Markov Chains

Forecasting Cash Withdrawals in the ATM Network Using a Combined Model based on the Holt-Winters Method and Markov Chains Forecasting Cash Withdrawals in the ATM Network Using a Combined Model based on the Holt-Winters Method and Markov Chains 1 Mikhail Aseev, 1 Sergei Nemeshaev, and 1 Alexander Nesterov 1 National Research

More information

MODELING THE EXPERT. An Introduction to Logistic Regression The Analytics Edge

MODELING THE EXPERT. An Introduction to Logistic Regression The Analytics Edge MODELING THE EXPERT An Introduction to Logistic Regression 15.071 The Analytics Edge Ask the Experts! Critical decisions are often made by people with expert knowledge Healthcare Quality Assessment Good

More information

Author's response to reviews

Author's response to reviews Author's response to reviews Title: A pooling-based genome-wide analysis identifies new potential candidate genes for atopy in the European Community Respiratory Health Survey (ECRHS) Authors: Francesc

More information

A Markov Chain Approach to Reconstruction of Long Haplotypes

A Markov Chain Approach to Reconstruction of Long Haplotypes A Markov Chain Approach to Reconstruction of Long Haplotypes Lauri Eronen, Floris Geerts, Hannu Toivonen HIIT-BRU and Department of Computer Science, University of Helsinki Abstract Haplotypes are important

More information

WE consider the general ranking problem, where a computer

WE consider the general ranking problem, where a computer 5140 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 11, NOVEMBER 2008 Statistical Analysis of Bayes Optimal Subset Ranking David Cossock and Tong Zhang Abstract The ranking problem has become increasingly

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

Tutorial Segmentation and Classification

Tutorial Segmentation and Classification MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION v171025 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

More information

Harbingers of Failure: Online Appendix

Harbingers of Failure: Online Appendix Harbingers of Failure: Online Appendix Eric Anderson Northwestern University Kellogg School of Management Song Lin MIT Sloan School of Management Duncan Simester MIT Sloan School of Management Catherine

More information

SAS/STAT 14.1 User s Guide. Introduction to Categorical Data Analysis Procedures

SAS/STAT 14.1 User s Guide. Introduction to Categorical Data Analysis Procedures SAS/STAT 14.1 User s Guide Introduction to Categorical Data Analysis Procedures This document is an individual chapter from SAS/STAT 14.1 User s Guide. The correct bibliographic citation for this manual

More information

Chapter 3. Integrating AHP, clustering and association rule mining

Chapter 3. Integrating AHP, clustering and association rule mining Chapter 3. Integrating AHP, clustering and association rule mining This chapter presents a novel product recommendation methodology that combines group decision-making and data mining. This method has

More information

Simulation-Based Analysis and Optimisation of Planning Policies over the Product Life Cycle within the Entire Supply Chain

Simulation-Based Analysis and Optimisation of Planning Policies over the Product Life Cycle within the Entire Supply Chain From the SelectedWorks of Liana Napalkova June, 2009 Simulation-Based Analysis and Optimisation of Planning Policies over the Product Life Cycle within the Entire Supply Chain Galina Merkuryeva Liana Napalkova

More information

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Why can GBS be complicated? Tools for filtering, error correction and imputation. Why can GBS be complicated? Tools for filtering, error correction and imputation. Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net Many Organisms Are Diverse Humans are at the lower

More information

A Statistical Framework for Joint eqtl Analysis in Multiple Tissues

A Statistical Framework for Joint eqtl Analysis in Multiple Tissues A Statistical Framework for Joint eqtl Analysis in Multiple Tissues Timothée Flutre 1,2., Xiaoquan Wen 3., Jonathan Pritchard 1,4, Matthew Stephens 1,5 * 1 Department of Human Genetics, University of Chicago,

More information

On polyclonality of intestinal tumors

On polyclonality of intestinal tumors Michael A. University of Wisconsin Chaos and Complex Systems April 2006 Thanks Linda Clipson W.F. Dove Rich Halberg Stephen Stanhope Ruth Sullivan Andrew Thliveris Outline Bio Three statistical questions

More information

Genetic Algorithms in Matrix Representation and Its Application in Synthetic Data

Genetic Algorithms in Matrix Representation and Its Application in Synthetic Data Genetic Algorithms in Matrix Representation and Its Application in Synthetic Data Yingrui Chen *, Mark Elliot ** and Joe Sakshaug *** * ** University of Manchester, yingrui.chen@manchester.ac.uk University

More information

Bayesian Model Averaging for Clustered Data: Imputing Missing Daily Air Pollution Concentration

Bayesian Model Averaging for Clustered Data: Imputing Missing Daily Air Pollution Concentration Johns Hopkins University, Dept. of Biostatistics Working Papers 12-15-2008 Bayesian Model Averaging for Clustered Data: Imputing Missing Daily Air Pollution Concentration Howard H. Chang Johns Hopkins

More information

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG Chapman & Hall/CRC Mathematical and Computational Biology Series ALGORITHMS IN BIO INFORMATICS A PRACTICAL INTRODUCTION WING-KIN SUNG CRC Press Taylor & Francis Group Boca Raton London New York CRC Press

More information

Draft agreed by Scientific Advice Working Party 5 September Adopted by CHMP for release for consultation 19 September

Draft agreed by Scientific Advice Working Party 5 September Adopted by CHMP for release for consultation 19 September 23 January 2014 EMA/CHMP/SAWP/757052/2013 Committee for Medicinal Products for Human Use (CHMP) Qualification Opinion of MCP-Mod as an efficient statistical methodology for model-based design and analysis

More information

Joint Study of Genetic Regulators for Expression

Joint Study of Genetic Regulators for Expression Joint Study of Genetic Regulators for Expression Traits Related to Breast Cancer Tian Zheng 1, Shuang Wang 2, Lei Cong 1, Yuejing Ding 1, Iuliana Ionita-Laza 3, and Shaw-Hwa Lo 1 1 Department of Statistics,

More information

C 2014 The Authors. Genetic Epidemiology published by Wiley Periodicals, Inc.

C 2014 The Authors. Genetic Epidemiology published by Wiley Periodicals, Inc. RESEARCH ARTICLE Genetic Epidemiology Fine-Mapping Additive and Dominant SNP Effects Using Group-LASSO and Fractional Resample Model Averaging Jeremy Sabourin, 1,2 Andrew B. Nobel, 2,3,4 and William Valdar

More information

Gasoline Consumption Analysis

Gasoline Consumption Analysis Gasoline Consumption Analysis One of the most basic topics in economics is the supply/demand curve. Simply put, the supply offered for sale of a commodity is directly related to its price, while the demand

More information

Modeling of competition in revenue management Petr Fiala 1

Modeling of competition in revenue management Petr Fiala 1 Modeling of competition in revenue management Petr Fiala 1 Abstract. Revenue management (RM) is the art and science of predicting consumer behavior and optimizing price and product availability to maximize

More information

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data BST227 Introduction to Statistical Genetics Lecture 8: Variant calling from high-throughput sequencing data 1 PC recap typical genome Differs from the reference genome at 4-5 million sites ~85% SNPs ~15%

More information

The Next Generation of Transcription Factor Binding Site Prediction

The Next Generation of Transcription Factor Binding Site Prediction The Next Generation of Transcription Factor Binding Site Prediction Anthony Mathelier*, Wyeth W. Wasserman* Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department

More information

Data Mining Applications with R

Data Mining Applications with R Data Mining Applications with R Yanchang Zhao Senior Data Miner, RDataMining.com, Australia Associate Professor, Yonghua Cen Nanjing University of Science and Technology, China AMSTERDAM BOSTON HEIDELBERG

More information

Genomic Selection using low-density marker panels

Genomic Selection using low-density marker panels Genetics: Published Articles Ahead of Print, published on March 18, 2009 as 10.1534/genetics.108.100289 Genomic Selection using low-density marker panels D. Habier 1,2, R. L. Fernando 2 and J. C. M. Dekkers

More information

Learning Petri Net Models of Non-Linear Gene Interactions

Learning Petri Net Models of Non-Linear Gene Interactions Learning Petri Net Models of Non-Linear Gene Interactions Abstract Understanding how an individual s genetic make-up influences their risk of disease is a problem of paramount importance. Although machine

More information

Automated Energy Distribution In Smart Grids

Automated Energy Distribution In Smart Grids Automated Energy Distribution In Smart Grids John Yu Stanford University Michael Chun Stanford University Abstract We design and implement a software controller that can distribute energy in a power grid.

More information

Genomic Research: Issues to Consider. IRB Brown Bag August 28, 2014 Sharon Aufox, MS, LGC

Genomic Research: Issues to Consider. IRB Brown Bag August 28, 2014 Sharon Aufox, MS, LGC Genomic Research: Issues to Consider IRB Brown Bag August 28, 2014 Sharon Aufox, MS, LGC Outline Key genomic terms and concepts Issues in genomic research Consent models Types of findings Returning results

More information

ARTICLE Sherlock: Detecting Gene-Disease Associations by Matching Patterns of Expression QTL and GWAS

ARTICLE Sherlock: Detecting Gene-Disease Associations by Matching Patterns of Expression QTL and GWAS ARTICLE Sherlock: Detecting Gene-Disease Associations by Matching Patterns of Expression QTL and GWAS Xin He, 1,2 Chris K. Fuller, 1 Yi Song, 1 Qingying Meng, 3 Bin Zhang, 4 Xia Yang, 3 and Hao Li 1, *

More information

A SAS Macro to Analyze Data From a Matched or Finely Stratified Case-Control Design

A SAS Macro to Analyze Data From a Matched or Finely Stratified Case-Control Design A SAS Macro to Analyze Data From a Matched or Finely Stratified Case-Control Design Robert A. Vierkant, Terry M. Therneau, Jon L. Kosanke, James M. Naessens Mayo Clinic, Rochester, MN ABSTRACT A matched

More information

What is Evolutionary Computation? Genetic Algorithms. Components of Evolutionary Computing. The Argument. When changes occur...

What is Evolutionary Computation? Genetic Algorithms. Components of Evolutionary Computing. The Argument. When changes occur... What is Evolutionary Computation? Genetic Algorithms Russell & Norvig, Cha. 4.3 An abstraction from the theory of biological evolution that is used to create optimization procedures or methodologies, usually

More information

Integration of heterogeneous omics data

Integration of heterogeneous omics data Integration of heterogeneous omics data Andrea Rau March 11, 2016 Formation doctorale: Biologie expe rimentale animale et mode lisation pre dictive andrea.rau@jouy.inra.fr Integration of heterogeneous

More information

Active Chemical Sensing with Partially Observable Markov Decision Processes

Active Chemical Sensing with Partially Observable Markov Decision Processes Active Chemical Sensing with Partially Observable Markov Decision Processes Rakesh Gosangi and Ricardo Gutierrez-Osuna* Department of Computer Science, Texas A & M University {rakesh, rgutier}@cs.tamu.edu

More information

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS

More information

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN Predictive Modeling using SAS Enterprise Miner and SAS/STAT : Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN 1 Overview This presentation will: Provide a brief introduction of how to set

More information

A Particle Swarm Optimization Algorithm for Multi-depot Vehicle Routing problem with Pickup and Delivery Requests

A Particle Swarm Optimization Algorithm for Multi-depot Vehicle Routing problem with Pickup and Delivery Requests A Particle Swarm Optimization Algorithm for Multi-depot Vehicle Routing problem with Pickup and Delivery Requests Pandhapon Sombuntham and Voratas Kachitvichayanukul Abstract A particle swarm optimization

More information

Risk Analysis Overview

Risk Analysis Overview Risk Analysis Overview What Is Risk? Uncertainty about a situation can often indicate risk, which is the possibility of loss, damage, or any other undesirable event. Most people desire low risk, which

More information

Module 7: Multilevel Models for Binary Responses. Practical. Introduction to the Bangladesh Demographic and Health Survey 2004 Dataset.

Module 7: Multilevel Models for Binary Responses. Practical. Introduction to the Bangladesh Demographic and Health Survey 2004 Dataset. Module 7: Multilevel Models for Binary Responses Most of the sections within this module have online quizzes for you to test your understanding. To find the quizzes: Pre-requisites Modules 1-6 Contents

More information

Container packing problem for stochastic inventory and optimal ordering through integer programming

Container packing problem for stochastic inventory and optimal ordering through integer programming 20th International Congress on Modelling and Simulation, Adelaide, Australia, 1 6 December 2013 www.mssanz.org.au/modsim2013 Container packing problem for stochastic inventory and optimal ordering through

More information

Mapping and Mapping Populations

Mapping and Mapping Populations Mapping and Mapping Populations Types of mapping populations F 2 o Two F 1 individuals are intermated Backcross o Cross of a recurrent parent to a F 1 Recombinant Inbred Lines (RILs; F 2 -derived lines)

More information

The Impact of SEM Programs on Customer Participation Dan Rubado, JP Batmale and Kati Harper, Energy Trust of Oregon

The Impact of SEM Programs on Customer Participation Dan Rubado, JP Batmale and Kati Harper, Energy Trust of Oregon The Impact of SEM Programs on Customer Participation Dan Rubado, JP Batmale and Kati Harper, Energy Trust of Oregon Abstract Strategic Energy Management (SEM) is designed to positively change how customers

More information

Overview of Statistics used in QbD Throughout the Product Lifecycle

Overview of Statistics used in QbD Throughout the Product Lifecycle Overview of Statistics used in QbD Throughout the Product Lifecycle August 2014 The Windshire Group, LLC Comprehensive CMC Consulting Presentation format and purpose Method name What it is used for and/or

More information

Diagnostic Screening Without Cutoffs

Diagnostic Screening Without Cutoffs Diagnostic Screening Without Cutoffs Wesley O. Johnson Young-Ku Choi Department of Statistics and Mark C. Thurmond Department of Medicine and Epidemiology University of California, Davis Introduction Standard

More information

Evaluation of Logistic Regression Model with Feature Selection Methods on Medical Dataset

Evaluation of Logistic Regression Model with Feature Selection Methods on Medical Dataset Evaluation of Logistic Regression Model with Feature Selection Methods on Medical Dataset Abstract Raghavendra B. K., Dr. M.G.R. Educational and Research Institute, Chennai600 095. Email: raghavendra_bk@rediffmail.com,

More information

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad.

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad. GENETIC ALGORITHMS Narra Priyanka K.Naga Sowjanya Vasavi College of Engineering. Ibrahimbahg,Hyderabad mynameissowji@yahoo.com priyankanarra@yahoo.com Abstract Genetic algorithms are a part of evolutionary

More information

Predictive and Causal Modeling in the Health Sciences. Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics

Predictive and Causal Modeling in the Health Sciences. Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics Predictive and Causal Modeling in the Health Sciences Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics 1 Exponentially Rapid Data Accumulation Protein Sequencing

More information

A logistic regression model for Semantic Web service matchmaking

A logistic regression model for Semantic Web service matchmaking . BRIEF REPORT. SCIENCE CHINA Information Sciences July 2012 Vol. 55 No. 7: 1715 1720 doi: 10.1007/s11432-012-4591-x A logistic regression model for Semantic Web service matchmaking WEI DengPing 1*, WANG

More information

LECTURE 5: LINKAGE AND GENETIC MAPPING

LECTURE 5: LINKAGE AND GENETIC MAPPING LECTURE 5: LINKAGE AND GENETIC MAPPING Reading: Ch. 5, p. 113-131 Problems: Ch. 5, solved problems I, II; 5-2, 5-4, 5-5, 5.7 5.9, 5-12, 5-16a; 5-17 5-19, 5-21; 5-22a-e; 5-23 The dihybrid crosses that we

More information