Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies

Similar documents

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Title: Powerful SNP Set Analysis for Case-Control Genome Wide Association Studies. Running Title: Powerful SNP Set Analysis. Hill, NC. MD.

General aspects of genome-wide association studies

Genome-wide association studies (GWAS) Part 1

Runs of Homozygosity Analysis Tutorial

ARTICLE Insights into Colon Cancer Etiology via a Regularized Approach to Gene Set Analysis of GWAS Data

Basic statistical analysis in genetic case-control studies

Genomic models in bayz

Computational Workflows for Genome-Wide Association Study: I

Nature Genetics: doi: /ng Supplementary Figure 1. QQ plots of P values from the SMR tests under a range of simulation scenarios.

A FAST ALGORITHM FOR DETECTING GENE GENE INTERACTIONS IN GENOME-WIDE ASSOCIATION STUDIES

Advanced Introduction to Machine Learning

Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip

MoGUL: Detecting Common Insertions and Deletions in a Population

arxiv: v1 [stat.ap] 3 Feb 2015

Supplementary Note: Detecting population structure in rare variant data

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits

Gene-centric Genomewide Association Study via Entropy

Testing Gene-Environment Interaction in Large Scale Case-Control Association Studies: Possible Choices and Comparisons

Principal Component Analysis in Genomic Data

Improved Power by Use of a Weighted Score Test for Linkage Disequilibrium Mapping

Author's response to reviews

H3A - Genome-Wide Association testing SOP

ARTICLE ROADTRIPS: Case-Control Association Testing with Partially or Completely Unknown Population and Pedigree Structure

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

An introductory overview of the current state of statistical genetics

Exploring the Genetic Basis of Congenital Heart Defects

Whole-Genome Genetic Data Simulation Based on Mutation-Drift Equilibrium Model

Probabilistic Graphical Models

KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies

A Bayesian shared component model for genetic association studies

PERSPECTIVES. A gene-centric approach to genome-wide association studies

A Bayesian Graphical Model for Genome-wide Association Studies (GWAS)

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data

SNPs - GWAS - eqtls. Sebastian Schmeier

Computational methods for the analysis of rare variants

JOINT ANALYSIS OF SNP AND GENE EXPRESSION DATA IN GENETIC ASSOCIATION STUDIES OF COMPLEX DISEASES 1 arxiv: v1 [stat.

Genetic load. For the organism as a whole (its genome, and the species), what is the fitness cost of deleterious mutations?

A strategy for multiple linkage disequilibrium mapping methods to validate additive QTL. Abstracts

Characterization of Allele-Specific Copy Number in Tumor Genomes

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

Chapter 11: Genome-Wide Association Studies

Phasing of 2-SNP Genotypes based on Non-Random Mating Model

Genomic Selection Using Low-Density Marker Panels

Statistical Methods for Genome Wide Association Studies

PLINK gplink Haploview

Genome-Wide Association Studies (GWAS): Computational Them

Linkage Disequilibrium Mapping via Cladistic Analysis of Single-Nucleotide Polymorphism Haplotypes

ARTICLE Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach

Intro Logistic Regression Gradient Descent + SGD

Evaluation of random forest regression for prediction of breeding value from genomewide SNPs

Genetics and Bioinformatics

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

C 2014 The Authors. Genetic Epidemiology published by Wiley Periodicals, Inc.

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score

Genomic Selection using low-density marker panels

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Genome-wide association studies and the genetic dissection of complex traits

S SG. Metabolomics meets Genomics. Hemant K. Tiwari, Ph.D. Professor and Head. Metabolomics: Bench to Bedside. ection ON tatistical.

BTRY 7210: Topics in Quantitative Genomics and Genetics

Accuracy of whole genome prediction using a genetic architecture enhanced variance-covariance matrix

Experimental Design and Sample Size Requirement for QTL Mapping

Why can GBS be complicated? Tools for filtering, error correction and imputation.

BAYESIAN LARGE-SCALE MULTIPLE REGRESSION WITH SUMMARY STATISTICS FROM GENOME-WIDE ASSOCIATION STUDIES 1

A Statistical Framework for Joint eqtl Analysis in Multiple Tissues

Ning Yang 1., Yanli Lu 1,2., Xiaohong Yang 3, Juan Huang 1, Yang Zhou 1, Farhan Ali 1, Weiwei Wen 1, Jie Liu 1, Jiansheng Li 3, Jianbing Yan 1 *

REVIEWS GENOME-WIDE ASSOCIATION STUDIES FOR COMMON DISEASES AND COMPLEX TRAITS. Joel N. Hirschhorn* and Mark J. Daly*

Strategy for applying genome-wide selection in dairy cattle

Mapping and Mapping Populations

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes

ARTICLE Sherlock: Detecting Gene-Disease Associations by Matching Patterns of Expression QTL and GWAS

Extracting Sequence Features to Predict Protein-DNA Interactions: A Comparative Study

Mapping Genes Controlling Quantitative Traits Using MAPMAKER/QTL Version 1.1: A Tutorial and Reference Manual

Lecture 8: Sequencing and SNP. Sept 15, 2006

Flexible Regression and Smoothing

Analysing the Immune System with Fisher Features

Linking Genetic Variation to Important Phenotypes

A GENOTYPE CALLING ALGORITHM FOR AFFYMETRIX SNP ARRAYS

Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies

Harbingers of Failure: Online Appendix

Exact Multipoint Quantitative-Trait Linkage Analysis in Pedigrees by Variance Components

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

Quality Control Report for Exome Chip Data University of Michigan April, 2015

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions

Predictive Modelling of Recruitment and Drug Supply in Multicenter Clinical Trials

Identifying Selected Regions from Heterozygosity and Divergence Using a Light-Coverage Genomic Dataset from Two Human Populations

Package MADGiC. July 24, 2014

Nima Hejazi. Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi. nimahejazi.org github/nhejazi

Semiparametric analysis of case control genetic data in the presence of environmental factors

Quality Measures for CytoChip Microarrays

LINKAGE DISEQUILIBRIUM MAPPING USING SINGLE NUCLEOTIDE POLYMORPHISMS -WHICH POPULATION?

A powerful and efficient set test for genetic markers that handles confounders

Human linkage analysis. fundamental concepts

MAS refers to the use of DNA markers that are tightly-linked to target loci as a substitute for or to assist phenotypic screening.

Package FSTpackage. June 27, 2017

Transcription:

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 1/20 Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies David J. Balding Centre for Biostatistics Imperial College London d.balding@ic.ac.uk from 1/10/09 moving to: Institute of Genetics, University College London

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 2/20 Simultaneous analysis of all SNPs Why try to do it? additional power to detect true positives: multiple true predictors in model less residual variation better prediction of phenotype capture all main effects for epistatic interactions look for G E joint/interaction effects with all the G signal at a potential false +ve weakened by true +ves better localisation and interpretability. Counter-arguments: massive optimization problem unlikely to find global maximum so may lose some of the above advantages.

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 3/20 HyperLASSO algorithm Data: cases and controls GW genotypes (+ covariates) Model: logistic regression logit(y i ) = β 0 + j β j x ij i individuals; j SNPs; x ij genotype {0, 1, 2}. Problem: too may predictors - overfitting. Solution: prior/penalty strongly rewards β j = 0 for each j:

Normal-Exponential-Gamma (NEG) NEG is a generalisation of the DE (Laplace): DE(β ξ) = 0 NEG(β λ, γ) = N(β 0,σ 2 )G(σ 2 1,ξ 2 /2) dσ 2 = ξ 2 exp{ ξ β } 0 0 { β 2 } = κexp 4γ 2 N(β 0,σ 2 )G(σ 2 1,ψ)G(ψ λ,γ 2 ) dσ 2 dψ D 2λ 1 ( β γ ), If λ and γ with ξ = 2λ/γ constant, NEG DE. prior interpretation: most effects are zero, agnostic about size of non-zero effects; flatter tails means less shrinkage so sparser models initial over-estimate of effect size penalised lightly. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 4/20

log density of NEG prior 4 3 2 1 0 1 NEG : λ = 0.1 NEG : λ = 1 NEG : λ = 10 DE 1.0 0.5 0.0 0.5 1.0 x Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 5/20

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 6/20 Choice of prior parameters Some prior quantiles for three choices of λ and γ: λ 1.0 1.4 1.8 γ 0.0012 0.006 0.015 P(x > 0.05) 5.8 10 4 3.5 10 3 1.7 10 2 P(x > 0.1) 1.4 10 4 5.3 10 4 2.1 10 3 P(x > 0.2) 3.6 10 5 7.8 10 5 2.0 10 4 P(x > 0.4) 9.0 10 6 1.1 10 5 1.7 10 5 P(x > 1) 1.4 10 6 8.5 10 7 6.2 10 7 Larger shape parameter λ gives more weight to small, non-zero effects. Below, we choose λ to be very small (usually 0.05) to have very little shrinkage. Choose scale parameter γ to give desired type 1 error for given λ (see below).

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 7/20 The optimisation algorithm Cyclic Co-ordinate Descent algorithm: start with all β j = 0. update order is allocated randomly but fixed in each run. Newton-Raphson update step: β new j = β j L (β) f (β j ) L (β) f (β j ) where each denotes derivative wrt β j. Key shortcut: use computationally-fast bounds to avoid expensive computation of L for all but a few SNPs. Seek local optimum in 100 runs; choose best solution: may not be global optimum but very similar model.

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 8/20 SNP selection Given shrinkage prior: SNPs with non-zero posterior mode (log-lik) > (log-prior) not a Bayesian procedure: assess using type-1 error use Bayesian language for penalised likelihood ( shrinkage regression ) rescaling genotype scores alters prior; e.g. if genotypes are standardised to mean zero and unit variance then type-1 error invariant with MAF for 1 SNP, asymptotically Armitage trend test (ATT). supports larger effect sizes for lower MAF.

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 9/20 Type 1 error; Multiple genetic models explicit approximation for type 1 error in terms of derivative of log-prior asymptotically correct for 1 SNP, otherwise conservative avoids need for permutation choose desired λ then assign γ to control type-1 error Can also consider dominant, recessive and heterozygous (1 df over-dominant) models, in addition to codominant (additive). only one regression coefficient per SNP no type-1 error approximation, but empirically the extra terms approximately double type-1 error.

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 10/20 Main simulation study 1K cases, 1K controls; 80K SNPs on 20 20 Mb chromosomes; 6 Causal SNPs all on one chromosome; 500 replicates, so 3K causal SNPs total; α = 10 5 ; additive-only model. Causal False positives SNPs SNPs min. separation (Kb) Method selected tagged 0 40 100 HyperLASSO 2097 1576 368 368 366 ATT 6810 1554 696 486 441 A causal variant is tagged" if 1 selected SNP has r 2 > 0 05 with it.

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 11/20 Main simulation study # causal SNPs tagged (/500) by MAF and risk ratio MAF and allelic risk ratio 15% 5% 2% Method 1.4 1.5 1.8 2.2 2.5 3.0 HyperLASSO 252 360 209 370 146 239 ATT 244 353 209 370 143 235 54 SNPs tagged by HyperLASSO and not ATT; 32 SNPs tagged by ATT and not HyperLASSO; p = 0 11

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 12/20 Tagging SNPs per causal SNP NEG analysis Frequency 0 500 1000 1500 0 5 10 15 Number of selected SNPs tagging each causal variant ATT analysis Frequency 0 500 1000 1500 0 5 10 15 Number of selected SNPs tagging each causal variant

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 13/20 Genome-wide simulation study 1K cases, 1K controls; 120 20 Mb chromosome; 480K SNPs; 10 causal variants on one chromosome, each MAF = 0 15 and risk ratio = 2; α = 5 10 7 ; additive-only model; compute time 1 hour per mode. Results: Both HyperLASSO and ATT tag all 10 causal SNPs; HyperLASSO selects 14 SNPs; ATT selects 35 SNPs.

10 15 20 25 10 15 20 25 Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 14/20 Genome-wide simulation (a) 10 15 20 25 Zoomed regions Causal SNPs Sig. p values Selected by NEG 0 5 10 15 20 Position (Mb) (b) 8.0 8.2 8.4 8.6 8.8 Position (Mb) (c) 11.0 11.1 11.2 11.3 11.4 11.5 11.6 Position (Mb)

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 15/20 Resequencing simulation study 1K cases, 1K controls; all 192K polymorphic sites on one 20 Mb chromosome; 6 Causal SNPs on the chromosome same disease model as main study; 10 replicates; α = 10 5 ; Results: additive-only model; Both ATT and HyperLASSO tagged 54 of 60 causals; HyperLASSO selected 64 SNPs ATT selected 599.

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 16/20 Selected SNPs: best r 2 with causal NEG analysis Frequency 0 500 1000 1500 Frequency 0 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 Highest r 2 with a causal variant for each selected SNP ATT analysis 0.0 0.2 0.4 0.6 0.8 1.0 Highest r 2 with a causal variant for each selected SNP

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 17/20 T2D GWAS Sladek et al. 2007 694 cases, 654 controls; 300K Illumina Hap300 genotyping platform; 42 SNPs tagging 32 loci significant at 5 10 5 and were progressed to stage 2 (plus 15 from Hap100 chip not re-analysed here). NEG re-analysis using additive, dominant and recessive models: 26 SNPs tagging 25 loci; tagged all 5 loci confirmed in stage 2 with same model (3 dominant, 2 additive); looking at sub-optimal modes generated 3 new SNPs, each in high LD with a selected SNP.

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 18/20 Prediction of case/control status much interest in prediction of phenotype, but widespread view that prospects are poor, e.g. NEJM Nov 08: 18 confirmed SNPs add little to prediction of T2D from known risk factors Nat Genet 08: 20 confirmed SNPs for human height explain 3% of variation BUT these only include SNPs significant at very stringent levels. Many more true causal SNPs exist. relax penalty for prediction larger models greater shrinkage of effect sizes

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 19/20 NEG prior for prediction NEG density 0 20 40 60 80 lam = 1, gam = 0.005 lam = 3, gam = 0.02 lam = 5, gam = 0.05 Larger values of shape parameter λ gives more curvature and greater density away from origin, so more shrinkage and more non-zero modes. 0.00 0.02 0.04 0.06 0.08 x

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies p. 20/20 Acknowledgments funding from UK Medical Research Council HyperLASSO software by Clive Hoggart, with help from Maria De Iorio and John Whittaker http://www.ebi.ac.uk/projects/bargen/ Hoggart et al., PLoS Genetics 2008