Probabilistic Graphical Models

Similar documents
Probabilistic Graphical Models

Advanced Introduction to Machine Learning

Statistical Methods for Network Analysis of Biological Data

Workshop on Data Science in Biomedicine

Data Mining and Applications in Genomics

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

Identifying Genetic Associations with MRI-derived Measures via Tree-Guided Sparse Learning

Course Announcements

BTRY 7210: Topics in Quantitative Genomics and Genetics

Abstract. Introduction

Applied Bioinformatics

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Modeling & Simulation in pharmacogenetics/personalised medicine

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

SNPs - GWAS - eqtls. Sebastian Schmeier

Appendix 5: Details of statistical methods in the CRP CHD Genetics Collaboration (CCGC) [posted as supplied by

Including prior knowledge in shrinkage classifiers for genomic data

Principal Component Analysis in Genomic Data

Lecture: Genetic Basis of Complex Phenotypes Advanced Topics in Computa8onal Genomics

Introduction to Quantitative Genomics / Genetics

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering procedure

Understanding genetic association studies. Peter Kamerman


FINDING GENOME-TRANSCRIPTOME-PHENOME ASSOCIATION WITH STRUCTURED ASSOCIATION MAPPING AND VISUALIZATION IN GENAMAP

LD Mapping and the Coalescent

FFGWAS. Fast Functional Genome Wide Association AnalysiS of Surface-based Imaging Genetic Data

Supplementary Note: Detecting population structure in rare variant data

Computational Genomics

Genomic Selection with Linear Models and Rank Aggregation

BTRY 7210: Topics in Quantitative Genomics and Genetics

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Computational Workflows for Genome-Wide Association Study: I

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies

Using Visualization and Automation to Accelerate Genetics Discovery

Gene Mapping in Natural Plant Populations Guilt by Association

Inferring Gene-Gene Interactions and Functional Modules Beyond Standard Models

Linkage Disequilibrium

Genome-wide association studies with high-dimensional phenotypes

Linking Genetic Variation to Important Phenotypes

Authors: Yumin Xiao. Supervisor: Xia Shen

Association studies (Linkage disequilibrium)

The Future of IntegrOmics

Technical Challenges to Implementation of Genomic Selection

Introduction to Add Health GWAS Data Part I. Christy Avery Department of Epidemiology University of North Carolina at Chapel Hill

Lecture 2: Height in Plants, Animals, and Humans. Michael Gore lecture notes Tucson Winter Institute version 18 Jan 2013

Genomic Selection in Breeding Programs BIOL 509 November 26, 2013

Crash-course in genomics

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

Trudy F C Mackay, Department of Genetics, North Carolina State University, Raleigh NC , USA.

Genome-wide association studies (GWAS) Part 1

Axiom Biobank Genotyping Solution

Prostate Cancer Genetics: Today and tomorrow

Genotype Prediction with SVMs

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

Population differentiation analysis of 54,734 European Americans reveals independent evolution of ADH1B gene in Europe and East Asia

A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods *

QTL Mapping Using Multiple Markers Simultaneously

arxiv: v1 [stat.ap] 31 Jul 2014

Genome-Wide Association Studies (GWAS): Computational Them

Familial Breast Cancer

Benno Pütz. MPI of Psychiatry

Polygenic Influences on Boys & Girls Pubertal Timing & Tempo. Gregor Horvath, Valerie Knopik, Kristine Marceau Purdue University

Linkage Disequilibrium. Adele Crane & Angela Taravella

b. (3 points) The expected frequencies of each blood type in the deme if mating is random with respect to variation at this locus.

S SG. Metabolomics meets Genomics. Hemant K. Tiwari, Ph.D. Professor and Head. Metabolomics: Bench to Bedside. ection ON tatistical.

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Overview. Methods for gene mapping and haplotype analysis. Haplotypes. Outline. acatactacataacatacaatagat. aaatactacctaacctacaagagat

Multiple Sclerosis: Recent Insights from Genomics. Bruce Cree, MD, PhD, MAS University of California San Francisco. Disclosure

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

DNA Based Disease Prediction using pathway Analysis

Bioinformatics for Biologists

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Using Genomics to Guide Immunosuppression Therapy David A. Baran, MD, FACC, FSCAI System Director, Advanced HF, Transplant and MCS, Sentara Heart

Module 1 Principles of plant breeding

Bayesian Networks as framework for data integration

Why do we need statistics to study genetics and evolution?

Genome-wide analyses in admixed populations: Challenges and opportunities

Brown University. Linear Methods for SNP Selection

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY

ARTICLE Sherlock: Detecting Gene-Disease Associations by Matching Patterns of Expression QTL and GWAS

Genome-Wide Association Studies. Ryan Collins, Gerissa Fowler, Sean Gamberg, Josselyn Hudasek & Victoria Mackey

A strategy for multiple linkage disequilibrium mapping methods to validate additive QTL. Abstracts

SNP calling and Genome Wide Association Study (GWAS) Trushar Shah

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

High-density SNP Genotyping Analysis of Broiler Breeding Lines

Genome Wide Association Study for Binomially Distributed Traits: A Case Study for Stalk Lodging in Maize

Understanding the Genetic Architecture of Complex Human Diseases: Biomedicine, Statistics, and Large Genomic Data Sets Josée Dupuis

Shrunken Dissimilarity Measure for Genome-wide SNP Data Classification

Human Genetics and Gene Mapping of Complex Traits

1 why study multiple traits together?

1b. How do people differ genetically?

Detecting gene gene interactions that underlie human diseases

Detecting gene gene interactions that underlie human diseases

Haplotype Based Association Tests. Biostatistics 666 Lecture 10

A FAST ALGORITHM FOR DETECTING GENE GENE INTERACTIONS IN GENOME-WIDE ASSOCIATION STUDIES

Transcription:

School of Computer Science Probabilistic Graphical Models Graph-induced structured input/output models - Case Study: Disease Association Analysis Eric Xing Lecture 25, April 16, 2014 Reading: See class website 1

Genetic Basis of Diseases ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA Healthy TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA Sick Single nucleotide polymorphism (SNP) Causal (or "associated") SNP 2

Genetic Association Mapping Data Genotype Phenotype Standard Approach A.... T.. G..... C...... T..... A.. G. A.... A.. C..... C...... T..... A.. G. causal SNP T.... T.. G..... G...... T..... T.. C. A.... A.. C..... C...... T..... T.. C. A.... T.. G..... G...... A..... A.. G. T.... T.. C..... G...... A..... A.. C. A.... A.. C..... C...... A..... T.. C. a univariate phenotype: e.g., disease/control Cancer: Dunning et al. 2009. Diabetes: Dupuis et al. 2010. Atopic dermatitis: Esparza-Gordillo et al. 2009. Arthritis: Suzuki et al. 2008 3

Genetic Basis of Complex Diseases ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA Healthy TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA Cancer Causal SNPs 4

Genetic Basis of Complex Diseases ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA Healthy ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA Biological mechanism ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA Cancer Causal SNPs 5

Genetic Basis of Complex Diseases Association to intermediate phenotypes Intermediate Phenotype Gene expression ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA Healthy ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA Causal SNPs Clinical records Cancer 6

Structured Association Traditional Approach causal SNP ACGTTTTACTGTACAATT Association with Phenome ACGTTTTACTGTACAATT a univariate phenotype: gene expression level Multivariate complex syndrome (e.g., asthma) age at onset, history of eczema genome wide expression profile 7

Goal: Inferring Structured Association Standard Approach Consider one phenotype & one genotype at a time vs. New Approach Consider multiple correlated phenotypes & genotypes jointly Phenotypes Phenome 8

Sparse Associations Pleotropic effects Epistatic effects 9

Sparse Learning Linear Model: Lasso (Sparse Linear Regression) [R.Tibshirani 96] Why sparse solution? penalizing constraining 10

Multi-Task Extension Multi-Task Linear Model: Input: Output: Coefficients for k-th task: Coefficient Matrix: Coefficients for a variable (2 nd ) Coefficients for a task (2 nd ) 11

Outline Background: Sparse multivariate regression for disease association studies Structured association a new paradigm Association to a graph-structured phenome Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009) Association to a tree-structured phenome Tree-guided group lasso (Kim & Xing, ICML 2010) 12

Genetic Association for Asthma Clinical Traits TCGACGTTTTACTGTACAATT Subnetworks for lung physiology Statistical challenge How to find associations to a multivariate entity in a graph? Subnetwork for quality of life 13

Multivariate Regression for Single- Trait Association Analysis Trait Genotype T G A A C C A T G A A G T A 2.1 x = Association Strength? y = X x β 14

Multivariate Regression for Single- Trait Association Analysis Trait Genotype T G A A C C A T G A A G T A 2.1 x = Association Strength Many non-zero associations: Which SNPs are truly significant? 15

Lasso for Reducing False Positives (Tibshirani, 1996) Trait Genotype Association Strength 2.1 = T G A A C C A T G A A G T A x Lasso Penalty for sparsity J + β j j 1 Many zero associations (sparse results), but what if there are multiple related traits? 16

Multivariate Regression for Multiple- Trait Association Analysis Genotype Association Strength (3.4, 1.5, 2.1, 0.9, 1.8) Allergy Lung physiology = T G A A C C A T G A A G T A LD Synthetic lethal x Association strength between SNP j and Trait i: β j,i + β j,i i, j How to combine information across multiple traits to increase the power? 17

Multivariate Regression for Multiple- Trait Association Analysis Trait Genotype Association Strength (3.4, 1.5, 2.1, 0.9, 1.8) Allergy Lung physiology = T G A A C C A T G A A G T A x Association strength between SNP j and Trait i: β j,i + β j,i i, j + We introduce graph-guided fusion penalty 18

Multiple-trait Association: Graph-Constrained Fused Lasso Step 1: Thresholded correlation graph of phenotypes Step 2: Graph constrained fused lasso ACGTTTTACTGTACAATT Fusion Lasso Penalty Graph-constrained fusion penalty 19

Fusion Penalty SNP j ACGTTTTACTGTACAATT Association strength between SNP j and Trait k: β jk Association strength between SNP j and Trait m: β jm Trait m Trait k Fusion Penalty: β jk - β jm For two correlated traits (connected in the network), the association strengths may have similar values. 20

Graph-Constrained Fused Lasso Overall effect ACGTTTTACTGTACAATT Fusion effect propagates to the entire network Association between SNPs and subnetworks of traits 21

Multiple-trait Association: Graph-Weighted Fused Lasso Overall effect ACGTTTTACTGTACAATT Subnetwork structure is embedded as a densely connected nodes with large edge weights Edges with small weights are effectively ignored 22

Estimating Parameters Quadratic programming formulation Graph-constrained fused lasso Graph-weighted fused lasso Many publicly available software packages for solving convex optimization problems can be used 23

Improving Scalability Original problem Equivalently Using a variational formulation Iterative optimization Update β k Update d jk s, d jml s 24

Previous Works vs. Our Approach Previous approach PCA-based approach (Weller et al., 1996, Mangin et al., 1998) Extension of module network for eqtl study (Lee et al., 2009) Network-based approach (Chen et al., 2008, Emilsson et al., 2008) Implicit representation of trait correlations Hard to interpret the derived traits Average traits within each trait cluster Loss of information Separate association analysis for each trait (no information sharing) Single-trait association are combined in light of trait network modules Our approach Explicit representation of trait correlations Original data for traits are used Joint association analysis of multiple traits 25

Simulation Results 50 SNPs taken from HapMap chromosome 7, CEU population 10 traits SNPs Trait Thresholded Trait Correlation Correlation Network Matrix Phenotypes High association No association True Regression Coefficients Single SNP-Single Trait Test Significant at α = 0.01 Lasso Graph-guided Fused Lasso 26

Simulation Results 27

Asthma Trait Network Subnetwork for Asthma symptoms Phenotype Correlation Network Subnetwork for lung physiology Subnetwork for quality of life 28

Phenotypes Results from Single-SNP/Trait Test Phenotypes Lung physiology related traits I Baseline FEV1 predicted value: MPVLung Pre FEF 25-75 predicted value Average nitric oxide value: online Body Mass Index Postbronchodilation FEV1, liters: Spirometry Baseline FEV1 % predicted: Spirometry Baseline predrug FEV1, % predicted Baseline predrug FEV1, % predicted Trait Network Q551R SNP Codes for amino acid changes in the intracellular signaling portion of the receptor Exon 11 SNPs High association No association Single Marker Single Trait Test Permutation test α = 0.05 Permutation test α = 0.01 29

Phenotypes Comparison of Gflasso with Others Phenotypes Lung physiology related traits I Baseline FEV1 predicted value: MPVLung Pre FEF 25-75 predicted value Average nitric oxide value: online Body Mass Index Postbronchodilation FEV1, liters: Spirometry Baseline FEV1 % predicted: Spirometry Baseline predrug FEV1, % predicted Baseline predrug FEV1, % predicted Trait Network Q551R SNP Codes for amino acid changes in the intracellular signaling portion of the receptor Exon 11 SNPs? High association No association Single Marker Single Trait Test Lasso Graph guided Fused Lasso 30

Linkage Disequilibrium Structure in IL-4R gene SNP rs3024622 SNP rs3024660 SNP Q551R r 2 =0.64 r 2 =0.07 31

IL4R Gene Chr16: 27,325,251-27,376,098 introns exons IL4R SNPs in intron SNPs in exon SNPs in promotor region Q551R rs3024622 rs3024660 32

Gene Expression Trait Analysis TCGACGTTTTACTGTACAATT Samples Genes Statistical challenge How to find associations to a multivariate entity in a tree? 33

TCGACGTTTTACTGTACAATT Tree-guided Group Lasso Why tree? Tree represents a clustering structure Scalability to a very large number of phenotypes Graph : O( V 2 ) edges Tree : O( V ) edges Expression quantitative trait mapping (eqtl) Agglomerative hierarchical clustering is a popular tool 34

Tree-Guided Group Lasso In a simple case of two genes h h Genotypes Low height Tight correlation Joint selection Genotypes Large height Weak correlation Separate selection 35

Tree-Guided Group Lasso In a simple case of two genes h Select the child nodes jointly or separately? Tree-guided group lasso L 1 penalty Lasso penalty Separate selection L 2 penalty Group lasso Joint selection Elastic net 36

Tree-Guided Group Lasso For a general tree h 2 h 1 Select the child nodes jointly or separately? Tree-guided group lasso Joint selection Separate selection 37

Tree-Guided Group Lasso For a general tree h 2 h 1 Select the child nodes jointly or separately? Tree-guided group lasso Joint selection Separate selection 38

Balanced Shrinkage Previously, in Jenatton, Audibert & Bach, 2009 39

Estimating Parameters Second-order cone program Many publicly available software packages for solving convex optimization problems can be used Also, variational formulation 40

Illustration with Simulated Data Phenotypes SNPs High association No association True association strengths Lasso Tree guided group lasso 41

Genome Structure Linkage Disequilibrium Phenome Structure Graph Stochastic block regression (Kim & Xing, UAI, 2008) Population Structure Structured Association Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009) Tree Multi-population group lasso (Puniyani, Kim, Xing, Submitted) Epistasis ACGTTTTACTGTACAATT Tree-guided fused lasso (Kim & Xing, Submitted) Dynamic Trait Group lasso with networks (Lee, Kim, Xing, Submitted) Temporally smoothed lasso (Kim, Howrylak, Xing, Submitted) 42

Computation Time 43

Proximal Gradient Descent Original Problem: Approximation Problem: Gradient of the Approximation: 44

Geometric Interpretation Smooth approximation Uppermost Line Nonsmooth Uppermost Line Smooth 45

Convergence Rate Theorem: If we require and set, the number of iterations is upper bounded by: Remarks: state of the art IPM method for for SOCP converges at a rate 46

Multi-Task Time Complexity Pre-compute: Per-iteration Complexity (computing gradient) Tree: IPM for SOCP Proximal-Gradient Graph: IPM for SOCP Proximal-Gradient Proximal-Gradient: Independent of Sample Size Linear in #.of Tasks 47

Experiments Multi-task Graph Structured Sparse Learning (GFlasso) 48

Experiments Multi-task Tree-Structured Sparse Learning (TreeLasso) 49

Conclusions Novel statistical methods for joint association analysis to correlated phenotypes Graph-structured phenome : graph-guided fused lasso Tree-structured phenome : tree-guided group lasso Advantages Greater power to detect weak association signals Fewer false positives Joint association to multiple correlated phenotypes Other structures In phenotypes: dynamic trait In genotypes: linkage disequilibrium, population structure, epistasis 50

Reference Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, Series B 58:267 288. Weller J, Wiggans G, Vanraden P, Ron M (1996) Application of a canonical transformation to detection of quantitative trait loci with the aid of genetic markers in a multi-trait experiment. Theoretical and Applied Genetics 92:998 1002. Mangin B, Thoquet B, Grimsley N (1998) Pleiotropic QTL analysis. Biometrics 54:89 99. Chen Y, Zhu J, Lum P, Yang X, Pinto S, et al. (2008) Variations in DNA elucidate molecular networks that cause disease. Nature 452:429 35. Lee SI, Dudley A, Drubin D, Silver P, Krogan N, et al. (2009) Learning a prior on regulatory potential from eqtl data. PLoS Genetics 5:e1000358. Emilsson V, Thorleifsson G, Zhang B, Leonardson A, Zink F, et al. (2008) Genetics of gene expression and its effect on disease. Nature 452:423 28. Kim S, Xing EP (2009) Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics 5(8): e1000587. Kim S, Xing EP (2008) Sparse feature learning in high-dimensional space via block regularized regression. In Proceedings of the 24th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 325-332. AUAI Press. Kim S, Xing EP (2010) Exploiting a hierarchical clustering tree of gene-expression traits in eqtl analysis. Submitted. Kim S, Howrylak J, Xing EP (2010) Dynamic-trait association analysis via temporally-smoothed lasso. Submitted. Puniyani K, Kim S, Xing EP (2010) Multi-population GWA mapping via multi-task regularized regression. Submitted. Lee S, Kim S, Xing EP (2010) Leveraging genetic interaction networks and regulatory pathways for joint mapping of epistatic and marginal eqtls. Submitted. Dunning AM et al. (2009) Association of ESR1 gene tagging SNPs with breast cancer risk. Hum Mol Genet. 18(6):1131-9. Esparza-Gordillo J et al. (2009) A common variant on chromosome 11q13 is associated with atopic dermatitis. Nature Genetics 41:596-601. Suzuki A et al. (2008) Functional SNPs in CD244 increase the risk of rheumatoid arthritis in a Japanese population. Nature Genetics 40:1224-1229. Dupuis J et al. (2010) New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nature Genetics 42:105-116. 51