Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

Similar documents
Predicting eukaryotic transcriptional cooperativity by Bayesian network integration of genome-wide data

Technical University of Denmark

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Network System Inference

Lecture 7: April 7, 2005

Identifying Signaling Pathways. BMI/CS 776 Spring 2016 Anthony Gitter

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Bayesian Networks as framework for data integration

Supplementary materials

Introduction to gene expression microarray data analysis

Machine Learning in Computational Biology CSC 2431

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes

Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data

On polyclonality of intestinal tumors

Analysing the Immune System with Fisher Features

Genomic models in bayz

A Greedy Algorithm for Minimizing the Number of Primers in Multiple PCR Experiments

Introduction to genome biology

Microarray Gene Expression Analysis at CNIO

Introduction to Bioinformatics. Fabian Hoti 6.10.

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

EECS730: Introduction to Bioinformatics

DNA Microarrays and Computational Analysis of DNA Microarray. Data in Cancer Research

Einführung in die Genetik

Characterization of Allele-Specific Copy Number in Tumor Genomes

2/23/16. Protein-Protein Interactions. Protein Interactions. Protein-Protein Interactions: The Interactome

Mapping strategies for sequence reads

MATH 5610, Computational Biology

Analysis of Microarray Data

Optimizing Synthetic DNA for Metabolic Engineering Applications. Howard Salis Penn State University

The Next Generation of Transcription Factor Binding Site Prediction

Recent technology allow production of microarrays composed of 70-mers (essentially a hybrid of the two techniques)

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

Chapter 24: Promoters and Enhancers


Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Einführung in die Genetik

Bioinformatics of Transcriptional Regulation

CHAPTER 21 LECTURE SLIDES

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data

WebMOTIFS: Automated discovery, filtering, and scoring of DNA sequence motifs using multiple programs and Bayesian approaches

Meta-analysis discovery of. tissue-specific DNA sequence motifs. from mammalian gene expression data

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Microarray Technique. Some background. M. Nath

Gene Expression and Heritable Phenotype. CBS520 Eric Nabity

Generative Models for Networks and Applications to E-Commerce

7 Gene Isolation and Analysis of Multiple

Exploration and Analysis of DNA Microarray Data

DNA Microarray Data Oligonucleotide Arrays

Le proteine regolative variano nei vari tipi cellulari e in funzione degli stimoli ambientali

SIMS2003. Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School. Introduction to Microarray Technology.

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

3.1.4 DNA Microarray Technology

DNA Microarray Technology

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Transcription Gene regulation

Calculation of Spot Reliability Evaluation Scores (SRED) for DNA Microarray Data

Roche Molecular Biochemicals Technical Note No. LC 10/2000

computational analysis of cell-to-cell heterogeneity in single-cell rna-sequencing data reveals hidden subpopulations of cells

Green Fluorescent Protein (GFP) Purification. Hydrophobic Interaction Chromatography

Gene Expression Data Analysis (I)

Decoding Chromatin States with Epigenome Data Advanced Topics in Computa8onal Genomics

Ana Teresa Freitas 2016/2017

Offshoring and the Functional Structure of Labour Demand in Advanced Economies

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

Name_BS50 Exam 3 Key (Fall 2005) Page 2 of 5

V 1 Introduction! Fri, Oct 24, 2014! Bioinformatics 3 Volkhard Helms!

Transcription factor binding site identification using the Self-Organizing Map

Supporting Information

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Procedia - Social and Behavioral Sciences 97 ( 2013 )

Zool 3200: Cell Biology Exam 3 3/6/15

ALSO: look at figure 5-11 showing exonintron structure of the beta globin gene

Quantitative Real Time PCR USING SYBR GREEN

Predicting Microarray Signals by Physical Modeling. Josh Deutsch. University of California. Santa Cruz

Transcription factor binding site prediction in vivo using DNA sequence and shape features

Combination of Neuro-Fuzzy Network Models with Biological Knowledge for Reconstructing Gene Regulatory Networks

Reliable classification of two-class cancer data using evolutionary algorithms

Lecture 10: Motif Finding Regulatory element detection using correlation with expression

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

DNA Transcription. Dr Aliwaini

pint: probabilistic data integration for functional genomics

Introduction to Molecular Biology

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Chapter 8 Lecture Outline. Transcription, Translation, and Bioinformatics

Exploring Similarities of Conserved Domains/Motifs

Gene Regulatory Network Reconstruction Using Dynamic Bayesian Networks

Functional Bioinformatics of Microarray Data: From Expression to Regulation

Creation of a PAM matrix

Multiple Testing in RNA-Seq experiments

Global analysis of gene transcription regulation in prokaryotes

Lecture 11: Gene Prediction

DO NOT OPEN UNTIL TOLD TO START

Designing Complex Omics Experiments

Year III Pharm.D Dr. V. Chitra

Discovery of Transcription Factor Binding Sites with Deep Convolutional Neural Networks

Transcription in Eukaryotes

Preprocessing Affymetrix GeneChip Data. Affymetrix GeneChip Design. Terminology TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT

Technical tips Session 5

DNA Microarrays Introduction Part 2. Todd Lowe BME/BIO 210 April 11, 2007

Transcription:

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks Shane T. Jensen Department of Statistics The Wharton School, University of Pennsylvania stjensen@wharton.upenn.edu Gary Chen and Christian Stoeckert, Jr Department of Bioengineering and Department of Genetics University of Pennsylvania Shane T. Jensen 1 March 5, 2008

Motivation Genes are long sequences of DNA that are transcribed to eventually become a protein Near-identical genetic material can lead to many different cell types and species A critical aspect of cellular function is how genes are regulated and which genes are regulated together Shane T. Jensen 2 March 5, 2008

Gene Regulatory Networks Genes are regulated by transcription factor (TF) proteins that bind directly to the DNA sequence near to a gene The bound protein affects the amount of transcription, thereby affecting the amount of protein produced The collection of TFs and their target genes is often called the gene regulatory network Goal is to elucidate regulatory network: which genes are targeted for regulation by a particuler TF? Shane T. Jensen 3 March 5, 2008

Different Data Types Gene expression data: microarray chips used amounts of mrna present for each gene across many conditions ChIP binding data: antibodies used to identify areas of genome physically bound by a particular TF Promoter element data: binding sites for a TF discovered by a sequence search algorithm Shane T. Jensen 4 March 5, 2008

Gene Expression Data Gene expression: measure of whether gene is turned on or turned off at a specific time Genes with similar expression across time or in different conditions may be coregulated Detect groups of genes that have correlated gene expression across many conditions Shane T. Jensen 5 March 5, 2008

ChIP Binding Data Chromatin Immunoprecipitation Experiments Antibodies used to pull out parts of genomic sequence that are physically bound to a particular TF Genes in close proximity to a TF binding site are possibly regulatory targets of that TF Shane T. Jensen 6 March 5, 2008

Promoter Element Data Some known promoter elements: the set of sequence binding sites recognized by a particular TF Promoter elements highly conserved but not identical: A 0.05 0.02 0.85 0.02 0.21 0.06 C 0.04 0.02 0.03 0.93 0.05 0.06 G 0.06 0.94 0.06 0.04 0.70 0.11 T 0.85 0.02 0.06 0.01 0.04 0.77 atgacgtctagcatcgaaatcgacgacgatcgacgactagctactctacgatcg aaaacatcgattgacgtttggtcgtaactttggcacgatcagcgatcgatcact aacagctatgacgtcgaaatcgaacatcgagacggacggcaacgtctacgatcg aaaacatcagctagcagcactagctaggattgacgtttggtcgtaactttggct aattatgctacgtgacgtacacgtacgtgacggactaagtcagctagcgtagct aattatgctacgtacgcggctcgctacactgacggagcatcaggtatttgacgt aaaaggcatcagctagcagcactagctaggtgacctggtcgtaactttggct aattatgctacgtggcgtacacgtacgtgacggactaagtcagctagcgtagct Matrix used to scan genomic sequences for putative promoter elements, which are then used to predict regulated genes Shane T. Jensen 7 March 5, 2008

Problem with Standard Methods These data sources, when used by themselves, provide only partial information for regulation: expression data gives only evidence of co-expression, not necessarily co-regulation ChIP binding data gives only evidence of physical TF binding, but binding is not necessarily functional promoter element data gives only possibility of TF binding site, but site may not be functional Need a principled approach to combine these complementary, but heterogeneous, sources of information Shane T. Jensen 8 March 5, 2008

Available Data Data: expression, ChIP binding, and promoter element data for 106 TFs in Yeast gene expression data across T different experiments g it = log-expression of gene i in experiment t f jt = log-expression of TF j in experiment t ChIP binding data for each gene i and TF j b ij = probability that TF j physically binds near gene i promoter element data for each gene i and TF j m ij = probability that gene i has a binding site for TF j Shane T. Jensen 9 March 5, 2008

Regulatory Indicators Regulatory network is formulated as unknown indicators: C ij =1 C ij =0 if gene i is actually regulated by TF j otherwise These C ij variables give the edges that connect TFs and their target genes on a regulatory graph C will be inferred using a Bayesian hierarchical model principled framework for combining heterogeneous data sources by using informed prior distributions Shane T. Jensen 10 March 5, 2008

Likelihood Model First model level involves target gene expression g it as a linear function of TF expression: g it = α i + j β j C ij f jt + ɛ it Error term is normally distributed: ɛ it Normal(0,σ 2 ) Regulation indicators C ij perform variable selection : only TFs j with C ij =1involved in expression of target gene i Biological reality: often the simultaneous action of multiple TFs are needed to change target gene expression Shane T. Jensen 11 March 5, 2008

Likelihood Model II We allow for synergistic relationships between pairs of TFs by also including interaction terms in our model: g it = α i + j β j C ij f jt + j k γ jk C ij C ik f jt f kt + ɛ it Sign of each interaction coefficient γ jk is unrestricted, so we are allowing for both synergistic and antagonistic relationships between pairs of TFs Non-informative priors used for parameters: α, β, γ, σ 2 Shane T. Jensen 12 March 5, 2008

Informed Prior Distribution Second model level is an informed prior distribution for our unknown regulation indicators C ij that involves both ChIP binding data b ij and promoter element data m ij : p(c ij m ij,b ij ) [ b C ij ij (1 b ij) 1 C ij ] wj [ ] m C ij ij (1 m ij) 1 C 1 wj ij Weight w j balances prior ChIP-binding information b ij vs prior promoter element information m ij Weights w j are TF-specific and reflect relative quality of ChIP binding data vs. promoter element data for TF j each w j treated as unknown variable with uniform prior Shane T. Jensen 13 March 5, 2008

Network Sparsity The probabilities from both ChIP binding data and promoter element data are mostly near zero: Density 0 10 20 30 40 ChIP binding probs Sequence motif probs 0.0 0.2 0.4 0.6 0.8 1.0 Values of b or m Prior implication that the network is quite sparse: each TF regulates only a small proportion of genes Shane T. Jensen 14 March 5, 2008

Implementation Get draws from joint posterior distribution using a Gibbs sampling strategy. 1. Sampling α, β, γ, σ 2 given C, w, g, f, b, m standard random effects model 2. Sampling each C ij given α, β, γ, σ 2, w, g, f, b, m easy 0-1 posterior probability calculation for each C ij 3. Sampling each w j given C, α, β, γ, σ 2, g, f, b, m grid sampler over the (0,1) range Shane T. Jensen 15 March 5, 2008

Inference Inference 1: posterior samples of C ij used to infer target genes for each TF j gene i is a target of TF j P(C ij =1 Y) > 0.5 Inference 2: posterior samples of interaction coefs γ jk used to find TF pairs with significant relationship Inference 3: posterior samples of weights w j used to infer quality of ChIP vs. promoter element data for different TFs Shane T. Jensen 16 March 5, 2008

Comparison of Predictions Primary goal is prediction of target genes based on estimated posterior probability P(C ij =1 Y) > 0.5 Can compare to several other current approaches: 1. MA-Networker: Gao et.al. 2004 2. GRAM: Bar-Joseph et.al. 2003 3. ReMoDiscovery: Lemmens et.al. 2006 Two external measures used for validation 1. similarity of MIPS functions between target genes 2. response of target genes to TF knockout Shane T. Jensen 17 March 5, 2008

MIPS functional categories Each gene in Yeast has an assigned MIPS functional category from Munich information center for protein sequences Gene targets with similar functions are more likely be in same biological pathway, which validates the inference that they are regulated by a common transcription factor Calculated fraction of inferred target genes that shared similar functional categories for each TF, and then averaged across all TFs Shane T. Jensen 18 March 5, 2008

Fraction of Target Genes with Similar Functional Category 0.0 0.1 0.2 0.3 0.4 0.5 Our Model Previous Methods Thresholded Data All 3 Exp+ChIP Exp Only MA Networker GRAM ReMoDiscovery Binding Expression Gene targets from our full model have slightly higher functional similarity than other methods All integration methods better than single data source Shane T. Jensen 19 March 5, 2008

Knockout Experiments Knockout experiments are gold standard for regulatory activity of individual TFs Knockout strain of yeast was created with a specific TF removed from the genome. Gene targets of knocked-out TF should show large response between wild-type and knock-out strains Calculated t-statistic of response to TF knockout for inferred target genes for 4 available knockout expts Shane T. Jensen 20 March 5, 2008

T-statistic for Knockout Response GCN4 knockout experiment SWI4 knockout experiment 0 2 4 6 8 8.13 Our Model 8.38 4.2 Previous Methods 7.3 7.21 3.81 Thresholded Data 3.73 0.1 0 1 2 3 4 5 6 7 Our Model 5.56 5.52 1.45 Previous Methods 4.79 4.4 0.35 Thresholded Data 1.3 2.36 All 3 ExpChIP Exp MANet GRAM ReMo Bind Exp All 3 ExpChIP Exp MANet GRAM ReMo Bind Exp YAP1 knockout experiment SWI5 knockout experiment 0 1 2 3 4 5 3.77 Our Model 3.3 0.02 Previous Methods 2.11 1.3 0.65 Thresholded Data 1.67 0.87 0 1 2 3 4 5 3.24 Our Model 3.95 1.75 Previous Methods 3.04 2.5 0.58 Thresholded Data 1.83 0.1 All 3 ExpChIP Exp MANet GRAM ReMo Bind Exp All 3 ExpChIP Exp MANet GRAM ReMo Bind Exp Our gene targets show greater response to TF knockout across all 4 knockout experiments Shane T. Jensen 21 March 5, 2008

Inference for Weight Variables Posterior distributions of w j variables for same 39 TFs: 0.2 0.4 0.6 0.8 1.0 K K K K ABF1 ACE2 BAS1 CAD1 CBF1 FKH1 FKH2 GAL4 GCN4 GCR1 GCR2 HAP2 HAP3 HAP4 HSF1 INO2 LEU3 MBP1 MCM1 MET31 MSN4 NDD1 PDR1 PHO4 PUT3 RAP1 RCS1 REB1 RLM11 RME1 ROX1 SKN7 SMP1 STB1 STE12 SWI4 SWI5 SWI6 YAP1 Centered substantially higher than 0.5: suggests that ChIP binding data is generally superior to promoter element data Shane T. Jensen 22 March 5, 2008

Interactions between TFs Many recent papers have focused on combinatorial relationships between TFs Which pairs of TFs bind to same set of target genes? We can address this question by examining the posterior distribution of each interaction effect γ jk Positive γ jk s suggest a synergistic relationship, whereas negative γ jk s suggest an antagonistic relationship In our Yeast application, we found that 84 TF pairs have significant γ jk coefficients Shane T. Jensen 23 March 5, 2008

Interactions between TFs Many predicted interactions are known and involved in several important pathways Nodes = TFs and edges = significant interactions Shane T. Jensen 24 March 5, 2008

Mouse Application Also applied our model to one Mouse TF, C/EBP-β, which has all three data types available We identified 14/16 validated C/EBP-β targets More targets missed when using only single data source Our model also potentially reduces false positives: we predict 38 target genes compared to 72 predicted from expression data alone or 779 from ChIP data alone Estimated weight of w =0.92 for favoring ChIP binding data over promoter element data promoter element data useful in some instances, but generally less discriminative power than ChIP data Shane T. Jensen 25 March 5, 2008

Summary Combining multiple data sources (expression, ChIP binding and promoter element data) leads to improved predictions Bayesian hierarchical model is a natural framework for integrating heterogenous data sources Most Bayesian variable selection approaches use non-informative priors for selection indicators Our approach uses informed priors for our selection indicators based on addditional data sources Shane T. Jensen 26 March 5, 2008

Summary II Fully probabilistic approach: no reliance pre-clustering of data or dependence on arbitrary parameter cutoffs Flexibility for genes to belong to multiple regulatory clusters and pairs of transcription factors to interact Variable weight methodology achieves appropriate balance of priors: we confirm common belief that promoter element data is less reliable, but useful in some cases Shane T. Jensen 27 March 5, 2008

References Chen, G., Jensen, S.T. and Stoeckert, C. (2007). "Clustering of Genes into Regulons using Integrated Modeling." Genome Biology 8:R4 Jensen, S.T., Chen, G., and Stoeckert, C. (2007). "Bayesian Variable Selection and Data Integration for Biological Regulatory Networks." Annals of Applied Statistics 1: 612-633. Shane T. Jensen 28 March 5, 2008