Bayesian Variable Selection and Data Integration for Biological Regulatory Networks Shane T. Jensen Department of Statistics The Wharton School, University of Pennsylvania stjensen@wharton.upenn.edu Gary Chen and Christian Stoeckert, Jr Department of Bioengineering and Department of Genetics University of Pennsylvania Shane T. Jensen 1 March 5, 2008
Motivation Genes are long sequences of DNA that are transcribed to eventually become a protein Near-identical genetic material can lead to many different cell types and species A critical aspect of cellular function is how genes are regulated and which genes are regulated together Shane T. Jensen 2 March 5, 2008
Gene Regulatory Networks Genes are regulated by transcription factor (TF) proteins that bind directly to the DNA sequence near to a gene The bound protein affects the amount of transcription, thereby affecting the amount of protein produced The collection of TFs and their target genes is often called the gene regulatory network Goal is to elucidate regulatory network: which genes are targeted for regulation by a particuler TF? Shane T. Jensen 3 March 5, 2008
Different Data Types Gene expression data: microarray chips used amounts of mrna present for each gene across many conditions ChIP binding data: antibodies used to identify areas of genome physically bound by a particular TF Promoter element data: binding sites for a TF discovered by a sequence search algorithm Shane T. Jensen 4 March 5, 2008
Gene Expression Data Gene expression: measure of whether gene is turned on or turned off at a specific time Genes with similar expression across time or in different conditions may be coregulated Detect groups of genes that have correlated gene expression across many conditions Shane T. Jensen 5 March 5, 2008
ChIP Binding Data Chromatin Immunoprecipitation Experiments Antibodies used to pull out parts of genomic sequence that are physically bound to a particular TF Genes in close proximity to a TF binding site are possibly regulatory targets of that TF Shane T. Jensen 6 March 5, 2008
Promoter Element Data Some known promoter elements: the set of sequence binding sites recognized by a particular TF Promoter elements highly conserved but not identical: A 0.05 0.02 0.85 0.02 0.21 0.06 C 0.04 0.02 0.03 0.93 0.05 0.06 G 0.06 0.94 0.06 0.04 0.70 0.11 T 0.85 0.02 0.06 0.01 0.04 0.77 atgacgtctagcatcgaaatcgacgacgatcgacgactagctactctacgatcg aaaacatcgattgacgtttggtcgtaactttggcacgatcagcgatcgatcact aacagctatgacgtcgaaatcgaacatcgagacggacggcaacgtctacgatcg aaaacatcagctagcagcactagctaggattgacgtttggtcgtaactttggct aattatgctacgtgacgtacacgtacgtgacggactaagtcagctagcgtagct aattatgctacgtacgcggctcgctacactgacggagcatcaggtatttgacgt aaaaggcatcagctagcagcactagctaggtgacctggtcgtaactttggct aattatgctacgtggcgtacacgtacgtgacggactaagtcagctagcgtagct Matrix used to scan genomic sequences for putative promoter elements, which are then used to predict regulated genes Shane T. Jensen 7 March 5, 2008
Problem with Standard Methods These data sources, when used by themselves, provide only partial information for regulation: expression data gives only evidence of co-expression, not necessarily co-regulation ChIP binding data gives only evidence of physical TF binding, but binding is not necessarily functional promoter element data gives only possibility of TF binding site, but site may not be functional Need a principled approach to combine these complementary, but heterogeneous, sources of information Shane T. Jensen 8 March 5, 2008
Available Data Data: expression, ChIP binding, and promoter element data for 106 TFs in Yeast gene expression data across T different experiments g it = log-expression of gene i in experiment t f jt = log-expression of TF j in experiment t ChIP binding data for each gene i and TF j b ij = probability that TF j physically binds near gene i promoter element data for each gene i and TF j m ij = probability that gene i has a binding site for TF j Shane T. Jensen 9 March 5, 2008
Regulatory Indicators Regulatory network is formulated as unknown indicators: C ij =1 C ij =0 if gene i is actually regulated by TF j otherwise These C ij variables give the edges that connect TFs and their target genes on a regulatory graph C will be inferred using a Bayesian hierarchical model principled framework for combining heterogeneous data sources by using informed prior distributions Shane T. Jensen 10 March 5, 2008
Likelihood Model First model level involves target gene expression g it as a linear function of TF expression: g it = α i + j β j C ij f jt + ɛ it Error term is normally distributed: ɛ it Normal(0,σ 2 ) Regulation indicators C ij perform variable selection : only TFs j with C ij =1involved in expression of target gene i Biological reality: often the simultaneous action of multiple TFs are needed to change target gene expression Shane T. Jensen 11 March 5, 2008
Likelihood Model II We allow for synergistic relationships between pairs of TFs by also including interaction terms in our model: g it = α i + j β j C ij f jt + j k γ jk C ij C ik f jt f kt + ɛ it Sign of each interaction coefficient γ jk is unrestricted, so we are allowing for both synergistic and antagonistic relationships between pairs of TFs Non-informative priors used for parameters: α, β, γ, σ 2 Shane T. Jensen 12 March 5, 2008
Informed Prior Distribution Second model level is an informed prior distribution for our unknown regulation indicators C ij that involves both ChIP binding data b ij and promoter element data m ij : p(c ij m ij,b ij ) [ b C ij ij (1 b ij) 1 C ij ] wj [ ] m C ij ij (1 m ij) 1 C 1 wj ij Weight w j balances prior ChIP-binding information b ij vs prior promoter element information m ij Weights w j are TF-specific and reflect relative quality of ChIP binding data vs. promoter element data for TF j each w j treated as unknown variable with uniform prior Shane T. Jensen 13 March 5, 2008
Network Sparsity The probabilities from both ChIP binding data and promoter element data are mostly near zero: Density 0 10 20 30 40 ChIP binding probs Sequence motif probs 0.0 0.2 0.4 0.6 0.8 1.0 Values of b or m Prior implication that the network is quite sparse: each TF regulates only a small proportion of genes Shane T. Jensen 14 March 5, 2008
Implementation Get draws from joint posterior distribution using a Gibbs sampling strategy. 1. Sampling α, β, γ, σ 2 given C, w, g, f, b, m standard random effects model 2. Sampling each C ij given α, β, γ, σ 2, w, g, f, b, m easy 0-1 posterior probability calculation for each C ij 3. Sampling each w j given C, α, β, γ, σ 2, g, f, b, m grid sampler over the (0,1) range Shane T. Jensen 15 March 5, 2008
Inference Inference 1: posterior samples of C ij used to infer target genes for each TF j gene i is a target of TF j P(C ij =1 Y) > 0.5 Inference 2: posterior samples of interaction coefs γ jk used to find TF pairs with significant relationship Inference 3: posterior samples of weights w j used to infer quality of ChIP vs. promoter element data for different TFs Shane T. Jensen 16 March 5, 2008
Comparison of Predictions Primary goal is prediction of target genes based on estimated posterior probability P(C ij =1 Y) > 0.5 Can compare to several other current approaches: 1. MA-Networker: Gao et.al. 2004 2. GRAM: Bar-Joseph et.al. 2003 3. ReMoDiscovery: Lemmens et.al. 2006 Two external measures used for validation 1. similarity of MIPS functions between target genes 2. response of target genes to TF knockout Shane T. Jensen 17 March 5, 2008
MIPS functional categories Each gene in Yeast has an assigned MIPS functional category from Munich information center for protein sequences Gene targets with similar functions are more likely be in same biological pathway, which validates the inference that they are regulated by a common transcription factor Calculated fraction of inferred target genes that shared similar functional categories for each TF, and then averaged across all TFs Shane T. Jensen 18 March 5, 2008
Fraction of Target Genes with Similar Functional Category 0.0 0.1 0.2 0.3 0.4 0.5 Our Model Previous Methods Thresholded Data All 3 Exp+ChIP Exp Only MA Networker GRAM ReMoDiscovery Binding Expression Gene targets from our full model have slightly higher functional similarity than other methods All integration methods better than single data source Shane T. Jensen 19 March 5, 2008
Knockout Experiments Knockout experiments are gold standard for regulatory activity of individual TFs Knockout strain of yeast was created with a specific TF removed from the genome. Gene targets of knocked-out TF should show large response between wild-type and knock-out strains Calculated t-statistic of response to TF knockout for inferred target genes for 4 available knockout expts Shane T. Jensen 20 March 5, 2008
T-statistic for Knockout Response GCN4 knockout experiment SWI4 knockout experiment 0 2 4 6 8 8.13 Our Model 8.38 4.2 Previous Methods 7.3 7.21 3.81 Thresholded Data 3.73 0.1 0 1 2 3 4 5 6 7 Our Model 5.56 5.52 1.45 Previous Methods 4.79 4.4 0.35 Thresholded Data 1.3 2.36 All 3 ExpChIP Exp MANet GRAM ReMo Bind Exp All 3 ExpChIP Exp MANet GRAM ReMo Bind Exp YAP1 knockout experiment SWI5 knockout experiment 0 1 2 3 4 5 3.77 Our Model 3.3 0.02 Previous Methods 2.11 1.3 0.65 Thresholded Data 1.67 0.87 0 1 2 3 4 5 3.24 Our Model 3.95 1.75 Previous Methods 3.04 2.5 0.58 Thresholded Data 1.83 0.1 All 3 ExpChIP Exp MANet GRAM ReMo Bind Exp All 3 ExpChIP Exp MANet GRAM ReMo Bind Exp Our gene targets show greater response to TF knockout across all 4 knockout experiments Shane T. Jensen 21 March 5, 2008
Inference for Weight Variables Posterior distributions of w j variables for same 39 TFs: 0.2 0.4 0.6 0.8 1.0 K K K K ABF1 ACE2 BAS1 CAD1 CBF1 FKH1 FKH2 GAL4 GCN4 GCR1 GCR2 HAP2 HAP3 HAP4 HSF1 INO2 LEU3 MBP1 MCM1 MET31 MSN4 NDD1 PDR1 PHO4 PUT3 RAP1 RCS1 REB1 RLM11 RME1 ROX1 SKN7 SMP1 STB1 STE12 SWI4 SWI5 SWI6 YAP1 Centered substantially higher than 0.5: suggests that ChIP binding data is generally superior to promoter element data Shane T. Jensen 22 March 5, 2008
Interactions between TFs Many recent papers have focused on combinatorial relationships between TFs Which pairs of TFs bind to same set of target genes? We can address this question by examining the posterior distribution of each interaction effect γ jk Positive γ jk s suggest a synergistic relationship, whereas negative γ jk s suggest an antagonistic relationship In our Yeast application, we found that 84 TF pairs have significant γ jk coefficients Shane T. Jensen 23 March 5, 2008
Interactions between TFs Many predicted interactions are known and involved in several important pathways Nodes = TFs and edges = significant interactions Shane T. Jensen 24 March 5, 2008
Mouse Application Also applied our model to one Mouse TF, C/EBP-β, which has all three data types available We identified 14/16 validated C/EBP-β targets More targets missed when using only single data source Our model also potentially reduces false positives: we predict 38 target genes compared to 72 predicted from expression data alone or 779 from ChIP data alone Estimated weight of w =0.92 for favoring ChIP binding data over promoter element data promoter element data useful in some instances, but generally less discriminative power than ChIP data Shane T. Jensen 25 March 5, 2008
Summary Combining multiple data sources (expression, ChIP binding and promoter element data) leads to improved predictions Bayesian hierarchical model is a natural framework for integrating heterogenous data sources Most Bayesian variable selection approaches use non-informative priors for selection indicators Our approach uses informed priors for our selection indicators based on addditional data sources Shane T. Jensen 26 March 5, 2008
Summary II Fully probabilistic approach: no reliance pre-clustering of data or dependence on arbitrary parameter cutoffs Flexibility for genes to belong to multiple regulatory clusters and pairs of transcription factors to interact Variable weight methodology achieves appropriate balance of priors: we confirm common belief that promoter element data is less reliable, but useful in some cases Shane T. Jensen 27 March 5, 2008
References Chen, G., Jensen, S.T. and Stoeckert, C. (2007). "Clustering of Genes into Regulons using Integrated Modeling." Genome Biology 8:R4 Jensen, S.T., Chen, G., and Stoeckert, C. (2007). "Bayesian Variable Selection and Data Integration for Biological Regulatory Networks." Annals of Applied Statistics 1: 612-633. Shane T. Jensen 28 March 5, 2008