Comparative eqtl analyses within and between seven tissue types suggest mechanisms underlying cell type specificity of eqtls

Size: px
Start display at page:

Download "Comparative eqtl analyses within and between seven tissue types suggest mechanisms underlying cell type specificity of eqtls"

Transcription

1 Comparative eqtl analyses within and between seven tissue types suggest mechanisms underlying cell type specificity of eqtls, Duke University Christopher D Brown, University of Pennsylvania November 9th, 2012

2 Motivation: Predicting functional SNPs Most functional nucleotides in vertebrate genomes are non-coding > 85% of common disease associations with non-coding SNPs We would like to know whether any non-coding SNP in cell type of interest is biochemically functional to study: genome-wide association study hits de novo mutations involved in highly penetrant disease somatic mutations involved in cancer Current functional SNP analyses are limited by our narrow understanding of the functional constraints of most of the genome

3 Functional SNPs: Expression Quantitative Trait Loci eqtls are genetic variants that are associated with differences in mrna transcription levels Current eqtl studies do not go far enough: cell specificity across relevant cell types unclear LD-linked SNPs instead of causal SNP often one local, most significantly associated eqtl-snp Study goal: quantify, identify possible mechanisms for, and predict cell type specific eqtls Results will enable functional interpretations of SNPs in a cell type specific way

4 Comparison of eqtls: eleven studies, seven cell types Used gene expression and genotype data from 11 publicly available studies on 7 different cell types Analysis pipeline was uniform across studies: Remapped expression probes to unique genes in Ensembl Removed unexpressed probes, probes containing SNPs Removed principal components to account for study-specific confounders Imputed genotypes to CEPH HapMap phase 2 panel Evaluated eqtls using Bayes factors (BFs) Single permutation to evaluate FDR Only considered cis-eqtls (SNPs within 1Mb of TSS or TES)

5 eqtls across studies: by the numbers Study Code Tissue N N genes CAP CPL LCLs HapMap 2 STL LCLs Harvard HCE Cerebellum Harvard HPC Prefrontal cortex Harvard HVC Visual cortex GenCord GCF Blood fibroblasts GenCord GCL LCLs GenCord GCT Blood t cells UChicago CLI Liver Merck MLI Liver Myers MBR Brain

6 Sample size versus fraction of genes with eqtls Genes with eqtl [%] log BF log CLI GC* STL MBR MLI CPL HVC HPC HCE Samples eqtls with AH [%] GC* Studies with duplicate arrays have substantially more power Study size and replicate arrays account for 98% of the variability in fraction of genes with eqtls CLI STL MBR MLI Sam

7 Allelic heterogeneity Allelic heterogeneity: variants at a genomic locus independently regulate the same biological process. ENCODE: > 400, 000 regulatory elements for 23, 000 genes Most significant eqtl is often not the only eqtl Used LD-block method to identify allelic heterogeneity Followed identification with a test for independent effects

8 Allelic heterogeneity across eleven studies Study HPC HCE CPL HVC MLI STL CLI MBR GCL GCT GCF Sample size well correlated with levels of allelic heterogeneity Gene Ontology analysis shows no distinction between genes with primary eqtls and those with secondary or more eqtls We hypothesize that allelic heterogeneity is ubiquitous eqtl Count (x10 )

9 eqtls across cell types: locations GCT eqtls with AH [%] eQTLs enriched relative to background at TSS, TES 10TSS, CLI TES enrichment MLI extends to eqtls in all tiers 5 GC* GCL GCF log BF eqtl Count (x10 ) GC* R MLI STL STL MBR HVC CE CPL CPL HPC HVC HCE eqtl Count Samples TSS Kb TES P

10 Replication within and between cell types eqtl replication entails log 10 BF > 1.0 in target data set for all eqtls in discovery data at FDR < 5% Blue lines show within cell type replication; red lines show between cell type replication 100 Replication [%] Replication [%] LCL + LCL LCL + Liver Liver + Liver Liver + LCL Brain + Brain Brain + LCL log BF LCL + LCL LCL + Liver Liver + Liver Liver + LCL False positives: small percentage of replicating eqtls False negatives: due to study design, lack of power, etc. Brain + Brain Brain + LCL

11 Incorporating ENCODE data: functional interpretability ENCODE project has extensive genomic data for cell type specific genomic features Understand how eqtl regulates transcription Figure from ENCODE project

12 Allelic heterogeneity and insulators CTCF is the best characterized insulator protein, conserved in function across metazoans If two SNPs independently regulate transcription, we might expect an enrichment of CTCF between them In Drosophila melanogaster, recent work showed insulators are enriched between alternative promoters [Negré, 2010] We see this same enrichment in humans Intervening CTCF [%] Independent eqtl SNPs Background SNPs SNP-SNP Distance [kb] Figure 3. Insulators are enriched between SNPs independently asso gene expression trait.

13 eqtls and overlap with DHS sites DNAse I hypersensitive (DHS) sites: indicate histone-depleted open chromatin; classic feature of active regulatory elements Clear enrichment in eqtl overlap DHS Sites Significant enrichment for replicating eqtls versus non-replicating eqtls (not shown) Significant enrichment for LCL eqtls in DHS sites in LCLs versus DHS sites in Hepg2 cells (not shown) SNP-CRE Overlap [%] eqtl SNPs Background SNPs eqtl SNPs Background SNPs ]

14 eqtls and overlap with heterochromatin Heterochromatin (facultative): tightly packed, cell specific form of DNA; regulatory elements in heterochromatin regions are inaccessible to transcriptional regulators Clear depletion in eqtl overlap Heterochromatin Significant depletion for replicating eqtls versus non-replicating eqtls (not shown) SNP-CRE Overlap [%] Background SNPs eqtl SNPs Significant depletion for LCL eqtls in heterochromatin in LCLs versus heterochromatin in Hepg2 cells (not shown) romatin E [%] Background SNPs eqtl SNPs

15 Predicting replication of eqtls Built random forest classifier to predict whether a specific eqtl would replication in a second study Class was whether an eqtl replicated or not Features included: genomic information (e.g., distance to TSS of SNP) non-cell type specific regulatory elements (e.g., GERP scores) cell type specific regulatory elements (e.g., DHS sites, TFBS) Considered predicting replication: within cell type using cell type specific CRE information between cell type using target cell type specific CRE data Validated accuracy using 10-fold cross validation

16 Predicting replication of eqtls: ROC curves Receiver Operating Characteristic (ROC) curves compare the rate of false positives versus the rate of true positives as the cutoff moves from most to least restrictive Red lines: within cell type replicability; blue lines: between cell type replicability TPR LCL + LCL LCL + Liver Liver + Liver Liver + LCL Brain + Brain Brain + LCL FPR Area under the ROC Curve (AUC): quantifies improvement over random guessing For LCL eqtls, AUCs are 0.79 and 0.73, respectively, for within LCL and between LCL and liver eqtl replication

17 Predicting replication of eqtls: Gini scores How predictive is each feature for whether the eqtl replicates? Across all training sets, biggest contributors: eqtl discovery significance SNP to TSS distance, gene expression level Cis-regulatory elements vary considerably in the degree to which they are useful in predicting replication Intervening insulators contribute substantially to within cell type predictions Heterochromatin states contribute substantially to between cell type predictions

18 Summary and Conclusions We leveraged eqtls found in both within and between cell types and extensive ENCODE data in this large comparative study to quantify, describe mechanistically, and predict cell type specific eqtl SNPs With an SNP and a cell type of interest: identify an eqtl well correlated (in high LD) with the hit compute probability that it will replicate in cell type of interest consider the location of the hit relative to cell type specific and prediction-informative CREs make a more informed hypothesis about mechanism of phenotype (validate via experiments)

19 Acknowledgements Casey Brown (UChicago, Penn), Lara Mangravite (Sage Bionetworks), Matthew Stephens (University of Chicago) Greg Crawford (Duke University), all the ENCODE data eqtl studies: GenCord, CAP, Harvard Brain, HapMap phase 2, Merck liver, Myers brain, UChicago liver Funding: NIH NHGRI K99/R00 Paper on arxiv, Haldane s Sieve Graphics: R package ggplot2