Epigenetics and DNase-Seq

Similar documents
Genome 541! Unit 4, lecture 3! Genomics assays

Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays

2/10/17. Contents. Applications of HMMs in Epigenomics

NGS Approaches to Epigenomics

Applications of HMMs in Epigenomics

2/19/13. Contents. Applications of HMMs in Epigenomics

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

Shin Lin CS229 Final Project Identifying Transcription Factor Binding by the DNase Hypersensitivity Assay

Advanced Bioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2018

Deep learning frameworks for regulatory genomics and epigenomics

CS273B: Deep learning for Genomics and Biomedicine

Measuring transcriptomes with RNA-Seq

Characterizing DNA binding sites high throughput approaches Biol4230 Tues, April 24, 2018 Bill Pearson Pinn 6-057

Linking Genetic Variation to Important Phenotypes

ChIP-seq analysis 2/28/2018

The ChIP-Seq project. Giovanna Ambrosini, Philipp Bucher. April 19, 2010 Lausanne. EPFL-SV Bucher Group

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

Decoding Chromatin States with Epigenome Data Advanced Topics in Computa8onal Genomics

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

The ENCODE Encyclopedia. Zhiping Weng University of Massachusetts Medical School ENCODE 2016: Research Applications and Users Meeting June 8, 2016

The ENCODE Encyclopedia. & Variant Annotation Using RegulomeDB and HaploReg

A Brief History. Bootstrapping. Bagging. Boosting (Schapire 1989) Adaboost (Schapire 1995)

Supplementary Table 1: Oligo designs. A list of ATAC-seq oligos used for PCR.

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies

What Can the Epigenome Teach Us About Cellular States and Diseases?

Methods and tools for exploring functional genomics data

ChIP. November 21, 2017

Homework 4. Due in class, Wednesday, November 10, 2004

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Ensembl Funcgen: A Database and API for Epigenomics and Gene Regulation Data.

MODULE TSS1: TRANSCRIPTION START SITES INTRODUCTION (BASIC)

The Next Generation of Transcription Factor Binding Site Prediction

Understanding transcriptional regulation by integrative analysis of transcription factor binding data

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts.

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

DOUBLE-STRAND DNA BREAK PREDICTION USING EPIGENOME MARKS AT KILOBASE RESOLUTION

Lecture 5: Regulation

ChIP-seq and RNA-seq

Bioinformatics of Transcriptional Regulation

Transcription factor binding site prediction in vivo using DNA sequence and shape features

ChIP-seq and RNA-seq. Farhat Habib

Motif Finding: Summary of Approaches. ECS 234, Filkov

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Machine Learning. HMM applications in computational biology

Integrating diverse datasets improves developmental enhancer prediction

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues

V23: the histone code

Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs

Chapter 17 Lecture. Concepts of Genetics. Tenth Edition. Regulation of Gene Expression in Eukaryotes

Control of Eukaryotic Gene Expression (Learning Objectives)

Analysis of ChIP-seq data with R / Bioconductor

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Differential Gene Expression

Discovery of Transcription Factor Binding Sites with Deep Convolutional Neural Networks

Discovering similar (epi)genomics feature patterns in multiple genome browser tracks

Bio5488 Practice Midterm (2018) 1. Next-gen sequencing

Discovering gene regulatory control using ChIP-chip and ChIP-seq. An introduction to gene regulatory control, concepts and methodologies

A Supervised Learning Approach to the Prediction of Hi-C Data

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

Integrating Diverse Datasets Improves Developmental Enhancer Prediction

Bi8 Lecture 19. Review and Practice Questions March 8th 2016

Personal and population genomics of human regulatory variation

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells?

Do engineered ips and ES cells have similar molecular signatures?

Next- genera*on Sequencing. Lecture 13

Exploring Long DNA Sequences by Information Content

Analyzing ChIP-seq data. R. Gentleman, D. Sarkar, S. Tapscott, Y. Cao, Z. Yao, M. Lawrence, P. Aboyoun, M. Morgan, L. Ruzzo, J. Davison, H.

University of Groningen

Molecular Cell Biology - Problem Drill 06: Genes and Chromosomes

Charles Girardot, Furlong Lab. MACS, CisGenome, SISSRs and other peak calling algorithms: differences and practical use

Lecture 7 Motif Databases and Gene Finding

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Plant Molecular and Cellular Biology Lecture 9: Nuclear Genome Organization: Chromosome Structure, Chromatin, DNA Packaging, Mitosis Gary Peter

Figure S1. nuclear extracts. HeLa cell nuclear extract. Input IgG IP:ORC2 ORC2 ORC2. MCM4 origin. ORC2 occupancy

Differential Gene Expression

Lecture 21: Epigenetics Nurture or Nature? Chromatin DNA methylation Histone Code Twin study X-chromosome inactivation Environemnt and epigenetics

Chromatin. Structure and modification of chromatin. Chromatin domains

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

RNA-SEQUENCING ANALYSIS

Molecular Biology (BIOL 4320) Exam #1 March 12, 2002

Motifs. BCH339N - Systems Biology / Bioinformatics Edward Marcotte, Univ of Texas at Austin

A survey on computational methods for enhancer and enhancer target predictions Introduction

Gene function prediction. Computational analysis of biological networks. Olga Troyanskaya, PhD


L8: Downstream analysis of ChIP-seq and ATAC-seq data

Computational Analysis of Ultra-high-throughput sequencing data: ChIP-Seq

Enhancing motif finding models using multiple sources of genome-wide data

Introduction to gene expression microarray data analysis

Integrative Genomics 3b. Systems Biology and Epigenetics

Huijuan Feng, Shining Ma,Chao Ye & Zhixing Feng

Lecture 14. ENCODE. Michael Schatz. March 15, 2017 JHU : Applied Comparative Genomics

Measuring Protein-DNA interactions

Y1 Biology 131 Syllabus - Academic Year

CollecTF Documentation

THE ANALYSIS OF CHROMATIN CONDENSATION STATE AND TRANSCRIPTIONAL ACTIVITY USING DNA MICROARRAYS 1. INTRODUCTION

Why DNA Isn t Your Destiny

Transcription:

Epigenetics and DNase-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2018 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Anthony Gitter, Mark Craven, and Colin Dewey

2 Goals for lecture Key concepts Importance of epigenetic data for understanding transcriptional regulation Predicting transcription factor binding sites Gaussian process models

Introduction to epigenetics 3

C 4 Defining epigenetics Formally: attributes that are in addition to genetic sequence or sequence modifications Informally: experiments that reveal the context of DNA sequence DNA has multiple states and modifications G A C T A G T G C G T T A C T vs. modification inaccessible G T G C G T T A C T Histones

5 Importance of epigenetics Better understand DNA binding and transcriptional regulation Differences between cell and tissue types Development and other important processes Non-coding genetic variants (next lecture)

6 PWMs are not enough Genome-wide motif scanning is imprecise Transcription factors (TFs) bind < 5% of their motif matches Same motif matches in all cells and conditions

PWMs are not enough DNA looping can bring distant binding sites close to transcription start sites Which genes does an enhancer regulate? Enhancer: DNA binding site for TFs, can be far from affected gene Promoter: DNA binding site for TFs, close to gene transcription start site Nature Education 2010 7

8 Mapping regulatory elements genome-wide Can do much better than motif scanning with additional data ChIP-seq measures binding sites for one TF at a time Shlyueva Nature Reviews Genetics 2014 Epigenetic data suggests where some TF binds

DNase I hypersensitivity Regulatory proteins bind accessible DNA DNase I enzyme cuts open chromatin regions that are not protected by nucleosomes Nucleosome: DNA wrapped around histone proteins Wang PLoS ONE 2012 9

Histone modifications Mark particular regulatory configurations Two copies of histone proteins H2A, H2B, H3, H4 Shlyueva Nature Reviews Genetics 2014 H3 (protein) K27 (amino acid) ac (modification) Latham Nature Structural & Molecular Biology 2007; Katie Ris-Vicari 10

DNA methylation Reversible DNA modification Represses gene expression OpenStax CNX 11

3d organization of chromatin Algorithms to predict long range enhancer-promoter interactions Or measure with chromosome conformation capture (3C, Hi-C, etc.) Rao Cell 2014 12

3d organization of chromatin Hi-C produces 2d chromatin contact maps 500 kb 50 kb Learn domains, enhancerpromoter interactions 5 kb Rao Cell 2014 13

Large-scale epigenetic maps Epigenomes are condition-specific Roadmap Epigenomics Consortium and ENCODE surveyed over 100 types of cells and tissues Roadmap Epigenomics Consortium Nature 2015 14

Genome annotation Combinations of epigenetic signals can predict functional state ChromHMM: Hidden Markov model Segway: Dynamic Bayesian network Roadmap Epigenomics Consortium Nature 2015 15

16 Genome annotation States are more interpretable than raw data Ernst and Kellis Nature Methods 2012

17 Predicting TF binding with DNase-Seq

18 DNase I hypersensitive sites Arrows indicate DNase I cleavage sites Obtain short reads that we map to the genome Wang PLoS ONE 2012

DNase I footprints Distribution of mapped reads is informative of open chromatin and specific TF binding sites ChIP-Seq peak I Read depth at each position Nucleosome free open chromatin Zoom in TF binding prevents DNase cleavage leaving Dnase I footprint, only consider 5 end Neph Nature 2012 19

20 DNase I footprints to TF binding predictions DNase footprints suggest that some TF binds that location We want to know which TF binds that location Two ideas: Search for DNase footprint patterns, then match TF motifs Search for motif matches in genome, then model proximal DNase-Seq reads We ll consider this approach

Protein Interaction Quantification (PIQ) Sherwood et al. Nature Biotechnology 2014 Given: TF motifs and DNase-Seq reads Do: Predict binding sites of each TF Rieck and Wright Nature Biotechnology 2014 21

22 PIQ main idea With no TF binding, DNase-Seq reads come from some background distribution TF binding changes read density in a TFspecific way TF effects Background

23 PIQ main idea Shape of DNase peak and footprint depend on the TF TF A TF B Sherwood Nature Biotechnology 2014

24 PIQ features We ll discuss Modeling the DNase-Seq background distribution How TF binding impacts that distribution Priors on TF binding We ll skip Modeling multiple replicates or conditions, crossexperiment and cross-strand effects Expectation propagation TF hierarchy: pioneers, settlers, migrants

25 Algorithm preview Identify candidate binding sites with PWMs Build a probabilistic model of the DNase-Seq reads Estimate TF binding effects Estimate which candidate binding sites are bound Predict pioneer, settler, and migrant TFs

26 DNase-Seq background Each replicate is noisy, don t want to overinterpret this noise Only counting density of 5 ends of reads Manage two competing objectives Smooth some of the noise Don t destroy base pair resolution signal

27 Gaussian processes Can model and smooth sequential data Bayesian approach Jupyter notebook demonstration

28 TF DNase profile Adjust the log-read rate by a TF-specific effect at binding sites DNase profile for factor l Whether site m is bound DNase log-read rate adjusted for binding of factor l μ l = μ i + β i j,l y m j W and I m = 1 0 otherwise Location of binding site m DNase log-read rate at position i from Gaussian process Window size

29 TF DNase profile DNase profiles represented as a vector for each TF DNase profile for factor l μ l = μ i + β i j,l y m j W and I m = 1 0 otherwise Can t be too far apart μ y m i I β W l = W

30 Priors on TF binding TF binding event be more likely when s j motif score is high c j I j should DNase counts are high Example only, not realistic data Isotonic (monotonic) regression f s j s j Wikipedia log(p(i j = 1)) = f s j + g(c j )

31 Full algorithm Given: TF motifs and DNase-Seq reads Do: Predict binding sites of each TF Identify candidate binding sites with PWMs Fit Gaussian process parameters for background Estimate TF binding effects β i j,l Iterate until parameters converge Estimate Gaussian process posterior with expectation propagation Estimate expectation of which candidate binding sites are bound Update monotonic regression functions for binding priors

32 TF binding hierarchy Pioneer, settler, and migrant TFs Sherwood Nature Biotechnology 2014

33 Evaluation: confusion matrix Compare predictions to actual ground truth (gold standard) Lever Nature Methods 2016

Evaluation: ChIP-Seq gold standard Sung Molecular Cell 2014 34

35 Evaluation: ROC curve Calculate receiver operating characteristic curve (ROC) True Positive Rate versus False Positive Rate Summarize with area under ROC curve (AUROC) TPR TP P TP TP FN FPR FP N FP FP TN Includes true negatives Reason to prefer precision-recall for class imbalanced data

36 Evaluation: ROC curve TPR and FPR are defined for a set of positive predictions Need to threshold continuous predictions Rank predictions ROC curve assesses all thresholds Candidate P(bound) binding site 764 0.99 47 0.96 942 0.91 157 0.87 79 0.83 202 0.72 356 0.66 679 0.51 291 0.43 810 0.40 Calculate TPR and FPR at all thresholds t t Positive predictions Negative predictions

PIQ ROC curve for mouse Ctcf Compare predictions to ChIP-Seq Full PIQ model improves upon motifs or DNase alone Sherwood Nature Biotechnology 2014 37

PIQ evaluation Compare to two standard methods 303 ChIP-Seq experiments in K562 cells Centipede, digital genomic footprinting Compare AUROC PIQ has very high AUROC Mean 0.93 Corresponds to recovering median of 50% of binding sites Sherwood Nature Biotechnology 2014 38

DNase-Seq benchmarking PIQ among top methods in large scale DNase benchmarking study HMM-based model HINT was top performer Gusmao Nature Methods 2016 39

Downside of AUROC for genome-wide evaluations Almost all methods look equally good when using full ROC curve AUROC close to 1.0 Precision-recall curve or truncated ROC curve differentiate methods Gusmao Nature Methods 2016 40

41 PIQ summary Smooth noisy DNase-Seq data without imposing too much structure Combine DNase-Seq and motifs to predict condition-specific binding sites Supports replicates and multiple related conditions (e.g. time series)