ChIP. November 21, 2017

Similar documents
ChIP-seq analysis 2/28/2018

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Charles Girardot, Furlong Lab. MACS, CisGenome, SISSRs and other peak calling algorithms: differences and practical use

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

Motif Finding: Summary of Approaches. ECS 234, Filkov

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Discovering gene regulatory control using ChIP-chip and ChIP-seq. An introduction to gene regulatory control, concepts and methodologies

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies

Computational Analysis of Ultra-high-throughput sequencing data: ChIP-Seq

Introduction to genome biology

Caroline Townsend December 2012 Biochem 218 A critical review of ChIP-seq enrichment analysis tools

Characterizing DNA binding sites high throughput approaches Biol4230 Tues, April 24, 2018 Bill Pearson Pinn 6-057

Figure 7.1: PWM evolution: The sequence affinity of TFBSs has evolved from single sequences, to PWMs, to larger and larger databases of PWMs.

Introduction to genome biology

Next- genera*on Sequencing. Lecture 13

ChIP-seq/Functional Genomics/Epigenomics. CBSU/3CPG/CVG Next-Gen Sequencing Workshop. Josh Waterfall. March 31, 2010

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Sequence Motif Analysis

Analyzing ChIP-seq data. R. Gentleman, D. Sarkar, S. Tapscott, Y. Cao, Z. Yao, M. Lawrence, P. Aboyoun, M. Morgan, L. Ruzzo, J. Davison, H.

ChIP-seq and RNA-seq. Farhat Habib

Lecture 7: April 7, 2005

ECS 234: Genomic Data Integration ECS 234

ChIP-seq and RNA-seq

L8: Downstream analysis of ChIP-seq and ATAC-seq data

Transcription Gene regulation

Epigenetics and DNase-Seq

CS273B: Deep learning for Genomics and Biomedicine

Analysis of ChIP-seq data with R / Bioconductor

ChIP-Seq Data Analysis: Identification of Protein DNA Binding Sites with SISSRs Peak-Finder

Computational Investigation of Gene Regulatory Elements. Ryan Weddle Computational Biosciences Internship Presentation 12/15/2004

Genome 541! Unit 4, lecture 3! Genomics assays

Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays

CSC 2427: Algorithms in Molecular Biology Lecture #14

Genome 373: High- Throughput DNA Sequencing. Doug Fowler

Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics Suite 6.6

Chromatin immunoprecipitation: five steps to great results

DIAMANTINA INSTITUTE for Cancer, Immunology and Metabolic Medicine

Deep Sequencing technologies

SUPPLEMENTARY INFORMATION

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

MCAT: Motif Combining and Association Tool

File S1. Program overview and features

ChIP-seq data analysis

The ChIP-Seq project. Giovanna Ambrosini, Philipp Bucher. April 19, 2010 Lausanne. EPFL-SV Bucher Group

Intracellular receptors specify complex patterns of gene expression that are cell and gene

Übung V. Einführung, Teil 1. Transktiptionelle Regulation TFBS

2/10/17. Contents. Applications of HMMs in Epigenomics

Sequence logos for DNA sequence alignments

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Genome-Wide Survey of MicroRNA - Transcription Factor Feed-Forward Regulatory Circuits in Human. Supporting Information

ChIP-seq experimental design and analysis

CollecTF Documentation

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

Machine Learning. HMM applications in computational biology

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Decoding Chromatin States with Epigenome Data Advanced Topics in Computa8onal Genomics

Galaxy Platform For NGS Data Analyses

Motifs. BCH339N - Systems Biology / Bioinformatics Edward Marcotte, Univ of Texas at Austin

Current methods in the analysis of CLIP-Seq data

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts.

Activation of a Floral Homeotic Gene in Arabidopsis

APPLICATION NOTE. Abstract. Introduction

Gene Expression Microarrays. For microarrays, purity of the RNA was further assessed by

DNA:CHROMATIN INTERACTIONS

Applied Bioinformatics - Lecture 16: Transcriptomics

Applications of short-read

7.03, 2006, Lecture 23 Eukaryotic Genes and Genomes IV

7.05, 2005, Lecture 23 Eukaryotic Genes and Genomes IV

Lecture 22 Eukaryotic Genes and Genomes III

PIP-seq. Cells. Permanganate ChIP-Seq

A Bioconductor pipeline for the analysis of ChIP- Seq experiments.

Computational Systems Biology Deep Learning in the Life Sciences

Shin Lin CS229 Final Project Identifying Transcription Factor Binding by the DNase Hypersensitivity Assay

Parts of a standard FastQC report

Analysis of Biological Sequences SPH

Bioinformatics of Transcriptional Regulation

Ana Teresa Freitas 2016/2017

A more efficient, sensitive and robust method of chromatin immunoprecipitation (ChIP)

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

02 Agenda Item 03 Agenda Item

Methods and tools for exploring functional genomics data

Computational Technique for Improvement of the Position-Weight Matrices for the DNA/Protein Binding Sites

Interaktionen und Modifikationen von RNAs und Proteinen RNA-Protein Interactions II

XPRIME-EM: Eliciting Expert Prior Information for Motif Exploration Using the Expectation- Maximization Algorithm

Computational Methods for Analyzing and Modeling Gene Regulation Dynamics

Measuring Protein-DNA interactions

nature methods A paired-end sequencing strategy to map the complex landscape of transcription initiation

Statistical Aspects of ChIP-Seq Data Analysis. Oleg Sergeyevich Mayba. Doctor of Philosophy. Statistics. Computational and Genomic Biology

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Astrocyte GCRB/BICF Workflow for ChIP-Seq Analysis. Venkat Beibei

Deep learning frameworks for regulatory genomics and epigenomics

NGS Approaches to Epigenomics

MODULE TSS1: TRANSCRIPTION START SITES INTRODUCTION (BASIC)

Introduction to NGS analyses

Transcription:

ChIP November 21, 2017

functional signals: is DNA enough? what is the smallest number of letters used by a written language?

DNA is only one part of the functional genome DNA is heavily bound by proteins, in any cell: nucleosomes transcription factors transcription suppressors scaffolding Proteins can bind to specific DNA sequences Some proteins have fairly nonspecific binding (e.g. nucleosomes)

chromatin immunoprecipitation (ChIP) workflow: 1) crosslink DNA and proteins 2) shear or digest DNA into fragments 3) use tagged antibody to isolate protein of interest 4) reverse protein-dna crosslinks 5) sequence DNA and align to a reference genome, or hybridize to a microarray

cross link sonicate or digest add antibody no antibody immunoprecipitate, reverse cross links, purify DNA total input

after aligning to a reference genome

pathognomonic ChIP peak shape forward strand sequenced from this side only reverse strand sequenced from this side only

after aligning to a reference genome: IGV, with plus strand in pink, minus strand in blue

characteristic fragments Kharchenko et al, Nat Biotech 2008

MACS (Model-based Analysis of ChIP-Seq) two issues in peak calling: resolution (how finely can the peak be defined) ChIP seq tags only come from the ends of a fragment! so the exact position of a bound protein must be inferred to a resolution smaller than the fragment size detection above background noise because of sequencing biases, chromatin structure, copy number variation, and mapping biases, the baseline isn t flat

MACS (Model-based Analysis of ChIP-Seq)

MACS (Model-based Analysis of ChIP-Seq) assume there is no strand bias (not more likely to get tags from one strand than the other) then sample 1000 regions where there is more than mfold enrichment relative to a random distribution, and look at Watson vs Crick peak positions

MACS (Model-based Analysis of ChIP-Seq)

MACS (Model-based Analysis of ChIP-Seq) cross-correlation of signals from the two strands is highest when the shift distance matches the size of the binding site

MACS (Model-based Analysis of ChIP-Seq) options for removing background noise: 1) use a Poisson distribution (λbg) to define a cutoff # tags 2) use the total input to estimate local background MACS uses a dynamic Poisson parameter, λlocal, defined separately for each candidate peak as: λlocal = max(λbg, [λ1k,] λ5k, λ10k) λ1k, λ5k, λ10k are λ estimated from the 1 kb, 5 kb or 10 kb window centered at the peak location in the control sample

λ1k, λ5k, λ10k

common ChIP problems size range of the binding phenomenon is unknown (e.g. some repressive histone marks can occupy many kb of DNA) sequencing depth in control and IP samples influences peak finding

histonehmm expression data indicates that the huge repressive peak is real!

histonehmm Classifies data into four states: modified in both samples unmodified in both samples sample A is modified sample B is modified where the read counts are presumed to come from a mixture of background & signal what are the observed and hidden states?

what next? after finding ChIP peaks... look for motifs, to figure out binding site correlate binding with structural or functional assays (gene expression, chromatin conformation) use ChIP peaks for different marks to profile genes

Meta-clustering identifies combinatorial subprofiles for chromatin marks.

Meta-clustering identifies combinatorial subprofiles for chromatin marks.

viewing and describing motifs PWM (position weight matrix) ACCGCTG AGCGCTG TCCGCAG TCCCGTG ACCGCTG AGCGCTG AGCGCTG TCCGCAG pos. A C G T 0 5 0 0 3 1 0 5 3 0 2 0 8 0 0 3 0 1 7 0 4 0 7 1 0 5 2 0 0 6 6 0 0 8 0 consensus sequence: ACCGCTG

viewing and describing motifs pos. A C G T 0 5 0 0 3 1 0 5 3 0 2 0 8 0 0 3 0 1 7 0 4 0 7 1 0 5 2 0 0 6 6 0 0 8 0 consensus sequence: ACCGCTG Probability 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 Position

viewing and describing motifs pos. A C G T 0 5 0 0 3 1 0 5 3 0 2 0 8 0 0 3 0 1 7 0 4 0 7 1 0 5 2 0 0 6 6 0 0 8 0 2 test_sequences consensus sequence: ACCGCTG Information content 1.5 1 0.5 0 1 2 3 4 5 6 7 Position

viewing and describing motifs seqlogo 2 test_sequences 1.5 Information content 1 0.5 0 1 2 3 4 5 6 7 Position Information content: measure of tolerance to substitutions IC of 2 means only one nucleotide is allowed at that position. IC of 0 means that all nucleotides occur with equal frequency at that position.

seqlogo information content for position w in the motif, where J is the length of the alphabet for the motif (4 for DNA, 20 for protein) IC(w) = log2(j) - entropy(w) AAAAAAAAAAA has zero entropy

short side trip into entropy Measure of how close to uniform the distribution is (~unpredictability)... Like variance in a way but not the same thing Entropy of random DNA (Wootton and Federhen definition): ACAGGTTTCT AAAAAAAAAA

Entropy: most useful when calculated in windows ACTGACTGATCGACGTACGTACGTACGTACGT AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Entropy Computing in windows is critical to assessing landscape ACTGACTGAAAAACGTACGTATTTCCCGTACGT

motif finding workflow get a bunch of sequences predicted experimental find candidate motifs de novo or starting from a known motif word-based algorithms probabilistic algorithms test whether motifs are functional chromatin binding assay reporter assay phylogeny gene set analysis

sequence sources predicted transcription factor binding sites typically bind within 1kb of a promoter, so search in those intervals for motifs genes in a pathway are often regulated by the same transcription factors, so their upstream sequences may have the same motifs experimental bind a transcription factor to DNA and digest all DNA that is not bound (DNA footprinting) collect sequences near or bound by particular proteins (ChIP-seq) as a control, include sequences not known to be bound by the factor or not known to have the effect

word-based algorithms simplest approach: assume that the binding site is n bp. Count occurrences of all n-bp sequences in the dataset and compare to the expected distribution: AAAAAA AAAAAC AAAAAG...CGCCCT CGCCGA...TTTTTT obs: 20 50 41 98 104 9 exp: 85 84 84 72 72 85 expected distribution is based on GC/AT content of the sequences. Calculate a z-score for the observed frequency of a motif. If it is significantly overrepresented, look at all 1-base edits: CGCCCT: AGCCCT,GGCCCT,TGCCCT... are these motifs overrepresented, as a group?

PWM-based algorithms use publicly available position weight matrices, look for -range of scores of alignments, then scores above the distribution -multiple matches in one sequence

probabilistic algorithms de novo motif finding simplest: from the set of sequences, find the n bp motif with the highest information content (greedy approach) look for similarities to motif in the other sequences; usually require that every instance of the motif has at least one common site with the first motif sometimes useful to allow one and only one match of the motif per sequence, to minimize bad matches and over weighting by long sequences

probabilistic algorithms de novo motif finding Gibbs Sampler: get best motif from a set of sequences 1) select random short subsequences from the set, call these the patterns 2) choose another short subsequence at random. its score is p(generated by the patterns)/p(generated by background) add high scoring subsequences to the pattern By starting with a known pattern specification this can not only find other instances of the pattern but can also improve the pattern specification

extensions to Gibbs sampling explicitly account for AT% of DNA from the organism consider both strands of DNA mask motifs that have been found so that other motifs can be uncovered use specific models to look for dyad sites and palindromes add random jumps to avoid local maxima add structure-based constraints account for motif families allow gapped motifs

testing motifs use a test set, if available ChIP in another cell type or another organism reporter assay phylogenetic comparisons gene set analysis

reporter assay

phylogeny

gene set/pathway analysis looking for enhancers! no well-defined location no well-defined binding sites

resources JASPAR (free) and TRANSFAC (licensed) databases Both are collections of experimentally validated transcription factor binding sites and PWMs.

MEME http://meme.nbcr.net/meme/ older, well-used program, now part of a suite of motif finding tools uses Expectation Maximization (Multiple EM for Motif Elicitation)

MEME http://meme.nbcr.net/meme/ older, well-used program, now part of a suite of motif finding tools uses Expectation Maximization (Multiple EM for Motif Elicitation)

input: promoter sequences for all yeast genes (~6000) >chr1.fa.33249.33449 TTAATGCTTTTGATAAAATGTATATAAAGGCTGTCGTAATGTGCAGTAGTAAGGACCTGA CTGTGTTTGTGGTTCTCTTCATTCTTGAACCTTGTCATTGGTAAAAGACCATCGTCAAGA TATTTGAAAGTTAATAGACAGTTAACAATAATAACAACAGCAATAAGAATAACAATAAAT TCATTGAACATATTTCAGAAT >chr1.fa.34956.35156 TGTTTCTCTTGATATGATAATAGGTGGAAACGTAGAAAAAAAAATCGACATATAAAAGTG GGGCAGATACTTCGTGTGACAATGGCCAATTCAAGCCCTTTGGGCAGATGTTGCCCTTCT TCTTTCTTAAAAAGTCTTAGTACGATTGACCAAGTCAGAAAAAAAAAAAAAAAGGAACTA AAAAAAGTTTTAATTAATTAT >chr1.fa.36310.36510 AATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTTT ATAATATTTGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTTT TTTCTTCTTTTTTTTTCCTTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAAG GACCCTTTAGAAGTATTGAAT >chr1.fa.37265.37465 TTTTTTATATATCTGGATGTATACTATTATTGAAAAACTTCATTAATAGTTACAACTTTT TCAATATCAAGTTGATTAAGAAAAAGAAAATTATTATGGGTTAGCTGAAAACCGTGTGAT GCATGTCGTTTAAGGATTGTGTAAAAAAGTGAACGGCAACGCATTTCTAATATAGATAAC GGCCACACAAAGTAGTACTAT

MEME http://meme.nbcr.net/meme/ older, well-used program, now part of a suite of motif finding tools uses Expectation Maximization (Multiple EM for Motif Elicitation)

ChIPMunk May 2012 release optimized for lots of very long sequences; searches for motif with highest information content, then aligns to motifs with high information content, constructs a PWM from that alignment. Series of PWMs tested within ChIPseq peaks, taking into consideration peak shape.

other algorithms main variations add background kmer frequency in genome add negative set information for training set, weight the sequences account for related motifs from other TF family members rely on known signals (TRANSFAC & JASPAR) look for low probability clusters of signals look for repeated clusters in a set of sequences

useful sites http://www.gene-regulation.com/pub/programs.html (older programs) http://molbiol-tools.ca/transcriptional_factors.htm (newer programs & databases) http://alggen.lsi.upc.edu/recerca/menu_recerca.html