ChIP November 21, 2017
functional signals: is DNA enough? what is the smallest number of letters used by a written language?
DNA is only one part of the functional genome DNA is heavily bound by proteins, in any cell: nucleosomes transcription factors transcription suppressors scaffolding Proteins can bind to specific DNA sequences Some proteins have fairly nonspecific binding (e.g. nucleosomes)
chromatin immunoprecipitation (ChIP) workflow: 1) crosslink DNA and proteins 2) shear or digest DNA into fragments 3) use tagged antibody to isolate protein of interest 4) reverse protein-dna crosslinks 5) sequence DNA and align to a reference genome, or hybridize to a microarray
cross link sonicate or digest add antibody no antibody immunoprecipitate, reverse cross links, purify DNA total input
after aligning to a reference genome
pathognomonic ChIP peak shape forward strand sequenced from this side only reverse strand sequenced from this side only
after aligning to a reference genome: IGV, with plus strand in pink, minus strand in blue
characteristic fragments Kharchenko et al, Nat Biotech 2008
MACS (Model-based Analysis of ChIP-Seq) two issues in peak calling: resolution (how finely can the peak be defined) ChIP seq tags only come from the ends of a fragment! so the exact position of a bound protein must be inferred to a resolution smaller than the fragment size detection above background noise because of sequencing biases, chromatin structure, copy number variation, and mapping biases, the baseline isn t flat
MACS (Model-based Analysis of ChIP-Seq)
MACS (Model-based Analysis of ChIP-Seq) assume there is no strand bias (not more likely to get tags from one strand than the other) then sample 1000 regions where there is more than mfold enrichment relative to a random distribution, and look at Watson vs Crick peak positions
MACS (Model-based Analysis of ChIP-Seq)
MACS (Model-based Analysis of ChIP-Seq) cross-correlation of signals from the two strands is highest when the shift distance matches the size of the binding site
MACS (Model-based Analysis of ChIP-Seq) options for removing background noise: 1) use a Poisson distribution (λbg) to define a cutoff # tags 2) use the total input to estimate local background MACS uses a dynamic Poisson parameter, λlocal, defined separately for each candidate peak as: λlocal = max(λbg, [λ1k,] λ5k, λ10k) λ1k, λ5k, λ10k are λ estimated from the 1 kb, 5 kb or 10 kb window centered at the peak location in the control sample
λ1k, λ5k, λ10k
common ChIP problems size range of the binding phenomenon is unknown (e.g. some repressive histone marks can occupy many kb of DNA) sequencing depth in control and IP samples influences peak finding
histonehmm expression data indicates that the huge repressive peak is real!
histonehmm Classifies data into four states: modified in both samples unmodified in both samples sample A is modified sample B is modified where the read counts are presumed to come from a mixture of background & signal what are the observed and hidden states?
what next? after finding ChIP peaks... look for motifs, to figure out binding site correlate binding with structural or functional assays (gene expression, chromatin conformation) use ChIP peaks for different marks to profile genes
Meta-clustering identifies combinatorial subprofiles for chromatin marks.
Meta-clustering identifies combinatorial subprofiles for chromatin marks.
viewing and describing motifs PWM (position weight matrix) ACCGCTG AGCGCTG TCCGCAG TCCCGTG ACCGCTG AGCGCTG AGCGCTG TCCGCAG pos. A C G T 0 5 0 0 3 1 0 5 3 0 2 0 8 0 0 3 0 1 7 0 4 0 7 1 0 5 2 0 0 6 6 0 0 8 0 consensus sequence: ACCGCTG
viewing and describing motifs pos. A C G T 0 5 0 0 3 1 0 5 3 0 2 0 8 0 0 3 0 1 7 0 4 0 7 1 0 5 2 0 0 6 6 0 0 8 0 consensus sequence: ACCGCTG Probability 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 Position
viewing and describing motifs pos. A C G T 0 5 0 0 3 1 0 5 3 0 2 0 8 0 0 3 0 1 7 0 4 0 7 1 0 5 2 0 0 6 6 0 0 8 0 2 test_sequences consensus sequence: ACCGCTG Information content 1.5 1 0.5 0 1 2 3 4 5 6 7 Position
viewing and describing motifs seqlogo 2 test_sequences 1.5 Information content 1 0.5 0 1 2 3 4 5 6 7 Position Information content: measure of tolerance to substitutions IC of 2 means only one nucleotide is allowed at that position. IC of 0 means that all nucleotides occur with equal frequency at that position.
seqlogo information content for position w in the motif, where J is the length of the alphabet for the motif (4 for DNA, 20 for protein) IC(w) = log2(j) - entropy(w) AAAAAAAAAAA has zero entropy
short side trip into entropy Measure of how close to uniform the distribution is (~unpredictability)... Like variance in a way but not the same thing Entropy of random DNA (Wootton and Federhen definition): ACAGGTTTCT AAAAAAAAAA
Entropy: most useful when calculated in windows ACTGACTGATCGACGTACGTACGTACGTACGT AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Entropy Computing in windows is critical to assessing landscape ACTGACTGAAAAACGTACGTATTTCCCGTACGT
motif finding workflow get a bunch of sequences predicted experimental find candidate motifs de novo or starting from a known motif word-based algorithms probabilistic algorithms test whether motifs are functional chromatin binding assay reporter assay phylogeny gene set analysis
sequence sources predicted transcription factor binding sites typically bind within 1kb of a promoter, so search in those intervals for motifs genes in a pathway are often regulated by the same transcription factors, so their upstream sequences may have the same motifs experimental bind a transcription factor to DNA and digest all DNA that is not bound (DNA footprinting) collect sequences near or bound by particular proteins (ChIP-seq) as a control, include sequences not known to be bound by the factor or not known to have the effect
word-based algorithms simplest approach: assume that the binding site is n bp. Count occurrences of all n-bp sequences in the dataset and compare to the expected distribution: AAAAAA AAAAAC AAAAAG...CGCCCT CGCCGA...TTTTTT obs: 20 50 41 98 104 9 exp: 85 84 84 72 72 85 expected distribution is based on GC/AT content of the sequences. Calculate a z-score for the observed frequency of a motif. If it is significantly overrepresented, look at all 1-base edits: CGCCCT: AGCCCT,GGCCCT,TGCCCT... are these motifs overrepresented, as a group?
PWM-based algorithms use publicly available position weight matrices, look for -range of scores of alignments, then scores above the distribution -multiple matches in one sequence
probabilistic algorithms de novo motif finding simplest: from the set of sequences, find the n bp motif with the highest information content (greedy approach) look for similarities to motif in the other sequences; usually require that every instance of the motif has at least one common site with the first motif sometimes useful to allow one and only one match of the motif per sequence, to minimize bad matches and over weighting by long sequences
probabilistic algorithms de novo motif finding Gibbs Sampler: get best motif from a set of sequences 1) select random short subsequences from the set, call these the patterns 2) choose another short subsequence at random. its score is p(generated by the patterns)/p(generated by background) add high scoring subsequences to the pattern By starting with a known pattern specification this can not only find other instances of the pattern but can also improve the pattern specification
extensions to Gibbs sampling explicitly account for AT% of DNA from the organism consider both strands of DNA mask motifs that have been found so that other motifs can be uncovered use specific models to look for dyad sites and palindromes add random jumps to avoid local maxima add structure-based constraints account for motif families allow gapped motifs
testing motifs use a test set, if available ChIP in another cell type or another organism reporter assay phylogenetic comparisons gene set analysis
reporter assay
phylogeny
gene set/pathway analysis looking for enhancers! no well-defined location no well-defined binding sites
resources JASPAR (free) and TRANSFAC (licensed) databases Both are collections of experimentally validated transcription factor binding sites and PWMs.
MEME http://meme.nbcr.net/meme/ older, well-used program, now part of a suite of motif finding tools uses Expectation Maximization (Multiple EM for Motif Elicitation)
MEME http://meme.nbcr.net/meme/ older, well-used program, now part of a suite of motif finding tools uses Expectation Maximization (Multiple EM for Motif Elicitation)
input: promoter sequences for all yeast genes (~6000) >chr1.fa.33249.33449 TTAATGCTTTTGATAAAATGTATATAAAGGCTGTCGTAATGTGCAGTAGTAAGGACCTGA CTGTGTTTGTGGTTCTCTTCATTCTTGAACCTTGTCATTGGTAAAAGACCATCGTCAAGA TATTTGAAAGTTAATAGACAGTTAACAATAATAACAACAGCAATAAGAATAACAATAAAT TCATTGAACATATTTCAGAAT >chr1.fa.34956.35156 TGTTTCTCTTGATATGATAATAGGTGGAAACGTAGAAAAAAAAATCGACATATAAAAGTG GGGCAGATACTTCGTGTGACAATGGCCAATTCAAGCCCTTTGGGCAGATGTTGCCCTTCT TCTTTCTTAAAAAGTCTTAGTACGATTGACCAAGTCAGAAAAAAAAAAAAAAAGGAACTA AAAAAAGTTTTAATTAATTAT >chr1.fa.36310.36510 AATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTTT ATAATATTTGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTTT TTTCTTCTTTTTTTTTCCTTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAAG GACCCTTTAGAAGTATTGAAT >chr1.fa.37265.37465 TTTTTTATATATCTGGATGTATACTATTATTGAAAAACTTCATTAATAGTTACAACTTTT TCAATATCAAGTTGATTAAGAAAAAGAAAATTATTATGGGTTAGCTGAAAACCGTGTGAT GCATGTCGTTTAAGGATTGTGTAAAAAAGTGAACGGCAACGCATTTCTAATATAGATAAC GGCCACACAAAGTAGTACTAT
MEME http://meme.nbcr.net/meme/ older, well-used program, now part of a suite of motif finding tools uses Expectation Maximization (Multiple EM for Motif Elicitation)
ChIPMunk May 2012 release optimized for lots of very long sequences; searches for motif with highest information content, then aligns to motifs with high information content, constructs a PWM from that alignment. Series of PWMs tested within ChIPseq peaks, taking into consideration peak shape.
other algorithms main variations add background kmer frequency in genome add negative set information for training set, weight the sequences account for related motifs from other TF family members rely on known signals (TRANSFAC & JASPAR) look for low probability clusters of signals look for repeated clusters in a set of sequences
useful sites http://www.gene-regulation.com/pub/programs.html (older programs) http://molbiol-tools.ca/transcriptional_factors.htm (newer programs & databases) http://alggen.lsi.upc.edu/recerca/menu_recerca.html