Genome 541! Unit 4, lecture 3! Genomics assays

Size: px
Start display at page:

Download "Genome 541! Unit 4, lecture 3! Genomics assays"

Transcription

1 Genome 541! Unit 4, lecture 3! Genomics assays

2 I d like a bit more background on the assays and bioterminology.!! The phantom peak concept was confusing.! I didn t quite understand what the phantom peak is.!! I liked the interactive part of the convex section.! Is each row in the 2nd derivative of multiple variables a partial derivative?! Need more explanation on generic form of quadratic equation.!! The examples on convexity were a little basic. It would be interesting to do a nontrivial example. Maybe with regularization?

3 Translate reads to the inferred center of the sequencing fragment correlation! between! strands strand shift

4 The phantom peak results from mappability islands unmappable mappable ChIP-seq measure of quality:! relative strand correlation (RSC) read length

5 Multivariate derivatives First derivative of a function on multiple variables:!!!!! Second derivative: r 2 f(x) = x 1 x

6 Regularization L2 regularization: f 0 (x) =f(x)+kxk 2

7 This lecture Genomics assays! Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data?! Convex optimization (without constraints)! Applications of genomics assays

8 ChIP-exo has better spatial resolution than ChIP-seq

9 DamID measures TF binding through a fusion protein Dam+TF fusion protein Measure methylation at! GATCs DamID vs. ChIP-seq:! DamID can be easier! ChIP requires (specific) antibody! DamID requires fusion protein! DamID can t query post-transcriptional modification (histone mods)! ChIP has better spacial resolution! ChIP is limited by cross-linking bias! DamID is limited by GATC content and Dam reactivity! ChIP has better temporal resolution: Dam acts over ~24 hours

10 DNase-seq and ATAC-seq measure DNA accessibility

11 High-depth DNase-seq (DNase-DGF) measures TF binding

12 Paired-end DNase and ATAC-seq measure nucleosome architecture

13 ChIP-exo has better spatial resolution than ChIP-seq

14 DamID measures TF binding through a fusion protein Dam+TF fusion protein Measure methylation at! GATCs DamID vs. ChIP-seq:! DamID can be easier! ChIP requires (specific) antibody! DamID requires fusion protein! DamID can t query post-transcriptional modification (histone mods)! ChIP has better spacial resolution! ChIP is limited by cross-linking bias! DamID is limited by GATC content and Dam reactivity! ChIP has better temporal resolution: Dam acts over ~24 hours

15 DNase-seq and ATAC-seq measure DNA accessibility Raj and McVicker. Nature Methods 2014

16 High-depth DNase-seq (DNase digital genomic footprinting (DGF)) measures bp-level TF binding Neph et al. Nature 2012

17 Paired-end DNase (DNase-FLASH) and ATACseq measure nucleosome architecture Vierstra et al. Nature Methods 2014

18 Converting genomics data sets into signal tracks Extend according to fragment length Sum

19 Accounting for biases Expected signal: Average of control experiment in (for example) 1kb window.! Two representations of signal data:! Fold enrichment: observed / expected! Poisson p-value:! -log10(pois(observed expected) / Pois(observed observed))

20 Today s class Genomics assays! Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data?! Convex optimization (without constraints)! Applications of genomics assays

21 Problem: Predict TF binding at motifs from DNase/ATAC-seq data

22 ! Several methods have been developed to address this problem FIMO+prior: Use prior based on DNase for motif scanning.! CENTIPEDE: Logistic regression.! Deep neural network.! PIQ: Protein Identification Quantification. Generative model.

23 Bound factors leave identifiable DNase-seq profiles Individual binding site prediction is difficult: Individual CTCF: Aggregate CTCF: MIT

24 Idea: Learn a characteristic profile for each TF CTCF bound? CTCF DNase read! counts Binding effect window = 400bp

25 Idea: Nearby factors contribute to profile Genomic position CTCF Oct4 Bound? DNase read! counts Binding effect window = 400bp

26 Overdispersion model: Poisson over normal Latent signal strength (μ) µ i N(µ 0, ) x i Pois(exp(µ i )) DNase read! counts (x)

27 Overdispersion model: Poisson over normal Latent signal strength (μ) µ i N(µ 0, ) x i Pois(exp(µ i )) DNase read! counts (x)

28 Binding model CTCF Binding indicator (I) Binding effect! (βctcf) Background! accessibility Signal strength (μ) DNase read! counts (x) Binding effect window = 400bp Binding! effect

29 Full model CTCF Oct4 Binding indicator (I) Binding effect! (βctcf) Latent! signal strength (μ) DNase read counts (x) MX ˆµ i = µ i + i=1 m=1 I m Tm (i) NY P (X 1...N µ 1...N,I 1...M )= Pois (X i exp (ˆµ i ))

30 Problem: How do we learn the parameters of PIQ?

31 Today s class Genomics assays! Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data?! Convex optimization (without constraints)! Applications of genomics assays

32 Gradient descent Gradient descent update step: x (k+1) x (k) + trf(x (k) ) t: Learning rate. Must be chosen.

33 Convergence properties of gradient descent! Gradient descent is guaranteed to converge if t is low enough.! The value of t depends on the curvature of the function. If f(x) satisfies!! krf(x) rf(y)k 2 apple Lkx yk 2 then gradient descent with t 1/L is guaranteed to converge.

34 Pros and cons of gradient descent Pros:! Very simple and easy to implement.! Requires computing only first derivative.! Updates to each variable can be computed independently.! Cons:! Requires tuning learning rate.! Slower than other methods.

35 Variant: Backtracking line search Pros:! Converges for any learning rate.! Sometimes faster than gradient descent.! Cons:! More complicated; updates are no longer independent. Stanford EE364a

36 Aside: Optimizing quadratic functions P is positive semi-definite. df dx = Px+ q df dx =0 =) x = P 1 q Optimum can be computed in closed-form.

37 Newton s method Idea: minimize second-order Taylor expansion of f: f(x 0 ) (1/2)x 0 T r 2 f(x)x 0 + rf(x) T x 0 + f(x) Newton s method update step: x (k+1) x (k) +(r 2 f(x (k) )) 1 rf(x (k) )

38 Pros and cons of Newton s method Pros:! Extremely fast.! Guaranteed to converge (no hyperparameters).! Cons:! Requires second derivative.! Updates to each variable are not independent.

39 Variant: Coordinate descent Update each variable (or subset of variables) separately (using any method).!!!!!!! Works best when each subset has closed-form updates.

40 Variant: Stochastic gradient descent Choose a random example (or subset of examples) for each update step.! Usually much faster than gradient descent.! Requires decreasing the learning rate with each iteration in order to guarantee convergence.! Alternatively: Increase batch size with each iteration.

41 Update equations for -sqrt(x) Function: f(x) =x p x Closed-form: df (x) dx =1 (1/2) 1 p x =0 =) x = 1 4 Gradient descent: x (t+1) x (t) 1 1 1/2p x (t) Newton s method: d 2 f(x) 1 dx 2 =(1/4) x 3/2 x (t+1) x (t) + 1 (1/2)p x (t) 1 4x (t)3/2

42 What are the update equations for x log(x)?!

43 Today s class Genomics assays! Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data?! Convex optimization (without constraints)! Applications of genomics assays

44 Semi-automated genome annotation algorithms partition and label the genome on the basis of functional genomics tracks H3k36me3 DNase1 RNA-seq Annotation HMMSeg: Day et al. Bioinformatics, 2007! ChromHMM: Ernst, J. and Kellis, M. Nature Biotechnology, 2010! Segway: Hoffman, M et al. Nature Methods, 2012

45 Semi-automated genome annotation algorithms use dynamic Bayesian network models Segment label DNase1 H3K27me3 RNA-seq hidden random variable observed random variable 45

46 Semi-automated genome annotation recovers known types of genome elements Enhancer Gene CTCF Hoffman et al. Nucleic Acids Research 2012.

47 Semi-automated genome at course resolution discovers chromatin domain types Quiescent domains Constitutive heterochromatin Facultative heterochromatin H3K9me3 X H3K27me3 QUI CON FAC Specific SPC expression domains Regulatory! elements BRD Broad expression domains 47

48 Problem: Can we impute the output of missing experiments? 316 assay types Experiment performed 346 cell types DnaseSeq H3K4me3 H3K27me3 H3K36me3 H3K9me3 H3K4me1 H3K27ac H3K9ac CTCF DnaseDgf K562 GM12878 H1-hESC HepG2 HeLa-S3 A549 IMR90 HUVEC NHEK H9ES MCF-7 ENCODE Roadmap Epigenomics

49 c t 0 Prediction function c t ō ct,m t,p m t f ct 0 (o ct 0,m t,p F ct 0,m t,p) Machine learning predictor: regression trees ō ct,m t,p = 1 C mt X c t 0 2C mt f ct0 (o ct0,m t,p F ct0,m t,p) Ernst and Kellis. Nature 2015.

50 Features c t 0 c t Other mark! (same sample)! features m t Other samples! (same mark)! features Ernst and Kellis. Nature 2015.

51 Imputed data has high correlation with observed data Ernst and Kellis. Nature 2015.

52 Imputed data recovers promoters and TSSs better than observed data Ernst and Kellis. Nature 2015.

53 Goal: model relationship between H3K27me3 and motif occurrences TF motifs Output: parameters of regression model Regression model Features: predicted TFBSs around TSSs Target: H3K27me3 level in 3 cell types

54 Epi-MARA: Epigenome-motif activity response analysis noise H3K27me3 signal at! promoter p in sample s Basal H3K27me3 signal at promoter p Number of m motifs at promoter p Activity of motif m in sample s ~30,000 * 3 data points! ~30,000 basal level parameters! 207 * 3 activity parameters p1 p2... m1 m2 s1 s2 s3 M(s1, p1)

55 Regression model identifies REST as a regulator of H3K27me3

56 Administrivia Homework 7 is up online. Due Friday ! Homework 8 will be up by the end of the week.! Next class: Chromatin 3D architecture.! Please write 1-minute responses.