Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays

Size: px
Start display at page:

Download "Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays"

Transcription

1 Genome 541 Gene regulation and epigenomics Lecture 3 Integrative analysis of genomics assays

2 Please consider both the forward and reverse strands (i.e. reverse compliment sequence). You do not need to consider background base pairs. This corresponds to using a uniform background distribution. P (sequence motif starts at position i, PSFM) = i+l Y j=i PSFM(sequence j,j i) Y j/2[i,i+l] background(sequence j ) Do you have to consider whether there is a plateau in the minimum of a convex function? Notation: s X kxk = p x 2 = x 2 i R ++ : positive real numbers i

3 This lecture Genomics assays Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data? Convex optimization (without constraints) Applications of genomics assays

4 ChIP-exo has better spatial resolution than ChIP-seq

5 DamID measures TF binding through a fusion protein Dam+TF fusion protein Measure methylation at GATCs DamID vs. ChIP-seq: DamID can be easier ChIP requires (specific) antibody DamID requires fusion protein DamID can t query post-transcriptional modification (histone mods) ChIP has better spacial resolution ChIP is limited by cross-linking bias DamID is limited by GATC content and Dam reactivity ChIP has better temporal resolution: Dam acts over ~24 hours

6 DNase-seq and ATAC-seq measure DNA accessibility

7 High-depth DNase-seq (DNase-DGF) measures TF binding

8 Paired-end DNase and ATAC-seq measure nucleosome architecture

9 ChIP-exo has better spatial resolution than ChIP-seq

10 Converting genomics data sets into signal tracks Extend according to fragment length Sum

11 Accounting for biases Expected signal: Average of control experiment in (for example) 1kb window. Two representations of signal data: Fold enrichment: observed / expected Poisson p-value: -log10(pois(observed expected) / Pois(observed observed))

12 Today s class Genomics assays Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data? Convex optimization (without constraints) Applications of genomics assays

13 Problem: Predict TF binding at motifs from DNase/ATAC-seq data

14 Several methods have been developed to address this problem FIMO+prior: Use prior based on DNase for motif scanning. CENTIPEDE: Logistic regression. Deep neural network. PIQ: Protein Identification Quantification. Generative model.

15 Bound factors leave identifiable DNase-seq profiles Individual binding site prediction is difficult: Individual CTCF: Aggregate CTCF: MIT

16 Idea: Learn a characteristic profile for each TF CTCF bound? CTCF DNase read counts Binding effect window = 400bp

17 Idea: Nearby factors contribute to profile Genomic position CTCF Oct4 Bound? DNase read counts Binding effect window = 400bp

18 Overdispersion model: Poisson over normal Latent signal strength (μ) DNase read counts (x) µ i N(µ 0, ) x i Pois(exp(µ i ))

19 Overdispersion model: Poisson over normal Latent signal strength (μ) DNase read counts (x) µ i N(µ 0, ) x i Pois(exp(µ i ))

20 Binding model CTCF Binding indicator (I) Binding effect (βctcf) Background accessibility Signal strength (μ) DNase read counts (x) Binding effect window = 400bp Binding effect

21 Full model CTCF Oct4 Binding indicator (I) Binding effect (βctcf) Latent signal strength (μ) DNase read counts (x) MX ˆµ i = µ i + i=1 m=1 I m Tm (i) NY P (X 1...N µ 1...N,I 1...M )= Pois (X i exp (ˆµ i ))

22 Problem: How do we learn the parameters of PIQ?

23 Today s class Genomics assays Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data? Convex optimization (without constraints) Applications of genomics assays

24 Gradient descent Gradient descent update step: x (k+1) x (k) + trf(x (k) ) t: Learning rate. Must be chosen.

25 Convergence properties of gradient descent Gradient descent is guaranteed to converge if t is low enough. The value of t depends on the curvature of the function. If f(x) satisfies krf(x) rf(y)k 2 apple Lkx yk 2 then gradient descent with t 1/L is guaranteed to converge.

26 Pros and cons of gradient descent Pros: Very simple and easy to implement. Requires computing only first derivative. Updates to each variable can be computed independently. Cons: Requires tuning learning rate. Slower than other methods.

27 Variant: Backtracking line search Pros: Converges for any learning rate. Sometimes faster than gradient descent. Cons: More complicated; updates are no longer independent. Stanford EE364a

28 Aside: Optimizing quadratic functions P is positive semi-definite. df dx = Px+ q df dx =0 =) x = P 1 q Optimum can be computed in closed-form.

29 Newton s method Idea: minimize second-order Taylor expansion of f: f(x 0 ) (1/2)x 0 T r 2 f(x)x 0 + rf(x) T x 0 + f(x) Newton s method update step: x (k+1) x (k) +(r 2 f(x (k) )) 1 rf(x (k) )

30 Pros and cons of Newton s method Pros: Extremely fast. Guaranteed to converge (no hyperparameters). Cons: Requires second derivative. Updates to each variable are not independent.

31 Variant: Coordinate descent Update each variable (or subset of variables) separately (using any method). Works best when each subset has closed-form updates.

32 Variant: Stochastic gradient descent Choose a random example (or subset of examples) for each update step. Usually much faster than gradient descent. Requires decreasing the learning rate with each iteration in order to guarantee convergence. Alternatively: Increase batch size with each iteration.

33 Update equations for -sqrt(x) Function: f(x) =x p x Closed-form: df (x) dx =1 (1/2) 1 p x =0 =) x = 1 4 Gradient descent: x (t+1) x (t) 1 1 1/2p x (t) Newton s method: d 2 f(x) 1 dx 2 =(1/4) x 3/2 x (t+1) x (t) + 1 (1/2)p x (t) 1 4x (t)3/2

34 What are the update equations for x log(x)?

35 PIQ optimization X `( ) = log P (X µ, I) = Y ˆµ i = µ i + MX m=1 I m Tm (i) YNX log 1 X! + X i ˆµ i exp(ˆµ i ) T (j) = ˆµ i (X i exp(ˆµ i T (j) ˆµ T (j) = 1 if position i is the j th position of a motif of T. 0 otherwise.

36 PIQ predicts ChIP-seq peaks accurately

37 Today s class Genomics assays Problem of the day: Can we predict TF binding at motifs based on DNA sensitivity data? Convex optimization (without constraints) Applications of genomics assays

38 Semi-automated genome annotation algorithms partition and label the genome on the basis of functional genomics tracks H3k36me3 DNase1 RNA-seq Annotation HMMSeg: Day et al. Bioinformatics, 2007 ChromHMM: Ernst, J. and Kellis, M. Nature Biotechnology, 2010 Segway: Hoffman, M et al. Nature Methods, 2012

39 Semi-automated genome annotation algorithms use dynamic Bayesian network models Segment label DNase1 H3K27me3 RNA-seq hidden random variable observed random variable 39

40 Semi-automated genome annotation recovers known types of genome elements Enhancer Gene CTCF Hoffman et al. Nucleic Acids Research 2012.

41 Semi-automated genome at coarse resolution discovers chromatin domain types Quiescent domains Constitutive heterochromatin Facultative heterochromatin H3K9me3 X H3K27me3 QUI CON FAC Specific SPC expression domains Regulatory! elements BRD Broad expression domains 41

42 Problem: Can we impute the output of missing experiments? 316 assay types Experiment performed 346 cell types DnaseSeq H3K4me3 H3K27me3 H3K36me3 H3K9me3 H3K4me1 H3K27ac H3K9ac CTCF DnaseDgf K562 GM12878 H1-hESC HepG2 HeLa-S3 A549 IMR90 HUVEC NHEK H9ES MCF-7 ENCODE Roadmap Epigenomics

43 c t 0 Prediction function c t ō ct,m t,p m t f ct 0 (o ct 0,m t,p F ct 0,m t,p) Machine learning predictor: regression trees Ernst and Kellis. Nature 2015.

44 Features c t 0 c t Other mark (same sample) features m t Other samples (same mark) features Ernst and Kellis. Nature 2015.

45 Imputed data has high correlation with observed data Ernst and Kellis. Nature 2015.

46 Imputed data recovers promoters and TSSs better than observed data Ernst and Kellis. Nature 2015.

47 Goal: model relationship between H3K27me3 and motif occurrences TF motifs Output: parameters of regression model Regression model Features: predicted TFBSs around TSSs Target: H3K27me3 level in 3 cell types

48 Epi-MARA: Epigenome-motif activity response analysis noise H3K27me3 signal at promoter p in sample s Basal H3K27me3 signal at promoter p Number of m motifs at promoter p Activity of motif m in sample s ~30,000 * 3 data points ~30,000 basal level parameters 207 * 3 activity parameters p1 p2... m1 m2 s1 s2 s3 M(s1, p1)

49 Regression model identifies REST as a regulator of H3K27me3

50 Problem: Different background models result in wildly different false discovery-rate estimates 50

51 Idea: Control reproducibility of peaks between biological replicates 51 Irreproducible discovery rate (IDR): Expected fraction of peaks that are not reproducible between biological replicates.

52 52 IDR can handle varying quality levels

53 Administrivia Homework 1 is due Thursday. Homework 2 will be up by the end of the week. Next class: Chromatin 3D architecture. Please write 1-minute responses.