Lecture 5: Regulation

Size: px
Start display at page:

Download "Lecture 5: Regulation"

Transcription

1 Machine Learning in Computational Biology CSC 2431 Lecture 5: Regulation Instructor: Anna Goldenberg

2 Central Dogma of Biology

3 Transcription DNA RNA protein Process of producing RNA from DNA Constitutive happens all the time (example: production of glycolysis, true of house-keeping genes) Regulated otherwise (specific to tissue, gene, time)

4 Gene

5 Regulated Transcription Transcription Factor (TF) protein that binds to specific DNA sequence that controls the flow of production of DNA to RNA basal factors = general transcription factors (GTFs)

6 Regulated Transcription basal factors = general transcription factors

7 RNA Pol II

8 Note - RNA editing Transcript can be different from underlying genomic sequence due to post-transcriptional editing Relatively rare Includes insertion, deletion, base substitutions in already produced RNA

9 Transcription termination In eukaryotes not fully understood Cleavage of the new transcript Polyadenylation addition of polya tail to 3 end (predominant, but not the only way)

10 Questions we can ask How do we know which TFs will bind where? What is the relation of where TF binds to its function (effect on expression level)? How do TFs work together? What are all the players that maybe involved in transcription? What about the chromatin DNA is packed into chromatin, how is it opened? How many enhancers/repressors are there? Splicing which transcripts do we see and why?

11 TF binding Problem TFs bind to motifs that are short! Representation: Position Weight Matrices (PWM) or Position Specific Scoring Matrices (PSSM)

12 TF binding Problem TFs bind to motifs that are short! Representation: Position Weight Matrices (PWM) or Position Specific Scoring Matrices (PSSM)

13 TF binding Problem TFs bind to motifs that are short! Representation: Position Weight Matrices (PWM) or Position Specific Scoring Matrices (PSSM) Score = log (P(S M)/P(S Bg)) P(S M) - probability of observing sequence S (i.e., the product of nucleotide frequencies in the PSSM matrix), given the matrix model M P(S Bg) = probability of observing sequence S given the background matrix null model Bg.

14 Problems with PSSM Give non-zero probability to observing sequences that might not happen (independence assumption) Sharon, Lubliner, Segal, PloS CompBio 2008

15 Problems with PSSM Give non-zero probability to observing sequences that might not happen (independence assumption) Sharon, Lubliner, Segal, PloS CompBio 2008

16 Problems with PSSM Give non-zero probability to observing sequences that might not happen (independence assumption) Fu, Ray, Xing, Bioinformatics 2009

17 Determining an enhancer Shlyueva et al, Transcriptional enhancers: from properties to genome-wide predictions, 2014

18 Super enhancers 'super-enhancer describes groups of putative enhancers in close genomic proximity with unusually high levels of Mediator binding, as measured by ChIP-seq Conclusion: it is not clear whether super-enhancers are more than a sum of its enhancer parts Pott and Lieb, What are super-enhancers? Nature Genetics 2015

19 Cusanovich a et al, The functional consequences of Variation in Transcriptional Factor Binding. PLOS Genetics 2014 Is the binding site functional? Using TF knockouts, CHIP-seq and Dnase for 56 TFs it was found that: Functional binding is associated with stronger binding motifs and greater levels of factor binding near differentially expressed genes For 19 of the 56 factors there was a significant enrichment of binding in strong enhancers near differentially expressed genes, i.e. functional binding occurs further from the TSS (not in promoter regions) Many of the factors function as repressors at least as often as they function as activators

20 Note non-coding RNA A large portion of the human genome is transcribed into RNAs without known protein-coding functions, far outnumbering coding transcription units. E.g. sirna, mirna, trna, snorna Can play critical roles in regulating gene expression, development, and diseases, acting both as transcriptional activators and repressors Most recent: ernas transcribed enhancer RNA

21 Kundaje, A. et al. Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements. Genome Res 2012 Chromatin Asymmetric in the direction of transcription Clustered Aggregation Tool (CAGT)

22 Alternative Splicing 95% of multiexonic genes are alternatively spliced (increases diversity) Example: Exon skipping (or Mutually Exclusive)

23 S. Djebali et al, Landscape of transcription in human cell, Nature 2012 Alternative Splicing Gene can express multiple isoforms simultaneously Isoforms are expressed at diff levels Usually one isoform dominates for a given condition (at least 30%)

24 Splicing code Machine learning has been used quite successfully to predict whether exon will be spliced out (Xiong et al, Science 2014)

25 S. Djebali et al, Landscape of transcription in human cell, Nature 2012 ENCODE Encyclopedia Of DNA Elements Novel elements from RNA-seq 78% coverage of intronic regions 34% coverage of intergenic regions Predicted 94,800 exons (19% increase), 69,052 splice junctions (22%), 73,325 (45%) transcripts and 41,204 genes (80% increase), 22,000 novel splice sites from PET and mass spectrometry. Most novel transcripts seem to lack protein-coding capacity

26 ENCODE From encodeproject.org website

27 Next class Papers Cell-type specific TF binding Arvey, Aaron, et al. "Sequence and chromatin determinants of celltype specific transcription factor binding." Genome research 22.9 (2012): (Te Cheng) Integration of multiple types of regulation data Cheng, Chao, et al. "Understanding transcriptional regulation by integrative analysis of transcription factor binding data." Genome research 22.9 (2012): (Marat)