Analysis of Biological Sequences SPH

Size: px
Start display at page:

Download "Analysis of Biological Sequences SPH"

Transcription

1 Analysis of Biological Sequences SPH

2 nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book, open note, collaborations allowed as long as work is not copied) no single recommended textbook. Website has a few recommendations with guidance for choosing a resource. I will try to keep it updated with upcoming lecture notes, a daily dozen for each lecture, and homework assignments daily dozen is just some questions (probably not always 12!) that you don t have to turn in but that you should be able to answer easily after each lecture

3 Course objectives describe the algorithms used in estimating function of biological sequences determine which methods are appropriate for analyzing sequences derived from different experiments design analysis pipelines that are biologically meaningful and mathematically rigorous

4 concepts covered algorithms, including HMM MCMC dynamic programming heuristic methods enrichment of spatial associations experimental methods ChIP RNAseq bisulfite, RRBS, MBDseq, MeDIP variant calling HiC & similar structural methods

5 waaaay back: prebiotic soup/primordial sandwich early Earth was too hot for stable molecules, but as atmosphere cooled, molecules formed at random many hypotheses about what happened next... but eventually molecules appeared that had catalytic capabilities and could replicate themselves.

6 amazing property of nucleotides

7 RNA single stranded but self-complementary, so complex 3D structures with enzymatic capacity are possible

8 RNA amino acids were likely also present in the prebiotic soup

9 Next steps an RNA that gained a permanent function could out-reproduce other RNAs proteins are much more stable than RNA proteins are linear arrangements of information (like RNA)

10 RNA encodes proteins

11 the genetic code is a wobbling degenerate

12 protein synthesis (translation) unsurprisingly, protein synthesis involves large RNA/protein complexes (ribosomes)

13 protein synthesis translation is energetically expensive highly regulated ribosomes have proofreading functions all components are recycled

14 becoming a useful protein

15 and then? RNA and proteins were working well, and there were probably many genetic codes... but the RNA that won was the one that invented a stable version of itself

16 DNA

17 transcription a usually short-lived RNA copy of the DNA is created through transcription RNA is exported to the cytoplasm to encode proteins some types of RNA do not encode proteins

18 transcription: the cell knows where to start! classical eukaryotic promoter transcription is expensive and potentially damaging, so it is highly regulated at many levels: signal sequences (activating or repressive) chromatin structure polymerase control (elongation speed, etc) cleavage of nascent RNA biological signals: how do we find these signal sequences, in a big sequence?

19 motif finding Simplest example: look for exact matches to a known motif Next example: imperfect matches to a known motif Finally: finding enriched motifs in a pile of sequences additional questions: conservation throughout evolution, coordinated changes

20 exact match to a known motif: TATA the TATA box is one of many signals in DNA sequences, that mark the location for transcriptional initiation in a large percentage of eukaryotic genes. does my sequence contain a TATA box? ACGCTAGCGCATATAGCATGACTAGTATAGCTAGACGAGCTAGCATATCCGAT

21 exact match to a known motif: TATA ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA

22 exact match to a known motif: TATA ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA how many comparisons are needed? (hint: 4 comparisons for each position x # positions to be compared)

23 exact match to a known motif: TATA ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA are there ways to speed this up?

24 exact match to a known motif: TATA ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT * * * * * * * * * * * * reduce search space by flagging all Ts how many comparisons are made? find and catalog all 4mers in advance how do we store that information? lots of options: big text table hash table tree structure

25 exact match to a known motif: TATA ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT I found a gene!! or did I? p(tata) =?

26 exact match to a known motif: TATA ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT p(tata at any site) = p(t)*p(a)*p(t)*p(a) and, assuming that the nucleotides are equally represented (which isn t true) p(tata) = 0.25^4 = our sequence is 53 nucleotides long, so we have 50 possible start sites. expect 53 * occurrences of TATA = 0.2 so is our result surprising (do we have a gene)? What if we re working with a genome that is 80% AT?

27 imperfect motifs Many proteins bind DNA or RNA with less strict sequence preferences. good example: splicing Our understanding is still very incomplete... but a cell knows how to do it!

28 Enhancer Promoter exon intron exon intron exon polya signal 5 UTR 3 UTR CAAT TATA 5 3

29 Enhancer Promoter exon intron exon intron exon polya signal 5 UTR 3 UTR CAAT TATA 5 3 Start point for transcription Start point for Translation (AUG) Terminator for translation (UGA, UAA, UAG)

30 Enhancer Promoter exon intron exon intron exon polya signal 5 UTR 3 UTR CAAT TATA 5 transcription 3 Pre-mRNA

31 Enhancer Promoter exon intron exon intron exon polya signal 5 UTR 3 UTR CAAT TATA 5 transcription 3 Pre-mRNA mrna splicing

32 Enhancer Promoter exon intron exon intron exon polya signal 5 UTR 3 UTR CAAT TATA 5 transcription 3 Pre-mRNA mrna splicing translation

33 Alternative splicing

34 Splice site and branch site consensus sequences The problem: Consensus 5' and 3' splice site sequences, branch site sequences occur frequently in any genome what is the probability of finding a GT sequence? More information necessary to define bona fide exons

35 Splice site and branch site consensus sequences UGACAUUACUGUGAGUAAAAC UUGUUUUCAGGUACAGUAGUC GCAAGUCAUGGUAAGUCCUCU GACUUAACAGGUACUAUAUAU AAAGGAUUAGGUAUGUAUACC UUCAACACAGGUAACUGACUU GGGGCUGCAGGUACAGUCAUG AGUCAUGUCUGUAUCCUUUUG ACCUUACAGUGUGAUGGGCAG AGAGGAUGAUGUAAGUAAUGG AUCAUUCGGGGUGAGUAUUUU CAAAAUGGGGGUAAGAAGACU UUCAACAAAGGUAAGACCAUU CAAAAAUAAGGUGAUUGGCAC UAUGAAUUAGGUAAGAACUAU UGCGUAACAGGUGAGGCCCUU CGAGCAGAAGGUGAGAACUGA CUGGAGCAAGGUAAUUGUGAG UAUGAUGAAGGUAAAUCUUUA CAAACUGGAGGUACUUCAAUU UCUUUUUAGGGUUUCACUAAG A C G U

36 Splice site and branch site consensus sequences U1 U2 U2AF Cartegni et al interpretation of sequence logos: If the letters occupy the entire vertical space, the height of each letter is the proportion of sequences with that base at that position. If the letters do not occupy the entire vertical space, the height of each letter typically signifies information content.

37 splice signals... not much information how many times would the sequence GT occur in a 3GB genome with 60% AT? how about the sequence CAGGTAAG?

38 so how does the cell know how to splice things? as we ll see, eukaryotic gene prediction is a tricky problem! but cells know where their exons are... what are some approaches that we can take?

39 finding signals in eukaryotic genes: additional tools and approaches evolutionary conservation: take advantage of history open reading frames (we know what a protein-coding gene should look like!)

40 evolutionary conservation: where are mutations tolerated? this doesn t look like a random coincidence...

41 evolutionary conservation probability of seeing a mutation is the product of the probability of mutation occurrence (mutation rate) and the probability of retaining the mutation (selection) surrogate for biological importance?

42 evolutionary conservation to think about this in a meaningful way we need metrics. How do you define surprisingly conserved?

43 open reading frames if you are looking for protein-coding genes, you also want to look for open reading frames.

44 open reading frames There are 64 codons and 3 of them are stop codons. If the codons are equally likely to appear, how long would you expect an ORF to be, in random sequence? frequency of stop codon = 3/64 expected probability of stop codon in random sequence = 3/64

45 open reading frames frequency of stop codon = 3/64 expected probability of stop codon in random sequence p = 3/64 the length of ORFs in random sequence follows a negative binomial distribution, with mean 1/p (21.3 codons, or 64 nucleotides)

46 put the signals together to make a gene: ORF splice site intron exon intergenic mean length 20aa spans exon mean length 20aa random occurrence at boundaries + random random occurrence conservation low high low this is, of course, an immense oversimplification and ignores lots of biological entities (pseudogenes, conserved noncoding regions etc). Also, these are noisy signals!

47 highly conserved locus

48 finding signals in DNA lots of data available for various organisms, to infer conservation, binding sites, function look for known motifs, known sequence signals mine experimental data for overrepresented sequences and motifs laboratory approaches for exploration (e.g. mutagenesis) and confirmation gene finding algorithms may assess as many data modalities as possible, with weighting

49 First generation sequencing Pairwise sequence alignment Dot plots Needleman-Wunsch and Smith-Waterman BLAST, gapped BLAST Phylogeny Multiple sequence alignment Next (and nextnext) generation sequencing Short read and not-so-short read alignment Hidden Markov Models ChIPseq variant calling Gene expression: approaches and statistics Functional analysis in genome space Alignment-free sequence comparison Metagenomics Visualizing big data: Circos, Hive plots