Genome 373: Gene Predic/on I. Doug Fowler

Size: px
Start display at page:

Download "Genome 373: Gene Predic/on I. Doug Fowler"

Transcription

1 Genome 373: Gene Predic/on I Doug Fowler

2 Outline Review of gene structure Scale of the problem Solu;ons Empirical methods Ab ini&o predic;on

3 What is a gene? A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and/or other func;onal sequence regions. Pearson H (May 2006). "Genetics: what is a gene?". Nature 441 (7092):

4 Simple structure of a gene transcrip;onal start site ATG open reading frame transcrip;onal termina;on site TAA promoter 5 untranslated region 3 untranslated region What are we missing here?

5 Introns! Exon 1 Exon 2 Exon 3 Intron 1 Intron 2 Introns: interrupt coding sequence contain splice donor/acceptor mo;fs are removed by splicing from the final message

6 Gene Structure Varies by Organism Eubacteria, archae, and single- celled fungi have simple genes: No or few introns (and these are short) Transcrip;on start and stop sites rela;vely well- defined in primary sequence. Plants and animals have complex genes: Many introns (and these can be long) Transcrip;on start/stop sites are poorly defined (hard to find) in primary sequence Intron/exon boundaries are poorly defined in primary sequence

7 A Sample from S. cerevisiae (yeast) 10 kb region No introns, genes ;ghtly packed. (there actually are introns in yeast, but very few and none for these par;cular genes)

8 A Sample from C. elegans (worm) 10 kb region Note many introns, variable length.

9 A Sample from Homo sapiens 200 kb region Large number of long introns, highly variable length (note different scale from previous slide) A single gene, CFTR (defects cause cys;c fibrosis):

10 Outline Review of gene structure Scale of the problem Solu;ons Empirical methods Ab ini;o predic;on

11 How Big are Genomes? Organism Genome size (Mb) # of Genes Epstein- Barr virus (EBV) 0.17 Yeast 12.5 Worm Human 3,300 Guesses about the number of genes?

12 How Big are Genomes? Organism Genome size (Mb) # of Genes Epstein- Barr virus (EBV) Yeast ,770 Worm ,733 Human 3,300 21,000

13 What About Introns? Organism Genome size (Mb) # of Genes Epstein- Barr virus (EBV) Yeast ,770 Worm ,733 Human 3,300 21,000 Humans have many introns, relative to most other organisms Deutsch M, Long M. Intron-exon structures of eukaryotic model organisms. Nucleic Acids Res Aug 1;27(15): PMCID: PMC148551

14 What About Introns? Organism Genome size (Mb) # of Genes Epstein- Barr virus (EBV) Yeast ,770 Worm ,733 Human 3,300 21,000 Human introns are much longer than other organism s introns Deutsch M, Long M. Intron-exon structures of eukaryotic model organisms. Nucleic Acids Res Aug 1;27(15): PMCID: PMC148551

15 Why such diversity in intron number/size?

16 Why such diversity in intron number/size? Related to how much selec;on there is for compact genome. Strong selec;on = few introns, short introns. Frac;on of resources expended on DNA replica;on and maintenance. Generally, rapid growth = compact genome. Extreme is in viruses, where there are essen;ally no intergenic spacers (some genes even overlap each other). Selec;on for compact genomes in major groups: viruses > bacteria = archae > fungi > nematodes > insects = plants > chordates (as a rule of thumb)

17 Why such diversity in intron number/size? Related to how much selec;on there is for compact genome. Strong selec;on = few introns, short introns. Frac;on of resources expended on DNA replica;on and maintenance. Generally, rapid growth = compact genome. Extreme is in viruses, where there are essen;ally no intergenic spacers (some genes even overlap each other). So, genes are complex structures with common features (e.g. promoters, exons, introns) whose characteris;cs vary across species

18 And Remember, We re Sequencing New Genomes Rapidly

19 When We Assemble A New Genome, Finding the Genes is a Key Problem Chr 1 Chr 2 Genome assembly Chr 3 Sequencing data Assembling a genome is just the beginning

20 When We Assemble A New Genome, Finding the Genes is a Key Problem Chr 1 Chr 2 Chr 3 Genome assembly Chr 3 Annotation Sequencing data A genome Locations of genes, promoters,etc because if we want to do much with it we have to annotate it, identifying the location of functional elements including genes

21 When We Assemble A New Genome, Finding the Genes is a Key Problem This is part of the original human genome paper in You can see six chromosomes. Each lille annota;on at the bolom is a gene

22 When We Assemble A New Genome, Finding the Genes is a Key Problem So, genome annota;on generally and gene finding specifically is a big problem!!!

23 Outline Review of gene structure Scale of the problem Solu;ons Empirical methods Ab ini;o predic;on

24 The Goal of Gene Finding To create a model for every gene in a genome Exon 1 Exon 2 Exon 3 Intron 1 Intron 2 A gene model describes the structure of a gene, comprising the loca;ons in DNA sequence of all key gene features (introns and exons, transcrip;on start, etc.).

25 How Would You Find Genes?

26 Experimental Methods for Finding Genes You can sequence the mrna to figure out where genes must be

27 Experimental Methods for Finding Genes You can sequence the mrna to figure out where genes must be Start by capturing mrna using the polya tail

28 Experimental Methods for Finding Genes You can sequence the mrna to figure out where genes must be Fragment, synthesize the cdna strand with reverse transcriptase

29 Experimental Methods for Finding Genes You can sequence the mrna to figure out where genes must be Synthesize the second strand with DNA polymerase then sequence

30 Experimental Methods for Finding Genes You can sequence the mrna to figure out where genes must be Synthesize the second strand with DNA polymerase then sequence The resul;ng short sequences are some;mes called Expressed Sequence Tags (ESTs) These tell you where genes are by inspec;on

31 Homology- Based Gene Finding With EST- based gene models in hand, you can go looking for homologous sequences in other organisms

32 Homology- Based Gene Finding With EST- based gene models in hand, you can go looking for homologous sequences in other organisms In principle, this should solve the problem

33 Problems With Empirical Approach The empirical approach has issues, though can anyone think of some?

34 Problems With Empirical Approaches The empirical approach has issues, though can anyone think of some? Scale is the first big one there are massive numbers of organisms (especially bacteria) that we want to annotate. We can t even culture them all, and even if we could we couldn t afford to sequence all their mrna

35 EST count from a small region on C. elegans chr. IV (not all shown) The second is the large dynamic range of mrna concentra;on 6 > 10

36 EST count from a small region on C. elegans chr. IV (not all shown) Even the power of high- throughput DNA sequencing cannot solve problem

37 EST count from a small region on C. elegans chr. IV (not all shown) This means that we will miss many low- abundance mrnas

38 Problems With Empirical Approaches Finally, homology based approaches suffer because sequences diverge Any diverged gene will be missed You ll never find new/unexpressed genes that way (a big problem in mul;cellular organisms think development)!

39 Computa;onal Methods for Finding Genes (e.g. Gene Predic;on) We cannot experimentally determine the loca;on of all genes, but we can predict them! TGAATCAAGTTAGAAGTTATGGAGCATAATAACATGT GGATGGCCAGTGGTCGGTTGCTACACCCCTGCCGCAA CGTTGAAGGTCCCGGATTAGACTGGCTGGATCTATGC CGTGACACCCGTTATACTCCATTACCGTCTGTGGGTC ACAGCTTGTTGTGGACTGGATTGCCATTCTCTCAGTG TATTACGCAGGCCGGCGCACGGGTCCCATATAAACCT GTCATAGCTTACCTGACTCTACTTGGAAATGTGGCTA GGCCTTTGCCCACGCACCTGATCGGTCCTCGTTTGCT TTTTAGGACCGGATGAACTACAGAGCATTGCAAGAAT CTCTACCTGCTTTACAAAGTGCTGGATCCTATTCCAG CGGGATGTTTTATCTAAACACGATGAGAGGAGTATTC GTCAGGCCACATGGCTTTCTTGTTCTGGTCGGATCCA TCGTTGGCGCCCGACCCCCCCATTCCATAGTGAGTTC TTCGTCCGAGCCATTGTATGCCAGATCGACAGACAGA TAGCGGATCCAGTATATCCCTGGAAACTATAGACGCA CAGGTTGGAATCTTAAGTGAAGTCGCGCGTCCAAACC CAGCTCTATTTTAGTGGTCATGGGTTCTGGTCCCCCC GAGCCGCGGAACCGATTAGGACCATGTACAACAATAC TTATTAGTCATCTTTTAGACACAATCTCCCTGCTCAG TGGTATATGGTTTTTGCTATAATTAGCCACCCTCATA AGTTGCACTACTTCTGCGACCCAAATGCACCCTTACC ACGAAGACAGGATTGTCCGATCCTATATTACGACTTT

40 Computa;onal Methods for Finding Genes (e.g. Gene Predic;on) Given a sequence, we want to be able to predict the major features of genes in the sequence (e.g. create gene models) TGAATCAAGTTAGAAGTTATGGAGCATAATAACATGT GGATGGCCAGTGGTCGGTTGCTACACCCCTGCCGCAA CGTTGAAGGTCCCGGATTAGACTGGCTGGATCTATGC CGTGACACCCGTTATACTCCATTACCGTCTGTGGGTC ACAGCTTGTTGTGGACTGGATTGCCATTCTCTCAGTG TATTACGCAGGCCGGCGCACGGGTCCCATATAAACCT GTCATAGCTTACCTGACTCTACTTGGAAATGTGGCTA GGCCTTTGCCCACGCACCTGATCGGTCCTCGTTTGCT TTTTAGGACCGGATGAACTACAGAGCATTGCAAGAAT CTCTACCTGCTTTACAAAGTGCTGGATCCTATTCCAG CGGGATGTTTTATCTAAACACGATGAGAGGAGTATTC GTCAGGCCACATGGCTTTCTTGTTCTGGTCGGATCCA TCGTTGGCGCCCGACCCCCCCATTCCATAGTGAGTTC TTCGTCCGAGCCATTGTATGCCAGATCGACAGACAGA TAGCGGATCCAGTATATCCCTGGAAACTATAGACGCA CAGGTTGGAATCTTAAGTGAAGTCGCGCGTCCAAACC CAGCTCTATTTTAGTGGTCATGGGTTCTGGTCCCCCC GAGCCGCGGAACCGATTAGGACCATGTACAACAATAC TTATTAGTCATCTTTTAGACACAATCTCCCTGCTCAG TGGTATATGGTTTTTGCTATAATTAGCCACCCTCATA AGTTGCACTACTTCTGCGACCCAAATGCACCCTTACC ACGAAGACAGGATTGTCCGATCCTATATTACGACTTT Exon 1 Intron 1 Exon 2 Stop Start TGAATCAAGTTAGAAGTTATGGAGCATAATAACATGT GGATGGCCAGTGGTCGGTTGCTACACCCCTGCCGCAA CGTTGAAGGTCCCGGATTATGCTGGCTGGATCTATGC CGTGACACCCGTTATACTCCATTACCGTCTGTGGGTC ACAGCTTGTTGTGGACTGGATTGCCATTCTCTCAGTG TATTACGCAGGCCGGCGCACGGGTCCCATATAAACCT GTCATAGCTTACCTGACTCTACTTGGAAATGTGGCTA GGCCTTTGCCCACGCACCTGATCGGTCCTCGTTTGCT TTTTAGGACCGGATGAACTACAGAGCATTGCAAGAAT CTCTACCTGCTTTACAAAGTGCTGGATCCTATTCCAG CGGGATGTTTTATCTAAACACGATAGAGGGAGTATTC GTCAGGCCACATGGCTTTCTTGTTCTGGTCGGATCCA TCGTTGGCGCCCGACCCCCCCATTCCATAGTGAGTTC TTCGTCCGAGCCATTGTATGCCAGATCGACAGACAGA TAGCGGATCCAGTATATCCCTGGAAACTATAGACGCA CAGGTTGGAATCTTAAGTGAAGTCGCGCGTCCAAACC CAGCTCTATTTTAGTGGTCATGGGTTCTGGTCCCCCC GAGCCGCGGAACCGATTAGGACCATGTACAACAATAC TTATTAGTCATCTTTTAGACACAATCTCCCTGCTCAG TGGTATATGGTTTTTGCTATAATTAGCCACCCTCATA AGTTGCACTACTTCTGCGACCCAAATGCACCCTTACC ACGAAGACAGGATTGTCCGATCCTATATTACGACTTT

41 Sites We Need to Predict Translation start Translation stop Splice donor site Splice acceptor site

42 Ab Ini&o Gene Predic;on Here, we define sequence features of real genes based on experimental evidence

43 Ab Ini&o Gene Predic;on Here, we define sequence features of real genes based on experimental evidence Open reading frame model Splice donor sequence model Splice acceptor sequence model Intron/exon length distribu;on Requirement that introns maintain the reading frame Then, we use these sequence features to obtain the best interpreta;on of where genes are in any region from sequence alone Ab ini&o = from first principles

44 Example #1: Open Reading Frames What sequence features should ORFs have? Starts with ATG Ends with a stop codon (TGA/TAA/TAG) There will be many short sequences that fit this bill we can take advantage of one other fact: the probability of not having a stop codon in a par;cular reading frame decays rapidly with increasing length

45 Example #2: Splice Donor and Acceptors Splice donor and acceptor sites have characteris;c sequences Donor Acceptor

46 Where Would You Infer Introns? Size of arrows indicates strength of match to donor/acceptor mo;f sequence

47 Where Would You Infer Introns? Size of arrows indicates strength of match to donor/acceptor mo;f sequence

48 Where Would You Infer Introns? Size of arrows indicates strength of match to donor/acceptor mo;f sequence

49 How Do We Actually Accomplish This Task? It turns out that we can make a Hidden Markov Model that, given a par;cular sequence, can return the most likely gene model Start TGAATCAAGTTAGAAGTTATGGAGCATAATAACATGT GGATGGCCAGTGGTCGGTTGCTACACCCCTGCCGCAA CGTTGAAGGTCCCGGATTAGACTGGCTGGATCTATGC CGTGACACCCGTTATACTCCATTACCGTCTGTGGGTC ACAGCTTGTTGTGGACTGGATTGCCATTCTCTCAGTG TATTACGCAGGCCGGCGCACGGGTCCCATATAAACCT GTCATAGCTTACCTGACTCTACTTGGAAATGTGGCTA GGCCTTTGCCCACGCACCTGATCGGTCCTCGTTTGCT TTTTAGGACCGGATGAACTACAGAGCATTGCAAGAAT CTCTACCTGCTTTACAAAGTGCTGGATCCTATTCCAG CGGGATGTTTTATCTAAACACGATGAGAGGAGTATTC GTCAGGCCACATGGCTTTCTTGTTCTGGTCGGATCCA TCGTTGGCGCCCGACCCCCCCATTCCATAGTGAGTTC TTCGTCCGAGCCATTGTATGCCAGATCGACAGACAGA TAGCGGATCCAGTATATCCCTGGAAACTATAGACGCA CAGGTTGGAATCTTAAGTGAAGTCGCGCGTCCAAACC CAGCTCTATTTTAGTGGTCATGGGTTCTGGTCCCCCC GAGCCGCGGAACCGATTAGGACCATGTACAACAATAC TTATTAGTCATCTTTTAGACACAATCTCCCTGCTCAG TGGTATATGGTTTTTGCTATAATTAGCCACCCTCATA AGTTGCACTACTTCTGCGACCCAAATGCACCCTTACC ACGAAGACAGGATTGTCCGATCCTATATTACGACTTT Exon 1 Intron 1 Exon 2 Stop TGAATCAAGTTAGAAGTTATGGAGCATAATAACATGT GGATGGCCAGTGGTCGGTTGCTACACCCCTGCCGCAA CGTTGAAGGTCCCGGATTATGCTGGCTGGATCTATGC CGTGACACCCGTTATACTCCATTACCGTCTGTGGGTC ACAGCTTGTTGTGGACTGGATTGCCATTCTCTCAGTG TATTACGCAGGCCGGCGCACGGGTCCCATATAAACCT GTCATAGCTTACCTGACTCTACTTGGAAATGTGGCTA GGCCTTTGCCCACGCACCTGATCGGTCCTCGTTTGCT TTTTAGGACCGGATGAACTACAGAGCATTGCAAGAAT CTCTACCTGCTTTACAAAGTGCTGGATCCTATTCCAG CGGGATGTTTTATCTAAACACGATAGAGGGAGTATTC GTCAGGCCACATGGCTTTCTTGTTCTGGTCGGATCCA TCGTTGGCGCCCGACCCCCCCATTCCATAGTGAGTTC TTCGTCCGAGCCATTGTATGCCAGATCGACAGACAGA TAGCGGATCCAGTATATCCCTGGAAACTATAGACGCA CAGGTTGGAATCTTAAGTGAAGTCGCGCGTCCAAACC CAGCTCTATTTTAGTGGTCATGGGTTCTGGTCCCCCC GAGCCGCGGAACCGATTAGGACCATGTACAACAATAC TTATTAGTCATCTTTTAGACACAATCTCCCTGCTCAG TGGTATATGGTTTTTGCTATAATTAGCCACCCTCATA AGTTGCACTACTTCTGCGACCCAAATGCACCCTTACC ACGAAGACAGGATTGTCCGATCCTATATTACGACTTT