AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG

Size: px
Start display at page:

Download "AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG"

Transcription

1 AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCT CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCT GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCT AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT

2 >my favorite protein MSFNNALSGVNAAQKDLNVTANNIANVNTTGFKESRAEFADVYANSIFVNAKTQV TGAVAQQFHQGALQFTNNALDLSIQGNGFFVTSDGLTNLDRTFTRAGAFKLNENS QGNYLQGYEINTDGTPKAVSINATKPIQIPDRAGEPKMTELVEASFNLSIESKTKPT AFDPTNSATFAHSTSVTIYDSLGAPHVITKYFVRHEDPAAPGTPLTPGVKMTFTSG TLTVPVDPIKTVALGTTAGIINNGADPTQTLEIRLGDVTQYSSPFNVTKLTQDGATV TKVEITPDGIVSATYSNATTLKVAMVALAKFANSQGLTQVGDTSWRQSLLSGDAL SGTLGSIKSSALEQSNVDLTSQLVNLITAQRNFQANSRSLEVNSSLQQTILQI >my favorite transcript ATGAAAGTTAGTTTTGAAAGAATAATTCCAAGTGAAAAAAGCTCTTTCCGCACA AATAACTCTCCTATTTCTGAATTTAAATGGGAGTATCATTATCATCCGGAAATAG GTATGTGTAATTTCGGGAAGTGGCACACGGCATGTAGGCTACCATAAAAGCAA AACGGAGATCTTGTGTTAATAGGTTCAAACATTCCACATTCCGGATTTGGACTG GTTGATCCGCATGAAGAAATAGTACTTCAGTTCAGGGAAGAGATTTTGCATTTT CAGGAAGTTGAAACAAGAGCCGTGAAAGATCTACTGGAACGCTCTAAATATGG TATAGTACAGCTACAAAAAAGCTGCTCATGCCGAAACTAAAAAAGCTTCTGGA GGCTACAAAAGATACTTACTACTTCTGGAGATTCTCTTCGAACTTTCTTTGTGC TATGAATTGTTGAACAAAGAAATTATGCCTTATACCATAATCTCTAAAAATAAAA CTGGAAAATATCTTTACCTATGTGGAACATCATTACGATAAGGAAATAAATATA GTTGCAAAGCTGGCTAATCTTACTCTTCCTGCATTTTGTAATTTTTTTAAAAAAG CAGATTACCTTTACAGAATTTGTCAACCGTTACCGTATTAATAAAGCCTGCCTT ACTCAGGATAAAACAATATCCGAATGCAGCTACAGTTGTGGCTTTAACAATGTT TTCAACAGAATGTTTAAAAAATATACCAATAAAACGCCATCAGAATTT

3 What could cause such wide differences of opinion in the community? Most estimates made prior to whole genome assembly Lack of understanding of caveats associated with gene prediction Lack of understanding of caveats associated with using EST data Differences between prediction and annotation not well understood.

4 Eukaryotic Gene Identification Learn about finding genes rather than gene finders Approach: Focus on practical use rather than algorithmic details. Develop an intuition for how these algorithms behave in the real world. Explore: The influence of Sequence content, length, and quality on on ab-initio gene prediction, and the pitfalls of homology based gene prediction. Understand: The differences between gene annotation and gene prediction. A few key concepts: Gene prediction is a misnomer: usually it s a CDS that is predicted, not a gene. ncrna genes? Gene annotation is more complex; it tries to deal with issues such as alternative splicing, locus assignment, UTR, pseudogenes, evidence trails, nomenclature issues, known vs. novelty, etc. A gene prediction is the output of an executable and resides in a flat file. A gene annotation is a complex object residing in a database that is versioned, graphically accessible, alterable, obsolete-able, and human own-able. Genscan and FgenesH predict CDSs. Ensembl and Otto create and manage annotations.

5 Genomic DNA Gene predictors e.g. e.g. Twinscan, Genscan, Genomescan Genie (run-time) Gene predictions Gene annotators (after-the-fact) e.g ensembl, otto, human beings Gene annotations Accessory computes: ESTs, mrnas Protein homology Other genomic sequences databases 3 classes of gene prediction: Ab-initio (de-novo) Genscan Grail* FgenesH* Genie* GeneId* Genefinder Glimmer Etc. Homology based GeneID Genomescan Twinscan etc Identity based Genewise Sim4 Spidey Pair-hmms

6 automated gene annotation systems: Ensembl Genome channel RiceGaas EGRET Otto, etc. (Gadfly, DAS) Ab-initio prediction: exon exon CCGTGATGCGGTGGCGCGTAAGGCGCAGTGGAAAGTGTAAGA Example: Genscan, etc

7 Homology assisted prediction: EST exon exon CCGTGATGCGGTGGCGCGTAAGGCGCAGTGGAAAGTGTAAGA Example: Genie, Grail, GeneID, etc Identity based gene prediction Known mrna prediction Example: esttogenome, Sim4,etc.

8 homology Genscan Human prediction Example: humans, Otto, etc The real data is more complex than one might imagine

9 Sim4 dbest Genewise nr.aa Grail Genscan FgenesH Ensembl Otto Part 1: Exploring ab-initio gene prediction

10 Experiment #1 250 KB of real human genomic DNA CHUNK 0.5KB 1KB 5KB 10KB 25KB 50KB 100KB 250KB Purpose: explore the influence of Sequence length on ab-initio gene prediction Run Genscan on each chunk Sew results back together Examine results

11 Interference 50Kb chunks 100Kb chunks 100Kb chunks 50Kb chunks Interference is a surprisingly complex phenomenon

12 Experiment #2 Genscan Examine results Purpose: What do Ns do to gene finders? Ns can cause Squelching NNNNNNNNNN Exon Squelching

13 Squelching can produce less obvious errors NNNNNNNNNN Remaining exons no longer frame compatible; insertion of new, bogus exon to fix frame Ns can cause Squelching NNNNNNNNNN Squelching perturbs 3 exons

14 Ns can cause Squelching NNNNNNNNNN Squelching causes entire prediction to be lost Ns can cause Squelching NNNNNNNNNN Squelching causes prediction to be split

15 Experiment #2: Ns cause squelching Genscan Examine results Conclusion: Ns decrease the performance of ab-initio predictors. Factoid: Genscan pretty tolerant of Ns, FgenesH less so. If possible run Repeatmasker with the no low option. Experiment #3: A negative control A=T=G=C=0.25 Random sequence generator Genscan, CpG island finder Examine results Purpose: always do your negative controls! Truthfully there are no reference annotations so there is no perfect positive control

16 CpG Genscan DNA 0Kb 200Kb 400Kb 600Kb Genscan Genscan exon Number Real human genomic dna Randomly generated sequence frequency 0 Number of exons 40

17 Experiment #3 Conclusion: Genscan overcalls; DNA is strange stuff! A=T=G=C=0.25 Random sequence generator Genscan, CpG island finder Examine results

18 Why are there Genes in random sequence? It seems that gene like structures are present/latent within random-dna sequences. and it has some interesting evolutionary implications At the very least this fact makes it clear that one should never put much faith in an ab-initio gene predictiontime g and some very practical ones g g g g g Part 2: Exploring homology assisted gene prediction

19 Real DNA is full of real and faux exons (why Ensembl & Otto work) faux exon real exon true structure predicted exons genscan prediction Predicted exons The real exon is not necessarily the best exon. Exploting homology to avoid faux exons (Why Ensembl & Otto work) true structure predicted exons homology faux exon real exon homology homology NNNNNN The real exon is not necessarily the best exon. annotation Predicted exons

20 2 basic ways to use homology in the gene prediction process Runtime: use homology to inform a gene prediction e.g. Twinscan, Genomescan (pair HMMS) new After-the-fact: use homology to confirm and revise a gene prediction e.g. Ensembl, Otto The Genscan family of homology assisted gene predictors Run-time homology assisted annotation Genscan Genomescan (likes to use protein data) Genscan++ Twinscan (strength is using mouse genomic reads)

21 Run-tine homology assisted gene prediction Genomic sequence Blast against some database Use blast result to label each nucleotide in the genomic sequence as intron, exon, conserved, not conserved Feed homology information and genomic sequence to a genscan-like algorithm which has been trained in such a way that it can optimaly exploit homology information. Twinscan and Genomescan are significant steps forward, but homology can also misinform gene prediction Twinscan GenomeScan Genscan refseq Homology can cause all of the same problems for After-the-Fact predictors

22 Homology can also misinform: Tandem duplications Blast result reality prediction In general segmentation is the toughest thing to figure out: Its difficult to sort out at the grammatical (ab-initio level) and its difficult to sort out at the level of a blast report, but probably less so! Homology can also misinform: pseudogenes Blast result reality prediction Ab-initio predictors will often split exons to avoid frame shifts, or insert bogus exons to fix the frame. Like Ns, stops cause squelching. Not easy to distinguish sequencing errors from psedogenes. Genewise helps.

23 Homology can also misinform: Forced calls in bad places Blast result reality Genscan Naïve homology assisted prediction Often a bad EST or low complexity blast hit will engender a prediction Homology can also misinform: When the homology contradicts, can it really confirm? BlastN BlastX Human mouse yeast EST result result Human Sequence Whether or not homology supports or contradicts a gene prediction is more than just a matter of coordinates!

24 Query and subject source are as important as coordinates! Human Mouse EST protein homology Human Fly protein EST Genscan Human prediction Code that de-convolutes homology data is essential for accurate annotation dbest otherests Nr.aa Genscan refseq Otto

25 ESTs are a very problematic data source for gene prediction and annotation. EST chaos!

26 dbest refseq Otto Weird ESTs So what are those weird ESTs? 1. Real biology unknown function 2. Splicing may be noisy; perhaps more so following death! 3. Genomic contamination

27 3 5 TTTTTTTTTTTTTT-poly T primer AAAAAGAAAGAAAA ALU element Clone cdna, Sequence resulting clone 5 reads contain no sign of ALU. Trimmed 3 reads don t either. In any case everyone knows that human and mouse genes have ALUs in their 3 UTRs, right? But shouldn t a contaminant be unique? -- nuclease resistant sites -- tracking problems: in the real world plates are mistakenly sequenced more than once, etc. -- a library with few primary recombinants will contain many copies of every primary recombinant after amplification. -- many libraries have been heavily sequenced Related topic: which strand is that EST on?

28 Where would missing genes lie? ~50 % of genome is intergenic (61,971,014) genic genic genic intergenic intergenic

29 Distribution of intergenic lengths Genes distributed randomly within Genome Distribution of intergenic lengths Genes distributed randomly within Genome Actual distribution 26,346,479 million bp

30 ~62 mega bases of DNA run Genscan on every intergenic region predictions 1,167 non-overlapping FgenesH predictions (V. Solovyev) 1800 non-overlapping new genes from Hild et al 159 control annotations new gene predictions How many are real? In-silico analyses

31 ~62 mega bases of DNA 14,797 new gene predictions Standardized validation procedure GENE PREDICTION 1. Pool mrna from 6 different stages 2. RVT with T 15 TAGGED primer validation procedure But which two exons? 3. PCR w/exon specific primers 4. Sequence PCR product 5. Realign to genome 6. Examine in browser genome browser PCR PRODUCT GENE PREDICTION

32 But which two exons? 1. Choose the longest possible pair of exons. 2. Choose a pair that are separated by an intron of the right length. 3. All things being equal, choose the pair closest to the 3 end of the prediction Fuzzy Logic chooser Fuzzy Logic Chooser 85 Priority Score

33 Fuzzy Logic Chooser 72 Priority Score Fuzzy Logic Chooser 17 Priority Score

34 Fuzzy Logic Chooser 85 Priority Score Benefits of this approach: 1. Provides a rational, standardized and generic procedure for primer design. 2. A universal scoring scheme for gene predictions. 3. Absolutely necessary for large-scale & coordinated studies. Problems it solves: 1. What FgenesH score is equivalent to a Genscan score of 56? 2. Scores and probabilities assigned by a single program change from version to version & with changing training data. 3. Potentially allows us to test best predictions first; cost effective; hence the term priority score. For example Genscan Priority Score 85 FgenesH 3 rd party 85 42

35 ~62 mega bases of DNA new gene predictions sub-categorization seemed advised homology seemed a logical criterion Split the gene models in to 5 different sets 1 One or none set 293 two or more set (all) 3 GT GT AG AG splice junction conserved set 207 (all) 4 5 new genes from Hild et al. Platinum annotations Heidelberg set 196 control set 159

36 1 Split the gene models in to 5 different sets One or none set two or more set 339 (all) 3 GT GT AG AG splice junction conserved set 207 (all) 4 5 new genes from Hild et al. Platinum annotations Heidelberg set 196 control set Split the gene models in to 5 different sets One or none set two or more set 339 (all) 3 GT GT AG AG splice junction conserved set 207 (all) 4 5 new genes from Hild et al. Platinum annotations Heidelberg set 196 control set 159

37 1 Split the gene models in to 5 different sets One or none set two or more set 339 (all) 3 GT GT AG AG splice junction conserved set 207 (all) 4 5 new genes from Hild et al. Platinum annotations Heidelberg set 196 control set Split the gene models in to 5 different sets One or none set two or more set 339 (all) 3 GT GT AG AG splice junction conserved set 207 (all) 4 5 new genes from Hild et al. Platinum annotations Heidelberg set 196 control set 159

38 Hild et all verification sheme 2400 new genes. Many with developmentally regulated expression patterns Split the gene models in to 5 different sets 1 One or none set 293 two or more set (all) 3 GT GT AG AG splice junction conserved set 207 (all) 4 5 new genes from Hild et al. Platinum annotations Heidelberg set 196 control set 159

39 1 Split the gene models in to 5 different sets One or none set two or more set 339 (all) 3 GT GT AG AG splice junction conserved set 207 (all) 4 5 new genes from Hild et al. Platinum annotations Heidelberg set 196 control set Split the gene models in to 5 different sets One or none set two or more set 339 (all) 3 GT GT AG AG splice junction conserved set 207 (all) 4 5 new genes from Hild et al. Platinum annotations Heidelberg set 196 control set 159

40 1 2 Split the gene models in to 5 different sets One or none set 293 2% two or more set 339 (all) 9% 3 GT GT AG AG splice junction conserved set 207 (all) 34% 4 new genes from Hild et al. Heidelberg set 196 7% 5 Platinum annotations control set % Using priority scores as a as a generic means to score a gene prediction Genscan FgenesH 3 rd party Priority Score Just how predictive were these scores that a prediction would verify?

41 All in all our results suggest around ~ 400 new protein coding genes Control set All verified predictions new genes from Hild et al. How to rationalize our results with those reported in Hild et al.? Cliff-notes version of eukaryotic transcription 5 aaaaa 5 aaaaa

42 More of the genome is transcribed than previously thought. aaaaa 5 aaaaa 5 aaaaa 5 aaaaa aaaaa More of the genome is transcribed than previously thought. "ncrnas "is transcription leaky, messy? aaaaa What isn t understood often seems nonsensical. 5 aaaaa 5 aaaaa 5 aaaaa aaaaa

43 aaaaa 5 aaaaa 5 aaaaa 5 aaaaa 5 5 aaaaa " Confirmation of expression is not confirmation of existence. " At the very least show that its spliced, or failing that discrete. " Determining the true structure of the transcriptome is the next logical step for annotation. #accurate annotation of ncrna #accurate annotation of each protein-coding gene s intron-exon structure #accurate annotation of every alternate transcript. #extend in-situ information to individual alternative transcripts.

44 Eukaryotic Gene Identification (Conclusions) Approach: Focus on practical use rather than algorithmic details. Develop an intuition for how these algorithms behave in the real world. Explore: The influence of Sequence content, length, and quality on on ab-initio gene prediction, and the pitfalls of homology based gene prediction. Understand: The differences between gene annotation and gene prediction. Acknowledgements Comparative Analysis project Chris Mungall Simon Prochnik Chris Smith Josh Kaminker George Hartzell Ian Holmes Eric Smith Annotation verification project Suzi Lewis Gerald M. Rubin Sima Misra Adina Bailey Colin Wiel ShengQiang Shu Joe Carlson Martha Evans-Holmes Pavel Tomancak Sue Celniker

45 Acknowledgements: