Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Similar documents
Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Annotation of a Drosophila Gene

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Small Exon Finder User Guide

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Transcription Start Sites Project Report

Annotation of contig62 from Drosophila elegans Dot Chromosome

Annotation of Contig8 Sakura Oyama Dr. Elgin, Dr. Shaffer, Dr. Bednarski Bio 434W May 2, 2016

Lab Week 9 - A Sample Annotation Problem (adapted by Chris Shaffer from a worksheet by Varun Sundaram, WU-STL, Class of 2009)

ab initio and Evidence-Based Gene Finding

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Drosophila ficusphila F element

Outline 08/2018. Searching for Transcription Start Sites in Drosophila. TSS of F element genes show lower levels of H3K9me3 and HP1a

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

MODULE 5: TRANSLATION

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

MODULE TSS1: TRANSCRIPTION START SITES INTRODUCTION (BASIC)

Aaditya Khatri. Abstract

Annotation of Drosophila erecta Contig 14. Kimberly Chau Dr. Laura Hoopes. Pomona College 24 February 2009

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Annotating the D. virilis Fourth Chromosome: Fosmid 99M21

MODULE TSS2: SEQUENCE ALIGNMENTS (ADVANCED)

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Gene Identification in silico

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

user s guide Question 1

Chimp Sequence Annotation: Region 2_3

Motif Discovery in Drosophila

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

ORTHOMINE - A dataset of Drosophila core promoters and its analysis. Sumit Middha Advisor: Dr. Peter Cherbas

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

COMPUTER RESOURCES II:

Hands-On Four Investigating Inherited Diseases

BME 110 Midterm Examination

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Sequence Based Function Annotation

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)


Browser Exercises - I. Alignments and Comparative genomics

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

Gene Finding Genome Annotation

Homework 4. Due in class, Wednesday, November 10, 2004

RNA-Seq Analysis. August Strand Genomics, Inc All rights reserved.

Finding Genes, Building Search Strategies and Visiting a Gene Page

FUNCTIONAL BIOINFORMATICS

Finding Genes, Building Search Strategies and Visiting a Gene Page

CS313 Exercise 1 Cover Page Fall 2017

Mapping strategies for sequence reads

MAKING WHOLE GENOME ALIGNMENTS USABLE FOR BIOLOGISTS. EXAMPLES AND SAMPLE ANALYSES.

Supplementary Information

Interpreting RNA-seq data (Browser Exercise II)

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Gene Model Annotations for Drosophila melanogaster: Impact of High throughput Data

MAKER: An easy to use genome annotation pipeline. Carson Holt Yandell Lab Department of Human Genetics University of Utah

Identifying Regulatory Regions using Multiple Sequence Alignments

Guided tour to Ensembl

Gene Prediction. Mario Stanke. Institut für Mikrobiologie und Genetik Abteilung Bioinformatik. Gene Prediction p.

Tutorial for Stop codon reassignment in the wild

Investigating Inherited Diseases

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Why learn sequence database searching? Searching Molecular Databases with BLAST

Computational gene finding

Using the Genome Browser: A Practical Guide. Travis Saari

Bacterial Genome Annotation

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq

Annotating the Genome (H)

Assessing De-Novo Transcriptome Assemblies

Basic Bioinformatics: Homology, Sequence Alignment,

Computational gene finding

Systematic evaluation of spliced alignment programs for RNA- seq data

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

HUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009

RNA-Sequencing analysis

Genome annotation & EST

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Analysis of RNA-seq Data

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

user s guide Question 3

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

User s Manual Version 1.0

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Biotechnology Explorer

Gene-centered resources at NCBI

Lecture 7 Motif Databases and Gene Finding

Exercise I, Sequence Analysis

Transcription:

Agenda Annotation of Drosophila January 2018 Overview of the GEP annotation project GEP annotation strategy Types of evidence Analysis tools Web databases Annotation of a single isoform (walkthrough) Wilson Leung and Chris Shaffer AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCT TAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAATAGCCGTAAGA GTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAGGACTAGTCGGAACTGGA AATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACATCAATTGAAACAAGTTTGTACATC GATGCGCGGAGGCGCTTTTCTCTCAGGATGGCTGGGGATGCCAGCACGTTAATCAGGAT ACCAATTGAGGAGGTGCCCCAGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGG GCCGCTTATGTGGAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATT TAAGAAACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATACG CCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCCGATAGCACCAG CGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTGGTGTTGCAGTCGGTGCCA TAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCCCTAATGGGGACAGGCTGTTGCTGTTGG TGTTGGAGTCGGAGTTGCCTTAAACTCGACTGGAAATAACAATGCGCCGGCAACAGGAG CCCTGCCTGCCGTGGCTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATC ATCGGCCGGAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATCGCAAGCT CAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCGGGGCACAATGGGGA GCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCCAAGCAAATCACTGGATGGGAGGA ACCACAATCAGATTCAGAATATTAACAAAATGGTCGGCCCCGTTGTTATGGATAAAAAA TTTGTGTCTTCGTACGGAGATTATGTTGTTAATCAATTTTATTAAGATATTTAAATAAA TATGTGTACCTTTCACGAGAAATTTGCTTACCTTTTCGACACACACACTTATACAGACA GGTAATAATTACCTTTTGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Splice donor Splice acceptor UTR Annotation: Adding labels to a sequence Genes: Novel or known genes, pseudogenes Regulatory Elements: Promoters, enhancers, silencers Non-coding RNA: trnas, mirnas, sirnas, snornas Repeats: Transposable elements, simple repeats Structural: Origins of replication Experimental Results: DNase I Hypersensitive sites ChIP-chip and ChIP-Seq datasets (e.g., modencode) GEP Drosophila annotation projects D. melanogaster D. simulans D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. bipectinata D. ananassae D. pseudoobscura D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi Reference Published TSS projects for Fall 2017 / Spring 2018 Annotation projects for Fall 2017 / Spring 2018 New species sequenced by modencode Phylogenetic tree produced by Thom Kaufman as part of the modencode project Muller element nomenclature Muller elements Species A B C D E F D. simulans X 2L 2R 3L 3R 4 D. sechellia X 2L 2R 3L 3R 4 D. melanogaster X 2L 2R 3L 3R 4 D. yakuba X 2L 2R 3L 3R 4 D. erecta X 2L 2R 3L 3R 4 D. ananassae XLXR 3R 3L 2R 2L 4L4R D. pseudoobscura XL 4 3 XR 2 5 D. persimilis XL 4 3 XR 2 5 D. willistoni XL 2R 2L XR 3 D. mojavensis X 3 5 4 2 6 D. virilis X 4 5 3 2 6 D. grimshawi X 3 2 5 4 6 1

Nomenclature for Drosophila genes Drosophila gene names are case-sensitive Lowercase initial letter = recessive mutant phenotype Uppercase initial letter = dominant mutant phenotype Every D. melanogaster gene has an annotation symbol Begins with the prefix CG (Computed Gene) Some genes have a different gene symbol (e.g., mav) Suffix after the gene symbol denotes different isoforms mrna = -R; protein = -P mav-ra = Transcript for the A isoform of mav mav-pa = Protein product for the A isoform of mav GEP annotation strategy Technique optimized for projects with a moderately close, well annotated neighbor species Example: D. melanogaster Need to apply different strategies when annotating genes in other species: Example: corn, parrot GEP annotation goals Identify and annotate all the genes in your project For each gene, identify and precisely map (accurate to the base pair) all coding exons (CDS) Do this for ALL isoform Annotate the initial transcribed exon and transcription start site (TSS) Optional curriculum not submitted to GEP Clustal analysis (proteins, promoter regions) Transposons and other repeats Synteny Non-coding genes Evidence-based annotation Human-curated analysis Much higher accuracy than standard ab initio and evidence-based gene finders Goal: collect, analyze, and synthesize all the available evidence to create the best-supported gene model: Example: 4591-4688, 5157-5490, 5747-6001 Collect, analyze, and synthesize Collect: Genome Browser Conservation (BLAST searches) Analyze: Interpreting Genome Browser evidence tracks Interpreting BLAST results Synthesize: Construct the best-supported gene model based on potentially contradictory evidence Evidence for gene models (in general order of importance) 1. Conservation Sequence similarity to genes in D. melanogaster Sequence similarity to other Drosophila species (Multiz) 2. Expression data RNA-Seq, EST, cdna 3. Computational predictions Open reading frames; gene and splice site predictions 4. Tie-breakers of last resort See the Annotation Instruction Sheet 2

Expression data: RNA-Seq Positive results very helpful Negative results are less informative Lack of transcription no gene Evidence tracks: RNA-Seq coverage (read depth) TopHat splice junction predictions Assembled transcripts (Cufflinks, Oases) GEP curriculum: RNA-Seq Primer Browser-Based Annotation and RNA-Seq Data Gene structure - terminology Primary mrna Protein Gene span Exons UTR s CDS s Basic annotation workflow 1. Identify the likely ortholog in D. melanogaster 2. Determine the gene structure of the ortholog 3. Map each CDS of ortholog to the project sequence Use BLASTX to identify conserved region Note position and reading frame 4. Use these data to construct a gene model Identify the exact start and stop base position for each CDS 5. Use the Gene Model Checker to verify the gene model 6. For each additional isoform, repeat steps 2-5 Annotation workflow (graphically) Contig Feature BLASTP search of feature against the D. melanogaster proteins database D. melanogaster gene model (1 isoform, 5 CDS) BLASTX search of each D. melanogaster CDS against the contig Reading frame 3 1 3 2 1 Alignment Contig Feature Annotation workflow (graphically) BLASTX search of each D. melanogaster CDS against the contig Reading frame 3 1 3 2 1 Alignment Contig Identify the exact coordinates of each CDS using the Genome Browser M GT AG GT Reading frame 3 1 Use the Gene Model Checker to verify the final CDS coordinates Gene model 1245 1383 1437 1678 1740 2081 2159 2337 2397 2511 Coordinates: 1245-1383, 1437-1678, 1740-2081, 2159-2337, 2397-2511 UCSC Genome Browser Provide a graphical view of genomic regions Sequence conservation Gene and splice site predictions RNA-Seq data and splice junction predictions BLAT BLAST-Like Alignment Tool Map protein or nucleotide sequence against an assembly Faster but less sensitive than BLAST Table Browser Access raw data used to create the graphical browser 3

UCSC Genome Browser overview Genomic sequence Two different versions of the UCSC Genome Browser Official UCSC Version http://genome.ucsc.edu Evidence tracks Published data, lots of species, whole genomes; used for Chimp Chunks GEP Version http://gander.wustl.edu GEP projects, parts of genomes; used for annotation of Drosophila species Additional training resources Training section on the UCSC web site http://genome.ucsc.edu/training.html User guides and tutorials Mailing lists OpenHelix tutorials and training materials http://www.openhelix.com/ucsc Pre-recorded tutorials Reference cards Four web sites used by the GEP annotation strategy Open 4 tabs on your web browser: 1. GEP UCSC Genome Browser (http://gander.wustl.edu) Genome Browser D. virilis Mar. 2005 chr10 2. FlyBase (http://flybase.org) Tools è Genomic/Map Tools è BLAST 3. Gene Record Finder GEP web site è Projects è Annotation Resources Information on the D. melanogaster gene structure 4. NCBI BLAST (https://blast.ncbi.nlm.nih.gov) BLASTX è select the checkbox: Navigate to the Genome Browser for the project region Configure the Genome Browser Gateway page: clade = Insect genome = D. virilis assembly = Mar. 2005 (GEP/Annot. D. virilis ppt) position = chr10 Initial survey of a genomic region Investigate gene prediction chr10.4 in a fosmid project (chr10) from D. virilis using the GEP UCSC Genome Browser Click submit 4

Control how evidence tracks are displayed on the Genome Browser Five different display modes: Hide: track is hidden Dense: all features appear on a single line Squish: overlapping features appear on separate lines Features are half the height compared to full mode Pack: overlapping features appear on separate lines Features are the same height as full mode Full: each feature is displayed on its own line Set Base Position track to Full to see the amino acid translations Initial assessment of fosmid project chr10 Seven gene predictions (features) from Genscan Need to investigate each feature if one were to annotate this entire project Some evidence tracks (e.g., RepeatMasker) only have a subset of these display modes Investigate gene prediction chr10.4 Enter chr10:15000-21000 in the position box and click go to navigate to this region Click on the feature and select Predicted Protein to retrieve the predicted protein sequence Select and copy the sequence Computational evidence Assumption: there are recognizable signals in the DNA sequence that the cell uses; it should be possible to detect these signals computationally Many programs designed to detect these signals Use machine learning and Bayesian statistics These programs do work to a certain extent The information they provide is better than nothing The predictions have high error rates Accuracy of the different types of computational evidence DNA sequence analysis: ab initio gene predictors exons vs. genes Splice site predictors high sensitivity but low specificity Multi-species alignment low specificity BLASTX protein alignment: WARNING! Students often over-interpret the BLASTX alignment track (D. mel Proteins); use with caution Data on the Genome Browser is incomplete Most evidence has already been gathered for you and they are shown in the Genome Browser tracks Annotator still needs to generate conservation data 1. Assign orthology 2. Look for regions of conservation using BLAST Collect the locations (position, frame, and strand) of conservation to the orthologous protein If no expression data are available, these searches might need to be exhaustive (labor intensive) 5

Detect sequence similarity with BLAST BLAST = Basic Local Alignment Search Tool Why is BLAST popular? Provide statistical significance for each match Good balance between sensitivity and speed Identify local regions of similarity Common BLAST programs Except for BLASTN, all alignments are based on comparisons of protein sequences Alignment coordinates are relative to the original sequences Decide which BLAST program to use based on the type of query and subject sequences: Program Query Database (Subject) BLASTN Nucleotide Nucleotide BLASTP Protein Protein BLASTX Nucleotide Protein Protein TBLASTN Protein Nucleotide Protein TBLASTX Nucleotide Protein Nucleotide Protein Where can I run BLAST? Accessing TBLASTX at NCBI NCBI BLAST web service https://blast.ncbi.nlm.nih.gov/blast.cgi EBI BLAST web service https://www.ebi.ac.uk/tools/sss/ FlyBase BLAST (Drosophila and other insects) http://flybase.org/blast/ Detect conserved D. melanogaster coding exons with BLASTX FlyBase Database for the Drosophila research community Coding sequences evolve slowly Exon structure changes very slowly Use sequence similarity to infer homology D. melanogaster very well annotated Use BLAST to find similarity at amino acid level As evidence accumulates indicating the presence of a gene, we could justify spending more time and effort looking for conservation Key features: Lots of ancillary data for each gene in Drosophila Curation of literature for each gene Reference Drosophila annotations for all other databases Including NCBI, EBI, and DDBJ Fast release cycle (~6-8 releases per year) 6

Web databases and tools Many genome databases available Be aware of different annotation releases Release 6 versus release 5 assemblies Use FlyBase as the canonical reference Web databases are being updated constantly Update GEP materials before semester starts Potential discrepancies in exercise screenshots Minor changes in search results Let us know about errors or revisions Finding the ortholog BLASTP search of the chr10.4 gene prediction against the set of D. melanogaster proteins at FlyBase BLASTP results show a significant hit to the mav gene Note the large change in E-value from mav-pa to the next best hit gbb-pb; good evidence for orthology Results of ortholog search Degree of similarity consistent with comparison of typical proteins from D. virilis versus D. melanogaster Regions of strong similarity interspersed with regions of little or no similarity Same Muller element supports orthology We have a probable ortholog: mav FlyBase naming convention: <gene symbol>-[rp]<isoform> R = mrna, P = Protein mav-pa = Protein product from the A isoform of the gene mav Constructing a gene model We need more information on mav: Determine the gene function and structure in D. melanogaster If this is the ortholog, we also need the amino acid sequence of each CDS Use two web sites to learn about the gene structure: FlyBase Graphical representation of the gene; detailed information on each gene Gene Record Finder Table list of isoforms, retrieve exon sequences Obtain the gene structure model Using the Gene Record Finder and FlyBase to determine the gene structure of the D. melanogaster gene mav 7

Retrieve the Gene Record Finder record for mav Type mav into the search box, then press [Enter] Gene structure of mav Both the graphical and table view shows mav has a single isoform (mav-ra) in D. melanogaster This isoform has two transcribed exons Both exons contain translated regions Click on the View in GBrowse link to get a graphical view of the gene structure on FlyBase FlyBase GBrowse graphical format Gene Record Finder table format Investigate exons Search each CDS from D. melanogaster against the project sequence (e.g., fosmid / contig) Identify regions within the project sequence that code for amino acids similar to the D. melanogaster CDS Best to search with the entire project DNA sequence Easier to keep track of positions and reading frames Identify the conserved regions Use BLASTX to map each D. melanogaster CDS from the mav gene against the D. virilis chr10 fosmid sequence Retrieve CDS sequences In the Polypeptide Details section of the Gene Record Finder, select a row in the CDS table The corresponding CDS sequence will appear in the Sequence viewer window Retrieve the fosmid sequence Go back to the Genome Browser (first tab) and navigate to chr10 Click on the DNA link under the View menu, then click on the get DNA button 8

Compare the CDS against the fosmid sequence with BLASTX Copy and paste the genomic sequence from tab 1 into the Enter Query Sequence textbox Copy and paste the sequence for the CDS 1_9541_0 from tab 2 into the Enter Subject Sequence textbox Expand the Algorithm parameters section: Verify the Word size is set to 3: Turn off compositional adjustments Turn off the low complexity filter Click BLAST Adjust the Expect threshold E-value is negatively correlated with alignment length Difficult to find small exons with BLAST Sometimes BLAST cannot find regions with significant similarity: Use the Query subrange field to restrict the search region Change the Expect threshold to 1000 and try again Keep increasing the Expect threshold until you get hits This strategy will identify regions with very weak similarity but they can be better than nothing Examine the BLASTX alignment BLASTX result shows a weak alignment 50 identities and 94 similarities (positives) Low degree of similarity is not unusual when comparing single exons from these two species Location: 16866-17504, frame: +3, missing first 92 aa Map the second CDS (2_9541_0) Perform similar BLASTX search with CDS 2_9541_0 Location: chr10:18,476-19,747; Frame: +2; Alignment include the entire CDS except for first amino acid For larger genes, continue mapping each exon with BLASTX (adjust Expect threshold as needed) Record location, frame and missing amino acids BLAST might not find very small exons Move on and come back later Plot the location of each CDS might help Note the parts of the CDS s missing from the alignments Identify missing parts of CDS BLASTX CDS alignments for mav on D. virilis 16,866 17,504 18,476 19,747 93 271 2 431 Frame +3 Frame +2 Building a gene model Determine the exact start and end positions of each exon for the putative mav-pa ortholog 9

Basic biological constraints (inviolate rules*) Find the missing start codon Coding regions start with a methionine Coding regions end with a stop codon Gene should be on only one strand of DNA Exons appear in order along the DNA (collinear) Intron sequences should be at least 40 bp Intron starts with a GT (or rarely GC) Intron ends with an AG * There are known exceptions to each rule Only a single start codon (16,515-16,517 in frame +3) upstream of conserved region before the stop codon Start codon at 16,977-16,979 would truncate conserved region CDS alignment starts at 16,866 but 92 amino acids missing 16,866 - (92 * 3) = 16,590 A genomic sequence has 6 different reading frames Splice donor and acceptor phases Frames 1 2 3 Phase: Number of bases between the complete codon and the splice site Donor phase: Number of bases between the end of the last complete codon and the splice donor site (GT/GC) Acceptor phase: Number of bases between the splice acceptor site (AG) and the start of the first complete codon Frame: Base to begin translation relative to the first base of the sequence Phase depends on the reading frame of the CDS Phase depends on the reading frame Phase of donor and acceptor sites must be compatible Extra nucleotides from donor and acceptor phases will form an additional codon Donor phase + acceptor phase = 0 or 3 Splice Acceptor CCA AAT G GT AG TT CTC GAT Phase of acceptor site: Phase 2 relative to frame +1 Phase 0 relative to frame +2 Phase 1 relative to frame +3 Translation: CCA AAT GTT CTC GAT P N V L D 10

Incompatible donor and acceptor phases results in a frame shift Picking the best acceptor site CCA AAT G GT GT AG TT CTC GAT Reading frame (+2) dictated by BLASTX result Translation: CCA AAT GGT TTC P N G F S TCG AT Phase 0 donor is incompatible with phase 2 acceptor; use prior GT, which is a phase 1 donor. Two splice acceptor candidates: 18,471-18,472 (phase 0) 18,482-18,483 (phase 1), remove conserved amino acids 18,471-18,472 is the better candidate Pick different candidate if no phase 0 donor in previous exon Picking the best donor site Create the final gene model Pick ATG (M) at the start of the gene Reading frame (+3) dictated by BLASTX result Two splice donor candidates: 17,505-17,506 (phase 0) 17,518-17,519 (phase 1) Phase 0 donor (17,505-17,506) is compatible with phase 0 acceptor (18,471-18,472) For each putative intron/exon boundary, use the Genome Browser to locate the exact first and last base of the exon After splicing, conserved amino acids are linked together in a single long open reading frame CDS coordinates for mav: 16515-17504, 18473-19744 Additional strategies for identifying splice donor and acceptor sites For many genes, the location of donor and acceptor sites can be identified easily based on: Locations and quality of the CDS alignments Evidence of expression from RNA-Seq TopHat splice junctions When amino acid conservation is absent, other evidence and techniques must be considered. See the following documents for help: Annotation Instruction Sheet Annotation Strategy Guide Verify the final gene model using the Gene Model Checker Examine the checklist and explain any errors or warnings in the GEP Annotation Report View your gene model in the context of the other evidence tracks on the Genome Browser Examine the dot plot and explain any discrepancies in the GEP Annotation Report Look for large vertical and horizontal gaps See the Quick Check of Student Annotations document on the GEP web site 11