Small Exon Finder User Guide

Size: px
Start display at page:

Download "Small Exon Finder User Guide"

Transcription

1 Small Exon Finder User Guide Author Wilson Leung Document History Initial Draft 01/09/2011 First Revision 08/03/2014 Current Version 12/29/2015 Table of Contents Author... 1 Document History... 1 Introduction... 2 Acknowledgements... 2 Questions about the Small Exon Finder... 2 Availability... 2 Using the Small Exon Finder (Tutorial)... 3 Retrieve the CDS record for unc-13 from the Gene Record Finder... 3 Define the region to search for the CDS 1_2159_ Define the start of the next gene (Rad23)... 3 Define the end of the next CDS of unc-13 (5_2159_0)... 4 Search for candidates using the Small Exon Finder... 6 Iterative search for the best CDS candidate... 7 Small Exon Finder Configuration Options

2 Introduction One of the key steps in the recommended annotation strategy for the GEP Drosophila annotation projects is to map each unique coding exon (CDS) in the D. melanogaster gene model onto the contig sequence. This CDS mapping is typically done using BLAST 2 Sequences (bl2seq) with either the blastx or tblastn program. The annotator can then examine each conserved region and elucidate the correct splice site boundaries for each coding exon using the GEP UCSC Genome Browser. This strategy works well for coding exons that are relatively large (> 10 amino acids) and relatively well conserved. However, the 5' and 3' coding exons of many D. melanogaster genes tend to be much shorter than the internal coding exons. The heuristics used by the BLAST algorithm (e.g., word size) means that it is difficult to identify these small CDS's using either blastx or tblastn. In addition, BLAST alignments for small exons tend to have high E-values because it is more likely for short regions of sequence similarity to occur by chance. Consequently, we need to apply a different search strategy to identify small exons in a gene model. One of the key premises of the GEP comparative Drosophila annotation strategy is that coding regions are under stronger selective pressure and they tend to accumulate changes at a slower rate than other regions of the genome. Hence we seek to minimize the number of changes compared to the D. melanogaster ortholog when we create gene models in other Drosophila species (i.e. parsimony). Our past annotation experiences suggest that the CDS size in the other Drosophila species tend to remain similar to the size of the orthologous CDS in D. melanogaster. The Small Exon Finder is designed to look for open reading frames that satisfy a set of biological constraints. These constraints include the locations of adjacent CDS and genes, the type of CDS (i.e. initial, internal, or terminal CDS), the phase of the donor or acceptor site, and the expected CDS size according to the D. melanogaster model. The Small Exon Finder will search the contig region defined by the user and report a list of open reading frames that satisfy these constraints. Acknowledgements The Small Exon Finder is developed by Wilson Leung at Washington University in St. Louis as part of the Genomics Education Partnership (GEP) project. Questions about the Small Exon Finder Please contact Wilson at wleung@wustl.edu if you have any questions or encounter any problems with the Small Exon Finder. Availability The "Small Exon Finder" is available in the "Annotation Resources" section under the "Projects" menu at the GEP web site ( 2

3 Using the Small Exon Finder (Tutorial) In this tutorial, we will use the Small Exon Finder to find the first CDS of unc-13 in contig12 from the D. mojavensis dot chromosome (Sep assembly) in order to illustrate some of the key features of the Small Exon Finder tool. Retrieve the CDS record for unc-13 from the Gene Record Finder Before we can annotate the first CDS of unc-13 in our D. mojavensis contig, we need to obtain a better understanding of the gene structure of unc-13 in D. melanogaster. We can retrieve this information using the Gene Record Finder (Figure 1). Detailed documentation for the Gene Record Finder is available at the GEP web site (under Help è Documentations è Web Frameworks ). Figure 1 Retrieve the unc-13 gene record from the Gene Record Finder. The gene record shows that the first CDS (1_2159_0) of the G, A, C, F, and D isoforms of unc-13 only consists of two amino acids (MT, Figure 2). Figure 2 The first CDS of the A, C, D, F, and G isoforms of unc-13 contains only two amino acids. Because NCBI blastx and tblastn has a minimum word size of two, the bl2seq search strategy will not find this CDS even if these amino acids were completely conserved in D. mojavensis. Define the region to search for the CDS 1_2159_0 When searching for small CDS, we typically expect to find a large number of candidates within our contig sequence. Thus it is important to impose additional constraints on our search criteria in order to reduce the number of candidates we need to examine. Define the start of the next gene (Rad23) First, we need to define the region to search for the CDS. From the Genome Browser view of contig12, we see that Rad23 is adjacent to unc-13 (Figure 3). Because nested genes are rare in Drosophila, we can define one end of the search boundary based on where Rad23 starts in contig12. 3

4 Based on the results on the blastx alignment track, we know that Rad23 is in the same relative orientation as the contig12 sequence. Hence we can use the 5' end of the A and C isoforms of Rad23 to define one end of the search boundary. Figure 3 Genome Browser view of the beginning of contig12 shows that Rad23 is located next to unc-13. According to the Gene Record Finder, the first CDS of the A and C isoforms of Rad23 is 1_2346_0 (Figure 4). A blastx search of CDS 1_2346_0 against contig12 shows that this CDS begins at 23,720 (Figure 5). Consequently, the first CDS of unc-13 is likely to be located within the region 1-23,719 of contig12. Figure 4 Gene record for Rad23 shows that CDS 1_2346_0 is the first CDS of the A and C isoforms. Figure 5 Results of blastx search of contig12 (query) against the Rad23 CDS 1_2346_0 (subject). Define the end of the next CDS of unc-13 (5_2159_0) We can impose additional restrictions on the possible locations of the initial CDS of unc-13 (1_2159_0) by determining the location of the adjacent CDS (5_2159_0) and the phase of its splice acceptor site. A blastx search of CDS 5_2159_0 against contig12 results in a partial alignment where the first 7 amino acids of the CDS are missing from the alignment (Figure 6). 4

5 Figure 6 The first 7aa are missing from the blastx alignment of CDS 5_2159_0 against contig12. Based on the blastx alignment, we will examine the region surrounding 21,912 (i.e (3*7)) for possible splice acceptor sites. Examination of the region contig12:21,897-21,926 using the Genome Browser reveals a splice acceptor site at 21,917-21,916 that is supported by multiple computational predictions and RNA-Seq data (Figure 7). Because the end of the CDS is at 21,915 and the blastx alignment is in frame -1, this acceptor site is in phase 0. This means that CDS 1_2159_0 must have a phase 0 donor site. Figure 7 Phase 0 splice acceptor site for CDS 5_2159_0 in contig12 (relative to frame -1). 5

6 Collectively, the analysis above suggest that the CDS 1_2159_0 is located within the region 21,916-23,719 in contig12 and it must have a phase 0 donor site. Search for candidates using the Small Exon Finder Using the information we have collected so far, we will search for candidate open reading frames using the Small Exon Finder with the following settings: Field Value Sequence File contig12.fasta Coding Exon Type Initial Exon (with start codon) Position to Begin Search Position to End Search Strand Minus CDS Size (aa) 2 Donor Site GT Donor Phase 0 The Small Exon Finder identified three potential candidates that satisfy our search criteria (Figure 8). Among these candidates, only one (at 21,990-21,995) contains the same amino acid sequence (MT) as the orthologous CDS in D. melanogaster. Figure 8 Small Exon Finder identifies three candidates that satisfy our search constraints. 6

7 Furthermore, when we examine the three possible candidates using the GEP UCSC Genome Browser, we find that multiple gene predictors (e.g., Genscan, SGP Gene) and the RNA-Seq evidence (e.g., TopHat splice junction predictions) strongly favors the candidate at 21,990-21,995 compared to the two other candidates (Figure 9). Consequently, we will place the CDS 1_2159_0 of unc-13 at in contig12. Figure 9 The candidate at is supported by sequence conservation with D. melanogaster, multiple gene predictions and RNA-Seq evidence. Iterative search for the best CDS candidate In some cases, the size of the CDS might have changed compared to D. melanogaster. In that case, the GEP annotation strategy would prefer the candidate that minimizes the change in CDS size (i.e. parsimony). To help you identify the most parsimonious candidate, the Small Exon Finder will automatically run an iterative search by changing the target size by 1aa at each step if it fails to find a candidate with the requested CDS size. This iterative search will terminate when it finds the first candidate that satisfy the rest of the search criteria or if the target CDS size has decreased to 0. The result of this iterative search is listed under the "Matches with the smallest changes in CDS size" section. For example, the D. melanogaster CDS 19_959_0 for the gene bt consists of 13 amino acids (Figure 10). A Small Exon Finder search of the region at in contig41 of the D. biarmipes dot chromosome (Aug assembly) did not find a candidate with 13aa but it did find a candidate with a CDS size of 12aa (Figure 11). 7

8 Figure 10 The CDS 19_970 of the bt gene has 13aa in D. melanogaster. Figure 11 Results of the Small Exon Finder iterative search for candidates with the smallest change in CDS size (12aa) compared to the requested CDS size (13aa). 8

9 Small Exon Finder Configuration Options Parameter Sequence file Coding Exon Type Start Position End Position Strand CDS Size (aa) Donor Site Acceptor Phase Donor Phase Description Unmasked sequence file in FASTA format Initial Exon (with start codon) Internal Exon Terminal Exon (with stop codon) Start coordinate of the search region End coordinate of the search region Orientation to search relative to the query sequence Target size of the open reading frame (e.g., based on the size of the CDS in D. melanogaster) The splice donor site sequence (GT or GC) [Only available for initial and internal exons] The phase of the splice acceptor site (0, 1, 2, or Any) [Only available for internal and terminal exons] The phase of the splice donor site (0, 1, 2, or Any) [Only available for initial and internal exons] 9