Annotation of Drosophila erecta Contig 14. Kimberly Chau Dr. Laura Hoopes. Pomona College 24 February 2009

Size: px
Start display at page:

Download "Annotation of Drosophila erecta Contig 14. Kimberly Chau Dr. Laura Hoopes. Pomona College 24 February 2009"

Transcription

1 Annotation of Drosophila erecta Contig 14 Kimberly Chau Dr. Laura Hoopes Pomona College 24 February 2009

2 1 Table of Contents I. Overview A. Introduction..1 B. Final Gene Model II. Genes A. Initial Predictions 2 B. USCS Genome Browser.. 3 C. BLASTx Initial Search 4 D. ENSEMBL and tblastn 4 E. Gene Model Checker.7 III. Clustal Analysis A. Alignment of CG B. Alignment of CG C. Alignment of Sox102F.13 IV. Synteny..15 V. Repeats VI. Discussion..18 VII. Appendix...19

3 2 I. Overview A. Introduction: Gene annotation involves taking sequences, determining the elements contained within them and attaching biological information to those elements. The process of annotating contig 14 in Drosophila erecta will be the focus of this paper. This contig is bp in length and three complete genes have been identified on it (Figure 1). The three genes are CG11148, CG11152 and SOX102F respectively and are all lesser known protein-coding genes. A ClustalW analysis done with sequences from other Drosophila species revealed moderate conservation and did not identify any regulatory elements or any important conserved regions. The synteny analysis revealed that the genes on the contig lie in the same relative position as their orthologs found in Drosophila melanogaster on the fourth chromosome (Figure 2). There are slight changes in exon amount and length and one gene is different in orientation in relations to its ortholog in D. melanogaster. Lastly, it was discovered that approximately 14.52% of the contig sequence is comprised of repetitious elements, mainly of the DNA/helitron type, but none of which were 1360 transposons. B. Final Gene Model: Using various database searches and sequence alignments, the approximate location of contig 14 in D. erecta was discovered and a final gene model was constructed. Figure 1: Completed gene model for contig14 containing all three predicted genes with exons represented. Large repeats greater than 300 bps found by RepeatMasker, all DNA transposons, were also included.

4 3 Figure 2: Approximate predicted location of contig14 in chromosome 4, using the model for D. melanogaster shown in Ensembl. II. Genes Contig 14.1-CG11148 (3 isoforms) Contig 14.2-CG11152 Contig 14.3-Sox102F (2 isoforms) A. Initial Predictions Upon receiving the contig, the first step that was taken was the perusal of all the previous data assembled by the Washington St. Louis Genomics Education Partnership that was known about contig 14. From the gene predictor tool, Augustus developed by Mario Stanke and Stephan Waack in 2003, it was discovered that there appeared to be three features in contig 14, all of which seem to be genes. These genes all were transcribed in the positive direction with the first having three isoforms, the second with one isoform and the third also having three isoforms. The gene predictor GeneScan presented the similar picture of the contig. However it predicted only two isoforms for the last feature (Figure 3 and 4). Figure 3: An initial search on GeneScan had been run to identify all possible features on the sequence. It was predicted that there were 3 features on the contig.

5 4 Figure 4: Genescan predictions for contig 14 presented in the form of a gene model. B. UCSC Genome Browser The contig sequence was also examined using the UCSC Genome Browser mirror at goose.wustl.edu. The search turned up three Refseq genes, which appeared to greatly match the sequence for contig 14. The first feature, contig 14.1, was shown to have 8 exons, the second feature to have 3 exons and the last feature to have 7 exons (Figure 5, Track B). This is different than the amount predicted by both Augustus and GeneScan, so this information was not very useful. However, several other pieces of information gathered from the Gene Browser were critical to the progression of the project. It was first noted that all three features were oriented in the same direction, as predicted by Augustus and GeneScan. After analyzing each feature, exon by exon, it was determined that the three features were probably actual genes not a pseudogene due the lack of premature stop codons. This was done by examining each exon on the codon level, to see if there was a reading frame that was free of stop codons occurring in the middle of the exon. This would be the reading frame that the exon would likely exist in. Each intron was also examined to make sure they possessed the proper 3 GA TG5 splice sites. The presence of ESTs appeared to indicate the possibility of isoforms, or differently splicing transcripts of the same gene, for the three genes (Figure 5, Track C). Looking at the track for the Drosophila melanogaster gene sequence (Track A), it appears that the features in contig 14 correspond to three genes in D. melanogaster: CG11148, CG11152 and Sox102F. They are most likely the orthologs to the genes found in the contig. From this, it looks as if the first feature has three isoforms, the second, one isoform, and the third, two isoform. This corresponds to the previous GeneScan predictions.

6 5 A B C Figure 5: UCSC Genome Browser search results for contig 14. Track A depicts the corresponding D. melanogaster sequence to that of the contig. Track B is the contig 14 sequence itself. Track C indicates the presence of ESTs, which would signify the possibility of isoforms. C. BLASTx Initial Search In order to get a better idea of what the three features could possibly code for, an initial BLASTx search was done using the contig 14 sequence as query against the Drosophila melanogaster protein database. Hits for entire genomes and or partial genes were ignored. From this, filtering the top hits with E values of zero and 1 e^-130 returned matches to the isoforms of CG11148, CG11152 and Sox102F (Figure 6). This re-affirms the results found from looking at the UCSC Genome Browser. Figure 6: Top Blastx hits corresponding with genes in Drosophila melanogaster. Each of the results was examined then to see the quality of match between the query and subject sequences. For CG11148, all of the isoforms appeared to have identity matches of 80% or higher with only 0-1% gaps occurring. These searches had been run without the low-quality sequence filter, yet the results appeared to not contain many low-quality sequences which would have been demarcated by lower-case grey letters The segments of the gene sequence all had positive frames, which matches the previous predictions. This was a strong indication that CG11148 would be a good candidate for a gene on the contig. The same was done for the isoforms of CG11152 and SOX102F. Similar results were found, thus also indicating that both genes were possible features in the contig. D. ENSEMBL and tblastn The three potential gene matches were then searched on the gene database ENSEMBL. CG11148 was reported to be on chromosome 4:875, ,652 on the reverse strand D. melanogaster. There were three transcripts or isoforms shown for the gene just named Isoforms A,B and C (Figure 7). It is known to be a protein-coding gene.

7 6 Figure 7: Location of all three isoforms of CG11148 on chromosome 4. Using the protein sequence provided on ENSEMBL, each exon was then aligned against the contig 14 sequence using a two sequence tblastn search. Only the best-matched sections of sequences were counted as being part of the exon. The length of the section and the number of mis-matched bases determined the soundness of the match. The start and stop of the sections were recorded to determine the length and location of the sequence in respect to the contig. For example, for the first exon, of the sections which turned up in the tblastn alignment, the piece which had a 91% identity and +1 frame was determined to be the correct exon. This criterion of high identity and positive reading frame was used for the rest of the exons. It was discovered that while some of the smaller exons had 100% identity, even the longer ones had at least 80% identity. Similarly while the reading frames for each exon sometimes shifted to +2 or +3, they never changed direction. Thus, it was discovered that CG11148 aligned very well with the contig and that all three of its isoforms had 8 exons. A preliminary exon map was constructed (Table 1). While Isoforms B and C have the same splice sites, they differ in terms of the transcript sequence. Thus it appears to have been beneficial for some reason for D. erecta evolved two different transcripts that code for the same protein. Table 1: Exon map based off of tblastn two sequence alignment results for all three isoforms of CG IsoformA Exon # Protein Alignment Our Sequence Alignment Number of complete codons (Start-End) (Start-End) 1 (112) 1 to (49) 1 to (99) 1 to (874) 1 to (316) 1 to (25) 1 to (57) 1 to (41) 1 to IsoformB Protein Alignment Our Sequence Alignment Exon # Number of complete codons (Start-End) (Start-End) 1 (112) 1 to (56) 1 to (99) 1 to (874) 1 to

8 7 5 (316) 1 to (25) 1 o (57) 1 to (41) 1 to IsoformC Protein Alignment Our Sequence Alignment Exon # Number of complete codons (Start-End) (Start-End) 1 (112) 1 to (56) 1 to (99) 1 to (874) 1 to (316) 1 to (25) 1 o (57) 1 to (41) 1 to CG11152 was reported to be on chromosome 4: 857, , 634 on the reverse strand D. melanogaster (Figure 8). It has only one transcript and is a known protein-coding gene. The exons discovered from the protein sequence on ENSEMBL was aligned against the contig 14 sequence using a two sequence tblastn in the exact manner as the first gene. CG11152 on D. erecta also appears to have one exon with an 88% identity match and a frame of +1. The exon is comprised of 599 complete codons and is located at on the contig. Figure 8: Location of the one isoform of CG11152 on chromosome 4. The last gene that surfaced in the BLASTx search was SOX102F. In ENSEMBLE was reported to be on chromosome 4: 821, ,639 on the reverse strand in D. melanogaster (Figure 9). There are two transcripts, Isoform A and B. It is also known to be a protein-coding gene. It is shown to also be near CG11152, which would fit within the previously made gene models. Figure 9: Location of both isoforms of Sox102F on chromosome 4.

9 8 Each exon from the ENSEMBL protein sequence was aligned with the contig 14 sequence, as was done for the previous two genes. It was discovered that there were 2 exons for IsoformA and 3 exons for Isoform B. The first isoform had exons with identities of 85% and 90% and a reading frame of +3. The second isoform has the same two exons as above with an extra exon with a 90% identity and a frame of +1. A preliminary exon map was made (Table 2). Table 2: Exon map based off of tblastn two sequence alignment results for both isoforms of Sox102F. IsoformA Exon # Protein Alignment Our Sequence Alignment Number of complete codons (Start-End) (Start-End) 1 (305) 1 to (275) 1 to IsoformB Exon # Protein Alignment Our Sequence Alignment Number of complete codons (Start-End) (Start-End) 1 (33) 1 to (310) 1 to (275) 1 to E. GEP Gene Model Checker After creating the basic exon maps for all three genes, it became possible to attempt to arrange the genes in the correct order in the contig and to check the soundness of the model. To do this, the UCSC Genome Browser was once again utilized (Figure 5). The locations of the exons were compared to the model projected on the Browser for contig 14. One of the first things that was noticed was that while the first gene (CG11148) matched very well in the number of exons and in its location on the models, the other two did not. The model had predicted three exons for the second gene (CG11152), yet the alignment yielded only one. This result does match the number of exons in the D. melanogaster ortholog though. Thus, in order to double check, the protein sequence for that gene was re-aligned against the contig sequence using tblastn again. Because similar results were obtained, it was decided that the predicted gene model on the browser was simply wrong and that CG11152 in D. erecta does have only one exon. The last gene had the most conflicting data. The UCSC Genome Browser predicted seven exons for the feature, yet the two isoforms of SOX102F seem to have only two and three exons respectively. The D. melanogaster orthologs predict one isoform to have four exons and the other to have eight. Again, in order to clear up this discrepancy, the sequences for each of the D. melanogaster gene isoforms were aligned via tblastn to the contig sequence. Once more, the results of the alignment turned up similar data as found for the exon map. Thus, it was just accepted that the model was incorrect and that the isoforms of SOX102F in D. erecta have less exons than predicted. All of the genes also had their start and stop locations double-checked, noting the location of codons from the Genome Browser. The predicted splice sites for the genes were also recorded from the browser and cross referenced to those used in the previously formulated exon maps. When all of this was done, the gene models for all isoforms of each of the three genes were examined using the GEP Gene Model Checker (Figure 10). If missing start codons, stop codons or splice sites were detected, the model was not considered complete. While the model

10 9 for the second gene, CG11152, was proclaimed valid without extra modifications, the other models required some refining. Figure 10: Gene Model Checker Results for one isoform of CG This was done by comparing the beginning and end locations of each exon reported on the Genome Browser and adding and subtracting bases systematically to the model. Sometimes the exact value reported on the Browser was enough to fix the model. However, every once in a while, one had to experimentally add or subtract numbers until the Model Checker was satisfied. All the models were eventually cleared, and from this, the exon locations were deemed the most legitimate and were used to create the final maps (Table 3). Table 3: Final exon map created from the results of the Gene Model Checker. CG11148 IsoformA Exon # Protein Alignment Our Sequence Alignment Number of complete codons (Start-End) (Start-End) 1 (112) 1 to (49) 1 to (99) 1 to (874) 1 to (316) 1 to (25) 1 to (57) 1 to (41) 1 to

11 10 IsoformB 1 (112) 1 to (56) 1 to (99) 1 to (874) 1 to (316) 1 to (25) 1 o (57) 1 to (41) 1 to IsoformC 1 (112) 1 to (56) 1 to (99) 1 to (874) 1 to (316) 1 to (25) 1 o (57) 1 to (41) 1 to Sox102F IsoformA Exon # Protein Alignment Our Sequence Alignment Number of complete codons (Start-End) (Start-End) 1 (305) 1 to (275) 1 to IsoformB 1 (33) 1 to (310) 1 to (275) 1 to CG11152 Exon # Protein Alignment Our Sequence Alignment Number of complete codons (Start-End) (Start-End) 1 (599) 1 to From this the complete gene map presented at the start of this paper (Figure 1), was created.

12 11 III. Clustal Alignments Sequence alignments were done for the coding regions of DNA for all three genes in the contig against D. melanogaster and four other Drosophila species. Because the contig14 DNA sequence is for the all the coding and non-coding regions contained within it, the DNA sequence for each separate gene in the contig was obtained from the UCSC Genome Browser. The same was done to obtain the DNA sequence for the corresponding ortholog in D. melanogaster. Using Flybase, four orthologs were chosen from the database list and the DNA sequences were taken for each. The last sequence used was the known DNA sequence for the ortholog in D. erecta, also found from Flybase. It was included to see how good a match the predicted contig sequence for the gene would be against a listed sequence for the same gene. A list of sequences from contig14, D.melanogaster, D. ananassae, D. erecta, D. mojavensis, D. virilis, and D. yakuba was complied and then imputed into the ClustalW program at A. Alignment of CG11148 Alignment of the first gene on contig 14 was done using the gene-coding DNA sequences from the following seven sources imputed in the order listed below: Sequence 1-D. melanogaster (CG11148) Sequence 2-Contig 14.1 Sequence 3-D. ananassae (GF22673) Sequence 4-D. erecta (GG16434) Sequence 5-D. mojavensis (GI14083) Sequence 6-D. virilis (GJ15866) Sequence 7-D. yakuba (GE14500) The ClustalW alignment aligns all seven sequences against one another and the produces a score showing percentage of conservation between all sequences (Figure 11). As it is shown in the table and the cladogram (Figure 12) below, the contig 14.1 sequence appears to align best with its ortholog in D. melanogaster but is very poorly conserved for all other species. Figure 11: ClustalW alignment scores for contig 14.1 (CG11148) against the five ortholog sequences and the predicted D. erecta sequence demonstrates poor conservation across species except for D. melanogaster. There are two puzzling results that emerge from this search. First of all, the pair-wise alignment of contig 14.1 against the same predicted gene in D. erecta, the ortholog of CG11148 in D. melanogaster, yields a poor match. It was first thought that the sequence extracted from the

13 12 A USCS Genome Browser may not have been the most refined sequence to use and therefore might have caused a bad alignment. Therefore, the sequence produced using the Gene Model Checker was then extracted for the gene and used in a realignment with the other six sequences. The realignment yielded practically the same results, however, with no improved alignment between the two D. erecta sequences. It is possible that the sequence used on Flybase was not based on as extensive research as the one provided by the Genomics Education Partnership. However, further research should be done concerning the nature of the two sequences before definitive conclusions are drawn. The second inconsistent result, which may be a product of the first, is contig 14 s alignment with the D. yakuba sequence. Flybase places D. erecta as fitting within the same melanogaster subgroup as D. melanogaster and D. yakuba. Therefore, it would make sense that the contig 14.1 sequence would higher conservation with those two species in comparison to the rest of the orthologs. The predicted D. erecta gene does, in fact, show good alignment with the D. yakuba sequence. However, contig 14 does not align with that sequence at all, as the alignment score is 0. This is especially odd because searching the same contig 14 sequence using BLASTx, yields D. yakuba GE14500 as the top hit, with an E value of 0 and an identity of 84%. It seems that these ClustalW alignments might be worth further investigation, in order to understand want is happening. Examining the alignment, it was noted that conservation appears to be best in the middle of the sequences than the ends, with the regions near N-terminus having poorer conservation than those at the C-terminus (Figure 12). Because CG11148 is a protein-coding gene, with no known functions beyond that, it is possible that this lack of conservation emerged from selective pressures aimed at shaping the specific gene coded. The start of the conserved middle region appears around bps. B C Figure 12: CLUSTALw alignment of contig 14.1 (CG11148) showing A) low conservation in the regions closer to the N-terminus B) a transition towards good conservation in the middle of the sequence and C) somewhat poorer conservation towards the C-terminus.

14 13 A BLASTx search done using the same contig 14 DNA sequence for CG11148 against the nr protein database provided further support. No results with high E values demonstrated alignments made before the 32 bps mark between the contig sequence and sequences from other Drosophila species. B. Alignment of CG11152 In order to investigate the issues that arose during the alignment of the first gene, a ClustalW alignment was also performed for the second gene. This was done to double-check and see if it was an issue of technique or an issue of the particular sequence used. As had been done for the first gene, alignment of the second gene on contig 14 was done using the gene-coding DNA sequences from the following seven sources imputed in the order listed below: Sequence 1-D. melanogaster (CG11152) Sequence 2-Contig 14.2 Sequence 3-D. ananassae (GF10343) Sequence 4-D. erecta (GG16435) Sequence 5-D. mojavensis (GI14084) Sequence 6-D. virilis (GJ15974) Sequence 7-D. yakuba (GE14501) The ClustalW alignment aligns all seven sequences against one another and the produces a score showing percentage of conservation between all sequences (Figure 13). As it is shown in the table below, the contig 14.1 sequence appears to align best with that of its orthologs in D. melanogaster and D. yakuba. Figure 13: ClustalW alignment scores for contig 14.2 (CG11152) against the five ortholog sequences and the predicted D. erecta sequence demonstrates medium conservation across species, with good conservation for D. melanogaster and D. yakuba. In this alignment, the problems observed with CG11148 were not as much of an issue. While the contig sequence does not have a perfect alignment score in the pairwise alignment with the predicted D. erecta sequence, it has a very good one of 91. This indicates that there are perhaps differences in the two sequences, once more, due to the amount of refinement done on

15 14 A them. Also, the contig 14 sequence seems to align very well with that of D. yakuba, which is what was expected. This leads one to believe that perhaps, the there may be an issue of one or more of the sequences used in the first alignment and if further work should be done using ClustalW for this contig, the focus should be on the first gene. Unlike CG11148, despite the fact that CG11152 is also a protein-coding gene, the N- terminus seems to be fairly conserved between species. In fact the align transitions at around to a poorly conserved middle region that lasts all the way until the C-terminus (Figure 14). This is probably also an effect of evolutionary pressures that shape the gene to the specific needs of the species. Because this region seemed decently conserved, a closer look was given to the area to see if any conserved regulatory elements could be found. The upstream sequence for all the homologous genes were examined but no meaningful evidence for a 5 UTR was discovered. It appears that there does not seem to be a common regulatory element sequence that was conserved across species within this contig. B C Figure 14: Clustalw alignment of contig 14.2 (CG11152) showing A) high conservation in the regions closer to the N-terminus B) a transition towards poor conservation in the middle of the sequence and C) very poor conservation towards the C-terminus. Once again, a BLASTx search was done on using CG11152 sequence against an nr protein database. The results found from the Clustal alignment were reaffirmed since no hits with high E values had alignments against the query occurring past 1662 bps. C. Alignment of SOX102F Just like before, the alignment of the third gene on contig 14 was done using the genecoding DNA sequences from the following seven sources imputed in the order listed below: Sequence 1-D. melanogaster (SOX102F) Sequence 2-Contig 14.3

16 15 Sequence 3-D. ananassae (GF20041) Sequence 4-D. erecta (GG16436) Sequence 5-D. mojavensis (GI14086) Sequence 6-D. virilis (GJ16151) Sequence 7-D. yakuba (GE14503) The CLUSTALW alignment aligns all seven sequences against one another and the produces a score showing percentage of conservation between all sequences (Figure 15). As was observed in the previous two genes, the contig 14.1 sequence appears to align best with that of its ortholog in D. melanogaster with medium conservation occurring in other species. Because there was perfect alignment between the contig sequence and the predicted Flybase D. erecta sequence, the pairwise alignment was not interpreted with as much caution as it was for the other two genes since the data seemed consistent. Figure 15: ClustalW alignment scores for contig 14.3 (SOX102F) against the five ortholog sequences and the predicted D. erecta sequence demonstrates medium conservation across species but perfect alignment with D. melanogaster. Sox102F also showed a difference in pattern of alignment. The gene is also for protein coding but, unlike the other two, the middle of the sequence is not so nicely conserved (Figure 16). There are stretches of sequence were the alignment is poor, and others where there is very good alignment. Also while the N-terminus was poorly conserved, the C-terminus appeared to be highly conserved with no good transition point observed.

17 16 A B C Figure 16: Clustalw alignment of contig 14.3 (SOX102A) showing A) poor conservation in the regions closer to the N-terminus B) an uneven distribution in conservation in the middle of the sequence and C) good conservation towards the C-terminus. Running a Blastx search using the contig 14 sequence for SOX102F against the nr protein data base yielded consenting results. When looking at only the hits with the best E values, no alignments were found to be made before the 8800 bp mark, which is where the well-conserved region begins. IV. Synteny The three genes in contig 14 appear in the same region on the dot chromosome, chromosome 4 as found for D. melanogaster. While the relative locations and ordering of the genes have been preserved, if one considers that since the D. melanogaster sequence is on the reverse strand and is being read from the 5 to the 3 end (Figure 17).

18 17 Figure 17: Ordering of three genes from the 5 to 3 end on chromosome 4 in D. Melanogaster is the same as presented in the contig model. However, in terms of orientation, the three genes on contig 14 appear to be reading in the exact same direction. This is because all of the exons have positive reading frames and are presented as transcribed in the same direction on the UCSC Genome Browser (Figure 5). This is different from the picture presented of the three genes in D. melanogaster since CG11148 appears to be going in the reverse direction (Figure 18). Figure 18: CG11148 is presented as reading in the reverse direction in D. melanogaster in comparison to CG11152 and SOX102F on the UCSC Genome Browser. There are fewer exons for CG11148 and Sox102F in D. erecta than for D. melanogaster for all of the isoforms (Table 4). The exons are of relatively the same length, with notable differences highlighted. SOX102F contains a large intron after the first exon in isoform B of the contig sequence. This is also apparent in the sequence of D. melanogaster for the same isoform.

19 18 Table 4: Comparison of exon number and length in contig 14 in D. erecta and the same region in D. melanogaster. D. erecta D. melanogaster (cdna) Gene Isoform Exon # Sequence Gene Isoform Exon # Sequence Number of complete codons (Start-End) Number of complete codons (Start 5'-End 3') CG11148 A 1 (336) CG11148 A 1 (57) CG11148 A 2 (147) CG11148 A 2 (364) CG11148 A 3 (295) CG11148 A 3 (150) CG11148 A 4 (2529) CG11148 A 4 (295) CG11148 A 5 (966) CG11148 A 5 (2619) CG11148 A 6 (75) CG11148 A 6 (948) CG11148 A 7 (155) CG11148 A 7 (75) CG11148 A 8 (123) CG11148 A 8 (170) CG11148 A 9 (282) CG11148 B 1 (336) CG11148 B 1 (57) CG11148 B 2 (168) CG11148 B 2 (364) CG11148 B 3 (295) CG11148 B 3 (171) CG11148 B 4 (2529) CG11148 B 4 (295) CG11148 B 5 (966) CG11148 B 5 (2619) CG11148 B 6 (75) CG11148 B 6 (948) CG11148 B 7 (155) CG11148 B 7 (75) CG11148 B 8 (123) CG11148 B 8 (170) CG11148 B 9 (282) CG11148 C 1 (336) CG11148 C 1 (51) CG11148 C 2 (168) CG11148 C 2 (364) CG11148 C 3 (295) CG11148 C 3 (171) CG11148 C 4 (2529) CG11148 C 4 (295) CG11148 C 5 (966) CG11148 C 5 (2619) CG11148 C 6 (75) CG11148 C 6 (948) CG11148 C 7 (155) CG11148 C 7 (75) CG11148 C 8 (123) CG11148 C 8 (170) CG11149 C 9 (282) CG11152 A 1 (1812) CG11152 A 1 (1800) Sox102F A 1 (909) Sox102F A 1 (160) Sox102F A 2 (825) Sox102F A 2 (930) Sox102F A 3 (828) Sox102F B 1 (99) Sox102F B 1 (635) Sox102F B 2 (924) Sox102F B 2 (66) Sox102F B 3(825) Sox102F B 3(118) Sox102F B 4 (930) Sox102F B 5 (1681) V. Repeats After determining the nature of the genes found within contig 14, further research was done to uncover other elements contained within the sequence. Repeat Masker was run on the contig sequence in order to determine the amount of repeats found in the sequence. The output revealed that 14.52%, or 9437 bps of the contig was composed of repetitive sequence (Figure 19). The majority of the repeats were DNA transposons, which made up 13.69% of the sequence, however no 1360 transposons were found. Simple repeats (.47%), low complexity regions (.92%) and unclassified repeats.(39%) also compose the contig to a lesser extent. In addition one retroelement, a LINE element of the class L2/CR1/Rex makes up.44% of the sequence. A table

20 19 was complied to examine the nature of the larger repeats, those larger than 300 bps (Table 5). It appears that they are all DNA transposons and seem to fall in the non-coding regions of the sequence or within the large intron of SOX102F. Table 5: Repeat elements larger than 300 bps are all DNA transposons SW Percent Percent Percent Class Begin End Length deletion insertion sequence DNAREP1_DM DNAREP1_DM DNAREP1_DM DNAREP1_DM DNAREP1_DM Helitron _DYak DNAREP1_DM Figure 19: Repeat Masker output for contig 14 revealing large makeup of DNA transposons in sequence One of the biggest differences between the D. melanogaster sequence and that of contig 14 occurs in exon makeup in SOX102F. Isoform B of the gene in the contig has a large first intron that spans to 59379, bps long. Of those 2022 bps, 4181 bps are repeat sequence, making roughly 44% of the total amount of repeat sequence fall within the intron. This is an intriguing find and may be worth examining more closely. VI. Discussion Contig 14 in D. erecta contained three features, all of which appear to be complete gene orthologs of those found in D. melanogaster. CG11148, CG11152 and Sox102F lie in a syntenous stretch like that of the dot chromosomes in the other species. The gene models (Figure 1 and Table 3) constructed for all three genes appear to be sound after a series of careful BLAST searches, exon alignments, and determination of splice sites. Clustalw analysis was done for all three found on the contig. It was discovered that while there is good conservation between D. melanogaster and D. erecta for all three genes, the sequences were not conserved as well across other Drosophila species. It might be worthwhile to

21 20 attempt such an alignment against orthologs of these genes found in more distantly related organisms, such as mice or even humans. The ClustalW alignment done for the first gene, CG11148 should be reviewed due to the unexpectedly poor alignments against the Flybase predicted D.erecta sequence and D. yakuba sequence. Also the ClustalW analysis done to find conserved upstream regulatory elements in CG11152 was not very thorough. Despite the lack of evidence, it perhaps would be worthwhile to re-examine the region just to be certain. Repeat Masker also helped characterize the contig past the three main features. It was discovered that 14.52% of the sequence was repeats and that the majority of those were DNA/Helitron elements. About 44% of the repeat sequence appears to lie within the first intron of isoform B of SOX102F. It is uncertain at this point what this might mean but this region might be of interest when determining the heterochromatin make-up of the D. erecta chromosome four. VII. Appendix GENES *Extracted from Gene Model Checker Results* A. CG ) Isoform A Protein Sequence Fasta >Dere2_contig14_Dere/CG11148_CG11148-PA_pep MTDSMKFGPEWLRNMSAEPSSSPSTYNVGTGAQNISIGGHNLGNNTTAST SRNLFPEYRYGREEMLSLFDRNCLLPHILPSFRKLFVEKVQYPLALTPSS EEDTNQNSLGNNRSPGGFGSASRGSGRGGTVDRGRMRGKSAYHPIYQRPS GLYDESLSVISAERTWSDRNGTGDSAATTTSTSGPGGIDWNGTPSSSPRK DYSSHHRNLENWRRTRNEDGSGDGPATSGSIGGPDIAGWRSGVVGGSTST SFGTNSHRWVGTLGTDRDRTGNSKGSGMGVAEPGGSTSHPRLSSQLWTVN SAGGVDVDENLPEWAMENPSELGGSFDASGAFHGDTDLKLNKSSHILKTE SLNSDNDVTNQKRKDLSDADSVKDKTSETLLTKDSNSAAVQEEVESSLSP KSSTTTKKEEIHGDISERIKEVADEVEKLIMDDDHKSSANQSELQNDDRF TAALPSLAAIEISIEPSVTGVQQQAPSTMPIRVPNTITDVAHPAHQHPGV SFSDHETLQHHNMHLPHFPMLPTPHMINSNLNELWFYRDPQANVQGPFSA IEMTEWYRAGYFNENLFVRRYSENRFRPLGELIKFCHGNMPFTHSHLLPS PIDLENLSVGQIPTPLTASLSITPHKPSPIPIALSVVEQQLQQQRDEHLK ENVTATAESLSAAIKGNFSGNSISNTSHLLTMRFQMLQDQYLQHQEYQIL AELSKNECFQRLSAAEQETVVRRKVQMLVLPEYLISLNGLSNSLSVLNPV AGRQLYSTVVEQAKKDQQHIFANNSEHQRSVGNLLDANNFILNAQIMHQQ SQQEVGALAASVDCIMQGGTAPDLSKPNEQPRNELDLINEYNLRMLLRGQ PTSTQQHPPSLPNSANENLSGVDFLTETQLLQRQNLMIPIWLPPNKQQQS DQQWAGMTNAEASLWGVSHLNEERNDDQQLYVQKSSEACFVDTKKDVKIS PLLQVQSGDIVKHSTSGDLDQTAENLKNSHNQKIVKSLVSDIQQNHKEQN SHQHQAKQANKQNLNTKQNATQPALVKQINEDDRKREQTEEKKKQKEERK RQQLEDEKRRALHESEERARQIREEKERQQQIQAQRRKALIGNAESVQSG TPGTFASAQGNRNDAAKTAEPQASSRLPSTSVAPWSLQSPNSMSTAPGLA EIQKAERRERRADQQRHQELLDKQLRANAAAAAEANDALLKWQSTPASAP

22 21 VISLAEIQAEEARRLANDLVDQQRRRELEHHQQAPLSSAVLVASATSNIW GNANKAWSSSASQSLSLRTSSGTGLWDEPNALGSIQPIYGSGTSCASSVT AAAVLAGGLNSTSKSNLQAQNKSSALFASPRNLRKSQTVPALNNPGKANK SGPGQRPEKQNLAQIRSKGPPVSVEEKEKERKTNVKSHVQQSSTDQVISK VNEYENEFTSWCIKSLDNMSAKVDVPTFVAFLQDLEAPYEVKDYVRIYLG DGKDSLDFAKQFLERRSKYKSLQRAQNAHNDDMCKPAPAITPSANDYADS KNKQKKIKKNKMTKMDARILGFSVTAAEGRINVGIRDYVEGP* Nucleotide Sequence Fasta >Dere2_contig14_Dere/CG11148_CG11148-PA_cds ATGACAGATTCAATGAAATTTGGTCCGGAATGGTTACGCAATATGTCAGC CGAGCCTTCGAGCTCTCCCAGTACCTACAACGTTGGTACTGGTGCTCAAA ACATCTCGATTGGAGGGCACAACCTGGGAAACAACACGACAGCATCCACT TCGCGTAATCTATTTCCAGAATACCGGTACGGACGCGAGGAAATGCTGTC CTTGTTCGATCGGAATTGCCTACTGCCTCATATCCTACCATCGTTCAGAA AGCTCTTCGTGGAGAAGGTCCAGTACCCGCTTGCACTAACACCGAGCTCG GAGGAGGACACCAACCAAAACTCGCTTGGCAATAACCGCTCGCCCGGTGG ATTTGGTAGTGCCTCCCGAGGATCTGGACGTGGTGGAACAGTCGACCGGG GCCGAATGCGCGGAAAATCTGCATATCATCCAATATACCAACGCCCAAGC GGTCTATATGATGAAAGCTTATCGGTAATATCAGCCGAACGCACATGGAG CGATCGCAACGGAACTGGAGATTCTGCGGCTACCACCACCTCCACTAGTG GCCCTGGTGGTATAGATTGGAACGGAACGCCAAGCTCAAGTCCTCGAAAA GATTATTCTAGTCATCATCGCAACTTGGAAAATTGGCGACGAACACGTAA CGAAGATGGATCCGGAGATGGTCCAGCTACCAGCGGTTCCATCGGAGGAC CCGATATTGCTGGTTGGCGGAGCGGTGTCGTCGGTGGAAGTACAAGCACA AGTTTTGGTACCAATAGCCATCGCTGGGTTGGGACTTTAGGTACAGATCG AGACCGCACTGGAAACAGTAAAGGATCTGGAATGGGAGTTGCAGAGCCGG GAGGAAGCACATCGCACCCACGCTTGTCGAGCCAATTATGGACTGTTAAT AGTGCAGGCGGTGTTGATGTCGACGAAAATCTTCCCGAATGGGCAATGGA AAATCCATCGGAGTTGGGTGGCAGTTTTGATGCTAGTGGAGCGTTTCATG GAGATACCGATCTAAAACTTAACAAAAGTTCGCATATTTTAAAAACCGAA AGCTTAAACTCCGATAACGATGTGACAAACCAAAAAAGAAAAGACCTGTC TGATGCAGATAGTGTTAAAGATAAAACCTCAGAAACCTTATTAACGAAAG ATTCCAATTCAGCAGCTGTGCAAGAAGAAGTTGAAAGCAGTTTATCCCCA AAAAGCTCTACAACGACAAAGAAGGAAGAGATCCATGGAGATATTTCAGA ACGGATCAAAGAGGTCGCCGATGAAGTAGAAAAACTTATAATGGATGATG ATCATAAAAGCTCGGCGAATCAAAGCGAACTTCAAAATGATGACCGGTTT ACAGCTGCACTACCAAGTCTAGCAGCCATTGAGATAAGCATAGAACCGAG TGTTACAGGAGTGCAACAGCAAGCGCCTTCCACCATGCCCATACGAGTTC CCAATACAATTACAGATGTCGCTCATCCAGCCCATCAGCATCCAGGTGTT TCGTTTTCGGATCATGAGACTTTGCAACATCATAACATGCATCTACCACA TTTTCCGATGCTTCCTACACCGCATATGATCAATTCAAATCTGAATGAAT TGTGGTTTTACCGGGATCCGCAGGCAAATGTACAGGGGCCATTCAGTGCC

23 ATTGAGATGACCGAATGGTATCGCGCTGGCTACTTCAATGAGAACCTCTT TGTACGCCGGTACTCTGAGAATAGGTTTAGACCACTGGGAGAGCTTATAA AATTTTGTCATGGTAACATGCCATTTACGCACAGTCACTTGCTTCCTTCG CCTATAGACCTAGAGAACCTTTCTGTTGGTCAAATACCAACCCCTCTTAC AGCGTCCCTCTCAATTACACCCCATAAGCCATCACCAATTCCTATCGCAC TGTCTGTTGTTGAACAGCAGTTGCAGCAGCAAAGAGATGAGCATCTGAAG GAAAATGTAACCGCAACCGCTGAATCTCTAAGTGCTGCAATAAAAGGAAA TTTTAGCGGAAATAGCATTAGTAATACATCTCATTTGCTTACAATGCGGT TTCAAATGCTTCAGGATCAGTACTTACAGCACCAGGAATACCAAATACTA GCTGAGCTGTCGAAAAATGAATGCTTTCAGCGGCTTTCGGCTGCCGAGCA GGAAACAGTTGTTCGTCGGAAAGTTCAAATGCTGGTTCTTCCTGAGTATT TGATTAGTTTAAACGGATTAAGCAACTCCTTGTCCGTACTGAACCCCGTC GCCGGAAGACAGTTATACAGTACAGTGGTTGAGCAGGCCAAGAAAGATCA GCAACATATTTTTGCAAACAACAGCGAGCATCAACGTTCAGTGGGCAATT TACTAGATGCTAATAATTTTATTCTAAACGCCCAAATAATGCATCAGCAA TCGCAACAAGAGGTAGGTGCCTTGGCAGCATCCGTTGATTGTATTATGCA AGGTGGAACTGCACCCGACCTTAGTAAGCCTAATGAACAGCCAAGGAATG AGTTGGACTTAATTAATGAATACAACTTACGGATGCTTTTGCGGGGCCAA CCAACAAGTACTCAACAGCATCCACCTTCGCTGCCGAACTCTGCTAACGA AAATCTCTCTGGAGTGGATTTTTTAACTGAAACACAATTGTTACAGAGGC AAAATTTAATGATTCCTATCTGGTTACCCCCTAACAAGCAACAACAATCC GACCAACAGTGGGCTGGAATGACTAACGCGGAAGCATCATTATGGGGAGT GAGCCACTTAAATGAAGAGCGTAATGACGATCAGCAACTATATGTGCAGA AGTCTTCTGAGGCGTGCTTTGTAGATACAAAAAAAGATGTGAAAATTTCA CCATTATTACAAGTTCAATCGGGAGATATTGTTAAACACAGCACTTCAGG GGATTTAGATCAGACCGCTGAAAATTTAAAGAATTCACATAATCAAAAAA TAGTCAAATCCCTTGTTTCCGACATTCAACAGAATCATAAGGAACAAAAT TCACATCAGCACCAGGCAAAGCAGGCTAATAAGCAGAATCTGAATACAAA ACAGAATGCGACACAGCCAGCCTTAGTTAAGCAAATTAATGAAGATGATC GCAAAAGAGAGCAGACAGAAGAAAAAAAAAAGCAAAAGGAGGAACGCAAG CGCCAGCAATTGGAAGATGAAAAACGTAGGGCGTTGCATGAATCTGAAGA ACGAGCCCGCCAAATTCGAGAGGAAAAGGAGAGGCAACAGCAAATACAAG CCCAACGTCGAAAGGCATTAATAGGCAATGCTGAATCAGTTCAAAGTGGG ACTCCAGGAACATTCGCGTCTGCACAAGGCAACAGGAACGACGCAGCCAA AACAGCAGAGCCGCAAGCATCCTCTCGCTTACCATCTACATCCGTAGCGC CTTGGTCTCTGCAGTCTCCAAATTCTATGAGCACTGCGCCTGGTCTCGCA GAGATACAAAAGGCAGAACGTCGAGAGCGTCGCGCAGACCAGCAGCGACA TCAAGAGCTATTAGATAAGCAATTGCGTGCCAATGCTGCAGCTGCAGCTG AAGCCAATGATGCTCTGCTCAAATGGCAGTCAACGCCAGCGTCGGCCCCC GTAATAAGTCTAGCCGAGATTCAAGCGGAAGAGGCAAGACGGTTGGCCAA CGACCTTGTGGATCAGCAGCGTCGACGCGAATTGGAACATCACCAACAAG CTCCTCTATCATCAGCGGTTTTGGTAGCAAGTGCAACTTCCAACATCTGG GGTAACGCTAATAAAGCATGGAGCTCGTCCGCTTCTCAATCACTTTCATT 22

24 23 AAGAACAAGTTCTGGAACTGGTCTATGGGACGAACCGAATGCACTAGGTT CTATTCAACCAATTTATGGATCTGGAACAAGCTGTGCCAGTTCCGTAACT GCGGCGGCAGTCCTGGCAGGAGGATTGAACTCAACTAGTAAATCCAATCT ACAAGCTCAAAATAAGTCTTCGGCTTTATTTGCGTCGCCTCGAAATTTGC GCAAGAGTCAAACAGTGCCAGCCTTAAATAACCCAGGAAAGGCAAATAAA AGTGGACCAGGACAACGGCCAGAGAAACAAAATTTGGCCCAAATTCGTTC AAAAGGTCCACCTGTCTCAGTTGAAGAGAAGGAAAAAGAAAGAAAGACGA ATGTAAAAAGTCATGTGCAGCAAAGCAGCACCGACCAAGTCATTAGCAAG GTTAATGAGTATGAAAACGAGTTCACTAGCTGGTGCATAAAGAGCTTAGA TAATATGTCCGCTAAAGTCGATGTACCCACGTTCGTGGCATTCTTGCAGG ACTTGGAAGCGCCATATGAAGTAAAGGACTATGTCCGAATATACCTTGGT GATGGAAAAGATTCTTTGGATTTCGCAAAACAGTTTTTGGAGCGACGTAG CAAATACAAAAGCTTGCAACGTGCCCAAAATGCACACAATGACGATATGT GCAAACCGGCTCCTGCTATTACTCCATCTGCGAACGACTATGCTGACAGC AAGAACAAGCAGAAAAAGATTAAAAAGAATAAGATGACTAAGATGGACGC CCGCATTCTTGGATTTTCAGTAACAGCTGCCGAGGGTCGCATAAATGTTG GCATTCGAGACTATGTCGAAGGACCA GFF (See Folder for file) contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP start_codon gene_id "Dere/CG11148"; trans_id "CG11148-PA" contig14 GEP stop_codon gene_id "Dere/CG11148"; trans_id "CG11148-PA"; 2) Isoform B Protein Sequence Fasta >Dere2_contig14_Dere/CG11148_CG11148-PB_pep MTDSMKFGPEWLRNMSAEPSSSPSTYNVGTGAQNISIGGHNLGNNTTAST SRNLFPEYRYGREEMLSLFDRNCLLPHILPSFRKLFVEKVQYPLALTPSS

25 24 EEDTNQNSLGNNSRPAWLQRSPGGFGSASRGSGRGGTVDRGRMRGKSAYH PIYQRPSGLYDESLSVISAERTWSDRNGTGDSAATTTSTSGPGGIDWNGT PSSSPRKDYSSHHRNLENWRRTRNEDGSGDGPATSGSIGGPDIAGWRSGV VGGSTSTSFGTNSHRWVGTLGTDRDRTGNSKGSGMGVAEPGGSTSHPRLS SQLWTVNSAGGVDVDENLPEWAMENPSELGGSFDASGAFHGDTDLKLNKS SHILKTESLNSDNDVTNQKRKDLSDADSVKDKTSETLLTKDSNSAAVQEE VESSLSPKSSTTTKKEEIHGDISERIKEVADEVEKLIMDDDHKSSANQSE LQNDDRFTAALPSLAAIEISIEPSVTGVQQQAPSTMPIRVPNTITDVAHP AHQHPGVSFSDHETLQHHNMHLPHFPMLPTPHMINSNLNELWFYRDPQAN VQGPFSAIEMTEWYRAGYFNENLFVRRYSENRFRPLGELIKFCHGNMPFT HSHLLPSPIDLENLSVGQIPTPLTASLSITPHKPSPIPIALSVVEQQLQQ QRDEHLKENVTATAESLSAAIKGNFSGNSISNTSHLLTMRFQMLQDQYLQ HQEYQILAELSKNECFQRLSAAEQETVVRRKVQMLVLPEYLISLNGLSNS LSVLNPVAGRQLYSTVVEQAKKDQQHIFANNSEHQRSVGNLLDANNFILN AQIMHQQSQQEVGALAASVDCIMQGGTAPDLSKPNEQPRNELDLINEYNL RMLLRGQPTSTQQHPPSLPNSANENLSGVDFLTETQLLQRQNLMIPIWLP PNKQQQSDQQWAGMTNAEASLWGVSHLNEERNDDQQLYVQKSSEACFVDT KKDVKISPLLQVQSGDIVKHSTSGDLDQTAENLKNSHNQKIVKSLVSDIQ QNHKEQNSHQHQAKQANKQNLNTKQNATQPALVKQINEDDRKREQTEEKK KQKEERKRQQLEDEKRRALHESEERARQIREEKERQQQIQAQRRKALIGN AESVQSGTPGTFASAQGNRNDAAKTAEPQASSRLPSTSVAPWSLQSPNSM STAPGLAEIQKAERRERRADQQRHQELLDKQLRANAAAAAEANDALLKWQ STPASAPVISLAEIQAEEARRLANDLVDQQRRRELEHHQQAPLSSAVLVA SATSNIWGNANKAWSSSASQSLSLRTSSGTGLWDEPNALGSIQPIYGSGT SCASSVTAAAVLAGGLNSTSKSNLQAQNKSSALFASPRNLRKSQTVPALN NPGKANKSGPGQRPEKQNLAQIRSKGPPVSVEEKEKERKTNVKSHVQQSS TDQVISKVNEYENEFTSWCIKSLDNMSAKVDVPTFVAFLQDLEAPYEVKD YVRIYLGDGKDSLDFAKQFLERRSKYKSLQRAQNAHNDDMCKPAPAITPS ANDYADSKNKQKKIKKNKMTKMDARILGFSVTAAEGRINVGIRDYVEGP* Nucleotide Sequence Fasta >Dere2_contig14_Dere/CG11148_CG11148-PB_cds ATGACAGATTCAATGAAATTTGGTCCGGAATGGTTACGCAATATGTCAGC CGAGCCTTCGAGCTCTCCCAGTACCTACAACGTTGGTACTGGTGCTCAAA ACATCTCGATTGGAGGGCACAACCTGGGAAACAACACGACAGCATCCACT TCGCGTAATCTATTTCCAGAATACCGGTACGGACGCGAGGAAATGCTGTC CTTGTTCGATCGGAATTGCCTACTGCCTCATATCCTACCATCGTTCAGAA AGCTCTTCGTGGAGAAGGTCCAGTACCCGCTTGCACTAACACCGAGCTCG GAGGAGGACACCAACCAAAACTCGCTTGGCAATAACTCTCGTCCTGCCTG GTTGCAGCGCTCGCCCGGTGGATTTGGTAGTGCCTCCCGAGGATCTGGAC GTGGTGGAACAGTCGACCGGGGCCGAATGCGCGGAAAATCTGCATATCAT CCAATATACCAACGCCCAAGCGGTCTATATGATGAAAGCTTATCGGTAAT ATCAGCCGAACGCACATGGAGCGATCGCAACGGAACTGGAGATTCTGCGG CTACCACCACCTCCACTAGTGGCCCTGGTGGTATAGATTGGAACGGAACG

26 CCAAGCTCAAGTCCTCGAAAAGATTATTCTAGTCATCATCGCAACTTGGA AAATTGGCGACGAACACGTAACGAAGATGGATCCGGAGATGGTCCAGCTA CCAGCGGTTCCATCGGAGGACCCGATATTGCTGGTTGGCGGAGCGGTGTC GTCGGTGGAAGTACAAGCACAAGTTTTGGTACCAATAGCCATCGCTGGGT TGGGACTTTAGGTACAGATCGAGACCGCACTGGAAACAGTAAAGGATCTG GAATGGGAGTTGCAGAGCCGGGAGGAAGCACATCGCACCCACGCTTGTCG AGCCAATTATGGACTGTTAATAGTGCAGGCGGTGTTGATGTCGACGAAAA TCTTCCCGAATGGGCAATGGAAAATCCATCGGAGTTGGGTGGCAGTTTTG ATGCTAGTGGAGCGTTTCATGGAGATACCGATCTAAAACTTAACAAAAGT TCGCATATTTTAAAAACCGAAAGCTTAAACTCCGATAACGATGTGACAAA CCAAAAAAGAAAAGACCTGTCTGATGCAGATAGTGTTAAAGATAAAACCT CAGAAACCTTATTAACGAAAGATTCCAATTCAGCAGCTGTGCAAGAAGAA GTTGAAAGCAGTTTATCCCCAAAAAGCTCTACAACGACAAAGAAGGAAGA GATCCATGGAGATATTTCAGAACGGATCAAAGAGGTCGCCGATGAAGTAG AAAAACTTATAATGGATGATGATCATAAAAGCTCGGCGAATCAAAGCGAA CTTCAAAATGATGACCGGTTTACAGCTGCACTACCAAGTCTAGCAGCCAT TGAGATAAGCATAGAACCGAGTGTTACAGGAGTGCAACAGCAAGCGCCTT CCACCATGCCCATACGAGTTCCCAATACAATTACAGATGTCGCTCATCCA GCCCATCAGCATCCAGGTGTTTCGTTTTCGGATCATGAGACTTTGCAACA TCATAACATGCATCTACCACATTTTCCGATGCTTCCTACACCGCATATGA TCAATTCAAATCTGAATGAATTGTGGTTTTACCGGGATCCGCAGGCAAAT GTACAGGGGCCATTCAGTGCCATTGAGATGACCGAATGGTATCGCGCTGG CTACTTCAATGAGAACCTCTTTGTACGCCGGTACTCTGAGAATAGGTTTA GACCACTGGGAGAGCTTATAAAATTTTGTCATGGTAACATGCCATTTACG CACAGTCACTTGCTTCCTTCGCCTATAGACCTAGAGAACCTTTCTGTTGG TCAAATACCAACCCCTCTTACAGCGTCCCTCTCAATTACACCCCATAAGC CATCACCAATTCCTATCGCACTGTCTGTTGTTGAACAGCAGTTGCAGCAG CAAAGAGATGAGCATCTGAAGGAAAATGTAACCGCAACCGCTGAATCTCT AAGTGCTGCAATAAAAGGAAATTTTAGCGGAAATAGCATTAGTAATACAT CTCATTTGCTTACAATGCGGTTTCAAATGCTTCAGGATCAGTACTTACAG CACCAGGAATACCAAATACTAGCTGAGCTGTCGAAAAATGAATGCTTTCA GCGGCTTTCGGCTGCCGAGCAGGAAACAGTTGTTCGTCGGAAAGTTCAAA TGCTGGTTCTTCCTGAGTATTTGATTAGTTTAAACGGATTAAGCAACTCC TTGTCCGTACTGAACCCCGTCGCCGGAAGACAGTTATACAGTACAGTGGT TGAGCAGGCCAAGAAAGATCAGCAACATATTTTTGCAAACAACAGCGAGC ATCAACGTTCAGTGGGCAATTTACTAGATGCTAATAATTTTATTCTAAAC GCCCAAATAATGCATCAGCAATCGCAACAAGAGGTAGGTGCCTTGGCAGC ATCCGTTGATTGTATTATGCAAGGTGGAACTGCACCCGACCTTAGTAAGC CTAATGAACAGCCAAGGAATGAGTTGGACTTAATTAATGAATACAACTTA CGGATGCTTTTGCGGGGCCAACCAACAAGTACTCAACAGCATCCACCTTC GCTGCCGAACTCTGCTAACGAAAATCTCTCTGGAGTGGATTTTTTAACTG AAACACAATTGTTACAGAGGCAAAATTTAATGATTCCTATCTGGTTACCC CCTAACAAGCAACAACAATCCGACCAACAGTGGGCTGGAATGACTAACGC GGAAGCATCATTATGGGGAGTGAGCCACTTAAATGAAGAGCGTAATGACG ATCAGCAACTATATGTGCAGAAGTCTTCTGAGGCGTGCTTTGTAGATACA AAAAAAGATGTGAAAATTTCACCATTATTACAAGTTCAATCGGGAGATAT 25

27 26 TGTTAAACACAGCACTTCAGGGGATTTAGATCAGACCGCTGAAAATTTAA AGAATTCACATAATCAAAAAATAGTCAAATCCCTTGTTTCCGACATTCAA CAGAATCATAAGGAACAAAATTCACATCAGCACCAGGCAAAGCAGGCTAA TAAGCAGAATCTGAATACAAAACAGAATGCGACACAGCCAGCCTTAGTTA AGCAAATTAATGAAGATGATCGCAAAAGAGAGCAGACAGAAGAAAAAAAA AAGCAAAAGGAGGAACGCAAGCGCCAGCAATTGGAAGATGAAAAACGTAG GGCGTTGCATGAATCTGAAGAACGAGCCCGCCAAATTCGAGAGGAAAAGG AGAGGCAACAGCAAATACAAGCCCAACGTCGAAAGGCATTAATAGGCAAT GCTGAATCAGTTCAAAGTGGGACTCCAGGAACATTCGCGTCTGCACAAGG CAACAGGAACGACGCAGCCAAAACAGCAGAGCCGCAAGCATCCTCTCGCT TACCATCTACATCCGTAGCGCCTTGGTCTCTGCAGTCTCCAAATTCTATG AGCACTGCGCCTGGTCTCGCAGAGATACAAAAGGCAGAACGTCGAGAGCG TCGCGCAGACCAGCAGCGACATCAAGAGCTATTAGATAAGCAATTGCGTG CCAATGCTGCAGCTGCAGCTGAAGCCAATGATGCTCTGCTCAAATGGCAG TCAACGCCAGCGTCGGCCCCCGTAATAAGTCTAGCCGAGATTCAAGCGGA AGAGGCAAGACGGTTGGCCAACGACCTTGTGGATCAGCAGCGTCGACGCG AATTGGAACATCACCAACAAGCTCCTCTATCATCAGCGGTTTTGGTAGCA AGTGCAACTTCCAACATCTGGGGTAACGCTAATAAAGCATGGAGCTCGTC CGCTTCTCAATCACTTTCATTAAGAACAAGTTCTGGAACTGGTCTATGGG ACGAACCGAATGCACTAGGTTCTATTCAACCAATTTATGGATCTGGAACA AGCTGTGCCAGTTCCGTAACTGCGGCGGCAGTCCTGGCAGGAGGATTGAA CTCAACTAGTAAATCCAATCTACAAGCTCAAAATAAGTCTTCGGCTTTAT TTGCGTCGCCTCGAAATTTGCGCAAGAGTCAAACAGTGCCAGCCTTAAAT AACCCAGGAAAGGCAAATAAAAGTGGACCAGGACAACGGCCAGAGAAACA AAATTTGGCCCAAATTCGTTCAAAAGGTCCACCTGTCTCAGTTGAAGAGA AGGAAAAAGAAAGAAAGACGAATGTAAAAAGTCATGTGCAGCAAAGCAGC ACCGACCAAGTCATTAGCAAGGTTAATGAGTATGAAAACGAGTTCACTAG CTGGTGCATAAAGAGCTTAGATAATATGTCCGCTAAAGTCGATGTACCCA CGTTCGTGGCATTCTTGCAGGACTTGGAAGCGCCATATGAAGTAAAGGAC TATGTCCGAATATACCTTGGTGATGGAAAAGATTCTTTGGATTTCGCAAA ACAGTTTTTGGAGCGACGTAGCAAATACAAAAGCTTGCAACGTGCCCAAA ATGCACACAATGACGATATGTGCAAACCGGCTCCTGCTATTACTCCATCT GCGAACGACTATGCTGACAGCAAGAACAAGCAGAAAAAGATTAAAAAGAA TAAGATGACTAAGATGGACGCCCGCATTCTTGGATTTTCAGTAACAGCTG CCGAGGGTCGCATAAATGTTGGCATTCGAGACTATGTCGAAGGACCA GFF (See Folder for file) contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP exon gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PB"

28 27 contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP CDS gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP start_codon gene_id "Dere/CG11148"; trans_id "CG11148-PB" contig14 GEP stop_codon gene_id "Dere/CG11148"; trans_id "CG11148-PB"; 3) Isoform C Protein Sequence Fasta >Dere2_contig14_Dere/CG11148_CG11148-PC_pep MTDSMKFGPEWLRNMSAEPSSSPSTYNVGTGAQNISIGGHNLGNNTTAST SRNLFPEYRYGREEMLSLFDRNCLLPHILPSFRKLFVEKVQYPLALTPSS EEDTNQNSLGNNSRPAWLQRSPGGFGSASRGSGRGGTVDRGRMRGKSAYH PIYQRPSGLYDESLSVISAERTWSDRNGTGDSAATTTSTSGPGGIDWNGT PSSSPRKDYSSHHRNLENWRRTRNEDGSGDGPATSGSIGGPDIAGWRSGV VGGSTSTSFGTNSHRWVGTLGTDRDRTGNSKGSGMGVAEPGGSTSHPRLS SQLWTVNSAGGVDVDENLPEWAMENPSELGGSFDASGAFHGDTDLKLNKS SHILKTESLNSDNDVTNQKRKDLSDADSVKDKTSETLLTKDSNSAAVQEE VESSLSPKSSTTTKKEEIHGDISERIKEVADEVEKLIMDDDHKSSANQSE LQNDDRFTAALPSLAAIEISIEPSVTGVQQQAPSTMPIRVPNTITDVAHP AHQHPGVSFSDHETLQHHNMHLPHFPMLPTPHMINSNLNELWFYRDPQAN VQGPFSAIEMTEWYRAGYFNENLFVRRYSENRFRPLGELIKFCHGNMPFT HSHLLPSPIDLENLSVGQIPTPLTASLSITPHKPSPIPIALSVVEQQLQQ QRDEHLKENVTATAESLSAAIKGNFSGNSISNTSHLLTMRFQMLQDQYLQ HQEYQILAELSKNECFQRLSAAEQETVVRRKVQMLVLPEYLISLNGLSNS LSVLNPVAGRQLYSTVVEQAKKDQQHIFANNSEHQRSVGNLLDANNFILN AQIMHQQSQQEVGALAASVDCIMQGGTAPDLSKPNEQPRNELDLINEYNL RMLLRGQPTSTQQHPPSLPNSANENLSGVDFLTETQLLQRQNLMIPIWLP PNKQQQSDQQWAGMTNAEASLWGVSHLNEERNDDQQLYVQKSSEACFVDT KKDVKISPLLQVQSGDIVKHSTSGDLDQTAENLKNSHNQKIVKSLVSDIQ QNHKEQNSHQHQAKQANKQNLNTKQNATQPALVKQINEDDRKREQTEEKK KQKEERKRQQLEDEKRRALHESEERARQIREEKERQQQIQAQRRKALIGN AESVQSGTPGTFASAQGNRNDAAKTAEPQASSRLPSTSVAPWSLQSPNSM STAPGLAEIQKAERRERRADQQRHQELLDKQLRANAAAAAEANDALLKWQ STPASAPVISLAEIQAEEARRLANDLVDQQRRRELEHHQQAPLSSAVLVA SATSNIWGNANKAWSSSASQSLSLRTSSGTGLWDEPNALGSIQPIYGSGT SCASSVTAAAVLAGGLNSTSKSNLQAQNKSSALFASPRNLRKSQTVPALN NPGKANKSGPGQRPEKQNLAQIRSKGPPVSVEEKEKERKTNVKSHVQQSS TDQVISKVNEYENEFTSWCIKSLDNMSAKVDVPTFVAFLQDLEAPYEVKD YVRIYLGDGKDSLDFAKQFLERRSKYKSLQRAQNAHNDDMCKPAPAITPS