Drosophila ficusphila F element

Size: px
Start display at page:

Download "Drosophila ficusphila F element"

Transcription

1 5/2/2016 CONTIG52 Drosophila ficusphila F element Vahag Kechejian BIO434W

2 Abstract Contig52 is a 35,000 bp region located on the F element of Drosophila ficusphila. Genscan predicts six features in the region, all of which were supported with BLASTx matches. The first feature was found to be an ortholog of the Kif3C gene in D. melanogaster, but it is a partial gene and only contains three of the five total exons of Kif3C. The other two exons are located in the neighboring Contig51. The rest of the features were found to be orthologs of the D. melanogaster genes pho, CG33521, PIP4K, Mitf, and Arf102F. Contig52 is 30.63% repetitive sequence as determined by RepeatMasker and contains four repetitive sequences 500 bp in length or greater. The TSS for pho has been tentatively put at 11,238 bp, based only on a BLASTn alignment. The pho gene was found to have two highly conserved protein motifs among Drosophila species when it was aligned using ClustalW2. The synteny of the genes in this region was preserved except for Arf102F, which had its orientation flipped and was moved to the 5 end of Mitf. The purpose of this project is to annotate the genes in this region so that we may use the data to find out how genes are expressed in a largely heterochromatic environment. Introduction Despite being mostly heterochromatic, the F element in Drosophila exhibits active transcription in the approximately 80 genes present on this chromosome. How exactly these 80 genes maintain active transcription in this heterochromatic environment is still largely unknown. By annotating sections of Drosophila F element we can possibly find conserved regulatory motifs or transposable elements that promote active transcription in these regions. In this project, a region of the D. ficusphila F element, Contig52, was annotated. Genscan 2

3 predicted six features within contig52. Each prediction was annotated in a specific process detailed below for the first prediction, Contig52.1. Figure 1 Contig52. From top to bottom: BLASTX Alignment to D. melanogaster proteins, Genscan predictions, modencode RNA- Seq Alignment summary, evolutionary conservation in seven Drosophila species based on a phylogenetic hidden Markov model, and repeating elements annotated by RepeatMasker. 3

4 Annotation of Contig52.1 First, the Genscan predicted amino acid sequence for Contig52.1 was used as the query in a Flybase BLASTp search, using the Annotated Proteins database for Drosophila melanogaster as the subject. The best match to Contig52.1 is Kif3C-PA with an E-value of 0 (Figure 2). Figure 2 Contig52.1 BLASTp Search. Genscan predicted amino acid sequence for Contig52.1 was used as the query in a Flybase BLASTp search, with the annotated proteins database as the subject. The best match to Contig52.1 is Kif3C-PA with an E-value of 0. The alignment shows that amino acids 1 through 123 and 621 through 649 are missing from the Subject (Figure 3). Next the Kif3C Gene Record was examined and Kif3C was found to be on the fourth chromosome (F element) of D. melanogaster. The polypeptide details showed that Kif3C in D. melanogaster has five exons and only one protein isoform (Figure 4). However, Genscan had only predicted four exons in the region, so it became apparent that Contig52.1 is likely a partial gene. 4

5 Figure 3 BLASTP Alignment of Contig52.1. Flybase Annotated Proteins database as subject and the Genscan predicted Contig52.1 amino acid sequence as the query. Amino acids and are missing from the subject in the alignment. Note the location on chromosome 4. 5

6 Gene Record Finder Figure 4 The Gene Record Finder entry for Kif3C in D. melanogaster. Kif3C has five exons and one isoform. Note that the length of exon 5 is 2,574 bp, and the length of CDS 5 is 76 aa, or 228 bp; most of exon 5 is untranslated in D. melanogaster.. 6

7 Figure 5 Region where putative fifth exon is located. If Kif3C is orthologous to its counterpart in D. melanogaster, the 3 UTR should extend 2,346 bp from the end of the last CDS (shown by the blue arrow). The putative exon (shown by the red arrow) is approximately 700 bp away from the end of CDS 5, which would place it well within this 3 UTR. The fourth exon predicted by Genscan was found to be within the 3 UTR of Kif3C, and likely corresponds with alternative splicing on the 3 UTR. For this project we are not interested in features within the 3 UTR, so that exon will not be examined at this time (Figure 5). A BLASTp search in the D. melanogaster annotated proteins database with this putative exon s amino acid sequence as the query showed no reasonable alignments, which suggests a miscall by Genscan. The D. melanogaster CDS amino acid sequences were then used as the subject in a NCBI BLASTX search with the Contig52 DNA sequence as the query. Only the amino acid sequences of the last three exons matched with the Contig52 DNA sequence. CDS 3 matched with 83% identity and was missing the first nine amino acids from the subject (Figure 6). CDS 4 matched with 67% identity with coverage of the whole subject (Figure 7). CDS 5 matched with 51% identity and was missing the last ten amino acids from the subject (Figure 8). The amino acid sequences of the first two exons were then used as the subject in a BLASTX search with the 7

8 neighboring Contig51 DNA sequence as the query. CDS 1 matched with 75% identity, CDS 2 matched with 77% identity, and both had the whole subject covered (Figures 9, 10). All of the results of the BLASTX searches are compiled in Table 1. Figure 6 BLASTX alignment of Exon 3. Exon 3 amino acid sequence as the subject and Contig52 DNA sequence as query. Amino acids 1-9 are missing from the subject. 8

9 Figure 7 BLASTX alignment of Exon 4. Exon 4 amino acid sequence as the subject and Contig52 DNA sequence as query with complete coverage. Figure 8 BLASTX alignment of Exon 5. Exon 5 amino acid sequence as the subject and Contig52 DNA sequence as query. Amino Acids are missing from the subject. 9

10 Figure 9 BLASTX alignment of Exon 1. Exon 1 amino acid sequence as the subject and Contig51 DNA sequence as query, with complete coverage. Figure 10 BLASTX alignment of Exon 2. Exon 2 amino acid sequence as the subject and Contig51 DNA sequence as query. Table 1 FlyBase ID CDS Size Query Range Reading Subject Range Identity Frame 1_9597_ * % 2_9597_ * % 3_9597_ % 4_9597_ % 5_9597_ % *Range in Contig51 Putative Exons 3 through 5 were then examined using the GEP D. ficusphila Genome Browser. For Exon 3, the acceptor site was found to be bp with phase 0. This site is 10

11 supported by Geneid Genes, SGP, and Augustus gene predictors, and it is a high quality acceptor site according to the Predicted Splice Sites track (Figure 11). Having the splice site here also accounts for the missing nine amino acids from the BLASTX alignment, so size is conserved even though the exact amino acid sequence is not (Figure 6). Note that the RNAseq data is particularly noisy in this region because it is close to the edge of the contig. Frame +1 Figure 11 Exon 3 Acceptor Site. The acceptor site was found in phase 0, reading frame +1, from bp shown here by the blue arrow. The site is supported by Geneid Genes, SGP, and Augustus gene predictors. It is also a high quality acceptor site according to the Predicted Splice Sites track, shown here by the yellow arrow. This choice conserves the size of the exon. The donor site for Exon 3 was found to be bp in phase 0. The site is supported by alignment data and gene predictors (Figure 12). The acceptor site for exon 4 was found to be at bp in phase 0. This site is supported by alignment data and gene predictors (Figure 12). 11

12 Frame +1 Figure 12 Exon 3 Donor Site and Exon 4 Acceptor Site. For Exon 3, the donor site is in phase 0, frame +1 from bp shown here by the red arrow. The site is supported by BLASTp, BLASTx, Genscan, Geneid, N-SCAN, SGP, Augustus, GlimmerHMM, and TopHat. It is also a high confidence donor site according to the predicted splice sites track, shown here by the orange arrow. For Exon 4, the acceptor site is in phase 0, frame +1 from bp shown here by the blue arrow. This site is supported by Genscan, Geneid, N-SCAN, SGP, Augustus, GlimmerHMM, and TopHat. It is also a high confidence acceptor site according to the Predicted Splice Sites track, shown here by the yellow arrow. The donor site for exon 4 was found to be at bp in phase 2. The site is supported by alignment data and gene predictors (Figure 13). The acceptor site for exon 5 was found to be at bp in phase 1. The site is supported by alignment data and gene predictors (Figure 13). 12

13 Frame +2 Figure 13 Exon 4 Donor Site and Exon 5 Acceptor Site. The donor site for exon 4 is in phase 2, frame +2 at bp shown here by the red arrow. The site is supported by data from BLASTp, BLASTx, Genscan, Geneid, SGP, Augustus, GlimmerHMM, and TopHat. The acceptor site for exon 5 is in phase 1, frame +2 at bp shown here by the blue arrow. The site is supported by BLASTp, BLASTx, Genscan, Geneid, N-SCAN, SGP, Augustus, GlimmerHMM, and TopHat. It is also a high confidence acceptor site according to the predicted splice sites track, shown here by the yellow arrow. There is a lack of conservation data in this region, as well as poor RNA-seq alignment due to the proximity to the edge of the contig, which led to the decision to extend the exon until it reached a stop codon. The stop codon for exon 5 was thus placed at bp (Figure 14). Placing the stop codon here accounted for the ten missing amino acids from the BLASTx alignment, but it also added 14 more amino acids (Figure 8). At the end of the 10 missing amino acids in D. ficusphila there is an AGA codon that codes for Arginine while in D. melanogaster there is a UGA codon that codes for a stop. All the acceptor and donor sites for exons 3 to 5 are compiled in Table 2. Since Exons 1 and 2 are in the neighboring Contig51, they are not part of this project and will not be discussed at this time. 13

14 Frame +2 Figure 14 Exon 5 Stop Codon. The stop codon for exon 5 was placed in frame +2 at bp shown here by the red arrow. Table 2 Feature Beginning(bp) End (bp) Acceptor Phase Donor Phase Frame CDS 3 2,500 3, CDS 4 3,166 3, CDS 5 4,075 4,339 1 N/A +2 Stop 4,340 4,342 N/A N/A +2 14

15 Figure 15 Dot plot of Kif3C-PA versus the gene model found by this project. Recall that the gene is missing the first two exons in Contig52. The large gap in the last exon is in an area of very little conservation between Drosophila species. Note also the change in size of the final exon. The proposed gene model was then checked using Gene Model Checker, and passed all the tests. The Dot plot (Figure 15) and the alignment (Figure 16) of Kif3C-PA amino acid sequence versus the proposed gene model s sequence both show that there is a significant lack of conservation between the sequences, especially in exon 4 from about aa. However, the exon boundaries are exactly the same for the two, which supports the gene model proposed by this project. 15

16 Figure 16 Alignment of Kif3C-PA amino acid sequence compared with gene model generated by this project. Note that despite the molecular differences within the exons, the exon boundaries are the same except for the extensionof exon 5. The first two exons are not present in the gene model since those are within Contig51, and not a part of this project. The small section that matched up in exon 1 is simply by chance and is not meaningful; those amino acids should be placed with exon 3. 16

17 Figure 17 Final Kif3C-PA Model. From top to bottom: proposed model, BLASTx alignment, Genscan predictions, RNA-seq alignment, and conservation. The annotated model was then compiled into a GFF file and opened on the GEP Browser (Figure 17). The rest of the features in the project were annotated following the same protocol as was used for Contig

18 Annotation of Contig52.2 The BLASTp search for Contig52.2 revealed that it is the pho ortholog in D. ficusphila. It has two identical protein isoforms, pho-pa and pho-pb. Table 3 Feature Beginning (bp) End (bp) Acceptor Phase Donor Phase Frame CDS 1 10,725 10,623 N/A 1-3 CDS 2 8,212 8, CDS 3 8,007 7, CDS 4 7,186 6, CDS 5 6,202 6, CDS 6 6,024 5,883 0 N/A -1 Stop 5,882 5,880 N/A N/A -1 Figure 18 Contig52.2 BLASTp Search. Genscan predicted amino acid sequence for Contig52.2 was used as the query in a Flybase BLASTp search, with the annotated proteins database as the subject. The best match to Contig52.2 is pho with an E-value of 1.25e

19 Figure 19 Alignment of pho-pa amino acid sequence of D. melanogaster compared with gene model generated by this project. The splice sites all match up perfectly, while much of the amino acid sequences within the exons do not. 19

20 Figure 20 Dot plot of pho-pa versus the gene model found by this project. The large gaps in the dot plot can be attributed to divergent evolution. Figure 21 Final pho-pa Model. From top to bottom: proposed model, BLASTx alignment, Genscan predictions, RNA-seq alignment, and conservation. 20

21 Annotation of Contig52.3 The BLASTp search for Contig52.3 revealed that it is the CG33521 ortholog in D. ficusphila. It has four total protein isoforms, with three unique coding sequences. CG33521-PC and CG33521-PD are identical to one another, while CG33521-PA and CG33521-PG are unique. Table 4 Feature Beginnin g (bp) End (bp) Acceptor Phase Donor Phase Frame PC, PD PA PG CDS 1 13,622 13,738 N/A 0 +2 Exon 1 Exon 1 Exon 1 CDS 2 14,273 14, Exon 2 Exon 2 CDS 3 14,273 14, Exon 2 CDS 4 14,479 14, Exon 3 Exon 3 Exon 3 CDS 5 14,960 15, Exon 4 Exon 4 Exon 4 CDS 6 15,642 16, Exon 5 Exon 5 Exon 5 CDS 7 16,348 16, Exon 6 Exon 6 Exon 6 CDS 8 16,533 16, Exon 7 Exon 7 CDS 9 16,764 16, Exon 7 CDS 10 18,199 18, Exon 8 Exon 8 Exon 8 CDS 11 18,395 18,476 1 N/A +3 Exon 9 Exon 9 Exon 9 Stop 18,492 18,494 N/A N/A +3 Exon 10 Exon 10 Exon 10 Figure 22 Contig52.3 BLASTp Search. Genscan predicted amino acid sequence for Contig52.3 was used as the query in a Flybase BLASTp search, with the annotated proteins database as the subject. The best match to Contig52.3 is CG33521 with an E-value of 0. 21

22 Figure 23 Alignment of CG33521-PC amino acid sequence of D. melanogaster compared with gene model generated by this project. The splice sites all match up perfectly, while some of the amino acid sequences within the exons do not. 22

23 Figure 24 Dot plot of CG33521-PC versus the gene model found by this project. The large gaps in the dot plot can be attributed to divergent evolution. Figure 25 Final CG33521-PC Model. From top to bottom: proposed model, BLASTx alignment, Genscan predictions, RNA-seq alignment, and conservation. Note that isoform PD has the identical amino acid sequence. 23

24 Figure 26 Alignment of CG33521-PA amino acid sequence compared with gene model generated by this project. The splice sites all match up perfectly, while some of the amino acid sequences within the exons do not. 24

25 Figure 27 Dot plot of CG33521-PA versus the gene model found by this project. The small gaps in the dot plot can be attributed to divergent evolution. Figure 28 Final CG33521-PA Model. From top to bottom: proposed model, BLASTx alignment, Genscan predictions, RNA-seq alignment, and conservation. 25

26 Figure 29 Alignment of CG33521-PG amino acid sequence of D. melanogaster compared with gene model generated by this project. The splice sites all match up perfectly, while some of the amino acid sequences within the exons do not. 26

27 Figure 30 Dot plot of CG33521-PG versus the gene model found by this project. The small gaps in the dot plot can be attributed to divergent evolution. Figure 31 Final CG33521-PG Model. From top to bottom: proposed model, BLASTx alignment, Genscan predictions, RNA-seq alignment, and conservation. 27

28 Annotation of Contig52.4 The BLASTp search for Contig52.4 revealed that it is the PIP4K ortholog in D. ficusphila. It has two identical protein isoforms, PIP4K-PA and PIP4K-PB. Table 5 Feature Beginning (bp) End (bp) Acceptor Phase Donor Phase Frame CDS 1 21,951 21,814 N/A 0-3 CDS 2 21,755 21, CDS 3 21,585 21, CDS 4 21,281 21, CDS 5 21,092 20, CDS 6 20,735 20, CDS 7 20,526 20, CDS 8 18,918 18, CDS 9 18,799 18,686 0 N/A -2 Stop 18,685 18,683 N/A N/A -2 Figure 32 Contig52.4 BLASTp Search. Genscan predicted amino acid sequence for Contig52.4 was used as the query in a Flybase BLASTp search, with the annotated proteins database as the subject. The best match to Contig52.4 is PIP4K with an E-value of 0. 28

29 Figure 33 Alignment of PIP4K-PA amino acid sequence of D. melanogaster compared with gene model generated by this project. The splice sites and almost all of the amino acid sequences match up perfectly. 29

30 Figure 34 Dot plot of PIP4K-PA versus the gene model found by this project. The one gap in the dot plot can be attributed to divergent evolution. Figure 35 Final PIP4K-PA Model. From top to bottom: proposed model, BLASTx alignment, Genscan predictions, RNA-seq alignment, and conservation. 30

31 Annotation of Contig52.5 The BLASTp search for Contig52.5 revealed that it is the Mitf ortholog in D. ficusphila. It has four total protein isoforms, with three unique coding sequences. Mitf-PA and Mitf-PB are identical to one another, and Mitf-PD and Mitf-PC are unique. Table 6 Feature Beginnin g (bp) End (bp) Acceptor Phase Donor Phase Frame PA, PB PD PC CDS 1 24,439 24,614 N/A 0 +2 Exon 1 Exon 1 CDS 2 25,784 26, Exon 1 CDS Exon 2 Exon 2 Exon 2 CDS Exon 3 Exon 3 Exon 3 CDS Exon 4 Exon 4 CDS 6 27,011 27, Exon 4 CDS Exon 5 CDS Exon 6 Exon 5 Exon 5 CDS Exon 7 Exon 6 Exon 6 CDS Exon 8 Exon 7 Exon 7 CDS N/A +3 Exon 9 Exon 8 Exon 8 Stop 30,093 30,094 N/A N/A +3 Exon 10 Exon 9 Exon 9 Figure 36 Contig52.5 BLASTp Search. Genscan predicted amino acid sequence for Contig52.5 was used as the query in a Flybase BLASTp search, with the annotated proteins database as the subject. The best match to Contig52.5 is Mitf-PC with an E-value of 3.42e

32 Figure 37 Alignment of Mitf-PC amino acid sequence of D. melanogaster with gene model generated by this project. The splice sites all match up perfectly, while some of the amino acid sequences within the exons do not. 32

33 Figure 38 Dot plot of Mitf-PC versus the gene model found by this project. Figure 39 Final Mitf-PC Model. From top to bottom: proposed model, BLASTx alignment, Genscan predictions, RNA-seq alignment, and conservation. 33

34 Figure 40 Alignment of Mitf-PD amino acid sequence with gene model generated by this project. The splice sites all match up perfectly, while some of the amino acid sequences within the exons do not. 34

35 Figure 41 Dot plot of Mitf-PD versus the gene model found by this project. The large gaps can be attributed to divergent evolution. Figure 42 Final Mitf-PD Model. From top to bottom: proposed model, BLASTx alignment, Genscan predictions, RNA-seq alignment, and conservation. 35

36 Figure 43 Alignment of Mitf-PA amino acid sequence of D. melanogaster with gene model generated by this project. The splice sites all match up perfectly, while some of the amino acid sequences within the exons do not. 36

37 Figure 44 Dot plot of Mitf-PA versus the gene model found by this project. Figure 45 Final Mitf-PA Model. From top to bottom: proposed model, BLASTx alignment, Genscan predictions, RNA-seq alignment, and conservation. 37

38 Annotation of Contig52.6 The BLASTp search for Contig52.6 revealed that it is the Arf102F ortholog in D. ficusphila. It has two identical protein isoforms, Arf102F-PA and Arf102F-PB. Table 7 Feature Start (bp) End (bp) Acceptor Phase Donor Phase Frame CDS N/A 1-3 CDS CDS CDS N/A -1 Stop 33,683 33,681 N/A N/A -1 Figure 46 Contig52.6 BLASTp Search. Genscan predicted amino acid sequence for Contig52.6 was used as the query in a Flybase BLASTp search, with the annotated proteins database as the subject. The best match to Contig52.6 is Arf102F with an E-value of 3.28e-89. Figure 47 Alignment of Arf102F-PA amino acid sequence of D. melanogaster with gene model generated by this project. The splice sites and almost all of the amino acid sequences match up perfectly. 38

39 Figure 48 Dot plot of Arf102F-PA versus the gene model found by this project. Figure 49 Final Arf102F-PA Model. From top to bottom: proposed model, BLASTx alignment, Genscan predictions, RNA-seq alignment, and conservation. 39

40 TSS Estimate The D. ficusphila pho ortholog was chosen to have its transcription start site (TSS) annotated. Both isoforms of pho share the same TSS, but isoform A has a longer 5 UTR. Thus, the first exon of isoform A was chosen to as the subject in a BLASTn search against the Contig52 DNA sequence as the query. The best match placed the TSS at 11,238 bp in Contig52 (Figure 45). RNAseq data in the region places the search region at 11,042 to 11,411 bp (Figure 46). The orthologous TSS in D. melanogaster is peaked. The motifs found in this search region and the orthologous region in D. melanogaster can be found in Table 8. None of these motifs supported the TSS proposed by the BLASTN search. This coupled with the lack of sequence conservation in the region and RNA Pol III data, made the TSS at 11,238 bp the best estimate that could be made at this time. Figure 50 BLASTn Alignment of pho-ra exon 1. Subject is pho-ra exon 1 DNA sequence and query is Contig52 DNA sequence. The last nine bases are missing from the subject. 40

41 Figure 51 The search region for the pho TSS. RNAseq data in the region places the search region at 11,042 to 11,411 bp. The putative TSS based on the BLASTn results is at 11,238 bp, shown here by the red arrow. None of the BRE d, MTE, or DPE motifs support this site. Table 8 Core promoter motif D. ficusphila D. melanogaster TATA Box NA BRE d , , , , , , , , , , , , , , , , , , , , , Inr NA MTE NA DPE , , Gene Evolution Using ClustalW2, an analysis was done comparing the pho-pb isoform of D. ficusphila to the PB isoform of eight other Drosophila species. The Drosophila species were D. ficusphila, D. melanogaster, D. sechellia, D. erecta, D. yakuba, D. willistoni, D. grimshawi, D. virilis, and D. mojavensis. This revealed two highly conserved motifs within the pho-pb isoform. One motif is 41

42 approximately 30 amino acids in length and the other is approximately 120 amino acids in length; the motifs are represented below with asterisks under the matching sequences. Orange is small nonpolar, green is hydrophobic, pink is polar, red is acidic, and blue is basic amino acids. This high level of conservation among species is a solid indication of functional significance. A Blastp search of the 30 amino acid motif revealed no known conserved domains. However, a search of the 120 amino acid motif revealed that it contains two Zinc finger double domains (Figure52 ). DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ MAYERYGILQKEESEE-DVVGSKGTQKINAQATGSDGLFHKGRTTLQKSK MAYERFGIILQSEQYDEDIGNTKVNQKMNE---G--NH-----YDLHRKN MAYERFGILLQSEQYDEDIGNTKVNRKMNEHDIS--NH-----YDLDRKN ---MDYSIANMAYERFGILLESEQYDEDIGNARGNQKMNDHDISNGNH-----NDLHRQK ---MDYSIVNMAYERFGILLESEQYDEEIGNTECNQKLNDHGISNGNH-----NDLNRKN MDRDPHSLAGMAYERYGILVEDEEIT--IAKAD----VNIQTMTDVLLP----SGLG--D -----MDPTNFAYEHYGILVQNDDDD--ISKADELNTLNVQHLNDALIC----SSQR--K -----MDPNNFAYEHYGILVQNDEEG--IPKSDELRTLNVQNISNILSS----EDIG--Q -----MDTNDFAYEHYGILVQNDEEL PKSLNVQNLSNVLGT----EDIG--H :***::**: :.:. :*. VFESLKQSG----KS----DISNNLLVQLNRKAPENITFS--KKKNITGY AFDRIIHSE----SKKGDNVINYNIHENDKIKAADNIFSSKLKMNPNMSY AFDRIIHSE----SKKNDNAIKYNIHGNVQLKAADDIFSSKLRMNTNIGY SFDRIIHSE----SKDSYNEINYNIHGKVHLKAADTLISSKLRMNTNMGY TFDQIIHSE----SKDSDNAFNYNIHGKVQFKAADNNFSSKLRMNTNIGY TFM Q----EPHFSSEDKSVLKTTDGVYLSSD Q NE SARMVLVSQNDSYAQVEAHN-NINIDGKSLLKTAKNICLSTKMQAQNMNANNLELQTTNQ SSELVLVSQHDSFRQTDFRNNSLHFDGKSLLKPPTKMEG------NA---N-NAL----E DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ EMNINFYKNTNYGETQDILKDAEINFLNSETPIEKFCTDDNSSPFIASNT EMNINCFKNIGYGENQETSKVLTNSLSNNDINTEESGVVDKNSPFLTLGT EMNINCFKNIGYGENQETSKGVTNSFSNNDINTEESGSVDKSSPFLSLGT EMNINCFKNIEYGKNPETSKDIANSFLNNGINTEESGAGDKSAPFLTLGT EMNINCYKSIGYGENQEK--DVANS FSNNGT KIKK EPKPHTFAEEREQVSHVSATS S-SKRTEEDDEDLVNCMQQLQQQQQQQ KV E------IKDVD FIEESEKDVTQDLVNCMQQLRQHDSAK EPSNDSEAME------RKNIS STKHLEESVSQDLVNCMQILRQDRSNV QP---EPAMD------RKHMA DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ TMLSSTGKSRRWEQKLVQIKTMEGEFSVTMWASGISDD---EYSGSDQNPGESEFLKE-N TILNSNGKSRRWEQKLVHIKTMEGEFSVTMWASGISDD---EYSGSDQIVGASDLLKGKE TILNSNGKSRRWEQKLVHIKTMEGEFSVTMWASGISDD---EYSASDQIVGASDLLKGKE TILNSTGKSRRWEQKLVQIKTMEGEFSVTMWASGISDD---EYSGSEQI-GDSDILKEKE TILNSTGKSRRWEQKLVQIKTMEGEFSVTMWASGISDD---EYSGSEQI-GDSDLLKEKE DLVGPAVRSRRWEQKLVQIKTMEGEFSVTMWASGGTSDDDDVYSESDNNLRTRACAEEGD IMDRKSGKCRRWEQKLVQIKTMEGEFSVTMWASGTSDD---EYSGSEQNADEIDYLNAAA L-MTANGKSRRWEQKLVQIKTMEGEFSVTMWASGTSDD---EYSSSDQNAEDVDYLNGNE L-MSGNGKSRRWEQKLVQIKTMEGEFSVTMWASGTSDD---EYSSSDQNADEVDYLIGNE :.********:**************** :.* ** *:: 42

43 DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ DIDFDSAESQQ---NKDF QQVESLLFA-HEHQPLGMQPPLSSVPLLEQLT EFGIDGFTSQQ---NKEY QKMESKFT---NAQTLEMPHPISSVQIMDHLI EFDIDGFTSQQ---NKEY QKMESKFT---NAHTLEIPHPISSVQIMDHLR DIDYDRFTSQR---NKEY QKMESKLT---NTQALEMHPPISSVQIMDQLT DIDLDRFTSQR---NEEY QKMESNFT---NSQALEMHSPISSVQIMDHLT I IHQQEACGPVDTFLHEQIVYQQVFNPHQ----ISLGP-----LTTEQLY E QTDIRKTETVLNQQQQQQQLLF--QQEQFLQLQPHLVLPLTGTTLA NAL-QNAASNKTLSPEQVKKNDTVLH---Q-QQLFFQQQQEQFLQLQPHLIVPLSGSTGS SAL-N SEQIKKNETVLQQQQQ-DQLFFQQQQEQFLQLQPQLILSMNGATAA :. : : : KEKIVL PQENNLST-NIPTKSTLSFNDSILISDSTNIQLVNET--ASMS KERGNL SQENNISE-RILSKTTLSFEEPILLPDSSSIELVNET--AAMT KERGNL SHENNISE-RILSKTTLSFEEPILVSDTSSIQLVNET--AAMT KERGIL FQENNFTE-RILSKSSLSFEDPILIPDSSSIQLVNET--AAMT KERGSL IQENNITE-RILSKSALSFEDPILISDSSSI KEKVPTA PIHISIPPSKPTIAPNCG-PHP VISME TATTTS ATDKCNNRQQ-QSKRNST-TPSSTFAVISSQDCQLISSDSSGILP MDKANSNSK-RSL-NCE-THTSTFT-VVSQGCRLINEE--SLLT AATAATAATAATAAAAADKCNPSSK-RNL-CCE-TQMSSYV-AISQGRRLINEE--TLLV DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ IEDHRILNSHNSSG----LRISTNTTHFGLDTQEAQEIDNSITPSQLPYIDNTGSTGSPD INNHRTLSNHTGNTG--DLHALPSSVPFRIGLHEGQVNDCLSTISQSTHQDNTDSTGCGE INNHRILSNHTDNTG--DLHALPSSLPVRIGLHEGQVNDCLSTTSQSTL-HNTDSTGCGE INNHRILNNHTDNTG--DLHTLTSSVPFDLGLQEGQVNESLSTSQTT-HHENTGLTGCAE HNHTDSSG--DLHTLTNSVQFDLGLQERQVNDCLSTTSQA-HHDN---IGCAE LESQTLLGVDNSCDVNTGSETLSYSY QSLTTSAESAILSAPP IGNQSILSNDNDCDLINQHNETTNSGT TLSNYSIVRSATDSAENG--- TENQSLVPNDIDCDIMNEVAATG-SV PFSVVR QAAP DSNQSLVPND-DCDINNDVTTASASV TFATAPATAAPGTAGAPP. : LSF-----NISDVTGACLNEKKIACPHKGCHKNFRDSSAMRKHLHTHGPRVHVCAECGKA M NLSEVTVSYTNDKKIACPHKGCNKHFRDSSAMRKHLHTHGPRVHVCAECGKA M NLSEVTVAYTNDKKIACPHKGCNKHFRDSSAMRKHLHTHGPRVHVCAECGKA VRL-----NLSEVTVAYTNDKKIACPHKGCHKHFRDSSAMRKHLHTHGPRVHVCAECGKA ISS-----NLSEVTVAYANDKKIACPHKGCHKHFRDSSAMRKHLHTHGPRVHVCAECGKA AQLLIE--EANLGQSAVEDDKKIACPHKGCHKTFRDSSAMRKHLHTHGPRVHVCAECGKA --AAAVAAGDVSDFSQYSNDKKIACPHKGCHKYFRDSSAMRKHLHTHGPRVHVCAECGKA STDIVANGTDASDPSQFANDKKIACPHKGCHKYFRDSSAMRKHLHTHGPRVHVCAECGKA IATESAESADTTDFGQFGNDKKIPCPHKGCHKYFRDSSAMRKHLHTHGPRVHVCAECGKA ::*** ******.* *************************** FVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRTHVRIHTGDRPFVCPFDACNKK FVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRTHVRIHTGDRPFVCPFDACNKK FVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRTHVRIHTGDRPFVCPFDACSKK FVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRTHVRIHTGDRPFVCPFDACNKK FVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRTHVRIHTGDRPFVCPFDACNKK FVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRTHVRIHTGDRPFVCPFDACNKK FVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRTHVRIHTGDRPFVCPFDACNKK FVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRTHVRIHTGDRPFVCPFDACNKK FVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSLDFNLRTHVRIHTGDRPFVCPFDACNKK *********************************************************.** FAQSTNLKSHILTHAKAKR-TSISRKGGCLNLESSSQ---SEDNSTNYIKVELQDSVSE- FAQSTNLKSHILTHAKAKRNTSISGKSGCSNAESNSQ---SEDTSANYVKVELQDSVTE- FAQSTNLKSHILTHAKAKRNTSISGKNGCSNADSNSQ---NEDTSANYVKVELQDSVTE- FAQSTNLKSHILTHAKAKRNTSSTVKSGCSNSDSNSQ---SEDNSRNYVKVELQDSVTE- FAQSTNLKSHILTHAKAKRNTSISGKSGCSNADSNSQ---SEDNSTNNVKIELKESVTE- FAQSTNLKSHILTHAKAKRNGNTSRHGTCSNNEADPHSPHSEETSGSLIKSELGDHV--- FAQSTNLKSHILTHAKAKRNTGTPRHNNCLNNSEPLS---PGDSSRNLIKVELRDTNNTT FAQSTNLKSHILTHAKAKRNTGNPRHTTCPSNEPLSP---SGESSTNLIKVELRDTTMSD FAQSTNLKSHILTHAKAKRNTANPRHNICPNNEPLSP---SGESSTNLIKVELRDTTMSE ******************* : *.. :.*. :* ** : 43

44 DFIC DMEL DSEC DERE DYAK DWIL DGRI DVIR DMOJ NSVPFVVYAD NHVPFVVYAD NHVPFVVYAD NHVPFVVYAD NQVPFVVYAD --SSTASDHAPFIVYAD TISDNTHAAASFVMYAD THAPFVMYAD TQAPFVMYAD. *::*** Figure 52 Blastp Conserved Domain Search. Blastp results of search with 120 amino acid motif in NCBI Conserved Domain Search, revealing the presence of two Zinc finger double domains. Repeats Contig52 is 30.63% repetitive sequence as determined by RepeatMasker and contains four repetitive sequences 500 bp or greater. One of these large repeats is classified as LINE, another as DNA/MITE, another as RC/Helitron, and the last one has an unknown classification. More information about these large repeats is summarized in Table 9 and the location of the large repeats relative to the annotated genes in shown in Figure 47. Table 9 Repeat Name Repeat Class/Family Strand Contig52 Start Contig52 End Size (bp) rnd-1_family-563 LINE rnd-1_family-108 DNA/MITE rnd-5_family-34 RC/Helitron rnd-1_family-12 Unknown

45 Figure 53 The location of the large repeats relative to the annotated genes. All the large repeats are within introns. Synteny Synteny between D. ficusphila Contig52 and the orthologous D. melanogaster region is not entirely conserved. In D. ficusphila, genes Kif3C, pho, CG33521, PIP4K, and Mitf all have their orientation and gene order preserved. However, in D. ficusphila Arf102F has flipped orientation and is in a different position relative to the surrounding genes. In D. melanogaster, Arf102F is on the positive strand and is bordered by Cals on the 5 end and CG11155 on the 3 end. In D. ficusphila, Arf102F is on the negative strand and is bordered by Dyrk3 on the 5 end and Mitf on the 3 end (Figure 48). 45

46 Figure 54 Comparison of D. ficusphila Contig52 to orthologous region in D. melanogaster. D. melanogaster is on the top and D. ficusphila is on the bottom. Genes Kif3C, pho, CG33521, PIP4K, and Mitf all have their orientation and gene order preserved. In D. melanogaster, Arf102F is on the positive strand and is bordered by Cals on the 5 end and CG11155 on the 3 end. In D. ficusphila, Arf102F is on the negative strand and is bordered by Dyrk3 on the 5 end and Mitf on the 3 end Conclusion In total, one partial gene and five full genes were annotated in this region. Starting from the left side of Contig52, one finds the last three exons of Kif3C-PA, followed by the two pho isoforms PA and PB. Next, there is the four CG33521 isoforms, PC, PD, PA, and PG. The next gene is the two PIP4K isoforms, PA and PB. That gene is followed by the four Mitf isoforms, PA, PB, PC, and PD. The last gene is the two Arf102F isoforms, PA and PB (Figure 49). Contig52 is 30.63% repetitive sequence as determined by RepeatMasker and contains four repetitive sequences 500 bp or greater. The TSS for pho has been tentatively put at 11,238 bp, but further experiments would be necessary to determine the exact location. The pho protein was found to have two highly conserved motifs among Drosophila species. The synteny in this region was 46

47 largely preserved except for Arf102F, which had its orientation flipped and was moved to the 5 end of Mitf. 47

48 Figure 55 Contig52 with all Genes Annotated. From top to bottom: full annotation of genes, Genscan predictions, BLASTx alignment, RNA-seq data, repeats larger than 500 bp, all repeats, and conservation with seven other Drosophila species. 48