Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Size: px

Start display at page:

Download "Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence"

Norma Johnston
6 years ago
Views:

1 Annotating 7G24-63 Justin Richner May 4, 2005 Zfh2 exons Thd1 exons Pur-alpha exons 0 40 kb 8 = 1 kb = LINE, Penelope = DNA/Transib, Transib1 = DINE = Novel Repeat = LTR/PAO, Diver2 I = LTR/Gypsy, Invader = Transposon, Tel1 = DNA, DNAREP1 DM Figure 1: Map of my sequence I was given 80,940 bases of sequence to annotate from the Drosophila virilis dot chromosome. This consisted of two approximately 40 kb fosmids joined together; 7G24 and 63. Fosmid 7G24 comprises bases 1 to 39,070. Fosmid 63 was annotated last year (Figure 1), and three genes were found; zfh2, thd1, and pur-alpha. I also found and annotated the same three genes. Zfh2 is zinc finger homeodomain protein 2, a probable transcription factor that is required for wing development. Zfh2 stretches from to and contains nine exons. Thd1 is mismatch dependent uracil/thymine DNA glycosylase, which removes mismatched uracil or thymine in double stranded DNA. Thd1 stretches from to and contains five exons. Pur-alpha is purine-rich binding protein-α, which is a single stranded DNA binding protein thought to be involved in DNA replication. Pur-alpha begins at and extends past the end of my sequence. Two of the Pur-alpha exons are within my sequence. The entire sequence contains 32 repeated segments, one of which is a novel repeat, and five of which are DINES. The protein Zfh2 is conserved across species in the zinc finge binding domain. No conserved non-genic regions were found. This segment of the dot chromosome has high synteny with the fourth chromosome of D. melanogaster. Figure 2: Gene map from last year s submitted paper

2 Genes: I first tried to identify genes using the Twinscan output on the Goose server within the UCSC genome browser format (Figure 3). The first gene predicted (chr6001.

2 2 Genes: I first tried to identify genes using the Twinscan output on the Goose server within the UCSC genome browser format (Figure 3). The first gene predicted (chr6001.1) is the tel1 gene, a protein involved in transposable elements. I will look at this gene more closely in the Repeat section. Figure 3: UCSC output on goose server The next predicted feature I analyzed was chr Twinscan predicts this to be a single exon feature, but Genescan and mrna data suggests that there are multiple exons. When Blast was performed against the nr database, the feature shows very good homology to the Zfh2 protein. But, the Zfh2 protein was much longer than the predicted one exon gene from Twinscan. I did a Blast search with the next predicted feature, chr and again found high homology to Zfh2. I decided that these were most likely the exons for this same gene and attempted to find the rest of the exons. At this point, I did not know how to use Ensembl or FlyBase, so to look for the exons, I blasted my entire repeat masked sequence to the nr database, and looked for the exons using herne on the Blast output file. The results were not expected. I had the first four exons transcribed in the forward direction from around to bases (Figure 4), and the last five exons transcribed in the reverse direction from the very end of my sequence to about bases (Figure 5). Figure 4: Two of the exons for Zfh2 transcribed in the forward direction. Figure 5: Three of the exons for Zfh2 transcribed in the reverse direction. I realized that my sequence was not assembled correctly, and XAAA63 should have been orientated in the opposite direction before it was joined with 7G24. Chris corrected my sequence but could not put the corrected sequence into the UCSC output on the Goose server. All of the numbers in the second half of my sequence were incorrect

3 when looking at data on the UCSC output, and I continually had to do Blast2 alignments in order to find the proper numbers. Also, the Twinscan output was wrong for Zfh2.

0, predicted exons for nearly all of the amino acids, no stop codons within the predicted exons, and last years data, I concluded that zfh2 is a real gene. I than begin searching for exons.

However, I noticed that the exon could extend for quite some distance in the +2 frame without encountering a stop codon as shown by the green arrow in Figure 6.

3 3 when looking at data on the UCSC output, and I continually had to do Blast2 alignments in order to find the proper numbers. Also, the Twinscan output was wrong for Zfh2. After performing a Blast search with the corrected sequence file, I looked at the hits to Zfh2. With an e-value score of 0.0, predicted exons for nearly all of the amino acids, no stop codons within the predicted exons, and last years data, I concluded that zfh2 is a real gene. I than begin searching for exons. The first exon predicted by Twinscan was much shorter than the first exon in D. melanogaster, obtained from the Ensembl database. However, I noticed that the exon could extend for quite some distance in the +2 frame without encountering a stop codon as shown by the green arrow in Figure 6. I hypothesized that the exon actually continued through the first three exons predicted by Genescan, as shown in Figure 6. Figure 6: UCSC output of first exon of zfh2 I performed a Blast2 alignment against my hypothesized exon and the D. melanogaster first exon, and obtained a good match (Figure 7). I hypothesize that this region, from to 24577, is the first exon of zfh2. Figure 7: D. melanogaster Vs. predicted zfh2 first exon Figure 8: Blast2 of D. melanogaster 2 nd exon with my sequence. At this point I realized two things; Twinscan and Genscan are not reliable, and the method used to find the first exon was highly inefficient. I began to search for exons

The beginning of exon 1 was moved farther back to 22793 bases because of mrna data, Figure 9, and now the exon has a 5 un-translated region.

4 4 much more quickly by performing Blast2 with the D, melanogaster exons from Ensembl and my entire sequence (Figure 8). Later, I came back to exon 1 and examined intron/exon boundaries to determine the exact stop site of this exon. The beginning of exon 1 was moved farther back to bases because of mrna data, Figure 9, and now the exon has a 5 un-translated region. The end of exon 1 had to be moved forward a couple of bases to because all introns begin with the base GT, see Figure 10. Figure 9: Beginning of exon 1; Red arrow = old boundary; Green arrow = new boundary Figure 10: End of exon 1 Exons 2, 3, and 4 were found without much difficulty. When searching for exon 5, only half of the exon predicted by D. melanogaster matched with my sequence. I joined exons 5 and 6 of D. melanogaster and performed a Blast2 alignment with my sequence and found a complete exon encompassing both predicted exons without any internal stop codons (Figure 11). I hypothesize that exons 5 and 6 from melanogaster have combined to form one exon in virilis.

5 Figure 11: Exons 5 and 6 of D. melanogaster aligned with my sequence Exons 6, 7, 8, and 9 were all pretty straight forward and matched the exons from D. melanogaster. Because exon 9 is the last exon in the ORF, it ends with a stop codon.

5 5 Figure 11: Exons 5 and 6 of D. melanogaster aligned with my sequence Exons 6, 7, 8, and 9 were all pretty straight forward and matched the exons from D. melanogaster. Because exon 9 is the last exon in the ORF, it ends with a stop codon. I was unable to find any 3 un-translated region for zfh2. Table 1 shows all the identified exons for Zfh2. Table 1: Zfh2 exons; Capital letters are exons Exon Start base Sequence End base Sequence Length (bases) tgctaacgacggct GTGCTCGgtaagttc tttgttacagctgcg GGCAGgtacgtttt ccgttccaggccaa CTGAAGgtatgtc aatttcagatcca AGCTTgtcgatct gcagtcccccca ACCCAGgtaagtcg tagcaacaatt GAAGgtaccacgtcga atattcaaacagggttg TACAAgtaagtcaa gggctttcacaggtttgg TCACCGgtaagaatt cgtaaaacaagacacg GACTAAacgaaatt 89

6 To ensure the accuracy of the predicted exons, I joined all of the exons into one file forming the DNA sequence of the protein.

If the intron/exon boundaries are incorrect, than the translated protein will be full of stop codons, as occurred on the initial attempt with Zfh2

Figure 12: Translated Zfh2 with predicted exons I made the intron boundaries incorrect between the 5 th and 6 th exons, which caused a frame shift.

When comparing Figure 13 to Figure 12, it becomes apparent that I was in the 3 frame instead of the desired 1 frame.

6 6 To ensure the accuracy of the predicted exons, I joined all of the exons into one file forming the DNA sequence of the protein. Using the translate tool on Expassy, I translated the protein s DNA sequence. If the intron/exon boundaries are incorrect, than the translated protein will be full of stop codons, as occurred on the initial attempt with Zfh2 (Figure 12). Figure 12: Translated Zfh2 with predicted exons I made the intron boundaries incorrect between the 5 th and 6 th exons, which caused a frame shift. Between exons, the annotator has to be sure to keep in the same frame. When comparing Figure 13 to Figure 12, it becomes apparent that I was in the 3 frame instead of the desired 1 frame. This problem resulted from the end of exon 5 where I was off by just one base, Figure 14. Figure 13: Frame shift in exon 6 Figure 14: Wrong exon boundary at the end of exon 5 After fixing this, I recompiled the exons together and translated the sequence. The result was exactly what I wanted (Figure 15). I confirmed that this was the correct sequence by blasting the translated amino acid sequence against Zfh2 and got a nearly perfect alignment. Figure 15: Zfh2 with correct exons

7 The next feature I analyzed was Twinscan output chr6.009.1. When I performed a Blast against the nr database with this feature, a hit to CG1981 appeared with an evalue of e^-100.

7 7 The next feature I analyzed was Twinscan output chr When I performed a Blast against the nr database with this feature, a hit to CG1981 appeared with an evalue of e^-100. Flybase showed this gene to be thd1. I assumed this gene to be real because it was annotated last year, and when I ran blast with my entire sequence against the nr database, I matched this gene with multiple exons and no internal stop codons. Thd1 clearly contains more exons than just the one predicted by Twinscan. When attempting to find the first exon, I could not match the first 144 amino acids of the protein, even with a high e-value and the filter turned off (Figure 16). Because I could not find the start site by using Blast, I used the first methionine that was upstream of the area that matched in Figure 16. Fortunately, the methionine was about 140 amino acids away. Figure 16: Blast2 with D. melanogaster exon 1 and my sequence When looking at the first exon. I noticed that the score gets better and better the more you use the raw sequence instead of filtered data. In Figure 17 all panels show the output from the same Blast2 as in Figure 16. The top panel shows the score using my sequence after Repeat Masker was run and turning on the filter from the Blast2 website. The middle panel shows the same reaction but with the filter turned off. The bottom panel shows the same reaction but the filter off, and using my unmasked sequence. The rest of the exons were not difficult to find for Thd1, and Table 2 shows all of the exons. I compiled the exons as before and attempted to translate the predicted sequence of thd1. The first attempt failed, but after making adjustments to account for the gene going in the opposite direction, I was successful (Figure 18).

8 Figure 17: Progression of score when decreasing

Length (bases) 1 62505 aggcacgaagatggc 60884

aaaaaccctgcaggtcgg 58399 ATACTgtaagcatattt 363 5 56912

8 8 Figure 17: Progression of score when decreasing filtering Exon Start base Sequence End base Sequence Length (bases) aggcacgaagatggc AAGGTTgtgagtaacgtat atattattgcagaacac ACAATGgtgagttcctat atcttgaaacagcggcgg TTATAgtgagttgtaaa aaaaaccctgcaggtcgg ATACTgtaagcatattt aatttcagtatatct TCTGAtggcagcagcag 2556 Table 2: Thd1 exons Figure 18: Thd1 translated

9 The next feature to investigate was chr6.006.1, a predicted single exon gene.

Chr6.005.1 was the next feature predicted by Twinscan. This feature, like chr6.006.1, had no hits to any actual data.

9 9 The next feature to investigate was chr , a predicted single exon gene. I performed blast on this feature, searched for EST data, cdna data, CDS data, and mrna data and found no hits to the region around or including this feature. This suggests a false hit by Twinscan. Chr was the next feature predicted by Twinscan. This feature, like chr , had no hits to any actual data. After this, I completely gave up on Twinscan and used the Blast file, with my sequence and the nr database, to see that there was only one other hit with a good evalue score; the gene CG1507, Pur-alpha (Figure19). This protein has several different splicing patterns according to Ensembl. Figure 19: Herne view of Blast output with my sequence and nr database zoomed in at the end I could not locate the first exon for this gene, so I used the mrna data available (Figure 20). The gene starts at around in the figure and is in the 3 frame. The blue area is where my sequence and exon 2 of D. melanogaster aligned. I hypothesize that the first exon is that shown by the mrna data in Figure 20 and the area prior to the Methionine is 5 un-translated region. Figure 20: Pur-alpha exon 1 Exon 2 was found using Ensembl and mrna data. The rest of pur-alpha extends past my sequence. Table 3 shows the exon information. I compiled the exons, transcribed them, and got the desired translation.

10 10 Exon Start base Sequence End base Sequence Length (bases) tcttttattttcaga GGTATgttataaaaaaa cagccgtcagtgcag GGCCGAGgtaaatata 106 Table 3: Pur-alpha exons Conserved Non-Genic Regions: I searched for, but could not find, any CNG regions. Repeats: The large table below contains all the repeats in my sequence. The black entries are the repeats found by Repeat Masker. All of the red entries indicate repeats found upon further analysis. Repetitive features from this table make up 16.9% of my sequence. Repeat Masker ran with out the no low option found 74 additional regions of low complexity or simple repeats. Repeat ID# Position on Sequence Repeat Family Repeat LINE PENELOPE LINE PENELOPE LINE PENELOPE Novel??? Probably end of Penelope LINE PENELOPE DNA DNAREP1 DM DINE LTR/Pao DIVER2 I LTR/Pao BATUMI I Transposon Tel LINE PENELOPE LINE PENELOPE DINE DNA/Transib TRANSIB Novel??? Probably end of Transib LINE PENELOPE LINE PENELOPE LINE PENELOPE DINE LINE PENELOPE DINE LINE PENELOPE LINE PENELOPE Novel??? Probably joins entries 22 and LINE PENELOPE DINE Novel LTR/Gypsy INVADER3 I LTR/Gypsy INVADER2 I DNA DNAREP1 DM DNA DNAREP1 DM LINE PENELOPE LINE PENELOPE

11 When searching for proteins through the Twinscan output, the first feature analyzed hit perfectly to tel1 when run on Blast against the nr database.

11 11 When searching for proteins through the Twinscan output, the first feature analyzed hit perfectly to tel1 when run on Blast against the nr database. Tel1 is a protein involved in transposable elements. Tel1 lifts a region out of a DNA sequence and places it elsewhere. Tel1 is adjacent to repeat #8 on the table, and possibly lifts this section out of the DNA sequence. Tel1 is not a novel repeat and should have been recognized by repeat masker. Tel1 is on the table of repeating elements under entry #10. I found five DINE s in my sequence by performing a Blast2 alignment with my sequence and the generic DINE sequence supplied by Libby. After the initial matches, I performed a Blast2 with the suspected DINE regions and the known DINE sequences from different sources. The suspected DINE s had significant matches to all of the different types of DINE s in the exact same areas. The characteristic common to all DINE s is two highly conserved regions of DNA separated by a non-conserved region, as is shown in Figure 23. Figure 21: DINE with two section of conserved sequence To find novel repeats, or repeats not known by Repeat Masker, I performed a BlastN operation with my sequence against the rest of the dot chromosome of D. virillis, and found four potential novel repeats. Three of the potential novel repeats were very close to either end of repeats found in Repeat Masker, and are probably extensions of the known repeats. Repeat Masker often will not recognize the end of a repeat within a sequence due to the program s method of scoring. The other novel repeat had no matches to any known protein, and I hypothesize this to be truly novel. Interestingly, this novel repeat is found within an intron of Thd1. The four potential novel repeats are found on the table under entry # s, 4, 15, 24, and 27, with #27 being the truly novel repeat. ClustalW: For the Clustal analysis, I compared Zfh2 with different zinc finger proteins from a wide-range of species. Organisms and the proteins that I used include; Zfh2 from D. melanogaster, Zinc finger homeodomain 4 from Homo sapiens, Zinc finger homeodomain from Caenorhabditis elegans, and the Homeobox protein from Arabidopsis thaliana. The Clustal analysis with all of the species did not show any conservation except in a small area, and this was not good conservation. I hypothesized that conservation would be more evident without A. thaliana because of the great evolutionary distance between any of the other species. I ran another Clustal analysis without A. thaliana and

12 found a much higher conserved sequence in the same region that showed little conservation before (Figure

This domain is conserved across animal species, but it appears not to be conserved in plants.

melanogaster dot chromosome, in that all the genes are in the same order and orientation.

12 12 found a much higher conserved sequence in the same region that showed little conservation before (Figure 22). The conserved sequence represents the Zinc finger domain. This domain is conserved across animal species, but it appears not to be conserved in plants. Figure 24: Clustal without A. thaliana Synteny: My sequence has high synteny to the D. melanogaster dot chromosome, in that all the genes are in the same order and orientation. Figure 25 shows the region on the dot chromosome of D. melanogaster, and Figure 26 shows my region with just the genes. Figure 25: Ensembl map of region on 4 th chromosome of melanogaster Figure 26: Map with just my genes

13 13 In my sequence, about 17.5 kilobases separate the first translated exons of Thd and Pur-alpha, compared to 4 kilobases in D. melanogaster. This is a very large difference and is unexpected considering that D. virilis is more genetically dense than D. melanogaster in the dot chromosome. There is a large repeat section in my sequence that could account for some of the space difference. Between the last translated exons of Thd1 and Zfh2, both D. virilis and D. melanogaster contains about 8.5 kilobases of sequence. The region before Zfh2 does not contain any known genetic features for more than 30 kilobases in both species. Both These regions show high synteny between D. virilis and D. melanogaster. The region in front of Zfh2 is hypothesized to contain an important element of Zfh2, be it a 5 un-translated region or a promoter. When a P-element is inserted into this empty region, the fly does not survive. Unfortunately, I did not have enough time to analyze this section of sequence.

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Lo 1 Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo, Louis April 20, 2006 Annotation Report Introduction In the first half of Research Explorations in Genomics I finished a 38kb fragment of chromosome