Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Similar documents
Annotating Fosmid 14p24 of D. Virilis chromosome 4

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Aaditya Khatri. Abstract

Lab Week 9 - A Sample Annotation Problem (adapted by Chris Shaffer from a worksheet by Varun Sundaram, WU-STL, Class of 2009)

Chimp Sequence Annotation: Region 2_3

Annotation of Drosophila erecta Contig 14. Kimberly Chau Dr. Laura Hoopes. Pomona College 24 February 2009

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

ab initio and Evidence-Based Gene Finding

MODULE 5: TRANSLATION

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Annotating the D. virilis Fourth Chromosome: Fosmid 99M21

Computational gene finding

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Transcription Start Sites Project Report

user s guide Question 1

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Computational gene finding

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

Genomes: What we know and what we don t know

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Annotation of Contig8 Sakura Oyama Dr. Elgin, Dr. Shaffer, Dr. Bednarski Bio 434W May 2, 2016

Annotation of contig62 from Drosophila elegans Dot Chromosome

Drosophila ficusphila F element

Lecture 7 Motif Databases and Gene Finding

Small Exon Finder User Guide

BME 110 Midterm Examination

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Gene Identification in silico

Genome annotation & EST


Sections 12.3, 13.1, 13.2

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Computational gene finding. Devika Subramanian Comp 470

Data Retrieval from GenBank

Annotation of a Drosophila Gene

Bacterial Genome Annotation

Complete draft sequence 2001

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

The common structure of a DNA nucleotide. Hewitt

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

MATH 5610, Computational Biology

Finishing of DELE Drosophila elegans has been sequenced using Roche 454 pyrosequencing and Illumina

Hands-On Four Investigating Inherited Diseases

I. Gene Expression Figure 1: Central Dogma of Molecular Biology

CS313 Exercise 1 Cover Page Fall 2017

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

How does the human genome stack up? Genomic Size. Genome Size. Number of Genes. Eukaryotic genomes are generally larger.

COMPUTER RESOURCES II:

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Biotechnology Unit 3: DNA to Proteins. From DNA to RNA

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

Sequence Based Function Annotation

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Transcription and Translation. DANILO V. ROGAYAN JR. Faculty, Department of Natural Sciences

From DNA to Protein: Genotype to Phenotype

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

The Flow of Genetic Information

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression

How to Use This Presentation

CHapter 14. From DNA to Protein

Lecture 2: Biology Basics Continued. Fall 2018 August 23, 2018

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Genes found in the genome include protein-coding genes and non-coding RNA genes. Which nucleotide is not normally found in non-coding RNA genes?

Tutorial for Stop codon reassignment in the wild

Biology Chapter 12 Test: Molecular Genetics

Transcription is the first stage of gene expression

Biology A: Chapter 9 Annotating Notes Protein Synthesis

A tutorial introduction into the MIPS PlantsDB barley&wheat database instances

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

user s guide Question 3

From DNA to Protein: Genotype to Phenotype

Bio 101 Sample questions: Chapter 10

Gene Expression Transcription/Translation Protein Synthesis

Investigating Inherited Diseases

Genome 373: Gene Predic/on I. Doug Fowler

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Textbook Reading Guidelines

BIOLOGY. Monday 14 Mar 2016

Biotechnology Explorer

Transcription:

Annotating 7G24-63 Justin Richner May 4, 2005 Zfh2 exons Thd1 exons Pur-alpha exons 0 40 kb 8 = 1 kb = LINE, Penelope = DNA/Transib, Transib1 = DINE = Novel Repeat = LTR/PAO, Diver2 I = LTR/Gypsy, Invader = Transposon, Tel1 = DNA, DNAREP1 DM Figure 1: Map of my sequence I was given 80,940 bases of sequence to annotate from the Drosophila virilis dot chromosome. This consisted of two approximately 40 kb fosmids joined together; 7G24 and 63. Fosmid 7G24 comprises bases 1 to 39,070. Fosmid 63 was annotated last year (Figure 1), and three genes were found; zfh2, thd1, and pur-alpha. I also found and annotated the same three genes. Zfh2 is zinc finger homeodomain protein 2, a probable transcription factor that is required for wing development. Zfh2 stretches from 22793 to 45965 and contains nine exons. Thd1 is mismatch dependent uracil/thymine DNA glycosylase, which removes mismatched uracil or thymine in double stranded DNA. Thd1 stretches from 62505 to 54357 and contains five exons. Pur-alpha is purine-rich binding protein-α, which is a single stranded DNA binding protein thought to be involved in DNA replication. Pur-alpha begins at 80071 and extends past the end of my sequence. Two of the Pur-alpha exons are within my sequence. The entire sequence contains 32 repeated segments, one of which is a novel repeat, and five of which are DINES. The protein Zfh2 is conserved across species in the zinc finge binding domain. No conserved non-genic regions were found. This segment of the dot chromosome has high synteny with the fourth chromosome of D. melanogaster. Figure 2: Gene map from last year s submitted paper

2 Genes: I first tried to identify genes using the Twinscan output on the Goose server within the UCSC genome browser format (Figure 3). The first gene predicted (chr6001.1) is the tel1 gene, a protein involved in transposable elements. I will look at this gene more closely in the Repeat section. Figure 3: UCSC output on goose server The next predicted feature I analyzed was chr6.002.1. Twinscan predicts this to be a single exon feature, but Genescan and mrna data suggests that there are multiple exons. When Blast was performed against the nr database, the feature shows very good homology to the Zfh2 protein. But, the Zfh2 protein was much longer than the predicted one exon gene from Twinscan. I did a Blast search with the next predicted feature, chr6.003.1 and again found high homology to Zfh2. I decided that these were most likely the exons for this same gene and attempted to find the rest of the exons. At this point, I did not know how to use Ensembl or FlyBase, so to look for the exons, I blasted my entire repeat masked sequence to the nr database, and looked for the exons using herne on the Blast output file. The results were not expected. I had the first four exons transcribed in the forward direction from around 20000 to 40000 bases (Figure 4), and the last five exons transcribed in the reverse direction from the very end of my sequence to about 60000 bases (Figure 5). Figure 4: Two of the exons for Zfh2 transcribed in the forward direction. Figure 5: Three of the exons for Zfh2 transcribed in the reverse direction. I realized that my sequence was not assembled correctly, and XAAA63 should have been orientated in the opposite direction before it was joined with 7G24. Chris corrected my sequence but could not put the corrected sequence into the UCSC output on the Goose server. All of the numbers in the second half of my sequence were incorrect

3 when looking at data on the UCSC output, and I continually had to do Blast2 alignments in order to find the proper numbers. Also, the Twinscan output was wrong for Zfh2. After performing a Blast search with the corrected sequence file, I looked at the hits to Zfh2. With an e-value score of 0.0, predicted exons for nearly all of the amino acids, no stop codons within the predicted exons, and last years data, I concluded that zfh2 is a real gene. I than begin searching for exons. The first exon predicted by Twinscan was much shorter than the first exon in D. melanogaster, obtained from the Ensembl database. However, I noticed that the exon could extend for quite some distance in the +2 frame without encountering a stop codon as shown by the green arrow in Figure 6. I hypothesized that the exon actually continued through the first three exons predicted by Genescan, as shown in Figure 6. Figure 6: UCSC output of first exon of zfh2 I performed a Blast2 alignment against my hypothesized exon and the D. melanogaster first exon, and obtained a good match (Figure 7). I hypothesize that this region, from 22805 to 24577, is the first exon of zfh2. Figure 7: D. melanogaster Vs. predicted zfh2 first exon Figure 8: Blast2 of D. melanogaster 2 nd exon with my sequence. At this point I realized two things; Twinscan and Genscan are not reliable, and the method used to find the first exon was highly inefficient. I began to search for exons

4 much more quickly by performing Blast2 with the D, melanogaster exons from Ensembl and my entire sequence (Figure 8). Later, I came back to exon 1 and examined intron/exon boundaries to determine the exact stop site of this exon. The beginning of exon 1 was moved farther back to 22793 bases because of mrna data, Figure 9, and now the exon has a 5 un-translated region. The end of exon 1 had to be moved forward a couple of bases to 24576 because all introns begin with the base GT, see Figure 10. Figure 9: Beginning of exon 1; Red arrow = old boundary; Green arrow = new boundary Figure 10: End of exon 1 Exons 2, 3, and 4 were found without much difficulty. When searching for exon 5, only half of the exon predicted by D. melanogaster matched with my sequence. I joined exons 5 and 6 of D. melanogaster and performed a Blast2 alignment with my sequence and found a complete exon encompassing both predicted exons without any internal stop codons (Figure 11). I hypothesize that exons 5 and 6 from melanogaster have combined to form one exon in virilis.

5 Figure 11: Exons 5 and 6 of D. melanogaster aligned with my sequence Exons 6, 7, 8, and 9 were all pretty straight forward and matched the exons from D. melanogaster. Because exon 9 is the last exon in the ORF, it ends with a stop codon. I was unable to find any 3 un-translated region for zfh2. Table 1 shows all the identified exons for Zfh2. Table 1: Zfh2 exons; Capital letters are exons Exon Start base Sequence End base Sequence Length (bases) 1 22793 tgctaacgacggct 24576 GTGCTCGgtaagttc 1784 2 27214 tttgttacagctgcg 27395 GGCAGgtacgtttt 182 3 28530 ccgttccaggccaa 30179 CTGAAGgtatgtc 1650 4 37078 aatttcagatcca 38573 AGCTTgtcgatct 1496 5 39179 gcagtcccccca 39865 ACCCAGgtaagtcg 687 6 39938 tagcaacaatt 40084 GAAGgtaccacgtcga 147 7 40174 atattcaaacagggttg 43161 TACAAgtaagtcaa 2988 8 44801 gggctttcacaggtttgg 45470 TCACCGgtaagaatt 670 9 45777 cgtaaaacaagacacg 45865 GACTAAacgaaatt 89

6 To ensure the accuracy of the predicted exons, I joined all of the exons into one file forming the DNA sequence of the protein. Using the translate tool on Expassy, I translated the protein s DNA sequence. If the intron/exon boundaries are incorrect, than the translated protein will be full of stop codons, as occurred on the initial attempt with Zfh2 (Figure 12). Figure 12: Translated Zfh2 with predicted exons I made the intron boundaries incorrect between the 5 th and 6 th exons, which caused a frame shift. Between exons, the annotator has to be sure to keep in the same frame. When comparing Figure 13 to Figure 12, it becomes apparent that I was in the 3 frame instead of the desired 1 frame. This problem resulted from the end of exon 5 where I was off by just one base, Figure 14. Figure 13: Frame shift in exon 6 Figure 14: Wrong exon boundary at the end of exon 5 After fixing this, I recompiled the exons together and translated the sequence. The result was exactly what I wanted (Figure 15). I confirmed that this was the correct sequence by blasting the translated amino acid sequence against Zfh2 and got a nearly perfect alignment. Figure 15: Zfh2 with correct exons

7 The next feature I analyzed was Twinscan output chr6.009.1. When I performed a Blast against the nr database with this feature, a hit to CG1981 appeared with an evalue of e^-100. Flybase showed this gene to be thd1. I assumed this gene to be real because it was annotated last year, and when I ran blast with my entire sequence against the nr database, I matched this gene with multiple exons and no internal stop codons. Thd1 clearly contains more exons than just the one predicted by Twinscan. When attempting to find the first exon, I could not match the first 144 amino acids of the protein, even with a high e-value and the filter turned off (Figure 16). Because I could not find the start site by using Blast, I used the first methionine that was upstream of the area that matched in Figure 16. Fortunately, the methionine was about 140 amino acids away. Figure 16: Blast2 with D. melanogaster exon 1 and my sequence When looking at the first exon. I noticed that the score gets better and better the more you use the raw sequence instead of filtered data. In Figure 17 all panels show the output from the same Blast2 as in Figure 16. The top panel shows the score using my sequence after Repeat Masker was run and turning on the filter from the Blast2 website. The middle panel shows the same reaction but with the filter turned off. The bottom panel shows the same reaction but the filter off, and using my unmasked sequence. The rest of the exons were not difficult to find for Thd1, and Table 2 shows all of the exons. I compiled the exons as before and attempted to translate the predicted sequence of thd1. The first attempt failed, but after making adjustments to account for the gene going in the opposite direction, I was successful (Figure 18).

8 Figure 17: Progression of score when decreasing filtering Exon Start base Sequence End base Sequence Length (bases) 1 62505 aggcacgaagatggc 60884 AAGGTTgtgagtaacgtat 1622 2 60325 atattattgcagaacac 59633 ACAATGgtgagttcctat 693 3 59011 atcttgaaacagcggcgg 58855 TTATAgtgagttgtaaa 157 4 58761 aaaaaccctgcaggtcgg 58399 ATACTgtaagcatattt 363 5 56912 aatttcagtatatct 54357 TCTGAtggcagcagcag 2556 Table 2: Thd1 exons Figure 18: Thd1 translated

9 The next feature to investigate was chr6.006.1, a predicted single exon gene. I performed blast on this feature, searched for EST data, cdna data, CDS data, and mrna data and found no hits to the region around or including this feature. This suggests a false hit by Twinscan. Chr6.005.1 was the next feature predicted by Twinscan. This feature, like chr6.006.1, had no hits to any actual data. After this, I completely gave up on Twinscan and used the Blast file, with my sequence and the nr database, to see that there was only one other hit with a good evalue score; the gene CG1507, Pur-alpha (Figure19). This protein has several different splicing patterns according to Ensembl. Figure 19: Herne view of Blast output with my sequence and nr database zoomed in at the end I could not locate the first exon for this gene, so I used the mrna data available (Figure 20). The gene starts at around 399940 in the figure and is in the 3 frame. The blue area is where my sequence and exon 2 of D. melanogaster aligned. I hypothesize that the first exon is that shown by the mrna data in Figure 20 and the area prior to the Methionine is 5 un-translated region. Figure 20: Pur-alpha exon 1 Exon 2 was found using Ensembl and mrna data. The rest of pur-alpha extends past my sequence. Table 3 shows the exon information. I compiled the exons, transcribed them, and got the desired translation.

10 Exon Start base Sequence End base Sequence Length (bases) 1 80071 tcttttattttcaga 80141 GGTATgttataaaaaaa 71 2 80725 cagccgtcagtgcag 80830 GGCCGAGgtaaatata 106 Table 3: Pur-alpha exons Conserved Non-Genic Regions: I searched for, but could not find, any CNG regions. Repeats: The large table below contains all the repeats in my sequence. The black entries are the repeats found by Repeat Masker. All of the red entries indicate repeats found upon further analysis. Repetitive features from this table make up 16.9% of my sequence. Repeat Masker ran with out the no low option found 74 additional regions of low complexity or simple repeats. Repeat ID# Position on Sequence Repeat Family Repeat 1 258-330 LINE PENELOPE 2 2844-2971 LINE PENELOPE 3 4800-4913 LINE PENELOPE 4 6081-6553 Novel??? Probably end of Penelope 5 6547-6674 LINE PENELOPE 6 6578-6704 DNA DNAREP1 DM 7 6601-6643 DINE 8 11147-15982 LTR/Pao DIVER2 I 9 13026-13198 LTR/Pao BATUMI I 10 15983-16432 Transposon Tel1 11 17343-17391 LINE PENELOPE 12 17672-18049 LINE PENELOPE 13 17893-17953 DINE 14 19530-19638 DNA/Transib TRANSIB1 15 19598-19863 Novel??? Probably end of Transib1 16 30807-30855 LINE PENELOPE 17 30926-31395 LINE PENELOPE 18 35377-35767 LINE PENELOPE 19 35483-35535 DINE 20 36154-36544 LINE PENELOPE 21 36260-36312 DINE 22 44036-44326 LINE PENELOPE 23 52191-52233 LINE PENELOPE 24 522276-52725 Novel??? Probably joins entries 22 and 24 25 52740-53098 LINE PENELOPE 26 52971-53031 DINE 27 58028-58204 Novel 28 63599-66132 LTR/Gypsy INVADER3 I 29 66800-67794 LTR/Gypsy INVADER2 I 30 72622-72723 DNA DNAREP1 DM 31 73274-73402 DNA DNAREP1 DM 32 80310-80372 LINE PENELOPE 33 80468-80602 LINE PENELOPE

11 When searching for proteins through the Twinscan output, the first feature analyzed hit perfectly to tel1 when run on Blast against the nr database. Tel1 is a protein involved in transposable elements. Tel1 lifts a region out of a DNA sequence and places it elsewhere. Tel1 is adjacent to repeat #8 on the table, and possibly lifts this section out of the DNA sequence. Tel1 is not a novel repeat and should have been recognized by repeat masker. Tel1 is on the table of repeating elements under entry #10. I found five DINE s in my sequence by performing a Blast2 alignment with my sequence and the generic DINE sequence supplied by Libby. After the initial matches, I performed a Blast2 with the suspected DINE regions and the known DINE sequences from different sources. The suspected DINE s had significant matches to all of the different types of DINE s in the exact same areas. The characteristic common to all DINE s is two highly conserved regions of DNA separated by a non-conserved region, as is shown in Figure 23. Figure 21: DINE with two section of conserved sequence To find novel repeats, or repeats not known by Repeat Masker, I performed a BlastN operation with my sequence against the rest of the dot chromosome of D. virillis, and found four potential novel repeats. Three of the potential novel repeats were very close to either end of repeats found in Repeat Masker, and are probably extensions of the known repeats. Repeat Masker often will not recognize the end of a repeat within a sequence due to the program s method of scoring. The other novel repeat had no matches to any known protein, and I hypothesize this to be truly novel. Interestingly, this novel repeat is found within an intron of Thd1. The four potential novel repeats are found on the table under entry # s, 4, 15, 24, and 27, with #27 being the truly novel repeat. ClustalW: For the Clustal analysis, I compared Zfh2 with different zinc finger proteins from a wide-range of species. Organisms and the proteins that I used include; Zfh2 from D. melanogaster, Zinc finger homeodomain 4 from Homo sapiens, Zinc finger homeodomain from Caenorhabditis elegans, and the Homeobox protein from Arabidopsis thaliana. The Clustal analysis with all of the species did not show any conservation except in a small area, and this was not good conservation. I hypothesized that conservation would be more evident without A. thaliana because of the great evolutionary distance between any of the other species. I ran another Clustal analysis without A. thaliana and

12 found a much higher conserved sequence in the same region that showed little conservation before (Figure 22). The conserved sequence represents the Zinc finger domain. This domain is conserved across animal species, but it appears not to be conserved in plants. Figure 24: Clustal without A. thaliana Synteny: My sequence has high synteny to the D. melanogaster dot chromosome, in that all the genes are in the same order and orientation. Figure 25 shows the region on the dot chromosome of D. melanogaster, and Figure 26 shows my region with just the genes. Figure 25: Ensembl map of region on 4 th chromosome of melanogaster Figure 26: Map with just my genes

13 In my sequence, about 17.5 kilobases separate the first translated exons of Thd and Pur-alpha, compared to 4 kilobases in D. melanogaster. This is a very large difference and is unexpected considering that D. virilis is more genetically dense than D. melanogaster in the dot chromosome. There is a large repeat section in my sequence that could account for some of the space difference. Between the last translated exons of Thd1 and Zfh2, both D. virilis and D. melanogaster contains about 8.5 kilobases of sequence. The region before Zfh2 does not contain any known genetic features for more than 30 kilobases in both species. Both These regions show high synteny between D. virilis and D. melanogaster. The region in front of Zfh2 is hypothesized to contain an important element of Zfh2, be it a 5 un-translated region or a promoter. When a P-element is inserted into this empty region, the fly does not survive. Unfortunately, I did not have enough time to analyze this section of sequence.