Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Size: px
Start display at page:

Download "Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences."

Transcription

1 Bio4342 Exercise 1 Answers: Detecting and Interpreting Genetic Homology (Answers prepared by Wilson Leung) Question 1: Low complexity DNA can be described as sequences that consist primarily of one or two out of the four possible nucleotides. Due to its simple structure, we would expect to find low complexity DNA in the fast-annealing fraction of the genome in a Cot curve. Similarly, low complexity sequences tend to align with each other in a BLAST search. Therefore, while it may not have a highly repetitive structure, low complexity DNA causes problems similar to simple repeats and other repetitious elements an increase in the number of spurious matches. The generic BLAST algorithm looks first for short perfect matches and then attempts to extend the alignment in both directions. A blastn search involves the comparison of a nucleotide query sequence with a nucleotide database. Removing low-complexity DNA (and other repetitious elements) decreases the number of spurious matches in a blastn search. Removing lowcomplexity DNA and repetitive elements can also significantly decrease the running time of the blast search. The decrease in running time is caused by fewer initial perfect matches that fail to produce significant alignments following the extension phase of the BLAST search. However, it is generally a bad idea to remove low-complexity DNA from a sequence before running blastx. In a blastx search, a nucleotide query is translated into all six frames and compared against a protein database. While regions may be low complexity at the DNA level, these regions may be significant to the protein at the amino acid level. For example, proteins such as collagen have highly regular sequences that may appear to be low complexity at the DNA level but are essential for the protein to function properly. Other structural proteins, such as elastin, also have this characteristic of highly regular amino acid sequence (i.e rich in glycine, valine, alanine, proline). Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences. Question 3: There are no known repeats that are found in this 4500 bp sequence (lab1seq2.fna). We would generally not expect the same results from primates, since repetitive elements (particularly Alu) are more prevalent in primate genomes. It should also be noted, however, that transposon-free regions have been found in both the human and the mouse genome that span more than 10 kb. Question 4: There are a total of 24 hits reported by blastx. The best E-value is 1e-148 while the worst E- value is 8.7. The matches with poor E-values were inconsistent with those with better e-values (i.e. 1e-148 and 6e-81). The two hits with the best E-values reported matches to the SWALLOW protein, while the remaining hits report matches to a diverse array of proteins. These poor matches range from the extensin precursor to the human collagen alpha I(IX) chain precursors. 1

2 Question 5: The two most reliable matches, according to blastx, suggest that our sequence shows a high degree of similarity to the SWALLOW proteins in D. melanogaster and D. pseudoobscura. Furthermore, the small E-values for both matches indicate that the probability of obtaining alignments this good or better due to random chance is extremely low (assuming that the evolutionary model used by BLAST is correct). The potential pitfalls of this interpretation are due to the inductive reasoning that is used in a BLAST search (comparison software -> similar sequence -> conservation -> negative selection - > conserved function). The first problem with our reasoning is in the second step (similar sequence -> conservation). We know that the Expect value (E-value) depends on the size of our search space and our scoring system. If an inappropriate scoring system is used, the BLAST search will produce significant hits with low E-values that are biologically meaningless. A more serious problem with the second step is that sequences may appear similar due to convergent evolution or simply to chance (improbable given the small E-values, but not impossible, in this case). The third step in our chain of inference is conservation -> negative selection. However, sequences such as inactive transposable elements can remain conserved simply because there is insufficient time or selective pressure for mutations to accumulate. Another problem with this inference is the possibility of a pseudogene - where our sequence no longer represents a functional copy of the SWALLOW protein but insufficient time has elapsed for the pseudogene to show significant divergence from the real SWALLOW protein. The final step (negative selection -> conserved function) also is problematic because negative selection can still produce similar proteins that have very different functions (examples include the different functions of adh1 and adh2 in yeast). In addition, homologous sequences can be either orthologous or paralogous. Question 6: The SWALLOW gene "has a role in localizing bicoid mrna at the anterior margin of the oocyte during oogenesis, and a poorly characterized role in nuclear divisions in early embryogenesis" according to the Swissprot database. The BLAST output matches SWALLOW genes from two species: Drosophila melanogaster and Drosophila pseudoobscura. According to the Genbank records for SWA_DROME, we should cite Chao, Y.C., Donahue, K.M., Pokrywka, N.J. and Stephenson, E.C. as the group who first characterized this gene. We should cite Huang, Z., Pokrywka, N.J., Yoder, J.H. and Stephenson, E.C. as the individuals who characterized SWA_DROPS. Question 7: The SWALLOW gene is in opposite orientation relative to the query sequence. Question 8: We note that the matches are to the following regions in the subject (SWA_DROME) sequence: 2

3 1-77, , , , This is most easily visualized by drawing the fragments identified on a map of the full-length protein. We know from our original blastx output that the SWA_DROME SWALLOW protein has a total length of 548 amino acids. Therefore, the entire protein is not matched. The region of the protein that is missing is at residues Figure 1. Positions of HSPs relative to the SWA_DROME protein in blastx search (with SEG filter) The basic unit of a BLAST output is the High-scoring Segment Pair (HSP). A HSP denotes the optimal local alignment in a region whose alignment score is above a certain threshold. There are multiple HSPs between our query and SWA_DROME that overlap each other in this BLAST alignment. Regions that showed two hits include , , and Regions that have three hits include (Figure 1). In this case, there has been a partial gene duplication event that leads to these overlapping HSP s. Overlapping HSP s can also occur because BLAST may overextend an alignment. Question 9: According to the Swissprot database, the region has the sequence QEDEDDYDEDVD. This sequence is rich in both glutamic acid (E) and aspartic acid (D) with a few glutamine, tyrosine, and valine residues. Hence this region is rich in a few highly charged amino acids. By default, NCBI BLAST automatically filters low complexity regions (using the program SEG for blastx). With the repeated occurrence of E and D in this region, this region may have been filtered prior to the alignment. To test this hypothesis, blastx is run with filter turned off. The following hits are found: , , , 1-91, 46-91, We find that when we turn off the SEG filter, we obtain an alignment to the entire protein (Figure 2). Blastx with SEG filter: Blastx without SEG filter: Figure 2. blastx hits relative to the query sequence (with versus without filtering low complexity regions). 3

4 Question 10: From the blastx results of question 9, we note that the amino acid position at 91 represents a cutoff point for the various HSPs. We note that the reading frame for 1-91 is -3 while the reading frame for is -1. This change in reading frame suggests the presence of an intron in this region. In particular, the matches to the protein at 1-91 correspond to in our query sequence while the matches to the protein at correspond to Hence there may be a potential intron that spans from (178 bp). The reason BLAST did not include residues in the protein alignment is because alignment to a masked base (X) still incurs a score penalty. (See Figure 13 of the Problem Set for an example of an alignment with masked bases.) Since BLAST is looking for optimal local alignments, extension of the alignment to include the masked base would simply lower the score of the alignment. Hence the optimal local alignment will be the one that did not include the masked bases. In other words, adding negative-scoring residue pairs to the end of an alignment will result in a worse scoring alignment, and hence be less optimal than the alignment without the masked bases. Question 11: Two distinct features seem to be present in this locus (Figure 2). Region 1 spans from while region 2 spans from Based on the matches, Region 2 would most likely be the true SWALLOW gene since we have matches to most of the full-length protein (Figure 3). Figure 3. Hits (without SEG filter) in region 2 match to the full-length protein. Region 1 could arise from a tandem duplication of the real gene. Since a functional copy is still present in the genome (in region 2), the sequences in region 1 can then mutate at a relatively neutral rate. Based on the number of sections that are missing in region 1, this probably does not encode a functional copy of the SWALLOW protein (Figure 4). Figure 4. Hits (without SEG filter) in region 1 do not match to full-length protein. Question 12: If our query has a repetitive element such as a transposon that is not masked, blastn would report many matches where the only region of similarity is the transposon sequence. It will be more difficult to identify biologically meaningful matches due to the increased noise. In other words, we will have a lot more false positives in our BLAST results if we forget to mask the repeats within the query sequence. Question 13: The best refseq match (identified by ref in the accession number) is to the Drosophila melanogaster CG3429-PA (swa) mrna (ref NM_ ). The hits that show very high sequence identity (100%) to the mrna sequence map to region 2 - the region that we have 4

5 previously hypothesized to be the real SWALLOW gene in our query (Figure 5). The alignment indicates there are three exons and two introns in our homolog to the SWALLOW gene. The introns are at and in the query, corresponding to and in the protein coding sequence. Figure 5. blastn alignments between the refseq hit and the query sequence. Question 14: The refseq hits to region 2 extend to sequences that are outside of the protein matches on both sides of SWALLOW from the previous blastx search (4259 in blastx versus 4372 in blastn ; 2347 in blastx versus 2091 in blastn). These extended regions may represent untranslated regions (UTRs) of the mature SWALLOW mrna transcript. Since the blastx search uses only the amino acid sequence of the SWALLOW protein, it would not have revealed these UTRs. These UTRs can only be found when we search against the mrna sequence at the nucleotide level using blastn. We also notice there are some small discrepancies in the potential splice sites identified using blastx against Swissprot and blastn against nt. Since the refseq hit represents the full transcript, we would trust the mrna more in terms of ascertaining the potential splice sites. In addition, when we examine the blastx alignment (Figure 2) from 2943 to 2347 in the query we note that blastx might have extended the alignment too far as there are a few stop codons at the beginning of the alignment (Figure 6). Figure 6. blastx alignment may have extended too far in region 2 5

6 There are also additional matches to the beginning of the sequences that are in much smaller fragments that are clustered at the beginning of our query sequence. There are two main explanations that could account for these additional matches in blastn relative to blastx. First, blastx is more sensitive to insertions and deletions (indels) than blastn. An indel in a blastn alignment will only incur a small gap penalty. However, since blastx translates the query into amino acid sequences prior to the alignment, an indel can cause a frameshift mutation that leads to alignments with stop codons. A residue aligned with a stop codon is heavily penalized and will quickly terminate the alignment. Second, random nucleotide matches are more probable than random amino acid sequence matches. Assuming uniform independent, identically distributed (IID) model, the probability of finding random nucleotide matches of length N is (1/4)^N. However, the probability of finding random amino acid matches of length N is approximately (1/20)^N. Hence we are more likely to detect local region of similarity (spurious matches) when we compare nucleotide sequences. Question 15: Figure 7. Comparison of refseq hits to region 2 (top) versus region 1 (bottom) of query sequence The blastn refseq hits to region 1 are significantly worse than the hits to region 2. We find multiple (approximately 7) gaps relative to the refseq mrna sequence (Figure 7). We see some evidence of potential frameshift mutations in this region (a 7-base gap between and a 1-base gap at base 1768 relative to the subject sequence) (Figure 8). 6

7 Figure 8. Gaps in alignments of region 1 We also notice multiple gaps in the blastx alignment. The overall quality of the alignments is much worse in region 1 when compared to region 2. In addition, we notice there are three stop codons (*) in the HSP from 1499 to 456 of the query ( of subject) (Figure 9). Figure 9. Presence of stop codons (*) indicative of pseudogene in region 1 of our query sequence Based on the available data, we conclude that region 1 probably represents a pseudogene that is derived from the SWALLOW gene. Question 16: I would annotate two features in this region, a gene in region 2 and a pseudogene in region 1. The gene in region 2 is probably orthologous to the SWALLOW gene in D. melanogaster and has 3 exons (2 introns). Potential UTR regions and more precise definition of the splice sites in region 2 can be determined using the mrna blastn alignment. The blastn alignment suggests the presence of 7 bp and 1 bp gaps in region 1 relative to the mrna sequence. These deletions could cause frameshift mutations. Furthermore, examination of the blastx alignment indicates the presence of stop codons in the reading frame of feature 1. Hence our evidence suggests feature 1 is a pseudogene derived from the SWALLOW gene. Last Update: 06/15/2006 7

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz] BLAST Exercise: Detecting and Interpreting Genetic Homology Adapted by W. Leung and SCR Elgin from Detecting and Interpreting Genetic Homology by Dr. J. Buhler Prequisites: None Resources: The BLAST web

More information

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST BLAST Exercise: Detecting and Interpreting Genetic Homology Adapted by T. Cordonnier, C. Shaffer, W. Leung and SCR Elgin from Detecting and Interpreting Genetic Homology by Dr. J. Buhler Recommended Background

More information

Aaditya Khatri. Abstract

Aaditya Khatri. Abstract Abstract In this project, Chimp-chunk 2-7 was annotated. Chimp-chunk 2-7 is an 80 kb region on chromosome 5 of the chimpanzee genome. Analysis with the Mapviewer function using the NCBI non-redundant database

More information

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo 1 Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo, Louis April 20, 2006 Annotation Report Introduction In the first half of Research Explorations in Genomics I finished a 38kb fragment of chromosome

More information

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Ruth Howe Bio 434W April 1, 2010 INTRODUCTION De novo annotation is the process by which a finished genomic sequence is searched for

More information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene

More information

Data Retrieval from GenBank

Data Retrieval from GenBank Data Retrieval from GenBank Peter J. Myler Bioinformatics of Intracellular Pathogens JNU, Feb 7-0, 2009 http://www.ncbi.nlm.nih.gov (January, 2007) http://ncbi.nlm.nih.gov/sitemap/resourceguide.html Accessing

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing

More information

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database

More information

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018 Outline Overview of the GEP annotation projects Annotation of Drosophila Primer January 2018 GEP annotation workflow Practice applying the GEP annotation strategy Wilson Leung and Chris Shaffer AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCT

More information

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

TIGR THE INSTITUTE FOR GENOMIC RESEARCH Introduction to Genome Annotation: Overview of What You Will Learn This Week C. Robin Buell May 21, 2007 Types of Annotation Structural Annotation: Defining genes, boundaries, sequence motifs e.g. ORF,

More information

Chimp Sequence Annotation: Region 2_3

Chimp Sequence Annotation: Region 2_3 Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker

More information

CHAPTER 21 LECTURE SLIDES

CHAPTER 21 LECTURE SLIDES CHAPTER 21 LECTURE SLIDES Prepared by Brenda Leady University of Toledo To run the animations you must be in Slideshow View. Use the buttons on the animation to play, pause, and turn audio/text on or off.

More information

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence Annotating 7G24-63 Justin Richner May 4, 2005 Zfh2 exons Thd1 exons Pur-alpha exons 0 40 kb 8 = 1 kb = LINE, Penelope = DNA/Transib, Transib1 = DINE = Novel Repeat = LTR/PAO, Diver2 I = LTR/Gypsy, Invader

More information

Evolutionary Genetics. LV Lecture with exercises 6KP

Evolutionary Genetics. LV Lecture with exercises 6KP Evolutionary Genetics LV 25600-01 Lecture with exercises 6KP HS2017 >What_is_it? AATGATACGGCGACCACCGAGATCTACACNNNTC GTCGGCAGCGTC 2 NCBI MegaBlast search (09/14) 3 NCBI MegaBlast search (09/14) 4 Submitted

More information

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence Agenda GEP annotation project overview Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Web databases for Drosophila annotation UCSC Genome Browser NCBI / BLAST FlyBase

More information

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical

More information

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAAT AATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAA

More information

DNA is normally found in pairs, held together by hydrogen bonds between the bases

DNA is normally found in pairs, held together by hydrogen bonds between the bases Bioinformatics Biology Review The genetic code is stored in DNA Deoxyribonucleic acid. DNA molecules are chains of four nucleotide bases Guanine, Thymine, Cytosine, Adenine DNA is normally found in pairs,

More information

Bacterial Genome Annotation

Bacterial Genome Annotation Bacterial Genome Annotation Bacterial Genome Annotation For an annotation you want to predict from the sequence, all of... protein-coding genes their stop-start the resulting protein the function the control

More information

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? Lesson Plan: Title Introduction to the Genome Browser: what is a gene? JOYCE STAMM Objectives Demonstrate basic skills in using the UCSC Genome

More information

Evidence for convergent evolution of ALU repeats in human and mouse

Evidence for convergent evolution of ALU repeats in human and mouse Evidence for convergent evolution of ALU repeats in human and mouse Aristotelis Tsirigos - IBM Computational Genomics Group May 2 I. ALU repeats: a class of mobile elements II. Evolution of ALUs III. Genomic

More information

Gene Expression: Transcription

Gene Expression: Transcription Gene Expression: Transcription The majority of genes are expressed as the proteins they encode. The process occurs in two steps: Transcription = DNA RNA Translation = RNA protein Taken together, they make

More information

COMPUTER RESOURCES II:

COMPUTER RESOURCES II: COMPUTER RESOURCES II: Using the computer to analyze data, using the internet, and accessing online databases Bio 210, Fall 2006 Linda S. Huang, Ph.D. University of Massachusetts Boston In the first computer

More information

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity Genome Annotation Genome Sequencing Costliest aspect of sequencing the genome o But Devoid of content Genome must be annotated o Annotation definition Analyzing the raw sequence of a genome and describing

More information

Assessing De-Novo Transcriptome Assemblies

Assessing De-Novo Transcriptome Assemblies Assessing De-Novo Transcriptome Assemblies Shawn T. O Neil Center for Genome Research and Biocomputing Oregon State University Scott J. Emrich University of Notre Dame 100K Contigs, Perfect 1M Contigs,

More information

Gene Identification in silico

Gene Identification in silico Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction

More information

(a) (3 points) Which of these plants (use number) show e/e pattern? Which show E/E Pattern and which showed heterozygous e/e pattern?

(a) (3 points) Which of these plants (use number) show e/e pattern? Which show E/E Pattern and which showed heterozygous e/e pattern? 1. (20 points) What are each of the following molecular markers? (Indicate (a) what they stand for; (b) the nature of the molecular polymorphism and (c) Methods of detection (such as gel electrophoresis,

More information

Biotechnology Explorer

Biotechnology Explorer Biotechnology Explorer C. elegans Behavior Kit Bioinformatics Supplement explorer.bio-rad.com Catalog #166-5120EDU This kit contains temperature-sensitive reagents. Open immediately and see individual

More information

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs. Page 1 REMINDER: BMI 214 Industry Night Comparative Genomics Russ B. Altman BMI 214 CS 274 Location: Here (Thornton 102), on TV too. Time: 7:30-9:00 PM (May 21, 2002) Speakers: Francisco De La Vega, Applied

More information

Hands-On Four Investigating Inherited Diseases

Hands-On Four Investigating Inherited Diseases Hands-On Four Investigating Inherited Diseases The purpose of these exercises is to introduce bioinformatics databases and tools. We investigate an important human gene and see how mutations give rise

More information

Answer: Sequence overlap is required to align the sequenced segments relative to each other.

Answer: Sequence overlap is required to align the sequenced segments relative to each other. 14 Genomes and Genomics WORKING WITH THE FIGURES 1. Based on Figure 14-2, why must the DNA fragments sequenced overlap in order to obtain a genome sequence? Answer: Sequence overlap is required to align

More information

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018 Agenda Annotation of Drosophila January 2018 Overview of the GEP annotation project GEP annotation strategy Types of evidence Analysis tools Web databases Annotation of a single isoform (walkthrough) Wilson

More information

Genomes summary. Bacterial genome sizes

Genomes summary. Bacterial genome sizes Genomes summary 1. >930 bacterial genomes sequenced. 2. Circular. Genes densely packed. 3. 2-10 Mbases, 470-7,000 genes 4. Genomes of >200 eukaryotes (45 higher ) sequenced. 5. Linear chromosomes 6. On

More information

Relationship of Gene s Types and Introns

Relationship of Gene s Types and Introns Chi To BME 230 Final Project Relationship of Gene s Types and Introns Abstract: The relationship in gene ontology classification and the modification of the length of introns through out the evolution

More information

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences. BLAST Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences. An example could be aligning an mrna sequence to genomic DNA. Proteins are frequently composed of

More information

Unit 1: DNA and the Genome Sub-topic 6: Mutation

Unit 1: DNA and the Genome Sub-topic 6: Mutation Unit 1: DNA and the Genome Sub-topic 6: Mutation Page 1 of 24 On completion of this topic I will be able to state that: mutations are random changes in the genome, causing no protein or an altered protein

More information

From DNA to Protein: Genotype to Phenotype

From DNA to Protein: Genotype to Phenotype 12 From DNA to Protein: Genotype to Phenotype 12.1 What Is the Evidence that Genes Code for Proteins? The gene-enzyme relationship is one-gene, one-polypeptide relationship. Example: In hemoglobin, each

More information

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University Making Sense of DNA and Protein Sequences Lily Wang, PhD Department of Biostatistics Vanderbilt University 1 Outline Biological background Major biological sequence databanks Basic concepts in sequence

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama Method to assign the coding regions of ESTs Céline Becquet Summer Program 2002 Structural Neuropathology Lab Molecular Neuropathology Group RIKEN Brain Science Institute Host : Dr. Nobuyuki Nukina Tutor

More information

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions Measurement of Molecular Genetic Variation Genetic Variation Is The Necessary Prerequisite For All Evolution And For Studying All The Major Problem Areas In Molecular Evolution. How We Score And Measure

More information

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project

More information

MATH 5610, Computational Biology

MATH 5610, Computational Biology MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24 Announcements Error on syllabus Class

More information

ELE4120 Bioinformatics. Tutorial 5

ELE4120 Bioinformatics. Tutorial 5 ELE4120 Bioinformatics Tutorial 5 1 1. Database Content GenBank RefSeq TPA UniProt 2. Database Searches 2 Databases A common situation for alignment is to search through a database to retrieve the similar

More information

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein? Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein? Messenger RNA Carries Information for Protein Synthesis from the DNA to Ribosomes Ribosomes Consist

More information

Nature Biotechnology: doi: /nbt.3943

Nature Biotechnology: doi: /nbt.3943 Supplementary Figure 1. Distribution of sequence depth across the bacterial artificial chromosomes (BACs). The x-axis denotes the sequencing depth (X) of each BAC and y-axis denotes the number of BACs

More information

Genome Annotation Genome annotation What is the function of each part of the genome? Where are the genes? What is the mrna sequence (transcription, splicing) What is the protein sequence? What does

More information

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources Appendix Table of Contents A2 A3 A4 A5 A6 A7 A9 Ethics Background Creating Discussion Ground Rules Amino Acid Abbreviations and Chemistry Resources Codons and Amino Acid Chemistry Behind the Scenes with

More information

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem. Dec-82 Oct-84 Aug-86 Jun-88 Apr-90 Feb-92 Nov-93 Sep-95 Jul-97 May-99 Mar-01 Jan-03 Nov-04 Sep-06 Jul-08 May-10 Mar-12 Growth of GenBank 160,000,000,000 180,000,000 Introduction to Bioinformatics Iosif

More information

The C-value paradox. PHAR2811: Genome Organisation. Is there too much DNA? C o t plots. What do these life forms have in common?

The C-value paradox. PHAR2811: Genome Organisation. Is there too much DNA? C o t plots. What do these life forms have in common? PHAR2811: enome Organisation Synopsis: -value paradox, different classes of DA, repetitive DA and disease. If protein-coding portions of the human genome make up only 1.5% what is the rest doing? The -value

More information

Finishing of DELE Drosophila elegans has been sequenced using Roche 454 pyrosequencing and Illumina

Finishing of DELE Drosophila elegans has been sequenced using Roche 454 pyrosequencing and Illumina Sarah Swiezy Dr. Elgin, Dr. Shaffer Bio 434W 27 February 2015 Finishing of DELE8596009 Abstract Drosophila elegans has been sequenced using Roche 454 pyrosequencing and Illumina technology. DELE8596009,

More information

Additional Practice Problems for Reading Period

Additional Practice Problems for Reading Period BS 50 Genetics and Genomics Reading Period Additional Practice Problems for Reading Period Question 1. In patients with a particular type of leukemia, their leukemic B lymphocytes display a translocation

More information

Sequence Databases and database scanning

Sequence Databases and database scanning Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.

More information

Lecture 20: Drosophila melanogaster

Lecture 20: Drosophila melanogaster Lecture 20: Drosophila melanogaster Model organisms Polytene chromosome Life cycle P elements and transformation Embryogenesis Read textbook: 732-744; Fig. 20.4; 20.10; 20.15-26 www.mhhe.com/hartwell3

More information

Gene-centered resources at NCBI

Gene-centered resources at NCBI COURSE OF BIOINFORMATICS a.a. 2014-2015 Gene-centered resources at NCBI We searched Accession Number: M60495 AT NCBI Nucleotide Gene has been implemented at NCBI to organize information about genes, serving

More information

MOLECULAR GENETICS PROTEIN SYNTHESIS. Molecular Genetics Activity #2 page 1

MOLECULAR GENETICS PROTEIN SYNTHESIS. Molecular Genetics Activity #2 page 1 AP BIOLOGY MOLECULAR GENETICS ACTIVITY #2 NAME DATE HOUR PROTEIN SYNTHESIS Molecular Genetics Activity #2 page 1 GENETIC CODE PROTEIN SYNTHESIS OVERVIEW Molecular Genetics Activity #2 page 2 PROTEIN SYNTHESIS

More information

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers Web resources: NCBI database: http://www.ncbi.nlm.nih.gov/ Ensembl database: http://useast.ensembl.org/index.html UCSC

More information

Analysis of large deletions in human-chimp genomic alignments. Erika Kvikstad BioInformatics I December 14, 2004

Analysis of large deletions in human-chimp genomic alignments. Erika Kvikstad BioInformatics I December 14, 2004 Analysis of large deletions in human-chimp genomic alignments Erika Kvikstad BioInformatics I December 14, 2004 Outline Mutations, mutations, mutations Project overview Strategy: finding, classifying indels

More information

Guided tour to Ensembl

Guided tour to Ensembl Guided tour to Ensembl Introduction Introduction to the Ensembl project Walk-through of the browser Variations and Functional Genomics Comparative Genomics BioMart Ensembl Genome browser http://www.ensembl.org

More information

Microbial Genetics. Chapter 8

Microbial Genetics. Chapter 8 Microbial Genetics Chapter 8 Structure and Function of Genetic Material Genome A cell s genetic information Chromosome Structures containing DNA that physically carry hereditary information Gene Segments

More information

Release Notes for Genomes Processed Using Complete Genomics Software

Release Notes for Genomes Processed Using Complete Genomics Software Release Notes for Genomes Processed Using Complete Genomics Software Version 1.11.0 Related Documents... 1 Changes to Version 1.11.0... 2 Changes to Version 1.10.0... 6 Changes to Version 1.9.0... 10 Changes

More information

Analysis of Biological Sequences SPH

Analysis of Biological Sequences SPH Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book,

More information

Understanding Genes & Mutations. John A Phillips III May 16, 2005

Understanding Genes & Mutations. John A Phillips III May 16, 2005 Understanding Genes & Mutations John A Phillips III May 16, 2005 Learning Objectives Understand gene structure Become familiar with genetic & mutation databases Be able to find information on genetic variation

More information

Pre-Lab Questions. 1. Use the following data to construct a cladogram of the major plant groups.

Pre-Lab Questions. 1. Use the following data to construct a cladogram of the major plant groups. Pre-Lab Questions Name: 1. Use the following data to construct a cladogram of the major plant groups. Table 1: Characteristics of Major Plant Groups Organism Vascular Flowers Seeds Tissue Mosses 0 0 0

More information

BLASTing through the kingdom of life

BLASTing through the kingdom of life Information for teachers Description: In this activity, students copy unknown DNA sequences and use them to search GenBank, the main database of nucleotide sequences at the National Center for Biotechnology

More information

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Ruth Howe Bio 434W 27 February 2010 Abstract The fourth or dot chromosome of Drosophila species is composed primarily of highly condensed,

More information

Genome Annotation. Stefan Prost 1. May 27th, States of America. Genome Annotation

Genome Annotation. Stefan Prost 1. May 27th, States of America. Genome Annotation Genome Annotation Stefan Prost 1 1 Department of Integrative Biology, University of California, Berkeley, United States of America May 27th, 2015 Outline Genome Annotation 1 Repeat Annotation 2 Repeat

More information

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks Introduction to Bioinformatics CPSC 265 Thanks to Jonathan Pevsner, Ph.D. Textbooks Johnathan Pevsner, who I stole most of these slides from (thanks!) has written a textbook, Bioinformatics and Functional

More information

Create a model to simulate the process by which a protein is produced, and how a mutation can impact a protein s function.

Create a model to simulate the process by which a protein is produced, and how a mutation can impact a protein s function. HASPI Medical Biology Lab 0 Purpose Create a model to simulate the process by which a protein is produced, and how a mutation can impact a protein s function. Background http://mssdbio.weebly.com/uploads/1//7/6/17618/970_orig.jpg

More information

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important! Themes: RNA is very versatile! RNA and RNA Processing Chapter 14 RNA-RNA interactions are very important! Prokaryotes and Eukaryotes have many important differences. Messenger RNA (mrna) Carries genetic

More information

Unit 6: Molecular Genetics & DNA Technology Guided Reading Questions (100 pts total)

Unit 6: Molecular Genetics & DNA Technology Guided Reading Questions (100 pts total) Name: AP Biology Biology, Campbell and Reece, 7th Edition Adapted from chapter reading guides originally created by Lynn Miriello Chapter 16 The Molecular Basis of Inheritance Unit 6: Molecular Genetics

More information

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading: 132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, 214 1 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading: Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 155 12 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

Genie Gene Finding in Drosophila melanogaster

Genie Gene Finding in Drosophila melanogaster Methods Gene Finding in Drosophila melanogaster Martin G. Reese, 1,2,4 David Kulp, 2 Hari Tammana, 2 and David Haussler 2,3 1 Berkeley Drosophila Genome Project, Department of Molecular and Cell Biology,

More information

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome Lectures 30 and 31 Genome analysis I. Genome analysis A. two general areas 1. structural 2. functional B. genome projects a status report 1. 1 st sequenced: several viral genomes 2. mitochondria and chloroplasts

More information

Theoretische Biologie

Theoretische Biologie Theoretische Biologie Prof. Computational EvoDevo, University of Leipzig SS 2017 Two Gene Concepts in Comparison Gerstein-Snyder gene definition Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel

More information

Gene mutation and DNA polymorphism

Gene mutation and DNA polymorphism Gene mutation and DNA polymorphism Outline of this chapter Gene Mutation DNA Polymorphism Gene Mutation Definition Major Types Definition A gene mutation is a change in the nucleotide sequence that composes

More information

Genes and Proteins in Health. and Disease

Genes and Proteins in Health. and Disease Genes and Health and I can describe the structure of proteins All proteins contain the chemical elements Carbon, Hydrogen, Oxygen and Nitrogen. Some also contain sulphur. Proteins are built from subunits

More information

Review of Protein (one or more polypeptide) A polypeptide is a long chain of..

Review of Protein (one or more polypeptide) A polypeptide is a long chain of.. Gene expression Review of Protein (one or more polypeptide) A polypeptide is a long chain of.. In a protein, the sequence of amino acid determines its which determines the protein s A protein with an enzymatic

More information

Why Use BLAST? David Form - August 15,

Why Use BLAST? David Form - August 15, Wolbachia Workshop 2017 Bioinformatics BLAST Basic Local Alignment Search Tool Finding Model Organisms for Study of Disease Can yeast be used as a model organism to study cystic fibrosis? BLAST Why Use

More information

Lecture #8 2/4/02 Dr. Kopeny

Lecture #8 2/4/02 Dr. Kopeny Lecture #8 2/4/02 Dr. Kopeny Lecture VI: Molecular and Genomic Evolution EVOLUTIONARY GENOMICS: The Ups and Downs of Evolution Dennis Normile ATAMI, JAPAN--Some 200 geneticists came together last month

More information

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson

More information

BIO 311C Spring Lecture 36 Wednesday 28 Apr.

BIO 311C Spring Lecture 36 Wednesday 28 Apr. BIO 311C Spring 2010 1 Lecture 36 Wednesday 28 Apr. Synthesis of a Polypeptide Chain 5 direction of ribosome movement along the mrna 3 ribosome mrna NH 2 polypeptide chain direction of mrna movement through

More information

The goal of this project was to prepare the DEUG contig which covers the

The goal of this project was to prepare the DEUG contig which covers the Prakash 1 Jaya Prakash Dr. Elgin, Dr. Shaffer Biology 434W 10 February 2017 Finishing of DEUG4927010 Abstract The goal of this project was to prepare the DEUG4927010 contig which covers the terminal 99,279

More information

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014 Single Nucleotide Variant Analysis H3ABioNet May 14, 2014 Outline What are SNPs and SNVs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants! Single Nucleotide

More information

Genomic region (ENCODE) Gene definitions

Genomic region (ENCODE) Gene definitions DNA From genes to proteins Bioinformatics Methods RNA PROMOTER ELEMENTS TRANSCRIPTION Iosif Vaisman mrna SPLICE SITES SPLICING Email: ivaisman@gmu.edu START CODON STOP CODON TRANSLATION PROTEIN From genes

More information

Basic Bioinformatics: Homology, Sequence Alignment,

Basic Bioinformatics: Homology, Sequence Alignment, Basic Bioinformatics: Homology, Sequence Alignment, and BLAST William S. Sanders Institute for Genomics, Biocomputing, and Biotechnology (IGBB) High Performance Computing Collaboratory (HPC 2 ) Mississippi

More information

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015 The use of bioinformatic analysis in support of HGT from plants to microorganisms Meeting with applicants Parma, 26 November 2015 WHY WE NEED TO CONSIDER HGT IN GM PLANT RA Directive 2001/18/EC As general

More information

Protein Synthesis: Transcription and Translation

Protein Synthesis: Transcription and Translation Review Protein Synthesis: Transcription and Translation Central Dogma of Molecular Biology Protein synthesis requires two steps: transcription and translation. DNA contains codes Three bases in DNA code

More information

32 Gene regulation in Eukaryotes Lecture Outline 11/28/05. Gene Regulation in Prokaryotes and Eukarykotes

32 Gene regulation in Eukaryotes Lecture Outline 11/28/05. Gene Regulation in Prokaryotes and Eukarykotes 3 Gene regulation in Eukaryotes Lecture Outline /8/05 Gene regulation in eukaryotes Chromatin remodeling More kinds of control elements Promoters, Enhancers, and Silencers Combinatorial control Cell-specific

More information

Combined Evidence Annotation of Transposable Elements in Genome Sequences

Combined Evidence Annotation of Transposable Elements in Genome Sequences Combined Evidence Annotation of Transposable Elements in Genome Sequences Hadi Quesneville 1*, Casey M. Bergman 2, Olivier Andrieu 1, Delphine Autard 1, Danielle Nouaud 1, Michael Ashburner 2, Dominique

More information

CH 17 :From Gene to Protein

CH 17 :From Gene to Protein CH 17 :From Gene to Protein Defining a gene gene gene Defining a gene is problematic because one gene can code for several protein products, some genes code only for RNA, two genes can overlap, and there

More information

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro Philip Morris International R&D, Philip Morris Products S.A., Neuchatel, Switzerland Introduction Nicotiana sylvestris

More information

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Ch 17 Practice Questions MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. 1) Garrod hypothesized that "inborn errors of metabolism" such as alkaptonuria

More information

BIOINFORMATICS IN BIOCHEMISTRY

BIOINFORMATICS IN BIOCHEMISTRY BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses on the analysis of molecular sequences (DNA, RNA, and

More information

Chapter 13. From DNA to Protein

Chapter 13. From DNA to Protein Chapter 13 From DNA to Protein Proteins All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequenceof a gene The Path From Genes to

More information

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly Supplementary Tables Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly Library Read length Raw data Filtered data insert size (bp) * Total Sequence depth Total Sequence

More information