Gene Annotation Project. Group 1. Tyler Tiede Yanzhu Ji Jenae Skelton

Similar documents
Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

BME 110 Midterm Examination

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Aaditya Khatri. Abstract

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Chimp Sequence Annotation: Region 2_3

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Annotating Fosmid 14p24 of D. Virilis chromosome 4

GENOME ANNOTATION INTRODUCTION TO CONCEPTS AND METHODS. Olivier GARSMEUR & Stéphanie SIDIBE-BOCS

HUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009

HC70AL Spring An Introduction to Bioinformatics -- Part I. Brandon Le. April 6, What is a Gene? An ordered sequence of nucleotides


What is a Gene? HC70AL Spring An Introduction to Bioinformatics -- Part I. What are the 4 Nucleotides By in DNA?

user s guide Question 1

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Sequence Based Function Annotation

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

COMPUTER RESOURCES II:

Bacterial Genome Annotation

HC70AL SUMMER 2014 PROFESSOR BOB GOLDBERG Gene Annotation Worksheet

Overview of the next two hours...

Annotation of Drosophila erecta Contig 14. Kimberly Chau Dr. Laura Hoopes. Pomona College 24 February 2009

A Prac'cal Guide to NCBI BLAST

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Annotation of contig62 from Drosophila elegans Dot Chromosome

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

Gene Identification in silico

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

ab initio and Evidence-Based Gene Finding

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Annotation of a Drosophila Gene

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Guided tour to Ensembl

Why learn sequence database searching? Searching Molecular Databases with BLAST

Drosophila ficusphila F element

Genome annotation & EST

Annotation. Repeated sequences

Exercise I, Sequence Analysis

NCBI Molecular Biology Resources

Wheat Genome Structural Annotation Using a Modular and Evidence-combined Annotation Pipeline

HC70AL Spring 2011! An Introduction to Bioinformatics! By!! Brandon Le! April 7, 2011!

MAKER: An easy to use genome annotation pipeline. Carson Holt Yandell Lab Department of Human Genetics University of Utah

Match the Hash Scores

Worksheet for Bioinformatics

Lecture 7 Motif Databases and Gene Finding

FUNCTIONAL BIOINFORMATICS

Genome Annotation. Stefan Prost 1. May 27th, States of America. Genome Annotation

Lab Week 9 - A Sample Annotation Problem (adapted by Chris Shaffer from a worksheet by Varun Sundaram, WU-STL, Class of 2009)

Small Exon Finder User Guide

Annotation of Contig8 Sakura Oyama Dr. Elgin, Dr. Shaffer, Dr. Bednarski Bio 434W May 2, 2016

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Annotating the D. virilis Fourth Chromosome: Fosmid 99M21

Download the Lectin sequence output from

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Computational gene finding

RiceGAAS: an automated annotation system and database for rice genome sequence

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

ELE4120 Bioinformatics. Tutorial 5

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Additional file 2. Sequence origin GC% URL. Sequence type Number of sequences

Finding Genes, Building Search Strategies and Visiting a Gene Page

Finding Genes, Building Search Strategies and Visiting a Gene Page

Genomics and Database Mining (HCS 604.3) April 2005

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

WSSP-10 Chapter 9 Determine ORF and BLASTP

MODULE 5: TRANSLATION

A tutorial introduction into the MIPS PlantsDB barley&wheat database instances

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

BIOINFORMATICS AN OVERVIEW

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Homework 4. Due in class, Wednesday, November 10, 2004

GENOME ANNOTATION INTRODUCTION TO CONCEPTS AND METHODS. Olivier GARSMEUR. Training course in Bioinformatics applied to Musa genome November 2013

The TriAnnot Automated Annotation Pipeline: Making Sense the Output Files and Information - a Case Study W422 V3.5 3:45 4:05 P.

To investigate the heredity of the WFP gene, we selected plants that were homozygous

Biology 4100 Minor Assignment 1 January 19, 2007

Computational gene finding

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

9/19/13. cdna libraries, EST clusters, gene prediction and functional annotation. Biosciences 741: Genomics Fall, 2013 Week 3

Why Use BLAST? David Form - August 15,

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly

Genomic region (ENCODE) Gene definitions

Microarray Ordering Guide

Assessing De-Novo Transcriptome Assemblies

Transcription:

Gene Annotation Project Group 1 Tyler Tiede Yanzhu Ji Jenae Skelton

Outline Tools Overview of 150kb region Overview of annotation process Characterization of 5 putative gene regions Analysis of masked regions

Annotation Tools Sequence analyses EMBOSS tools Dot Plot, word frequency, RepeatMasker, Nucleotide Density, and CpG Island Gene predictions FGeneSH, AUGUSTUS, GeneMark In the end used only FGeneSH and GeneMark b/c AUGUSTUS did not add additional information Alignments NCBI, TIGR, GRAMENE blastn, blastx, blastp Genome viewer MaizeGDB

Gramene Blast result 150kb region from Chr8:138800001..138950000 in the maize reference genome

Intrinsic Sequence Analysis GC content 49.77% 99,668bp (66.45%) of bases masked

Gene 1 4 exons Reverse Strand 844 bp coding sequence Gene Model Exon Start End Exon Length Evidence for Start Evidence for End 4 3 2 1 45029 (1022) 45431 (1424) 45673 (1666) 45862 (1855) 45349 (1342) 45593 (1586) 45796 (1789) 46100 (2093) 320 160 123 238 Gene1:EST1.5; 37375054 "Exon 4" Prediction; Gene1:cDNA1.4; Gene1:cDNA3.4; Gene1:cDNA4.4 Gene1:cDNA1.0; Gene1:cDNA2.0; Gene1:cDNA3.0; Gene1:cDNA4.0 "Exon 2" Prediction; Gene1:cDNA2.2; Gene1:cDNA4.2 Gene1:cDNA1.5; Gene1:cDNA3.5; "Exon 5" Predicted End "Exon 4" Prediction; Gene1:cDNA1.4; Gene1:cDNA3.4; Gene1:cDNA4.4 Gene1:cDNA2.0; Gene1:cDNA3.0; Gene1:cDNA4.0 Gene1:cDNA2.2; 211384078; and weak support by Gene1:EST1.2 FGeneSH Prediction GeneMark Prediction Exon Strand Start End Start End 1-46157 46173 2-45862 45979 3-45649 45656 4-45431 45593 45431 45550 5-45230 45349 45230 45349 Coding sequence of gene model

Gene 1 cont. Predicted exons 1 and 3 supported by EST and cdna Exon 2 not predicted by either software Predicted exon 4 partially supported by EST and cdna Overall, expression supported by ESTs in MaizeGDB and NCBI cdna/est summary from NCBI blastn Accession ID Query Range Relation to Predicted Exons % Match E. Value gb FL442439.1 Gene1:cDNA1.5 1031-1343 5 99 6e^-156 Gene1:cDNA1.4 1423-1585 4 98 e^-72 Gene1:cDNA1.0 1667-1767 - 99 2e^-41 gb FL471335.1 Gene1:cDNA2.2 1855-2094 2 95 e^-102 Gene1:cDNA2.0 1667-1790 - 98 e^-52 Gene1:cDNA2.4 1433-1537 4 98 e^-42 gb FK984278.1 Gene1:cDNA3.4 1423-1585 4 98 e^-72 Gene1:cDNA3.5 1194-1343 5 97 2e^-61 Gene1:cDNA3.0 1667-1790 - 98 e^-52 Gene1:cDNA4.4 1423-1585 gb CO446956.1 4 96 e^-65 Gene1:cDNA4.0 1667-1790 - 98 7e^-51 Gene1:cDNA4.2 1855-1977 2 97 e^-48 TA216465 4577 Gene1:EST1.5 1023-1361 5 96 6.1e^-122 Gene1:EST1.0 1662-1794 - 95 1.3e^-112 Gene1:EST1.2 1849-2032 2 83 6.1e^-122 Gene1:EST1.1 2115-2441 1 63 1.9e^-114

Gene 1 cont. NCBI blastx with model sequence NCBI blastp w/ FGeneSH predicted protein as query 4 exon gene model better supported than FGeneSH prediction by cdna and EST Expression supported by ESTs and some cdna blastx highest hit 64% match (e^-30) to a hypothetical protein hits of lesser extent also include hypothetical proteins blastp of FGeneSH predicted AA sequence yielded worse results (E.values >2) tblastx of model coding sequence provided no results Conclusion region codes for ncrna novel protein not yet characterized

Gene 2 3 exon model Forward strand 1248 bp coding sequence Possible homolog to candidate gene: 1-aminocyclopropane-1-carboxylase oxidase 1 1268 bp Gene Model Exon Exon Start Exon Stop Exon Length Evidence for Start Evidence for Stop Gene2:mRNAcds1.1; 1 61225 61426 Gene Predictions; 201 Gene2:mRNAcds1.1 (35) (235) 195627159 from MaizeGDB 2 3 61545 (355) 61911 (721) 61789 (599) 62714 (1524) 244 Gene2:mRNAcds1.2; Gene Predictions 803 Gene2:mRNAcds1.3 Gene2:mRNAcds1.2; Gene Predictions; 195627159 from MaizeGDB Gene2:mRNAcds1.3; Gene Predictions; 195627159 from MaizeGDB FGeneSH Prediction GeneMark Prediction Exon Strand Start End Start End 1 + 61318 61426 61318 61425 2 + 61545 61789 61545 51789 3 + 61911 62499 61911 62499 FGeneSH predicted coding sequence, 942 bp: ATGGAGATTCCGGTGATCGATCTCGGCGGCCTCAACGGCGGCGGCGAGGAGAG GTCGCGGACCTTGGCGGAGCTCCACGACGCCTGCAAGGACTGGGGCTTCTTCTG GGTGGAGAACCACGGCGTGGACGCGCCGCTGATGGACGAGGTCAAGCGCTTCG TCTACGGCCACTACGAGGAGCACCTGGAGGCCAAGTTCTACGCCTCCGCCCTCG CCATGGACCTCGAGGCCGCCACCAGAGGTGACACTGATGAGAAGCCCTCCGAC GAGGTGGACTGGGAGTCCACCTACTTCATCCAGCACCACCCCAAGACCAACGTC GCCGACTTCCCAGAGATCACGCCGCCGACACGAGAGACGCTGGACGCGTACGT CGCGCAGATGGTGTCCCTCGCGGAGCGTCTGGCCGAGTGCATGAGCCTCAACCT GGGCCTCCCCGGGGCCCACGTCGCCGCCACCTTCGCGCCGCCGTTCGTGGGCAC CAAGTTCGCCATGTACCCGTCCTGCCCGCGCCCGGAGCTGGTGTGGGGCCTGCG CGCGCACACCGACGCCGGCGGCATCATCCTGCTCCTCCAGGACGACGTCGTGGG CGGCCTCGAGTTCCTCAGGGCCGGCGCCCACTGGGTCCCCGTCGGCCCCACCAA GGGGGGCAGGCTCTTCGTCAACATCGGGGACCAGATCGAGGTCCTCAGCGCCG GCGCCTACCGGAGCGTCCTGCACCGCGTCGCGGCCGGGGACCAGGGCCGCCGC CTGTCCGTGGCCACGTTCTACAACCCTGGCACCGACGCCGTGGTCGCGCCGGCG CCCCGCAGGGATCAGGACGCCGGCGCCGCGGCGTACCCCGGTCCCTACAGGTTC GGGGACTACCTCGACTACTACCAGGGCACCAAGTTCGGCGACAAGGACGCCAG GTTCCAGGCCGTCAAGAAGCTGCTCGGCTAA

Gene 2 cont. High match (almost 100%, E.value basically 0) to maize 1-aminocyclopropane-1-carboxylate oxidase 1 Many >>10 ESTs align to region, suggests that gene 2 is expressed Many blastx and blastp alignments to candidate gene in many other species, top 8 in table below Gene 2 may be a homolog to candidate gene Gene Model Match to Candidate Gene Accession ID Query Range Relation to Predicted Exons % Match E. Value Gene2:mRNAcd s1.3 720-1524 3 99 0 Gene ID: 100283053; 1- Gene2:mRNAcd aminocyclopropane-1-354 - 599 2 100 5e^-124 carboxylase oxidase 1 s1.2 Gene2:mRNAcd 35-237 1 100 4e^-100 s1.1 cdna from MaizeGDB (below) 3 exons -Coordinates: 35-1524 blastx results Top Hits from blastp and blastx to 1- aminocyclopropane-1-carboxylate oxidase 1 Organism % Match E.value Arabidopsis thaliana 50% 2e^-105 clove pink 43% 5e^-86 Indian rice 45% 5e^-85 Japanese rice 45% 3e^-85 Kiwifruit 43% 7e^-85 Arabidopsis thaliana (L.) Heynh 43% 4e^-83 Apple 42% e^-82 Tomatoe 42% 4e^-82

Gene 3 8 or 9 exons possible alternative splicing 4150 bp of coding sequence for model 1; 3589 bp for model 2 Forward strand Candidate gene: lycopene epsilon cyclase 1 (lyce1) FGeneSH Prediction GeneMark Prediction Exon Strand Start End Start End 1 + 82817 83131 82817 83131 2 + 83224 83315 83270 83315 3 + 83394 83460 83420 83460 4 + 84163 84168 84163 84168 5 + 84287 84458 84287 84458 6 + 84568 84703 84568 84703 7 + 84808 85021 84919 85021 8 + 85172 85315 85406 85513 9 + 85406 85513 85611 85682 10 + 85586 85682 85986 86048 11 + 85759 85887 87022 87223 12 + 87029 87508 87302 87396 13 + 87665 88190 87494 87615 14 + 88289 89626 87686 88190 15 + - - 88289 89626 Exons 1-7 of models match MaizeGDB model, whose CDS is below:

Gene 3 cont. cdna and EST support for expression of exons potential alternative splicing mrna evidence Associated Predicted Exon(s) (FGSH/GM) % Match E.value Gene3:cDNA1.14/15 5892 7376 14 and 15 99 0 Accession ID Start End gb BT037027.1; GENE ID: 100216601 LOC100216601 gb BT067056.1 gb BT063754.1 ; GENE ID: 100280448 lyce1 gb EU924262.1 / lcye- W22 allele **B73 allel supports model Gene3:cDNA1.13/14 5191 5794 13 and 14 100 0 Gene3:cDNA2.14/15 5892 7386 14 and 15 93 0 Gene3:cDNA2.13/14 5290 5794 13 and 14 88 2e^-164 Gene3:cDNA2.0/0 4448 4817 none and none 87 6e^-110 Gene3:cDNA2.0/13 5097 5218 none and 13 94 e^-42 Gene3:cDNA2.0/12 4905 5007 none and 12 95 e^-35 Gene3:cDNA3.11/0 3360 4245 11 and none 100 0 Gene3:cDNA3.1/1 329 734 1 and 1 100 0 Gene3:cDNA3.7/7 2409 2626 7 and 7 100 e^-107 Gene3:cDNA3.5/5 1888 2061 5 and 5 100 3e^-83 Gene3:cDNA3.8/0 2773 2919 8 and none 100 3e^-68 Gene3:cDNA3.6/6 2169 2306 6 and 6 100 3e^-63 Gene3:cDNA3.9/8 3007 3117 9 and 8 100 3e^-48 Gene3:cDNA3.10/9 3187 3289 10 and 9 99 4e^-42 Gene3:cDNA4.1/1 281 734 1 and 1 99 0 Gene3:cDNA4.7/7 2409 2626 7 and 7 100 e^-107 Gene3:cDNA4.0/0 3828 4017 none and none 100 4e^-92 Gene3:cDNA4.5/5 1888 2061 5 and 5 100 3e^-83 Gene3:cDNA4.8/0 2773 2919 8 and none 100 3e^-68 Gene3:cDNA4.6/6 2169 2306 6 and 6 100 3e^-63 Gene3:cDNA4.0/10 3588 3722 none and 10 100 e^-61 Gene3:cDNA4.11/0 3360 3490 11 and noen 100 2e^-59 Gene3:cDNA4.9/8 3007 3117 9 and 8 100 3e^-48 Gene3:cDNA4.10/9 3187 3289 10 and 9 99 4e^-42 GENE ID: 100216601 LOC100216601 lycopene epsilon cyclase1 [Zea mays]

Gene 3 cont. blastx using model 2 CDS as query -When expanded the "NADB_Rossmann superfamily" (blue bars) in all three reading frames are exactly lined up with domains of lyce1. -Model 1 similar to model 2 except NADB_Rossman domain truncated at 3 end blastx of MaizeGDB gene model Organism Arabidopsis thaliana Tomatoe Tobacco % Match E.Value 67% 0 72% 0 38% 4e^-89 blastp using MaizeGDB lyce1 protein sequence as query Conclusion: blastp of MaizgGDB lyce1 protein sequence resulted in a perfect match to Zea mays lcye1 (E.value = 0) PKc-like superfamily domain on 3 end of model sequences suggest that exon 9 and 10 of model 1 (10 and 11 of model 2) can themselves be their own gene model for a PKc_like superfamily protein. cdna and EST evidence and a blastx match exists to support the gene model suggestthat GENE ID: 100216601 LOC100216601 may have been mistakenly named

Gene 4 4 exons Forward strand 1820 bp coding sequence Expression and exon positions supported by cdna and ESTs FGeneSH Prediction GeneMark Prediction Exon Strand Start End Size Start End Size 1 + - - - 1 130 130 2 + 537 738 201 509 738 229 3 + 1310 1570 260 1310 1570 260 4 + 1657 2547 890 1657 2547 890 5 + 3024 3493 469 3024 3493 469 EST evidence below FGeneSH Model Exon Start End Size 1 2 3 4 109800 (537) 110573 (1310) 110920 (1657) 112287 (3024) 110001 (738) 110833 (1570) 111810 (2547) 112756 (3493) 201 260 890 469

Gene 4 cont. blastx of model coding sequence results in a hit to a Pkc_like superfamily domain 10+ blastx hits with E.values ranging from 6e^-39 to 5e^-43, ~35% identity cdna hits, while strong matches, do not provide any additional information cdna hits

Gene 5 2 exon model Reverse strand 604 bp coding sequence FGeneSh GeneMark Exon Start Stop Start Stop 1 2 142412 (753) 142968 (1309) 142456 (797) - - 143528 (1869) 142968 (1309) 143528 (1869) cdna support for exon 2 However, upstream, around 141,659 143,000 the query matches cdna of transposons and cdna and ESTs of random gene fragments cdna and blastp (of FGeneSH predicted protein) match to HLH superfamily (cdna: 81%, 3e^-123) according to NCBI HLH is common in DNAbinding proteins such as transcription factors

Repetitive Region Validation Skip the regions with predicted genes Database search DNA-level Maize TE database Protein-level Swiss-Prot

BLASTn against maize TE databas

Region 3: 62729, 82396 1-aminocyclopropane- 1-carboxylate oxidase 1 Transposonrelated protein BLASTx against Swiss-Prot

Region 5: 113683, 138555 BLASTx 30S ribosomal protein S4, Chloroplast! BLASTn (Maize TE database)

Region 5 cont: 113683, 138555 BLASTx, nr

Summary 5 regions harboring genes predicted 1 possible ncrna coding region 2 candidate gene hits 1-aminocyclopropane-1-carboxylate oxidase 1 homolog Lycopene epsilon cyclase 1 With PKC_like superfamily slightly downstream 1 Pkc_like superfamily hit 1 region likely resulting from helitron insertion Potential expression of a transcription factor Further analyses on repetitve regions support repeatmasker results