Prokaryotic Annotation Pipeline SOP HGSC, Baylor College of Medicine

Similar documents
Gene Prediction Group

Gene Prediction Final Presentation

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Gene Prediction: Preliminary Results

Gene Prediction. Lab & Preliminary Results. Faction 2 Saturday, March 11, 2017

Functional Annotation - Faction 2 Background and Strategy

Applied bioinformatics in genomics

ab initio and Evidence-Based Gene Finding

Gene Prediction Background & Strategy Faction 2 February 22, 2017

Gene Prediction Background & Strategy. February 24, 2016

Bacterial Genome Annotation

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

BME 110 Midterm Examination

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Bioinformatic tools for metagenomic data analysis

What s New for School Year in Phage Genome Annotation

Tutorial for Stop codon reassignment in the wild

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Glossary of Commonly used Annotation Terms

Gene-centered resources at NCBI

COMPUTER RESOURCES II:

Probiotic Strain Isolated from the Vagina of Healthy Women

FUNCTIONAL ANNOTATION PRELIMINARY RESULTS. Compgenomics 2010

From Infection to Genbank

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

ELE4120 Bioinformatics. Tutorial 5

Small Genome Annotation and Data Management at TIGR

Analysis Report. Institution : Macrogen Japan Name : Macrogen Japan Order Number : 1501APB-0004 Sample Name : 8380 Type of Analysis : De novo assembly

Functional annotation of metagenomes

Sequence Based Function Annotation

Computational Genomics Final Presentation. BIOL 7210 Spring 2015

From assembled genome to annotated genome

Post-assembly Data Analysis

G4120: Introduction to Computational Biology

I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

Why learn sequence database searching? Searching Molecular Databases with BLAST

Genome Annotation - 2. Qi Sun Bioinformatics Facility Cornell University

HMP Data Set Documentation

RiceGAAS: an automated annotation system and database for rice genome sequence

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Download the Lectin sequence output from

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Gene Identification in silico

Computational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010

Community-assisted genome annotation: The Pseudomonas example. Geoff Winsor, Simon Fraser University Burnaby (greater Vancouver), Canada

IGS Annota*on Engine and Manatee. Michelle Gwinn Giglio Pathway Tools Workshop October 2010

Genome sequence of Acinetobacter baumannii MDR-TJ

Biology 4100 Minor Assignment 1 January 19, 2007

Functional Annotation: Preliminary Results

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Lecture 10. Ab initio gene finding

WSSP-10 Chapter 9 Determine ORF and BLASTP

G4120: Introduction to Computational Biology

Three-Way Comparison and Investigation of Annotated Halorhabdus utahensis Genome

RNA Genomics. BME 110: CompBio Tools Todd Lowe May 14, 2010

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

FUNCTIONAL BIOINFORMATICS

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Lecture 7 Motif Databases and Gene Finding

Data Retrieval from GenBank

Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL

Bioinformatic analysis of phage AB3, a phikmv-like virus infecting Acinetobacter baumannii

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Textbook Reading Guidelines


GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition

A Robust Method for Finding the Automated Best Matched Genes Based on Grouping Similar Fragments of Large-Scale References for Genome Assembly

MAKER: An easy to use genome annotation pipeline. Carson Holt Yandell Lab Department of Human Genetics University of Utah

Chimp Sequence Annotation: Region 2_3

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

This software/database/presentation is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part

Functional analysis using EBI Metagenomics

Genome annotation & EST

Corynebacterium pseudotuberculosis genome sequencing: Final Report

NCBI web resources I: databases and Entrez

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Aaditya Khatri. Abstract

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Annotation. (Chapter 8)

Bioinformatics for Microbial Biology

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools

Complete Genome Sequence of Pathogenic Bacterium

Practical Bioinformatics for Life Scientists. Week 14, Lecture 27. István Albert Bioinformatics Consulting Center Penn State

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

BIO 4342 Lecture on Repeats

Designing Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler

GENOME ANNOTATION INTRODUCTION TO CONCEPTS AND METHODS. Olivier GARSMEUR & Stéphanie SIDIBE-BOCS

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

NCBI Molecular Biology Resources

Title: Genome sequence of lineage III Listeria monocytogenes strain HCC23

Transcription:

1 Abstract A prokaryotic annotation pipeline was developed to automatically annotate draft and complete bacterial genomes. The protein coding genes in the genomes are predicted by the combination of Glimmer and GeneMark, while trna, rrna and other non-coding RNAs are predicted by trnascan, RNAmmer and Rfam, respectively. Conflict of gene predictions at each locus is resolved by empirical rules described in this SOPs. Gene product names are assigned based on sequence homology to known genes in the public databases. Functions of protein coding genes are also assigned according Pfam and COG and KEGG categories. 2 Introduction This SOP describes the prokaryotic annotation pipeline at the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine. This pipeline identifies protein coding sequences and RNA genes in complete and draft bacterial genome sequences, and assigns gene product names and functional annotation based on sequence homology to known genes. The main user of this pipeline is the automatic annotation pipeline for HMP bacterial reference genomes at the HGSC at Baylor College of Medicine. 3 Requirements 3.1 Data requirements A draft or complete bacterial genome sequence file in FASTA file format. The FASTA file can contain more than one FASTA sequence. 3.2 Software requirements Numerous software packages and data sets are required as referenced in the Procedure section. 3.3 Compute requirements This protocol requires a Linux or Unix operating system to execute. Linux or Unix compute clusters are also required. 4 Procedure 4.1 Fasta file formatting The DNA sequence file used for data generation from the pipeline must first be formatted correctly, by running though a program that confirms or corrects it to the FASTA specifications. If the input file is a multifasta file and a scaffold file in AGP format (see: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_specification.shtml ) is provided, the FASTA sequences will be linearized according to the scaffold file. If the input file is a multifasta file but no scaffold file is provided, the FASTA sequences will be concatenated into one Posted on the DACC website on July 10, 2009 by xqin Page 1 of 5

pseudo FASTA file with 200 Ns between the FASTA sequences. Sequence with fewer than 500 bp will not be annotated in the current version v1.10. 4.2 RNA prediction We apply the following three RNA prediction tools to the entire genome sequence: trnascan, RNAmmer, and RFAM/infernal. Together these tools provide highly accurate predictions for trnas, rrnas, and other non-coding RNA genes. The software version and parameters used for RNA predictions: Software Version Parameter Prediction trnascan 1.11 default parameters, -P option for trna prokaryotic sequences RNAmmer 1.2 default parameters rrna Rfam infernal-0.7 default parameters non-coding RNA The programs can be downloaded from the following websites: ftp://selab.janelia.org/pub/software/trnascan-se/ http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?rnammer http://rfam.sanger.ac.uk/help?tab=helpftpblock 4.3 Gene prediction programs Glimmer and GeneMark are used for ab initio protein coding gene prediction. The software version and parameters used are as the following: Software Version Parameter Prediction Glimmer 3.01 --gene_len=90 --max_olap=200 --codontable=11 --linear GeneMark 2.4 default parameters Use closest related genome as model or use heuristic model The programs can be downloaded from the following websites: http://www.cbcb.umd.edu/software/glimmer/ http://opal.biology.gatech.edu/genemark/ ab initio protein coding gene prediction ab initio protein coding gene prediction 4.4 BLAST BLAST is used for sequence similarity search of known genes using the NCBI NR and Pfam databases. The BLAST results are used as evidence in the small ORF filtering process and in resolving conflicts involving multiple gene predictions at one locus. BLAST is also used to find Posted on the DACC website on July 10, 2009 by xqin Page 2 of 5

missing genes in ab initia gene predictions, and to name gene products and assign functional annotations to protein coding genes. BLAST parameters used in the pipeline: Program Parameter Database BLASTx -F F -e 1e-10 NR BLASTp -F F -e 1e-5 Pfam BLAST programs and the NCBI NR and Pfam databases can be downloaded from these websites: ftp://ftp.ncbi.nih.gov/blast/executables/release/ ftp://ftp.ncbi.nih.gov/blast/db/ http://pfam.janelia.org/ 4.5 Gene calling The best gene call at each locus is determined by the following empirical rules: 4.51 Minimum length cutoff for protein coding ORFs ORF with BLAST evidence ORF without BLAST evidence 60 bp 120 bp 4.52 Protein coding ORFs in the same reading frame at the same locus ORFs in the same reading frame with different length at the same locus are resolved based on the best BLAST evidence. When there is not BLAST evidence for both ORFs, the longest ORF will be chosen. If all the BLAST evidence has the same best score, the longest ORF will be kept. 4.53 Overlaps between protein coding ORFs and functional RNAs ORFs within functional RNAs are excluded. ORFs without evidence that contain functional RNAs are excluded. ORFs with evidence that overlap functional RNAs by less than 30% of the ORF length are allowed. ORFs with evidence that overlap functional RNAs by more than 30% of the ORF length are excluded. 4.54 Overlaps between two protein coding ORFs If both ORFs do not have BLAST evidence, keep the longest ORF. If only one ORF has BLAST evidence, keep the ORF with evidence. If both ORFs have BLAST evidences and the overlap is less than 30% of the length of the shorter ORF and less than 200 bp, keep both ORFs. Otherwise, keep the longer ORF. An ORF within another ORF will be excluded. 4.55 Pseudogene and frameshift detection This version does not have automatic frame shift or pseudogene detection. This function will be added in the future. Posted on the DACC website on July 10, 2009 by xqin Page 3 of 5

4.6 Gene product naming and other features 4.61 Gene product naming Gene products are named based on BLAST matches to known genes in HGSC curated microbial database or NR database using a 30% identity and 60 % subject length cutoff. ORFs with matches to known genes below the cutoffs will be named with possible. ORFs with matches to hypothetical proteins will be named as conserved hypothetical protein. 4.62 EC number and gene name assignment EC number and gene name assignments are based on the BLAST matches to NR database and Swissprot Enzyme nomenclature database (http://ca.expasy.org/enzyme/) 4.63 Controlled vocabulary At the HGSC we currently use in-house programs to check gene product names in order to make the gene product names as consistent as possible. 4.7 Gene functional annotations Program Parameter cut-off Database Purpose HMMER -e 1e-5 Pfam database Pfam/InterProScan BLASTp -e 1e-10 COG database March 2003 COG number 25% identity 50% subject length BLASTp -e 1e-5 KEGG database Feb. 2009 KEGG number 30 % identity 50% subject length InterProScan default parameters Pfam database InterPro number The programs and databases can be downloaded from the following websites: http://hmmer.janelia.org/ http://pfam.janelia.org/ http://www.genome.jp/kegg/ http://www.ebi.ac.uk/tools/interproscan/ ftp://ftp.ncbi.nih.gov/pub/cog/cog/ 4 Implementation This is a SIGS requirement; to be included in HMP SOPs when applicable. 5 Discussion This is a SIGS requirement, but optional for HMP SOP submission. Posted on the DACC website on July 10, 2009 by xqin Page 4 of 5

6 Related Documents & References This is a SIGS requirement; to be included in HMP SOPs when applicable SIGS Guideline: Software tools and data sets should be referenced to a publication (or alternatively a project web site) whenever possible. Technical references that are internal to an institution and not public should be excluded. Examples include intranet URLs, file system paths, and computer server names that are not publicly accessible on the Internet. 7 Revision History This is an HMP_specific requirement, not included in the SIGS submission. Please be sure to update this when any changes as made, to help the DACC organize SOPs. Version Author/Reviewer Date Change Made 1.01 Xiang Qin, Kim Worley, Lan Zhang 7/10/2009 Establish SOP Posted on the DACC website on July 10, 2009 by xqin Page 5 of 5