Gene Prediction: Preliminary Results

Similar documents
Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Genome sequence of Acinetobacter baumannii MDR-TJ

GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition

Why learn sequence database searching? Searching Molecular Databases with BLAST

Analysis Report. Institution : Macrogen Japan Name : Macrogen Japan Order Number : 1501APB-0004 Sample Name : 8380 Type of Analysis : De novo assembly

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

COMPUTER RESOURCES II:

Computational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

ELE4120 Bioinformatics. Tutorial 5

RNA Genomics II. BME 110: CompBio Tools Todd Lowe & Andrew Uzilov May 17, 2011

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Small Genome Annotation and Data Management at TIGR

Gene-centered resources at NCBI

Gene Identification in silico

Glossary of Commonly used Annotation Terms

RNA-seq Data Analysis


Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

RNA folding & ncrna discovery

Genes and gene finding

Applied Bioinformatics

Product Applications for the Sequence Analysis Collection

BIOINFORMATICS Introduction

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Post-assembly Data Analysis

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Gene Signal Estimates from Exon Arrays

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Computational gene finding. Devika Subramanian Comp 470

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Efficient and Accurate Analysis of non coding RNAs with InSyBio ncrnaseq

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

Computational aspects of ncrna research. Mihaela Zavolan Biozentrum, Basel Swiss Institute of Bioinformatics

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Chimp Sequence Annotation: Region 2_3

Bioinformatic analysis of phage AB3, a phikmv-like virus infecting Acinetobacter baumannii

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Basic Bioinformatics: Homology, Sequence Alignment,

Bioinformatics for Proteomics. Ann Loraine

Mapping strategies for sequence reads

Why Use BLAST? David Form - August 15,

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

MicroSEQ Rapid Microbial Identification System

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Genomics and Transcriptomics of Spirodela polyrhiza

Bioinformatic tools for metagenomic data analysis

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

a-dB. Code assigned:

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

ABSTRACT METHODS FOR MICROBIAL GENOMICS. Professor Steven L. Salzberg Department of Computer Science

Reference genomes and common file formats

RNA-Seq Software, Tools, and Workflows

RNA Secondary Structure Prediction Computational Genomics Seyoung Kim

Regulation of eukaryotic transcription:

NCBI web resources I: databases and Entrez

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Analysis of Microarray Data

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Bionano Access 1.1 Software User Guide

Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

Ab initio gene identification in metagenomic sequences

axe Documentation Release g6d4d1b6-dirty Kevin Murray

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Variant calling in NGS experiments

RNA-Seq with the Tuxedo Suite

SVMerge Output File Format Specification Sheet

SAMPLE LITERATURE Please refer to included weblink for correct version.

Genomics AGRY Michael Gribskov Hock 331

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

Types of Databases - By Scope

RNA secondary structure prediction and analysis

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

Gene Structure & Gene Finding Part II

Complete Genome Sequence of Pathogenic Bacterium

ORTHOMINE - A dataset of Drosophila core promoters and its analysis. Sumit Middha Advisor: Dr. Peter Cherbas

Sequence Annotation & Designing Gene-specific qpcr Primers (computational)

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

Introduction to Bioinformatics

Bundle 5 Test Review

The modified RNAfold program was used with molecule specific and molecule independent

European Union Reference Laboratory for Genetically Modified Food and Feed (EURL GMFF)

DNA makes RNA makes Proteins. The Central Dogma

Sequence Databases and database scanning

Functional Genomics Research Stream. Research Meeting: June 19, 2012 SYBR Green qpcr, Research Update

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Complete Genome Sequence of the Polycyclic Aromatic Hydrocarbon-Degrading. Bacterium Alteromonas sp. Strain SN2

Post-assembly Data Analysis

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Tutorial. Whole Metagenome Functional Analysis (beta) Sample to Insight. November 21, 2017

CHAPTER 21 LECTURE SLIDES

Protein Bioinformatics Part I: Access to information

Assignment 9: Genetic Variation

Three-Way Comparison and Investigation of Annotated Halorhabdus utahensis Genome

MATH 5610, Computational Biology

Transcription:

Gene Prediction: Preliminary Results

Outline Preliminary Pipeline Programs Program Comparison Tests Metrics Gene Prediction Tools: Usage + Results GeneMarkS Glimmer 3.0 Prodigal BLAST ncrna Prediction Tools: Usage + Results trnascan SE RNAmmer RFAM Further Steps

Preliminary Pipeline

Programs Tested Concerned might be redundant during initial research, but test metrics suggest we should add this to our pipeline Easy Gene was tested at Web Interface stage, but we did not get to test it in depth before this presentation. We plan to test it. During initial research, we wanted to steer clear from programs originally designed for non-bacterial genomes, since training is a major portion of ab initio gene prediction. Since homologous options were not present for our species, we decided it was best to focus on BLAST for homologous Gene Prediction.

Log-Odds Log-odds is a common scoring metric used for predicted genes. 1st Matrix shows probability of possible alignment pairs at random 2nd shows probability of possible alignment pairs in your query Final Matrix shows value of 2nd/1st, so likelihood the pair is not observed at random Take log (usually Ln) of likelihoods any values between 0 and 1 become negative any greater than 1 become positive http://www.bio.brandeis.edu/interpgenes/project/align16.htm

Methods Used RefSeq FTP to obtain annotation for FAM18 A.gff file was obtained with the start, stop, strand and type information for portions of the genome

FAM18 summary - Species: Neisseria meningitidis - Serogroup: C Genes CDS Exon trna rrna Total 2046 1954 71 59 12 + Strand 1035 982 34 31 3 - Strand 1011 972 37 28 9

Comparing Tools - Tests Metric True Positive (Rightly Predicted) False Positive (Over Predicted) False Negative (Under Predicted) True Negative Predicted Set Annotated Set

Comparing Tools - Metrics Sensitivity: ability to exclude false positives Precision: ability to predict maximum number of genes.

Summary PROGRAM Entries Predicted (TP) Unpredicted (FN) Overpredecited (FP) AMIGene 1098 956 79 142 GeneMarkS 2214 1672 374 542 Prodigal 1024 941 94 83 trnascan 59 33 26 26 Rfam 232 31 40 201

Summary PROGRAM Sensitivity Precision AMIGene 0.92 0.87 GeneMarkS 0.82 0.75 Prodigal 0.91 0.92 trnascan 0.56 0.56 Rfam 0.44 0.13

Summary

GeneMarkS Usage - Sequence Type - Prokaryotic Intronless Eukaryotic Virus Phage EST/cDNA - Output Format (-format) - GFF/LST - Omit Overlaps (-offover) Add to path: /home/yasvanth3/gm/genemark_suite_linux_64/gmsuite/ Example: gmsn.pl -prok <inputfilename.fasta> -format <output format>

GeneMarkS Output - GFF File - Gene Name, Start, Stop, Gene ID, Length, Gene Score

Glimmer 3.0 Usage 4 steps: Input file : sequence.fa Add to path: /home/vvenkat6/bin/ > long-orfs -n sequence.fa sequence.orf > extract sequence.fa sequence.orf > sequence.train > build-icm -r sequence.icm < sequence.train > glimmer3 sequence.fa sequence.icm out Output file: i) out.predict ii) out.detail

out.detail

out.predict

Prodigal Implements simple log-likelihood scoring functions unlike the previous programs which use complicated HMMs and IMMs Performs well for high GC content Genomes Trade off between # of FPs and TPs

Command Used : prodigal.linux -i input_file_name -o output_file_name -f output_format -d nucleotide_sequences_of_all_genes -a protein_sequences_of_all_genes -s potential_genes_with_scores The mode can be specified as well using -p flag. Different output formats can be specified gbk: Genbank-like format (Default) gff: GFF format sqn: Sequin feature table format sco: Simple coordinate output Total No. of Genes Predicted : 645+771=1416 in the CISA_all file grep -w "-" out_cisa_all.gff wc -l 645 grep -w "+" out_cisa_all.gff wc -l 771

Output File Generated

BLAST Step1: Create blast database makeblastdb in FAM18.fasta -dbtype 'nucl' -out FAM18_db RESOURCES: MAKEBLAST: /home/rnagilla3/bin/blast/ncbi-blast-2.2.30 +/bin/makeblastdb Input file: /home/rnagilla3/assignment_data/fam18.fasta Step2: Run blastn blastn -db FAM18_db query CISA_all.fa outfmt 6 -out BLAST_OUTPUT RESOURCES: BLASTN: /home/rnagilla3/bin/blast/ncbi-blast-2.2.30+/bin/blastn QUERY File: /home/yasvanth3/gm/cisa_all.fa

BLAST output format

Non-coding RNA Prediction

trnascan-se command line /home/tmi7/bin/trnascan-se Some Options: -B: search for bacterial trnas (use bacterial trna model) -C: search using Cove analysis only slow, sensitive) -o: save tabular result to... -f: save trna secondary structures to... -m: save statistics summary to...

trnascan-se result on CISA_all.fa

trnascan-se result on CISA_all.fasta

RNAmmer 1.2 /home/akelley35/bin rnammer [-S kingdom] [-m molecules] [-xml xml-file] [-gff gff-file] [-h hmmreport] [-f fasta-file] [sequence] -S Specifies the super kingdom of the input sequence, euk, bac, arc -m Molecule type can be 'tsu' for 5/8s rrna, 'ssu' for 16/18s rrna, 'lsu' for 23/28s rrna or any combination separated by comma -xml,-gff,-h,-f The types of outputs generated.

Infernal (Rfam) path to the installed file: /home/tmi7/bin/ Step 1: create an CM database flatfile download from Rfam Step 2: compress and index the flatfile with cmpress cmpress <cmdb> Step 3:search the CM database with cmscan cmscan --noali -E <x> -o <f> --noali <cmdb> <seqfile> : don't output alignments -E <x> : report sequences <= this E-value threshold in output -o <f> : direct output to file <f>

Infernal (Rfam) Results

Infernal (Rfam) Results

ncrna prediction results Prediction Tools FAM18 CISA_all.fasta RNAmmer 12 3 trnascan-se 59 37 Rfam cmscan (rrna) 12 8 Rfam cmscan (trna) 62 46 Rfam cmscan (other) 22 25

Pipeline Changes AMIgene EasyGene? RSAT - Lower priority Adds non-coding regulatory portions of genome, but want to focus on coding portions first

Challenges - Use GenePRIMP to combine GeneMarkS and Prodigal results - Find a way to combine outputs to minimize False Positives and False Negatives - Use a confidence system - Highest confidence is genes confirmed by all relevant outputs that do not contradict - Resolve conflicting results - use database references - compare conflicts and pick higher score or more likely gene prediction - How to determine likely vs unlikely Theoretical Genes (not over predict)

Further Steps - Continue testing/comparing programs - Schema for Naming Genes - derive from GeneID, contig # and Sample ID - Finalize Method for Merging Results - i.e. investigate GENE Primp (Gene Prediction Improvement Pipeline) for ideas - Use metrics mentioned previously to filter results of individual programs

References 1. 2. 3. 4. 5. 6. Lagesen, Karin, et al. "RNAmmer: consistent and rapid annotation of ribosomal RNA genes." Nucleic acids research 35.9 (2007): 3100-3108. Besemer, John, Alexandre Lomsadze, and Mark Borodovsky. "GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions." Nucleic Acids Research 29.12 (2001): 2607-2618. Burge, Sarah W., et al. "Rfam 11.0: 10 years of RNA families." Nucleic acids research (2012): gks1005. Schattner, Peter, Angela N. Brooks, and Todd M. Lowe. "The trnascan-se, snoscan and snogps web servers for the detection of trnas and snornas." Nucleic acids research 33. suppl 2 (2005): W686-W689. Delcher, Arthur L. et al. Improved microbial gene prediction with GLIMMER. Nucleic Acids Res. 1999. 27 (23):4636-4641. Delcher, Arthur L. et al. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (2007). 23 (6): 673-679.

LAB ACTIVITY! - In class: - BLAST - GeneMarkS - Rfam - Run programs. Create file on your server folder labeled GenePredHWOut and place output files there - Parts done in groups - one person email me before you leave with the files generated together, who worked together, and where I can find these output files - Write answers/observations in a text file marked appropriately. Email this to me (email below) with subject Gene Prediction HW Answers - Name the file FIRSTNAME_LASTNAME.txt - rachel.kutner06@gmail.com Will provide instructions and grading scheme for those absent/in case you don t finish. Due next Friday at midnight.

Grading Attendance in class: 10% Completion of exercise: 30% Proper answers: 40% each question is 5 points Proper output files: 20%

BLAST You will have to run Blast using unknown sequence query as query against a known reference database sequence. So, you have to create a blast database with reference.fasta and blast query against this database and submit the results. Both the query files and database reference files are located in /home/rnagilla3/assignment_data/) [ Please write to Roopa, roopareddynagilla@gatech.edu for any permission issues ] MAKEBLAST: /home/rnagilla3/bin/blast/ncbi-blast-2.2.30 +/bin/makeblastdb BLASTN: /home/rnagilla3/bin/blast/ncbi-blast-2.2.30+/bin/blastn Query sequence: query.fasta Reference sequence: reference.fasta (N. meningitidis) 1. How is the output sorted? 2. What is e-value and why is it significant? 3. Pick one of the top homologous sequences for FAM18 and what do you think is the species the sequence is most related to?

GeneMarkS You will have to run GeneMarkS with the FAM18 fasta file. The FAM18 file can be found in /home/yasvanth3/fam18.fasta and GeneMarkS (gmsn.pl) can be run from /home/yasvanth3/gm/genemark_suite_linux_64/gmsuite/ Assume the species is unknown and use the appropriate command and parameters to produce a GFF file. Is RBS True or False? (Ribosomal Binding Site) Describe the Format of the Output and List the types of scores that are available.

Infernal You will have to run the cmscan program for FAM18 sequence against Rfam CM database. The cmscan program is in /home/tmi7/bin/cmscan The FAM18 can be found in /home/tmi7/fam18.fasta The Rfam CM database is in /home/tmi7/cms/rfam.cm Using parameters of no alignment(--noali) and use a E value of 1E10 (-E), and save a output file(-o). How many cmscan hits did you get? How many ribosomal RNA are there in your output? (There may be some redundant results)