Gene Prediction: Preliminary Results

Outline Preliminary Pipeline Programs Program Comparison Tests Metrics Gene Prediction Tools: Usage + Results GeneMarkS Glimmer 3.0 Prodigal BLAST ncrna Prediction Tools: Usage + Results trnascan SE RNAmmer RFAM Further Steps

Preliminary Pipeline

Programs Tested Concerned might be redundant during initial research, but test metrics suggest we should add this to our pipeline Easy Gene was tested at Web Interface stage, but we did not get to test it in depth before this presentation. We plan to test it. During initial research, we wanted to steer clear from programs originally designed for non-bacterial genomes, since training is a major portion of ab initio gene prediction. Since homologous options were not present for our species, we decided it was best to focus on BLAST for homologous Gene Prediction.

Log-Odds Log-odds is a common scoring metric used for predicted genes. 1st Matrix shows probability of possible alignment pairs at random 2nd shows probability of possible alignment pairs in your query Final Matrix shows value of 2nd/1st, so likelihood the pair is not observed at random Take log (usually Ln) of likelihoods any values between 0 and 1 become negative any greater than 1 become positive http://www.bio.brandeis.edu/interpgenes/project/align16.htm

Methods Used RefSeq FTP to obtain annotation for FAM18 A.gff file was obtained with the start, stop, strand and type information for portions of the genome

FAM18 summary - Species: Neisseria meningitidis - Serogroup: C Genes CDS Exon trna rrna Total 2046 1954 71 59 12 + Strand 1035 982 34 31 3 - Strand 1011 972 37 28 9

Comparing Tools - Tests Metric True Positive (Rightly Predicted) False Positive (Over Predicted) False Negative (Under Predicted) True Negative Predicted Set Annotated Set

Comparing Tools - Metrics Sensitivity: ability to exclude false positives Precision: ability to predict maximum number of genes.

Summary PROGRAM Entries Predicted (TP) Unpredicted (FN) Overpredecited (FP) AMIGene 1098 956 79 142 GeneMarkS 2214 1672 374 542 Prodigal 1024 941 94 83 trnascan 59 33 26 26 Rfam 232 31 40 201

Summary PROGRAM Sensitivity Precision AMIGene 0.92 0.87 GeneMarkS 0.82 0.75 Prodigal 0.91 0.92 trnascan 0.56 0.56 Rfam 0.44 0.13

Summary

GeneMarkS Usage - Sequence Type - Prokaryotic Intronless Eukaryotic Virus Phage EST/cDNA - Output Format (-format) - GFF/LST - Omit Overlaps (-offover) Add to path: /home/yasvanth3/gm/genemark_suite_linux_64/gmsuite/ Example: gmsn.pl -prok <inputfilename.fasta> -format <output format>

GeneMarkS Output - GFF File - Gene Name, Start, Stop, Gene ID, Length, Gene Score

Glimmer 3.0 Usage 4 steps: Input file : sequence.fa Add to path: /home/vvenkat6/bin/ > long-orfs -n sequence.fa sequence.orf > extract sequence.fa sequence.orf > sequence.train > build-icm -r sequence.icm < sequence.train > glimmer3 sequence.fa sequence.icm out Output file: i) out.predict ii) out.detail

out.detail

out.predict

Prodigal Implements simple log-likelihood scoring functions unlike the previous programs which use complicated HMMs and IMMs Performs well for high GC content Genomes Trade off between # of FPs and TPs

Command Used : prodigal.linux -i input_file_name -o output_file_name -f output_format -d nucleotide_sequences_of_all_genes -a protein_sequences_of_all_genes -s potential_genes_with_scores The mode can be specified as well using -p flag. Different output formats can be specified gbk: Genbank-like format (Default) gff: GFF format sqn: Sequin feature table format sco: Simple coordinate output Total No. of Genes Predicted : 645+771=1416 in the CISA_all file grep -w "-" out_cisa_all.gff wc -l 645 grep -w "+" out_cisa_all.gff wc -l 771

Output File Generated

BLAST Step1: Create blast database makeblastdb in FAM18.fasta -dbtype 'nucl' -out FAM18_db RESOURCES: MAKEBLAST: /home/rnagilla3/bin/blast/ncbi-blast-2.2.30 +/bin/makeblastdb Input file: /home/rnagilla3/assignment_data/fam18.fasta Step2: Run blastn blastn -db FAM18_db query CISA_all.fa outfmt 6 -out BLAST_OUTPUT RESOURCES: BLASTN: /home/rnagilla3/bin/blast/ncbi-blast-2.2.30+/bin/blastn QUERY File: /home/yasvanth3/gm/cisa_all.fa

BLAST output format

Non-coding RNA Prediction

trnascan-se command line /home/tmi7/bin/trnascan-se Some Options: -B: search for bacterial trnas (use bacterial trna model) -C: search using Cove analysis only slow, sensitive) -o: save tabular result to... -f: save trna secondary structures to... -m: save statistics summary to...

trnascan-se result on CISA_all.fa

trnascan-se result on CISA_all.fasta

RNAmmer 1.2 /home/akelley35/bin rnammer [-S kingdom] [-m molecules] [-xml xml-file] [-gff gff-file] [-h hmmreport] [-f fasta-file] [sequence] -S Specifies the super kingdom of the input sequence, euk, bac, arc -m Molecule type can be 'tsu' for 5/8s rrna, 'ssu' for 16/18s rrna, 'lsu' for 23/28s rrna or any combination separated by comma -xml,-gff,-h,-f The types of outputs generated.

Infernal (Rfam) path to the installed file: /home/tmi7/bin/ Step 1: create an CM database flatfile download from Rfam Step 2: compress and index the flatfile with cmpress cmpress <cmdb> Step 3:search the CM database with cmscan cmscan --noali -E <x> -o <f> --noali <cmdb> <seqfile> : don't output alignments -E <x> : report sequences <= this E-value threshold in output -o <f> : direct output to file <f>

Infernal (Rfam) Results

ncrna prediction results Prediction Tools FAM18 CISA_all.fasta RNAmmer 12 3 trnascan-se 59 37 Rfam cmscan (rrna) 12 8 Rfam cmscan (trna) 62 46 Rfam cmscan (other) 22 25

Pipeline Changes AMIgene EasyGene? RSAT - Lower priority Adds non-coding regulatory portions of genome, but want to focus on coding portions first

Challenges - Use GenePRIMP to combine GeneMarkS and Prodigal results - Find a way to combine outputs to minimize False Positives and False Negatives - Use a confidence system - Highest confidence is genes confirmed by all relevant outputs that do not contradict - Resolve conflicting results - use database references - compare conflicts and pick higher score or more likely gene prediction - How to determine likely vs unlikely Theoretical Genes (not over predict)

Further Steps - Continue testing/comparing programs - Schema for Naming Genes - derive from GeneID, contig # and Sample ID - Finalize Method for Merging Results - i.e. investigate GENE Primp (Gene Prediction Improvement Pipeline) for ideas - Use metrics mentioned previously to filter results of individual programs

References 1. 2. 3. 4. 5. 6. Lagesen, Karin, et al. "RNAmmer: consistent and rapid annotation of ribosomal RNA genes." Nucleic acids research 35.9 (2007): 3100-3108. Besemer, John, Alexandre Lomsadze, and Mark Borodovsky. "GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions." Nucleic Acids Research 29.12 (2001): 2607-2618. Burge, Sarah W., et al. "Rfam 11.0: 10 years of RNA families." Nucleic acids research (2012): gks1005. Schattner, Peter, Angela N. Brooks, and Todd M. Lowe. "The trnascan-se, snoscan and snogps web servers for the detection of trnas and snornas." Nucleic acids research 33. suppl 2 (2005): W686-W689. Delcher, Arthur L. et al. Improved microbial gene prediction with GLIMMER. Nucleic Acids Res. 1999. 27 (23):4636-4641. Delcher, Arthur L. et al. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (2007). 23 (6): 673-679.

LAB ACTIVITY! - In class: - BLAST - GeneMarkS - Rfam - Run programs. Create file on your server folder labeled GenePredHWOut and place output files there - Parts done in groups - one person email me before you leave with the files generated together, who worked together, and where I can find these output files - Write answers/observations in a text file marked appropriately. Email this to me (email below) with subject Gene Prediction HW Answers - Name the file FIRSTNAME_LASTNAME.txt - rachel.kutner06@gmail.com Will provide instructions and grading scheme for those absent/in case you don t finish. Due next Friday at midnight.

Grading Attendance in class: 10% Completion of exercise: 30% Proper answers: 40% each question is 5 points Proper output files: 20%

BLAST You will have to run Blast using unknown sequence query as query against a known reference database sequence. So, you have to create a blast database with reference.fasta and blast query against this database and submit the results. Both the query files and database reference files are located in /home/rnagilla3/assignment_data/) [ Please write to Roopa, roopareddynagilla@gatech.edu for any permission issues ] MAKEBLAST: /home/rnagilla3/bin/blast/ncbi-blast-2.2.30 +/bin/makeblastdb BLASTN: /home/rnagilla3/bin/blast/ncbi-blast-2.2.30+/bin/blastn Query sequence: query.fasta Reference sequence: reference.fasta (N. meningitidis) 1. How is the output sorted? 2. What is e-value and why is it significant? 3. Pick one of the top homologous sequences for FAM18 and what do you think is the species the sequence is most related to?

GeneMarkS You will have to run GeneMarkS with the FAM18 fasta file. The FAM18 file can be found in /home/yasvanth3/fam18.fasta and GeneMarkS (gmsn.pl) can be run from /home/yasvanth3/gm/genemark_suite_linux_64/gmsuite/ Assume the species is unknown and use the appropriate command and parameters to produce a GFF file. Is RBS True or False? (Ribosomal Binding Site) Describe the Format of the Output and List the types of scores that are available.

Infernal You will have to run the cmscan program for FAM18 sequence against Rfam CM database. The cmscan program is in /home/tmi7/bin/cmscan The FAM18 can be found in /home/tmi7/fam18.fasta The Rfam CM database is in /home/tmi7/cms/rfam.cm Using parameters of no alignment(--noali) and use a E value of 1E10 (-E), and save a output file(-o). How many cmscan hits did you get? How many ribosomal RNA are there in your output? (There may be some redundant results)