Functional Annotation: Preliminary Results

Functional Annotation: Preliminary Results Vani Rajan Gena Tang Neha Varghese Kevin Lee Gabriel Mitchell Tripp Jones Robert Petit Shaupu Qin

Outline Motivation Naming scheme Preliminary Program Results Database created Blast InterPro CDD Ab Initio Pathway Tools

The GaTech/CDC collaboration What sort of questions are we trying to answer? What annotations do we need to answer them? annotations questions discussion

genetic feature A genetic feature A Classification Schemes so the first question I would want to ask is how Hhae strains are related to NTHi phylogenetically? Can you identify any features that would tell NTHi from Hhae without any uncertainty? ~Xin Wang NT Hi NT Hi genetic feature B Hhae functional behavior B Hhae In order to develop classification schemes that extend beyond standard MLST and 16S rrna we will need annotations of genes. Developing new classification protocols based on metabolic activity and the like will require annotation of functional networks.

Explaining Phenotypes The second question is what is the genetic basis for Hhae hemolysis and pathogenecity. ~Xin Wang We need to have annotations of genes (and gene functions) and putative genetic networks that determine behaviors.

An aside about Pathogenicity Many kinds of potential genetic signatures associated with pathogenicity (there are databases; our collaborators have a list of candidates as well). Genetic requirements may not be enough. The environmental context matters too. Mobile elements of particular interest Phages (as prophages; there are tools to detect) Plasmids (they can integrate) Transposons Insertion sequences

Virulence Factor Database CDC has preliminary list of target genes Virulence factors in many bacterial species Searching our genomes: Download all factors from VFDB Create a blastdb of them BLAST all genomes against them Record the hits with e<0.001

Virulence Factor Database A track in the browser Integral to comparative analysis

Mobile Elements Insertion sequences Allow for recombination, genomic rearrangement, and plasmid formation Structure: inverted terminal repeats flanking a transposase (and an activity modulator) Search using ISfinder (web interface)

Mobile Elements Insertion sequences

Naming Scheme Previously used scheme (Kislyuk et al, 2010) Uniprot result with > 91% aa identity & e<10-9 If hypothetical in Uniprot, use InterPro domain Genes with unknown function in other genomes Conserved hypothetical Else putative uncharacterized protein

Preliminary Results BLAST AND DATABASE CREATED

Blastp Goals find hits with: Identity > 91% E < 10-9 Assign name Database Build database from organisms related to h. haemolyticus From UniProt Sanity check Pasteurellaceae family

Summary 1030 significant hits in h. influenza 76% 10 significant hits in other organisms 1 in 4 genes with no hits (313) ident > 91% E < 10-9 Pasteurellaceae database No extra hits To Come: Blast against other databases

Preliminary Results INTERPRO SCAN

Degree Distribution of Database Hits

Sample InterProScan Output

InterPro Entry Link

Preliminary Results CONSERVED DOMAIN DATABASE

Conserved Domain DB Search by RPS-BLAST - uses the query sequence to search a database of pre-calculated PSSMs Helps give more information about function for genes which come up as: putative, conserved, unknown Allows scripted data downloads

Sample output GFF 57/1353 genes have no hit in CDD

Web Server Output

Preliminary Results AB INITIO

SignalP

SignalP prediction 1352 Protein seqs in total 238: HMM predicted to contain a Signal Peptide 175: Neural network predicted to contain a Signal Peptide

LipoP Lipo Protein: 44/1352 predicted

TMHMM 275/1352: At least one transmembrane helix predicted

Preliminary Results The Working Concept of Pathway Tools

Components Pathway/Genome Navigator Pathway/Genome Editors PathoLogic

Input and PGDB Schema Annotated genome of the organism Genome sequence, locations of identified genes, identified functions of gene products. Sequence -> fasta format Annotation-> Genbank, Pathologic PGDB representation of input file

Metabolic Pathway Prediction 2 steps: 1. Creates the reactome using MetaCyc 2. Imports corresponding pathways from MetaCyc and then carries out a pruning process.

Operon Predictor Identifies operon boundaries by examining pairs of adjacent genes A and B Intergenic distance, functional relationship between A and B, membership in the same pathway, membership in same multimeric protein complex and so on.

Pathway Hole Filler Pathway holes -> 200-300 seen -> under-annotated genes PHFiller uses Uniprot to identify potential hole fillers.

Pathway Tools Cellular Overview

Next steps On receiving complete gene prediction results, perform all the annotation steps on one strain and then essentially automate the process and run it for the other five.