Functional Annotation: Preliminary Results Vani Rajan Gena Tang Neha Varghese Kevin Lee Gabriel Mitchell Tripp Jones Robert Petit Shaupu Qin
Outline Motivation Naming scheme Preliminary Program Results Database created Blast InterPro CDD Ab Initio Pathway Tools
The GaTech/CDC collaboration What sort of questions are we trying to answer? What annotations do we need to answer them? annotations questions discussion
genetic feature A genetic feature A Classification Schemes so the first question I would want to ask is how Hhae strains are related to NTHi phylogenetically? Can you identify any features that would tell NTHi from Hhae without any uncertainty? ~Xin Wang NT Hi NT Hi genetic feature B Hhae functional behavior B Hhae In order to develop classification schemes that extend beyond standard MLST and 16S rrna we will need annotations of genes. Developing new classification protocols based on metabolic activity and the like will require annotation of functional networks.
Explaining Phenotypes The second question is what is the genetic basis for Hhae hemolysis and pathogenecity. ~Xin Wang We need to have annotations of genes (and gene functions) and putative genetic networks that determine behaviors.
An aside about Pathogenicity Many kinds of potential genetic signatures associated with pathogenicity (there are databases; our collaborators have a list of candidates as well). Genetic requirements may not be enough. The environmental context matters too. Mobile elements of particular interest Phages (as prophages; there are tools to detect) Plasmids (they can integrate) Transposons Insertion sequences
Virulence Factor Database CDC has preliminary list of target genes Virulence factors in many bacterial species Searching our genomes: Download all factors from VFDB Create a blastdb of them BLAST all genomes against them Record the hits with e<0.001
Virulence Factor Database A track in the browser Integral to comparative analysis
Mobile Elements Insertion sequences Allow for recombination, genomic rearrangement, and plasmid formation Structure: inverted terminal repeats flanking a transposase (and an activity modulator) Search using ISfinder (web interface)
Mobile Elements Insertion sequences
Mobile Elements Insertion sequences
Mobile Elements Insertion sequences
Naming Scheme Previously used scheme (Kislyuk et al, 2010) Uniprot result with > 91% aa identity & e<10-9 If hypothetical in Uniprot, use InterPro domain Genes with unknown function in other genomes Conserved hypothetical Else putative uncharacterized protein
Preliminary Results BLAST AND DATABASE CREATED
Blastp Goals find hits with: Identity > 91% E < 10-9 Assign name Database Build database from organisms related to h. haemolyticus From UniProt Sanity check Pasteurellaceae family
Summary 1030 significant hits in h. influenza 76% 10 significant hits in other organisms 1 in 4 genes with no hits (313) ident > 91% E < 10-9 Pasteurellaceae database No extra hits To Come: Blast against other databases
Preliminary Results INTERPRO SCAN
Degree Distribution of Database Hits
Sample InterProScan Output
InterPro Entry Link
Preliminary Results CONSERVED DOMAIN DATABASE
Conserved Domain DB Search by RPS-BLAST - uses the query sequence to search a database of pre-calculated PSSMs Helps give more information about function for genes which come up as: putative, conserved, unknown Allows scripted data downloads
Sample output GFF 57/1353 genes have no hit in CDD
Web Server Output
Preliminary Results AB INITIO
SignalP
SignalP prediction 1352 Protein seqs in total 238: HMM predicted to contain a Signal Peptide 175: Neural network predicted to contain a Signal Peptide
LipoP Lipo Protein: 44/1352 predicted
TMHMM 275/1352: At least one transmembrane helix predicted
Preliminary Results The Working Concept of Pathway Tools
Components Pathway/Genome Navigator Pathway/Genome Editors PathoLogic
Input and PGDB Schema Annotated genome of the organism Genome sequence, locations of identified genes, identified functions of gene products. Sequence -> fasta format Annotation-> Genbank, Pathologic PGDB representation of input file
Metabolic Pathway Prediction 2 steps: 1. Creates the reactome using MetaCyc 2. Imports corresponding pathways from MetaCyc and then carries out a pruning process.
Operon Predictor Identifies operon boundaries by examining pairs of adjacent genes A and B Intergenic distance, functional relationship between A and B, membership in the same pathway, membership in same multimeric protein complex and so on.
Pathway Hole Filler Pathway holes -> 200-300 seen -> under-annotated genes PHFiller uses Uniprot to identify potential hole fillers.
Pathway Tools Cellular Overview
Pathway Tools Cellular Overview
Next steps On receiving complete gene prediction results, perform all the annotation steps on one strain and then essentially automate the process and run it for the other five.