Functional Annotation: Preliminary Results

Size: px
Start display at page:

Download "Functional Annotation: Preliminary Results"

Transcription

1 Functional Annotation: Preliminary Results Vani Rajan Gena Tang Neha Varghese Kevin Lee Gabriel Mitchell Tripp Jones Robert Petit Shaupu Qin

2 Outline Motivation Naming scheme Preliminary Program Results Database created Blast InterPro CDD Ab Initio Pathway Tools

3 The GaTech/CDC collaboration What sort of questions are we trying to answer? What annotations do we need to answer them? annotations questions discussion

4 genetic feature A genetic feature A Classification Schemes so the first question I would want to ask is how Hhae strains are related to NTHi phylogenetically? Can you identify any features that would tell NTHi from Hhae without any uncertainty? ~Xin Wang NT Hi NT Hi genetic feature B Hhae functional behavior B Hhae In order to develop classification schemes that extend beyond standard MLST and 16S rrna we will need annotations of genes. Developing new classification protocols based on metabolic activity and the like will require annotation of functional networks.

5 Explaining Phenotypes The second question is what is the genetic basis for Hhae hemolysis and pathogenecity. ~Xin Wang We need to have annotations of genes (and gene functions) and putative genetic networks that determine behaviors.

6 An aside about Pathogenicity Many kinds of potential genetic signatures associated with pathogenicity (there are databases; our collaborators have a list of candidates as well). Genetic requirements may not be enough. The environmental context matters too. Mobile elements of particular interest Phages (as prophages; there are tools to detect) Plasmids (they can integrate) Transposons Insertion sequences

7 Virulence Factor Database CDC has preliminary list of target genes Virulence factors in many bacterial species Searching our genomes: Download all factors from VFDB Create a blastdb of them BLAST all genomes against them Record the hits with e<0.001

8

9

10 Virulence Factor Database A track in the browser Integral to comparative analysis

11 Mobile Elements Insertion sequences Allow for recombination, genomic rearrangement, and plasmid formation Structure: inverted terminal repeats flanking a transposase (and an activity modulator) Search using ISfinder (web interface)

12 Mobile Elements Insertion sequences

13 Mobile Elements Insertion sequences

14 Mobile Elements Insertion sequences

15 Naming Scheme Previously used scheme (Kislyuk et al, 2010) Uniprot result with > 91% aa identity & e<10-9 If hypothetical in Uniprot, use InterPro domain Genes with unknown function in other genomes Conserved hypothetical Else putative uncharacterized protein

16 Preliminary Results BLAST AND DATABASE CREATED

17 Blastp Goals find hits with: Identity > 91% E < 10-9 Assign name Database Build database from organisms related to h. haemolyticus From UniProt Sanity check Pasteurellaceae family

18

19

20

21 Summary 1030 significant hits in h. influenza 76% 10 significant hits in other organisms 1 in 4 genes with no hits (313) ident > 91% E < 10-9 Pasteurellaceae database No extra hits To Come: Blast against other databases

22 Preliminary Results INTERPRO SCAN

23 Degree Distribution of Database Hits

24 Sample InterProScan Output

25 InterPro Entry Link

26 Preliminary Results CONSERVED DOMAIN DATABASE

27 Conserved Domain DB Search by RPS-BLAST - uses the query sequence to search a database of pre-calculated PSSMs Helps give more information about function for genes which come up as: putative, conserved, unknown Allows scripted data downloads

28 Sample output GFF 57/1353 genes have no hit in CDD

29 Web Server Output

30 Preliminary Results AB INITIO

31 SignalP

32 SignalP prediction 1352 Protein seqs in total 238: HMM predicted to contain a Signal Peptide 175: Neural network predicted to contain a Signal Peptide

33 LipoP Lipo Protein: 44/1352 predicted

34 TMHMM 275/1352: At least one transmembrane helix predicted

35 Preliminary Results The Working Concept of Pathway Tools

36 Components Pathway/Genome Navigator Pathway/Genome Editors PathoLogic

37 Input and PGDB Schema Annotated genome of the organism Genome sequence, locations of identified genes, identified functions of gene products. Sequence -> fasta format Annotation-> Genbank, Pathologic PGDB representation of input file

38 Metabolic Pathway Prediction 2 steps: 1. Creates the reactome using MetaCyc 2. Imports corresponding pathways from MetaCyc and then carries out a pruning process.

39 Operon Predictor Identifies operon boundaries by examining pairs of adjacent genes A and B Intergenic distance, functional relationship between A and B, membership in the same pathway, membership in same multimeric protein complex and so on.

40 Pathway Hole Filler Pathway holes -> seen -> under-annotated genes PHFiller uses Uniprot to identify potential hole fillers.

41 Pathway Tools Cellular Overview

42 Pathway Tools Cellular Overview

43 Next steps On receiving complete gene prediction results, perform all the annotation steps on one strain and then essentially automate the process and run it for the other five.