FUNCTIONAL ANNOTATION PRELIMINARY RESULTS. Compgenomics 2010

Size: px
Start display at page:

Download "FUNCTIONAL ANNOTATION PRELIMINARY RESULTS. Compgenomics 2010"

Transcription

1 FUNCTIONAL ANNOTATION PRELIMINARY RESULTS Compgenomics 2010

2 E FIRST LEVEL OF ANNOTION F Pathways OPERONS BLASTP BLASTP OPERON_DB KEGG DOORS CONSENSUS SCRIPT SECOD LEVEL OF ANNOTION

3 Operon prediction Introduction

4 What is Operon? Operon family of co-regulated genes Adjacent Same orientation Not separated by promoters/terminators Related functions Strong selective pressure, conserved Knowledge of operon -> FUNCTION

5 Operon DB: Database of predicted operons Microbial Genomes Computer prediction of operon structures. 500 genomes.

6 Computational Approach Gene pair adjacent, same strand, intergenic length separation P(gene pair in operon) = 1 P(conserved D) X P(SN S) -P chance P(conserved S) D,S = sets of gene pairs P chance = P(conserved S has homologs in other genomes)

7 Algorithm Identification of conserved pairs Identification of orthologs using BLAST Finding conserved gene clusters Homology Teams software. Evolutionary distance D(G 1,G 2 ) = n(g 1 )+n(g 2 ) h(g 1 ;G 2 ) h(g 2 ;G 1 ) n(g 1 ) + n(g 2 ) Larger dist = greater prob of conservation

8 What is to be done? Genome Library (*.faa & *.ptt files) Query Predicted M13519.faa M16917.faa blastp N.meningiditis MC58 N.meningiditis Z2491 N.meningiditis FAM18 N.meningiditis O53442 N.meningiditis α14 N.gonorrhoeae FA 1090 N.gonorrhoeae NCCP11945 faa = Fasta Amino Acid sequences ptt = Information about the function and co-ordinates

9 faa file >gi ref YP_ UDP-3-O-[3-hydroxymyristoyl] N-acetylglucosamine deacetylase [Neisseria meningitidis FAM18] MLQRTLAKSISVTGVGLHSGERVALTLHPAPENSGISFRRTDLDGEMGEQIKLTPYLINDTRLSSTIVTD KGVRVGTIEHIMSALSAYGIDNALIELNAPEIPIMDGSSLPFIYLLQDAGVVDQKAQKRFLKILKPVEIK EAGKWVRFTPYDGFKVTLTIEFDHPAFNRSSPTFEIDFAGKSYIDEIARARTFGFMHEVEMMRAHNLGLG GNLNNAIVIDDTDVLNPEGLRYPDEFVRHKILDAIGDLYIVGHPIVGAFEGYKSGHAINNALLRAVLADE TAYDRVEFADSDDLPDAFHELNIRNCG >gi ref YP_ truncated pilin [Neisseria meningitidis FAM18] MASSGVNNEIKDKKLSLWAKRQDGSVKWFCGLPVARTDKATDDVKAATANGTDDKINTKHLPSTCRDDSS TGCIETPRADFKHFQKISRYRVLPESRQMAEKLRHSRKSGNLGLSAQKLIG

10 ptt file Neisseria meningitidis FAM18, complete genome Product Name Start End Strand Length Gi GeneID Locus Locus_tag COG(s) UDP-3-O lpxc NMC0001 COG0774M truncated pilin pils2 NMC0003 COG4969NU

11 Operon prediction BLAST information (*.blast file) Operon Prediction based on the algorithm

12 Database of PrOkaryotic OpeRones (DOOR) No stand-alone version/code available Can t be automated New query runs not possible. BLASTP OPERONS OPERON_DB DOORS CONSENSUS SCRIPT SECOD LEVEL OF ANNOTION

13 KEGG Introduction and preliminary Results

14 KEGG Kyoto Encyclopedia of Genes and Genomes KEGG/KAAS Applications Preliminary Results

15 KEGG Visualizes the functions of proteins in a genome by mapping them onto biosynthetic pathways Consists of many databases of which the most relevant to our purposes are: GENES PATHWAY

16 KAAS KAAS: KEGG Automatic Annotation Server Provides annotations through BLAST comparisons against the KEGG GENES database KEGG GENES: database of func. annotated genes GENES contains 5.3 million genes from various genomes: 129 Eukaryotic 971 Bacterial 74 Archaeal

17 KEGG PATHWAY database Graphical representations of biosynthetic pathways Predicts pathways by comparing the proteins found in a genome with reference pathways. Information in GENES databases is linked to the information in PATHWAY database through KO identifiers. If pathways are incomplete, then the missing proteins are visually detectable

18 KEGG Why use it? It s a tool for visualizing proteins of a genome in biosynthetic pathways Aids in checking annotations Visually detect missing proteins Can aid the comparative genomics group Comparisons of protein pathways between genomes

19 M13519 Preliminary Results

20 M13519 Preliminary Results

21 FAM18 Preliminary Results

22 Preliminary Results M13519 FAM18

23 Preliminary Results Aim To complete the biosynthetic pathways Use pathways to Determine missing proteins Aid comparative genomics group

24 First level of annotation Function prediction preliminary results

25

26 NEMESYS PRELIMINARY RESULTS

27 NeMeSys Database Protein sequences from: Neisseria gonorrhoeae FA 1090 Neisseria gonorrhoeae NCCP11945 Neisseria lactamica ST640 Neisseria meningitidis Neisseria meningitidis FAM18 Neisseria meningitidis MC58 Neisseria meningitidis NEM8013 Neisseria meningitidis Z2491 Neisseria meningitidis alpha14 Tables: BLASTP NeMeSyS Annotation COG

28 M13519 (2,157 sequences) NeMeSys BLASTP GENE ID, LABEL, TYPE, LENGTH, EVIDENCE, GENE, PRODUCT, CLASS, PRODUCT TYPE, LOCALIZATION, PUBMED ID NeMeSyS Proteins Total Hits: 10,045 BESTHIT= 1,512 match NeMeSyS Annotation Table 1,512 sequences annotated

29 LIPOP PRELIMINARY RESULTS

30 PRELIMINARY RESULTS LIPOP Proteins cleaved by Signal Peptidase II are predicted as Lipoproteins by LipoP Lipoproteins were predicted in 69 genes Other details such as cleavage site location, scores, margin, amino acids present +/-5 positions wrt cleavage site also displayed in the output Gff format output tabulated as shown

31

32 UNIPROT PRELIMINARY RESULTS BlastP

33 The Process Update Uniprot Database Extract SwissProt part BLASTP formatdb (input db, name, protein/nuc) blastall (blastp, dbname, input,output, o/p format) Calculate similarity and coverage for hits, Parse files best-hit, with >40% id and >80%cov

34

35 InterProScan Preliminary Results

36 How to run InterProScan? Precise installation of InterProScan (Instructions on Wiki) Once in /iprscan/bin, the command to be used is: $ nohup./iprscan -cli -i <name (path)of input sequence file> -o <name (path) of output file> -format <xml, html, raw, txt> -iprlookup -goterms - < address for result notification> -seqtype <n,p> -verbose -h

37 Output Formats HTML XML TXT RAW

38 GFF3 Format Obtained using /iprscan/bin/converter.pl./converter.pl -input <name (path)of iprscan RAW output file> -format <gff3, xml, ebixml, txt, html> > <name of output file.gff3> Eg. Click here

39 How to use these files? Use XML output files to import in a database using various Perl modules Use TXT files for consensus scripts, since information is structured in a user-friendly format.

40 KEGG extra slides for presentation

41 Enzyme Commission (EC) numbers EC numbers identify enzyme-catalyzed reactions NOT enzymes If several enzymes catalyze the same reaction, they receive the same EC number Can be used to identify proteins on a metabolic pathway, but not on reference pathways such as cellular processes.

42 KEGG Orthology (KO) Orthologs genes with common ancestry, with same function in different organisms Tend to have a similar sequence and location in genomes Using ortholog identifiers leads to a more specific classification system, since many related enzymes share the same EC number. Link information in GENES and PATHWAY databases

43 KEGG GENES

44

45 Preliminary Results M genes matched to KO numbers 1125 genes not matched FAM genes matched to KO numbers 733 genes not matched