From Sequence to Knowledge

Size: px
Start display at page:

Download "From Sequence to Knowledge"

Transcription

1

2

3 A helping hand through The Annotation Bottleneck From Sequence to Knowledge Assembly, Annotation, and Analysis of Phage genomes from Isolated Phages and Metagenomic Data Sets Ramy K. Aziz Professor & Chair, Microbiology and Immunology, Faculty of Pharmacy, Cairo University

4 PRELUDE

5 A bit of history Since 2009, the Genomics Workshop has become an essential part of the Evergreen phage meeting The challenge always is: how to meet needs/expectations that are so many and so diverse, in ~4 hours The answer is:.

6 A bit of history

7 Past workshops

8 Past workshops 9 July 2018 Phage Genomics - VoM 2018

9 MOTIVATION

10 The analysis bottleneck Observation: We generate more data than we can analyze. We generate sequence data faster than we can analyze them. Opinion: Bottlenecks are not created equal! It is important to define the question(s) before working on the answer(s)!

11 The analysis bottleneck Roux et al. Submitted

12 The analysis bottleneck The Lavigne paradox (2013)

13 The analysis bottleneck The Lavigne paradox (2015)

14 Audience EXPECTATIONS

15 Attendees expectations Who (how many) among you have: annotated at least a phage genome? worked on a viral metagenome? used the command line (Unix, Linux, Mac Terminal) for sequence analysis? To optimize the content, let s take this survey on SOCRATIVE ( Enter ROOM: AZIZ15

16 Activity: think, pair, share! Defining the question(s): Introduce yourself, your institution, and your favorite phage/virus Do you have a genome sequenced? Planning to? Why have you sequenced your phage genome? Why you want to sequence your phage genome? What is the single most pressing question you want to have answered from genome analysis? What s your top wish(es) for analysis tools that are not in the current programs?

17 Input from group activity

18 Begin with the end in mind DEFINING THE QUESTION(S)

19 What you want... is from genome from metagenome - complete - accurate frameshift Incomplete faulty assembly Credit: Andrew Kropinski Credit: Bas Dutilh

20 What you want... is from genome from metagenome - complete - accurate frameshift Incomplete faulty assembly Credit: Andrew Kropinski Credit: Bas Dutilh 9 July 2018 Phage Genomics - VoM 2018

21 A process of reconstruction

22 A process of reconstruction Experimentally DNA TGATTGTGTGTTTGCGCAATGCG TGATTGGTCTNNNTCTCTTGCGCAATGCG ATGTGTATATATAGTGAGCTTGCCC GTCTCTCTNNNTCTCTTG

23 A process of reconstruction Experimentally DNA TGATTGTGTGTTTGCGCAATGCG TGATTGGTCTNNNTCTCTTGCGCAATGCG ATGTGTATATATAGTGAGCTTGCCC GTCTCTCTNNNTCTCTTG Computationally TGATTGTGTGTTTGCGCAATGCG TGATTGGTCTNNNTCTCTTGCGCAATGCG ATGTGTATATATAGTGAGCTTGCCC GTCTCTCTNNNTCTCTTG

24 A process of reconstruction Experimentally edna TGATTGTGTGTTTGCGCAATGCG TGATTGGTCTNNNTCTCTTGCGCAATGCG ATGTGTATATATAGTGAGCTTGCCC GTCTCTCTNNNTCTCTTG Computationally Any phage one can get! TGATTGTGTGTTTGCGCAATGCG TGATTGGTCTNNNTCTCTTGCGCAATGCG ATGTGTATATATAGTGAGCTTGCCC GTCTCTCTNNNTCTCTTG

25 What will be covered? 1. Annotation overview 2. Using the RAST family for genome annotation: Optimizing RAST for phages Command line/ Batch options 3. Introducing PATRIC and resources in development Therapeutic phage database Assembly Variation analysis Metagenome binning

26 Assembly Validation trna calling From Sequence to Knowledge From raw sequence data to genome submission/ publication orienting Fixing frameshifts Gene finding/ ORF calling Annotation (Assigning functions) Introns and Inteins Subsystem assignment Refinement/ Secondary annotation loop Special purpose: toxins, morons, integrases, lifestyle prediction Regulatory elements (promoters, terminators) Output: files and graphics

27 Countless tools

28 Authority figures Andrew Kropinski Rob Lavigne Rob Edwards

29 Authority figures Rodney Brister Bonnie Hurwitz

30 The Kropinski wisdom

31 The Kropinski wisdom 1. Always use more than one tool. 2. Never blindly trust any automated (or manual) process. 3. Use your eyes and hands: visual inspection/ manual proofreading, re-annotation Every apparently complicated file is still editable on your favorite text editor (e.g., NotePad). 4. If you don t know a gene s function (if you can t convince your grandma), better keep it unnamed than contribute to error propagation.

32 Material/Resources

33 Data & links: Material/Resources Slides Old tutorials (more detailed, but missing latest ): Evergreen 2011: (by Karin Holmfeldt) Evergreen 2013: Evergreen 2015: VoM 2016: Evergreen 2017:

34 ANNOTATION OVERVIEW

35 Desired outcome Well characterized genome, in which, ideally we know: the location & function of all the genes the location of promoters & terminators 22 PstI PstI kb the correct taxonomy A Viruses; dsdna viruses, no RNA stage; Caudovirales; Siphoviridae; T1virus

36 Desired outcome: Create GenBank submission Complete, accurate description of the genome and its taxonomy

37 Desired outcome (2)

38 Desired outcome (3)

39 Desired outcome (4) Protein products of concern, particularly for those interested in phage therapy: Integrases Toxins PstI PstI kb A 30 32

40 Processes and Steps I. Primary analysis (QC/ pre-annotation proofreading: e.g., orient with BLASTN) II. Genome annotation Gene finding (ORF calling) Automated annotation Massaging (edition, functional assignment) III. Second dimension (regulatory elements) IV. Comparative genomics V. Metadata VI. Visualization

41 Assembly trna calling Validation (segmenter) From Sequence to Knowledge From raw sequence data to genome submission/ publication orienting Fixing frameshifts Gene finding/ ORF calling Annotation (Assigning functions) Introns and Inteins Subsystem assignment Refinement/ Secondary annotation loop Special purpose: toxins, morons, integrases, lifestyle prediction Regulatory elements (promoters, terminators) Output: files and graphics

42 Classification The phage sequence space (Lima-Mendez et al.) The phage proteomic tree (Edwards & Rohwer) New: VIP tree

43 II. Genome Annotation AUTOMATED ANNOTATION

44 RAST (subsystems-based tools) Will be the major focus of this short tutorial The goal is: 1. Quick demo how to use RAST 2. Optimize RAST for phage annotation 3. New RAST implementation in the PATRIC database 4. PATRIC features and future development

45 Nomenclature Sins hypothetical protein DNA polymerase with no or poor quality evidence is far worse than: DNA polymerase hypothetical protein Be cautious about using BLASTP hits in naming gps is there additional evidence to back the designation up

46 Consistent Nomenclature All of these describe homologs of the product of the coliphage T4 riia gene! riia protector from prophage-induced early lysis protector from prophage-induced early lysis protector from prophage-induced early lysis riia membrane-associated affects host membrane ATPase riia membrane-associated affects host membrane ATPase phage riia lysis inhibitor riia protector riia riia protein membrane integrity protector hypothetical protein unnamed protein product!!!!!! protein of unknown function

47 What do I call my gene product (i.e. protein)? phage hypothetical protein redundant gp87 (gp = gene product) hypothetical protein gp200 describes radically different proteins in Listeria, Enterococcus, Mycobacterium, Rhodococcus, Sphingomonas, Pseudomonas, Bacillus and Synechococcus phage genomes Add /note= similar to gp43 of Escherichia coli phage T4

48 What do I call my gene product (i.e. protein)? phage hypothetical protein redundant gp87 (gp = gene product) hypothetical protein gp200 describes radically different proteins in Listeria, Enterococcus, Mycobacterium, Rhodococcus, Sphingomonas, Pseudomonas, Bacillus and Synechococcus phage genomes Add /note= similar to gp43 of Escherichia coli phage T4

49 Bottom line: Manual vs. Automated Turtles know the road better than rabbits Khalil Gibran but they may never reach the end! The best approach? Human expert-based annotation

50 IV. COMPARATIVE GENOMICS

51 Genomic pairwise comparisons EMBOSS Stretcher: N.B. genomes must be collinear BLASTN - NCBI ANI (Average Nucleotide Identity): GGDC 2.0 (Genome to Genome Distance Calculator): jspeciesws ANI:

52 Proteomic pairwise comparisons CoreGenes ( tblastx Remember protein sequence is more conserved than DNA sequence; probably useful for more distant relationships

53 V. METADATA

54 Standards: coming soon Roux, et al.

55 VI. POLISH IT TO PUBLISH IT

56 Assembly trna calling Validation (segmenter) From Sequence to Knowledge From raw sequence data to genome submission/ publication orienting Fixing frameshifts Gene finding/ ORF calling Annotation (Assigning functions) Introns and Inteins Subsystem assignment Refinement/ Secondary annotation loop Special purpose: toxins, morons, integrases, lifestyle prediction Regulatory elements (promoters, terminators) Output: files and graphics

57 Servers & software BLAST Ring Image Generator ( CGView ( CGView Comparison Tool: Circos ( DNAPlotter: ( Easyfig ( GenomeVx ( GView Server ( progressivemauve and ACT

58 EasyFig

59 CGView Comparison Tool

60 BLAST Ring Image Generator