ivirus: a Cyberinfrastructure for Viral Ecology

Size: px
Start display at page:

Download "ivirus: a Cyberinfrastructure for Viral Ecology"

Transcription

1 ivirus: a Cyberinfrastructure for Viral Ecology Overview: The cost of sequencing has decreased more than a million-fold in the last several years, causing a rapid influx of molecular data associated with microbes in diverse environments, both temporally and spatially. These datasets are available through community metagenomic data repositories, yet integrating and analyzing these data together with new data is cumbersome and often requires data duplication. Further, given rapid development of new strategies for analyzing metagenomic datasets alongside advancements in big data science for computing, cutting-edge tools for viral ecology research remain the purview of bioinformatics heavy labs or sparsely in code repositories. These challenges are magnified for labs without access to bioinformatic personnel or facilities. To meet the growing needs of the broader community of viral researchers, tools and important datasets need to co-exist in a common cyberinfrastructure. This is particularly important for advancing our knowledge of viral ecology given technical challenges. Specifically, limitations exist in that: (i) viral metagenomes (viromes) are difficult to produce given to low quantities of DNA and specialized techniques, (ii) the vast majority of viral proteins are unknown (usually >90% 1 ), and (iii) new tools for comparative and functional metagenomics are rapidly developing. To meet these needs, viral datasets need to be shared in a common cyberinfrastructure allowing for comparative metagenomic analyses across diverse environments to identify new genes and function. Moreover, new tools need to be captured in the cyberinfrastructure where they can be immediately utilized, and continually developed and adapted by the community to keep pace with rapid tool innovation. To bridge this gap, we have begun developing a cyberinfrastructure for viral ecology called ivirus that leverages the pre-existing cyberinfrastructure of the iplant collaborative ( This effort lays the foundation for ongoing development of shared resources for viral datasets, metadata, and tools by the viral community. Specifically, we are actively working on projects to (1) collate important viral datasets in an ivirus Data Commons, (2) develop standardized ontologies to promote data discovery within the iplant cyberinfrastructure, and (3) develop new

2 Apps for existing metagenomic tools and pipelines within the iplant cyberinfrastructure. The resulting comparative and functional metagenomic toolkit will be generalizable to viral ecologists across disciplines to drive big data analytics from viruses, to their hosts, and their environment. About iplant: iplant provides several technologies that can help manage and share data, tools, analyses, and complicated software installations using these various technologies: 1. Discovery Environment: iplant's system for managing and running tools, analyses, and workflows. The iplant Discovery Environment (DE) connects various components of iplant's cyberinfrastructure together to create an easy-to-use web-based interface for analyzing data. The DE contains a growing list of integrated applications ( Apps ) for scientific workflows and a mechanism for users to integrate their own applications. 2. Data Store: iplant's cloud data store for managing and sharing large volumes of data. This store is accessible anywhere, enabling easy access to your data. 3. Atmosphere: iplant's cloud services for installing complicated software packages and on-demand computational power. Atmosphere allows you to easily launch and manage instances of linux-based servers with custom software-stacks pre-installed. 4. Additional Web Resources: iplant continues to build and deploy a variety of resources that are accessible through the web. This includes iplant's Foundational APIs, Agave ( and a variety of systems built with iplant collaborators to process image data, resolve taxonomic names, compare genomes, reconcile phylogenetic trees, and annotate gene models. Signing up for an iplant account: 1. To register for an iplant account, go to 2. Complete the information on the form.

3 3. The iplant Support team will create your account and send you a confirmation Click on the validation link in the to begin using your iplant account. Link to ivirus: Link to iplant: Link to the iplant confluence site for documentation: List of all available iplant materials: Documentation%2C+and+User+Help+Options imicrobe and ivirus App Development Comparative and functional metagenomic tools are essential for discovering novel viral gene function, understanding host-virus interactions such as co-evolution, and determining evolutionary and phylogenetic relationships. Viral genomes contain a reservoir of enormous genetic diversity that they transmit to their hosts via horizontal gene transfer. 2 Because viruses encode and express host metabolism genes they can confer niche specific adaptations from one host to another. 3 5 Despite this incredible genetic diversity viruses lack a conserved gene marker, as the 16S rrna gene in bacteria, to assess species diversity in different environments. Moreover, frequently greater than 90% of viral metagenomic sequences lack homology to proteins in public databases. 1 As such, new comparative metagenomic approaches are required to differentiate similarities and differences among viral communities in diverse environments. Strides have been made to develop new computationally efficient methods for comparative metagenomics 6,

4 yet these tools and others need to be made available to the broader community in a nontechnical fashion. With seed funding for a half time graduate student from the College of Agriculture and Life Sciences at the University of Arizona, we initiated a pilot project over the past year called imicrobe. As part of this project, 13 new microbial and metagenomic specific Apps were developed in iplant, expanding the existing toolkit already available in iplant (Table 1). Over the upcoming year, we plan to extend this pilot study to add additional tools and data sets specific to viruses and functional metagenomics across all aspects of metagenomic data analysis life cycle. Common steps for viral metagenomic analyses include: assembly, gene calling, taxonomic profiling, functional profiling, ecological interaction network construction, comparative metagenomics, and statistical tests for correlation between genomic content and environmental variables. Apps for assembly, gene calling, taxonomic profiling, and functional profiling have already been incorporated for ivirus through the iplant infrastructure (Figure 1). Table 1. Overview of useful viral software tools and data processing pipelines currently in iplant toward comparative and functional metagenomics. Method Description ABySS De novo sequence assembly 7 ALLPATHS-L G De novo sequence assembly 8 khmer Probabilistic de Bruijn graphs 9 Assembly Meta-IDBA De Bruijn graph multiple alignments 10 MetaVelvet De Bruijn graph coverage and connectivity 11 Newbler De novo assembly based on read overlap 12 SOAPdenovo Single-genome assembler tuned for metagenomics 13 SPA Short peptide assembly for metagenomes 14

5 Velvet De Bruijn graph coverage and connectivity 15 FragGeneScan Ab initio gene prediction 16 Gene Calling Glimmer Ab initio gene prediction 17 Prodigal Ab initio gene prediction 18 Metagene Ab initio gene prediction 19 MetaGenemark Ab initio gene prediction 20 PCPipe Protein clustering pipeline and annotation 1 Analyses VirSorter Find viral contigs in metagenome InterProScan Protein domain identifier BLAST Compare primary sequence information 21 Figure 1. An Overview of Assembly, Gene Prediction, and Annotation Apps in iplant

6 that can be used for viral ecology research. *Miscellaneous tools, such as sequence format converters can be applied to one App s results to make them compatible with other App s input. The Protein Cluster Pipeline (PCPipe): With the emergence of quantitative viral metagenomics (reviewed in 22 ) inter-comparability across global viromics datasets are now possible even with varied library preparation methods and/or sequencing platforms 23. Informatically, however, viral metagenomic analyses are stymied by the fact that they are dominated (to 95%) by completely novel sequences (reviewed in 1 ). One way forward is our PCpipe App which utilizes protein clusters (PCs) derived from each new metagenome by matching ORFs against a growing PC database and self-clustering left-over ORFs. New PCs are then annotated using the Similarity Matrix of Proteins (SIMAP) 24. Moreover, the resulting PCs are a powerful organizing tool as they provide (i) a universal diversity metric something currently problematic due to reliance upon quantification derived from assembly output not yet tuned for metagenomic datasets, (ii) a stable scaffold for iterative functional annotations, and (iii) input units for ecological comparisons, using new and expanding community tools (e.g., PC output can feed into QIIME that is available in iplant s Atmosphere). Such PCs have provided information on viral roles in ecosystem function and niche differentiation in the Pacific Ocean, and have led to 456K PCs, which doubled that known prior to our work 1. PCPipe was developed through the imicrobe project and was the first high-performance compute pipeline built in the iplant cyberinfrastructure. This work lays the foundation for future pipeline development for common viral metagenomic analyses. Further, given that all code is open source and can be easily copied within iplant, pipelines can be refined and re-used by the community in the iplant cyberinfrastructure. References 1. Hurwitz, B. L. & Sullivan, M. B. The Pacific Ocean Virome (POV): A Marine Viral Metagenomic Dataset and Associated Protein Clusters for Quantitative Viral Ecology. PLoS One 8, e57355 (2013).

7 2. Breitbart, M. Marine Viruses: Truth or Dare. Ann. Rev. Mar. Sci. 4, (2012). 3. Hurwitz, B. L., Brum, J. R. & Sullivan, M. B. Depth-stratified functional and taxonomic niche specialization in the core and flexible Pacific Ocean Virome. ISME J (2014). doi: /ismej Hurwitz, B. L., Hallam, S. J. & Sullivan, M. B. Metabolic reprogramming by viruses in the sunlit and dark ocean. Genome Biol. 14, R123 (2013). 5. Clokie, M. R. J. et al. Transcription of a photosynthetic T4-type phage during infection of a marine cyanobacterium. Environ. Microbiol. 8, (2006). 6. Hurwitz, B. L., Westveld, a. H., Brum, J. R. & Sullivan, M. B. Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses. Proc. Natl. Acad. Sci. 111, (2014). 7. Simpson, J. T. et al. ABySS: A parallel assembler for short read sequence data. Genome Res. 19, (2009). 8. Butler, J. et al. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18, (2008). 9. Pell, J. et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. 109, (2012). 10. Peng, Y., Leung, H. C. M., Yiu, S. M. & Chin, F. Y. L. Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics 27, i94 i101 (2011). 11. Namiki, T., Hachiya, T., Tanaka, H. & Sakakibara, Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 40, e155 e155 (2012). 12. Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, (2005). 13. Li, Y., Hu, Y., Bolund, L. & Wang, J. State of the art de novo assembly of human genomes from massively parallel sequencing data. Hum. Genomics 4, (2010). 14. Yang, Y. & Yooseph, S. SPA: A short peptide assembler for metagenomic data. Nucleic Acids Res. 41, 1 10 (2013). 15. Zerbino, D. R. & Birney, E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, (2008).

8 16. Rho, M., Tang, H. & Ye, Y. FragGeneScan: Predicting genes in short and errorprone reads. Nucleic Acids Res. 38, 1 12 (2010). 17. Aggarwal, G. & Ramaswamy, R. Ab initio gene identification: prokaryote genome annotation with GeneScan and GLIMMER. J. Biosci. 27, 7 14 (2002). 18. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010). 19. Noguchi, H., Park, J. & Takagi, T. MetaGene: Prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res. 34, (2006). 20. Zhu, W., Lomsadze, A. & Borodovsky, M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38, e132 e132 (2010). 21. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, (1997). 22. Duhaime, M. B. & Sullivan, M. B. Ocean viruses Rigorously evaluating the metagenomic sample-to-sequence pipeline. Virology 1 5 (2012). at < 23. Solonenko, S. A. et al. Sequencing platform and library preparation choices impact viral metagenomes. BMC Genomics 14, 320 (2013). 24. Rattei, T. SIMAP: the similarity matrix of proteins. Nucleic Acids Res. 34, D252 D256 (2006).