Genomics and High Performance Computing. Folker Meyer Argonne National Laboratory and University of Chicago

Similar documents
Metagenome Analysis With MG- RAST

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

METAGENOMICS. Aina Maria Mas Calafell Genomics

Infectious Disease Omics

Bioinformatic tools for metagenomic data analysis

Korilog. high-performance sequence similarity search tool & integration with KNIME platform. Patrick Durand, PhD, CEO. BIOINFORMATICS Solutions

Bioinformatics and computational tools

Plant genome annotation using bioinformatics

Third Generation Sequencing

Next Generation Sequencing for Metagenomics

Introduction to Bioinformatics

Analysing genomes and transcriptomes using Illumina sequencing

Accelerate High Throughput Analysis for Genome Sequencing with GPU

Human genome sequence

Genome Assembly Workshop Titles and Abstracts

Introduction to BIOINFORMATICS

ELE4120 Bioinformatics. Tutorial 5

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

NOW GENERATION SEQUENCING. Monday, December 5, 11

Microbially Mediated Plant Salt Tolerance and Microbiome based Solutions for Saline Agriculture

Human Microbiome Project: First Map of the World Within Us. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

Product Applications for the Sequence Analysis Collection

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

Mate-pair library data improves genome assembly

Finding Biology in the Human Microbiome. George Weinstock

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Next Gen Sequencing. Expansion of sequencing technology. Contents

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Welcome to the NGS webinar series

Introduction to Bioinformatics

Metagenomic Analysis in Human- Associated Projects

Molecular Biology: DNA sequencing

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Cost Optimization for Cloud-Based Engineering Simulation Using ANSYS Enterprise Cloud

USING HPC CLASS INFRASTRUCTURE FOR HIGH THROUGHPUT COMPUTING IN GENOMICS

Types of Databases - By Scope

Pathogenic organisms no thanks: Use of next generation sequencing techniques in risk assessment and HACCP

Sanger vs Next-Gen Sequencing

ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes

Enabling reproducible data analysis for metagenomics. eresearch Africa Conference 2017 Gerrit Botha CBIO H3ABioNet 3 May 2017

2014 APHL Next Generation Sequencing (NGS) Survey

Targeted Sequencing in the NBS Laboratory

How much sequencing do I need? Emily Crisovan Genomics Core

Introduction to Microbial Community Analysis. Tommi Vatanen CS-E Statistical Genetics and Personalised Medicine

Introduction. Highlights. Prepare Library Sequence Analyze Data

Engineering Genetic Circuits

Ion S5 and Ion S5 XL Systems

B I O I N F O R M A T I C S

Access to Information from Molecular Biology and Genome Research

Ion S5 and Ion S5 XL Systems

NCBI web resources I: databases and Entrez

ELIXIR: data for molecular biology and points of entry for marine scientists

De novo genome assembly with next generation sequencing data!! "

Bioinformatics for Proteomics. Ann Loraine

Genomics AGRY Michael Gribskov Hock 331

Genomic Data Is Going Google. Ask Bigger Biological Questions

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Whole Transcriptome Sequencing/RNA-Seq

Advanced Information Systems Big Data Study for Earth Science

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

E2ES to Accelerate Next-Generation Genome Analysis in Clinical Research

Genome Sequence Assembly

Preston Smith Director of Research Services. September 12, 2015 RESEARCH COMPUTING GIS DAY 2015 FOR THE GEOSCIENCES

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

BIOINFORMATICS IN AQUACULTURE. Aleksei Krasnov AKVAFORSK (Ås, Norway) Bergen, September 21, 2007

The IBM Reference Architecture for Healthcare and Life Sciences

Surviving the Life Sciences Data Deluge using Cray Supercomputers

What are Supercomputers Good For?

CDC s Advanced Molecular Detection (AMD) Sequence Data Analysis and Management

Three-Way Comparison and Investigation of Annotated Halorhabdus utahensis Genome

Overview of Health Informatics. ITI BMI-Dept

HiSeqTM 2000 Sequencing System

Best Practices for Implementing SAP BusinessObjects Mobile in Your Organization

HTCaaS: Leveraging Distributed Supercomputing Infrastructures for Large- Scale Scientific Computing

Integrating MATLAB Analytics into Enterprise Applications

Glossary of Commonly used Annotation Terms

The Rise of Engineering-Driven Analytics

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

The world leader in serving science. DataSafe Solutions. Protect your valuable laboratory data

Berkeley Data Analytics Stack (BDAS) Overview

Introduction to Next Generation Sequencing (NGS)

Functional analysis using EBI Metagenomics

Finding the LIMS of Your Dreams

Why learn sequence database searching? Searching Molecular Databases with BLAST

Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms

CloudLCA: finding the lowest common ancestor in metagenome analysis using cloud computing

INTRODUCTION A clear cultivation bias exists in microbial phylogenetics. As of 2010, half of all

MicroSEQ Rapid Microbial Identification System

A Cloud Migration Checklist

De Novo Assembly of High-throughput Short Read Sequences

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Oracle Financial Services Revenue Management and Billing V2.3 Performance Stress Test on Exalogic X3-2 & Exadata X3-2

Transcription:

Genomics and High Performance Computing Folker Meyer and University of Chicago

Brief intro: I am a computer scientist turned computational biologist My CS friends tell me I am a biologist My BIO friends tell me I am a CS person So clearly take my comments with a grain of salt My research interests: Metagenomics to study microbial role in geochemical cycling (Climate, Remediation) human health (Human Microbiome) Strong emphasis on technology development to allow study Algorithms, integrations, tools, genomics tech

Why microbes? Source: Rob Knight, U. Colorado

Biology changed. From this http://www.the-aps.org/education/ http://www.ferrum.edu/majors/biology.jpg http://www.oneocean.org/ambassadors/track_a_turtle/biology These are: biology.png, biology.gif and biology.jpg. 4

to this. (in ~2003)

DNA Technology advances FAST Technology Read length Mbp/hr MBp/run Cost/ run 1977 Sanger ~100 ~0.0001 ~0.001? 1987 ABI 370 ~500 0.01 0.05 ~$1,000 ~2001 ABI 3700 ~1000 0.1 0.5 $200 2005 454 pyroseq 120-450 25-100 500 $13,000 2007 Solexa 50-125bp (1) ~1000 50,000-200,000 $15,000 2007 ABI Solid 35-50bp 75-250 6,000-20,000 ~$5,000 2010 Ion Torrent (2) ~200 100 50 $500 2010? Helicos 25-50 ~50? $20,000 2010? PacBio 700 120 30 $100 1) Solexa paired-end now 200-220bp 2) Machine cost ~$50k

Ongoing change

1) Democratization of data generation From factory to bench-top And >70% of Illumina machines go to small customers (1) 1) From Illumina at 2010 GIA meeting

2) Data set size Instrument output went from <<1 GB to 200 GB in <5 years 24-48 month trend looks interesting to put this into perspective: All genomics knowledge ( RefSeq ) was 57GB when I last checked Large Global Ocean Survey study by Craig Venter in 2004 was 600 megabases When Biologist speak of big data if frequently fits on an IPod

3) Computing cost dominate $ $900,000 $600,000 $300,000 $240,000 $300,000 $120,000 $30,000 $30,000 $30,000 $30,000 $45,000 Bioinformatics Sequencing $15,000 $15,000 $15,000 $7,000 $3,000 $3,500 0.5 1 30 60 98 196 294 454 GAIIx HiSeq2000 GB 95GB == 195,600 node hours (on Nehalem 8core, 16GB), Illumina HiSeq2000 = 2x100GB/run available today cost is purely BLASTX (no storage or transfer cost) on Amazon EC2 Source: Wilkening et al., Proceedings IEEE Cluster09, 2009 note: 10x or 100x improvements over BLASTX will help, but not solve

What does genomics look like?

Example: Analysis of pathogen: B. subtilis

Sequencing a genome Where are we? Closing gaps between contigs validating the sequence via a map

Genome sequencing is now routine Figures: Nikos Kyrpides DOE Joint Genome Institute, G.O.L.D.

Genome sequencing: a success story Genome assembly used to require large machines and significant manual effort More data and novel codes make this much easier Velvet, newbler, AllPaths,.. Bacterial genome sequencing and assembly sometimes in a day While some quality issues still exist complete automation on low-cost machine is likely to happen in the next 12-24 months

A genome: CGGGGGAGCCCTCCAGAATACCCATCATATAGCCCCTGAGGTGGCATGGGATGTCTCCATGAGGGAACCCCTTCCCACTTCATACTGTC ACGTATATCATAGTGTTCTTGACTGGGCCATTCATCTAAGATGGGATTTACCCTGTGAAACAGGGAGAAGACTTATGGACCCCAAGCATCAT TTCAAGTTGAAGTTGAGTTTTTAAAAGCCATCCATGCAAAGTTCCTTTGCTTTGGACCCTCTGCATTATTAAAGCTGCTGTATTGCTAACCC AGAACTGCTCCAGTGTCTTGACTGATCATCATGGCTTCAGTTTGGAAGAGACTGCAGCGTGTGGGAAAACATGCATCCAAGTTCCAGTTTGT GGCCTCCTACCAGGAGCTCATGGTTGAGTGTACGAAGAAATGGTAACCAGATAAACTGGTGGTAGATGAAGACATGCAAAGTTTGGCTAGTT TGGTGAGTATGAAGCAGGCTGACATTGGCAATTTAGATGACTTCGAAGAAGATAATGAAGATGATGATGAGAACAGAGTGAACCAAGAAGAA AAGGCAGCTAAAATTACAGAGCTTATCAACAAACTTAACTTTTTGGATGAAGCAGAAAAGGACTTGGCCACCGTGAATTCAAATCCATTTGA TGATCCTGATGCTGCAGAATTAAATCCATTTGGAGATCCTGACTCAGAAGAACCTATCACTGAAACAGCTTCACCTAGAAAAACAGAAGACT CTTTTTATAATAACAGCTATAATCCCTTTAAAGAGGTGCAGACTCCACAGTATTTGAACCCATTCGATGAGCCAGAAGCATTTGTGACCATA AAGGATTCTCCTCCCCAGTCTACAAAAAGAAAAAATATAAGACCTGTGGATATGAGCAAGTACCTCTATGCTGATAGTTCTAAAACTGAAGC AGAGCTTAGTGATCTGAAGCGGGAGCCTGAACTACAACAGCCTATCAGCGGAGCGTGACAGGTACGTGATGCTAGCTTTTATCAGGCAGCGG TATGCGCGATCAATGCGCGCGGCTATATGATCTGCATGCGGCGCGATTACTCTTCGGAGCTTATTTCTGCGGCGGGCCGGGGGAGCCCTCCA GAATACCCATCATATAGCCCCTGAGGTGGCATGGGATGTCTCCATGAGGGAACCCCTTCCCACTTCATACTGTCACGTATATCATAGTGTTC TTGACTGGGCCATTCATCTAAGATGGGATTTACCCTGTGAAACAGGGAGAAGACTTATGGACCCCAAGCATCATTTCAAGTTGAAGTTGAGT TTTTAAAAGCCATCCATGCAAAGTTCCTTTGCTTTGGACCCTCTGCATTATTAAAGCTGCTGTATTGCTAACCCAGAACTGCTCCAGTGTCT TGACTGATCATCATGGCTTCAGTTTGGAAGAGACTGCAGCGTGTGGGAAAACATGCATCCAAGTTCCAGTTTGTGGCCTCCTACCAGGAGCT CATGGTTGAGTGTACGAAGAAATGGTAACCAGATAAACTGGTGGTAGATGAAGACATGCAAAGTTTGGCTAGTTTGGTGAGTATGAAGCAGG CTGACATTGGCAATTTAGATGACTTCGAAGAAGATAATGAAGATGATGATGAGAACAGAGTGAACCAAGAAGAAAAGGCAGCTAAAATTACA GAGCTTATCAACAAACTTAACTTTTTGGATGAAGCAGAAAAGGACTTGGCCACCGTGAATTCAAATCCATTTGATGATCCTGATGCTGCAGA ATTAAATCCATTTGGAGATCCTGACTCAGAAGAACCTATCACTGAAACAGCTTCACCTAGAAAAACAGAAGACTCTTTTTATAATAACAGCT ATAATCCCTTTAAAGAGGTGCAGACTCCACAGTATTTGAACCCATTCGATGAGCCAGAAGCATTTGTGACCATAAAGGATTCTCCTCCCCAG TCTACAAAAAGAAAAAATATAAGACCTGTGGATATGAGCAAGTACCTCTATGCTGATAGTTCTAAAACTGAAGCAGAGCTTAGTGATCTGAA GCGGGAGCCTGAACTACAACAGCCTATCAGCGGAGCGTGACAGGTACGTGATGCTAGCTTTTATCAGGCAGCGGTATGCGCGATCAATGCGC GCG

From genome to information: Annotation Bioinformatics Source: A. Becker, U of Freiburg,Germany

Annotation In the old days: Find every possible gene Run every tool known to mankind BLAST, HMMer, Against every known database RefSeq, PFAM, InterPro, KEGG, COG, Have humans interpret the results Several drawbacks: Computationally expensive fixable with $$ Requires lots of FTEs fixable with $$$$ Subjective factors come into play fixable with standards? Still an open debate HTGA

Resulting compute requirements bacterial genomes contain ~1000 genes per Megabase BLAST vs NCBI NR search takes >10min per gene annotations often require EC (Enzyme) numbers protein domains (Pfam) help gain confidence genomic context comparison ( Clusters ) = 10.000 min + 10.000 min + 50.000 min +200.000 min Clusters and Pfam have highest confidence BLAST viewed as error prone CPU investment and quality are correlated more computing helps most groups can not pay the price Source: Informal survey of ~20 manual annotators

Things change. More sequences are being annotated Database grows Human annotator expertise shrinks relatively 100% 50% DNA space relative annotator expertise Sanger time WGS Pyrosequencing 1%

Ecosystem is unsustainable As sequencing becomes so cheap Analysis is the bottleneck Community needs a scalable solution Many view standards as the solution: It is unclear to me how standards alone can help this problem As we get more data computes take longer, everything becomes more complex

Annotation: Another success story Many expensive solutions exist But there is also a novel approach: RAST server (Aziz et al, BMC Genomics, 2008) Team of CS and Bio experts developed novel approach Subsystem technology Combining domain knowledge from both areas Integrate data curation and annotation using novel approaches requiring far less resources, better accuracy Annotate several genomes in a day on a laptop Server has processed over 12k genomes since 2008 Note: Works only for bacteria Extension to other areas possible Try it: http://rast.nmpr.org

The future.. Bacterial genomics has become easy Larger genomes remain harder But plummeting sequencing cost will help traditional genomics Sequencer output is compressed to a few contigs Via assembly to a fraction of the size Human genome only has 20,000 genes Image what would happen w/o assembly Every sequence a gene The next big thing: Metagenomics

Cost per base Source: Rob Knight, UColorado

Metagenomics needs the magic wand.. == shotgun genomics applied directly to various environments shotgun metagenomics!= sequencing of BAC clones with env. DNA functional metagenomics!= sequencing single genes (16 rdna) gene surveys What are they doing? Who are they? data

Community Structure and Metabolism Gene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4, Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2 NATURE VOL 428 4 MARCH 2004 www.nature.com/nature

Size of flagship projects Date Metagenome Size type 2004 Acid Mine Drainage 76Mbp Sanger 2004 GOS-I 700Mbp Sanger 2009 ANL-Subsurface 12GBp Solexa 2x75PE 2009 JGI-Cow-rumen 17GBp Solexa 2010 JGI-Cow-rumen 255Gbp Solexa 2010 MetaHIT 500GBp Solexa 2010 DeepSoil 70GBp Solexa 180bp (125x2) 2010 HMP 5.7TBp Solexa 100x2

Metagenome tasks Metabolic functions Community members Who is doing what binning genes to genomes

Sync We are here: 8492 metagenomes from > 500 groups Over 10GB per week (rapid growth) Many centers produce data

MG-RAST metagenomics RAST server Data growing over time open access, web based user upload data sets >3500 data submitters Web UI

Scaling up an MG-RAST v2 ~3,500 users (data submitters) ~200 daily users (>10 minutes) V2.0 was a typical bioinformatics app ( next slide) Throughput was becoming a major problem approaching 1 GB per week in late 2008 Need a mechanisms allowing more throughput

Technology choices (typical for BIO) Tightly integrated system Pleasantly parallel code NFS for data movement central database server Workflow management via SGE Running on ~50 machines locally 40 node Dual-PPC-Cluster shares NFS filesystem with all systems 2x

Performance analysis MG-RAST jobs Run time hours large (avg. ~0.1Gb) small A snapshot with little wait time Most time is spent in SIMS Short jobs spend a long time in create jobs Careful analysis of all computations IO to CPU ratio determines suitability of platforms

Redesign workflow Enable use of distributed computing platforms Including e.g. BG/P, EC2, Azure and local clusters Enable users to contribute resources Be robust, scalable, fault tolerant Enable replacement of algorithm with more efficient ones Enable support for staged database updates Built a prototype workflow engine Argonne Workflow Environment (A.W.E)

A.W.E. AWE SERVER A Work request AWE Client webserver db fna Facebook s Tornado B C fna fna fna result s + SQLalchemy RESTful diverse clients Single analysis operation results in a series of work units Client requests a lease on work, with a timeout Results reported to the server, with failures resulting in lease expiration REST interface scales well and minimizes prerequisites Client can size requests to local computational capacity Tested up to ~500 clients

What will the future bring?

I lost my crystal ball BUT: A lot more data seems a safe bet A lot more computing is certain The computing will be non coupled codes (pleasantly parallel) I hope for better algorithms More standards ah best-practice

M5 -- a metagenomics data sharing infrastructure for a democratized sequencing world M5 = metagenomics, metadata, meta-analysis, models, and metainfrastructure

M5 The initial goal Enable similarities in one non-redundant database against: GenBank, RefSeq, KEGG, UniProt, SEED, IMG, EggNogs One computation can be used for all tools Store and transport similarities to avoid recomputing Proposed MTF Metagenome Transport Format v0.1 Benefit: compute once, use in ALL tools

M5 the long term goal Establish community best practice It will also lead to the ability to outsource some computing tasks for the community Maintaining large metagenomic data sets will overwhelm all bio computing centers I know Searches against published metagenomes will become impossible if we don t established a curated body of metagenomes

(A part of) the proposed M5 Platform define e.g. OLCF, ALCF, TeraGRID, OSG, HPC Reference data set TeraGRID CLOUD export MTF Standards in Genomics SOP/Workflo w repository IMG MG RAST CAMER A USERS (large scale) Xyz.. Environental metagenomics: GOS, Terragenome Microbiota in human health: HMP Many smaller user groups

Back to Summary

Summary Overall genomics is not limited by lack of cycles Lack of good codes and best practice is more limiting Adjusting to large data Will become important for HPC community to recognize good use of cycles And help avoid stupid computes Remember Bioinformatics

Interesting novel questions Which microbes are where on the planet Microbial weather Which genes are where Gene migration patterns Combinations of genes Where do pathways or clusters travel Integration of climate data Predictive models How will X impact the microbes

Evolution of computing infrastructure for BIO Abundance of machines Before genomics Early genomics genomics era 2010

Why should I care? Microbes determine the climate on the planet! (e.g. Falkowski et al Science 2008) Microbes impact human health (e.g. obesity Turnbaugh et al, Nature 2006) Computations are pleasantly parallel, but there are a lot of them Example: Oil spill, integrate gene inventory data with oil spill patterns