Genomics and High Performance Computing. Folker Meyer Argonne National Laboratory and University of Chicago

Genomics and High Performance Computing Folker Meyer and University of Chicago

Brief intro: I am a computer scientist turned computational biologist My CS friends tell me I am a biologist My BIO friends tell me I am a CS person So clearly take my comments with a grain of salt My research interests: Metagenomics to study microbial role in geochemical cycling (Climate, Remediation) human health (Human Microbiome) Strong emphasis on technology development to allow study Algorithms, integrations, tools, genomics tech

Why microbes? Source: Rob Knight, U. Colorado

Biology changed. From this http://www.the-aps.org/education/ http://www.ferrum.edu/majors/biology.jpg http://www.oneocean.org/ambassadors/track_a_turtle/biology These are: biology.png, biology.gif and biology.jpg. 4

to this. (in ~2003)

DNA Technology advances FAST Technology Read length Mbp/hr MBp/run Cost/ run 1977 Sanger ~100 ~0.0001 ~0.001? 1987 ABI 370 ~500 0.01 0.05 ~$1,000 ~2001 ABI 3700 ~1000 0.1 0.5 $200 2005 454 pyroseq 120-450 25-100 500 $13,000 2007 Solexa 50-125bp (1) ~1000 50,000-200,000 $15,000 2007 ABI Solid 35-50bp 75-250 6,000-20,000 ~$5,000 2010 Ion Torrent (2) ~200 100 50 $500 2010? Helicos 25-50 ~50? $20,000 2010? PacBio 700 120 30 $100 1) Solexa paired-end now 200-220bp 2) Machine cost ~$50k

Ongoing change

1) Democratization of data generation From factory to bench-top And >70% of Illumina machines go to small customers (1) 1) From Illumina at 2010 GIA meeting

2) Data set size Instrument output went from <<1 GB to 200 GB in <5 years 24-48 month trend looks interesting to put this into perspective: All genomics knowledge ( RefSeq ) was 57GB when I last checked Large Global Ocean Survey study by Craig Venter in 2004 was 600 megabases When Biologist speak of big data if frequently fits on an IPod

3) Computing cost dominate $ $900,000 $600,000 $300,000 $240,000 $300,000 $120,000 $30,000 $30,000 $30,000 $30,000 $45,000 Bioinformatics Sequencing $15,000 $15,000 $15,000 $7,000 $3,000 $3,500 0.5 1 30 60 98 196 294 454 GAIIx HiSeq2000 GB 95GB == 195,600 node hours (on Nehalem 8core, 16GB), Illumina HiSeq2000 = 2x100GB/run available today cost is purely BLASTX (no storage or transfer cost) on Amazon EC2 Source: Wilkening et al., Proceedings IEEE Cluster09, 2009 note: 10x or 100x improvements over BLASTX will help, but not solve

What does genomics look like?

Example: Analysis of pathogen: B. subtilis

Sequencing a genome Where are we? Closing gaps between contigs validating the sequence via a map

Genome sequencing is now routine Figures: Nikos Kyrpides DOE Joint Genome Institute, G.O.L.D.

Genome sequencing: a success story Genome assembly used to require large machines and significant manual effort More data and novel codes make this much easier Velvet, newbler, AllPaths,.. Bacterial genome sequencing and assembly sometimes in a day While some quality issues still exist complete automation on low-cost machine is likely to happen in the next 12-24 months

A genome: CGGGGGAGCCCTCCAGAATACCCATCATATAGCCCCTGAGGTGGCATGGGATGTCTCCATGAGGGAACCCCTTCCCACTTCATACTGTC ACGTATATCATAGTGTTCTTGACTGGGCCATTCATCTAAGATGGGATTTACCCTGTGAAACAGGGAGAAGACTTATGGACCCCAAGCATCAT TTCAAGTTGAAGTTGAGTTTTTAAAAGCCATCCATGCAAAGTTCCTTTGCTTTGGACCCTCTGCATTATTAAAGCTGCTGTATTGCTAACCC AGAACTGCTCCAGTGTCTTGACTGATCATCATGGCTTCAGTTTGGAAGAGACTGCAGCGTGTGGGAAAACATGCATCCAAGTTCCAGTTTGT GGCCTCCTACCAGGAGCTCATGGTTGAGTGTACGAAGAAATGGTAACCAGATAAACTGGTGGTAGATGAAGACATGCAAAGTTTGGCTAGTT TGGTGAGTATGAAGCAGGCTGACATTGGCAATTTAGATGACTTCGAAGAAGATAATGAAGATGATGATGAGAACAGAGTGAACCAAGAAGAA AAGGCAGCTAAAATTACAGAGCTTATCAACAAACTTAACTTTTTGGATGAAGCAGAAAAGGACTTGGCCACCGTGAATTCAAATCCATTTGA TGATCCTGATGCTGCAGAATTAAATCCATTTGGAGATCCTGACTCAGAAGAACCTATCACTGAAACAGCTTCACCTAGAAAAACAGAAGACT CTTTTTATAATAACAGCTATAATCCCTTTAAAGAGGTGCAGACTCCACAGTATTTGAACCCATTCGATGAGCCAGAAGCATTTGTGACCATA AAGGATTCTCCTCCCCAGTCTACAAAAAGAAAAAATATAAGACCTGTGGATATGAGCAAGTACCTCTATGCTGATAGTTCTAAAACTGAAGC AGAGCTTAGTGATCTGAAGCGGGAGCCTGAACTACAACAGCCTATCAGCGGAGCGTGACAGGTACGTGATGCTAGCTTTTATCAGGCAGCGG TATGCGCGATCAATGCGCGCGGCTATATGATCTGCATGCGGCGCGATTACTCTTCGGAGCTTATTTCTGCGGCGGGCCGGGGGAGCCCTCCA GAATACCCATCATATAGCCCCTGAGGTGGCATGGGATGTCTCCATGAGGGAACCCCTTCCCACTTCATACTGTCACGTATATCATAGTGTTC TTGACTGGGCCATTCATCTAAGATGGGATTTACCCTGTGAAACAGGGAGAAGACTTATGGACCCCAAGCATCATTTCAAGTTGAAGTTGAGT TTTTAAAAGCCATCCATGCAAAGTTCCTTTGCTTTGGACCCTCTGCATTATTAAAGCTGCTGTATTGCTAACCCAGAACTGCTCCAGTGTCT TGACTGATCATCATGGCTTCAGTTTGGAAGAGACTGCAGCGTGTGGGAAAACATGCATCCAAGTTCCAGTTTGTGGCCTCCTACCAGGAGCT CATGGTTGAGTGTACGAAGAAATGGTAACCAGATAAACTGGTGGTAGATGAAGACATGCAAAGTTTGGCTAGTTTGGTGAGTATGAAGCAGG CTGACATTGGCAATTTAGATGACTTCGAAGAAGATAATGAAGATGATGATGAGAACAGAGTGAACCAAGAAGAAAAGGCAGCTAAAATTACA GAGCTTATCAACAAACTTAACTTTTTGGATGAAGCAGAAAAGGACTTGGCCACCGTGAATTCAAATCCATTTGATGATCCTGATGCTGCAGA ATTAAATCCATTTGGAGATCCTGACTCAGAAGAACCTATCACTGAAACAGCTTCACCTAGAAAAACAGAAGACTCTTTTTATAATAACAGCT ATAATCCCTTTAAAGAGGTGCAGACTCCACAGTATTTGAACCCATTCGATGAGCCAGAAGCATTTGTGACCATAAAGGATTCTCCTCCCCAG TCTACAAAAAGAAAAAATATAAGACCTGTGGATATGAGCAAGTACCTCTATGCTGATAGTTCTAAAACTGAAGCAGAGCTTAGTGATCTGAA GCGGGAGCCTGAACTACAACAGCCTATCAGCGGAGCGTGACAGGTACGTGATGCTAGCTTTTATCAGGCAGCGGTATGCGCGATCAATGCGC GCG

From genome to information: Annotation Bioinformatics Source: A. Becker, U of Freiburg,Germany

Annotation In the old days: Find every possible gene Run every tool known to mankind BLAST, HMMer, Against every known database RefSeq, PFAM, InterPro, KEGG, COG, Have humans interpret the results Several drawbacks: Computationally expensive fixable with $$ Requires lots of FTEs fixable with $$$$ Subjective factors come into play fixable with standards? Still an open debate HTGA

Resulting compute requirements bacterial genomes contain ~1000 genes per Megabase BLAST vs NCBI NR search takes >10min per gene annotations often require EC (Enzyme) numbers protein domains (Pfam) help gain confidence genomic context comparison ( Clusters ) = 10.000 min + 10.000 min + 50.000 min +200.000 min Clusters and Pfam have highest confidence BLAST viewed as error prone CPU investment and quality are correlated more computing helps most groups can not pay the price Source: Informal survey of ~20 manual annotators

Things change. More sequences are being annotated Database grows Human annotator expertise shrinks relatively 100% 50% DNA space relative annotator expertise Sanger time WGS Pyrosequencing 1%

Ecosystem is unsustainable As sequencing becomes so cheap Analysis is the bottleneck Community needs a scalable solution Many view standards as the solution: It is unclear to me how standards alone can help this problem As we get more data computes take longer, everything becomes more complex

Annotation: Another success story Many expensive solutions exist But there is also a novel approach: RAST server (Aziz et al, BMC Genomics, 2008) Team of CS and Bio experts developed novel approach Subsystem technology Combining domain knowledge from both areas Integrate data curation and annotation using novel approaches requiring far less resources, better accuracy Annotate several genomes in a day on a laptop Server has processed over 12k genomes since 2008 Note: Works only for bacteria Extension to other areas possible Try it: http://rast.nmpr.org

The future.. Bacterial genomics has become easy Larger genomes remain harder But plummeting sequencing cost will help traditional genomics Sequencer output is compressed to a few contigs Via assembly to a fraction of the size Human genome only has 20,000 genes Image what would happen w/o assembly Every sequence a gene The next big thing: Metagenomics

Cost per base Source: Rob Knight, UColorado

Metagenomics needs the magic wand.. == shotgun genomics applied directly to various environments shotgun metagenomics!= sequencing of BAC clones with env. DNA functional metagenomics!= sequencing single genes (16 rdna) gene surveys What are they doing? Who are they? data

Community Structure and Metabolism Gene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4, Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2 NATURE VOL 428 4 MARCH 2004 www.nature.com/nature

Size of flagship projects Date Metagenome Size type 2004 Acid Mine Drainage 76Mbp Sanger 2004 GOS-I 700Mbp Sanger 2009 ANL-Subsurface 12GBp Solexa 2x75PE 2009 JGI-Cow-rumen 17GBp Solexa 2010 JGI-Cow-rumen 255Gbp Solexa 2010 MetaHIT 500GBp Solexa 2010 DeepSoil 70GBp Solexa 180bp (125x2) 2010 HMP 5.7TBp Solexa 100x2

Metagenome tasks Metabolic functions Community members Who is doing what binning genes to genomes

Sync We are here: 8492 metagenomes from > 500 groups Over 10GB per week (rapid growth) Many centers produce data

MG-RAST metagenomics RAST server Data growing over time open access, web based user upload data sets >3500 data submitters Web UI

Scaling up an MG-RAST v2 ~3,500 users (data submitters) ~200 daily users (>10 minutes) V2.0 was a typical bioinformatics app ( next slide) Throughput was becoming a major problem approaching 1 GB per week in late 2008 Need a mechanisms allowing more throughput

Technology choices (typical for BIO) Tightly integrated system Pleasantly parallel code NFS for data movement central database server Workflow management via SGE Running on ~50 machines locally 40 node Dual-PPC-Cluster shares NFS filesystem with all systems 2x

Performance analysis MG-RAST jobs Run time hours large (avg. ~0.1Gb) small A snapshot with little wait time Most time is spent in SIMS Short jobs spend a long time in create jobs Careful analysis of all computations IO to CPU ratio determines suitability of platforms

Redesign workflow Enable use of distributed computing platforms Including e.g. BG/P, EC2, Azure and local clusters Enable users to contribute resources Be robust, scalable, fault tolerant Enable replacement of algorithm with more efficient ones Enable support for staged database updates Built a prototype workflow engine Argonne Workflow Environment (A.W.E)

A.W.E. AWE SERVER A Work request AWE Client webserver db fna Facebook s Tornado B C fna fna fna result s + SQLalchemy RESTful diverse clients Single analysis operation results in a series of work units Client requests a lease on work, with a timeout Results reported to the server, with failures resulting in lease expiration REST interface scales well and minimizes prerequisites Client can size requests to local computational capacity Tested up to ~500 clients

What will the future bring?

I lost my crystal ball BUT: A lot more data seems a safe bet A lot more computing is certain The computing will be non coupled codes (pleasantly parallel) I hope for better algorithms More standards ah best-practice

M5 -- a metagenomics data sharing infrastructure for a democratized sequencing world M5 = metagenomics, metadata, meta-analysis, models, and metainfrastructure

M5 The initial goal Enable similarities in one non-redundant database against: GenBank, RefSeq, KEGG, UniProt, SEED, IMG, EggNogs One computation can be used for all tools Store and transport similarities to avoid recomputing Proposed MTF Metagenome Transport Format v0.1 Benefit: compute once, use in ALL tools

M5 the long term goal Establish community best practice It will also lead to the ability to outsource some computing tasks for the community Maintaining large metagenomic data sets will overwhelm all bio computing centers I know Searches against published metagenomes will become impossible if we don t established a curated body of metagenomes

(A part of) the proposed M5 Platform define e.g. OLCF, ALCF, TeraGRID, OSG, HPC Reference data set TeraGRID CLOUD export MTF Standards in Genomics SOP/Workflo w repository IMG MG RAST CAMER A USERS (large scale) Xyz.. Environental metagenomics: GOS, Terragenome Microbiota in human health: HMP Many smaller user groups

Back to Summary

Summary Overall genomics is not limited by lack of cycles Lack of good codes and best practice is more limiting Adjusting to large data Will become important for HPC community to recognize good use of cycles And help avoid stupid computes Remember Bioinformatics

Interesting novel questions Which microbes are where on the planet Microbial weather Which genes are where Gene migration patterns Combinations of genes Where do pathways or clusters travel Integration of climate data Predictive models How will X impact the microbes

Evolution of computing infrastructure for BIO Abundance of machines Before genomics Early genomics genomics era 2010

Why should I care? Microbes determine the climate on the planet! (e.g. Falkowski et al Science 2008) Microbes impact human health (e.g. obesity Turnbaugh et al, Nature 2006) Computations are pleasantly parallel, but there are a lot of them Example: Oil spill, integrate gene inventory data with oil spill patterns