From Lab Bench to Supercomputer: Advanced Life Sciences Computing. John Fonner, PhD Life Sciences Computing

Size: px
Start display at page:

Download "From Lab Bench to Supercomputer: Advanced Life Sciences Computing. John Fonner, PhD Life Sciences Computing"

Transcription

1 From Lab Bench to Supercomputer: Advanced Life Sciences Computing John Fonner, PhD Life Sciences Computing

2 A Decade s Progress in DNA Sequencing 2003: ABI 3730 Sequencer Human Genome: $2.7 Billion, 13 Years 2012: Oxford Nanopore MiniION Human Genome: $900, 6 Hours

3 The Problem of Big Data in Life Sciences BGI, based in China, is the world s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes a day. BGI churns out so much data that it often cannot transmit its results to clients or collaborators over the Internet or other communications lines because that would take weeks. Instead, it sends computer disks containing the data, via FedEx.

4 Genomic Pipelines Modular workflows repeated many times Pre-processing: demultiplexing, reverse complement, trimming, quality filtering, collapsing, etc. Mapping or de novo assembly Variant calling, annotation, RNA expression, etc. Visualization

5 Traditional HPC User Writing code is a primary expression of their research interest Accustomed to the command line interface Writes performance conscious, parallel code (C, Fortran) Simulation based Massively Parallel

6 Life Sciences User Lab protocols are a primary expression of their research interest Accustomed to GUI Writes quick and dirty, non-optimized scripts (Perl, Python) Data driven High Throughput

7 HPC for Life Scientists TACC s mission is to enable discoveries that advance science and society through the application of advanced computing technologies. How can we equip scientists to leverage HPC? What HPC resources do life scientists need? How can we make HPC more accessible to life sciences users?

8 Equipping Life Scientists Training classes (webcast and in-person) provided at no cost for academic researchers Recordings are available now at: Bring your own code style training is very effective 1-on-1 consulting

9 Life Sciences Software Stack TACC maintains a quarterly release cycle for a core set of biology related applications that: Have high impact and broad appeal to life sciences research Are highly requested by the user community Compiler optimized installations using fast math libraries when possible 80+ packages installed for genomics, phylogenetics, and computational chemistry

10 Life Sciences Software Stack Abyss BEDTools BioPerl Bismark Blat Bowtie BWA Cufflinks FASTX-Toolkit GATK Genomics GSNAP HMMER Libgtextutils MAFFT Maq mpiblast MUSCLE NCBI BLAST+ Newbler Oases Picard SAMtools SHRiMP SOAPdenovo SRA toolkit SSAKE Tophat TrinityRNASeq Velvet

11 User-Specific Software What if I need software not already installed? Will it run on Linux? Is the source code available? Are there licensing restrictions? Try installing it in your personal directory (root access is not required) Submit a consulting ticket on the TACC Portal for help

12 Client-Side Software SSH clients SSH Secure Shell Putty SFTP capable text editors Notepad++ TextWrangler File management GUIs Cyberduck

13 HPC Resources for Life Scientists Interactive sessions (if possible) for initial preprocessing High throughput capability Threading more important than MPI Large memory nodes for de novo sequencing Long-term data storage

14 Lonestar 1,888 Dell M610 PowerEdge blade servers, each with two six-core Intel Xeon 5600 "Westmere" processors total 22,656 cores Tuned for powerful per-node performance 302 teraflops peak performance 1TB large memory nodes and GPU nodes available

15 Stampede Base Cluster (2+ Petaflops): Intel Sandy Bridge (8 core) processors Dell dual-socket nodes w/32gb RAM More than 6,000 nodes More than 100,000 cores Co-Processors (7+ Petaflops): Intel Xeon Phi co-processors Special release 61 core version Total concurrency approaching 500,000 cores

16 Corral Data Repository UT System funded resource 5 petabytes of geo-replicated storage Data centers in Austin and in Arlington Parallel GPFS file system on high-speed network Allocations up to 5TB are free of charge for UT System researchers Synergistic with Lonestar/Stampede for computing on big data

17 idev tool for interactive sessions

18 launcher high throughput tool Each task is listed in a text file User submits a single job to the queue Once started, launcher uses all available resources to complete the tasks in the list Multi-threading is supported

19 Making HPC more accessible iplant Collaborative RepServer immune repertoire analysis

20 What the Future May Hold File compression (please!) Better large-file transfer utilities Large, cross-institution collaborations HIPAA requirements Better standards More refined software More bioinformaticians Even more impressive sequencers

21 John Fonner For more information: