Compute- and Data-Intensive Analyses in Bioinformatics"

Size: px
Start display at page:

Download "Compute- and Data-Intensive Analyses in Bioinformatics""

Transcription

1 Compute- and Data-Intensive Analyses in Bioinformatics" Wayne Pfeiffer SDSC/UCSD August 8, 2012

2 Questions for today" How big is the flood of data from high-throughput DNA sequencers? What bioinformatics codes are installed at SDSC? What are typical compute- and data-intensive analyses of in bioinformatics? What are their computational requirements?

3 Size matters: how much data are we talking about?" 3.1 GB for human genome Fits on flash drive; assumes FASTA format (1 B per base)" >100 GB/day from a single Illumina HiSeq Gbases/day of reads in FASTQ format (2.5 B per base)" 300 GB to >1 TB of reads needed as input for analysis of whole human genome, depending upon coverage 300 GB for 40x coverage" 1 TB for 130x coverage" Multiple TB needed for subsequent analysis 45 TB on disk at SDSC for W115 project (~10,000x single genome)" Multiple genomes per person" May only be looking for kb or MB in the end"

4 Market-leading DNA sequencers come from Illumina & Life Technologies (both SD County companies)" Illumina HiSeq 2000 Big; $690,000 list price" High throughput" Low error rate" 100-bp paired-end reads" read read Life Technologies Ion PGM Small; $50,000 list price" Low throughput" Modest error rate" 250-bp reads"

5 Cost of DNA sequencing is dropping much faster (1/10 in 2 y) than cost of computing (1/2 in 2 y); this is producing the flood of data"

6 What does this mean?" Growth of read data is roughly inversely proportional to drop in sequencing cost >100 GB/day of reads from a single Illumina HiSeq 2000 now" 1 TB/day of reads from a sequencer likely by 2014" Analysis & quality control will dominate the cost <$10,000 for sequencing human genome now" $1,000 for sequencing human genome in 2013 or 2014" $10,000 for analysis & quality control of human genome sequence now and decreasing relatively slowly" Analysis improvements are needed to take advantage of new sequencing technology

7 Many widely-used bioinformatics codes are installed on Triton, Trestles, & Gordon" Pairwise sequence alignment ATAC, BFAST, BLAST, BLAT, Bowtie, BWA" Multiple sequence alignment (via CIPRES gateway) ClustalW, MAFFT" RNA-Seq analysis TopHat, Cufflinks" De novo assembly ABySS, SOAPdenovo, Velvet Phylogenetic tree inference (via CIPRES gateway) BEAST, GARLI, MrBayes, RAxML Tool kits BEDTools, GATK, SAMtools"

8 Computational requirements for some codes & data sets can be substantial" Input Output Memory Time Cores / Code & data set (GB) (GB) (GB) (h) computer BFAST 0.6.4c / Dash 52M 100-bp reads SOAPdenovo / Triton P+C 1.7B 100-bp reads Velvet / Triton PDAF 562M 50-bp reads MrBayes < / Gordon DNA data, 40 taxa, 16k patterns RAxML <1 < / Trestles amino acid data, 1.6k taxa, 8.8k patterns

9 Benchmark tests were run on various computers, some with large shared memory" Gordon from Appro at SDSC 16-core nodes with 2.6-GHz Intel Sandy Bridge processors" 64 GB of memory per node + vsmp" Trestles from Appro at SDSC 32-core nodes with 2.4-GHz AMD Magny-Cours processors" 64 GB of memory per node" Triton CC & Dash from Appro at SDSC 8-core nodes with 2.4-GHz Intel Nehalem processors" 24 & 48 GB of memory per node + vsmp on Dash" Triton PDAF from Sun at SDSC 32-core nodes with 2.5-GHz AMD Shanghai processors" 256 & 512 GB of memory per node " Blacklight from SGI at PSC 2,048-core NUMA nodes with 2.27-GHz Intel Nehalem processors " 16 TB of memory per NUMA node"

10 Typical projects involve multiple codes, some with multiple steps, combined in workflows" HuTS: Human Tumor Study Search for genome variants between blood and tumor tissue" Start from Ilumina 100-bp paired-end reads" Use BWA & GATK on Triton to find SNPs & short indels" Use SOAPdenovo, ATAC, & custom scripts on Triton to find long indels" W115: Study of 115-year-old woman s genomes () Search for genome variants between " " " blood and brain tissue" Start from SOLiD 50-bp reads" Use BioScope, SAMtools, & GATK " " " elsewhere to find SNVs & short indels" Use SAMtools, ABySS, Velvet, ATAC, BFAST, " " " & custom scripts on Triton to find long indels" Hendrikje van Andel-Schipper

11 Computational workflows for common bioinformatics analyses" DNA reads in FASTQ format De novo assembly: SOAPdenovo, Velvet, Contigs & scaffolds in FASTA format Read mapping, i.e., pairwise alignment: BFAST, BWA, Reference genome in FASTA format Pairwise alignment: ATAC, BLAST, Multiple sequence alignment: ClustalW, MAFFT, Alignment info in BAM format Variant calling: GATK, Alignment info in various formats Aligned sequences in various formats Variants: SNPs, indels, others Tree in various formats Phylogenetic tree inference: MrBayes, RAxML,

12 Computational workflow for read mapping & variant calling" DNA reads in FASTQ format Read mapping, i.e., pairwise alignment: BFAST, BWA, Alignment info in BAM format Reference genome in FASTA format Variant calling: GATK, Variants: SNPs, indels, others Goal: identify simple variants, e.g., single nucleotide polymorphisms (SNPs) or single nucleotide variants (SNVs) short insertions & deletions (indels) CACCGGCGCAGTCATTCTCATAAT CACCGGCGCAGACATTCTCATAAT CACCGGCGCAGTCATTCTCATAAT CACCGGCGCA ATTCTCATAAT

13 Pileup diagram shows mapping of reads to reference; example from HuTS shows a SNP in KRAS gene; this means that cetuximab is not effective for chemotherapy" BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI

14 BFAST took about 8 hours & 17 GB of memory to map a small set of reads; speedup was 3.7 on 8 cores" Parallelization is typically done by Separate runs for each lane of reads" Threads within a run" 8-thread 1-thread 8-thread memory Step time (h) time(h) Speedup (GB) Match Align Postprocess Total Tabulated results are for One lane of Illumina 100-bp paired-end reads: 52 million reads" One index with k=22 on reference human genome (done previously)" One 8-core node of Dash with 2.4-GHz Intel Nehalems & 48 GB of memory" 26 GB input, half for reads & half for index; 19 GB output

15 Computational workflow for de novo assembly & variant calling" DNA reads in FASTQ format De novo assembly: SOAPdenovo, Velvet, Contigs & scaffolds in FASTA format Goal: identify more complex variants, e.g., large indels duplications Reference genome in FASTA format Pairwise alignment: ATAC, BLAST, inversions translocations Variant calling: GATK, Alignment info in various formats Variants: SNPs, indels, others

16 Key conceptual steps in de novo assembly" 1. Find reads that overlap by a specified number of bases (the k-mer size), typically by building a graph in memory 2. Merge overlapping, good reads into longer contigs, typically by simplifying the graph 3. Link contigs to form scaffolds using paired-end information Diagrams from Serafim Batzoglou, Stanford

17 de Bruijn graph has k-mers as nodes connected by reads; assembly involves finding Eulerian path through graph" AGAC Diagram from Michael Schatz, Cold Spring Harbor"

18 SOAPdenovo & Velvet are two leading assemblers that use de Bruijn graph algorithm" SOAPdenovo is from BGI Code has four steps: pregraph, contig, map, & scaffold" pregraph & map are parallelized with Pthreads, but not reproducibly" pregraph uses the most time & memory Velvet is from EMBL-EBI Code has two steps: hash & graph" Both are parallelized with OpenMP, but not reproducibly Either step can use more time or memory depending upon problem & computer" k-mer size is adjustable parameter Typically it is adjusted to maximize N50 length of scaffolds or contigs" N50 length is central measure of distribution weighted by lengths"

19 SOAPdenovo & Velvet each have their strengths" Quality of assembly Both give similar assemblies " Speed SOAPdenovo is faster" Memory SOAPdenovo uses much less memory" vsmp Velvet often runs well with vsmp, whereas SOAPdenovo does not" Reads Both work with Illumina reads, but only Velvet works with SOLiD reads"

20 Graph step of Velvet works well on Gordon with vsmp; Gordon, Blacklight, & Triton PDAF have similar speeds when memory for hash step is small"

21 Hash step of Velvet runs much slower on Gordon with vsmp & somewhat slower on Blacklight when memory for hash step is large; graph step still works well on Gordon with vsmp"

22 What is going on?" Memory access for graph step of Velvet is fairly regular This is efficient with vsmp" Performance improved significantly last year through tuning of vsmp by ScaleMP" Memory access for hash step of Velvet is nearly random This is inefficient with vsmp " Memory access for pregraph step of SOAPdenovo (not shown) is also nearly random Since pregraph step uses most memory, large-memory SOAPdenovo runs are slow with vsmp " vsmp allows analyses otherwise possible on only a few computers

23 Computational workflow for de novo assembly followed by phylogenetic analyses" DNA reads in FASTQ format De novo assembly: SOAPdenovo, Velvet, Contigs & scaffolds in FASTA format Multiple sequence alignment is matrix of taxa vs characters Human AAGCTTCACCGGCGCAGTCATTCTCATAAT... Chimpanzee AAGCTTCACCGGCGCAATTATCCTCATAAT... Gorilla AAGCTTCACCGGCGCAGTTGTTCTTATAAT... Orangutan AAGCTTCACCGGCGCAACCACCCTCATGAT... Gibbon AAGCTTTACAGGTGCAACCGTCCTCATAAT... Final output is phylogeny or tree with taxa at its tips / Human Chimpanzee + / Gorilla \---+ / Orangutan \ \ Gibbon Multiple sequence alignment: ClustalW, MAFFT, Aligned sequences in various formats Phylogenetic tree inference: MrBayes, RAxML,

24 Scalability of RAxML & MrBayes was improved during past three years by Stamatakis, Goll, & Pfeiffer" Hybrid MPI/Pthreads version of RAxML was developed MPI code was added to previous Pthreads-only code" Parallelization is multi-grained as well as hybrid Change in algorithm often leads to better solution Hybrid MPI/OpenMP version of MrBayes was developed OpenMP code was added to previous MPI-only code" Parallelization is multi-grained as well as hybrid Memory-efficient code called RAxML-Light was developed This allows very large trees to be analyzed together with RAxML" Single-node runs are more efficient than before Multi-node runs with more cores are possible Scalability before was limited to about 8 cores for typical analyses" Hybrid codes now scale well to 10s of cores for typical analyses" Scripted version of RAxML-Light scales even further "

25 RAxML parallel efficiency is >0.5 up to 60 cores for >1,000 patterns*; speedup is superlinear for comprehensive analysis at some core counts; scalability improves with number of patterns" * Number of patterns = number of unique columns in multiple sequence alignment

26 RAxML run time for a DNA analysis went from >3 days on 1 core to ~1.3 hours on 60 cores; large amino acid analysis was solved in 4.4 days on 160 cores" Char- Pat- Boot- Data Time (h) Time (h) Speed- Taxa acters terns straps type & cores & cores up 150 1,269 1, RNA 2.1, , ,294 1, , 500 DNA 8.7, , ,158 7, , 400 DNA 74.8, , ,596 10,301 8, AA 106, 160 Tabulated results are for Comprehensive analysis with number of bootstrap searches determined automatically followed by 10 or 20 thorough searches" 32-core nodes of Trestles with 2.4-GHz AMD Magny-Cours processors" 10 MPI processes & 6 threads/process using 60 cores (which gives better performance than using 64 cores)" 20 MPI processes & 8 threads/process using 160 cores

27 MrBayes runs 1.6x to 3.3x faster on Gordon than Trestles depending upon the size of the data set; speedup is greater for larger data sets that are not partitioned"

28 The CIPRES gateway lets biologists run parallel versions of tree inference codes via a browser interface on the Trestles & Gordon supercomputers at SDSC"

29 Questions & answers about analyzing DNA sequence data" How big is the flood of data from high-throughput DNA sequencers? >100 GB per day from a single Illumina sequencer now" 1 TB/day from a sequencer likely by 2014" What are three compute- and data-intensive analyses of DNA sequence data? Mapping of short reads against a reference genome" De novo assembly of short reads" Phylogenetic tree inference"

30 So how compute- and data-intensive are the three bioinformatics analyses we considered?" Here is a qualitative summary Compute- Memory- I/O- Analysis intensive intensive* intensive Read mapping x x De novo assembly x x x Tree inference (usually) x Tree inference (sometimes) x x * I.e., large memory per node is needed for shared-memory implementations