High peformance computing infrastructure for bioinformatics

Size: px
Start display at page:

Download "High peformance computing infrastructure for bioinformatics"

Transcription

1 High peformance computing infrastructure for bioinformatics Scott Hazelhurst University of the Witwatersrand December 2009

2 What we need Skills, time

3 What we need Skills, time Fast network Lots of storage Compute nodes

4 Who we are What we do Needs Example project

5 SA bioinformatics community Universities Biotech Regional Innovation Centres (DST) Industry National Bioinformatics Network : Promote discipline at universities Start a pipeline of students Training and education Establish a network of people Platforms & Service to industry

6 Going forward... Ring-fenced research funding through the NRF Education and training funds Service platform (BRICs, universities, industry) Genomics Proteomics Bioinformatics

7 Bioinformatics as a discipline The use of computional and mathematical techniques for solving problems in molecular biology and genetics. Large data sets Highly computationally demanding Very varied Dynamic HPC crucial for many key scientific programmes!

8 Examples Large databases (e.g. Genbank, Ensembl) Sequence homology (e.g. BLAST, Smith-Waterman)

9 Examples Large databases (e.g. Genbank, Ensembl) Sequence homology (e.g. BLAST, Smith-Waterman) Phylogenetic studies Protein structure prediction Molecular dynamics

10 Examples Large databases (e.g. Genbank, Ensembl) Sequence homology (e.g. BLAST, Smith-Waterman) Phylogenetic studies Protein structure prediction Molecular dynamics Sequence clustering and assembly

11 Examples Large databases (e.g. Genbank, Ensembl) Sequence homology (e.g. BLAST, Smith-Waterman) Phylogenetic studies Protein structure prediction Molecular dynamics Sequence clustering and assembly Machine learning/pattern classification Simulation, text mining, other...

12 So what does this mean for computing needs? Heterogeneus demands many projects, different needs Hard to predict Tim Hubbard, head of informatics at Sanger Insitute:... we overestimated need for computers by about 2 but underestimated disk usage about 4 large numbers, but we think that s not too bad!

13 Summary of needs access large databases move data around share data, securely dominant computing needs: run many single-threaded jobs but range of other needs

14 Network speed 2008 Bandwidth Meraka C4: 50KB/s (11h) CHPC: 40KB/s (12h) Amazon EC2: 300KB/s (1h50)

15 Network speed 2008 Bandwidth Meraka C4: 50KB/s (11h) CHPC: 40KB/s (12h) Amazon EC2: 300KB/s (1h50) End 2009 Bandwidth Amazon EC2: 400KB/s Meraka C4: 5MB/s (6min)

16 Network speed 2008 Bandwidth Meraka C4: 50KB/s (11h) CHPC: 40KB/s (12h) Amazon EC2: 300KB/s (1h50) End 2009 Bandwidth Amazon EC2: 400KB/s Meraka C4: 5MB/s (6min) CHPC: 40KB/s

17 Data Storage Disk space is not the issue management access

18 Computing power Need a range of equipment Private, public clusters Private, public clouds Range of memory requirements Multicore? Grid?? GPU??? FPGA????

19 Need the right software environment: Bioinformatics applications often pipelined May be multi-language May require access to internet Varied programming paradigms Parameter sweeping MPI Grid Hadoop OpenMP... Huge challenge for centralised systems.

20 Need the right software environment: Bioinformatics applications often pipelined May be multi-language May require access to internet Varied programming paradigms Parameter sweeping MPI Grid Hadoop OpenMP... Huge challenge for centralised systems. Virtualisation a key software technology

21 African Genome Project Africa has most diverse populations Least studied Goal of project complete sequence of hundreds of Africans more restricted studies of thousands of individuals phenotype information

22 Example project Platform for high-throughput sequence analysis Large collaborative project: UCT, UP, UWC, Wits

23 Example project Platform for high-throughput sequence analysis Large collaborative project: UCT, UP, UWC, Wits Assembly

24 Example project Platform for high-throughput sequence analysis Large collaborative project: UCT, UP, UWC, Wits Assembly Transcriptome analysis

25 Example project Platform for high-throughput sequence analysis Large collaborative project: UCT, UP, UWC, Wits Assembly Transcriptome analysis SNP detection

26 Example project Platform for high-throughput sequence analysis Large collaborative project: UCT, UP, UWC, Wits Assembly Transcriptome analysis SNP detection Functional annotation

27 Example project Platform for high-throughput sequence analysis Large collaborative project: UCT, UP, UWC, Wits Assembly Transcriptome analysis SNP detection Functional annotation Backend work