Grid & Cloud Computing in Bioinformatics

Size: px
Start display at page:

Download "Grid & Cloud Computing in Bioinformatics"

Transcription

1 Grid & Cloud Computing in Bioinformatics AptaMEMS-ID & Microbase (October 2010) Keith Flanagan School of Computing Science, Newcastle University

2 AptaMEMS-ID Currently takes 2-3 days to identify presence of bacterial strains e.g.: Staphylococcus aureus (MRSA) involves manual lab-based tests Aptamems aims to build a device capable of detecting specific bacterial strains within minutes utilises micro electrical-mechanical sensors (MEMS) aptamers attached to each sensor

3 aptamems-id-technologies Aptamers attached to sensor chip algorithm/?page_id=229 Bacterial proteins bind to aptamers File:Average_prokaryote_cell-_en.svg

4 Identification of target proteins Requires computationally-intensive analysis of available genomic data to identify suitable proteins that are: located on the surface of the bacterial cell (accessible to the sensor); unique to a strain, or particular group of strains

5 Our requirements of a distributed compute platform High level orchestration of individual bioinformatics applications Construct arbitrary workflows of existing bioinformatics software ability to add new data incrementally ability to add new tools without re-computing existing analyses Balance jobs across available hardware resources, eg: Condor Grid Amazon EC2

6 Our requirements of a distributed compute platform 2 Manage machine configurations Temporary software installation Transfer of potentially large files, eg: 4GB BLAST database files Full InterProScan installation: 25GB

7 Responder Responder xn... Responder Microbase architecture Responder Event listener "New data" notification Notification system Task Splitter Task description notification Events Compute job executable "New task" notification Task / Job Scheduler "Task completion" notification Job Scheduler Job Scheduler Job Server Resource storage system Resources Resources Resources Resources file file file Bit Torrent Transfers Job description request Microbase Client Condor Job completion report VM Amazon EC2

8 Analysis workflow Large-scale bioinformatics workloads: over 3 million protein sequences to be analysed Sub cellular-localisation prediction tools Sequence similarity searches Protein tokenisation Protein clustering Structural prediction tools

9 Aptamems Workflow File Scanner Genome file FTP site Genome Pool Resource Files BLAST TMHMM Signalp LipoP Interpro Scan Data integration and machine learning / pattern discovery AptaMEMS Extracellular Protein Database Workflow step Microbase Distributed Relational Responder file store database Data transfer

10 Typical computational workloads Performing analyses: executing thousands of small jobs each job typically equivalent to an execution of a single analysis tool embarrassingly parallel jobs sized to be suitable for execution on typical desktop hardware Large-scale data management: storing resulting data distributed database (MongoDB) requires a cluster of larger machines for adequate performance

11 Compute job characteristics Small jobs: typically execute within 2-3 minutes modest disk / RAM requirements CPU-intensive jobs: InterproScan: approx. 45 minutes to analyse 100 proteins Data-intensive jobs: Large database files (~4GB BLAST databases)

12 Summary Flexible processing pipelines: expandable with new applications expandable with new data 100,000s of jobs executed ~5 years compute time in 3 months Distributed processing - multiple geographic locations: Newcastle, Amazon EU & US

13 Acknowledgements Professor Anil Wipat Sirintra Nakjang EPSRC / BBSRC Professor Colin Harwood Bioinformatics Support Unit Newcastle Digital Institute Kurt Messersmith from Amazon EC2