HRG Insight: IBM & Intel: Intelligent Choice for Life Sciences IBM ex5 server technology significantly advances the state of Life Sciences research applications such as Bioinformatics, Genomic Research, and Translational Medicine. Workloads in Life Sciences, such as genomic sequencing, assembly, alignment, and secondary data analysis, gain significant benefits from the available memory capacity of IBM ex5 in combination with the improvements in transaction throughput enabled by the new Intel Xeon processor 7500 series. IBM customers will realize significant competitive advantage in faster ROI, reduced TCO, and improved time to result. While today s institutions and organizations conducting the full genome sequence experiments do provide primary analysis, they may not do the secondary analysis. It is this focused secondary analysis that looks for the very small percent of DNA variants which is the basis for research discoveries and insights. This is a precursor for the development and delivery of clinical applications to identify, diagnose, prevent and cure disease. This area of secondary data analysis is emerging as the area where right sized compute and data intensive solutions like IBM's ex5 powered by Intel Xeon processor 7500 series start to provide the computational throughput and very large memory capacity needed to make genomic data useful to Life Sciences companies, physicians, and patients. Copyright 2010 Harvard Research Group, Inc.
Selected HPC Life Science Disciplines & Workloads The Life Sciences industry segment can be characterized by an exponential growth in the volumes of raw and processed data in addition to an insatiable demand for compute power and throughput. Today High Performance Computing (HPC) techniques and technologies in life Sciences are being applied to workloads and solutions of large scale problems. The term workload is used here to describe the work being done, the relevant data characteristics, and the software used to manipulate the data. The following table provides a perspective for the discussion of HPC workload requirements for Life Sciences and in particular for genomic sequencing. Discipline Solutions Data/Application Characteristics Major Applications Bioinformatics Sequence Analysis Bioinformatics Sequence Assembly Biochemistry Drug Discovery Computational Chemistry Molecular Modeling & Quantum Mechanics Proteomics Searching, alignment & pattern matching of biological sequences (DNA & protein) Align & merge DNA fragments to reconstruct the original sequence Screening of large database libraries of potential drugs for ones with desired biological activity Modeling of biological molecules using Molecular Dynamics & Quantum Mechanics techniques Interpreting mass spectrometry data and matching the spectra to protein database Sequencing the Human Genome Structured Data. Integer dominant, frequency dependent, large caches & memory BW not critical, some algorithms are suited to SIMD acceleration Usually have large memory footprint Mostly floating point, very compute intensive, highly parallel Very floating point intensive, latency critical, frequency dependent, scalable to low 100s Mostly Integer dominant, frequency dependent. Not communication intensive NCBI BLAST, wublast ClustalW FASTA HMMER Phrap/phred, CAP3/PCAP Velvet,,ABySS, SOAPdenovo MAQ, BOWTIE, BFAST, SOAP, Eland, SHRiMP GAP, pgap (TAMU) Autodock GLIDE Dock Flexx FTDock LigandFit AMBER NAMD CHARMM / CHARMm Desmond GROMACS Gaussian GAMESS Jaguar NWCHEM Mascot Sequest ProteinProspector X!Tandem OMSSA (Source IBM) Sequencing the human genome - there are nearly three billion DNA base pairs in an individual human genome - is laborious, costly, and time consuming. Today the genome sequencing / biosciences industry segment is dominated by research institutes that generate a tremendous amount of data. These organizations require significant computation capabilities and data storage capacity in order to process and analyze that data. Copyright 2010 Harvard Research Group, Inc page 2
"The $10 million X PRIZE for Genomics prize purse will be awarded to the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 per genome." (Source http://genomics.xprize.org/archon x prize for genomics/prize overview ). Since 2003 when the cost to sequence a single human genome was roughly $3 billion, dramatic reductions in the cost per genome sequenced have continued. Much of this cost reduction is due to applied technology such as the new HPC Life Science solution enabling capabilities provided by IBM ex5 systems. When the cost per complete genome sequenced gets down to $10,000 the shift from research to the clinical / commercial application of this science should begin and personalized medicine can then become a reality. In genome sequencing raw "DNA base" data produced by the sequencing appliance results in the generation of extreme volumes of data (multiple terabytes) that must then be stored for further analysis. After the assembled and aligned genomic image data has been stored, secondary analysis and research on the data can be performed. This area of secondary analysis is a key focus for researchers and technology providers. Copyright 2010 Harvard Research Group, Inc page 3
Genome Sequencing Whole or full genome sequencing of the DNA bases or segments that constitute each human genome produces very large volumes of data which then has to be stored for secondary processing and analysis such as the identification of genetic markers which serve as indicators for a specific disease. This type of secondary analysis requires significant high performance compute power, memory, and storage capacity. High throughput full genome sequencing is becoming a reality with the development and ongoing refinement of technologies such as IBM's nanopore "DNA transistor". The promise of nanopore based genomic sequencing is to sequence whole strands of DNA, dramatically increase sequencing throughput and accuracy and move the cost point of sequencing a single human genome to less than $1,000 per genome. The result will be a significant increase in the volume of available genome sequence data. The population of a human genome data repository will dramatically improve the identification of the genetic and proteomic basis for disease before a disease presents itself. This in turn will facilitate proactive prevention and treatment through life style counseling, personalized medicine, and even unique custom personalized drug therapies. The patent for Nanopore technology is held by Harvard University and Oxford Nanopore Technologies. Second Level Analysis The second level analysis of clean, error free, deduplicated genomic data presents the next big challenge that Life Science researchers will need to meet. This type of workload - typified by high volume in memory data analysis - will drive HPC symmetric multiprocessor (SMP) server and clustered system utilization. The primary emerging computational requirement driven by his type of work load is the ability to load and process massive amounts of data by performing large in-memory data manipulation. This big data workload requirement drives the consumption of flash drive based data stores and high memory capacity systems. An additional requirement will be for higher frequency higher throughput multi-core chips to deal with multi threading SMP related throughput requirements. Overall IT requirements are driven by the need to produce meaningful results in the least possible amount of time and for the lowest possible cost. Personalized Medicine Genomic sequencing is a prerequisite to the development of personalized medicine which will be based on the combination of data derived from patient clinical history and genomic sequencing. Getting the cost of sequencing a complete human genome under $10,000 is a requirement for the field of personalized medicine to open up. As the repository of genome sequencing data is established researchers will be able to analyze anomalies or differences of one genome compared to the base level mapping of the entire human genome or an equivalent proxy. Through this type of analysis researchers will identify and decipher disease specific genetic markers resulting in individual / personalized preventative and therapeutic medical applications. The field of personalized medicine will not realize its full potential until these genomic and clinical history data repositories have been established and populated. Security, privacy, disaster recovery, availability, and other emerging concerns will have to be addressed in order to pave the way for fulfilling the promise that personalized medicine holds. Personalized medicine will be based on the combination of personal, historic, clinical, and genomic data in a single source. This could take the form of a machine readable card, subcutaneous RFID chip, or such device that an individual could carry with them. This device will have the individuals unique genome encoded along with other relevant personal information. When this individual goes to a pharmacy, for example, a pharmacist will be able to cross check the encoded genome, clinical historical data, and existing prescriptions Copyright 2010 Harvard Research Group, Inc page 4
in order to ensure that any new prescription does not conflict with existing conditions or treatments. This type of information when available will be a game changer in terms of enhancing the physician's clinical therapeutic efficacy. Cluster and/or SMP IBM System ex5 servers with the Intel Xeon processor 7500 series can run either SMP type HPC workloads or applications using MPI HPC code resulting in higher levels of system utilization by helping avoid situations where clusters may be idle due to the higher costs of application enabling for a cluster as compared to an SMP system. Life Sciences applications such as Abyss (MPI code) are specifically written for distributed memory environments and for workloads of this type. HRG expects to see small workgroup cluster systems of up to eight nodes being replaced by next generation SMP/MPI capable servers such as IBM's ex5 systems. The versatility of these new systems will allow customers to run in either SMP mode or MPI mode enabling significantly higher overall system utilization making the IBM ex5 server a true all-in-one solution. IBM ex5 IBM's ex5 Servers bring to market major solution building block elements (x3850 X5, BladeCenter HX5, and x3690 X5) designed to meet the continually increasing Life Sciences HPC workload requirements of labs, universities, and corporate commercial R & D. The ex5 builds on the Intel Xeon processor 7500 series with increased memory capacity, flexible storage, virtualization and system reliability for the Life Sciences HPC market. This offering delivers the compute power, memory capacity and bandwidth to solve big science problems faster. The new ex5 HPC compute infrastructure servers in combination with Intel's hyper threading multi core processors satisfy Life Science application large memory, SMP, and parallelism requirements. Expanded memory capabilities for faster results IBM silicon allows processors on ex5 systems to access extended memory very quickly and delivers the largest memory capacity in the industry. The IBM Enterprise X-Architecture chip is in its fifth generation with ex5 and leverages decades of IBM experience in integrating microelectronics to create first-of-a-kind silicon solutions. A component of the extended memory solution from IBM is the unique memory expansion with the external MAX5 memory chassis, decoupling server memory from system processors. The MAX5 for ex5 racks and blades enables systems to more than double the number of addressable memory DIMMS per processor, and allows increased memory with MAX5 up to twice the memory capacity currently provided in the industry. The new ex5 systems with IBM exflash solid state disk drives and MAX5 memory expansion represent a new class of energy-efficient cost-effective high-performance compute engines. The ex5 provides support for both the VMware ESXi and the open source KVM-based Red Hat RHEV-H virtualization hypervisors enabling data center consolidation and high density compute configurations. Now with the IBM ex5 MAX5 memory expansion, complete databases can be held in memory accelerating system performance and enhancing throughput by avoiding the latency associated with traditional page swapping requirements. One case in point: a two socket ex5 system with MAX5 installed can support up to 320 virtual machines. This magnitude of virtualization conserves power, saves money on licensing costs, and significantly reduces environmental conditioning (HVAC and power) and space requirements. Copyright 2010 Harvard Research Group, Inc page 5
IBM's exflash technology is an environmentally friendly replacement for older hard disk drive storage subsystems. exflash can reduce storage costs by up to 97% and deliver up to 30x more local database performance. With the addition of IBM's Systems Director capabilities customers can pre-configure servers, remotely re-purpose systems and set up automatic updates and recoveries. Conclusion With increased emphasis on massively parallel / high throughput sequencing and the resultant extreme volumes of data which will be generated it makes sense to offload that data from the sequencing appliance to a nearby SMP HPC server like an ex5 running Intel Xeon 7500 series processors. This solution is purpose built to maintain high throughput, provide large memory capacity, and provide access to high volumes of data for assembly, alignment, and secondary analysis such as that required for the identification of disease markers. In the case of high-throughput whole genome sequencers the base calling and other quality functions are typically performed on the raw genome (DNA base) sequence data while the data is resident in the appliance's memory. Then the reduced and mostly error free data can be moved to a nearby server like an IBM ex5 for assembly and alignment. As the Life Sciences industry reaches and passes the X PRIZE target of sequencing 100 human genomes in ten days the requirement to offload this data from the sequencing appliance will be necessitated. HRG believes that an IBM ex5 system with expanded memory and high speed solid state disk drives is a ideal choice for satisfying these emerging high-throughput large data Life Science analysis requirements. Copyright 2010 Harvard Research Group, Inc page 6
Harvard Research Group is an information technology market research and consulting company. The company provides highly focused market research and consulting services to vendors and users of computer hardware, software, and services. For more information please contact Harvard Research Group: Harvard Research Group PO Box 297 Harvard, MA 01451 USA Tel. (978) 456 3939 Tel. (978) 925 5187 e mail: hrg@hrgresearch.com http://www.hrgresearch.com Copyright 2010 Harvard Research Group, Inc page 7