Computational Challenges of Medical Genomics

Size: px
Start display at page:

Download "Computational Challenges of Medical Genomics"

Transcription

1

2 Talk at the VSC User Workshop Neusiedl am See, 27 February (lab) (institute)

3 Introducing myself to Vienna s scientific computing community Background: Bachelor/Master in computer science at the Universities of Mannheim and Heidelberg (IWR) PhD in bioinformatics at the Max Planck Institute for Informatics Postdoc at the Broad Institute of MIT and Harvard, working on the human epigenome project In Vienna: Just started my group at the CeMM Research Center for Molecular Medicine Focus on medical genomics, both computationally and experimentally Also coordinating the CeMM-MUW next-generation sequencing facility Goals: Contribute to personalized medicine using genomic/bioinformatic methods Provide value-added sequencing technology for CeMM and neighbors Page 3 of 18

4 Outline 1. Next-generation sequencing for medicine: A revolution in the making 2. The human epigenome: Unexpected complexity beyond the DNA sequence 3. Efficient indexing algorithms enable live exploration of multidimensional (epi-) genomes 4. Challenges for scientific computing in bioinformatics and genomic medicine Page 4 of 18

5 Why does everybody talk about next-generation sequencing? Cost per Genome in US Dollars Moore s law (IT cost) DNA Sequencing (cost per genome) In 2012, medical genome sequencing will produce more data than high-energy physics Page 5 of 18

6 Next-generation sequencing (NGS) will transform biology and medicine NGS is not just another incremental technical advance Probably the fastest progressing technology in history (1000x cost / throughput improvement since 2007) Becoming a universal tool of biomedical research (almost like PCs) Huge pressure on medical practice to embrace NGS in meaningful ways What will you do when patients bring their personal genomes on a USB stick? Page 6 of 18

7 DNA sequencing & computational methods will enable personalized medicine The cancer checkup of the future (2020+) Blood sampling, which contains cell-free DNA of any tumor DNA sequencing genome epigenome transcriptome Computational assembly into patient-specific map (~100 GB) Computeraided interpretation in support of clinical decision making Challenges for computer science research Algorithms for efficiently analyzing terabytes of DNA sequencing data Enabling live search of large-scale personal genome maps Robust therapy predictions based on high-dimensional and noisy datasets Page 7 of 18

8 CeMM-MUW Joint Sequencing Facility Overview Goals Facilitate competitive biomedical research at CeMM and MUW Build competences for the future of genomic medicine in Vienna & Austria Realize economies of scale, providing best value per EUR of grant money Provide process support: study design, sequencing, analysis, Status quo translation Two Illumina HiSeq 2000 machines up and running Accepting samples: exome-seq, whole-genome seq, RNA-seq, ChIP-seq, DNA methylation seq, custom protocols All basic workflows are in place, with ongoing extensions and optimization to improve quality of service Current focus on large-scale projects and power users with continuous sample flow (who will serve as multipliers) Page 8 of 18

9 Outline 1. Next-generation sequencing for medicine: A revolution in the making 2. The human epigenome: Unexpected complexity beyond the DNA sequence 3. Efficient indexing algorithms enable live exploration of multidimensional (epi-) genomes 4. Challenges for scientific computing in bioinformatics and genomic medicine Page 9 of 18

10 The human genome is much more complex than we thought Human Genome in 2000 Human Genome in 2012 >hg19_dna range=chr8: GACCCCCGAGCTGTGCTGCTCGCGGCCGCCACCGCCGGGCCCCGGCCGTC CCTGGCTCCCCTCCTGCCTCGAGAAGGGCAGGGCTTCTCAGAGGCTTGGC GGGAAAAAGAACGGAGGGAGGGATCGCGCTGAGTATAAAAGCCGGTTTTC GGGGCTTTATCTAACTCGCTGTAGTAATTCCAGCGAGAGGCAGAGGGAGC GAGCGGGCGGCCGGCTAGGGTGGAAGAGCCGGGCGAGCAGAGCTGCGCTG CGGGCGTCCTGGGAAGGGAGATCCGGAGCGAATAGGGGGCTTCGCCTCTG GCCCAGCCCTCCCGCTGATCCCCCAGCCAGCGGTCCGCAACCCTTGCCGC ATCCACGAAACTTTGCCCATAGCAGCGGGCGGGCACTTTGCACTGGAACT TACAACACCCGAGCAAGGACGCGACTCTCCCGACGCGGGGAGGCTATTCT GCCCATTTGGGGACACTTCCCCGCCGCTGCCAGGACCCGCTTCTCTGAAA GGCTCTCCTTGCAGCTGCTTAGACGCTGGATTTTTTTCGGGTAGTGGAAA ACCAGGTAAGCACCGAAGTCCACTTGCCTTTTAATTTATTTTTTTATCAC TTTAATGCTGAGATGAGTCGAATGCCTAAATAGGGTGTCTTTTCTCCCAT TCCTGCGCTATTGACACTTTTCTCAGAGTAGTTATGGTAACTGGGGCTGG GGTGGGGGGTAATCCAGAACTGGATCGGGGTAAAGTGACTTGTCAAGATG GGAGAGGAGAAGGCAGAGGGAAAACGGGAATGGTTTTTAAGACTACCCTT TCGAGATTTCTGCCTTATGAATATATTCACGCTGACTCCCGGCCGGTCGG ACATTCCTGCTTTATTGTGTTAATTGCTCTCTGGGTTTTGGGGGGCTGGG GGTTGCTTTGCGGTGGGCAGAAAGCCCCTTGCATCCTGAGCTCCTTGGAG TAGGGACCGCATATCGCCTGTGTGAGCCAGATCGCTCCGCAGCCGCTGAC TTGTCCCCGTCTCCGGGAGGGCATTTAAATTTCGGCTCACCGCATTTCTG ACAGCCGGAGACGGACACTGCGGCGCGTCCCGCCCGCCTGTCCCCGCGGC GATTCCAACCCGCCCTGATCCTTTTAAGAAGTTGGCATTTGGCTTTTTAA AAAGCAATAATACAATTTAAAACCTGGGTCTCTAGAGGTGTTAGGACGTG GTGTTGGGTAGGCGCAGGCAGGGGAAAAGGGAGGCGAGGATGTGTCCGAT TCTCCTGGAATCGTTGACTTGGAAAAACCAGGGCGAATCTCCGCACCCAG Page 10 of 18

11 The epigenome constitutes the DNA s second code Why do cells need a second code on top of the genomic code? All cells of the body share the same genome (set of genes) Cell-type specific epigenomes distinguish tissues (e.g. brain, heart, liver cells) The epigenome provides access control software to the genomic hardware s d o tc n ig p e 0 2 brain skin heart liver muscle No insulin Many antibodies No hemoglobin 200 human cell types white blood cells Page 11 of 18

12 Human epigenome projects are in full swing around the globe International Human Epigenome Consortium NIH Roadmap Epigenomics: Focus on pluripotent cells and fetal tissues EU BLUEPRINT: Focus on the blood, leukemia and diabetes Additional projects in Canada, Germany, Italy, Japan and South Korea Need to make large-scale dataset accessible for biomedical research Page 12 of 18

13 Outline 1. Next-generation sequencing for medicine: A revolution in the making 2. The human epigenome: Unexpected complexity beyond the DNA sequence 3. Efficient indexing algorithms enable live exploration of multidimensional (epi-) genomes 4. Challenges for scientific computing in bioinformatics and genomic medicine Page 13 of 18

14 How can we use reference epigenome data guide small-scale biology? Approach 1: Computational data mining Goal: Let the computer discover patterns in the data Example: EpiGRAPH web server for feature enrichment in epigenome region sets Problem: acceptance issues, potential to stifle creativity Computer leads the way Approach 2: Interactive data exploration Goal: Empower the researcher to discovery patterns Researcher-driven, the computer assists and guides Example: Exploring epigenome datasets in real time using EpiExplorer Computer helps navigate Page 14 of 18

15 A EpiExplorer interactive hypothesis generation and live exploration of large genomic datasets Inspired by web search engines such as Google B C D E Page 15 of 18

16 EpiExplorer utilizes text search to enable epigenome data exploration Efficient text search with interactive refinement Top-k search engine with live results display Auto-completion providing suggestions for query refinement EpiExplorer. Powered by Built-in relational database functionality (AND, OR, joins) R3 Very fast due to optimized prefix indexing (work by Hannah Bast: SPIRE'06, SIGIR'06) CpG island H3K4me3 H3K27me3 Adaptation to epigenome data Each genomic region corresponds to one document Epigenome annotations are encoded as text attributes in a hierarchical prefix format ( auto-completion) Diagrams generated in real-time to guide analysis Highly scalable: dedicated index server for each region set CpG island Exon H3K27me3 R1 R2 R3 R4 R5 R6 R7 R8 R9 R 10 H3K4me3 conserved Page 16 of 18

17 Outline 1. Next-generation sequencing for medicine: A revolution in the making 2. The human epigenome: Unexpected complexity beyond the DNA sequence 3. Efficient indexing algorithms enable live exploration of multidimensional (epi-) genomes 4. Challenges for scientific computing in bioinformatics and genomic medicine Page 17 of 18

18 Challenges for scientific computing in bioinformatics and genomic medicine Data volume At current rate, the CeMM will have to add 50 TB every 6 months just to survive We re on an exponential curve that is much steeper than IT hardware advances Data processing Raw sequencing data needs to be mapped to the genome to become useful Mapping is performed in-house using a highly automatized pipeline Typical mapping task: 100 serial jobs, each with 16 GB memory and 12 hours runtime Data analysis Most analyses are prepared interactively using the R statistics software Analysis task 1: 5000 serial jobs, each with 1-16 GB memory and hours runtime Analysis task 2: 10 serial jobs, each with 48 GB memory and 5 days runtime Increasing parallelization of R provides opportunities, but most analyses done serially Page 18 of 18

19 Challenges for scientific computing in bioinformatics and genomic medicine Data exchange Researchers need to submit their raw data to public repositories (GEO, EGA) Aspera is gaining significant market share for high-speed data transfer Bringing the code to the data guest accounts becoming increasingly common Biological applications User-friendly web tools empower biologists to perform analyses over the Internet Genome browsers and web-based analysis tools aggregate 100s of tools and datasets Need to have 10s of TB available over the Internet for block-wise HTTP download Medical applications Need to be able to deliver within a defined time frame (e.g. 5 days) Medical legislation requires high levels of data protection Very conservative approach needed, carefully avoiding any mistakes Page 19 of 18

20 END OF PRESENTATION Page 20 of 18