Computational Challenges in Life Sciences Research Infrastructures

Similar documents
Elixir: European Bioinformatics Research Infrastructure. Rolf Apweiler

Future of the Project. BioMedBridges 3rd Annual General Meeting 18 February 2015, Munich Janet Thornton

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Bioinformatics: Present & Future from an EBI Perspective. Janet Thornton. UCL February 14 th 2012

Elixir: Overview, Progress and Futures

European Genome phenome Archive at the European Bioinformatics Institute. Helen Parkinson Head of Molecular Archives

Interpreting Genome Data for Personalised Medicine. Professor Dame Janet Thornton EMBL-EBI

Introduction to Bioinformatics

ELIXIR: data for molecular biology and points of entry for marine scientists

Solutions will be posted on the web.

Introduction to 'Omics and Bioinformatics

A framework for quality management in the biomedical research infrastructures (BMS RIs)

Bridging the Gap Between Basic and Clinical Research. Julio E. Celis Danish Cancer Society

Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001

Pathology Facilities and Potential for Collaboration. The University of Western Australia

Ethical Governance Framework

Professor Ewan Birney FRS Director, EMBL-EBI

Minimum Information About a Microarray Experiment (MIAME) Successes, Failures, Challenges

Machine Learning. HMM applications in computational biology

Processing Data from Next Generation Sequencing

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Von Mäusen und Menschen Datenbrücken verbessern den Zugang zu Forschungsdaten in INFRAFRONTIER

Two Mark question and Answers

Computational Challenges of Medical Genomics

NGS Approaches to Epigenomics

TREE CODE PRODUCT BROCHURE

The EMBL-Bioinformatics and Data-Intensive Informatics

Course Presentation. Ignacio Medina Presentation

EESI Final Conference

Introduction to BIOINFORMATICS

Introduction to NGS. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes

EMMA - European Mouse Mutant Archive Michael Hagn

Agilent NGS Solutions : Addressing Today s Challenges

Centro Nacional de Análisis Genómico. Where are the Bottlenecks of Genome Analysis Today? Teratec. Ecole Polytechnique, Palaiseau, F.

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

NEXT GENERATION SEQUENCING. Farhat Habib

Fundamentals of Clinical Genomics

Genetics and Bioinformatics

B I O I N F O R M A T I C S

Genomics. Data Analysis & Visualization. Camilo Valdes

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

- OMICS IN PERSONALISED MEDICINE

Studying the Human Genome. Lesson Overview. Lesson Overview Studying the Human Genome

Next Generation Sequencing. Dylan Young Biomedical Engineering

Using human genetics to make new medicines

Overview of the Luxembourg biomedical and health research ecosystem FNR FRANK GLOD LUXCOR HEALTH EVENT MONDAY 19 JUNE 2017

NCBI web resources I: databases and Entrez

Genomes contain all of the information needed for an organism to grow and survive.

THE WHITE HOUSE Office of the Vice President

Complex Adaptive Systems Forum: Transformative CAS Initiatives in Biomedicine

From genome-wide association studies to disease relationships. Liqing Zhang Department of Computer Science Virginia Tech

EMBL-EBI Overview EMBL-EBI Overview

Introduction to EMBL-EBI.

Lesson Overview. Studying the Human Genome. Lesson Overview Studying the Human Genome

The EORTC Molecular Screening programme SPECTA

The Human Genome Project

Personalized Human Genome Sequencing

CSE/Beng/BIMM 182: Biological Data Analysis. Instructor: Vineet Bafna TA: Nitin Udpa

Introduc)on to Databases and Resources Biological Databases and Resources

DNA. Clinical Trials. Research RNA. Custom. Reports CLIA CAP GCP. Tumor Genomic Profiling Services for Clinical Trials

Introduction to Bioinformatics

Computational methods in bioinformatics: Lecture 1

What Can the Epigenome Teach Us About Cellular States and Diseases?

Christoph Bock ICPerMed First Research Workshop Milano, 26 June 2017

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

NGS in Pathology Webinar

Introduction to BIOINFORMATICS

Genetics Lecture 21 Recombinant DNA

Global Biomolecular Information Infrastructure and Australia. Graham Cameron Director The EMBL Australia Bioinformatics Resource

Molecular Biology. IMBB 2017 RAB, Kigali - Rwanda May 02 13, Francesca Stomeo

MICROARRAYS: CHIPPING AWAY AT THE MYSTERIES OF SCIENCE AND MEDICINE

Image adapted from: National Human Genome Research Institute

Genomes: What we know and what we don t know

The European Bioinformatics Institute - Cambridge. Overview

Our website:

IBBL: Introduction to Biobanks and their services

EMBL-EBI and pan-national scale bioinformatics: Relevance to Australia

Data representation for clinical data and metadata

Solutions will be posted on the web.

Cory Brouwer, Ph.D. Xiuxia Du, Ph.D. Anthony Fodor, Ph.D.

WP2 Legal, Governance & Ethical Issues

Vision, aims and strategies. Department of Immunology, Genetics and Pathology Uppsala University

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

Data Intensive Scientific Discovery Vijay Chandru

A whole-system approach to delivering personalised medicine and health in Leeds. Mike Messenger

Welcome to the NGS webinar series

Crash-course in genomics

Computers in Biology and Bioinformatics

Corporate Medical Policy

Whole genome sequencing in the UK Biobank

THE CAMPAIGN FOR UC SANTA CRUZ THE GENOMICS INSTITUTE

After the draft sequence, what next for the Human Genome Mapping Project Resource Centre?

Introduction to NGS. Simon Rasmussen Associate Professor DTU Bioinformatics Technical University of Denmark 2018

Computational Genomics. Irit Gat-Viks & Ron Shamir & Haim Wolfson Fall

Lecture #1. Introduction to microarray technology

1.0 Background to the organisation

Transcription:

Computational Challenges in Life Sciences Research Infrastructures Alvis Brazma European Bioinformatics Institute European Molecular Biology Laboratory

European Bioinformatics Institute (EBI) EBI is in Hinxton, ~10 miles South of Cambridge, UK Wellcome Trust Genome Campus EBI is part of EMBL, ~like CERN for molecular biology ~500 scientific and IT staff at EBI Hosting the ELIXIR node (details to follow)

Molecular Biology The study of how life works at a molecular level Key molecules: DNA Information store (Disk) RNA Key information transformer, also does stuff (RAM) Proteins The business end of life (Chip, robotic arms) Metabolites Fuel and signalling molecules (electricity) Theories of how these interact no theories of how to predict what they are Instead we determine attributes of molecules and store them in globally accessible, open databases for mining, exploration and model building

DNA molecule is a perfect medium for recording information The particular sequence of A, T, G, C in DNA has little effect on its structure it will usually be the same the double-helix Each of the 4 different letters can encode 2 bits of information The physical distance between nucleotide pairs is about 0.34 nm Thus the information storage density in DNA is roughly 6x10 8 bits/cm - approximately 12.5 CD-Roms per cm

We can routinely read small fragments of DNA 1977-1990 500 bp, manual tracking 1990-2000 500 bp, computational tracking, 1D, capillary 2007-now 20-100bp, 2D systems very cheaply, ( 2 nd Generation or NGS) Soon >5kb, Real time 3 rd Generation Fred Sanger, inventor of terminator DNA sequencing

Costs have come exponentially down

DNA is where the genome information is encoded Nature 448, 548-549

The human genome contains DNA of the total length of 2 x 3 000 000 000 letters packed in 23 chromosome pairs Every cell in human has two copies of full genome Roughly 2500 CD-roms

Human Genome project 1989 2000 sequencing the human genome Just 1 individual actually a mosaic of about 24 individuals but as if it was one Old school technologies, a bit epic Now Same data volume generated in ~3mins in a current large scale centre It s all about the analysis As the first step of analysis we need to assemble the short fragments into full genomes But that s only the beginning of the analysis, e.g., we need to identify genes, their variants, function, etc

Understanding the genome The sequencing of the 6 billion chemical letters of human DNA was completed in draft in 2000 and in final form in 2003. But clinical benefits have arrived more slowly than the initial hype suggested. This is mainly because the human genome actually works in a much more complex way than predicted by the late-20thcentury model.

Understanding the genome Twenty-first-century The sequencing of the research 6 billion shows chemical that we have letters only of 21,000 human genes, DNA was one-fifth completed of the number in draft predicted in 2000 and when in final the project form in started, 2003. But and clinical that just 1.5 benefits per cent have of the arrived genome more consists slowly than of the initial conventional hype suggested. protein-coding This is mainly genes. because Efforts the are under human way genome to understand actually the works vital in regulatory a much more and complex other way functions than predicted of the non-coding by the late-20thcentury genome, model. once dismissed wrongly as junk regions of the DNA.

ENCODE Dimensions 182 Cell Lines/ Tissues Cells 3,010 Experiments 5 TeraBases 1716x of the Human Genome Methods/Factors 164 Assays (114 different Chip)

The International Cancer Genome Consortium (ICGC) was launched to coordinate large-scale cancer genome studies in tumours from 50 different cancer types and/or subtypes that are of clinical and societal importance across the globe. Systematic studies of more than 25,000 cancer genomes at the genomic, epigenomic and transcriptomic levels will reveal the repertoire of oncogenic mutations, uncover traces of the mutagenic influences, define clinically relevant subtypes for prognosis and therapeutic management, and enable the development of new cancer therapies

Data size Cancer genomics projects ~30 PB (= 10 15 B) of sequencing data However, the problem is not the size of the data, but how to analyse and interpret it!

Looking for fusion genes in breast cancer using 1000 genome RNAdata for control Mapping 391 tumor samples samples to the genome 2172 hours = 90.5 days CPU time 11730 GB memory Mapping 668 RNA samples from the 1000 genome project 3711 hours = 155 days CPU time and 20040 GB memory This is just one small specific one postdoc project!

Infrastructures are critical

But we only notice them when they go wrong

Biology already relies on an information infrastructure For the human genome ( and the mouse, and the rat, and x 150 now, 1000 in the future!) - Ensembl For the function of genes and proteins For all genes, in text and computational UniProt and GO For all 3D structures To understand how proteins work PDBe For where things are expressed The differences and functionality of cells ArrayExpress, Expression Atlas

..But this keeps on going We have to scale across all of (interesting) life There are a lot of species out there! We have to improve our chemical understanding Of biological chemicals Of chemicals which interfere with Biology We have to handle new areas, in particular medicine A set of European haplotypes for good imputation A set of actionable variants in germline and cancers

How? Fully Centralised Fully Distributed Pros: Stability, reuse, Learning ease Cons: Hard to concentrate Expertise across of life science Geographic, language placement Bottlenecks and lack of diversity Pros: Responsive, Geographic Language responsive Cons: Internal communication overhead Harder for end users to learn Harder to provide multi-decade scability

How? Robust network with a strong hub Node Domain Specific hub General Hub (EBI) National European Life Sciences Infrastructure - ELIXIR

ELIXIR s mission To build a sustainable European infrastructure for biological information, supporting life science research and its translation to: medicine environment bioindustries society 27

Overlapping Networks

ELIXIR memorandum or understanding signed by

Other infrastructures needed for biology EuroBioImaging Cellular and whole organism Imaging BioBanks (BBMRI) European populations in particular for rare diseases, but also for specific sub types of common disease Mouse models and phenotypes (Infrafrontier) A baseline set of knockouts and phenotypes in our most tractable mammalian model Robust molecular assays in a clinical setting (EATRIS) The ability to reliably use state of the art molecular techniques in a clinical research setting

BioMedBridges - Ten new biomedical sciences research infrastructures: stronger through common links

WP11: Technology Watch Comprises representatives of GÉANT, DANTE, EGI.eu, PRACE & CERN as well as technical experts from the ESFRI BMS RIs Brings together the technical experts Facilitates adoption of e-infrastructure technologies Communicates advice from the ICT Infrastructures and the e- Infrastructures to the BioMedBridges partners

From Molecules to Medicine 33

Funding and acknowledgements Ewan Birney, who provided many of the slides Functional Genomics Group at EBI IT Systems Group at EBI Funders EMBL member countries European Commission FP7 grants EurocanPlatform, CAGEKID, ELIXIR, BioMedBridges, GEUVADIS, IMI EMIF