EMBL-EBI and pan-national scale bioinformatics: Relevance to Australia

Size: px
Start display at page:

Download "EMBL-EBI and pan-national scale bioinformatics: Relevance to Australia"

Transcription

1 EMBL-EBI and pan-national scale bioinformatics: Relevance to Australia Paul Flicek Vertebrate Genomics, European Molecular Biology Laboratory Wellcome Trust Sanger Institute

2 European Molecular Biology Laboratory EMBL is an intergovernmental organisation founded in 1974 and consisting of more than 21 member states, associate and prospect members The (EMBL-EBI) is one of five EMBL sites Other major EBI funders: European Commission National Institutes of Health Research Councils UK Wellcome Trust

3 EMBL infrastructure and services Structural Biology > 3,000 user visits per year many users of complementary services Bioinforma1cs at EMBL-EBI ~ 16,000,000 web visits per day Core Facili1es > 1,200 internal and external users per year

4 EMBL International PhD Programme Joint PhD degree with 25 universities in 17 countries incl Group of Eight, Australia > 200 students from over 40 countries Postdoctoral programmes ü EMBL Interdisciplinary Postdocs (EIPODs) ü Classical Postdoctoral Scheme ü EMBL-EBI Sanger Postdocs (ESPODs) ü EMBL-EBI BRC* Postdocs (EBPOD) (*BRC: NIHR Cambridge Biomedical Research Centre)

5 EBI Summary Based on the Wellcome Genome Campus near Cambridge, England Europe s hub for biological data services and research 600 members of staff from 57 nations

6 Overview of EBI activities Archive resources Record of scientific publications and output Ensure stability and reproducibility in the future Submission records have the concept of ownership and can only be updated by the owner Normally have been databases in other parts of the world (eg. NCBI) Example: European Nucleotide Archive / GenBank / DDBJ Value-added resources Usually built on the data contained in the archived resources Data analysis by world leaders (normally in collaborations) Enabling science Examples: Ensembl & UniProt Research & special projects 10 dedicated research groups across computational biology Projects include the 1000 Genomes, ENCODE & others

7 OUR MISSION Services Research Training Industry Coordination

8 Our research Neurons in Parkinson s disease Protein targets for new drugs Stem cell differentiation DNA data storage Cancer genome structure

9 Our services

10 Provide tools EMBL-EBI s data cycle Archive Analyse Share Classify

11 European Bioinformatics Photo Institute credit: Des Higgins

12 Lots of data

13 Ensembl Designed for the human genome; now used for all species Regulation Gene Allele Conserved sequence Gene annotation: splice variants, proteins, non-coding RNA Small and large scale sequence variation, phenotype associations Whole genome alignments, protein trees and homologous genes Potential promoters and enhancers, DNA methylation User upload, custom data Aken, et al (NAR, 2017)

14 Ensembl Variant Effect Predictor (VEP) McLaren, et al (Genome Biol, 2016)

15 Infrastructures are critical

16 But we only notice them when they go wrong

17 Informatics is Infrastructure: Vision and Opportunities Data Archives and Standards Data Coordination and Analysis Distribution and Compute Expectations and Ambition

18 Informatics builds things and makes things work for genomics and big biology

19 Vertebrate Genomics at EMBL-EBI We are responsible for the reference human genome assembly We create the standard human gene set (and the gene sets for most vertebrate species used in genomics) We make tools and data resources to enable genomic science We give everything away for free as soon as it is complete We work openly: our software and our data are visible to everyone in real time

20 1E+16 1E+15 EGA ENA PRIDE Data growth 12 month doubling 1E+14 1E+13 MetaboLights ArrayExpress 18 month doubling 4 month doubling bytes 1E+12 3 month doubling 1E+11 1E+10 1E date

21 Data compression: horizontal and vertical Jong-Seok Lee et al. (2009), hqp://mmspg.epfl.ch/files/ content/sites/mmspl/files/ shared/lee_icme.pdf Photograph from MichaelMaggs, hqp://en.wikipedia.org/wiki/ File:Amanita_muscaria_(fly_agari c).jpg

22 In how many ways can you say female? 18-day pregnant females female (lacta]ng) individual female worker caste (female) 2 yr old female female (pregnant) lgb*cc females sex: female 400 yr. old female female (outbred) mare female, other adult female female parent female (worker) female child asexual female female plant monosex female femal castrate female female with eggs ovigerous female 3 female cf.female female worker oviparous sexual females female (phenotype) cystocarpic female female, 6-8 weeks old worker bee female mice dikaryon female, virgin female enriched female, spayed dioecious female female, worker pseudohermaprhodi]c female femlale diploid female female(gynoecious) remale metafemale f femele semi-engorged female sterile female famale female, pooled sexual oviparous female normal female f femalen sterile female worker sf female females strictly female vitellogenic replete female female - worker females only tetraploid female worker female (alate sexual) gynoecious thelytoky hexaploid female female (calf) healthy female female (gynoecious) female (f-o) hen probably female (based on morphology) female (note: this sample was originally provided as a \"male\" sample to us and therefore labeled this way in the brawand et al. paper and original geo submission; however, detailed data analyses carried out in the mean]me clearly show that this sample stems from a female individual)",

23 Informatics is Infrastructure: Vision and Opportunities Data Archives and Standards Data Coordination and Analysis Distribution and Compute Expectations and Ambition

24 28 October Vol. 467, No Data Coordination: The 1000 Genomes Project and beyond An orchestration role that enables highly efficient large-scale science From raw data to project analysis to data release and community support Methods adapted for epigenetics, agricultural and ips projects 467, OCTOBER 2010 NATURE.COM/NATURE NO HUMAN STEM CELLS BEYOND THE COURT CASE Implications for the law, industry and ethics PAGE 1031 THE INTERNATIONAL WEEKLY JOURNAL OF SCIENCE A THOUSAND GENOMES Pilot studies prepare the way for population-scale gene sequencing PAGES 1050 & 1061 OCEAN PRODUCTIVITY PHOSPHATE DOWN THE AGES Key nutrient plentiful after snowball Earth PAGES 1052 & 1088 AUTUMN BOOKS THE RECURRING UNIVERSE Lee Smolin on Roger Penrose s grand idea PAGE 1034 NATURE.COM/NATURE Clarke, et al (Nat Meth, 2013)

25 Managing the 1000 Genomes data What it felt like in April 2008 First major data transfer Today

26 Distributed production: sequence data submission

27 Distributed consumption: sequence data access

28 Embassy Cloud

29 Embassy Cloud Direct network access to the EMBL-EBI data, services and compute Don t replicate data sets, use them Complete control An embassy is sovereign territory in a host country; the Embassy Cloud workspace is an isolated, secure environment. Traffic from an Embassy Cloud workspace to EMBL-EBI s public data resources and services is retained within our own network infrastructure, but neither EMBL-EBI nor any other cloud client can access it Currently have academic and commercial users

30 Informatics is Infrastructure: Vision and Opportunities Data Archives and Standards Data Coordination and Analysis Distribution and Compute Expectations and Ambition

31 Some things we often hear It should work just like Google or Apple I want a button that does all of my analysis automatically We have a critical deadline The cloud will solve this problem

32 Big problems need solutions Source: Guardian.co.uk Graphic: MaQ Pike Source: News Limited

33 Solutions are often possible, but rarely free

34 Infrastructure enables discovery Interesting, ground breaking ideas Necessary (if conceptually unexciting) data management

35 Standards, processes and methods for ethical data sharing Extensive engagement across the group related to data infrastructure, security, variation annotation, BRCA Major relevance to EBI s genomics data archives and all users of them

36 Relevance to Australia Data and compute access via traditional methods and the Embassy Cloud Archive and presentation of key data sets Transfer of technical expertise and best practices Specialised bioinformatics training Doctoral and postdoctoral training Research and service collaborations

37 Conclusions Informatics is about building high-quality, integrated tools and resources to enable genomic science, create stability and facilitate reproducibility Understanding biology requires viewing problems from multiple vantage points. Connecting the views from big projects to small projects drives results and creates insights

38 From Sequence to Knowledge