Access to Information from Molecular Biology and Genome Research

Future Needs for Research Infrastructures in Biomedical Sciences Access to Information from Molecular Biology and Genome Research DG Research: Brussels March 2005

User Community for this information is large and varied All biologists of every persuasion! Biomedicine, zoology, botany, genetics, phylogeny, physiology, ecology, biochemistry, biodiversity Academics, pharma, biotechnology Health, diagnostics, patents, agriculture, foods, agrochemicals, forestry, agriculture, brewing, fisheries Very large numbers of users

2,000,000 1,800,000 1,600,000 e.g. USAGE of EBI s WEB SITE Hits Including Ensembl 1,400,000 1,200,000 1.3 million hits/day 1,000,000 800,000 600,000 20,000 distinct users/month 400,000 200,000 0 1st 99 2nd 99 3rd 99 4th 99 1st 00 2nd 00 3rd 00 4th 00 1st 01 2nd 01 3rd 01 4th 01 1st 02 2nd 02 3rd 02 4th 02 1st 03 2nd 03 3rd 03 4th 03 1st 04 2nd 04 3rd 04 4th 04

Four Aspects of Bioinformatics Infrastructures Standards Data Bases collect, curate and archive experimental data and re-distribute (core & specialist) Annotations (manual & automated) Search & Retrieval Tools and Services Most Data Resources are involved in all 4 aspects

Standards & Ontologies For exchange and integration of any data defined Standards and Ontologies are necessary This usually involves whole community in discussions/negotiations, with data providers taking the lead All large data resources have well defined standards and ontologies, which evolve with time for example Genes Gene Ontology Structures mmcif Expression Data MIAME Proteomic Data PSI Metabolome Data in progress Systems Biology - SBML

Data Resources For molecular biology and genome research it is possible to define 3 key distinct types of data resources: Core Resources Model Organism Resources Specialist Resources

What defines a Core Data Resource? Some combination of the following: Data Collection and curation Collects experimental data from public (now often HTP) Comprehensive In public domain Designated site for data collection agreed by community Curates collected data into archive Provides searchable archive Journal Agreement to require submission of data International Collaboration, involving regular exchange of data 24/7 service Maturity & Stability (of data and resource!) High usage

Global Context For example: EMBL/DDBJ/NCBI wwpdb: RCSB/PDB j /MSD UniProt: EBI/SIB/PIR Europe USA Japan

Core Molecular Biology Data Resources in Europe EMBL-Bank DNA sequences Reactome Array-Express Microarray Expression Data UniProt Protein Sequences Ensembl Genome Annotation IntAct Protein Interactions MSD Macromolecular Structure Data

Model Organism Data Resources FlyBase - Drosophila Wormbase - c elegans MGD mouse YPD - Yeast Human? Arabidopsis (UK & US) Most funded by US government

Other Molecular Data Resources (Non-core/Specialist) These data resources have some or all of the following characteristics: Based on Core Data Resources Confined to one laboratory or small group of individuals Do not collect data from external laboratories Small Computationally derived data Less mature & stable No commitment to be comprehensive No explicit commitment to community (e.g. standards) More specialised (e.g. one protein family)

Molecular Biology Database Collection Galperin (2005) NAR 33 D5-D24 719 databases listed (+171 since last year!) 14 major categories ~ 30% in Europe First databases this year from Estonia, Greece, Hungary, Turkey, Malaysia, Taiwan, Brazil & Cuba

Annotations (Manual & Automated) Manual Annotations within Core Data Resources (e.g. UniProt) Third Party Annotations in Core Data Resources Established mechanisms to allow experimentalists/theoreticians to enter their own annotations in core data resources e.g. EMBL; Distributed Annotations via DAS (Ensembl) EU Network of Excellence for Genome Annotation - BioSapiens Virtual Institute for Genome Annotation i.e. Distributed 26 Contractors/ 14 countries mainly expert bioinformatics laboratories Based on core reference sequence/structure data Develops new tools for annotations Automated Distributed compute (clients and servers)

Data Processing, Search & Retrieval Tools Search and Retrieval Systems within Core Data Resources e.g. MSD Many Laboratories develop novel search and retrieval tools for sequences and structures EU Network of Excellence EMBRACE Data Integration Development of tools and programming interfaces to exploit information Tracking and exploiting novel IT/GRID developments

Current status of Bioinformatics in Europe: Infrastructure Funding I: EU Currently limited funding for Bioinformatics under Research Infrastructures Action EuroCarbDB Design Study Felics I3 (just submitted to FP6 by EBI/SIB/EPO/Uni Koln) FP6 TEMBLOR under Quality of Life Programme major grant to EBI & partners ( 20 million for 3 years) Funds MSD; ArrayExpress, IntAct [finishing June 2005] EMBnet 26 nodes through Europe many now closed Networks of Excellence BioSapiens Genome Annotations ( 12 million for 5 years) EMBRACE Search & Retrieval Tools ( 8 million for 5 years) Many other research grants with small informatics component

Current status of Bioinformatics in Europe: Infrastructure Funding II: EMBL EMBL funds 8 meuros pa for Core Molecular Data Resources at EBI

Current status of Bioinformatics in Europe: Infrastructure Funding: National National Funding - To my knowledge there is no national funding for core bioinformatics data resources (except Arabidopsis) One-off grants for specialist data resources awarded to individuals Some National search & training facilities in Europe :- UK CCP11 Collaborative Computing Programme to foster Bioinformatics (software collection - EMBOSS) UK MRC Rosalind Franklin Centre for Genomic Research (closing July) UK escience efforts GRID research (mainly middleware) French Genopoles (attached to experimental labs) German NBCC Bioinformatik-Zentrum in Deutschland & HNB - Helmholtz Network Bioinformatics Swiss Institute of Bioinformatics (SIB & ISB) Finnish CSC (Supercomputer Centre) - provides life-sciences services & support..

Challenge Funding Services provided by the Core Resources Bioinformatics is international Core resources are used by everyone Who should pay? National countries do not want to take responsibility for an international effort US has been very pro-active supporting public data resources and making data freely available to all through web therefore commercial solution is not available option EU to date has only funded new projects EMBL provides core funding but limited budget

US Bioinformatics Infrastructure Funding Infrastructure for data resources supported by Rolling Funding provided to government laboratories & designated core resources NCBI (for Genbank; Entrez; Omim) UniProt (inc PIR) RCSB (for PDB) Bioinformatics Infrastructure for Large-Scale Analyses (National Partnership for Advanced Computational Infrastructure NPACI)

Future Needs:From Molecules, to Cell, to Organisms, to Physiology p53 tumoursuppressor Genome Protein Cell Embryo Fruitfly Mouse Human Development, Ageing Disease

Needs for future Service Providers To fulfil these demands Service Providers need: Support for Core Resources To handle data explosion more data - better resources new data - new standards & resources mechanism to identify and support new core resources as they are needed To link to model organism data resources and specialist resources To link to related discipline data resources To Integrate Data for Systems Biology To enhance services to provide easier, quicker access to data, incorporating advances in GRID technologies Adequate virtual environment and installation equipment, systems administration and DBAs to serve data to community To provide scientific and technical support to run data resources & to provide training Cont.

Needs for future Service Providers (cont) Support for Specialist resources Coordination of Specialist Resources Links to Core Resources Mechanism to evolve into Core Resource, if appropriate Support for Distributed Annotation Support for Distributed Software Libraries & Coordinated Web Services

Proposals to meet these needs In FP7 Infrastructure Funds to support in order of priority: Core Databases - Enhancements/Services New Core Resources Standards Distributed networks of specialised data resources Distributed Annotation Network(s) Tools/libraries & Distributed Web Services?

Challenges of Current I3 Infrastructure Funding Instrument for Bioinformatics Current guidelines are based on physical not virtual infrastructures large instrumentation not needed Data are rapidly evolving and diversifying and hence need rapidly evolving platforms for data collection, storage & dissemination Peer review neither possible nor desirable for web-based infrastructures Difficult to raise adequate support for centralised DB and services they provide although everyone uses them No maintenance only creation and enhancements Cumulative cost comparable with Physics, but spent differently: ongoing rather than upfront investment

Funding Mechanisms I: Core Data Resources Aim: To support Core Data Resources To provide contribution to creation and enhancement of databases To fund services & access through web needs installation/ equipment/dbs To fund technical support & training to use resources properly Model - one/limited number of sites for DB, but thousands of depositors/users across Europe Instrument: I3 with modified access & enhancement criteria Prioritise: - Through strong links with Life Sciences Thematic Priorities especially for new resources

Funding Mechanisms II:Standards Aim: To support establishment of standards for new data Model - Led by limited number of laboratories, but involving whole community Instrument: Design Study in Infrastructures Prioritise: - Through strong links with Life Sciences Thematic Priorities

Funding Mechanisms III: Specialist Data Resources Aim: To develop Clusters of Specialist Data Resources E.g individual protein families; immunology databases; Model Organism data resources (eg bacteria, pathogenic organisms) Model distributed databases one in each laboratory (say ~6 on average) Instrument: Integrated Project one partner may provide core infrastructure? Prioritise - by a strong link between infrastructure funding and research priorities in Life Sciences Thematics Priorities

Funding Mechanisms IIII: Distributed Annotation/Tools/Web services AIM: Support for Bioinformatics Laboratories to employ their best tools to provide information to all biological community Model Distributed effort in many expert laboratories essentially open system with infrastructure hub provided by one partner Instrument: Network of Excellence perhaps with fewer participants Prioritise: By research themes; by topic (eg transcriptome data) etc