Access to Information from Molecular Biology and Genome Research

Similar documents
Types of Databases - By Scope

Global Biomolecular Information Infrastructure and Australia. Graham Cameron Director The EMBL Australia Bioinformatics Resource

ELE4120 Bioinformatics. Tutorial 5

Introduction to BIOINFORMATICS

The EMBL-Bioinformatics and Data-Intensive Informatics

Genome Informatics. Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11 th, Kiyoko F. Aoki-Kinoshita

BIMM 143: Introduction to Bioinformatics (Winter 2018)

Overview of Health Informatics. ITI BMI-Dept

Introduction to EMBL-EBI.

ELIXIR: data for molecular biology and points of entry for marine scientists

B I O I N F O R M A T I C S

The Gene Ontology Annotation (GOA) project application of GO in SWISS-PROT, TrEMBL and InterPro

Genetics and Bioinformatics

I nternet Resources for Bioinformatics Data and Tools

PATHWAY ANALYSIS. Susan LM Coort, PhD Department of Bioinformatics, Maastricht University. PET course: Toxicogenomics

ONLINE BIOINFORMATICS RESOURCES

European Genome phenome Archive at the European Bioinformatics Institute. Helen Parkinson Head of Molecular Archives

Bioinformatics to chemistry to therapy: Some case studies deriving information from the literature

Two Mark question and Answers

ArrayExpress: Quick tour

Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL

Introduc)on to Databases and Resources Biological Databases and Resources

Protein Bioinformatics Part I: Access to information

NCBI web resources I: databases and Entrez

Elixir: Overview, Progress and Futures

Sequence Databases and database scanning

Minimum Information About a Microarray Experiment (MIAME) Successes, Failures, Challenges

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Web-based Bioinformatics Applications in Proteomics

Introduction to 'Omics and Bioinformatics

Bioinformatics for Cell Biologists

This practical aims to walk you through the process of text searching DNA and protein databases for sequence entries.

Databases/Resources on the web

Proteomics: New Discipline, New Resources. Fred Stoss, University at Buffalo, NERM 2004, Rochester, NY

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

High peformance computing infrastructure for bioinformatics

Introduction to Bioinformatics

GREG GIBSON SPENCER V. MUSE

CSC 121 Computers and Scientific Thinking

Chapter 2: Access to Information

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

Applied Bioinformatics

1 st transplant user training workshop Versailles, 12th-13th November 2012

Computers in Biology and Bioinformatics

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Advances in Biomedical Research at Comenius University Bratislava

Elixir: European Bioinformatics Research Infrastructure. Rolf Apweiler

Engineering Genetic Circuits

Computational Biology and Bioinformatics

Introduction to Bioinformatics

G4120: Introduction to Computational Biology

Introduction to Bioinformatics

BGGN 213: Foundations of Bioinformatics (Fall 2017)

Tools, techniques and infrastructure bioinformatics in marine biotechnology

Introduction to Bioinformatics

Biology 644: Bioinformatics

Dina El-Khishin (Ph.D.) Bioinformatics Research Facility. Deputy Director of AGERI & Head of the Genomics, Proteomics &

Bioinformatics for Proteomics. Ann Loraine

Data representation for clinical data and metadata

Introduction to BIOINFORMATICS

Sequence Based Function Annotation

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

Biological databases an introduction

Deakin Research Online

Introduction to ELIXIR Andy Smith, ELIXIR Hub 18 March 2015 Wageningen

ELIXIR connects national centres and EMBL EBI to build a sustainable European infrastructure for biological research data. medicine.

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Biological databases an introduction

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Databasing Expression with Integrative Biochip Informatics

Interpreting Genome Data for Personalised Medicine. Professor Dame Janet Thornton EMBL-EBI

The University of California, Santa Cruz (UCSC) Genome Browser

GS Analysis of Microarray Data

GS Analysis of Microarray Data

Big picture and history

Bioinformatics for Cell Biologists

Klinisk kemisk diagnostik BIOINFORMATICS

Product Applications for the Sequence Analysis Collection

EBI web resources I: databases and tools. Yanbin Yin Spring 2013

Annotation. (Chapter 8)

Protein Data Bank (PDB) Archive and the wwpdb Partnership

GS Analysis of Microarray Data

State strategies for bioinformatics in BRICs and the UK. Professor Brian Salter

Introduction to Bioinformatics for Medical Research. Gideon Greenspan TA: Oleg Rokhlenko. Lecture 1

The Integrated Biomedical Sciences Graduate Program

What is Bioinformatics?

RESEARCH METHODOLOGY, BIOSTATISTICS AND IPR

Gene-centered resources at NCBI

Perspectives on the Priorities for Bioinformatics Education in the 21 st Century

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases

Array-Ready Oligo Set for the Rat Genome Version 3.0

Imperial College London Key Research Themes for NBIT and FP7

GNET 744 Biological Sequence Analysis, Protein Structure and Genome-Wide Data

M a x i m i z in g Value from NGS Analytics in t h e E n terprise

A New Database of Genetic and. Molecular Pathways. Minoru Kanehisa. sequencing projects have been. Mbp) and for several bacteria including

EMBL-EBI Overview EMBL-EBI Overview

DOWNLOAD OR READ : UNDERSTANDING BIOINFORMATICS PDF EBOOK EPUB MOBI

Transcription:

Future Needs for Research Infrastructures in Biomedical Sciences Access to Information from Molecular Biology and Genome Research DG Research: Brussels March 2005

User Community for this information is large and varied All biologists of every persuasion! Biomedicine, zoology, botany, genetics, phylogeny, physiology, ecology, biochemistry, biodiversity Academics, pharma, biotechnology Health, diagnostics, patents, agriculture, foods, agrochemicals, forestry, agriculture, brewing, fisheries Very large numbers of users

2,000,000 1,800,000 1,600,000 e.g. USAGE of EBI s WEB SITE Hits Including Ensembl 1,400,000 1,200,000 1.3 million hits/day 1,000,000 800,000 600,000 20,000 distinct users/month 400,000 200,000 0 1st 99 2nd 99 3rd 99 4th 99 1st 00 2nd 00 3rd 00 4th 00 1st 01 2nd 01 3rd 01 4th 01 1st 02 2nd 02 3rd 02 4th 02 1st 03 2nd 03 3rd 03 4th 03 1st 04 2nd 04 3rd 04 4th 04

Four Aspects of Bioinformatics Infrastructures Standards Data Bases collect, curate and archive experimental data and re-distribute (core & specialist) Annotations (manual & automated) Search & Retrieval Tools and Services Most Data Resources are involved in all 4 aspects

Standards & Ontologies For exchange and integration of any data defined Standards and Ontologies are necessary This usually involves whole community in discussions/negotiations, with data providers taking the lead All large data resources have well defined standards and ontologies, which evolve with time for example Genes Gene Ontology Structures mmcif Expression Data MIAME Proteomic Data PSI Metabolome Data in progress Systems Biology - SBML

Data Resources For molecular biology and genome research it is possible to define 3 key distinct types of data resources: Core Resources Model Organism Resources Specialist Resources

What defines a Core Data Resource? Some combination of the following: Data Collection and curation Collects experimental data from public (now often HTP) Comprehensive In public domain Designated site for data collection agreed by community Curates collected data into archive Provides searchable archive Journal Agreement to require submission of data International Collaboration, involving regular exchange of data 24/7 service Maturity & Stability (of data and resource!) High usage

Global Context For example: EMBL/DDBJ/NCBI wwpdb: RCSB/PDB j /MSD UniProt: EBI/SIB/PIR Europe USA Japan

Core Molecular Biology Data Resources in Europe EMBL-Bank DNA sequences Reactome Array-Express Microarray Expression Data UniProt Protein Sequences Ensembl Genome Annotation IntAct Protein Interactions MSD Macromolecular Structure Data

Core Molecular Biology Data Resources in Europe EMBL-Bank DNA sequences Reactome Array-Express Microarray Expression Data UniProt Protein Sequences Ensembl Genome Annotation IntAct Proteomics & Interactions MSD Macromolecular Structure Data Funded in part by EU

Model Organism Data Resources FlyBase - Drosophila Wormbase - c elegans MGD mouse YPD - Yeast Human? Arabidopsis (UK & US) Most funded by US government

Other Molecular Data Resources (Non-core/Specialist) These data resources have some or all of the following characteristics: Based on Core Data Resources Confined to one laboratory or small group of individuals Do not collect data from external laboratories Small Computationally derived data Less mature & stable No commitment to be comprehensive No explicit commitment to community (e.g. standards) More specialised (e.g. one protein family)

Molecular Biology Database Collection Galperin (2005) NAR 33 D5-D24 719 databases listed (+171 since last year!) 14 major categories ~ 30% in Europe First databases this year from Estonia, Greece, Hungary, Turkey, Malaysia, Taiwan, Brazil & Cuba

Annotations (Manual & Automated) Manual Annotations within Core Data Resources (e.g. UniProt) Third Party Annotations in Core Data Resources Established mechanisms to allow experimentalists/theoreticians to enter their own annotations in core data resources e.g. EMBL; Distributed Annotations via DAS (Ensembl) EU Network of Excellence for Genome Annotation - BioSapiens Virtual Institute for Genome Annotation i.e. Distributed 26 Contractors/ 14 countries mainly expert bioinformatics laboratories Based on core reference sequence/structure data Develops new tools for annotations Automated Distributed compute (clients and servers)

Data Processing, Search & Retrieval Tools Search and Retrieval Systems within Core Data Resources e.g. MSD Many Laboratories develop novel search and retrieval tools for sequences and structures EU Network of Excellence EMBRACE Data Integration Development of tools and programming interfaces to exploit information Tracking and exploiting novel IT/GRID developments

Current status of Bioinformatics in Europe: Infrastructure Funding I: EU Currently limited funding for Bioinformatics under Research Infrastructures Action EuroCarbDB Design Study Felics I3 (just submitted to FP6 by EBI/SIB/EPO/Uni Koln) FP6 TEMBLOR under Quality of Life Programme major grant to EBI & partners ( 20 million for 3 years) Funds MSD; ArrayExpress, IntAct [finishing June 2005] EMBnet 26 nodes through Europe many now closed Networks of Excellence BioSapiens Genome Annotations ( 12 million for 5 years) EMBRACE Search & Retrieval Tools ( 8 million for 5 years) Many other research grants with small informatics component

Current status of Bioinformatics in Europe: Infrastructure Funding II: EMBL EMBL funds 8 meuros pa for Core Molecular Data Resources at EBI

Current status of Bioinformatics in Europe: Infrastructure Funding: National National Funding - To my knowledge there is no national funding for core bioinformatics data resources (except Arabidopsis) One-off grants for specialist data resources awarded to individuals Some National search & training facilities in Europe :- UK CCP11 Collaborative Computing Programme to foster Bioinformatics (software collection - EMBOSS) UK MRC Rosalind Franklin Centre for Genomic Research (closing July) UK escience efforts GRID research (mainly middleware) French Genopoles (attached to experimental labs) German NBCC Bioinformatik-Zentrum in Deutschland & HNB - Helmholtz Network Bioinformatics Swiss Institute of Bioinformatics (SIB & ISB) Finnish CSC (Supercomputer Centre) - provides life-sciences services & support..

Challenge Funding Services provided by the Core Resources Bioinformatics is international Core resources are used by everyone Who should pay? National countries do not want to take responsibility for an international effort US has been very pro-active supporting public data resources and making data freely available to all through web therefore commercial solution is not available option EU to date has only funded new projects EMBL provides core funding but limited budget

US Bioinformatics Infrastructure Funding Infrastructure for data resources supported by Rolling Funding provided to government laboratories & designated core resources NCBI (for Genbank; Entrez; Omim) UniProt (inc PIR) RCSB (for PDB) Bioinformatics Infrastructure for Large-Scale Analyses (National Partnership for Advanced Computational Infrastructure NPACI)

Future Needs:From Molecules, to Cell, to Organisms, to Physiology p53 tumoursuppressor Genome Protein Cell Embryo Fruitfly Mouse Human Development, Ageing Disease

Needs for future Service Providers To fulfil these demands Service Providers need: Support for Core Resources To handle data explosion more data - better resources new data - new standards & resources mechanism to identify and support new core resources as they are needed To link to model organism data resources and specialist resources To link to related discipline data resources To Integrate Data for Systems Biology To enhance services to provide easier, quicker access to data, incorporating advances in GRID technologies Adequate virtual environment and installation equipment, systems administration and DBAs to serve data to community To provide scientific and technical support to run data resources & to provide training Cont.

Needs for future Service Providers (cont) Support for Specialist resources Coordination of Specialist Resources Links to Core Resources Mechanism to evolve into Core Resource, if appropriate Support for Distributed Annotation Support for Distributed Software Libraries & Coordinated Web Services

Proposals to meet these needs In FP7 Infrastructure Funds to support in order of priority: Core Databases - Enhancements/Services New Core Resources Standards Distributed networks of specialised data resources Distributed Annotation Network(s) Tools/libraries & Distributed Web Services?

Challenges of Current I3 Infrastructure Funding Instrument for Bioinformatics Current guidelines are based on physical not virtual infrastructures large instrumentation not needed Data are rapidly evolving and diversifying and hence need rapidly evolving platforms for data collection, storage & dissemination Peer review neither possible nor desirable for web-based infrastructures Difficult to raise adequate support for centralised DB and services they provide although everyone uses them No maintenance only creation and enhancements Cumulative cost comparable with Physics, but spent differently: ongoing rather than upfront investment

Funding Mechanisms I: Core Data Resources Aim: To support Core Data Resources To provide contribution to creation and enhancement of databases To fund services & access through web needs installation/ equipment/dbs To fund technical support & training to use resources properly Model - one/limited number of sites for DB, but thousands of depositors/users across Europe Instrument: I3 with modified access & enhancement criteria Prioritise: - Through strong links with Life Sciences Thematic Priorities especially for new resources

Funding Mechanisms II:Standards Aim: To support establishment of standards for new data Model - Led by limited number of laboratories, but involving whole community Instrument: Design Study in Infrastructures Prioritise: - Through strong links with Life Sciences Thematic Priorities

Funding Mechanisms III: Specialist Data Resources Aim: To develop Clusters of Specialist Data Resources E.g individual protein families; immunology databases; Model Organism data resources (eg bacteria, pathogenic organisms) Model distributed databases one in each laboratory (say ~6 on average) Instrument: Integrated Project one partner may provide core infrastructure? Prioritise - by a strong link between infrastructure funding and research priorities in Life Sciences Thematics Priorities

Funding Mechanisms IIII: Distributed Annotation/Tools/Web services AIM: Support for Bioinformatics Laboratories to employ their best tools to provide information to all biological community Model Distributed effort in many expert laboratories essentially open system with infrastructure hub provided by one partner Instrument: Network of Excellence perhaps with fewer participants Prioritise: By research themes; by topic (eg transcriptome data) etc