Translating Biological Data Sets Into Linked Data

Similar documents
Two Mark question and Answers

ELE4120 Bioinformatics. Tutorial 5

Types of Databases - By Scope

Biological databases an introduction

Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL

Introduction to Bioinformatics

Biological databases an introduction

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

Protein Bioinformatics Part I: Access to information

Product Applications for the Sequence Analysis Collection

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Sequence Databases and database scanning

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Introduction to BIOINFORMATICS

Chapter 2: Access to Information

Textbook Reading Guidelines

The Gene Ontology Annotation (GOA) project application of GO in SWISS-PROT, TrEMBL and InterPro

CHEM 436 / 630. Molecular modelling of proteins. Winter 2018 Term. Instructor: Guillaume Lamoureux Concordia University, Montréal, Canada

Powering Translational Medicine with Semantic Web technologies

NCBI web resources I: databases and Entrez

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Web based Bioinformatics Applications in Proteomics. Genbank

Visualizing Linked Data

B I O I N F O R M A T I C S

Introduction to 'Omics and Bioinformatics

KDD Cup Task 1 Information Extraction from Biomedical Articles

Overview of Health Informatics. ITI BMI-Dept

The University of California, Santa Cruz (UCSC) Genome Browser

ONLINE BIOINFORMATICS RESOURCES

VALLIAMMAI ENGINEERING COLLEGE

Genome Sequence Assembly

ELIXIR: data for molecular biology and points of entry for marine scientists

Sequence Based Function Annotation

Engineering Genetic Circuits

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Global Biomolecular Information Infrastructure and Australia. Graham Cameron Director The EMBL Australia Bioinformatics Resource

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Function Prediction of Proteins from their Sequences with BAR 3.0

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Bioinformatics. Ingo Ruczinski. Some selected examples... and a bit of an overview

GREG GIBSON SPENCER V. MUSE

Big picture and history

DNA and the Production of Proteins Course Notes. Cell Biology. Sub-Topic 1.3 DNA and the Production of Proteins

Hidden Markov Models. Some applications in bioinformatics

Bioinformatics Practical Course. 80 Practical Hours

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases

Using semantic web technology to accelerate plant breeding.

Bioinformatics for Proteomics. Ann Loraine

Improving information management and interoperability for national roads authorities

Applied Bioinformatics

What is Genetics? Genetics The study of how heredity information is passed from parents to offspring. The Modern Theory of Evolution =

The knowledge-driven exploration of integrated biomedical knowledge sources facilitates the generation of new hypotheses

Improving information management and interoperability for national roads authorities. Aonghus O Keeffe CitA BIM Gathering 23 rd November 2017

Introduction to BIOINFORMATICS

BIMM 143: Introduction to Bioinformatics (Winter 2018)

Glossary of Commonly used Annotation Terms

Bioinformatics for Cell Biologists

Introduction to Genome Biology

Textbook Reading Guidelines

Functional analysis using EBI Metagenomics

Ontologies - Useful tools in Life Sciences and Forensics

I nternet Resources for Bioinformatics Data and Tools

Bioinformatics for Cell Biologists

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Worksheet for Bioinformatics

Large Scale Enzyme Func1on Discovery: Sequence Similarity Networks for the Protein Universe

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

translation The building blocks of proteins are? amino acids nitrogen containing bases like A, G, T, C, and U Complementary base pairing links

Genomic region (ENCODE) Gene definitions

European Genome phenome Archive at the European Bioinformatics Institute. Helen Parkinson Head of Molecular Archives

Lecture 7 Motif Databases and Gene Finding

Introduc)on to Databases and Resources Biological Databases and Resources

Computational Biology and Bioinformatics

Biomedical Informatics in BIG DATA Era

Bioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University

Computers in Biology and Bioinformatics

Web-based Bioinformatics Applications in Proteomics

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

GENETIC ENGINEERING WORKSHOP. Geering Up 2017

I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

9/19/13. cdna libraries, EST clusters, gene prediction and functional annotation. Biosciences 741: Genomics Fall, 2013 Week 3

Data representation for clinical data and metadata

How do we know what the structure and function of DNA is? - Double helix, base pairs, sugar, and phosphate - Stores genetic information

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Introduction to Computational Genomics

Open PHACTS: Semantic interoperability for drug discovery

Introduction to Bioinformatics and Gene Expression Technology

GNET 744 Biological Sequence Analysis, Protein Structure and Genome-Wide Data

This practical aims to walk you through the process of text searching DNA and protein databases for sequence entries.

Introduction to Bioinformatics for Medical Research. Gideon Greenspan TA: Oleg Rokhlenko. Lecture 1

Explainer: What is a gene?

1.1 What is bioinformatics? What is computational biology?

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Sequence Analysis. Introduction to Bioinformatics BIMMS December 2015

Transcription:

Translating Biological Data Sets Into Linked Data Mark Tomko Simmons College, Boston MA The Broad Institute of MIT and Harvard, Cambridge MA September 28, 2011

Overview Why study biological data? UniProt & Pfam Translating Pfam into linked data Challenges for representing biological data

Perspective I am not a biologist I am not an expert on linked data I m a software engineer interested in: Scientific data and metadata Scientific data sharing Biology and bioinformatics This talk takes a pragmatic approach

Why study biological data? Quantity Diversity Changes Challenges

How much data? Growth of GenBank Genetic Sequence Database, 1982-2008 http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html

What kind of data? Sequence data (nucleotide & amino acid) Genomic data Proteomic data Neuronal wiring Cell fates Phylogenic information Homologous molecules And so on http://www.mpg.de/english/illustrationsdocumentation/documentation/pressreleases/2002/news0204_bild1.jpg

What kinds of changes? Gregor Mendel Charles Darwin James D. Watson & Francis Crick Gordon Moore 1800 1900 2000 In vivo In vitro In silico

Challenges Very large data sets In a multitude of data formats Widely distributed Full of semantic linkages Yet: Computational techniques are increasingly important Experts may not have extensive backgrounds in: Data modeling Data management Programming

Goals Organize the body of biological knowledge Link related knowledge Connect facts with research Make bioinformatics data discoverable and interoperable Facilitate data sharing between researchers

Existing linked biological data Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

A bit of biology Proteins are characterized by amino acid sequence and folding Similarity between analogous proteins suggests: Functional similarity Common evolutionary origin --KAHGKKVLGA-- Single letter code (Hemoglobin β chain)

UniProtKB Online repository of annotated protein sequences Derived from scientific literature & other databases Links to over 100 other data sets Supported by EBI, PIR, NIH Data available in several formats: Online (HTML) Flat files RDF http://www.uniprot.org/

Pfam Organizes proteins from UniProt into families based on similarities in amino acid sequences Computes similarities via sequence alignments and Hidden Markov Models (HMMs) Organizes families into higher-level groups called clans Clan membership is based on similarity between the characteristic sequences of the member families http://pfam.sanger.ac.uk

Pfam organizes UniProt Provides alternate access points for UniProt sequences Clusters UniProt entries Domain specific similarity metrics Non-obvious without domain knowledge Cluster membership helps to predict useful properties Function Evolutionary origin Shapes & features

Pfam families and clans

Pfam provides context Collects findings from disparate literature Establishes critical links that cannot be inferred automatically without specific domain knowledge Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Pfam references other data Pfam links to existing databases such as: InterPro (http://www.ebi.ac.uk/interpro/) SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/) PROSITE (http://prosite.expasy.org/) HOMSTRAD (http://tardis.nibio.go.jp/homstrad/) Also links to related publications in PubMed http://www.ncbi.nlm.nih.gov/pubmed/

Example Pfam data

Pfam sequence alignments Alignment for PF00271 (Helicase C domain): Alignment range Each row corresponds to a protein Each letter is an amino acid Colored bands indicate amino acids of particular types that are shared by many proteins

Bio2RDF Pfam translation

This work Retains original UniProt and Pfam URIs Captures: Clans Families Annotations Sequence alignments Links to PubMed Database references (InterPro and PROSITE) Uses existing vocabularies (in some cases, this may not be a feature)

Vocabulary usage Uses SKOS vocabulary narrower / broader preflabel / altlabel definition / scopenote related Uses UniProt core vocabulary Citations Sequence alignments

Clan & family structure

Modeling Pfam entities

Class membership

Membership Broader/narrower might not sufficiently precise Protein is an example of a family Protein is a subtype of a family Protein belongs to a family SKOS no longer contains instantive relationships

Sequence alignments

Automated translation Developed translation program in Scala Input: Pfam-C (clans) ( 360 kb) Pfam-A (curated families) ( 190 mb) uniprot_sprot (proteins) ( 350 mb compressed) Output: Single RDF file ( 333 mb) Source is available on GitHub http://github.com/mtomko/pfamskos http://www.scala-lang.org/

Open problems for Pfam Need vocabulary for class membership: skos:narrowerinstantive and skos:broaderinstantive Deprecated after SKOS-Core 1.0 Guide Need a better model for sequence alignments Both of these could be easily defined using OWL But should they have to be? Does something similar already exist? How do I find it?

Future work Extract all external database references Capture HMM parameters Infrastructure improvements Hosting Separate URLs for entities Improved codebase

Problems for biology data Existing linked data vocabularies are too general or too specific Vocabularies are hard to find Insufficient or inadequate software tools Linked data specifications are daunting to outsiders

Acknowledgements Simmons College: Kathy Wisser and Candy Schwartz Graduate School of Library and Information Science Tufts University: Caroline L. Dahlberg Sackler School of Graduate Biomedical Sciences The Broad Institute: Tom Green and David Root The Broad Institute RNAi Platform

Thanks! Project Code Contact http://web.simmons.edu/~tomko/pfam http://github.com/mtomko/pfamskos mark.tomko@simmons.edu Adapted from Lehninger, 3 rd Ed. Structure image from the PDB http://www.broadinstitute.org/ http://www.simmons.edu/gslis/

How much data? Human genome contains 20-25K genes Human DNA contains 3 billion base pairs (A,G,C,T) UniProt/TrEMBL database contains 16,504,022 protein annotations as of August 2011 Pfam contains: 458 clans 12,273 families 8,729,906 sequences (http://www.intel.om)