Glycomics Project Overview

Size: px
Start display at page:

Download "Glycomics Project Overview"

Transcription

1 Wright State University CORE Scholar Kno.e.sis Publications The Ohio Center of Excellence in Knowledge- Enabled Computing (Kno.e.sis) 2007 Glycomics Project Overview Satya S. Sahoo Wright State University - Main Campus Follow this and additional works at: Part of the Bioinformatics Commons, Communication Technology and New Media Commons, Databases and Information Systems Commons, OS and Networks Commons, and the Science and Technology Studies Commons Repository Citation Sahoo, S. S. (2007). Glycomics Project Overview.. This Presentation is brought to you for free and open access by the The Ohio Center of Excellence in Knowledge-Enabled Computing (Kno.e.sis) at CORE Scholar. It has been accepted for inclusion in Kno.e.sis Publications by an authorized administrator of CORE Scholar. For more information, please contact corescholar@

2

3 Bioinformatics Research Overview

4 Outline Biomedical Ontologies o GlycO o EnzyO o ProPreO Scientific Workflow for analysis of Proteomics Data Framework for Semantic Provenance Annotation Biological Services Registry Demo of User Interface T.cruzi Knowledge Base

5 GlycO is a focused ontology for the description of glycomics models the biosynthesis, metabolism, and biological relevance of complex glycans models complex carbohydrates as sets of simpler structures that are connected with rich relationships An ontology for structure and function of Glycopeptides Published through the National Center for Biomedical Ontology (NCBO) and Open Biomedical Ontologies (OBO) See:GlycoDoc, GlycO

6 GlycO Challenge model hundreds of thousands of complex carbohydrate entities But, the differences between the entities are small (E.g. just one component) How to model all the concepts but preclude redundancy ensure maintainability, scalability

7 GlycO population Assumption: with a large body of background knowledge, learning and extraction techniques can be used to assert facts. Asserted facts are compositions of individual building blocks Because the building blocks are richly described, the extracted larger structures will be of high quality

8 GlycO Population Multiple data sources used in populating the ontology o KEGG - Kyoto Encyclopedia of Genes and Genomes o SWEETDB o CARBBANK Database Each data source has a different schema for storing data There is significant overlap of instances in the data sources Hence, entity disambiguation and a common representational format are needed

9 Diverse Data From Multiple Sources Assures Quality Democratic principle Some sources can be wrong, but not all will be More likely to have homogeneity in correct data than in erroneous data

10 Ontology population workflow YES: next Instance Semagix Freedom knowledge extractor Instance Data Already in KB? Has CarbBank ID? NO IUPAC to LINUCS NO YES Insert into KB Compare to Knowledge Base LINUCS to GLYDE

11 Ontology population workflow YES: next Instance Semagix Freedom knowledge extractor Already in KB? NO Instance Data Has CarbBank ID? YES [][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] IUPAC to NO {[(2+1)][b-D-GlcpNAc] LINUCS {}[(4+1)][b-D-GlcpNAc] {}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}} Insert into KB Compare to Knowledge Base LINUCS to GLYDE

12 Ontology population workflow Already in KB? NO Insert into KB Semagix Freedom knowledge extractor <Glycan> <aglycon name="asn"/> YES: <residue link="4" next anomeric_carbon="1" Instance anomer="b" chirality="d" monosaccharide="glcnac"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="d" monosaccharide="glcnac"> <residue link="4" anomeric_carbon="1" anomer="b" Instance chirality="d" monosaccharide="man" > <residue link="3" anomeric_carbon="1" anomer="a" Data chirality="d" monosaccharide="man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="d" monosaccharide="glcnac" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="d" monosaccharide="glcnac" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="d" NO monosaccharide="man" > Has CarbBank ID? YES Compare to Knowledge Base IUPAC to LINUCS <residue link="2" anomeric_carbon="1" anomer="b" chirality="d" monosaccharide="glcnac"> </residue> </residue> </residue> </residue> </residue> </Glycan> LINUCS to GLYDE

13 Ontology population workflow YES: next Instance Semagix Freedom knowledge extractor Instance Data Already in KB? Has CarbBank ID? NO IUPAC to LINUCS NO YES Insert into KB Compare to Knowledge Base LINUCS to GLYDE

14 Diverse Data From Multiple Sources Assures Quality Holds only, when the data in each source is independent In the case of GlycO, the sources that were meant to assure quality were not diverse.? One original source (Carbbank) was copied by several Databases without curation Errors in the original propagated Errors in KEGG and Carbbank are the same Cannot use these sources for comparison Needs curation by the expert community

15 GlycoTree β-d-glcpnac-(1-2)- α-d-manp -(1-6)+ β-d-manp-(1-4)-β-d-glcpnac-(1-4)-β-d-glcpnac β-d-glcpnac-(1-4)- α-d-manp -(1-3)+ β-d-glcpnac-(1-2)+ N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15:

16 Pathway Steps - Glycan Abundance of this glycan in three experiments Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia

17

18 ProPreO ontology An ontology for capturing process and lifecycle information related to proteomic experiments Two aspects of glycoproteomics: What is it? identification How much of it is there? quantification Heterogeneity in data generation process, instrumental parameters, formats Need data and process provenance ontology-mediated provenance Hence, ProPreO models both the glycoproteomics experimental process and attendant data Published through the National Center for Biomedical Ontology (NCBO) and Open Biomedical Ontologies (OBO)

19 N-Glycosylation Process (NGP) Cell Culture Glycoprotein Fraction Glycopeptides Fraction 1 n Glycopeptides Fraction n Peptide Fraction n*m Peptide Fraction extract proteolysis Separation technique I PNGase Separation technique II Mass spectrometry Glycopeptide identification and quantification Signal integration ms data ms peaklist N-dimensional array Data reduction binning Data correlation ms/ms data ms/ms peaklist Peptide list Data reduction Peptide identification

20 Workflow based on Web Services = Web Process

21 ProPreO: Ontology-mediated provenance parent ion m/z fragment ion m/z parent ion charge parent ion abundance fragment ion abundance ms/ms peaklist data Mass Spectrometry (MS) Data

22 ProPreO: Ontology-mediated provenance <ms-ms_peak_list> <parameter instrument= micromass_qtof_2_quadropole_time_of_flight_mass_spectrometer mode= ms-ms /> <parent_ion m-z= abundance= z= 2 /> <fragment_ion m-z= abundance= /> <fragment_ion m-z= abundance= /> <fragment_ion m-z= abundance= /> <fragment_ion m-z= abundance= /> <fragment_ion m-z= abundance= /> Ontological Concepts <fragment_ion m-z= abundance= /> <fragment_ion m-z= abundance= /> <fragment_ion m-z= abundance= /> </ms-ms_peak_list> Semantically Annotated MS Data

23 ISiS Integrated Semantic Information and Knowledge System Semantic Web Process to incorporate provenance Biological Sample Analysis by MS/MS Agent Raw Data to Standard Format Agent Agent DB Agent Data Search Preprocess (Mascot/ Sequest) Results Postprocess (ProValt) O I O I O I O I O Semantic Annotation Applications Raw Data Standard Format Data Filtered Data Search Results Final Output Storage Biological Information

24 Integrated Semantic Information and knowledge System (Isis) Have I performed an error? Give me all result files from a similar organism, cell, preparation, mass spectrometric conditions and compare results. ProPreO ontology SPARQL query-based User Interface Is the result erroneous? Experimental Give me Data all Semantic result files from Semantica similar Annotation Metadata Metadata organism, cell, preparation, File Registry mass spectrometric conditions and compare results. EXPERIMENTAL DATA PROTEOMECOMMONS Raw mzxml Pkl psplit MACOT result ProVault result Raw2mzXML mzxml2pkl Pkl2pSplit MASCOT Search ProVault PROTEOMICS WORKFLOW

25 Semantic Annotation Facilitates Complex Queries Evaluate the specific effects of changing a biological parameter: Retrieve abundance data for a given protein expressed by three different cell types of a specific organism. Retrieve raw data supporting a structural assignment: Find all the raw ms data files that contain the spectrum of a given peptide sequence having a specific modification and charge state. Detect errors: Find and compare all peptide lists identified in Mascot output files obtained using a similar organism, cell-type, sample preparation protocol, and mass spectrometry conditions. A Web Service Must Be Invoked ProPreO concepts highlighted in red

26 Knowledge Base (ProPreO and GlycO Ontology) Data Browsing & Querying API Data Generation GlycoVault WWW ISiS Annotation

27 Semantic Biological Web Services Registry

28 Data, ontologies, more publications at Biomedical Glycomics project web site: Thank You