MS bioinformatics analysis for proteomics. Protein anotations

Size: px
Start display at page:

Download "MS bioinformatics analysis for proteomics. Protein anotations"

Transcription

1 MS bioinformatics analysis for proteomics Protein anotations UCO - Córdoba Organized by: ProteoRed, EUPA and Seprot Alberto Medina January, 23rd 2009

2 Summary Introduction Some issues Software: Fatigo - Babelomics GeneCodis GOTM Information retrieval and knowledge (PIKE)

3 From protein Id toward Biology Proteomics experiment goal: Set of proteins (identified, differential expressed, quantified, etc..) according to a sequence database DB id Name or de novo sequencing. gi gi Data = Information? gi Heat shock protein FKB9_HUMAN Tubulin alpha homo sapiens gi Ubiquitin specific It s necessary providingprotease methods and functions to retrieve information gi g vinculin andisoform create VCL knowledge from Homo sapiens protein id. gi selenium binding protein 1 Homo sapiens

4 What can we do? By hand???? Retrieve available information from database used to identify the proteins.

5 Several databases Troubles

6 Several Repositories and databases DB Catalog: deest: EMBL: GenBank: MSDB: ftp://ftp ebi ac uk/pub/databases/massspecdb/ MSDB: ftp://ftp.ebi.ac.uk/pub/databases/massspecdb/ NCBInr: OWL: PDB: PIR: PRF: Swiss-Prot:

7 What can we do? Using search engines, we have only the information g g, y from one site.

8 Troubles Several Databases Cross references (gi ).

9 Canwedoit? Sometimes it s difficult to related the data (gi: )?

10 IPI IPI: International Protein Index Features: Effectively maintains a database of cross references between the primary data sources Provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript) Maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases Kersey P. J., Duarte J., Williams A., Karavidopoulou Y., Birney E., Apweiler R. The International Protein Index: An integrated database for proteomics experiments. Proteomics 4(7): (2004).

11 IPI

12 HPRD hprd Human protein reference database. Peri, S. et al. (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Research. 13:

13 Troubles Several Databases Cross references (gi ). Database keywords mapping

14 Semantic integration????? How can I relate data provided by database 1 to data provided by database 2????? Human Albumin id? NCBI Id: CAA0606 name: albumin [Homo sapiens] SwissProt Id: P02768 name:albu_human Is It the same? Yes, it is. But

15 HPRD hprd Peri, S. et al. (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Research. 13:

16 Troubles Several Databases Cross references (gi ). Database keywords mapping Annotations?

17 Annotations For instance, Cellular location = Membrane Membrane location Protein XXX is located within membrane Protein XXX together.. are into the membrane We need CV or Ontologies to cluster similar expressions under the same concept.

18 Gene Ontology org/ Biological Process series of events accomplished by one or more ordered assemblies of molecular functions. Gene Ontology: tool for the unification of biology. Nat Genet ; 25: P.E: cellular physiological process or signal transduction. Molecular function Molecular function describes activities, such as catalytic or binding activities, that occur at the molecular level P.E: catalytic activity, transporter activity, or binding; Cellular Component a component of a cell, (e.g. rough endoplasmic reticulum or nucleus)

19 Troubles Several Databases Cross references (gi ). Database keywords mapping Annotations Wh t ill h if I h l t f What will happen if I have a large set of proteins?

20 What can I do? By hand Protein accession gi Information CoPaste (Copy & Paste)

21 Tools (more than 40 for genes) Fatigo: (CIPF, Valencia, Spain) is a web tool to extract Gene Ontology terms that are significantly overor under-represented in sets of genes within the context of a genomescale experiment (DNA microarray, proteomics, etc.). GeneCodis (CNB, Madrid, Spain) is a web tool to extract Gene Ontology, KEGG pathways and SwissProt keywords. Gene Ontology Tree Machine: is a web-based platform for interpreting microarray data or other interesting gene sets using Gene Ontology.

22 Fatigo org/

23 GeneCodis es

24 GOTM edu/gotm/

25 Troubles Several Databases Cross references (gi ). Database keywords mapping What will happen if I have a large set of proteins? What can I do for showing information from several sites at once?

26 PIKE p PIKE: Protein Information and Knowledge extractor DB id Protein Name gi Heat shock protein gi FKB9_HUMAN gi Tubulin alpha homo sapiens g gi Ubiquitin specific protease gi vinculin isoform VCL Homo sapiens gi selenium binding protein 1 Homo sapiens

27 PIKE p

28 PIKE p

29 PIKE p Information asked by user

30 PIKE p

31 Troubles Several Databases Cross references (gi ). Database keywords mapping What will happen if I have a large set of proteins? What can I do for showing information from several sites at once? How Is it possible to export the information?

32 PIKE Now, we know how to retrieve the information from the Internet databases and how to generate knowledge. As important as retrieving information is to share the information with other applications. In this way we need some standards to storage the information

33 PIKE Input: Txt and PRIDE file The PRIDE PRoteomics IDEntifications database is a centralized, standards compliant, public data repository for proteomics data. It has been developed to provide the proteomics community with a public repository for protein and peptide identifications together with the evidence supporting these identifications PIKE uses PRIDE files as input to related the PIKE uses PRIDE files as input to related the experimental information with biological information

34 PIKE output: PRIDE, CSV and TXT CSV: To export to Excel PRIDE: Biological information retrieved by PIKE is annotated according to: User information: if the information is annotated as natural language. (Uniprot/SwissProt Comments) CV information If the information follows a CV or Ontology such as: GO, KEGG pathways or OMIN ids MzData: If the data have been submitted previously within PRIDE file.

35 PIKE output. CSV Feature 1 Feature 2 Feature 3 Feature 4 Feature Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Series1 Series2

36 PIKE output. PRIDE st

37 Troubles Several Databases Cross references (gi ). Database keywords mapping What will happen if I have a large set of proteins? What can I do for showing information from several sites at once? Is it possible to retrieve the same data from several sites? Workflows? How Is it possible to export the information? What about graphical representation?

38 PIKE

39 Troubles without answer Tools are web based applications Connection problems. Front-end schemas Restricted access to some data Are the Information within databases updated? Are the database links updated?

40 ?????

41 Acknowledges CNB Proteomics facility & ProteoRed Salvador Martínez Alberto Paradela Rosana Navajas Antonio Ramos Ana Beloso Fernando Roncal Marisol Fernández Sergio Ciordia Silvia Juárez Juan Pablo Albar

42 Acknowledges CNB Bioinformatics facility, Madrid, SP Mónica Chagoyen Pedro Carmona José María Carazo EMBL EBI, Hinxton, UK Antony Quinn Henning Hermjakob Bruker Daltonics, Bremen, GE Marcus Macht Herbert Thiele Center for Biotechnology, Turku, FIN Anni Vehmas Garry Corthals