Function Prediction of Proteins from their Sequences with BAR 3.0

Similar documents
Functional analysis using EBI Metagenomics

ELE4120 Bioinformatics. Tutorial 5

ONLINE BIOINFORMATICS RESOURCES

JAFA: a Protein Function Annotation Meta-Server

Bioinformatics for Proteomics. Ann Loraine

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Protein Bioinformatics Part I: Access to information

Gene-centered resources at NCBI

I nternet Resources for Bioinformatics Data and Tools

A Random Forest proximity matrix as a new measure for gene annotation *

Web based Bioinformatics Applications in Proteomics. Genbank

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

NCBI web resources I: databases and Entrez

Product Applications for the Sequence Analysis Collection

Sequence Databases and database scanning

A New Path Length Measure Based on GO for Gene Similarity with Evaluation Using SGD Pathways

Bioinformatics. Ingo Ruczinski. Some selected examples... and a bit of an overview

The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches

Network System Inference

HINT-KB: The Human Interactome Knowledge Base

Bioinformatics (Globex, Summer 2015) Lecture 1

Small Genome Annotation and Data Management at TIGR

Introduction to EMBL-EBI.

Access to Information from Molecular Biology and Genome Research

Glossary of Commonly used Annotation Terms

ELIXIR: data for molecular biology and points of entry for marine scientists

Introduction to BIOINFORMATICS

BIOINFORMATICS IN AQUACULTURE. Aleksei Krasnov AKVAFORSK (Ås, Norway) Bergen, September 21, 2007

Exploring Similarities of Conserved Domains/Motifs

A Propagation-based Algorithm for Inferring Gene-Disease Associations

Korilog. high-performance sequence similarity search tool & integration with KNIME platform. Patrick Durand, PhD, CEO. BIOINFORMATICS Solutions

CISC 436/636 Computational Biology &Bioinformatics (Fall 2016) Lecture 1

Examination Assignments

Basic Bioinformatics: Homology, Sequence Alignment,

Plant genome annotation using bioinformatics

GOTA: GO term annotation of biomedical literature

Across-proteome modeling of dimer structures for the bottom-up assembly of protein-protein interaction networks

SMISS: A protein function prediction server by integrating multiple sources

BIOINFORMATICS Introduction

Chemical-disease inference using Comparative Toxicogenomics Database

Bioinformatic Tools. So you acquired data.. But you wanted knowledge. So Now What?

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

Introduction to Molecular Biology Databases

Build Your Own Gene Expression Analysis Panels

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Entrez Gene: gene-centered information at NCBI

B I O I N F O R M A T I C S

Visual comparison of metabolic pathways

Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison. CodeLink compatible

China National Grid --- BioNode. Jun Wang Beijing Genomics Institute

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Genome sequence of Brucella abortus vaccine strain S19 compared to virulent strains yields candidate virulence genes

Homology Modeling of Mouse orphan G-protein coupled receptors Q99MX9 and G2A

Worksheet for Bioinformatics

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Introduction to Bioinformatics

The Integrated Biomedical Sciences Graduate Program

Predicting the Coupling Specif icity of G-protein Coupled Receptors to G-proteins by Support Vector Machines

A property-based analysis of human transcription factors

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

Recommendations from the BCB Graduate Curriculum Committee 1

Overview of Health Informatics. ITI BMI-Dept

Introduction to Bioinformatics

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

Classification and Learning Using Genetic Algorithms

Using semantic web technology to accelerate plant breeding.

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

THE EFFECT OF SEQUENCE VARIATION ON ESSENTIAL PROTEIN-PROTEIN INTERACTIONS OF PATHOGENS A COMPUTATIONAL ANALYSIS

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Analysis of Microarray Data

JPred and Jnet: Protein Secondary Structure Prediction.

TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE

proteins PREDICTION REPORT Template-based and free modeling by RAPTOR11 in CASP8 Jinbo Xu,* Jian Peng, and Feng Zhao INTRODUCTION

SCI204: Honors Biology

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Why learn sequence database searching? Searching Molecular Databases with BLAST

2/23/16. Protein-Protein Interactions. Protein Interactions. Protein-Protein Interactions: The Interactome

Bioinformatic tools for metagenomic data analysis

Bioinformatics Specialist

The Two-Hybrid System

Bioinformatics to chemistry to therapy: Some case studies deriving information from the literature

Protein 3D Structure Prediction

Predictive and Causal Modeling in the Health Sciences. Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Random forest for gene selection and microarray data classification

Quantitative assessment of protein function prediction programs

Plant Proteomics Tutorial and Online Resources. Manish Raizada University of Guelph

The Pathways to Understanding Diseases

Engineering Genetic Circuits

Protein Structure Prediction. christian studer , EPFL

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Genomics and Bioinformatics GMS6231 (3 credits)

A New Technique to Manage Big Bioinformatics Data Using Genetic Algorithms

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

20.453J / 2.771J / HST.958J Biomedical Information Technology

Gap Filling for a Human MHC Haplotype Sequence

Transcription:

Open Access Annals of Proteomics and Bioinformatics Short Communication Function Prediction of Proteins from their Sequences with BAR 3.0 Giuseppe Profiti 1,2, Pier Luigi Martelli 2 and Rita Casadio 2 * 1 ELIXIR-IIB, National Research Council, Italy 2 Bologna Biocomputing Group, University of Bologna, Italy *Address for Correspondence: Dr. Rita Casadio, Bologna Biocomputing Group, University of Bologna, Italy, Tel: +39 3495577461; Email: casadio@biocomp.unibo.it Submitted: 06 June 2017 Approved: 21 June 2017 Published: 23 June 2017 Copyright: 2017 Profiti G, et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. SUMMARY Protein functional annotation requires time and effort, while sequencing technologies are fast and cheap. For this reason, the development of software tools aimed at predicting protein function from sequences can help in protein annotation. In this paper, we describe how to use our recently implemented Bologna Annotation Resource (BAR) version 3.0, a tool based on over 30 million protein sequences for protein structural and functional annotation. In BAR 3.0, sequences are arranged in a similarity graph and then clustered together when they share at least 40% sequence identity over 90% of sequence alignment, for a total of 1,361,773 clusters. Protein sequences with known function transfer their annotation to other sequences in the same cluster after statistical validation. Sequences with unknown function and new sequences entering in a cluster inherit its statistically validated annotations. The method well compares to other techniques in the Critical Assessment of protein Function Annotation algorithms (CAFA). The CAFA experiment tests the performances of different predictors on a dataset that accumulates annotations over time. BAR predictions have been submitted to all the instances of CAFA through the years (BAR Plus in CAFA, BAR++ in CAFA2 and BAR 3.0 in CAFA3). The benchmarking indicates that in the field improvement is still possible and that our BAR scores among the top performing methods. This work focuses on how the tool can transfer statistically significant features to poorly annotated or new sequences derived from transcrptomics or proteomics experiments. INTRODUCTION Cheap and fast sequencing technologies are widespread, and they constantly produce a large volume of biosequence data (DNA, RNA, Proteins). Protein sequences are stored in the reference UniProtKB database [1]. Then, the attribution of structural and functional features to a protein sequence (the annotation process) starts. Structural and function features are evaluated using experimental techniques that require time and different available technologies. This has been promoting a huge gap between the number of protein sequences whose biochemical and structural characteristics are documented and the vast majority of deposited sequences (presently some 85 millions). It is worth considering that more than 60 million protein sequences are labelled as predicted in UniprotKB (http://www.uniprot.org/statistics/trembl). To overcome the gap, sequences are iltered with bioinformatics tools speci ically suited to predict functional and structural features. The tools exploit the available knowledge to infer properties of the new sequences, using different approaches like machine learning and similarity searches [2,3]. The system we developed for protein functional annotation is the Bologna Annotation Resource (BAR) [4-7]. The method transfers statistically validated annotation thanks to a clustering mechanism, based on strict similarity requirements. BAR is built on a HTTPS://WWW.HEIGHPUBS.ORG How to cite this article: Profiti G, Martelli PL, Casadio R. Function Prediction of Proteins from their Sequences with BAR 3.0. Ann Proteom Bioinform. 2017; 1: 001-005. 001

graph representation of the sequence space from UniprotKB: each protein sequence is a node, and edges represent pairwise similarity. Only edges representing a sequence identity of at least 40% over 90% of the alignment length are kept. Connected nodes are then grouped into the same cluster. After identifying clusters, Gene ontology (GO) [8], and PFAM (PFAM) [9], annotations that are protein associated in UniprotKB, are statistically validated to identify over-represented terms. Statistical validation is performed via a Bonferronicorrected Fisher test, and validated terms that become cluster speci ic are spread to all the sequences in the cluster. Protein Data Bank (PDB) [10], structures associated to proteins in a cluster, after structural alignment, are used to build structural models for sequences inside a given cluster. The 2011 version of BAR (BAR Plus) predictions were validated by the Critical Assessment of protein Function Annotation algorithms (CAFA), reaching top scores when compared to over 50 state-of-the-art methods [2]. The 2013 version (BAR++) showed a good performance for some targets, highlighting the need for an update [3]. The present version (BAR 3.0) is both an update and an improvement on the functionalities of the system. Prediction quality was tested on the CAFA2 dataset [3]: BAR 3.0 performances has been compared to the previous version and state-of-theart techniques [7]. The new version performs at the state art in all the Gene Ontology branches. Furthermore, new features of the system include information about KEGG pathways [11], and cross-cluster links, with protein-protein interactions from IntAct [12], and physical interaction of protein complexes. Another improvement is the possibility to query not only by sequence, but also by annotation. We would like to propose BAR 3.0 as a useful tool for protein annotation in trascriptomics and proteomics experiments. THE METHOD BAR 3.0 [7], contains 32,268,689 sequences either grouped in clusters or isolated as singletons. There are 28,869,663 sequences in 1,361,773 clusters, while 3,399,026 are singletons. 97% of SwissProt sequences fall in clusters, allowing the transfer of annotation. Statistical validation of annotation led to 674,431 clusters having some validated annotation. These clusters contain 25,447,079 sequences, about 88% of all clustered sequences in BAR 3.0. About 39% of sequences are in clusters with statistically validated GO terms, PFAM families and a PDB structure. What is really important is that 11,206,902 of UniprotKB sequences get a statistically validated annotation they did not have previously. Singletons, on the other hand, mostly lack any type of annotation: 43% of them are not associated even to electronically transferred annotations and may offer a subset of proteins that deserve some attention in terms of experimental approaches. While performances of previous BAR versions have been benchmarked by CAFA and CAFA2 experiments [2,3]; BAR 3.0 predictions are still under assessment by the CAFA3 committee. We tested BAR 3.0 on the CAFA2 targets that accumulated experimental annotation between January 2014 and September 2014 and found that on this set BAR 3.0 scores similar or outperforms other state of the art methods [7]. The number of correctly predicted (true positive), wrongly assigned (false positive) and wrongly unassigned (false negative) terms are shown in table 1. A comparison with the state-of-the-art methods is listed in a recent paper [7]. When a new sequence is pasted in the query page (bar.biocomp.unibo.it), the alignment towards the BAR database allows (or not) entering a given annotated cluster. Entering is constraint by the alignment result (a match with a sequence in the Published: June 23, 2017 002

cluster of at least 40% Identity over 90% of the alignment coverage). Upon insertion in the cluster, the sequence inherits all the statistically validated annotation (Figure 1). RESULTS AND DISCUSSION Users of BAR 3.0 can access the annotations using different approaches. The most common one would be to search for a UniprotKB saccession or entering a sequence in FASTA format. In this case, the query sequence is aligned against the ones already present in the system. The cluster or singleton that contains the matching sequence or any sequence that shares at least 40% sequence identity over 90% of sequence alignment is returned, if any. The information page contains statistics about the cluster: number of sequences, average length and taxonomic domains. Structural information is shown as a list of PDB, when present, associated to sequences in the cluster. For each PDB chain, ligand/s is/are also speci ied. A Hidden Markov Model (HMM) derived from the structures in the cluster can be downloaded from this section and adopted to model the protein structure. The alignment of the query sequence against the cluster Table 1: Prediction statistics for Gene Ontology terms. GO branch True Positive (TP) False Positive (FP) False Negative (FN) F1 score Biological Process 7790 26156 12465 0.35 Cellular Component 4063 8381 3364 0.43 Molecular Function 2099 3449 840 0.54 Figure 1: Inherited GO terms and 3D structure for human sequence B7Z9I1. Published: June 23, 2017 003

HMM is available in PIR format, to be used with common modelling tools. When the PDB chain forms a complex with another one falling in a different cluster, such physical interaction is indicated, allowing navigation across different clusters. Interaction and cross-cluster information is derived from IntAct protein-protein interactions. When a sequence in the cluster is marked as interacting with another one, both are listed in the Protein-protein interactions section, along with their respective clusters. The same section indicates when the organism of the query sequence is present in cluster containing the interacting sequence. Gene Ontology annotations comprise the three main branches: Biological Process, Molecular Function and Cellular Component. For each GO Term, its p-value and distance from the ontology root are computed. PFAM domains are also associated to a p-value. Information about pathways involving sequences in the cluster is presented in the KEGG Pathways section. As an example (Figure 1), we may consider a human unreviewed sequence in UniprotKB, with evidence at protein level, with a submitted name of Medium-chain-speci ic acyl-coa dehydrogenase, mitochondrial (B7Z9I1). It falls into BAR cluster #6075 that contains 32355 sequences, 68 of which from SwissProt. Sequences in this cluster are from over 4000 different species, comprising 176 Archaea, 504 Eukaryotes and 3755 Bacteria. The cluster contains 57 sequences with PDB structures, four of which form complexes with PDB associated to other clusters. There are also 6 known interactions of proteins from this cluster. For GO terms, there are 132 validated Biological Processes, 29 Molecular Functions and 32 Cellular Components. BAR 3.0 transfers a more speci ic Biological Process GO term with respect to the one electronically assigned by InterPro (GO:0033539, fatty acid beta-oxidation using acyl-coa dehydrogenase), and it suggests possible new speci ic Molecular Function terms for dehydrogenase activity. Cellular Component experimentally assigned matches the prediction of BAR 3.0 (mitochondrion). With the cluster HMM, it is possible to model a 3D structure for the sequence. One of the known interactions is associated to Q92947, also a human dehydrogenase, suggesting possible interactions also for the query sequence. Besides offering a statistically validated annotation system, BAR 3.0 offers a unique opportunity for users to query for speci ic annotation terms (GO, PFAM, PDB), for ligands and for organisms. These searches return a list of all the clusters containing the query term. For GO terms and PFAM, clusters associated to the term in a statistically validated way are listed. For PDB, ligand and organism, all the clusters containing a sequence associated to the query term are shown. The result is presented as a table, where each row contains information about a cluster: number of sequences, number of PDB, number of validated GO terms (per branch), number of validated PFAM. If the query term was a GO or PFAM, also the associated p-value is available. The list of resulting clusters can be narrowed further by entering a taxonomy identi ier: in this way, the user can look for clusters containing a speci ic term and sequences from a speci ic organism. From the list, annotation pages for each cluster can be reached. BAR compliments any other annotation page of the sequence if available, particularly for poorly annotated and predicted sequences, with the possibility of linking information across different clusters and fully understand the role of the sequence in the cell complex landscape. ACKNOWLEDGEMENTS Thanks are due to Funding for open access charge of the University of Bologna [RFO] delivered to PLM and RC. G.P. thanks ELIXIR-IIB and ELIXIR Europe for supporting his research. Published: June 23, 2017 004

REFERENCES 1. UniProt Consortium. UniProt: A hub for protein information. Nucleic Acids Res. 2015; 43: 204-212. Ref.: https://goo.gl/yrmgua 2. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, et al. A large-scale evaluation of computational protein function prediction. Nat Meth. 2013; 10: 221-227. Ref.: https://goo.gl/xg6dfk 3. Jiang Y, Oron RT, Clark TW, Bankapur RA, D Andrea D, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biology. 2016; 17: 184. Ref.: https://goo.gl/lqhgpn 4. Bartoli L, Montanucci L, Fronza R, Martelli PL, Fariselli P, et al. The Bologna annotation resource: a non hierarchical method for the functional and structural annotation of protein sequences relying on a comparative large-scale genome analysis. J Proteome Res. 2009; 8: 4362-4371. Ref.: https://goo.gl/dlrvmk 5. Piovesan D, Martelli PL, Fariselli P, Zauli A, Rossi I, et al. BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences. Nucleic Acids Res. 2011; 39: 197-202. Ref.: https://goo.gl/9it5mu 6. Piovesan D, Martelli PL, Fariselli P, Profiti G, Zauli A, et al. How to inherit statistically validated annotation within BAR+ protein clusters. BMC Bioinformatics. 2013; 3: 4. Ref.: https://goo.gl/zm9buz 7. Profiti G, Martelli PL, Casadio R. The Bologna Annotation Resource (BAR 3.0): improving protein functional annotation. Nucl Acids Res. 2017. Ref.: https://goo.gl/gvwsiw 8. Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res. 2015; 43: 1049-1056. Ref.: https://goo.gl/kw74s7 9. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016; 44: 279-285. Ref.: https://goo.gl/avdlfi 10. Rose PW, Prlic A, Bi C, Bluhm WF, Christie CH, et al. The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res. 2015; 43: 345-356. Ref.: https://goo.gl/az5rmf 11. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017; 45: 353-361. Ref.: https://goo.gl/zqm1iq 12. Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, et al. The MIntAct project-intact as a common curation platform for 11 molecular interaction databases. Nucl Acids Res. 2014; 42: 358-363. Ref.: https://goo.gl/gwjftw Published: June 23, 2017 005