Protein function prediction using sequence motifs: A research proposal

Similar documents
Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

Sequence Databases and database scanning

Cross-references to the corresponding SWISS-PROT entries as well as to matched sequences from the PDB 3D-structure database 2 are also provided.

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana

Cross-references to the corresponding SWISS-PROT entries as well as to matched sequences from the PDB 3D-structure database 2 are also provided.

Textbook Reading Guidelines

BIOINFORMATICS Introduction

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Teaching Principles of Enzyme Structure, Evolution, and Catalysis Using Bioinformatics

Two Mark question and Answers

Sequence Based Function Annotation

ORGANISATION AND STANDARDISATION OF INFORMATION IN SWISS-PROT AND TREMBL

Improving SIM-based Annotation Method of Protein Sequence Using Support Vector Machine

Representation in Supervised Machine Learning Application to Biological Problems

proteins PREDICTION REPORT Fast and accurate automatic structure prediction with HHpred

Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions

JAFA: a Protein Function Annotation Meta-Server

Christian Sigrist. January 27 SIB Protein Bioinformatics course 2016 Basel 1

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

The Gene Ontology Annotation (GOA) project application of GO in SWISS-PROT, TrEMBL and InterPro

An Overview of Protein Structure Prediction: From Homology to Ab Initio

Introduction to Bioinformatics

AC Algorithms for Mining Biological Sequences (COMP 680)

ELE4120 Bioinformatics. Tutorial 5

The Brutlag Bioinformatics Group Conserved Structural and Functional Motifs in Proteins

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

A Novel Splice Site Prediction Method using Support Vector Machine

Extracting Database Properties for Sequence Alignment and Secondary Structure Prediction

The Catalytic Site Atlas: a resource of catalytic sites and residues identi ed in enzymes using structural data

Genome-Scale Predictions of the Transcription Factor Binding Sites of Cys 2 His 2 Zinc Finger Proteins in Yeast June 17 th, 2005

A Support Vector Machine Approach for LTP Using Amino Acid Composition

Product Applications for the Sequence Analysis Collection

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Application of Emerging Patterns for Multi-source Bio-Data Classification and Analysis

Textbook Reading Guidelines

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

Methods and tools for exploring functional genomics data

Exploring Suboptimal Sequence Alignments and Scoring Functions in Comparative Protein Structural Modeling

Exploring Similarities of Conserved Domains/Motifs

Functional profiling of metagenomic short reads: How complex are complex microbial communities?

Filtering Redundancies For Sequence Similarity Search Programs

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Bioinformation by Biomedical Informatics Publishing Group

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Small Genome Annotation and Data Management at TIGR

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences

Applying Machine Learning Strategy in Transcription Factor DNA Bindings Site Prediction

EURASIP Journal on Bioinformatics and Systems Biology

Designing Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Sequence Analysis. BBSI 2006: Lecture #(χ+3) Takis Benos (2006) BBSI MAY P. Benos 1

Supplementary materials - MeMo: A web tool for prediction of protein. methylation modifications

The University of California, Santa Cruz (UCSC) Genome Browser

Prediction of Binding Sites in the Mouse Genome Using Support Vector Machines

Array-Ready Oligo Set for the Rat Genome Version 3.0

Prediction of DNA-Binding Proteins from Structural Features

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data

An Evolutionary Optimization for Multiple Sequence Alignment

SVM based prediction of RNA-binding proteins using binding residues and evolutionary information

Database Searching and BLAST Dannie Durand

Applying Hidden Markov Model to Protein Sequence Alignment

BIOINFORMATICS ORIGINAL PAPER

BINARY CLASSIFICATION OF UNCHARACTERIZED PROTEINS INTO DNA BINDING/NON-DNA BINDING PROTEINS FROM SEQUENCE DERIVED FEATURES USING ANN

9/19/13. cdna libraries, EST clusters, gene prediction and functional annotation. Biosciences 741: Genomics Fall, 2013 Week 3

NACSVM Pred : A MACHINE LEARNING APPROACH FOR PREDICTION OF NAC PROTEINS IN RICE USING SUPPORT VECTOR MACHINES

Ph.D. in Information and Computer Science (Area: Bioinformatics), University of California, Irvine, August, (Advisor: Dr.

Supplement for the manuscript entitled

I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure

Enhanced Sequence-Based Function Prediction Methods and Application to Functional Similarity Networks

Why learn sequence database searching? Searching Molecular Databases with BLAST

Comparative Modeling Part 1. Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center

Biological databases an introduction

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Modeling gene expression data via positive Boolean functions

GA-SVM WRAPPER APPROACH FOR GENE RANKING AND CLASSIFICATION USING EXPRESSIONS OF VERY FEW GENES

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Basic Local Alignment Search Tool

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Community-assisted genome annotation: The Pseudomonas example. Geoff Winsor, Simon Fraser University Burnaby (greater Vancouver), Canada

Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics

Transcriptional Regulatory Code of a Eukaryotic Genome

Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm

A Hybrid Approach for Gene Selection and Classification using Support Vector Machine

HiSP: A Probabilistic Data Mining Technique for Protein Classification

Finding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm

Report on DIMACS Workshop on Machine Learning Techniques in Bioinformatics

Function Prediction of Proteins from their Sequences with BAR 3.0

Learning Methods for DNA Binding in Computational Biology

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

BIOINFORMATICS THE MACHINE LEARNING APPROACH

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Wavelet Transform for Detection of Conserved Motifs in Protein Sequences with Ten Bit Physico-Chemical Properties

Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA)

Transcription:

Protein function prediction using sequence motifs: A research proposal Asa Ben-Hur Abstract Protein function prediction, i.e. classification of protein sequences according to their biological function is an important task in bioinformatics. We propose a study of function prediction that uses protein sequence motifs as features; the result will be a system with state-of-the-art performance in function prediction, that provides interpretable results that can help in providing biological insight on the relationships between different classes of proteins. As a proof of concept we will focus on enzymes, which have a well developed system of classification, namely the Enzyme Commission (EC) numbering system. 1 Background Advances in DNA sequencing are yielding a wealth of sequenced genomes. And yet, understanding the function of the proteins coded by a specific genome is still an unsolved problem. The determination of the function of genes and gene products is performed mainly on the basis of sequence similarity (homology) [1]. Comparing entire proteomes to a database of known sequences leaves the function of a large percentage of genes undetermined: close to 40% of the known human genes do not have a functional classification by homology [2, 3]. Most homology-based methods depend on either global similarity to a known, closely related protein, or homology to a family of known proteins using BLAST, PSI-BLAST, profiles, or HMM methods [4, 5, 6]. The function of genes that have arisen recently by exon shuffling or with no closely related ortholog remains undetermined by these approaches [7]. Motifs on the other hand, represent short, highly conserved regions of proteins. Very often these motifs correspond to functional regions of a protein catalytic sites, binding sites, structural motifs etc. The presence of such protein motifs often reveals important clues to a protein s role even if it is not globally similar to any known protein. The motifs for most catalytic sites and binding sites are conserved over much larger taxonomic distances and evolutionary time than the rest of the sequence. However, a single motif is often not sufficient to determine the function of a protein. The catalytic site or binding site of a protein might be composed of several regions that are not contiguous in sequence, but are close in the folded protein structure (for example in serine proteases). In addition, a motif representing a binding site might be common to several protein families that bind the same substrate. Therefore, a pattern

of motifs is required in general to classify a protein into a certain family of proteins. Manually constructed fingerprints are provided by the PRINTS database [8]. In this proposal we suggest an automatic method for the construction of such fingerprints, and to do this in a discriminative manner using tools of supervised learning. Most motif methods extract conserved elements (blocks) from multiple sequence alignments of groups of proteins. These conserved elements are then represented as either discrete sequence motifs, Position Specific Scoring Matrices (PSSMs), or profiles. In this proposal we focus on discrete ungapped motifs and PSSMs. There are several databases of blocks/motifs that are publicly available, including PROSITE, BLOCKS+, PRINTS, and emotif [9, 10, 8, 11, 12]. The use of a motif database as the basis for constructing a classifier will enable us to compare the merit of different databases and methods for constructing motifs, according to how well the resulting classifier performs. 2 Enzyme Classification Enzymes represent a substantial fraction of the proteins in the SwissProt database [13], and have a well established system of annotation. The current version of SwissProt contains over 35,000 enzymes. The function of an enzyme is specified by a name given to it by the Enzyme Commission (EC) [14]. The name corresponds to an EC number, which is of the form: n1.n2.n3.n4, e.g. 1.1.3.13 for alcohol oxidase. The first number is between 1 and 6, and indicates the general type of chemical reaction catalyzed by the enzyme; the main categories are oxidoreductases, transferases, hydrolases, lyases, isomerases and ligases. The remaining numbers have meanings that are particular to each category. Consider for example, the oxidoreductases (EC number starting with 1), which involve reactions in which hydrogen or oxygen atoms, or electrons are transferred between molecules. In these enzymes, n2 specifies the chemical group of the (electron) donor molecule, n3 specifies the (electron) acceptor, and n4 specifies the substrate. The EC classification system specifies over 750 enzyme names, and a particular protein can have several enzymatic activities. Therefore, as a machine learning classification problem, it is not a standard multi-class problem, since each pattern can have more than one class label; this type of problem is sometimes called a multi-label problem [15]. However, in preliminary experiments we performed, we found that this multi-label problem can be reduced to a regular multi-class problem by considering a group of enzyms that have several activities as a class by itself; for example, there are 22 enzymes that have EC numbers 1.1.1.1 and 1.2.1.1, and these can be perfectly distinguished from enzymes with the single EC number 1.1.1.1 using a classifier that uses the motif content of the proteins. When looking at the SwissProt database we then found that these two groups are indeed recognized as distinct.

3 Methods We propose to use the motif composition of a protein to define a similarity measure or kernel function that can be used with various kernel based classification methods such as Support Vector Machines (SVM). In a recent study we found that a bag of motifs representation of a protein sequence provides state of the art performance in detecting remote homologs [16]. The motif database we plan to use contains over a million discrete sequence motifs [12]. The motif composition vector of a protein sequence is therefore high dimensional, and is also very sparse, much like the bag of words representation of text documents. However, unlike textual data, motifs are highly predictive features: We found that using methods such as the Recursive Feature Elimination (RFE) [17], we could reduce the number of motifs to a few tens at the most, while keeping the same level of classification accuracy. A classifier based on a handful of features has the advantage of interpretability, and can provide biological insight on what differentiates different classes of enzymes from each other. During the development of the system we plan to benchmark various classifiers and feature selection methods. Recall that protein function prediction is a classification problem with a large number of classes (hundreds of classes in the EC classification alone). This poses a computational challenge when using a two-class classifier such as SVM: the oneagainst-one method, that appears to perform better than the one-against-rest method of multi-class classification [18], can be too computationally intensive for this application. However, in view of the sparsity of the data in the motif representation, for a given query, only a small number of classes are likely to show any degree of similarity. So in practice only a small number of candidate classes will need to be discriminated agains each other. The motifs in the database we use can be highly redundant in their pattern of occurrence in a group of proteins. Due to this level of redundancy simple feature selection methods that are based on ranking individual features, cannot be used in order to find a small subset of motifs. Wrapper methods such as RFE were checked to be effective in reducing the dimensionality, while maintaining the accuracy of the resulting classifier. Preliminary results of our experiments will be presented at the workshop on Feature Selection at NIPS2003. 4 Deliverables A suite of tools for classification of proteins sequences will be provided as modules for the PyML machine learning environment (current version available at http://cmgm.stanford.edu/ asab/pyml.html. A web based interface for performing protein classification. For a given query the tool will show the component motifs in the query, the motifs that contributed to the classification of the motif and the pattern of occurrence of those motifs across different enzyme classes.

References [1] F.S. Domingues and T. Lengauer. Protein function from sequence and structure. Applied Bioinformatics, 2(1):3 12, 2003. [2] E.S. Lander, L.M. Linton, and B. Birren. Initial sequencing and analysis of the human genome. Nature, 409(6822):860 921, 2001. [3] J.C. Venter, M.D. Adams, E.W. Myers, and P.W. Li. The sequence of the human genome. Science, 2901(16):1304 1351, 2001. [4] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389 3402, 1997. [5] M. Gribskov, A.D. McLachlan, and D. Eisenberg. Profile analysis: Dectection of distantly related proteins. Proc. Natl. Acad. Sci. USA, 84:4355 4358, 1987. [6] E.L. Sonnhammer, S.R. Eddy, and E. Birney. Pfam: multiple sequence alignments and hmm-profiles of protein domains. Nucleic Acids Research, 26(1):320 322, 1998. [7] M. Ruvolo. Molecular phylogeny of the hominoids: inferences from multiple independent dna sequence data sets. Mol. Biol. Evol., 14(3):248 265, 1997. [8] T.K. Attwood, M. Blythe, D.R. Flower, A. Gaulton, J.E. Mabey, N. Maudling, L. McGregor, A. Mitchell, G. Moulton, K. Paine, and P. Scordis. Prints and prints-s shed light on protein ancestry. Nucleic Acids Research, 30(1):239 241, 2002. [9] L. Falquet, M. Pagni, P. Bucher, N. Hulo, C.J. Sigrist, K. Hofmann, and A. Bairoch. The PROSITE database, its status in 2002. Nucliec Acids Research, 30:235 238, 2002. [10] S. Henikoff, J.G. Henikoff, and S. Pietrokovski. Blocks+: A non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics, 15(6):471 479, 1999. [11] J.Y. Huang and D.L. Brutlag. The emotif database. Nucleic Acids Research, 29(1):202 204, 2001. [12] Q. Su, S. Saxonov, and D.L. Brutlag. eblocks: an automated database of protein conserved regions maximizing sensitivity and specificity, 2003. [13] C. O Donovan, M.J. Martin, A. Gattiker, E. Gasteiger, A. Bairoch A., and R. Apweiler. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief. Bioinform., 3:275 284, 2002.

[14] Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Enzyme Nomenclature. Recommendations 1992. Academic Press, 1992. [15] A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In Advances in Neural Information Processing Systems, 2001. [16] A. Ben-Hur and D. Brutlag. Remote homology detection: A motif based approach. In Proceedings, eleventh international conference on intelligent systems for molecular biology, volume 19 suppl 1 of Bioinformatics, pages i26 i33, 2003. [17] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389 422, 2002. [18] C.W. Hsu and C.J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13:415 425, 2002.