JPred and Jnet: Protein Secondary Structure Prediction.

Similar documents
Simple jury predicts protein secondary structure best

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid

Sequence Based Function Annotation

MOL204 Exam Fall 2015

G4120: Introduction to Computational Biology

CS273: Algorithms for Structure Handout # 5 and Motion in Biology Stanford University Tuesday, 13 April 2004

G4120: Introduction to Computational Biology

Sequence Analysis '17 -- lecture Secondary structure 3. Sequence similarity and homology 2. Secondary structure prediction

Combining PSSM and physicochemical feature for protein structure prediction with support vector machine

Bioinformatics Practical Course. 80 Practical Hours

AB INITIO PROTEIN STRUCTURE PREDICTION ALGORITHMS

NNvPDB: Neural Network based Protein Secondary Structure Prediction with PDB Validation

Methods and tools for exploring functional genomics data

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Consensus Prediction of Protein Secondary Structures

Protein Structure Prediction. christian studer , EPFL

Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics

Protein 3D Structure Prediction

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Molecular Modeling Lecture 8. Local structure Database search Multiple alignment Automated homology modeling

Artificial Intelligence in Prediction of Secondary Protein Structure Using CB513 Database

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Textbook Reading Guidelines

Protein Secondary Structure Prediction Using RT-RICO: A Rule-Based Approach

Prediction of Secondary Structures of Human Prion Proteins using Miscellaneous Methods

BMC Structural Biology

Protein backbone angle prediction with machine learning. approaches

Finding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm

Representation in Supervised Machine Learning Application to Biological Problems

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices

COMBINING SEQUENCE AND STRUCTURAL PROFILES FOR PROTEIN SOLVENT ACCESSIBILITY PREDICTION

Evaluation and Improvement of Multiple Sequence Methods for Protein Secondary Structure Prediction

Protein Structure Analysis

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

A Model for Protein Secondary Structure Prediction Meta - Classifiers

SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure

SECONDARY STRUCTURE AND OTHER 1D PREDICTION MICHAEL TRESS, CNIO

Structure & Function. Ulf Leser

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Secondary Structure Prediction

MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping

PCA and SOM based Dimension Reduction Techniques for Quaternary Protein Structure Prediction

Correcting Sampling Bias in Structural Genomics through Iterative Selection of Underrepresented Targets

Secondary Structure Prediction. Michael Tress CNIO

Secondary Structure Prediction. Michael Tress, CNIO

Designing and benchmarking the MULTICOM protein structure prediction system

Analysis and prediction of RNA-binding residues using. sequence, evolutionary conservation, and predicted

Protein Secondary Structures: Prediction

Profile-Profile Alignment: A Powerful Tool for Protein Structure Prediction. N. von Öhsen, I. Sommer, R. Zimmer

Machine Learning and Graph Theory Approaches for Classification and Prediction of Protein Structure

CFSSP: Chou and Fasman Secondary Structure Prediction server

Supplement for the manuscript entitled

Database Searching and BLAST Dannie Durand

Dynamic Programming Algorithms

Product Applications for the Sequence Analysis Collection

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Discovering Sequence-Structure Motifs from Protein Segments and Two Applications. T. Tang, J. Xu, and M. Li

OPTIMIZING LONG INTRINSIC DISORDER PREDICTORS WITH PROTEIN EVOLUTIONARY INFORMATION

[Loganathan *, 5(11): November 2018] ISSN DOI /zenodo Impact Factor

Textbook Reading Guidelines

BIOINFORMATICS ORIGINAL PAPER

Practical lessons from protein structure prediction

Advanced topics in bioinformatics

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

The Effect of Using Different Neural Networks Architectures on the Protein Secondary Structure Prediction

Multi-class protein Structure Prediction Using Machine Learning Techniques

Extracting Database Properties for Sequence Alignment and Secondary Structure Prediction

A Protein Secondary Structure Prediction Method Based on BP Neural Network Ru-xi YIN, Li-zhen LIU*, Wei SONG, Xin-lei ZHAO and Chao DU

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

This practical aims to walk you through the process of text searching DNA and protein databases for sequence entries.

Protein Structure Prediction

Computational Methods for Protein Structure Prediction

Protein Structural Motifs Search in Protein Data Base

Geoff Barton: Multiple alignment and Analysis Session 2: Alignment & alignment analysis. Session 3: Annotating sequences & alignments

How to use Variant Effects Report

Introduction to Bioinformatics Finish. Johannes Starlinger

Prediction of Protein Function and Functional Sites From Protein Sequences

Creation of a PAM matrix

BIOINFORMATICS Sequence to Structure

Bayesian Inference using Neural Net Likelihood Models for Protein Secondary Structure Prediction

Exploring Similarities of Conserved Domains/Motifs

Supervised Machine Learning Algorithms for Protein Structure Classifica;on. Pooja Jain February 10, 2009 IMA Seminar

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

A NOVEL FRAMEWORK FOR PROTEIN STRUCTURE PREDICTION. A Dissertation presented to the Faculty of the Graduate School University of Missouri-Columbia

CAFASP2:TheSecondCriticalAssessmentofFully AutomatedStructurePredictionMethods

proteins Prediction of RNA binding sites in a protein using SVM and PSSM profile Manish Kumar, 1 M. Michael Gromiha, 2 and G. P. S.

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Key-Words: Case Based Reasoning, Decision Trees, Bioinformatics, Machine Learning, Protein Secondary Structure Prediction. Fig. 1 - Amino Acids [1]

Homework 4. Due in class, Wednesday, November 10, 2004

Protein structure. Predictive methods

Cascaded walks in protein sequence space: Use of artificial sequences in remote homology detection between natural proteins

Data Mining for Biological Data Analysis

Prediction of Interface Residues in Protein Protein Complexes by a Consensus Neural Network Method: Test Against NMR Data

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

In silico Protein Recombination: Enhancing Template and Sequence Alignment Selection for Comparative Protein Modelling

ONLINE BIOINFORMATICS RESOURCES

Towards protein function annotations for matching remote homologs

A Neural Network Method for Identification of RNA-Interacting Residues in Protein

Transcription:

JPred and Jnet: Protein Secondary Structure Prediction www.compbio.dundee.ac.uk/jpred

...A I L E G D Y A S H M K... FUNCTION? Protein Sequence a-helix b-strand Secondary Structure Fold

What is the difference between JPred and Jnet??? JNet refers to the prediction engine that does the work. The current version of this is Version 2.3.1 JPred refers to the website. This uses JNet and other tools to do predictions and present them in different ways.

History of JPred/JNet 1987: Zpred: First predictor that used multiple sequence alignment 1999: Jpred 1: Did prediction by combining prediction methods developed by different groups that worked from multiple sequence alignments 2000: JNet 1: Multiple neural network predictor replaced all other methods in JPred 2002: JPred 2: Retraining JNet improved accuracy 2009: JPred 3: Retraining JNet, algorithm improvements to Jnet and website refresh 2015: Jpred 4: Retraining JNet, major website improvements

Neural Network??? Machine learning method

Neural Networks Inductive method of learning Input Nodes Hidden Nodes Output Nodes Supervised learning Inputs and outputs provided Dependent on quality of observations Representative Unbiased (non-redundant)

Training and Testing JPred4/Jnet 2.3.1 You need training data where you know the answer. We use a set of PDB domains of known structure from the SCOP domains database Testing 1. Cross-validation on 1208 domains 2. Blind test on 150 domains not used in training

Neural Network Inputs Generate alignments for each sequence by searching UniRef90 with PSI-BLAST Make profiles: Position-Specific Scoring Matrix (PSSM) Hidden Markov Model (HMMer3) Earlier versions of JNet/Jpred had more inputs.

Profiles give positionspecific scoring Aligning to Glycine at position 11 scores +6.5 Aligning to Glycine at position 23 scores -1.51 This emphasises position-specific features of the protein family Compared to Gly-Gly score of 0.6 in the BLOSUM62 matrix.

Neural Network Outputs DSSP definitions of secondary structure reduced from 8- to 3-state H: Helix E or B: Extended strand Everything else: coil

Query --------KMLTQRAEIDRAFEEAAGSAETLSVERLVTFLQHQQR Seq 2 --------KILTKREEIDVIYGEYAKTDGLMSANDLLNFLLTEQR Seq 3 --------KALTKRAEVQELFESFSADGQKLTLLEFLDFLREEQK Seq 4 --------RELLRRPELDAVFIQYSANGCVLSTLDLRDFLSD-QG DSSP } --HHHHHHHHH------HHHHHHHHHHHHH------- Each position in the input vector is the score for an amino acid within the window # Input 1 0 0 0 0 0 0 0 0 0.9820 0.1192 0.28689... # Output 1 1 0 0

Query --------KMLTQRAEIDRAFEEAAGSAETLSVERLVTFLQHQQR Seq 2 --------KILTKREEIDVIYGEYAKTDGLMSANDLLNFLLTEQR Seq 3 --------KALTKRAEVQELFESFSADGQKLTLLEFLDFLREEQK Seq 4 --------RELLRRPELDAVFIQYSANGCVLSTLDLRDFLSD-QG DSSP } --HHHHHHHHH------HHHHHHHHHHHHH------- Each position in the input vector is the score for an amino acid within the window # Input 2 0 0 0 0 0 0 0 0.9820 0.1192 0.28689 0.0474... # Output 2 1 0 0

Query --------KMLTQRAEIDRAFEEAAGSAETLSVERLVTFLQHQQR Seq 2 --------KILTKREEIDVIYGEYAKTDGLMSANDLLNFLLTEQR Seq 3 --------KALTKRAEVQELFESFSADGQKLTLLEFLDFLREEQK Seq 4 --------RELLRRPELDAVFIQYSANGCVLSTLDLRDFLSD-QG DSSP } --HHHHHHHHH------HHHHHHHHHHHHH------- Each position in the input vector is the score for an amino acid within the window # Input 3 0 0 0 0 0 0 0.9820 0.1192 0.28689 0.0474 0.1192... # Output 3 0 0 1

Two-Layer Ensemble Sequence to Structure Structure to Structure E-EEEEH-------HHH-HHH-HHH-HE- --EEEEE-------HHHHHHHHHHHHHH- Actually, both have hundreds of inputs and three outputs only two outputs shown for simplicity

Training and Testing JPred4/Jnet 2.3.1 You need training data where you know the answer. We use a set of PDB domains of known structure from the SCOP domains database Testing 1. Cross-validation on 1358 domains 2. Blind test on 150 domains not used in training

Cross-validation training Blind data - removed subset k-fold Cross-Validation (887 seqs) Divide training data into k groups, train on k-1 and test on remainder. Do this k times. Train Test

Jnet Version 1: Multiple methods of presenting alignment information. If alternative networks do not agree, predict with network trained on difficult to predict regions.

JNet Version 1: Effect of Different Alignment Inputs Jury/No Jury network Average of HMMER and PSSM PSIBLAST 76.9 76.5 PSSM PSIBLAST HMMER profile ClustalW 74.4 75.2 Frequency profile PSIBLAST Frequency profile ClustalW Blosum 62 profile ClustalW 70.8 71.6 72.1 67 68 69 70 71 72 73 74 75 76 77 78 Average Percentage Accuracy (7-fold cross-validation on 480 proteins) JNet also accurately predicts whether amino acids will be buried or exposed in the folded structure of the protein.

Blind Test JNet Version 1

Comparison of JNet Version 1.0 to other Prediction Methods in a Blind Test - (406 proteins) 70.6 73.3 72.3 70.7 74.6 76.4 62.0 Cuff, J. A. & Barton, G. J., (2000), Proteins 40: 502-511.

That was in 2000 What has happened since?

Average Prediction Accuracy is Rising, but Flattening off Current JPred

Confidence in prediction. JNet 1.0 vs Jnet 2.0

Latest Jnet Mean Q3 prediction accuracy curves Residue Coverage curves Dr. Alexey Drozdetskiy

Introducing the Practical At last!

Limits of protein Secondary Structure