CMPS 6630: Introduction to Computational Biology and Bioinformatics. Secondary Structure Prediction

Similar documents
Sequence Analysis '17 -- lecture Secondary structure 3. Sequence similarity and homology 2. Secondary structure prediction

Protein Structure Analysis

CFSSP: Chou and Fasman Secondary Structure Prediction server

[Loganathan *, 5(11): November 2018] ISSN DOI /zenodo Impact Factor

CS273: Algorithms for Structure Handout # 5 and Motion in Biology Stanford University Tuesday, 13 April 2004

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Finding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm

Representation in Supervised Machine Learning Application to Biological Problems

MOL204 Exam Fall 2015

BIOINFORMATICS Sequence to Structure

A Protein Secondary Structure Prediction Method Based on BP Neural Network Ru-xi YIN, Li-zhen LIU*, Wei SONG, Xin-lei ZHAO and Chao DU

BIOLOGY 200 Molecular Biology Students registered for the 9:30AM lecture should NOT attend the 4:30PM lecture.

BIRKBECK COLLEGE (University of London)

The relationship between relative solvent accessible surface area (rasa) and irregular structures in protean segments (ProSs)

Protein structure. Wednesday, October 4, 2006

Structure & Function. Ulf Leser

Textbook Reading Guidelines

Prot-SSP: A Tool for Amino Acid Pairing Pattern Analysis in Secondary Structures

Protein Structure Prediction. christian studer , EPFL

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Combining PSSM and physicochemical feature for protein structure prediction with support vector machine

Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics

Protein Secondary Structures: Prediction

Protein Folding Problem I400: Introduction to Bioinformatics

Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur. Lecture - 5 Protein Structure - III

A STUDY OF INTELLIGENT TECHNIQUES FOR PROTEIN SECONDARY STRUCTURE PREDICTION. Hanan Hendy, Wael Khalifa, Mohamed Roushdy, Abdel Badeeh Salem

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

CSE : Computational Issues in Molecular Biology. Lecture 19. Spring 2004

Molecular Structures

Molecular Structures

Polypeptide backbone (or the main chain)

Molecular Modeling Lecture 8. Local structure Database search Multiple alignment Automated homology modeling

All Rights Reserved. U.S. Patents 6,471,520B1; 5,498,190; 5,916, North Market Street, Suite CC130A, Milwaukee, WI 53202

Supplement for the manuscript entitled

JPred and Jnet: Protein Secondary Structure Prediction.

Proteins Higher Order Structures

Molecular biology WID Masters of Science in Tropical and Infectious Diseases-Transcription Lecture Series RNA I. Introduction and Background:

BETA STRAND Prof. Alejandro Hochkoeppler Department of Pharmaceutical Sciences and Biotechnology University of Bologna

Simple jury predicts protein secondary structure best

I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure

Computational methods in bioinformatics: Lecture 1

COMP364. Introduction to RNA secondary structure prediction. Jérôme Waldispühl School of Computer Science, McGill

Structural bioinformatics

Textbook Reading Guidelines

Problem Set #2

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Homework 4. Due in class, Wednesday, November 10, 2004

Protein Structure Prediction

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

COMP598: Advanced Computational Biology Methods & Research

Astronomy picture of the day (4/21/08)

Analysis and prediction of RNA-binding residues using. sequence, evolutionary conservation, and predicted

Bayesian Inference using Neural Net Likelihood Models for Protein Secondary Structure Prediction

Computational Methods for Protein Structure Prediction

Hmwk 6. Nucleic Acids

Secondary Structure Prediction. Michael Tress CNIO

Structural Bioinformatics (C3210) DNA and RNA Structure

Protein Structural Motifs Search in Protein Data Base

Exploring Suboptimal Sequence Alignments and Scoring Functions in Comparative Protein Structural Modeling

BMB/Bi/Ch 170 Fall 2017 Problem Set 1: Proteins I

RNA does not adopt the classic B-DNA helix conformation when it forms a self-complementary double helix

G4120: Introduction to Computational Biology

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices

Biomolecules: nucleotides

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases

Hmwk # 8 : DNA-Binding Proteins : Part II

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari

AP Biology Book Notes Chapter 3 v Nucleic acids Ø Polymers specialized for the storage transmission and use of genetic information Ø Two types DNA

Week6 The molecular basis for growth and reproduc9on

Prediction of DNA-Binding Proteins from Structural Features

Protein Secondary Structure Prediction Using RT-RICO: A Rule-Based Approach

Protein backbone angle prediction with machine learning. approaches

Evaluation and Improvement of Multiple Sequence Methods for Protein Secondary Structure Prediction

Programme Good morning and summary of last week Levels of Protein Structure - I Levels of Protein Structure - II

Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification Read intro to HMM on blackboard

TALENs (Transcription Activator-Like Effector Nucleases)

BIOINFORMATICS ORIGINAL PAPER

Creation of a PAM matrix

Bioinformatics Practical Course. 80 Practical Hours

SECONDARY STRUCTURE AND OTHER 1D PREDICTION MICHAEL TRESS, CNIO

Protein Structure. Protein Structure Tertiary & Quaternary

Database Searching and BLAST Dannie Durand

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

NNvPDB: Neural Network based Protein Secondary Structure Prediction with PDB Validation

BASIC MOLECULAR GENETIC MECHANISMS Introduction:

What Are the Chemical Structures and Functions of Nucleic Acids?

Chromatin Structure. a basic discussion of protein-nucleic acid binding

Artificial Intelligence in Prediction of Secondary Protein Structure Using CB513 Database

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Structural Bioinformatics GENOME 541 Spring 2018

BIOINFORMATICS Introduction

Predicting Protein Structure and Examining Similarities of Protein Structure by Spectral Analysis Techniques

Nucleic Acids, Proteins, and Enzymes

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

Secondary Structure Prediction. Michael Tress, CNIO

STRUCTURAL BIOLOGY. α/β structures Closed barrels Open twisted sheets Horseshoe folds

Prediction of protein disorder Zsuzsanna Dosztányi

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Transcription:

CMPS 6630: Introduction to Computational Biology and Bioinformatics Secondary Structure Prediction

Secondary Structure Annotation Given a macromolecular structure Identify the regions of secondary structure Assumptions: High resolution structure There is a reasonable and consistent way to define Secondary Structure Petsko, Ringe, Prot Struct and Function, 2004

DSSP (Defined Secondary Structure of Proteins) A gold standard for secondary structure identification (NOT prediction) Used by the PDB database for secondary structure annotation Based on Hydrogen-Bonding Pa/erns -q 2 N H +q 2 -q 1 Quality of H-bond is function of: Distance d between donor and acceptor Alignment angle +q 1 E = f q xq y O C r x y d 1 E = q 1 q 2 + 1 1 1 f r r r ON CH OH r CN

DSSP Defines a Polar Interaction as a H-bond with computed energy < -0.5 kcal/mol An ideal H-bond has o d = 2.9 A = 0 E = -3.0 kcal/mol E = -0.5 kcal/mol decreasing energy

DSSP n-turn An n-turn is present at residue i if there is an H-bond from CO(i) to NH(i+n). Two or more form a helix. alpha-turn(i) = Hbond(i,i+4) (most common) 3-turn(i) = Hbond(i,i+3) (310 -helix) pi-turn(i) = Hbond(i,i+5) (very rare) Bridge A bridge is formed from two non-overlapping stretches of 3 residues if one of the following H-bond patterns is seen: Parallel Bridge (i, j) = [Hbond(i-1,j) and Hbond(j,i+1)] OR [Hbond(j-1,i) and Hbond(i,j+1)] An5Parallel Bridge (i, j) = [Hbond(i,j) and Hbond(j,i)] OR [Hbond(i-1,j+1) and Hbond(j-1,i+1)]

Directionality Hbond(i,j) = H-bond from CO(i) to NH(j) R O Proteins written from N-terminus to C-terminus H N C A C H H i j

DSSP An5Parallel Bridge (i, j) = [Hbond(i,j) and Hbond(j,i)] OR [Hbond(i-1,j+1) and Hbond(j-1,i+1)] Hbonds i j, j i i j

DSSP An5Parallel Bridge (i, j) = [Hbond(i,j) and Hbond(j,i)] OR [Hbond(i-1,j+1) and Hbond(j-1,i+1)] Hbonds i-1 j+1, j-1 i+1 i+1 i-1 j-1 j+1

DSSP Parallel Bridge (i, j) = [Hbond(i-1,j) and Hbond(j,i+1)] OR [Hbond(j-1,i) and Hbond(i,j+1)] Hbonds j-1 i, i j+1 i j-1 j j+1

Secondary Structure Prediction Input: Primary Sequence only Output: Annotation of alpha-helices, beta-strands, loops Petsko, Ringe, Prot Struct and Function, 2004

Chou-Fasman (First-Generation) Compute propensities for each amino acid, structural conformation s j {,, } Pr[A = a i S = s j ] Pr[A = a i ] a i in Categorize each AA as helix-former, helix-breaker, helixindifferent and sheet-former, sheet-breaker, sheetindifferent. Keep chaining residues to SS element while average propensity is above some threshold.

Accuracy? 3-State Accuracy: Percent of residues for which a method s predicted secondary structure (Helix ( ), Strand ( ), Neither ( )) is correct. Chou-Fasman: 50-60% 3-State Accuracy Priors: 30% Helices 20% Strands 50% Neither If you always predicted Neither you would have 50% 3-State Accuracy Need to incorporate additional information!

GOR Method What if we look at blocks of sequence to determine secondary structure? (Garnier-Osguthorpe-Robson, late 70s) s j = f(r j k,...,r j+k ) GAVLIFYWMVLLAGIFFST

GOR Method Main Idea: Look at the mutual information between nearby amino acids to assess whether they influence secondary structure at a particular position. I(x; y) = X X Pr[x, y] Pr[x, y] log Pr[x]Pr[y] x y = X X Pr[x y]pr[y] Pr[x, y] log Pr[x]Pr[y] x y = X X Pr[x y] Pr[x, y] log Pr[x] x y

GOR Method Consider mutual information between structural class and sequence: I(s j ; r j 8,...,r j+8 ) = X s j X hr j 8,...,r j+8 i Select the structural class which maximizes: Pr[s j,r j 8,...,r j+8 ] log Pr[s j r j 8,...,r j+8 ] Pr[s j ] I(s j = x; r j 8,...,r j+8 ) I(s j = x; r j 8,...,r j+8 ) Problem: Each conditional probability needs to be computed. Pr[s j r j 8,...,r j+8 ] 3 20 17 table size: Assume independence : I(s j ; r j 8,...,r j+8 ) = 8 k= 8 I(s j = x; r j+k )

GOR Method Extension Larger training set (513 vs 267 sequences) Because Neither regions were over predicted, they added a margin by which Neither must be preferred over Helix or Coil Use of doublet and triplet statistics Optimized window length - 13 better than 17... Previously: I(s j ; r j 8,...,r j+8 ) = I(s j ; r j 8,...,r j+8 ) = I(s j = x; r j )+ 8 k= 8 I(s j = x; r j+k ) Incorporating conditional information on residue : 8 k= 8,k =0 r j I(s j = x; r j+k r j )

Other Learning Methods Neural Networks 17 Residue Window Each Residue represented with 21-bits One for each AA, one for beginning/end of sequence Each sample has 17x21 bit feature vector with 17 non-zero entries Support Vector Machines (SVM) Linear Discriminant Analysis (LDA) Returns H, E, or C with confidence or probability. Requires k sequential residues with confidence above some threshold.

Third-Generation Observation: Protein structure is more conserved than protein sequence. Russell et al, J. Mol. Biol, 269, 1997 Two proteins with >30% sequence identity are likely to have similar structures.

Multiple Sequence Alignment We saw that multiple primary sequences of DNA, RNA, or protein identify similarity that may be a consequence of functional, structural, or evolutionary relationships. Idea: Predict secondary structure type by using information from homologs, then combine predictions.

Multiple Sequence Alignment Have: List of similar (but different) sequences Assumption: These aligned sequences fold into the same structure Consider the probability distribution over all amino acids at residue i, a 20-dimensional vector Examine 13-residue window Each residue represented by the 20-dimensional probability vector Use favorite classification technique (NN, LDA, SVM)

Accuracy Limits 100 90 Scores (%) 80 70 60 50 40 30 CF GOR I GOR III PHD PSIpred What is the gold standard? (~12% variability) Puts upper bound at 88% accuracy Sequence Structure Degeneracy Some 11-long AA sequences can fold into both an alpha-helix and a beta-strand

Chameleon Sequences Chameleon Sequences in the PDB, 1998 7-Residue: LSLAVAG Many 5-residue chameleons Fewer 6-residue chameleons Few 7-residue chameleons

7-Residue: LITTAHA Chameleon Sequences in the PDB, 1998

7-Residue: KGLEWVS Chameleon Sequences in the PDB, 1998

Prions Prion: from proteinaceous infectious The prion protein is naturally found in the body. The infectious agent is the same protein but with a different fold. Disease fold induces normal copies of the protein to fold into disease form. Accumulation of disease folds forms cytotoxic aggregates. Normal PrP C Spontaneous Conversion, or Inoculation Interaction between PrP C and PrP Sc Conversion PrP C (NMR) www.wikipedia.org PrP Sc (Proposed) Accumulation of PrP Sc

Prions Prion: from proteinaceous infectious The prion protein is naturally found in the body. The infectious agent is the same protein but with a different fold. Disease fold induces normal copies of the protein to fold into disease form. Accumulation of disease folds forms cytotoxic aggregates. Normal PrP C Spontaneous Conversion, or Inoculation Interaction between PrP C and PrP Sc Conversion Sunde et al, J Mol Biol, 273(3), 1997. Amyloidosis leads to tissue damage. Accumulation of PrP Sc