RNA Secondary Structure Prediction Computational Genomics Seyoung Kim

Similar documents
RNA secondary structure prediction and analysis

RNA Secondary Structure Prediction

RNA folding & ncrna discovery

RNA Structure Prediction. Algorithms in Bioinformatics. SIGCSE 2009 RNA Secondary Structure Prediction. Transfer RNA. RNA Structure Prediction

An Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004

90 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 4, 2006

Algorithmic Approaches to Modelling and Predicting RNA Structure

COMP598: Advanced Computational Biology Methods & Research

RNA Genomics II. BME 110: CompBio Tools Todd Lowe & Andrew Uzilov May 17, 2011

Computational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010

A Genetic Algorithm for Predicting RNA Pseudoknot Structures

RNA Genomics. BME 110: CompBio Tools Todd Lowe May 14, 2010

Molecular biology WID Masters of Science in Tropical and Infectious Diseases-Transcription Lecture Series RNA I. Introduction and Background:

Protein Synthesis: From Gene RNA Protein Trait

Joint Loop End Modeling Improves Covariance Model Based Non-coding RNA Gene Search

TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE

Genes and How They Work. Chapter 15

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

RNA is a single strand molecule composed of subunits called nucleotides joined by phosphodiester bonds.

1/4/18 NUCLEIC ACIDS. Nucleic Acids. Nucleic Acids. ECS129 Instructor: Patrice Koehl

NUCLEIC ACIDS. ECS129 Instructor: Patrice Koehl

There are four major types of introns. Group I introns, found in some rrna genes, are self-splicing: they can catalyze their own removal.

RNA-Sequencing analysis

STOCHASTIC CONTEXT-FREE GRAMMARS METHOD AIDED CALCULATION OF THE LOCAL FOLDING POTENTIAL OF TARGET RNA

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

Prediction of noncoding RNAs with RNAz

COMP364. Introduction to RNA secondary structure prediction. Jérôme Waldispühl School of Computer Science, McGill

Nucleic Acids. OpenStax College. 1 DNA and RNA

Forest alignment with affine gaps and anchors

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

RNA Structure Prediction and Comparison. RNA Biology Background

RNA Structure Prediction and Comparison. RNA Biology Background

a small viral insert or the like). Both types of breaks are seen in the pyrobaculum snornas.

Genotype-Phenotype Maps: from molecules to simple cells

BIOCHEMISTRY REVIEW. Overview of Biomolecules. Chapter 13 Protein Synthesis

RNA secondary structure

RNA does not adopt the classic B-DNA helix conformation when it forms a self-complementary double helix

BASIC MOLECULAR GENETIC MECHANISMS Introduction:

Textbook Reading Guidelines

Algorithms in Bioinformatics

Methods for the Identification of Common RNA Motifs. by Benedikt Löwes

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools

produces an RNA copy of the coding region of a gene

Higher Human Biology Unit 1: Human Cells Pupils Learning Outcomes

Thebiotutor.com A2 Biology OCR Unit F215: Control, genomes and environment Module 1.1 Cellular control Answers

Novel representation of RNA secondary structure used to improve prediction algorithms

From DNA to Protein: Genotype to Phenotype

Accuracy-based identification of local RNA elements

Improving the Algorithm to Predict RNA Structures for Frameshifting in the Expression of Overlapping Viral Genes

Chapter 13. From DNA to Protein

DNA & Genetics. Chapter Introduction DNA 6/12/2012. How are traits passed from parents to offspring?

Ch. 10 Notes DNA: Transcription and Translation

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

BIOLOGY - CLUTCH CH.17 - GENE EXPRESSION.

MATH 5610, Computational Biology

BIOL 300 Foundations of Biology Summer 2017 Telleen Lecture Outline

Monte-Carlo Tree Search

Videos. Lesson Overview. Fermentation

Machine Learning. HMM applications in computational biology

The Nature of Genes. The Nature of Genes. The Nature of Genes. The Nature of Genes. The Nature of Genes. The Genetic Code. Genes and How They Work

Chapter 12: Molecular Biology of the Gene

Control of Eukaryotic Genes

ANALYZING THE AMBIGUITY IN RNA STRUCTURE USING PROBABILISTIC APPROACH

RNA : functional role

8/21/2014. From Gene to Protein

Unit 3c. Microbial Gene0cs

Control of Eukaryotic Genes. AP Biology

Computational aspects of ncrna research. Mihaela Zavolan Biozentrum, Basel Swiss Institute of Bioinformatics

Bioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University

RNA GENE DISCOVERY THROUGH THE CLASSIFICATION OF RNA SECONDARY STRUCTURE ELEMENTS

Feedback D. Incorrect! No, although this is a correct characteristic of RNA, this is not the best response to the questions.

MATH 5610, Computational Biology

From DNA to Protein: Genotype to Phenotype

Notes: (Our Friend) DNA. DNA Structure DNA is composed of 2 chains of repeating. A nucleotide = + +

Chapter 14: From DNA to Protein

Effective Annotation of Noncoding RNA Families Using Profile Context-Sensitive HMMs

Hello! Outline. Cell Biology: RNA and Protein synthesis. In all living cells, DNA molecules are the storehouses of information. 6.

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression

Genome Resources. Genome Resources. Maj Gen (R) Suhaib Ahmed, HI (M)

Videos. Bozeman Transcription and Translation: Drawing transcription and translation:

A Progressive Folding Algorithm for RNA Secondary Structure Prediction

Creation of a PAM matrix

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

CH 17 :From Gene to Protein

Nucleic Acids: DNA and RNA

Efficient and Accurate Analysis of non coding RNAs with InSyBio ncrnaseq

A Dynamic Programming Algorithm for Finding the Optimal Segmentation of an RNA Sequence in Secondary Structure Predictions

CONSTRUCTIVE INDUCTION IN BIO-SEQUENCES

DNA Structure & the Genome. Bio160 General Biology

CSE : Computational Issues in Molecular Biology. Lecture 19. Spring 2004

9/3/2009. DNA RNA Proteins. DNA Genetic program RNAs Ensure synthesis of proteins Proteins Ensure all cellular functions Carbohydrates (sugars) Energy

Flow of Genetic Information_ Genetic Code, Mutation & Translation (Learning Objectives)

Lecture 15. Promoters, TFs. #15_Sept26

Nucleic Acids. By Sarah, Zach, Joanne, and Dean

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

Division Ave. High School AP Biology

Flow of Genetic Information_Translation (Learning Objectives)

Transcription:

RNA Secondary Structure Prediction 02-710 Computational Genomics Seyoung Kim

Outline RNA folding Dynamic programming for RNA secondary structure prediction Covariance model for RNA structure prediction

RNA Basics RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U wobble pairing Bases can only pair with one other base.

RNA Basics transfer RNA (trna) messenger RNA (mrna) ribosomal RNA (rrna) small interfering RNA (sirna) micro RNA (mirna) small nucleolar RNA (snorna) http://www.genetics.wustl.edu/eddy/trnascan-se/

RNA Secondary Structure Pseudoknot Stem Interior Loop Single-Stranded Bulge Loop Hairpin loop Junction (Multiloop) Image Wuchty

Pseudoknots Pseudoknots: a nucleic acid secondary structure containing at least two stem-loop structures which half of one stem is intercalated between the two halves of another stem.

Sequence Alignment as a method to determine structure Bases pair in order to form backbones and determine the secondary structure Aligning bases based on their ability to pair with each other gives an algorithmic approach to determining the optimal structure

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Simple Example: Maximizing Base Pairing S(i,j) is the folding of the subsequence of the RNA strand from index i to index j which results in the highest number of base pairs

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Simple Example: Maximizing Base Pairing Base pair at i and j

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Simple Example: Maximizing Base Pairing Unmatched at i

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Simple Example: Maximizing Base Pairing Umatched at j

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Simple Example: Maximizing Base Pairing Bifurcation

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension Initialize first two diagonal arrays to 0

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension Fill in squares sweeping diagonally

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension Bases cannot pair, similar to unmatched alignment

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension Bases can pair, similar to matched alignment

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension Dynamic Programming possible paths S(i + 1, j 1) +1

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension S(i, j 1) Dynamic Programming possible paths

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension S(i + 1, j) Dynamic Programming possible paths

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension Bifurcation add values for all k k = 0 : Bifurcation max in this case S(i,k) + S(k + 1, j)

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension Bifurcation add values for all k Reminder: For all k S(i,k) + S(k + 1, j)

Images Sean Eddy Base Pair Maximization Dynamic Programming Algorithm Alignment Method Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension Bifurcation add values for all k Reminder: For all k S(i,k) + S(k + 1, j)

Base Pair Maximization - Drawbacks Base pair maximization will not necessarily lead to the most stable structure May create structure with many interior loops or hairpins which are energetically unfavorable Comparable to aligning sequences with scattered matches not biologically reasonable

Energy Minimization Thermodynamic Stability Estimated using experimental techniques Theory : Most Stable is the Most likely No Pseudknots due to algorithm limitations Uses Dynamic Programming alignment technique Attempts to maximize the score taking into account thermodynamics MFOLD and ViennaRNA

Energy Minimization Results Linear RNA strand folded back on itself to create secondary structure Circularized representation uses this requirement Arcs represent base pairing Images David Mount

Energy Minimization Results Images David Mount Linear All loops RNA must strand have folded at back least on 3 itself bases to create in them secondary structure Equivalent to having 3 base pairs between all arcs Circularized representation uses this requirement Exception: Arcs Location represent where base the pairing beginning and end of RNA come together in circularized representation

Trouble with Pseudoknots Pseudoknots cause a breakdown in the Dynamic Programming Algorithm. In order to form a pseudoknot, checks must be made to ensure base is not already paired this breaks down the recurrence relations Images David Mount

Energy Minimization Drawbacks Compute only one optimal structure Usual drawbacks of purely mathematical approaches Similar difficulties in other algorithms Protein structure Exon finding

Alternative Algorithms - Covariaton Incorporates Similarity-based method Evolution maintains sequences that are important Change in sequence coincides to maintain structure through base pairs (Covariance) Cross-species structure conservation example trna Manual and automated approaches have been used to identify covarying base pairs Models for structure based on results Ordered Tree Model Stochastic Context Free Grammar

Alternative Algorithms - Covariaton Expect areas of base pairing in trna to be covarying between various species

Alternative Algorithms - Covariaton Base pairing creates same stable trna structure in organisms

Alternative Algorithms - Covariaton Base Mutation Expect pairing areas in one creates of base same yields pairing stable pairing in trna trna to be structure impossible covarying in between and organisms breaks down various structure species

Alternative Algorithms - Covariaton Base Mutation Covariation Expect pairing areas in ensures one creates of base same yields ability pairing stable pairing to in base trna trna pair to be is structure impossible maintained covarying between and organisms breaks RNA down structure various structure species is conserved

Binary Tree Representation of RNA Secondary Structure Representation of RNA structure using Binary tree Nodes represent Base pair if two bases are shown Loop if base and gap (dash) are shown Traverse root to leaves, from left to right Pseudoknots still not represented Tree does not permit varying sequences Mismatches Insertions & Deletions Images Eddy et al.

Covariance Model HMM which permits flexible alignment to an RNA structure emission and transition probabilities Model trees based on finite number of states Match states sequence conforms to the model: MATP State in which bases are paired in the model and sequence MATL & MATR State in which either right or left bulges in the sequence and the model Deletion State in which there is deletion in the sequence when compared to the model Insertion State in which there is an insertion relative to model Transitions have probabilities Varying probability Enter insertion, remain in current state, etc Bifurcation no probability, describes path

Alignment to CM Algorithm Calculate the probability score of aligning RNA to CM Three dimensional matrix O(n³) Align sequence to given subtrees in CM For each subsequence calculate all possible states Subtrees evolve from Bifurcations For simplicity Left singlet is default Images Eddy et al.

Alignment to CM Algorithm Images Eddy et al. For each calculation take into account the Transition (T) to next state Emission probability (P) in the state as determined by training data

Alignment to CM Algorithm Images Eddy et al. For each calculation take into account the Transition (T) to next state Emission probability (P) in the state as determined by training data

Alignment to CM Algorithm Images Eddy et al. For each calculation take into account the Transition (T) to next state Emission probability (P) in the state as determined by training data

Alignment to CM Algorithm Images Eddy et al. For each calculation take into account the Transition (T) to next state Emission probability (P) in the state as determined by training data Deletion does not have an emission probability (P) associated with it

Alignment to CM Algorithm Images Eddy et al. For each calculation take into account the Transition (T) to next state Emission probability (P) in the state as determined by training data Bifurcation does not have a probability associated with the state

Model Training

Covariance Model (CM) Training Algorithm S(i,j) = Score at indices i and j in RNA when aligned to the Covariance Model Frequency of seeing the symbols (A, C, G, T) together in locations i and j depending on symbol. Independent frequency of seeing the symbols (A, C, G, T) in locations i or j depending on symbol. Frequencies obtained by aligning model to training data consists of sample sequences Reflect values which optimize alignment of sequences to model

Mutual information for RNA Secondary Structure Prediction

Covariance Model Drawbacks Needs to be well trained Not suitable for searches of large RNA Structural complexity of large RNA cannot be modeled Runtime Memory requirements

References How Do RNA Folding Algorithms Work?. S.R. Eddy. Nature Biotechnology, 22:1457-1458, 2004.