RNA folding & ncrna discovery

Similar documents
RNA secondary structure prediction and analysis

RNA Secondary Structure Prediction Computational Genomics Seyoung Kim

RNA Secondary Structure Prediction

90 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 4, 2006

RNA Structure Prediction. Algorithms in Bioinformatics. SIGCSE 2009 RNA Secondary Structure Prediction. Transfer RNA. RNA Structure Prediction

An Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004

RNA Genomics. BME 110: CompBio Tools Todd Lowe May 14, 2010

Computational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010

RNA secondary structure

RNA Genomics II. BME 110: CompBio Tools Todd Lowe & Andrew Uzilov May 17, 2011

Prediction of noncoding RNAs with RNAz

Joint Loop End Modeling Improves Covariance Model Based Non-coding RNA Gene Search

Computational aspects of ncrna research. Mihaela Zavolan Biozentrum, Basel Swiss Institute of Bioinformatics

COMP598: Advanced Computational Biology Methods & Research

There are four major types of introns. Group I introns, found in some rrna genes, are self-splicing: they can catalyze their own removal.

Algorithmic Approaches to Modelling and Predicting RNA Structure

RNA Structure and the Versatility of RNA. Mitesh Shrestha

Structural Bioinformatics GENOME 541 Spring 2018

Applications of mfold/rnafold

STOCHASTIC CONTEXT-FREE GRAMMARS METHOD AIDED CALCULATION OF THE LOCAL FOLDING POTENTIAL OF TARGET RNA

RNA is a single strand molecule composed of subunits called nucleotides joined by phosphodiester bonds.

Novel representation of RNA secondary structure used to improve prediction algorithms

Accuracy-based identification of local RNA elements

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

1/4/18 NUCLEIC ACIDS. Nucleic Acids. Nucleic Acids. ECS129 Instructor: Patrice Koehl

NUCLEIC ACIDS. ECS129 Instructor: Patrice Koehl

An approach to selecting putative RNA motifs using MDL principle

Cardioviral RNA structure logo analysis: entropy, correlations, and prediction

RNA secondary structure prediction: Analysis of Saccharomyces cerevisiaer RNAs

Regulation of bacterial gene expression

Methods for the Identification of Common RNA Motifs. by Benedikt Löwes

Chapter 8 Lecture Outline. Transcription, Translation, and Bioinformatics

RNA Structure Prediction and Comparison. RNA Biology Background

We can now identify three major pathways of information flow in the cell (in replication, information passes from one DNA molecule to other DNA

RNA Structure Prediction and Comparison. RNA Biology Background

Bioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University

Feedback D. Incorrect! No, although this is a correct characteristic of RNA, this is not the best response to the questions.

small RNAseq data analysis mirna detection P. Bardou, C. Gaspin, S. Maman, J. Mariette & O. Rué

Introduction to RNA Bioinformatics

DNA Structure and Replication, and Virus Structure and Replication Test Review

Machine Learning. HMM applications in computational biology

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

ANALYZING THE AMBIGUITY IN RNA STRUCTURE USING PROBABILISTIC APPROACH

Classification of Non-Coding RNA Using Graph Representations of Secondary Structure. Y. Karklin, R.F. Meraz, and S.R. Holbrook

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Efficient and Accurate Analysis of non coding RNAs with InSyBio ncrnaseq

DOWNLOAD OR READ : RNA INTERFERENCE CURRENT TOPICS IN MICROBIOLOGY AND IMMUNOLOGY PDF EBOOK EPUB MOBI

Genes and How They Work. Chapter 15

#26 - Gene Prediction 10/22/07

Biotechnology Unit 3: DNA to Proteins. From DNA to RNA

From DNA to Proteins

RNA-Sequencing analysis

RNA folding and its importance. Mitesh Shrestha

Chapter 8 From DNA to Proteins. Chapter 8 From DNA to Proteins

Textbook Reading Guidelines

TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE

Learning Objectives. Define RNA interference. Define basic terminology. Describe molecular mechanism. Define VSP and relevance

DNA & Genetics. Chapter Introduction DNA 6/12/2012. How are traits passed from parents to offspring?

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

Chapter 31. Transcription and RNA processing

Molecular biology WID Masters of Science in Tropical and Infectious Diseases-Transcription Lecture Series RNA I. Introduction and Background:

RNA-SEQUENCING ANALYSIS

Division Ave. High School AP Biology

RNA genes. Functional non-coding RNAs (ncrna) Jan 31 st 2018.

Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University

Nucleic Acids. OpenStax College. 1 DNA and RNA

DNA is the genetic material. DNA structure. Chapter 7: DNA Replication, Transcription & Translation; Mutations & Ames test

A novel theophylline triggered RNaseP riboswitch. Stefan Hammer Bled 2018

Maximizing Expected Base Pair Accuracy in RNA Secondary Structure Prediction by Joining Stochastic Context-Free Grammars Method

Module 6 Microbial Genetics. Chapter 8

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari

RNA : functional role

Sections 12.3, 13.1, 13.2

Expression of the genome. Books: 1. Molecular biology of the gene: Watson et al 2. Genetics: Peter J. Russell

COMP364. Introduction to RNA secondary structure prediction. Jérôme Waldispühl School of Computer Science, McGill

DNA and RNA. Chapter 12

Designing Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler

Computational DNA Sequence Analysis

MOLECULAR STRUCTURE OF DNA

Chapter 13 - Concept Mapping

Unit 3c. Microbial Gene0cs

DNA. translation. base pairing rules for DNA Replication. thymine. cytosine. amino acids. The building blocks of proteins are?

RNA GENE DISCOVERY THROUGH THE CLASSIFICATION OF RNA SECONDARY STRUCTURE ELEMENTS

Big Idea 3C Basic Review

Resources. How to Use This Presentation. Chapter 10. Objectives. Table of Contents. Griffith s Discovery of Transformation. Griffith s Experiments

From Gene to Protein

Lesson 8. DNA: The Molecule of Heredity. Gene Expression and Regulation. Introduction to Life Processes - SCI 102 1

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression

Protein Synthesis. DNA to RNA to Protein

The Flow of Genetic Information

Therapeutic & Prevention Application of Nucleic Acids

Videos. Lesson Overview. Fermentation

Introduction to human genomics and genome informatics

DNA Function: Information Transmission

Gene Prediction Background & Strategy Faction 2 February 22, 2017

Algorithms in Bioinformatics

Control of Eukaryotic Gene Expression (Learning Objectives)

Transcription:

I519 Introduction to Bioinformatics RNA folding & ncrna discovery Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB

Contents Non-coding RNAs and their functions RNA structures RNA folding Nussinov algorithm Energy minimization methods microrna target identification

RNAs have diverse functions ncrnas have important and diverse functional and regulatory roles that impact gene transcription, translation, localization, replication, and degradation Protein synthesis (rrna and trna) RNA processing (snorna) Gene regulation RNA interference (RNAi) Andrew Fire and Craig Mello (2006 Nobel prize) DNA-like function Virus RNA world

Non-coding RNAs A non-coding RNA (ncrna) is a functional RNA molecule that is not translated into a protein; small RNA (srna) is often used for bacterial ncrnas. trna (transfer RNA), rrna (ribosomal RNA), snorna (small RNA molecules that guide chemical modifications of other RNAs) micrornas (mirna, µrna, single-stranded RNA molecules of 21-23 nucleotides in length, regulate gene expression) sirnas (short interfering RNA or silencing RNA, double-stranded, 20-25 nucleotides in length, involved in the RNA interference (RNAi) pathway, where it interferes with the expression of a specific gene. ) pirnas (expressed in animal cells, forms RNA-protein complexes through interactions with Piwi proteins, which have been linked to transcriptional gene silencing of retrotransposons and other genetic elements in germ line cells) long ncrnas (non-protein coding transcripts longer than 200 nucleotides)

What s riboswitch Riboswitch mechanism Riboswitch Image source: Curr Opin Struct Biol. 2005, 15(3):342-348

Structures are more conserved Structure information is important for alignment (and therefore gene finding) CGA GCU CAA GUU

Features of RNA RNA typically produced as a single stranded molecule (unlike DNA) Strand folds upon itself to form base pairs & secondary structures Structure conservation is important RNA sequence analysis is different from DNA sequence

Canonical base pairing H N H O N N H N N O H N N N H O H N H N N H N N O N N Watson-Crick base pairing Non-Watson-Crick base pairing G/U (Wobble)

trna structure

RNA secondary structure Pseudoknot Stem Single-Stranded Interior Loop Bulge Loop Junction (Multiloop) Hairpin loop

Complex folds

Pseudoknots i j i i i j j j?

RNA secondary structure 2D Circle plot Dot plot Mountain Parentheses Tree model representation ((( )))..((.))

Main approaches to RNA secondary structure prediction Energy minimization dynamic programming approach does not require prior sequence alignment require estimation of energy terms contributing to secondary structure Comparative sequence analysis using sequence alignment to find conserved residues and covariant base pairs. most trusted Simultaneous folding and alignment (structural alignment)

Assumptions in energy minimization approaches Most likely structure similar to energetically most stable structure Energy associated with any position is only influenced by local sequence and structure Neglect pseudoknots

Base-pair maximization Find structure with the most base pairs Only consider A-U and G-C and do not distinguish them Nussinov algorithm (1970s) Too simple to be accurate, but stepping-stone for later algorithms

Nussinov algorithm Problem definition Given sequence X=x 1 x 2 x L, compute a structure that has maximum (weighted) number of base pairings How can we solve this problem? Remember: RNA folds back to itself! S(i,j) is the maximum score when x i..x j folds optimally S(1,L)? S(i,i)? S(i,j) 1 i j L

Grow from substructures (1) (2) (3) (4) 1 i i+1 k j-1 j L

Dynamic programming Compute S(i,j) recursively (dynamic programming) Compares a sequence against itself in a dynamic programming matrix Three steps

Initialization G G G A A A U C C G 0 G 0 0 G 0 0 A 0 0 A 0 0 A 0 0 U 0 0 C 0 0 C 0 0 Example: GGGAAAUCC the main diagonal L: the length of input sequence the diagonal below

Fill up the table (DP matrix) -- diagonal by diagonal Recursion G G G A A A U C C G 0 0 0 0 G 0 0 0 0 0 G 0 0 0 0 0 A 0 0 0 0? i A 0 0 0 1 A 0 0 1 1 U 0 0 0 0 C 0 0 0 C 0 0 j

Traceback G G G A A A U C C G 0 0 0 0 0 0 1 2 3 G 0 0 0 0 0 0 1 2 3 G 0 0 0 0 0 1 2 2 A 0 0 0 0 1 1 1 A 0 0 0 1 1 1 A 0 0 1 1 1 U 0 0 0 0 C 0 0 0 C 0 0 The structure is: What are the other optimal structures?

An exercise Input: AUGACAU Fill up the table Trace back Give the optimal structure What s the size of the hairpin loop A U G A C A U A U G A C A U

Energy minimization methods Nussinov algorithm (base pair maximization) is too simple to be accurate Energy minimization algorithm predicts secondary structure by minimizing the free energy (ΔG) ΔG calculated as sum of individual contributions of: loops stacking

Free energy computation U U +5.9 4nt loop A A G C G C +3.3 1nt bulge A -2.9 stacking G C -1.8 stacking U A A U 5 dangling C G A U -0.3 A 3-0.3 A 5-1.1 mismatch of hairpin -2.9 stacking -0.9 stacking -1.8 stacking -2.1 stacking ΔG = -4.6 KCAL/MOL

Loop parameters (from Mfold) DESTABILIZING ENERGIES BY SIZE OF LOOP! SIZE INTERNAL BULGE HAIRPIN! -------------------------------------------------------! 1. 3.8.! 2. 2.8.! 3. 3.2 5.4! 4 1.1 3.6 5.6! 5 2.1 4.0 5.7! 6 1.9 4.4 5.4!.!.! 12 2.6 5.1 6.7! 13 2.7 5.2 6.8! 14 2.8 5.3 6.9! 15 2.8 5.4 6.9! Unit: Kcal/mol

Stacking energy (from Vienna package) # stack_energies! /* CG GC GU UG AU UA @ */! -2.0-2.9-1.9-1.2-1.7-1.8 0! -2.9-3.4-2.1-1.4-2.1-2.3 0! -1.9-2.1 1.5 -.4-1.0-1.1 0! -1.2-1.4 -.4 -.2 -.5 -.8 0! -1.7 -.2-1.0 -.5 -.9 -.9 0! -1.8-2.3-1.1 -.8 -.9-1.1 0! 0 0 0 0 0 0 0!

Mfold versus Vienna package Mfold http://frontend.bioinfo.rpi.edu/zukerm/download/ http://frontend.bioinfo.rpi.edu/applications/mfold/cgi-bin/rnaform1.cgi Suboptimal structures The correct structure is not necessarily structure with optimal free energy Within a certain threshold of the calculated minimum energy Vienna -- calculate the probability of base pairings http://www.tbi.univie.ac.at/rna/

Mfold energy dot plot

Mfold algorithm (Zuker & Stiegler, NAR 1981 9(1):133)

Inferring structure by comparative sequence analysis Need a multiple sequence alignment as input Requires sequences be similar enough (so that they can be initially aligned) Sequences should be dissimilar enough for covarying substitutions to be detected Given an accurate multiple alignment, a large number of sequences, and sufficient sequence diversity, comparative analysis alone is sufficient to produce accurate structure predictions (Gutell RR et al. Curr Opin Struct Biol 2002, 12:301-310)

RNA variations Variations in RNA sequence maintain base-pairing patterns for secondary structures (conserved patterns of base-pairing) When a nucleotide in one base changes, the base it pairs to must also change to maintain the same structure CGA GCU CAA GUU Such variation is referred to as covariation.

If neglect covariation In usual alignment algorithms they are doubly penalized GA UC GA UC GA UC GC GC GA UA

Covariance measurements Mutual information (desirable for large datasets) Most common measurement Used in CM (Covariance Model) for structure prediction Covariance score (better for small datasets)

Mutual information : frequency of a base in column i : joint (pairwise) frequency of a base pair between columns i and j Information ranges from 0 and? bits If i and j are uncorrelated (independent), mutual information is 0

Mutual information plot

Structure prediction using MI S(i,j) = Score at indices i and j; M(i,j) is the mutual information between i and j The goal is to maximize the total mutual information of input RNA The recursion is just like the one in Nussinov Algorithm, just to replace w(i,j) (1 or 0) with the mutual information M(i,j)

RNAalifold Covariance-like score Hofacker et al. JMB 2002, 319:1059-1066 Desirable for small datasets Combination of covariance score and thermodynamics energy

Covariance-like score calculation The score between two columns i and j of an input multiple alignment is computed as following:

Covariance model A formal covariance model, CM, devised by Eddy and Durbin A probabilistic model A Stochastic Context-Free Grammer Generalized HMM model A CM is like a sequence profile, but it scores a combination of sequence consensus and RNA secondary structure consensus Provides very accurate results Very slow and unsuitable for searching large genomes

CM training algorithm Unaligned sequence alignment Multiple alignment EM Covariance model Parameter re-estimation Modeling construction

Binary tree representation of RNA secondary structure Representation of RNA structure using Binary tree Nodes represent Base pair if two bases are shown Loop if base and gap (dash) are shown Pseudoknots still not represented Tree does not permit varying sequences Mismatches Insertions & Deletions Images Eddy et al.

Overall CM architecture MATP emits pairs of bases: modeling of base pairing BIF allows multiple helices (bifurcation)

Covariance model drawbacks Needs to be well trained (large datasets) Not suitable for searches of large RNA Structural complexity of large RNA cannot be modeled Runtime Memory requirements

ncrna gene finding De novo ncrna gene finding Folding energy Number of sub-optimal RNA structures Homology ncrna gene searching Sequence-based Structure-based Sequence and structure-based

Rfam & Infernal Rfam 9.1 contains 1379 families (December 2008) Rfam 10.0 contains 1446 families (January 2010) Rfam is a collection of multiple sequence alignments and covariance models covering many common non-coding RNA families Infernal searches Rfam covariance models (CMs) in genomes or other DNA sequence databases for homologs to known structural RNA families http://rfam.janelia.org/

An example of Rfam families TPP (a riboswitch; THI element) RF00059 is a riboswitch that directly binds to TPP (active form of VB, thiamin pyrophosphate) to regulate gene expression through a variety of mechanisms in archaea, bacteria and eukaryotes

Simultaneous structure prediction and alignment of ncrnas The grammar emits two correlated sequences, x and y http://www.biomedcentral.com/1471-2105/7/400

References How Do RNA Folding Algorithms Work? Eddy. Nature Biotechnology, 22:1457-1458, 2004 (a short nice review) Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. Durbin, Eddy, Krogh and Mitchison. 1998 Chapter 10, pages 260-297 Secondary Structure Prediction for Aligned RNA Sequences. Hofacker et al. JMB, 319:1059-1066, 2002 (RNAalifold; covariance-like score calculation) Optimal Computer Folding of Large RNA Sequences Using Thermodynamics and Auxiliary Information. Zuker and Stiegler. NAR, 9(1):133-148, 1981 (Mfold) A computational pipeline for high throughput discovery of cisregulatory noncoding RNAs in Bacteria, PLoS CB 3(7):e126 Riboswitches in Eubacteria Sense the Second Messenger Cyclic Di-GMP, Science, 321:411 413, 2008 Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline, Nucl. Acids Res. (2007) 35 (14): 4809-4819. CMfinder a covariance model based RNA motif finding algorithm. Bioinformatics 2006;22:445-452

Understanding the transcriptome through RNA structure 'RNA structurome Genome-wide measurements of RNA structure by high-throughput sequencing Nat Rev Genet. 2011 Aug 18;12(9):641-55