Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Similar documents
Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Next Generation Genome Annotation with mgene.ngs

Fully Automated Genome Annotation with Deep RNA Sequencing

RNA-Seq Blog Poll Results

oqtans A Galaxy-Integrated Workflow for Quantitative Transcriptome Analysis from NGS Data

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Methods for Transcriptome Analysis with Tiling Arrays and mrna-seq

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Discovering Common Sequence Variation in A. thaliana. Gunnar Rätsch

Regina Bohnert. Friedrich Miescher Laboratory of the Max Planck Society Tübingen, Germany. July 18, 2008

RNA-Sequencing analysis

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Mapping strategies for sequence reads

RNA-SEQUENCING ANALYSIS

measuring gene expression December 5, 2017

Gene Expression Technology

Analysis of RNA-seq Data

RNA-Seq with the Tuxedo Suite

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

measuring gene expression December 11, 2018

Machine Learning. HMM applications in computational biology

Motivation From Protein to Gene

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Gene Prediction 10/21/05

Bi 8 Lecture 5. Ellen Rothenberg 19 January 2016

Systematic evaluation of spliced alignment programs for RNA- seq data

Genome annotation & EST

Applications of short-read

Eucalyptus gene assembly

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

Intro to RNA-seq. July 13, 2015

Wheat Genome Structural Annotation Using a Modular and Evidence-combined Annotation Pipeline

How to deal with your RNA-seq data?

CSE 549: RNA-Seq aided gene finding

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq


Problem formulations. Wikipedia :: Multidisciplinary design optimization

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Transcriptome analysis

MODULE 5: TRANSLATION

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Microarrays & Gene Expression Analysis

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

02 Agenda Item 03 Agenda Item

Single-Cell Whole Transcriptome Profiling With the SOLiD. System

MITIE: Simultaneous RNA-Seq-based Transcript Identification and Quantification in Multiple Samples

Finding Genes with Genomics Technologies

RNA-Seq data analysis course September 7-9, 2015

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

De novo assembly in RNA-seq analysis.

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Measuring transcriptomes with RNA-Seq

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

RNA standards v May

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

Bi 8 Lecture 4. Ellen Rothenberg 14 January Reading: from Alberts Ch. 8

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

1. Introduction Gene regulation Genomics and genome analyses

Chapter 1. from genomics to proteomics Ⅱ

High-throughput Transcriptome analysis

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

review Expression Microarrays Tiling genomic microarrays Sequencing methods Riassunto puntate precedenti RNA transcripts

Introduction of RNA-Seq Analysis

RNA-Seq Analysis. Simon Andrews, Laura v

Methods and tools for exploring functional genomics data

Genome 373: Gene Predic/on I. Doug Fowler

The Iso-Seq Method: Transcriptome Sequencing Using Long Reads

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

How to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling

Genomic resources. for non-model systems

Identification of individual motifs on the genome scale. Some slides are from Mayukh Bhaowal

Introduction to RNA sequencing

RNA-Seq Software, Tools, and Workflows

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Applied Bioinformatics - Lecture 16: Transcriptomics

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

NEXT GENERATION SEQUENCING. Farhat Habib

Measuring and Understanding Gene Expression

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy

Performance comparison of five RNA-seq alignment tools

Expressed genes profiling (Microarrays) Overview Of Gene Expression Control Profiling Of Expressed Genes

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

RNA

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies

Transcription:

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 1 / 43

Motivation Discovery of the Nuclein (Friedrich Miescher, 1869) Tu bingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one... wants to assume that a single substance... is the specific cause of fertilization, then one should undoubtedly first and foremost consider nuclein (Miescher, 1874) c Gunnar Ra tsch (FML, Tu bingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 43

Motivation Discovery of the Nuclein (Friedrich Miescher, 1869) Tu bingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one... wants to assume that a single substance... is the specific cause of fertilization, then one should undoubtedly first and foremost consider nuclein (Miescher, 1874) c Gunnar Ra tsch (FML, Tu bingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 43

Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 43

Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 43

Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 43

Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 43

Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Two different classes of observations c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Inferred classification rule c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models 1 Large scale sequence classification 2 Analysis and explanation of learning results 3 Sequence segmentation & structure prediction k mer Length 8 7 6 5 4 3 2 1 30 20 10 0 10 20 30 Position Log-intensity 10 5 0 transcript c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

Motivation Computational Transcriptome Prediction (a.k.a. Computational Gene Finding) DNA genic intergenic pre-mrna exon intron exon intron exon mrna cap 5' UTR 3' UTR polya Protein Given a piece of DNA sequence Predict protein-coding mrnas Less well developed for non-coding RNAs c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 43

Motivation Computational Transcriptome Prediction (a.k.a. Computational Gene Finding) DNA genic intergenic pre-mrna exon intron exon intron exon mrna cap 5' UTR 3' UTR polya Protein Given a piece of DNA sequence Predict protein-coding mrnas Less well developed for non-coding RNAs c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 43

Motivation Experimental Characterization of the Transcriptome Getting closer to reality... DNA Microarrays cdna Sequencing Oligonucleotide probes immobilized on a glass slide hybridize to complementary labeled target RNA. [Wikipedia] Experimental methods have there inherent limitations and biases Learning methods help to overcome them c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 43

Motivation Experimental Characterization of the Transcriptome Getting closer to reality... DNA Microarrays cdna Sequencing Oligonucleotide probes immobilized on a glass slide hybridize to complementary labeled target RNA. [Wikipedia] Experimental methods have there inherent limitations and biases Learning methods help to overcome them c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 43

Motivation Key Research Questions Characterize an organism s full complement of genes Find new (possibly noncoding) genes Compare genes among organisms Characterize transcript isoforms Find new alternative splice forms / transcript ends Monitor transcriptome changes between tissues or in response to environmental changes (e.g. stress) Identify significant expression changes Understand transcriptome regulation Knock-out / knock-down analysis of regulators Identify regulated targets with significant expression changes Identify binding sites used in regulation (e.g. ChIP-on-chip) QTL-Mapping of transcriptome-phenotypes c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 43

Motivation Key Research Questions Characterize an organism s full complement of genes Find new (possibly noncoding) genes Compare genes among organisms Characterize transcript isoforms Find new alternative splice forms / transcript ends Monitor transcriptome changes between tissues or in response to environmental changes (e.g. stress) Identify significant expression changes Understand transcriptome regulation Knock-out / knock-down analysis of regulators Identify regulated targets with significant expression changes Identify binding sites used in regulation (e.g. ChIP-on-chip) QTL-Mapping of transcriptome-phenotypes c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 43

Motivation Key Research Questions Characterize an organism s full complement of genes Find new (possibly noncoding) genes Compare genes among organisms Characterize transcript isoforms Find new alternative splice forms / transcript ends Monitor transcriptome changes between tissues or in response to environmental changes (e.g. stress) Identify significant expression changes Understand transcriptome regulation Knock-out / knock-down analysis of regulators Identify regulated targets with significant expression changes Identify binding sites used in regulation (e.g. ChIP-on-chip) QTL-Mapping of transcriptome-phenotypes c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 43

Motivation Key Research Questions Characterize an organism s full complement of genes Find new (possibly noncoding) genes Compare genes among organisms Characterize transcript isoforms Find new alternative splice forms / transcript ends Monitor transcriptome changes between tissues or in response to environmental changes (e.g. stress) Identify significant expression changes Understand transcriptome regulation Knock-out / knock-down analysis of regulators Identify regulated targets with significant expression changes Identify binding sites used in regulation (e.g. ChIP-on-chip) QTL-Mapping of transcriptome-phenotypes c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 43

RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... High-throughput transcriptome measurements Qualitative studies New transcripts Altered expression across tissues, conditions etc. Quantitative studies at high resolution pre-mrna exon mrna short reads junction reads reference genome intron Figure adapted from Wikipedia Motivation: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 43

RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... High-throughput transcriptome measurements Qualitative studies New transcripts Altered expression across tissues, conditions etc. Quantitative studies at high resolution pre-mrna exon mrna short reads junction reads reference genome intron Figure adapted from Wikipedia Motivation: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 43

RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... High-throughput transcriptome measurements Qualitative studies New transcripts Altered expression across tissues, conditions etc. Quantitative studies at high resolution pre-mrna exon mrna short reads junction reads reference genome intron Figure adapted from Wikipedia Motivation: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 43

RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... High-throughput transcriptome measurements Qualitative studies New transcripts Altered expression across tissues, conditions etc. Quantitative studies at high resolution pre-mrna exon mrna short reads junction reads reference genome intron Figure adapted from Wikipedia Motivation: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 43

RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads Read Alignment Expression level ~ #reads Significance Testing Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 43

RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads De novo Assembly Read Alignment TARs and Transcripts Expression level ~ #reads Significance Testing Transcript Quantitation Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 43

RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads De novo Assembly Read Alignment Transcript Reconstruction TARs and Transcripts Expression level ~ #reads Significance Testing Transcript Quantitation Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 43

RNA-Seq Pipeline RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 43

RNA-Seq Pipeline RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 43

PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes More info: http://.mpg.de/raetsch/suppl/palmapper c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT... matching k-mers... TCGCGCGCAACG TCGC k-mers CGCG GCGC CGCG GCGC CGCA GCAA Read CAAC AACG c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT... matching k-mers... TCGCGCGCAACG TCGC k-mers CGCG GCGC CGCG GCGC CGCA GCAA Read CAAC AACG c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes QPALMA for spliced read alignments: GenomeMapper identifies seed regions Spliced alignments by QPALMA [De Bona et al., 2008] DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT TCGCGCGCAACG Read More info: http://.mpg.de/raetsch/suppl/palmapper c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

PALMapper Read Alignment Accuracy PALMapper Accuracy Evaluation How accurately can PALMapper identify introns? 1 0.9 C. elegans recall precision 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 TopHat TopHat sensitive Palmapper PALMapper (3.5h) and TopHat (3.5h/10h) aligning 24M reads c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 43

PALMapper Read Alignment Accuracy PALMapper Accuracy Evaluation How accurately can PALMapper identify introns? 1 0.9 C. elegans recall precision 1 0.9 H. sapiens (chromosome 1) recall precision 0.8 0.7 0.6 0.5 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 TopHat TopHat sensitive Palmapper PALMapper (3.5h) and TopHat (3.5h/10h) aligning 24M reads 0.4 0.3 0.2 0.1 Palmapper Comparison of PALMapper with other alignment programs within the RGASP project (preliminary) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 43

PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 43

PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 43

PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 43

PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Source of information Sequence matches Computational splice site predictions Intron length model Read quality information Quality scoring f : (Σ R) Σ R [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 43

PALMapper Read Alignment Learning Algorithm Scoring Parameter Inference What are optimal parameters? How do we jointly optimize the 336 parameters? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 43

PALMapper Read Alignment Learning Algorithm Scoring Parameter Inference What are optimal parameters? How do we jointly optimize the 336 parameters? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 43

PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true false Correct alignment is not highest scoring one Correct alignment is highest scoring one Can we do better? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 43

PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true Correct alignment is not highest scoring one Correct alignment is highest scoring one Can we do better? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 43

PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true Use technique motivated by SVMs ( large-margin ) Idea: Enforce a margin between correct and incorrect examples One has to solve a big quadratic problem c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 43

PALMapper Read Alignment Learning Algorithm How Can We Generate Data for Training? Strategy: How do we obtain true alignments for training QPalma? Simulate realistic transcriptome reads with known origin 1 Estimate relationship between quality score and error probability from given reads 2 Use annotation of a few genes to simulate spliced reads 3 Introduce errors according to error model using quality strings from given read set 4 Train QPalma on generated read set with known alignments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 43

PALMapper Read Alignment Learning Algorithm How Can We Generate Data for Training? Strategy: How do we obtain true alignments for training QPalma? Simulate realistic transcriptome reads with known origin 1 Estimate relationship between quality score and error probability from given reads 2 Use annotation of a few genes to simulate spliced reads 3 Introduce errors according to error model using quality strings from given read set 4 Train QPalma on generated read set with known alignments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 43

PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 43

PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Error vs. intron position Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 43

PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Error vs. intron position 8nt 12nt 12nt 8nt Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 43

PALMapper Read Alignment Reasons for High Accuracy Step 2: Transcript Prediction Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts A. Coverage segmentation algorithm mtim for general transcripts (no coding bias/assumption) B. Extension of the mgene gene finding system to use NGS data for protein coding transcript prediction (mgene.ngs) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 18 / 43

mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic Idea read coverage (log-scale) 100 10 0 0 5 10 Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 43

mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic Idea read coverage (log-scale) 100 10 0 0 5 10 Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 43

mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic annotated gene Idea read coverage (log-scale) 100 10 0 0 5 10 Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 43

mtim Coverage Segmentation Approach The mtim Segmentation Approach S E I intergenic exonic intronic Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 43

mtim Coverage Segmentation Approach The mtim Segmentation Approach S E I intergenic exonic intronic feature score 1 0.5 0 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 5 6 7 8 9 10 11 12 13 14 read coverage 5 6 7 8 9 10 11 12 13 14 read coverage 5 6 7 8 9 10 11 12 13 14 read coverage Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 43

mtim Coverage Segmentation Approach The mtim Segmentation Approach don S E I acc intergenic exonic intronic feature score 1 0.5 0 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 5 6 7 8 9 10 11 12 13 14 read coverage 5 6 7 8 9 10 11 12 13 14 read coverage 5 6 7 8 9 10 11 12 13 14 read coverage Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 43

mtim Coverage Segmentation Approach The mtim Segmentation Approach Idea: Assume uniform read coverage within exons of same transcript annotated genes read coverage (log-scale) 100 10 0 0 5 10 15 20 25 Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 21 / 43

............ mtim Coverage Segmentation Approach The mtim Segmentation Approach don E 1 I 1 acc Discrete expression level 1 E 2 don I 2 2 S acc don E Q I Q acc intergenic exonic intronic Q Carry expression level information between exons of same transcript (G. Zeller et al., 2008; G. Zeller et al., in prep., 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 43

mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [???] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 43

mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [???] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 43

mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [???] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 43

mtim Coverage Segmentation Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 mtim 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 expression percentiles [%] Sensitivity heavily depends on read density c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 24 / 43

mtim Coverage Segmentation Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 mtim 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 expression percentiles [%] Sensitivity heavily depends on read density c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 24 / 43

Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS polya/cleavage pre-mrna Splice Donor Splice Acceptor Splice Splice Donor Acceptor mrna TIS Stop cap polya Protein c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 43

Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 43

Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein TSS TIS Stop cleave Don Acc c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 43

Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model 2 3 4 5 STEP 1: SVM Signal Predictions tss tis acc don stop genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 43

Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model 2 3 4 5 STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 43

Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) large margin genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 43

Next Generation Gene Finding Modeling Uncertainty Learning to use Expression Measurements Two approaches: Heuristic to incorporate ESTs/reads/tiling array measurements to refine predictions Directly use evidence during learning to learn to appropriately weight its importance Exon Level Transcript Level SN SP F SN SP F ab initio 82.3 82.6 82.5 43.1 49.5 46.1 ESTs heuristic 85.3 84.7 85.0 49.5 56.4 52.7 ESTs trained 84.8 85.8 85.3 50.5 57.8 53.9 Gene prediction in C. elegans (CDS evaluation) Behr et al., in pre., 2010 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 27 / 43

Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) large margin genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 43

Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Expression Evidence STEP 2: Integration transform SVM outputs with PLiF large margin score c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 43

Next Generation Gene Finding Modeling Uncertainty Learning to use Expression Measurements Two approaches: Heuristic to incorporate ESTs/reads/tiling array measurements to refine predictions Directly use evidence during learning to learn to appropriately weight its importance Exon Level Transcript Level SN SP F SN SP F ab initio 82.3 82.6 82.5 43.1 49.5 46.1 ESTs heuristic 85.3 84.7 85.0 49.5 56.4 52.7 ESTs trained 84.8 85.8 85.3 50.5 57.8 53.9 RNA-Seq trained 84.6 84.9 84.8 49.1 55.2 52.0 RNA-Seq/ESTs trained 84.7 86.9 85.8 50.3 60.5 54.9 Gene prediction in C. elegans (CDS evaluation) Behr et al., in prep., 2010 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 29 / 43

Next Generation Gene Finding Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 0.5 mgene ab initio mgene.ngs 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 expression percentiles [%] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 43

Next Generation Gene Finding Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 0.5 mtim mgene ab initio mgene.ngs 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 expression percentiles [%] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 43

Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Infered edges Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Infered edges Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

Transcript Quantitation with rquant RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 43

AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Transcript Quantitation with rquant Biases RNA-Seq Biases and Quantitation fragmentation Sequencer genome transcripts sequence library aligned reads RT junction reads mrna fragmentation RT short sequence reads exonic reads Biases due to... cdna library construction read coverage 80 60 40 20 0 Sequencing Read mapping relative transcript position 5 -> 3 (average over annotated transcripts of length 1kb for the C. elegans SRX001872 dataset) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 43

AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Transcript Quantitation with rquant Biases RNA-Seq Biases and Quantitation fragmentation Sequencer genome transcripts sequence library aligned reads RT junction reads mrna fragmentation RT short sequence reads exonic reads Biases due to... cdna library construction read coverage 80 60 40 20 0 Sequencing Read mapping relative transcript position 5 -> 3 (average over annotated transcripts of length 1kb for the C. elegans SRX001872 dataset) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 43

Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage relative transcript position 5 -> 3 A M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage relative transcript position 5 -> 3 Long transcript read coverage relative transcript position 5 -> 3 M i = w A A i + w B B i min wa,w B i l (M i, R i ) A B c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage genome position 5 -> 3 Long transcript read coverage genome position 5 -> 3 A B M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage Mixture of transcripts genome position 5 -> 3 Long transcript read coverage read coverage genome position 5 -> 3 genome position 5 -> 3 A B M M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage Mixture of transcripts genome position 5 -> 3 Long transcript read coverage expected observed read coverage genome position 5 -> 3 genome position 5 -> 3 A B M R M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand 99,960 100,368 100,776 101,184 101,592 70 60 50 read counts 40 30 20 10 0 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand 27.20 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 70 60 50 read counts 40 30 20 27.20 10 0 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G01240.2 436 + 190 1,178 4.80 27.20 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 70 60 50 read counts 40 30 20 27.20 10 0 4.80 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G01240.3 + 1,941 AT1G01240.2 436 + 190 1,178 6.00 4.80 27.20 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 70 60 50 read counts 40 30 20 27.20 10 0 6.00 4.80 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G01240.3 + 1,941 AT1G01240.2 436 + 190 1,178 6.00 4.80 27.20 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 70 60 50 read counts 40 30 20 38.00 27.20 10 0 6.00 4.80 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G01240.3 + 1,941 AT1G01240.2 436 + 190 1,178 6.00 4.80 27.20 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 1.8 1.6 70 60 50 profile weight 1.4 1.2 1 0.8 0.6 read counts 40 30 20 38.00 27.20 0.4 0.2 0 50 200 450 800 1250 1800 2450 2450 1800 1250 800 450 200 50 0 distance to transcript start/end 10 6.00 4.80 0 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G01240.3 + 1,941 AT1G01240.2 436 + 190 1,178 4.95 8.35 23.99 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 1.8 1.6 70 60 50 profile weight 1.4 1.2 1 0.8 0.6 read counts 40 30 37.29 0.4 0.2 0 50 200 450 800 1250 1800 2450 2450 1800 1250 800 450 200 50 0 20 23.99 distance to transcript start/end 10 8.35 4.95 0 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation 0.6 0.5 0.4 0.3 0.2 0.1 0 Segment-wise, w/o profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 43

Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation 0.6 0.5 0.4 0.3 0.2 0.1 0 Segment-wise, w/o profiles Segment-wise, w/ profiles Position-wise, w/o profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 43

Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation 0.6 0.5 0.4 0.3 0.2 0.1 0 Segment-wise, w/o profiles Segment-wise, w/ profiles Position-wise, w/o profiles rquant Position-wise, w/ profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 43

Transcript Quantitation with rquant Results Galaxy-based Web Services for NGS Analyses Galaxy-based web service http://galaxy..mpg.de PALMapper http://.mpg.de/raetsch/suppl/palmapper mgene http://mgene.org/web mtim http://.mpg.de/raetsch/suppl/mtim (in prep.) rquant http://.mpg.de/raetsch/suppl/rquant/web (Rätsch et al., in preparation, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 38 / 43

Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 43

Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 43

Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 43

Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 43

Summary Acknowledgements Fabio De Bona Jonas Behr Georg Zeller Regina Bohnert Alignments Gene finding Segmentation Quantitation RGASP Team Jonas Behr (FML) Georg Zeller (FML & MPI) Regina Bohnert (FML) Funding by DFG & Max Planck Society. Thank you for your attention! c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 40 / 43

Summary Acknowledgements Fabio De Bona Jonas Behr Georg Zeller Regina Bohnert Alignments Gene finding Segmentation Quantitation RGASP Team Jonas Behr (FML) Georg Zeller (FML & MPI) Regina Bohnert (FML) Funding by DFG & Max Planck Society. Thank you for your attention! c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 40 / 43