Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 1 / 43

Motivation Discovery of the Nuclein (Friedrich Miescher, 1869) Tu bingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one... wants to assume that a single substance... is the specific cause of fertilization, then one should undoubtedly first and foremost consider nuclein (Miescher, 1874) c Gunnar Ra tsch (FML, Tu bingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 43

Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 43

Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Two different classes of observations c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Inferred classification rule c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models 1 Large scale sequence classification 2 Analysis and explanation of learning results 3 Sequence segmentation & structure prediction k mer Length 8 7 6 5 4 3 2 1 30 20 10 0 10 20 30 Position Log-intensity 10 5 0 transcript c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

Motivation Computational Transcriptome Prediction (a.k.a. Computational Gene Finding) DNA genic intergenic pre-mrna exon intron exon intron exon mrna cap 5' UTR 3' UTR polya Protein Given a piece of DNA sequence Predict protein-coding mrnas Less well developed for non-coding RNAs c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 43

Motivation Experimental Characterization of the Transcriptome Getting closer to reality... DNA Microarrays cdna Sequencing Oligonucleotide probes immobilized on a glass slide hybridize to complementary labeled target RNA. [Wikipedia] Experimental methods have there inherent limitations and biases Learning methods help to overcome them c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 43

Motivation Key Research Questions Characterize an organism s full complement of genes Find new (possibly noncoding) genes Compare genes among organisms Characterize transcript isoforms Find new alternative splice forms / transcript ends Monitor transcriptome changes between tissues or in response to environmental changes (e.g. stress) Identify significant expression changes Understand transcriptome regulation Knock-out / knock-down analysis of regulators Identify regulated targets with significant expression changes Identify binding sites used in regulation (e.g. ChIP-on-chip) QTL-Mapping of transcriptome-phenotypes c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 43

RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... High-throughput transcriptome measurements Qualitative studies New transcripts Altered expression across tissues, conditions etc. Quantitative studies at high resolution pre-mrna exon mrna short reads junction reads reference genome intron Figure adapted from Wikipedia Motivation: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 43

RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads Read Alignment Expression level ~ #reads Significance Testing Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 43

RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads De novo Assembly Read Alignment TARs and Transcripts Expression level ~ #reads Significance Testing Transcript Quantitation Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 43

RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads De novo Assembly Read Alignment Transcript Reconstruction TARs and Transcripts Expression level ~ #reads Significance Testing Transcript Quantitation Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 43

RNA-Seq Pipeline RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 43

PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes More info: http://.mpg.de/raetsch/suppl/palmapper c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT... matching k-mers... TCGCGCGCAACG TCGC k-mers CGCG GCGC CGCG GCGC CGCA GCAA Read CAAC AACG c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes QPALMA for spliced read alignments: GenomeMapper identifies seed regions Spliced alignments by QPALMA [De Bona et al., 2008] DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT TCGCGCGCAACG Read More info: http://.mpg.de/raetsch/suppl/palmapper c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

PALMapper Read Alignment Accuracy PALMapper Accuracy Evaluation How accurately can PALMapper identify introns? 1 0.9 C. elegans recall precision 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 TopHat TopHat sensitive Palmapper PALMapper (3.5h) and TopHat (3.5h/10h) aligning 24M reads c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 43

PALMapper Read Alignment Accuracy PALMapper Accuracy Evaluation How accurately can PALMapper identify introns? 1 0.9 C. elegans recall precision 1 0.9 H. sapiens (chromosome 1) recall precision 0.8 0.7 0.6 0.5 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 TopHat TopHat sensitive Palmapper PALMapper (3.5h) and TopHat (3.5h/10h) aligning 24M reads 0.4 0.3 0.2 0.1 Palmapper Comparison of PALMapper with other alignment programs within the RGASP project (preliminary) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 43

PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 43

PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Source of information Sequence matches Computational splice site predictions Intron length model Read quality information Quality scoring f : (Σ R) Σ R [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 43

PALMapper Read Alignment Learning Algorithm Scoring Parameter Inference What are optimal parameters? How do we jointly optimize the 336 parameters? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 43

PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true false Correct alignment is not highest scoring one Correct alignment is highest scoring one Can we do better? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 43

PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true Correct alignment is not highest scoring one Correct alignment is highest scoring one Can we do better? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 43

PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true Use technique motivated by SVMs ( large-margin ) Idea: Enforce a margin between correct and incorrect examples One has to solve a big quadratic problem c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 43

PALMapper Read Alignment Learning Algorithm How Can We Generate Data for Training? Strategy: How do we obtain true alignments for training QPalma? Simulate realistic transcriptome reads with known origin 1 Estimate relationship between quality score and error probability from given reads 2 Use annotation of a few genes to simulate spliced reads 3 Introduce errors according to error model using quality strings from given read set 4 Train QPalma on generated read set with known alignments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 43

PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 43

PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Error vs. intron position Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 43

PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Error vs. intron position 8nt 12nt 12nt 8nt Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 43

PALMapper Read Alignment Reasons for High Accuracy Step 2: Transcript Prediction Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts A. Coverage segmentation algorithm mtim for general transcripts (no coding bias/assumption) B. Extension of the mgene gene finding system to use NGS data for protein coding transcript prediction (mgene.ngs) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 18 / 43

mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic Idea read coverage (log-scale) 100 10 0 0 5 10 Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 43

mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic annotated gene Idea read coverage (log-scale) 100 10 0 0 5 10 Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 43

mtim Coverage Segmentation Approach The mtim Segmentation Approach S E I intergenic exonic intronic Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 43

mtim Coverage Segmentation Approach The mtim Segmentation Approach S E I intergenic exonic intronic feature score 1 0.5 0 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 5 6 7 8 9 10 11 12 13 14 read coverage 5 6 7 8 9 10 11 12 13 14 read coverage 5 6 7 8 9 10 11 12 13 14 read coverage Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 43

mtim Coverage Segmentation Approach The mtim Segmentation Approach don S E I acc intergenic exonic intronic feature score 1 0.5 0 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 5 6 7 8 9 10 11 12 13 14 read coverage 5 6 7 8 9 10 11 12 13 14 read coverage 5 6 7 8 9 10 11 12 13 14 read coverage Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 43

mtim Coverage Segmentation Approach The mtim Segmentation Approach Idea: Assume uniform read coverage within exons of same transcript annotated genes read coverage (log-scale) 100 10 0 0 5 10 15 20 25 Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 21 / 43

............ mtim Coverage Segmentation Approach The mtim Segmentation Approach don E 1 I 1 acc Discrete expression level 1 E 2 don I 2 2 S acc don E Q I Q acc intergenic exonic intronic Q Carry expression level information between exons of same transcript (G. Zeller et al., 2008; G. Zeller et al., in prep., 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 43

mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [???] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 43

mtim Coverage Segmentation Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 mtim 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 expression percentiles [%] Sensitivity heavily depends on read density c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 24 / 43

Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS polya/cleavage pre-mrna Splice Donor Splice Acceptor Splice Splice Donor Acceptor mrna TIS Stop cap polya Protein c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 43

Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 43

Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein TSS TIS Stop cleave Don Acc c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 43

Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model 2 3 4 5 STEP 1: SVM Signal Predictions tss tis acc don stop genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 43

Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model 2 3 4 5 STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 43

Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) large margin genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 43

Next Generation Gene Finding Modeling Uncertainty Learning to use Expression Measurements Two approaches: Heuristic to incorporate ESTs/reads/tiling array measurements to refine predictions Directly use evidence during learning to learn to appropriately weight its importance Exon Level Transcript Level SN SP F SN SP F ab initio 82.3 82.6 82.5 43.1 49.5 46.1 ESTs heuristic 85.3 84.7 85.0 49.5 56.4 52.7 ESTs trained 84.8 85.8 85.3 50.5 57.8 53.9 Gene prediction in C. elegans (CDS evaluation) Behr et al., in pre., 2010 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 27 / 43

Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) large margin genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 43

Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Expression Evidence STEP 2: Integration transform SVM outputs with PLiF large margin score c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 43

Next Generation Gene Finding Modeling Uncertainty Learning to use Expression Measurements Two approaches: Heuristic to incorporate ESTs/reads/tiling array measurements to refine predictions Directly use evidence during learning to learn to appropriately weight its importance Exon Level Transcript Level SN SP F SN SP F ab initio 82.3 82.6 82.5 43.1 49.5 46.1 ESTs heuristic 85.3 84.7 85.0 49.5 56.4 52.7 ESTs trained 84.8 85.8 85.3 50.5 57.8 53.9 RNA-Seq trained 84.6 84.9 84.8 49.1 55.2 52.0 RNA-Seq/ESTs trained 84.7 86.9 85.8 50.3 60.5 54.9 Gene prediction in C. elegans (CDS evaluation) Behr et al., in prep., 2010 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 29 / 43

Next Generation Gene Finding Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 0.5 mgene ab initio mgene.ngs 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 expression percentiles [%] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 43

Next Generation Gene Finding Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 0.5 mtim mgene ab initio mgene.ngs 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 expression percentiles [%] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 43

Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Infered edges Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

Transcript Quantitation with rquant RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 43

AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Transcript Quantitation with rquant Biases RNA-Seq Biases and Quantitation fragmentation Sequencer genome transcripts sequence library aligned reads RT junction reads mrna fragmentation RT short sequence reads exonic reads Biases due to... cdna library construction read coverage 80 60 40 20 0 Sequencing Read mapping relative transcript position 5 -> 3 (average over annotated transcripts of length 1kb for the C. elegans SRX001872 dataset) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 43

Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage relative transcript position 5 -> 3 A M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage relative transcript position 5 -> 3 Long transcript read coverage relative transcript position 5 -> 3 M i = w A A i + w B B i min wa,w B i l (M i, R i ) A B c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage genome position 5 -> 3 Long transcript read coverage genome position 5 -> 3 A B M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage Mixture of transcripts genome position 5 -> 3 Long transcript read coverage read coverage genome position 5 -> 3 genome position 5 -> 3 A B M M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage Mixture of transcripts genome position 5 -> 3 Long transcript read coverage expected observed read coverage genome position 5 -> 3 genome position 5 -> 3 A B M R M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G01240.2 436 + 190 1,178 4.80 27.20 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 70 60 50 read counts 40 30 20 27.20 10 0 4.80 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G01240.3 + 1,941 AT1G01240.2 436 + 190 1,178 6.00 4.80 27.20 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 70 60 50 read counts 40 30 20 27.20 10 0 6.00 4.80 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G01240.3 + 1,941 AT1G01240.2 436 + 190 1,178 6.00 4.80 27.20 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 70 60 50 read counts 40 30 20 38.00 27.20 10 0 6.00 4.80 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G01240.3 + 1,941 AT1G01240.2 436 + 190 1,178 6.00 4.80 27.20 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 1.8 1.6 70 60 50 profile weight 1.4 1.2 1 0.8 0.6 read counts 40 30 20 38.00 27.20 0.4 0.2 0 50 200 450 800 1250 1800 2450 2450 1800 1250 800 450 200 50 0 distance to transcript start/end 10 6.00 4.80 0 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G01240.3 + 1,941 AT1G01240.2 436 + 190 1,178 4.95 8.35 23.99 625 AT1G01240.1 + 110 1,178 99,960 100,368 100,776 101,184 101,592 1.8 1.6 70 60 50 profile weight 1.4 1.2 1 0.8 0.6 read counts 40 30 37.29 0.4 0.2 0 50 200 450 800 1250 1800 2450 2450 1800 1250 800 450 200 50 0 20 23.99 distance to transcript start/end 10 8.35 4.95 0 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation 0.6 0.5 0.4 0.3 0.2 0.1 0 Segment-wise, w/o profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 43

Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation 0.6 0.5 0.4 0.3 0.2 0.1 0 Segment-wise, w/o profiles Segment-wise, w/ profiles Position-wise, w/o profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 43

Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation 0.6 0.5 0.4 0.3 0.2 0.1 0 Segment-wise, w/o profiles Segment-wise, w/ profiles Position-wise, w/o profiles rquant Position-wise, w/ profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 43

Transcript Quantitation with rquant Results Galaxy-based Web Services for NGS Analyses Galaxy-based web service http://galaxy..mpg.de PALMapper http://.mpg.de/raetsch/suppl/palmapper mgene http://mgene.org/web mtim http://.mpg.de/raetsch/suppl/mtim (in prep.) rquant http://.mpg.de/raetsch/suppl/rquant/web (Rätsch et al., in preparation, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 38 / 43

Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 43

Summary Acknowledgements Fabio De Bona Jonas Behr Georg Zeller Regina Bohnert Alignments Gene finding Segmentation Quantitation RGASP Team Jonas Behr (FML) Georg Zeller (FML & MPI) Regina Bohnert (FML) Funding by DFG & Max Planck Society. Thank you for your attention! c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 40 / 43