Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Size: px

Start display at page:

Download "Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction"

Erica Black
5 years ago
Views:

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen,

1 Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 1 / 36

2 Motivation Discovery of the Nuclein (Friedrich Miescher, 1869) Tu bingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one... wants to assume that a single substance... is the specific cause of fertilization, then one should undoubtedly first and foremost consider nuclein (Miescher, 1874) c Gunnar Ra tsch (FML, Tu bingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 36

Motivation Discovery of the Nuclein (Friedrich Miescher, 1869) Tu bingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one.

3 Motivation Discovery of the Nuclein (Friedrich Miescher, 1869) Tu bingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one... wants to assume that a single substance... is the specific cause of fertilization, then one should undoubtedly first and foremost consider nuclein (Miescher, 1874) c Gunnar Ra tsch (FML, Tu bingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 36

4 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

5 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

6 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

7 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

8 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

9 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Two different classes of observations c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

10 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Inferred classification rule c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

11 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models 1 Large scale sequence classification 2 Analysis and explanation of learning results 3 Sequence segmentation & structure prediction k mer Length Position Log-intensity transcript c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

12 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... intron pre-mrna High-throughput exon transcriptome measurements Qualitative studies New transcripts mrna Improved gene models short reads Quantitative studies at high resolution junction reads Differential expression in tissues, conditions, reference genome genotypes, etc. Figure adapted from Wikipedia Goal: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

13 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... intron pre-mrna High-throughput exon transcriptome measurements Qualitative studies New transcripts mrna Improved gene models short reads Quantitative studies at high resolution junction reads Differential expression in tissues, conditions, reference genome genotypes, etc. Figure adapted from Wikipedia Goal: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

14 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... intron pre-mrna High-throughput exon transcriptome measurements Qualitative studies New transcripts mrna Improved gene models short reads Quantitative studies at high resolution junction reads Differential expression in tissues, conditions, reference genome genotypes, etc. Figure adapted from Wikipedia Goal: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

15 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... intron pre-mrna High-throughput exon transcriptome measurements Qualitative studies New transcripts mrna Improved gene models short reads Quantitative studies at high resolution junction reads Differential expression in tissues, conditions, reference genome genotypes, etc. Figure adapted from Wikipedia Goal: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

16 RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads Read Alignment Expression level ~ #reads Significance Testing Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 36

17 RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads De novo Assembly Read Alignment TARs and Transcripts Expression level ~ #reads Significance Testing Transcript Quantitation Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 36

18 RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads De novo Assembly Read Alignment Transcript Reconstruction TARs and Transcripts Expression level ~ #reads Significance Testing Transcript Quantitation Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 36

19 RNA-Seq Pipeline RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 36

20 RNA-Seq Pipeline RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 36

21 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes More info: c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

22 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT... matching k-mers... TCGCGCGCAACG TCGC k-mers CGCG GCGC CGCG GCGC CGCA GCAA Read CAAC AACG c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

23 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT... matching k-mers... TCGCGCGCAACG TCGC k-mers CGCG GCGC CGCG GCGC CGCA GCAA Read CAAC AACG c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

24 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes QPALMA for spliced read alignments: GenomeMapper identifies seed regions Spliced alignments by QPALMA [De Bona et al., 2008] DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT TCGCGCGCAACG Read More info: c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

25 PALMapper Read Alignment Accuracy PALMapper Accuracy Evaluation How accurately can PALMapper identify introns? C. elegans recall precision TopHat TopHat sensitive Palmapper PALMapper (3.5h) and TopHat (3.5h/10h) aligning 24M reads c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 36

26 PALMapper Read Alignment Accuracy PALMapper Accuracy Evaluation How accurately can PALMapper identify introns? C. elegans recall precision H. sapiens (chromosome 1) recall precision TopHat TopHat sensitive Palmapper PALMapper (3.5h) and TopHat (3.5h/10h) aligning 24M reads Palmapper Comparison of PALMapper with other alignment programs within the RGASP project (preliminary) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 36

27 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

28 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

29 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

30 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Source of information Sequence matches Computational splice site predictions Intron length model Read quality information Quality scoring f : (Σ R) Σ R [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

31 PALMapper Read Alignment Learning Algorithm Scoring Parameter Inference What are optimal parameters? How do we jointly optimize the 336 parameters? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 36

32 PALMapper Read Alignment Learning Algorithm Scoring Parameter Inference What are optimal parameters? How do we jointly optimize the 336 parameters? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 36

33 PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true false Correct alignment is not highest scoring one Correct alignment is highest scoring one Can we do better? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 36

34 PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true Correct alignment is not highest scoring one Correct alignment is highest scoring one Can we do better? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 36

35 PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true Technique motivated by SVMs ( large-margin ) Enforce a margin between correct and incorrect examples One has to solve a big quadratic problem c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 36

36 PALMapper Read Alignment Learning Algorithm How Can We Generate Data for Training? Strategy: How do we obtain true alignments for training QPalma? Simulate realistic transcriptome reads with known origin 1 Estimate relationship between quality score and error probability from given reads 2 Use annotation of a few genes to simulate spliced reads 3 Introduce errors according to error model using quality strings from given read set 4 Train QPalma on generated read set with known alignments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 36

37 PALMapper Read Alignment Learning Algorithm How Can We Generate Data for Training? Strategy: How do we obtain true alignments for training QPalma? Simulate realistic transcriptome reads with known origin 1 Estimate relationship between quality score and error probability from given reads 2 Use annotation of a few genes to simulate spliced reads 3 Introduce errors according to error model using quality strings from given read set 4 Train QPalma on generated read set with known alignments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 36

38 PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 36

39 PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Error vs. intron position Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 36

40 PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Error vs. intron position 8nt 12nt 12nt 8nt Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 36

41 PALMapper Read Alignment Reasons for High Accuracy Step 2: Transcript Prediction Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts A. Coverage segmentation algorithm mtim for general transcripts (no coding bias/assumption) B. Extension of the mgene gene finding system to use NGS data for protein coding transcript prediction (mgene.ngs) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 36

42 mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic Idea read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 36

43 mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic Idea read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 36

44 mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic annotated gene Idea read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 36

45 mtim Coverage Segmentation Approach The mtim Segmentation Approach S E I intergenic exonic intronic Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 36

46 mtim Coverage Segmentation Approach The mtim Segmentation Approach S E I intergenic exonic intronic feature score read coverage read coverage read coverage Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 36

47 mtim Coverage Segmentation Approach The mtim Segmentation Approach don S E I acc intergenic exonic intronic feature score read coverage read coverage read coverage Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 36

48 mtim Coverage Segmentation Approach The mtim Segmentation Approach Idea: Assume uniform read coverage within exons of same transcript annotated genes read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 18 / 36

............ mtim Coverage Segmentation Approach The mtim Segmentation Approach don E 1 I 1 acc Discrete expression level 1 E 2 don I 2 2 S acc don E Q I Q acc intergenic exonic intronic Q Carry

49 mtim Coverage Segmentation Approach The mtim Segmentation Approach don E 1 I 1 acc Discrete expression level 1 E 2 don I 2 2 S acc don E Q I Q acc intergenic exonic intronic Q Carry expression level information between exons of same transcript (G. Zeller et al., 2008; G. Zeller et al., in prep., 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 36

50 mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [Altun et al., 2003, Rätsch et al., 2007, Zeller et al., 2008b] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 36

51 mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [Altun et al., 2003, Rätsch et al., 2007, Zeller et al., 2008b] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 36

52 mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [Altun et al., 2003, Rätsch et al., 2007, Zeller et al., 2008b] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 36

53 mtim Coverage Segmentation Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 mtim expression percentiles [%] Sensitivity heavily depends on read density c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 21 / 36

54 mtim Coverage Segmentation Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 mtim expression percentiles [%] Sensitivity heavily depends on read density c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 21 / 36

55 Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS polya/cleavage pre-mrna Splice Donor Splice Acceptor Splice Splice Donor Acceptor mrna TIS Stop cap polya Protein c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 36

56 Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 36

57 Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein TSS TIS Stop cleave Don Acc c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 36

58 Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model STEP 1: SVM Signal Predictions tss tis acc don stop genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 36

59 Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 36

60 Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) large margin genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 36

61 Next Generation Gene Finding Modeling Uncertainty Learning to use Expression Measurements Two approaches: Heuristic to incorporate ESTs/reads/tiling array measurements to refine predictions Directly use evidence during learning to learn to appropriately weight its importance Exon Level Transcript Level SN SP F SN SP F ab initio ESTs heuristic ESTs trained Gene prediction in C. elegans (CDS evaluation) Behr et al., in pre., 2010 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 24 / 36

62 Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction genomic position True gene model Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) large margin genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 36

Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop

63 Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction position True gene model Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Expression Evidence STEP 2: Integration transform SVM outputs with PLiF large margin score c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 36

64 Next Generation Gene Finding Modeling Uncertainty Learning to use Expression Measurements Two approaches: Heuristic to incorporate ESTs/reads/tiling array measurements to refine predictions Directly use evidence during learning to learn to appropriately weight its importance Exon Level Transcript Level SN SP F SN SP F ab initio ESTs heuristic ESTs trained RNA-Seq trained RNA-Seq/ESTs trained Gene prediction in C. elegans (CDS evaluation) Behr et al., in prep., 2010 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 36

65 Next Generation Gene Finding Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/ mgene ab initio mgene.ngs expression percentiles [%] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 27 / 36

66 Next Generation Gene Finding Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/ mtim mgene ab initio mgene.ngs expression percentiles [%] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 27 / 36

67 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

68 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

69 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

70 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Infered edges Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

71 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Infered edges Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

72 Transcript Quantitation with rquant RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 29 / 36

73 AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Transcript Quantitation with rquant Biases RNA-Seq Biases and Quantitation fragmentation Sequencer genome transcripts sequence library aligned reads RT junction reads mrna fragmentation RT short sequence reads exonic reads Biases due to... cdna library construction read coverage Sequencing Read mapping relative transcript position 5 -> 3 (average over annotated transcripts of length 1kb for the C. elegans SRX dataset) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 36

74 AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Transcript Quantitation with rquant Biases RNA-Seq Biases and Quantitation fragmentation Sequencer genome transcripts sequence library aligned reads RT junction reads mrna fragmentation RT short sequence reads exonic reads Biases due to... cdna library construction read coverage Sequencing Read mapping relative transcript position 5 -> 3 (average over annotated transcripts of length 1kb for the C. elegans SRX dataset) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 36

75 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage relative transcript position 5 -> 3 A M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

76 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage relative transcript position 5 -> 3 Long transcript read coverage relative transcript position 5 -> 3 M i = w A A i + w B B i min wa,w B i l (M i, R i ) A B c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

77 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage genome position 5 -> 3 Long transcript read coverage genome position 5 -> 3 A B M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

78 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage Mixture of transcripts genome position 5 -> 3 Long transcript read coverage read coverage genome position 5 -> 3 genome position 5 -> 3 A B M M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

79 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage Mixture of transcripts genome position 5 -> 3 Long transcript read coverage expected observed read coverage genome position 5 -> 3 genome position 5 -> 3 A B M R M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

80 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

81 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

82 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G , AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

83 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

84 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

85 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , profile weight read counts distance to transcript start/end c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

86 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , profile weight read counts distance to transcript start/end c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

87 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

88 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

89 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

90 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

91 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

92 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

93 Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation Segment-wise, w/o profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 36

94 Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation Segment-wise, w/o profiles Segment-wise, w/ profiles Position-wise, w/o profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 36

95 Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation Segment-wise, w/o profiles Segment-wise, w/ profiles Position-wise, w/o profiles rquant Position-wise, w/ profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 36

Transcript Quantitation with rquant Results Galaxy-based Web Services for NGS Analyses Galaxy-based web service http://galaxy..mpg.de PALMapper http://.mpg.de/raetsch/suppl/palmapper mgene http://mgene.

96 Transcript Quantitation with rquant Results Galaxy-based Web Services for NGS Analyses Galaxy-based web service PALMapper mgene mtim (in prep.) rquant (Rätsch et al., in preparation, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 36

97 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts Galaxy instance Easy use of these tools c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

98 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts Galaxy instance Easy use of these tools c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

99 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts Galaxy instance Easy use of these tools c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

100 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts Galaxy instance Easy use of these tools c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

101 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts Galaxy instance Easy use of these tools c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

102 Summary Acknowledgements Fabio De Bona Jonas Behr Georg Zeller Regina Bohnert Alignments Gene finding Segmentation Quantitation RGASP Team Jonas Behr (FML) Georg Zeller (FML & MPI) Regina Bohnert (FML) Funding by DFG & Max Planck Society. Thank you for your attention! c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 36

Regina Bohnert (FML) Funding by DFG & Max

c Gunnar Rätsch (FML, Tübingen) Methods for

103 Summary Acknowledgements Fabio De Bona Jonas Behr Georg Zeller Regina Bohnert Alignments Gene finding Segmentation Quantitation RGASP Team Jonas Behr (FML) Georg Zeller (FML & MPI) Regina Bohnert (FML) Funding by DFG & Max Planck Society. Thank you for your attention! c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 36

104 Summary References I Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn., pages 3 10, J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. Rätsch. Rna-seq and tiling arrays for improved gene finding. URL http: // Oral presentation at the CSHL Genome Informatics Meeting, September RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen, KA Frazer, DH Huson, B Schölkopf, M Nordborg, G Rätsch, JR Ecker, and D Weigel. Common sequence polymorphisms shaping genetic diversity in arabidopsis thaliana. Science, 317(5836): , ISSN (Electronic). doi: /science F. De Bona, S. Ossowski, K. Schneeberger, and G. Rätsch. Qpalma: Optimal spliced alignments of short sequence reads. Bioinformatics, 24:i174 i180, Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics, 25(8): , April G. Rätsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P. Vert, editors, Kernel Methods in Computational Biology. MIT Press, c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 38 / 36

105 Summary References II G. Rätsch, S. Sonnenburg, and B. Schölkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21(Suppl. 1):i369 i377, June G. Rätsch, S. Sonnenburg, J. Srinivasan, H. Witte, K.-R. Müller, R. Sommer, and B. Sch lkopf. Improving the c. elegans genome annotation using machine learning. PLoS Computational Biology, 3(2):e20, URL M. Sammeth. The Flux Capacitor. Website, 2009a. M. Sammeth. The Flux Simulator. Website, 2009b. Korbinian Schneeberger, Jörg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and Detlef Weigel. Simultaneous alignment of short reads against multiple genomes. Genome Biol, 10(9):R98, Jan doi: /gb r98. URL Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio De Bona, Lisa Hartmann, Anja Bohlen, Nina Krüger, Sören Sonnenburg, and Gunnar Rätsch. mgene: Accurate svm-based gene finding with an application to nematode genomes. Genome Research, URL Advance access June 29, S. Sonnenburg, G. Rätsch, A. Jagota, and K.-R. Müller. New methods for splice-site recognition. In Proc. International Conference on Artificial Neural Networks, c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 36

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010)