Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Size: px
Start display at page:

Download "Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction"

Transcription

1 Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 1 / 43

2 Motivation Discovery of the Nuclein (Friedrich Miescher, 1869) Tu bingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one... wants to assume that a single substance... is the specific cause of fertilization, then one should undoubtedly first and foremost consider nuclein (Miescher, 1874) c Gunnar Ra tsch (FML, Tu bingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 43

3 Motivation Discovery of the Nuclein (Friedrich Miescher, 1869) Tu bingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one... wants to assume that a single substance... is the specific cause of fertilization, then one should undoubtedly first and foremost consider nuclein (Miescher, 1874) c Gunnar Ra tsch (FML, Tu bingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 43

4 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 43

5 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 43

6 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 43

7 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 43

8 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

9 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Two different classes of observations c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

10 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Inferred classification rule c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

11 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models 1 Large scale sequence classification 2 Analysis and explanation of learning results 3 Sequence segmentation & structure prediction k mer Length Position Log-intensity transcript c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 43

12 Motivation Computational Transcriptome Prediction (a.k.a. Computational Gene Finding) DNA genic intergenic pre-mrna exon intron exon intron exon mrna cap 5' UTR 3' UTR polya Protein Given a piece of DNA sequence Predict protein-coding mrnas Less well developed for non-coding RNAs c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 43

13 Motivation Computational Transcriptome Prediction (a.k.a. Computational Gene Finding) DNA genic intergenic pre-mrna exon intron exon intron exon mrna cap 5' UTR 3' UTR polya Protein Given a piece of DNA sequence Predict protein-coding mrnas Less well developed for non-coding RNAs c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 43

14 Motivation Experimental Characterization of the Transcriptome Getting closer to reality... DNA Microarrays cdna Sequencing Oligonucleotide probes immobilized on a glass slide hybridize to complementary labeled target RNA. [Wikipedia] Experimental methods have there inherent limitations and biases Learning methods help to overcome them c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 43

15 Motivation Experimental Characterization of the Transcriptome Getting closer to reality... DNA Microarrays cdna Sequencing Oligonucleotide probes immobilized on a glass slide hybridize to complementary labeled target RNA. [Wikipedia] Experimental methods have there inherent limitations and biases Learning methods help to overcome them c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 43

16 Motivation Key Research Questions Characterize an organism s full complement of genes Find new (possibly noncoding) genes Compare genes among organisms Characterize transcript isoforms Find new alternative splice forms / transcript ends Monitor transcriptome changes between tissues or in response to environmental changes (e.g. stress) Identify significant expression changes Understand transcriptome regulation Knock-out / knock-down analysis of regulators Identify regulated targets with significant expression changes Identify binding sites used in regulation (e.g. ChIP-on-chip) QTL-Mapping of transcriptome-phenotypes c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 43

17 Motivation Key Research Questions Characterize an organism s full complement of genes Find new (possibly noncoding) genes Compare genes among organisms Characterize transcript isoforms Find new alternative splice forms / transcript ends Monitor transcriptome changes between tissues or in response to environmental changes (e.g. stress) Identify significant expression changes Understand transcriptome regulation Knock-out / knock-down analysis of regulators Identify regulated targets with significant expression changes Identify binding sites used in regulation (e.g. ChIP-on-chip) QTL-Mapping of transcriptome-phenotypes c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 43

18 Motivation Key Research Questions Characterize an organism s full complement of genes Find new (possibly noncoding) genes Compare genes among organisms Characterize transcript isoforms Find new alternative splice forms / transcript ends Monitor transcriptome changes between tissues or in response to environmental changes (e.g. stress) Identify significant expression changes Understand transcriptome regulation Knock-out / knock-down analysis of regulators Identify regulated targets with significant expression changes Identify binding sites used in regulation (e.g. ChIP-on-chip) QTL-Mapping of transcriptome-phenotypes c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 43

19 Motivation Key Research Questions Characterize an organism s full complement of genes Find new (possibly noncoding) genes Compare genes among organisms Characterize transcript isoforms Find new alternative splice forms / transcript ends Monitor transcriptome changes between tissues or in response to environmental changes (e.g. stress) Identify significant expression changes Understand transcriptome regulation Knock-out / knock-down analysis of regulators Identify regulated targets with significant expression changes Identify binding sites used in regulation (e.g. ChIP-on-chip) QTL-Mapping of transcriptome-phenotypes c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 43

20 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... High-throughput transcriptome measurements Qualitative studies New transcripts Altered expression across tissues, conditions etc. Quantitative studies at high resolution pre-mrna exon mrna short reads junction reads reference genome intron Figure adapted from Wikipedia Motivation: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 43

21 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... High-throughput transcriptome measurements Qualitative studies New transcripts Altered expression across tissues, conditions etc. Quantitative studies at high resolution pre-mrna exon mrna short reads junction reads reference genome intron Figure adapted from Wikipedia Motivation: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 43

22 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... High-throughput transcriptome measurements Qualitative studies New transcripts Altered expression across tissues, conditions etc. Quantitative studies at high resolution pre-mrna exon mrna short reads junction reads reference genome intron Figure adapted from Wikipedia Motivation: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 43

23 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... High-throughput transcriptome measurements Qualitative studies New transcripts Altered expression across tissues, conditions etc. Quantitative studies at high resolution pre-mrna exon mrna short reads junction reads reference genome intron Figure adapted from Wikipedia Motivation: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 43

24 RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads Read Alignment Expression level ~ #reads Significance Testing Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 43

25 RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads De novo Assembly Read Alignment TARs and Transcripts Expression level ~ #reads Significance Testing Transcript Quantitation Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 43

26 RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads De novo Assembly Read Alignment Transcript Reconstruction TARs and Transcripts Expression level ~ #reads Significance Testing Transcript Quantitation Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 43

27 RNA-Seq Pipeline RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 43

28 RNA-Seq Pipeline RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 43

29 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes More info: c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

30 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT... matching k-mers... TCGCGCGCAACG TCGC k-mers CGCG GCGC CGCG GCGC CGCA GCAA Read CAAC AACG c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

31 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT... matching k-mers... TCGCGCGCAACG TCGC k-mers CGCG GCGC CGCG GCGC CGCA GCAA Read CAAC AACG c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

32 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes QPALMA for spliced read alignments: GenomeMapper identifies seed regions Spliced alignments by QPALMA [De Bona et al., 2008] DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT TCGCGCGCAACG Read More info: c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 43

33 PALMapper Read Alignment Accuracy PALMapper Accuracy Evaluation How accurately can PALMapper identify introns? C. elegans recall precision TopHat TopHat sensitive Palmapper PALMapper (3.5h) and TopHat (3.5h/10h) aligning 24M reads c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 43

34 PALMapper Read Alignment Accuracy PALMapper Accuracy Evaluation How accurately can PALMapper identify introns? C. elegans recall precision H. sapiens (chromosome 1) recall precision TopHat TopHat sensitive Palmapper PALMapper (3.5h) and TopHat (3.5h/10h) aligning 24M reads Palmapper Comparison of PALMapper with other alignment programs within the RGASP project (preliminary) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 43

35 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 43

36 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 43

37 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 43

38 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Source of information Sequence matches Computational splice site predictions Intron length model Read quality information Quality scoring f : (Σ R) Σ R [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 43

39 PALMapper Read Alignment Learning Algorithm Scoring Parameter Inference What are optimal parameters? How do we jointly optimize the 336 parameters? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 43

40 PALMapper Read Alignment Learning Algorithm Scoring Parameter Inference What are optimal parameters? How do we jointly optimize the 336 parameters? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 43

41 PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true false Correct alignment is not highest scoring one Correct alignment is highest scoring one Can we do better? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 43

42 PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true Correct alignment is not highest scoring one Correct alignment is highest scoring one Can we do better? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 43

43 PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true Use technique motivated by SVMs ( large-margin ) Idea: Enforce a margin between correct and incorrect examples One has to solve a big quadratic problem c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 43

44 PALMapper Read Alignment Learning Algorithm How Can We Generate Data for Training? Strategy: How do we obtain true alignments for training QPalma? Simulate realistic transcriptome reads with known origin 1 Estimate relationship between quality score and error probability from given reads 2 Use annotation of a few genes to simulate spliced reads 3 Introduce errors according to error model using quality strings from given read set 4 Train QPalma on generated read set with known alignments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 43

45 PALMapper Read Alignment Learning Algorithm How Can We Generate Data for Training? Strategy: How do we obtain true alignments for training QPalma? Simulate realistic transcriptome reads with known origin 1 Estimate relationship between quality score and error probability from given reads 2 Use annotation of a few genes to simulate spliced reads 3 Introduce errors according to error model using quality strings from given read set 4 Train QPalma on generated read set with known alignments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 43

46 PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 43

47 PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Error vs. intron position Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 43

48 PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Error vs. intron position 8nt 12nt 12nt 8nt Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 43

49 PALMapper Read Alignment Reasons for High Accuracy Step 2: Transcript Prediction Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts A. Coverage segmentation algorithm mtim for general transcripts (no coding bias/assumption) B. Extension of the mgene gene finding system to use NGS data for protein coding transcript prediction (mgene.ngs) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 18 / 43

50 mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic Idea read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 43

51 mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic Idea read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 43

52 mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic annotated gene Idea read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 43

53 mtim Coverage Segmentation Approach The mtim Segmentation Approach S E I intergenic exonic intronic Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 43

54 mtim Coverage Segmentation Approach The mtim Segmentation Approach S E I intergenic exonic intronic feature score read coverage read coverage read coverage Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 43

55 mtim Coverage Segmentation Approach The mtim Segmentation Approach don S E I acc intergenic exonic intronic feature score read coverage read coverage read coverage Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 43

56 mtim Coverage Segmentation Approach The mtim Segmentation Approach Idea: Assume uniform read coverage within exons of same transcript annotated genes read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 21 / 43

57 mtim Coverage Segmentation Approach The mtim Segmentation Approach don E 1 I 1 acc Discrete expression level 1 E 2 don I 2 2 S acc don E Q I Q acc intergenic exonic intronic Q Carry expression level information between exons of same transcript (G. Zeller et al., 2008; G. Zeller et al., in prep., 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 43

58 mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [???] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 43

59 mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [???] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 43

60 mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [???] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 43

61 mtim Coverage Segmentation Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 mtim expression percentiles [%] Sensitivity heavily depends on read density c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 24 / 43

62 mtim Coverage Segmentation Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 mtim expression percentiles [%] Sensitivity heavily depends on read density c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 24 / 43

63 Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS polya/cleavage pre-mrna Splice Donor Splice Acceptor Splice Splice Donor Acceptor mrna TIS Stop cap polya Protein c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 43

64 Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 43

65 Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein TSS TIS Stop cleave Don Acc c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 43

66 Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model STEP 1: SVM Signal Predictions tss tis acc don stop genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 43

67 Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 43

68 Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) large margin genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 43

69 Next Generation Gene Finding Modeling Uncertainty Learning to use Expression Measurements Two approaches: Heuristic to incorporate ESTs/reads/tiling array measurements to refine predictions Directly use evidence during learning to learn to appropriately weight its importance Exon Level Transcript Level SN SP F SN SP F ab initio ESTs heuristic ESTs trained Gene prediction in C. elegans (CDS evaluation) Behr et al., in pre., 2010 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 27 / 43

70 Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction genomic position True gene model Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) large margin genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 43

71 Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction position True gene model Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Expression Evidence STEP 2: Integration transform SVM outputs with PLiF large margin score c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 43

72 Next Generation Gene Finding Modeling Uncertainty Learning to use Expression Measurements Two approaches: Heuristic to incorporate ESTs/reads/tiling array measurements to refine predictions Directly use evidence during learning to learn to appropriately weight its importance Exon Level Transcript Level SN SP F SN SP F ab initio ESTs heuristic ESTs trained RNA-Seq trained RNA-Seq/ESTs trained Gene prediction in C. elegans (CDS evaluation) Behr et al., in prep., 2010 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 29 / 43

73 Next Generation Gene Finding Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/ mgene ab initio mgene.ngs expression percentiles [%] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 43

74 Next Generation Gene Finding Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/ mtim mgene ab initio mgene.ngs expression percentiles [%] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 43

75 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

76 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

77 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

78 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Infered edges Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

79 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Infered edges Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 43

80 Transcript Quantitation with rquant RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 43

81 AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Transcript Quantitation with rquant Biases RNA-Seq Biases and Quantitation fragmentation Sequencer genome transcripts sequence library aligned reads RT junction reads mrna fragmentation RT short sequence reads exonic reads Biases due to... cdna library construction read coverage Sequencing Read mapping relative transcript position 5 -> 3 (average over annotated transcripts of length 1kb for the C. elegans SRX dataset) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 43

82 AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Transcript Quantitation with rquant Biases RNA-Seq Biases and Quantitation fragmentation Sequencer genome transcripts sequence library aligned reads RT junction reads mrna fragmentation RT short sequence reads exonic reads Biases due to... cdna library construction read coverage Sequencing Read mapping relative transcript position 5 -> 3 (average over annotated transcripts of length 1kb for the C. elegans SRX dataset) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 43

83 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage relative transcript position 5 -> 3 A M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

84 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage relative transcript position 5 -> 3 Long transcript read coverage relative transcript position 5 -> 3 M i = w A A i + w B B i min wa,w B i l (M i, R i ) A B c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

85 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage genome position 5 -> 3 Long transcript read coverage genome position 5 -> 3 A B M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

86 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage Mixture of transcripts genome position 5 -> 3 Long transcript read coverage read coverage genome position 5 -> 3 genome position 5 -> 3 A B M M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

87 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage Mixture of transcripts genome position 5 -> 3 Long transcript read coverage expected observed read coverage genome position 5 -> 3 genome position 5 -> 3 A B M R M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 43

88 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

89 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

90 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G , AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

91 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

92 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

93 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , profile weight read counts distance to transcript start/end c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

94 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , profile weight read counts distance to transcript start/end c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 43

95 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

96 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

97 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

98 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

99 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

100 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 43

101 Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation Segment-wise, w/o profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 43

102 Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation Segment-wise, w/o profiles Segment-wise, w/ profiles Position-wise, w/o profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 43

103 Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation Segment-wise, w/o profiles Segment-wise, w/ profiles Position-wise, w/o profiles rquant Position-wise, w/ profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 43

104 Transcript Quantitation with rquant Results Galaxy-based Web Services for NGS Analyses Galaxy-based web service PALMapper mgene mtim (in prep.) rquant (Rätsch et al., in preparation, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 38 / 43

105 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 43

106 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 43

107 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 43

108 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 43

109 Summary Acknowledgements Fabio De Bona Jonas Behr Georg Zeller Regina Bohnert Alignments Gene finding Segmentation Quantitation RGASP Team Jonas Behr (FML) Georg Zeller (FML & MPI) Regina Bohnert (FML) Funding by DFG & Max Planck Society. Thank you for your attention! c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 40 / 43

110 Summary Acknowledgements Fabio De Bona Jonas Behr Georg Zeller Regina Bohnert Alignments Gene finding Segmentation Quantitation RGASP Team Jonas Behr (FML) Georg Zeller (FML & MPI) Regina Bohnert (FML) Funding by DFG & Max Planck Society. Thank you for your attention! c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 40 / 43

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010)

More information

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany StatSeq Workshop, Gent (Mai 20, 2010) c Gunnar

More information

Next Generation Genome Annotation with mgene.ngs

Next Generation Genome Annotation with mgene.ngs Next Generation Genome Annotation with mgene.ngs Jonas Behr, 1 Regina Bohnert, 1 Georg Zeller, 1,2 Gabriele Schweikert, 1,2,3 Lisa Hartmann, 1 and Gunnar Rätsch 1 1 Friedrich Miescher Laboratory of the

More information

Fully Automated Genome Annotation with Deep RNA Sequencing

Fully Automated Genome Annotation with Deep RNA Sequencing Fully Automated Genome Annotation with Deep RNA Sequencing Gunnar Rätsch Friedrich Miescher Laboratory of the Max Planck Society Tübingen, Germany Friedrich Miescher Laboratory of the Max Planck Society

More information

RNA-Seq Blog Poll Results

RNA-Seq Blog Poll Results RNA-Seq Blog Poll Results What is the greatest immediate need facing the RNA Sequencing community? More sequencing capacity Further advancement in sequencing technology Continued cost reduction Better

More information

oqtans A Galaxy-Integrated Workflow for Quantitative Transcriptome Analysis from NGS Data

oqtans A Galaxy-Integrated Workflow for Quantitative Transcriptome Analysis from NGS Data oqtans A Galaxy-Integrated Workflow for Quantitative Transcriptome Analysis from NGS Data Géraldine Jean Sebastian J. Schultheiss Machine Learning in Biology, Rätsch

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

ChIP-seq and RNA-seq

ChIP-seq and RNA-seq ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)

More information

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Abundance

More information

Methods for Transcriptome Analysis with Tiling Arrays and mrna-seq

Methods for Transcriptome Analysis with Tiling Arrays and mrna-seq Methods for Transcriptome Analysis with Tiling Arrays and mrna-seq Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society Tübingen, Germany Talk at the University of Toronto July 17, 2008 Gunnar

More information

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ), Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ), 2012-01-26 What is a gene What is a transcriptome History of gene expression assessment RNA-seq RNA-seq analysis

More information

Discovering Common Sequence Variation in A. thaliana. Gunnar Rätsch

Discovering Common Sequence Variation in A. thaliana. Gunnar Rätsch Machine Learning Methods for Discovering Common Sequence Variation in A. thaliana Gunnar Rätsch Friedrich Miescher Laboratory, Max Planck Society, Tübingen Technical University Berlin March 31, 2008 Current

More information

Regina Bohnert. Friedrich Miescher Laboratory of the Max Planck Society Tübingen, Germany. July 18, 2008

Regina Bohnert. Friedrich Miescher Laboratory of the Max Planck Society Tübingen, Germany. July 18, 2008 Revealing Sequence Variation Patterns in Rice with Machine Learning Methods Regina Bohnert Friedrich Miescher Laboratory of the Max Planck Society Tübingen, Germany July 18, 2008 Regina Bohnert (FML) Sequence

More information

RNA-Sequencing analysis

RNA-Sequencing analysis RNA-Sequencing analysis Markus Kreuz 25. 04. 2012 Institut für Medizinische Informatik, Statistik und Epidemiologie Content: Biological background Overview transcriptomics RNA-Seq RNA-Seq technology Challenges

More information

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility 2018 ABRF Meeting Satellite Workshop 4 Bridging the Gap: Isolation to Translation (Single Cell RNA-Seq) Sunday, April 22 Basics of RNA-Seq (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly,

More information

Mapping strategies for sequence reads

Mapping strategies for sequence reads Mapping strategies for sequence reads Ernest Turro University of Cambridge 21 Oct 2013 Quantification A basic aim in genomics is working out the contents of a biological sample. 1. What distinct elements

More information

RNA-SEQUENCING ANALYSIS

RNA-SEQUENCING ANALYSIS RNA-SEQUENCING ANALYSIS Joseph Powell SISG- 2018 CONTENTS Introduction to RNA sequencing Data structure Analyses Transcript counting Alternative splicing Allele specific expression Discovery APPLICATIONS

More information

measuring gene expression December 5, 2017

measuring gene expression December 5, 2017 measuring gene expression December 5, 2017 transcription a usually short-lived RNA copy of the DNA is created through transcription RNA is exported to the cytoplasm to encode proteins some types of RNA

More information

Gene Expression Technology

Gene Expression Technology Gene Expression Technology Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu Gene expression Gene expression is the process by which information from a gene

More information

Analysis of RNA-seq Data

Analysis of RNA-seq Data Analysis of RNA-seq Data A physicist and an engineer are in a hot-air balloon. Soon, they find themselves lost in a canyon somewhere. They yell out for help: "Helllloooooo! Where are we?" 15 minutes later,

More information

RNA-Seq with the Tuxedo Suite

RNA-Seq with the Tuxedo Suite RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with

More information

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment Zhaojun Zhang, Shunping Huang, Jack Wang, Xiang Zhang, Fernando Pardo

More information

measuring gene expression December 11, 2018

measuring gene expression December 11, 2018 measuring gene expression December 11, 2018 Intervening Sequences (introns): how does the cell get rid of them? Splicing!!! Highly conserved ribonucleoprotein complex recognizes intron/exon junctions and

More information

Machine Learning. HMM applications in computational biology

Machine Learning. HMM applications in computational biology 10-601 Machine Learning HMM applications in computational biology Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly

More information

Motivation From Protein to Gene

Motivation From Protein to Gene MOLECULAR BIOLOGY 2003-4 Topic B Recombinant DNA -principles and tools Construct a library - what for, how Major techniques +principles Bioinformatics - in brief Chapter 7 (MCB) 1 Motivation From Protein

More information

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Transcriptomics analysis with RNA seq: an overview Frederik Coppens Transcriptomics analysis with RNA seq: an overview Frederik Coppens Platforms Applications Analysis Quantification RNA content Platforms Platforms Short (few hundred bases) Long reads (multiple kilobases)

More information

Gene Prediction 10/21/05

Gene Prediction 10/21/05 Gene Prediction 1/21/5 1/21/5 Gene Prediction Announcements Eam 2 - net Friday Posted online: Eam 2 Study Guide 544 Reading Assignment (2 papers) (formerly Gene Prediction - ) 1/21/5 D Dobbs ISU - BCB

More information

Bi 8 Lecture 5. Ellen Rothenberg 19 January 2016

Bi 8 Lecture 5. Ellen Rothenberg 19 January 2016 Bi 8 Lecture 5 MORE ON HOW WE KNOW WHAT WE KNOW and intro to the protein code Ellen Rothenberg 19 January 2016 SIZE AND PURIFICATION BY SYNTHESIS: BASIS OF EARLY SEQUENCING complex mixture of aborted DNA

More information

Systematic evaluation of spliced alignment programs for RNA- seq data

Systematic evaluation of spliced alignment programs for RNA- seq data Systematic evaluation of spliced alignment programs for RNA- seq data Pär G. Engström, Tamara Steijger, Botond Sipos, Gregory R. Grant, André Kahles, RGASP Consortium, Gunnar Rätsch, Nick Goldman, Tim

More information

Genome annotation & EST

Genome annotation & EST Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary

More information

Applications of short-read

Applications of short-read Applications of short-read sequencing: RNA-Seq and ChIP-Seq BaRC Hot Topics March 2013 George Bell, Ph.D. http://jura.wi.mit.edu/bio/education/hot_topics/ Sequencing applications RNA-Seq includes experiments

More information

Eucalyptus gene assembly

Eucalyptus gene assembly Eucalyptus gene assembly ACGT Plant Biotechnology meeting Charles Hefer Bioinformatics and Computational Biology Unit University of Pretoria October 2011 About Eucalyptus Most valuable and widely planted

More information

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics RNA Sequencing T TM variation genetics validation SNP ncrna metagenomics private trio de novo exome mendelian ChIP-seq RNA DNA bioinformatics custom target high-throughput resequencing storage ncrna comparative

More information

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1 CAP 5510-9 BIOINFORMATICS Su-Shing Chen CISE 10/5/2005 Su-Shing Chen, CISE 1 Basic BioTech Processes Hybridization PCR Southern blotting (spot or stain) 10/5/2005 Su-Shing Chen, CISE 2 10/5/2005 Su-Shing

More information

Intro to RNA-seq. July 13, 2015

Intro to RNA-seq. July 13, 2015 Intro to RNA-seq July 13, 2015 Goal of the course To be able to effectively design, and interpret genomic studies of gene expression. We will focus on RNA-seq, but the class will provide a foothold into

More information

Wheat Genome Structural Annotation Using a Modular and Evidence-combined Annotation Pipeline

Wheat Genome Structural Annotation Using a Modular and Evidence-combined Annotation Pipeline Wheat Genome Structural Annotation Using a Modular and Evidence-combined Annotation Pipeline Xi Wang Bioinformatics Scientist Computational Life Science Page 1 Bayer 4:3 Template 2010 March 2016 17/01/2017

More information

How to deal with your RNA-seq data?

How to deal with your RNA-seq data? How to deal with your RNA-seq data? Rachel Legendre, Thibault Dayris, Adrien Pain, Claire Toffano-Nioche, Hugo Varet École de bioinformatique AVIESAN-IFB 2017 1 Rachel Legendre Bioinformatics 27/11/2018

More information

CSE 549: RNA-Seq aided gene finding

CSE 549: RNA-Seq aided gene finding CSE 549: RNA-Seq aided gene finding Finding Genes We ll break gene finding methods into 3 main categories. ab initio latin from the beginning w/o experimental evidence comparative make use of knowledge

More information

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq Sequencing applications Applications of short-read sequencing: RNA-Seq and ChIP-Seq BaRC Hot Topics March 2013 George Bell, Ph.D. http://jura.wi.mit.edu/bio/education/hot_topics/ RNA-Seq includes experiments

More information

Background Wikipedia Lee and Mahadavan, JCB, 2009 History (Platform Comparison) P Park, Nature Review Genetics, 2009 P Park, Nature Reviews Genetics, 2009 Rozowsky et al., Nature Biotechnology, 2009

More information

Problem formulations. Wikipedia :: Multidisciplinary design optimization

Problem formulations. Wikipedia :: Multidisciplinary design optimization Problem formulations Problem formulation is normally the most difficult part of the process. It is the selection of design variables, constraints, objectives, and models... Wikipedia :: Multidisciplinary

More information

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017 Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l

More information

Transcriptome analysis

Transcriptome analysis Statistical Bioinformatics: Transcriptome analysis Stefan Seemann seemann@rth.dk University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize

More information

MODULE 5: TRANSLATION

MODULE 5: TRANSLATION MODULE 5: TRANSLATION Lesson Plan: CARINA ENDRES HOWELL, LEOCADIA PALIULIS Title Translation Objectives Determine the codons for specific amino acids and identify reading frames by looking at the Base

More information

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018 Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l

More information

Microarrays & Gene Expression Analysis

Microarrays & Gene Expression Analysis Microarrays & Gene Expression Analysis Contents DNA microarray technique Why measure gene expression Clustering algorithms Relation to Cancer SAGE SBH Sequencing By Hybridization DNA Microarrays 1. Developed

More information

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 23, 2017 Differential Gene Expression Generalized Workflow File Types

More information

02 Agenda Item 03 Agenda Item

02 Agenda Item 03 Agenda Item 01 Agenda Item 02 Agenda Item 03 Agenda Item SOLiD 3 System: Applications Overview April 12th, 2010 Jennifer Stover Field Application Specialist - SOLiD Applications Workflow for SOLiD Application Application

More information

Single-Cell Whole Transcriptome Profiling With the SOLiD. System

Single-Cell Whole Transcriptome Profiling With the SOLiD. System APPLICATION NOTE Single-Cell Whole Transcriptome Profiling Single-Cell Whole Transcriptome Profiling With the SOLiD System Introduction The ability to study the expression patterns of an individual cell

More information

MITIE: Simultaneous RNA-Seq-based Transcript Identification and Quantification in Multiple Samples

MITIE: Simultaneous RNA-Seq-based Transcript Identification and Quantification in Multiple Samples Bioinformatics Advance Access published August 25, 2013 MITIE: Simultaneous RNA-Seq-based Transcript Identification and Quantification in Multiple Samples Jonas Behr, 1,2, André Kahles, 1 Yi Zhong, 1 Vipin

More information

Finding Genes with Genomics Technologies

Finding Genes with Genomics Technologies PLNT2530 Plant Biotechnology (2018) Unit 7 Finding Genes with Genomics Technologies Unless otherwise cited or referenced, all content of this presenataion is licensed under the Creative Commons License

More information

RNA-Seq data analysis course September 7-9, 2015

RNA-Seq data analysis course September 7-9, 2015 RNA-Seq data analysis course September 7-9, 2015 Peter-Bram t Hoen (LUMC) Jan Oosting (LUMC) Celia van Gelder, Jacintha Valk (BioSB) Anita Remmelzwaal (LUMC) Expression profiling DNA mrna protein Comprehensive

More information

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department

More information

De novo assembly in RNA-seq analysis.

De novo assembly in RNA-seq analysis. De novo assembly in RNA-seq analysis. Joachim Bargsten Wageningen UR/PRI/Plant Breeding October 2012 Motivation Transcriptome sequencing (RNA-seq) Gene expression / differential expression Reconstruct

More information

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database

More information

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist Whole Transcriptome Analysis of Illumina RNA- Seq Data Ryan Peters Field Application Specialist Partek GS in your NGS Pipeline Your Start-to-Finish Solution for Analysis of Next Generation Sequencing Data

More information

Measuring transcriptomes with RNA-Seq

Measuring transcriptomes with RNA-Seq Measuring transcriptomes with RNA-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2017 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC

More information

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 1 Vocabulary Gene: hereditary DNA sequence at a

More information

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies Vocabulary Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 Gene: Genetics: Genome: Genomics: hereditary

More information

RNA standards v May

RNA standards v May Standards, Guidelines and Best Practices for RNA-Seq: 2010/2011 I. Introduction: Sequence based assays of transcriptomes (RNA-seq) are in wide use because of their favorable properties for quantification,

More information

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University RNA-Seq Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University joshua.ainsley@tufts.edu Day five Alternative splicing Assembly RNA edits Alternative splicing

More information

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data Jan Hellemans 7th international qpcr & NGS Event - Freising March 24 th, 2015 Therapeutics lncrna oncology

More information

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project

More information

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction Lecture 8 Reading Lecture 8: 96-110 Lecture 9: 111-120 DNA Libraries Definition Types Construction 142 DNA Libraries A DNA library is a collection of clones of genomic fragments or cdnas from a certain

More information

Bi 8 Lecture 4. Ellen Rothenberg 14 January Reading: from Alberts Ch. 8

Bi 8 Lecture 4. Ellen Rothenberg 14 January Reading: from Alberts Ch. 8 Bi 8 Lecture 4 DNA approaches: How we know what we know Ellen Rothenberg 14 January 2016 Reading: from Alberts Ch. 8 Central concept: DNA or RNA polymer length as an identifying feature RNA has intrinsically

More information

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

From reads to results: differential. Alicia Oshlack Head of Bioinformatics From reads to results: differential expression analysis with ihrna seq Alicia Oshlack Head of Bioinformatics Murdoch Childrens Research Institute Benefits and opportunities ii of RNA seq All transcripts

More information

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Measuring transcriptomes with RNA-Seq. BMI/CS 776  Spring 2016 Anthony Gitter Measuring transcriptomes with RNA-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu Overview RNA-Seq technology The RNA-Seq quantification problem Generative

More information

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAAT AATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAA

More information

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly SCIENCE CHINA Life Sciences SPECIAL TOPIC February 2013 Vol.56 No.2: 156 162 RESEARCH PAPER doi: 10.1007/s11427-013-4444-x Comparative analysis of de novo transcriptome assembly CLARKE Kaitlin 1, YANG

More information

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs Gene Finding GenBank Growth GenBank Growth In 2003 ~ 31 million sequences ~ 37 billion base pairs GenBank: Exponential Growth Growth of GenBank in billions of base pairs from release 3 in April of 1994

More information

1. Introduction Gene regulation Genomics and genome analyses

1. Introduction Gene regulation Genomics and genome analyses 1. Introduction Gene regulation Genomics and genome analyses 2. Gene regulation tools and methods Regulatory sequences and motif discovery TF binding sites Databases 3. Technologies Microarrays Deep sequencing

More information

Chapter 1. from genomics to proteomics Ⅱ

Chapter 1. from genomics to proteomics Ⅱ Proteomics Chapter 1. from genomics to proteomics Ⅱ 1 Functional genomics Functional genomics: study of relations of genomics to biological functions at systems level However, it cannot explain any more

More information

High-throughput Transcriptome analysis

High-throughput Transcriptome analysis High-throughput Transcriptome analysis CAGE and beyond Dr. Rimantas Kodzius, Singapore, A*STAR, IMCB rkodzius@imcb.a-star.edu.sg for KAUST 2008 Agenda 1. Current research - PhD work on discovery of new

More information

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis Experimental Design Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu What is Differential Expression Differential expression analysis means taking normalized sequencing

More information

review Expression Microarrays Tiling genomic microarrays Sequencing methods Riassunto puntate precedenti RNA transcripts

review Expression Microarrays Tiling genomic microarrays Sequencing methods Riassunto puntate precedenti RNA transcripts Riassunto puntate precedenti Expression Microarrays Tiling genomic microarrays Sequencing methods RNA transcripts Depend on kind of RNA prep from cells: Total RNA Poly(A) + fraction Long RNA Small RNA.bound

More information

Introduction of RNA-Seq Analysis

Introduction of RNA-Seq Analysis Introduction of RNA-Seq Analysis Jiang Li, MS Bioinformatics System Engineer I Center for Quantitative Sciences(CQS) Vanderbilt University September 21, 2012 Goal of this talk 1. Act as a practical resource

More information

RNA-Seq Analysis. Simon Andrews, Laura v

RNA-Seq Analysis. Simon Andrews, Laura v RNA-Seq Analysis Simon Andrews, Laura Biggins simon.andrews@babraham.ac.uk @simon_andrews v2018-10 RNA-Seq Libraries rrna depleted mrna Fragment u u u u NNNN Random prime + RT 2 nd strand synthesis (+

More information

Methods and tools for exploring functional genomics data

Methods and tools for exploring functional genomics data Methods and tools for exploring functional genomics data William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington Outline Searching for

More information

Genome 373: Gene Predic/on I. Doug Fowler

Genome 373: Gene Predic/on I. Doug Fowler Genome 373: Gene Predic/on I Doug Fowler Outline Review of gene structure Scale of the problem Solu;ons Empirical methods Ab ini&o predic;on What is a gene? A locatable region of genomic sequence, corresponding

More information

The Iso-Seq Method: Transcriptome Sequencing Using Long Reads

The Iso-Seq Method: Transcriptome Sequencing Using Long Reads The Iso-Seq Method: Transcriptome Sequencing Using Long Reads Elizabeth Tseng, Ph.D. Senior Staff Scientist FIND MEANING IN COMPLEXITY For Research Use Only. Not for use in diagnostic procedures. Copyright

More information

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

TIGR THE INSTITUTE FOR GENOMIC RESEARCH Introduction to Genome Annotation: Overview of What You Will Learn This Week C. Robin Buell May 21, 2007 Types of Annotation Structural Annotation: Defining genes, boundaries, sequence motifs e.g. ORF,

More information

How to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling

How to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling How to design an HMM for a new problem Architecture/topology design: What are the states, observation symbols, and the topology of the state transition graph? Learning/Training: Fully annotated or partially

More information

Genomic resources. for non-model systems

Genomic resources. for non-model systems Genomic resources for non-model systems 1 Genomic resources Whole genome sequencing reference genome sequence comparisons across species identify signatures of natural selection population-level resequencing

More information

Identification of individual motifs on the genome scale. Some slides are from Mayukh Bhaowal

Identification of individual motifs on the genome scale. Some slides are from Mayukh Bhaowal Identification of individual motifs on the genome scale Some slides are from Mayukh Bhaowal Two papers Nature 423, 241-254 (15 May 2003) Sequencing and comparison of yeast species to identify genes and

More information

Introduction to RNA sequencing

Introduction to RNA sequencing Introduction to RNA sequencing Bioinformatics perspective Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden November 2017 Olga (NBIS) RNA-seq November 2017 1 / 49 Outline Why sequence

More information

RNA-Seq Software, Tools, and Workflows

RNA-Seq Software, Tools, and Workflows RNA-Seq Software, Tools, and Workflows Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 1, 2016 Some mrna-seq Applications Differential gene expression analysis Transcriptional profiling Assumption:

More information

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson

More information

Applied Bioinformatics - Lecture 16: Transcriptomics

Applied Bioinformatics - Lecture 16: Transcriptomics Applied Bioinformatics - Lecture 16: Transcriptomics David Hendrix Oregon State University Feb 15th 2016 Transcriptomics High-throughput Sequencing (deep sequencing) High-throughput sequencing (also

More information

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016 Introduction to RNAseq Analysis Milena Kraus Apr 18, 2016 Agenda What is RNA sequencing used for? 1. Biological background 2. From wet lab sample to transcriptome a. Experimental procedure b. Raw data

More information

NEXT GENERATION SEQUENCING. Farhat Habib

NEXT GENERATION SEQUENCING. Farhat Habib NEXT GENERATION SEQUENCING HISTORY HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites HISTORY Sanger Dominant for last ~30 years 1000bp

More information

Measuring and Understanding Gene Expression

Measuring and Understanding Gene Expression Measuring and Understanding Gene Expression Dr. Lars Eijssen Dept. Of Bioinformatics BiGCaT Sciences programme 2014 Why are genes interesting? TRANSCRIPTION Genome Genomics Transcriptome Transcriptomics

More information

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs Nadia Pisanti University of Pisa & Leiden University Outline New Generation Sequencing (NGS), and the importance of detecting

More information

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy An introduction to RNA-seq Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy The central dogma Genome = all DNA in an organism (genotype) Transcriptome = all RNA (molecular

More information

Performance comparison of five RNA-seq alignment tools

Performance comparison of five RNA-seq alignment tools New Jersey Institute of Technology Digital Commons @ NJIT Theses Theses and Dissertations Spring 2013 Performance comparison of five RNA-seq alignment tools Yuanpeng Lu New Jersey Institute of Technology

More information

Expressed genes profiling (Microarrays) Overview Of Gene Expression Control Profiling Of Expressed Genes

Expressed genes profiling (Microarrays) Overview Of Gene Expression Control Profiling Of Expressed Genes Expressed genes profiling (Microarrays) Overview Of Gene Expression Control Profiling Of Expressed Genes Genes can be regulated at many levels Usually, gene regulation, are referring to transcriptional

More information

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing Gene Regulation Solutions Microarrays and Next-Generation Sequencing Gene Regulation Solutions The Microarrays Advantage Microarrays Lead the Industry in: Comprehensive Content SurePrint G3 Human Gene

More information

RNA

RNA RNA sequencing Michael Inouye Baker Heart and Diabetes Institute Univ of Melbourne / Monash Univ Summer Institute in Statistical Genetics 2017 Integrative Genomics Module Seattle @minouye271 www.inouyelab.org

More information

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia RNA-Seq Workshop AChemS 2017 Sunil K Sukumaran Monell Chemical Senses Center Philadelphia Benefits & downsides of RNA-Seq Benefits: High resolution, sensitivity and large dynamic range Independent of prior

More information

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian, Genscan The Genscan HMM model Training Genscan Validating Genscan (c) Devika Subramanian, 2009 96 Gene structure assumed by Genscan donor site acceptor site (c) Devika Subramanian, 2009 97 A simple model

More information

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies Discovering gene regulatory control using ChIP-chip and ChIP-seq Part 1 An introduction to gene regulatory control, concepts and methodologies Ian Simpson ian.simpson@.ed.ac.uk http://bit.ly/bio2links

More information