Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Size: px
Start display at page:

Download "Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction"

Transcription

1 Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 1 / 36

2 Motivation Discovery of the Nuclein (Friedrich Miescher, 1869) Tu bingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one... wants to assume that a single substance... is the specific cause of fertilization, then one should undoubtedly first and foremost consider nuclein (Miescher, 1874) c Gunnar Ra tsch (FML, Tu bingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 36

3 Motivation Discovery of the Nuclein (Friedrich Miescher, 1869) Tu bingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one... wants to assume that a single substance... is the specific cause of fertilization, then one should undoubtedly first and foremost consider nuclein (Miescher, 1874) c Gunnar Ra tsch (FML, Tu bingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 2 / 36

4 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

5 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

6 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

7 Motivation Learning about the Transcriptome What is encoded on the genome and how is it processed? Computational Point of View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View What can we not predict yet? What is missing? Can we derive a deeper understanding of these processes? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 3 / 36

8 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

9 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Two different classes of observations c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

10 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Inferred classification rule c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

11 Motivation Machine Learning Learning from empirical observations Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models 1 Large scale sequence classification 2 Analysis and explanation of learning results 3 Sequence segmentation & structure prediction k mer Length Position Log-intensity transcript c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 4 / 36

12 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... intron pre-mrna High-throughput exon transcriptome measurements Qualitative studies New transcripts mrna Improved gene models short reads Quantitative studies at high resolution junction reads Differential expression in tissues, conditions, reference genome genotypes, etc. Figure adapted from Wikipedia Goal: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

13 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... intron pre-mrna High-throughput exon transcriptome measurements Qualitative studies New transcripts mrna Improved gene models short reads Quantitative studies at high resolution junction reads Differential expression in tissues, conditions, reference genome genotypes, etc. Figure adapted from Wikipedia Goal: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

14 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... intron pre-mrna High-throughput exon transcriptome measurements Qualitative studies New transcripts mrna Improved gene models short reads Quantitative studies at high resolution junction reads Differential expression in tissues, conditions, reference genome genotypes, etc. Figure adapted from Wikipedia Goal: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

15 RNA-Seq Pipeline Deep RNA Sequencing (RNA-Seq) RNA-Seq allows... intron pre-mrna High-throughput exon transcriptome measurements Qualitative studies New transcripts mrna Improved gene models short reads Quantitative studies at high resolution junction reads Differential expression in tissues, conditions, reference genome genotypes, etc. Figure adapted from Wikipedia Goal: Obtain complete transcriptome for further analyses c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 5 / 36

16 RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads Read Alignment Expression level ~ #reads Significance Testing Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 36

17 RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads De novo Assembly Read Alignment TARs and Transcripts Expression level ~ #reads Significance Testing Transcript Quantitation Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 36

18 RNA-Seq Pipeline Common RNA-Seq Analysis Steps RNA-Seq Reads De novo Assembly Read Alignment Transcript Reconstruction TARs and Transcripts Expression level ~ #reads Significance Testing Transcript Quantitation Differentially Processed Transcripts/ Segments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 6 / 36

19 RNA-Seq Pipeline RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 36

20 RNA-Seq Pipeline RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 7 / 36

21 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes More info: c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

22 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT... matching k-mers... TCGCGCGCAACG TCGC k-mers CGCG GCGC CGCG GCGC CGCA GCAA Read CAAC AACG c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

23 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT... matching k-mers... TCGCGCGCAACG TCGC k-mers CGCG GCGC CGCG GCGC CGCA GCAA Read CAAC AACG c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

24 PALMapper Read Alignment Overview Step 1: PALMapper Read Alignment (PALMapper = QPALMA + GenomeMapper) GenomeMapper for (unspliced) read mapping: Alignments based on GenomeMapper developed in Tübingen for the 1001 plant genome project [Schneeberger et al., 2009] k-mer based index, well suited for smaller genomes QPALMA for spliced read alignments: GenomeMapper identifies seed regions Spliced alignments by QPALMA [De Bona et al., 2008] DNA ACCGTCGCGCGCGT...TCGGCG...AGAACGCT TCGCGCGCAACG Read More info: c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 8 / 36

25 PALMapper Read Alignment Accuracy PALMapper Accuracy Evaluation How accurately can PALMapper identify introns? C. elegans recall precision TopHat TopHat sensitive Palmapper PALMapper (3.5h) and TopHat (3.5h/10h) aligning 24M reads c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 36

26 PALMapper Read Alignment Accuracy PALMapper Accuracy Evaluation How accurately can PALMapper identify introns? C. elegans recall precision H. sapiens (chromosome 1) recall precision TopHat TopHat sensitive Palmapper PALMapper (3.5h) and TopHat (3.5h/10h) aligning 24M reads Palmapper Comparison of PALMapper with other alignment programs within the RGASP project (preliminary) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 9 / 36

27 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

28 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

29 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Classical scoring f : Σ Σ R Source of information Sequence matches Computational splice site predictions Intron length model Read quality information c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

30 PALMapper Read Alignment Accuracy QPALMA: Extended Smith-Waterman Scoring Source of information Sequence matches Computational splice site predictions Intron length model Read quality information Quality scoring f : (Σ R) Σ R [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 10 / 36

31 PALMapper Read Alignment Learning Algorithm Scoring Parameter Inference What are optimal parameters? How do we jointly optimize the 336 parameters? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 36

32 PALMapper Read Alignment Learning Algorithm Scoring Parameter Inference What are optimal parameters? How do we jointly optimize the 336 parameters? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 11 / 36

33 PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true false Correct alignment is not highest scoring one Correct alignment is highest scoring one Can we do better? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 36

34 PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true Correct alignment is not highest scoring one Correct alignment is highest scoring one Can we do better? c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 36

35 PALMapper Read Alignment Learning Algorithm Cartoon: Maximize the Margin false true Technique motivated by SVMs ( large-margin ) Enforce a margin between correct and incorrect examples One has to solve a big quadratic problem c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 12 / 36

36 PALMapper Read Alignment Learning Algorithm How Can We Generate Data for Training? Strategy: How do we obtain true alignments for training QPalma? Simulate realistic transcriptome reads with known origin 1 Estimate relationship between quality score and error probability from given reads 2 Use annotation of a few genes to simulate spliced reads 3 Introduce errors according to error model using quality strings from given read set 4 Train QPalma on generated read set with known alignments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 36

37 PALMapper Read Alignment Learning Algorithm How Can We Generate Data for Training? Strategy: How do we obtain true alignments for training QPalma? Simulate realistic transcriptome reads with known origin 1 Estimate relationship between quality score and error probability from given reads 2 Use annotation of a few genes to simulate spliced reads 3 Introduce errors according to error model using quality strings from given read set 4 Train QPalma on generated read set with known alignments c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 13 / 36

38 PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 36

39 PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Error vs. intron position Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 36

40 PALMapper Read Alignment Reasons for High Accuracy QPALMA RNA-Seq Read Alignment Generate set of artificially spliced reads Genomic reads with quality information Genome annotation for artificially splicing the reads Use 10, 000 reads for training and 30, 000 for testing Error vs. intron position 8nt 12nt 12nt 8nt Alignment Error Rate 14.19% 9.96% 1.94% 1.78% SmithW Intron Intron+ Splice Intron+ Splice+ Quality [De Bona et al., 2008] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 14 / 36

41 PALMapper Read Alignment Reasons for High Accuracy Step 2: Transcript Prediction Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts A. Coverage segmentation algorithm mtim for general transcripts (no coding bias/assumption) B. Extension of the mgene gene finding system to use NGS data for protein coding transcript prediction (mgene.ngs) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 15 / 36

42 mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic Idea read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 36

43 mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic Idea read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 36

44 mtim Coverage Segmentation mtim: Read Coverage Segmentation Goal: Characterize each base as intergenic, exonic, or intronic annotated gene Idea read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 16 / 36

45 mtim Coverage Segmentation Approach The mtim Segmentation Approach S E I intergenic exonic intronic Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 36

46 mtim Coverage Segmentation Approach The mtim Segmentation Approach S E I intergenic exonic intronic feature score read coverage read coverage read coverage Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 36

47 mtim Coverage Segmentation Approach The mtim Segmentation Approach don S E I acc intergenic exonic intronic feature score read coverage read coverage read coverage Learn to associate a state with each position given its read coverage and local context HM-SVM training: Optimize transformations: signal score Extension: Score spliced reads and splice sites (G. Zeller et al., 2008; G. Zeller et al., in prep., 2009) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 17 / 36

48 mtim Coverage Segmentation Approach The mtim Segmentation Approach Idea: Assume uniform read coverage within exons of same transcript annotated genes read coverage (log-scale) Kbp c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 18 / 36

49 mtim Coverage Segmentation Approach The mtim Segmentation Approach don E 1 I 1 acc Discrete expression level 1 E 2 don I 2 2 S acc don E Q I Q acc intergenic exonic intronic Q Carry expression level information between exons of same transcript (G. Zeller et al., 2008; G. Zeller et al., in prep., 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 19 / 36

50 mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [Altun et al., 2003, Rätsch et al., 2007, Zeller et al., 2008b] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 36

51 mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [Altun et al., 2003, Rätsch et al., 2007, Zeller et al., 2008b] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 36

52 mtim Coverage Segmentation Approach Discriminative training of HM-SVMs f : R Σ given a sequence of hybridization measurements χ R predicts a state sequence (path) σ Σ Discriminant function F θ : R Σ R such that for decoding: f (χ) = argmax F θ (χ, σ). σ S Training: For each training example (χ (i), σ (i) ), enforce a large margin of separation F θ (χ (i), σ (i) ) F θ (χ (i), σ) ρ between the correct path σ (i) and any other wrong path σ σ (i). A quadratic programming problem (QP) is solved to optimize θ. [Altun et al., 2003, Rätsch et al., 2007, Zeller et al., 2008b] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 20 / 36

53 mtim Coverage Segmentation Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 mtim expression percentiles [%] Sensitivity heavily depends on read density c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 21 / 36

54 mtim Coverage Segmentation Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/2 0.6 mtim expression percentiles [%] Sensitivity heavily depends on read density c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 21 / 36

55 Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS polya/cleavage pre-mrna Splice Donor Splice Acceptor Splice Splice Donor Acceptor mrna TIS Stop cap polya Protein c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 36

56 Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 36

57 Next Generation Gene Finding Idea Computational Gene Finding Labeling the Genome DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein TSS TIS Stop cleave Don Acc c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 22 / 36

58 Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model STEP 1: SVM Signal Predictions tss tis acc don stop genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 36

59 Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 36

60 Next Generation Gene Finding mgene-based Transcript Prediction Idea genomic position True gene model Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) large margin genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 23 / 36

61 Next Generation Gene Finding Modeling Uncertainty Learning to use Expression Measurements Two approaches: Heuristic to incorporate ESTs/reads/tiling array measurements to refine predictions Directly use evidence during learning to learn to appropriately weight its importance Exon Level Transcript Level SN SP F SN SP F ab initio ESTs heuristic ESTs trained Gene prediction in C. elegans (CDS evaluation) Behr et al., in pre., 2010 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 24 / 36

62 Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction genomic position True gene model Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop STEP 2: Integration transform features F(x,y) large margin genomic position c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 36

63 Next Generation Gene Finding Modeling Uncertainty mgene-based Transcript Prediction position True gene model Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Expression Evidence STEP 2: Integration transform SVM outputs with PLiF large margin score c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 25 / 36

64 Next Generation Gene Finding Modeling Uncertainty Learning to use Expression Measurements Two approaches: Heuristic to incorporate ESTs/reads/tiling array measurements to refine predictions Directly use evidence during learning to learn to appropriately weight its importance Exon Level Transcript Level SN SP F SN SP F ab initio ESTs heuristic ESTs trained RNA-Seq trained RNA-Seq/ESTs trained Gene prediction in C. elegans (CDS evaluation) Behr et al., in prep., 2010 c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 26 / 36

65 Next Generation Gene Finding Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/ mgene ab initio mgene.ngs expression percentiles [%] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 27 / 36

66 Next Generation Gene Finding Results Preliminary Evaluation (C. elegans) 0.7 CDS (precision+recall)/ mtim mgene ab initio mgene.ngs expression percentiles [%] c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 27 / 36

67 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

68 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

69 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

70 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Infered edges Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

71 Digestion Next Generation Gene Finding Alternative Transcripts mtim and mgene.ngs predict single transcripts mtim exploits uniformity of read coverage among exons of same transcript mgene.ngs uses more assumptions on structure of transcripts Alt. Transcripts: Spliced reads for splicing graph completion: Transcript prediction (result of mgene/mtim) Infered edges Spliced reads (result of QPALMA) Paths through splicing graph define alternative transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 28 / 36

72 Transcript Quantitation with rquant RNA-Seq Pipeline Overview Read alignment Transcript finding Quantitation mtim Segmentation PALMapper rquant mgene.ngs Short Reads Transcripts w/ Quantitation Alignments Transcripts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 29 / 36

73 AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Transcript Quantitation with rquant Biases RNA-Seq Biases and Quantitation fragmentation Sequencer genome transcripts sequence library aligned reads RT junction reads mrna fragmentation RT short sequence reads exonic reads Biases due to... cdna library construction read coverage Sequencing Read mapping relative transcript position 5 -> 3 (average over annotated transcripts of length 1kb for the C. elegans SRX dataset) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 36

74 AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA Transcript Quantitation with rquant Biases RNA-Seq Biases and Quantitation fragmentation Sequencer genome transcripts sequence library aligned reads RT junction reads mrna fragmentation RT short sequence reads exonic reads Biases due to... cdna library construction read coverage Sequencing Read mapping relative transcript position 5 -> 3 (average over annotated transcripts of length 1kb for the C. elegans SRX dataset) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 30 / 36

75 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage relative transcript position 5 -> 3 A M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

76 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage relative transcript position 5 -> 3 Long transcript read coverage relative transcript position 5 -> 3 M i = w A A i + w B B i min wa,w B i l (M i, R i ) A B c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

77 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage genome position 5 -> 3 Long transcript read coverage genome position 5 -> 3 A B M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

78 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage Mixture of transcripts genome position 5 -> 3 Long transcript read coverage read coverage genome position 5 -> 3 genome position 5 -> 3 A B M M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

79 Transcript Quantitation with rquant rquant Basic Idea Short transcript Approach read coverage Mixture of transcripts genome position 5 -> 3 Long transcript read coverage expected observed read coverage genome position 5 -> 3 genome position 5 -> 3 A B M R M i = w A A i + w B B i min wa,w B i l (M i, R i ) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 31 / 36

80 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

81 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

82 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G , AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

83 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

84 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , read counts c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

85 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , profile weight read counts distance to transcript start/end c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

86 Transcript Quantitation with rquant Approach rquant Iterative Algorithm 1 Optimise transcript weights: min w i l ( t w (t) p (t) i, R i ) 2 Optimise profile weights: min p i l ( t w (t) p (t) i, R i ) 3 Repeat 1. and 2. until convergence. gene AT1G01240 chromosome 1, forward strand AT1G ,941 AT1G , AT1G ,178 99, , , , , profile weight read counts distance to transcript start/end c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 32 / 36

87 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

88 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

89 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

90 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

91 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

92 Transcript Quantitation with rquant rquant Evaluation I rquant: Position-wise with profiles compared to Position-wise, without profiles Results (estimating library and mapping bias) Segment-wise, without profiles (e.g., Jiang and Wong [2009] ) Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a]) Estimate transcript abundances Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b]) Subset of alternatively spliced genes Evaluation: Spearman correlation between Simulated RNA expression level and Predicted transcript weights c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 33 / 36

93 Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation Segment-wise, w/o profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 36

94 Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation Segment-wise, w/o profiles Segment-wise, w/ profiles Position-wise, w/o profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 36

95 Transcript Quantitation with rquant rquant Evaluation II 0.9 Across transcripts 0.8 Within genes (mean) 0.7 Results Spearman Correlation Segment-wise, w/o profiles Segment-wise, w/ profiles Position-wise, w/o profiles rquant Position-wise, w/ profiles (Bohnert et al., submitted, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 34 / 36

96 Transcript Quantitation with rquant Results Galaxy-based Web Services for NGS Analyses Galaxy-based web service PALMapper mgene mtim (in prep.) rquant (Rätsch et al., in preparation, 2010) c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 35 / 36

97 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts Galaxy instance Easy use of these tools c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

98 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts Galaxy instance Easy use of these tools c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

99 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts Galaxy instance Easy use of these tools c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

100 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts Galaxy instance Easy use of these tools c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

101 Summary Summary PALMapper Splice site predictions improve alignment performance Outperforms many other read mappers in intron accuracy mtim High specificity, sensitivity depends on read coverage Better for identifying transcripts specific to experimental data mgene High sensitivity (also for lowly expressed genes) Identifies also non-expressed genes good for annotation rquant Models library prep., sequencing, alignment biases Accurately quantifies transcripts Galaxy instance Easy use of these tools c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 36 / 36

102 Summary Acknowledgements Fabio De Bona Jonas Behr Georg Zeller Regina Bohnert Alignments Gene finding Segmentation Quantitation RGASP Team Jonas Behr (FML) Georg Zeller (FML & MPI) Regina Bohnert (FML) Funding by DFG & Max Planck Society. Thank you for your attention! c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 36

103 Summary Acknowledgements Fabio De Bona Jonas Behr Georg Zeller Regina Bohnert Alignments Gene finding Segmentation Quantitation RGASP Team Jonas Behr (FML) Georg Zeller (FML & MPI) Regina Bohnert (FML) Funding by DFG & Max Planck Society. Thank you for your attention! c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 37 / 36

104 Summary References I Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn., pages 3 10, J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. Rätsch. Rna-seq and tiling arrays for improved gene finding. URL http: // Oral presentation at the CSHL Genome Informatics Meeting, September RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen, KA Frazer, DH Huson, B Schölkopf, M Nordborg, G Rätsch, JR Ecker, and D Weigel. Common sequence polymorphisms shaping genetic diversity in arabidopsis thaliana. Science, 317(5836): , ISSN (Electronic). doi: /science F. De Bona, S. Ossowski, K. Schneeberger, and G. Rätsch. Qpalma: Optimal spliced alignments of short sequence reads. Bioinformatics, 24:i174 i180, Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics, 25(8): , April G. Rätsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P. Vert, editors, Kernel Methods in Computational Biology. MIT Press, c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 38 / 36

105 Summary References II G. Rätsch, S. Sonnenburg, and B. Schölkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21(Suppl. 1):i369 i377, June G. Rätsch, S. Sonnenburg, J. Srinivasan, H. Witte, K.-R. Müller, R. Sommer, and B. Sch lkopf. Improving the c. elegans genome annotation using machine learning. PLoS Computational Biology, 3(2):e20, URL M. Sammeth. The Flux Capacitor. Website, 2009a. M. Sammeth. The Flux Simulator. Website, 2009b. Korbinian Schneeberger, Jörg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and Detlef Weigel. Simultaneous alignment of short reads against multiple genomes. Genome Biol, 10(9):R98, Jan doi: /gb r98. URL Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio De Bona, Lisa Hartmann, Anja Bohlen, Nina Krüger, Sören Sonnenburg, and Gunnar Rätsch. mgene: Accurate svm-based gene finding with an application to nematode genomes. Genome Research, URL Advance access June 29, S. Sonnenburg, G. Rätsch, A. Jagota, and K.-R. Müller. New methods for splice-site recognition. In Proc. International Conference on Artificial Neural Networks, c Gunnar Rätsch (FML, Tübingen) Methods for Transcriptome Analysis NGS Bioinformatics, Paris 39 / 36

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010)

More information

Next Generation Genome Annotation with mgene.ngs

Next Generation Genome Annotation with mgene.ngs Next Generation Genome Annotation with mgene.ngs Jonas Behr, 1 Regina Bohnert, 1 Georg Zeller, 1,2 Gabriele Schweikert, 1,2,3 Lisa Hartmann, 1 and Gunnar Rätsch 1 1 Friedrich Miescher Laboratory of the

More information

Discovering Common Sequence Variation in A. thaliana. Gunnar Rätsch

Discovering Common Sequence Variation in A. thaliana. Gunnar Rätsch Machine Learning Methods for Discovering Common Sequence Variation in A. thaliana Gunnar Rätsch Friedrich Miescher Laboratory, Max Planck Society, Tübingen Technical University Berlin March 31, 2008 Current

More information

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Abundance

More information

RNA-Sequencing analysis

RNA-Sequencing analysis RNA-Sequencing analysis Markus Kreuz 25. 04. 2012 Institut für Medizinische Informatik, Statistik und Epidemiologie Content: Biological background Overview transcriptomics RNA-Seq RNA-Seq technology Challenges

More information

TRANSCRIPT NORMALIZATION AND SEGMENTATION OF TILING ARRAY DATA

TRANSCRIPT NORMALIZATION AND SEGMENTATION OF TILING ARRAY DATA TRANSCRIPT NORMALIZATION AND SEGMENTATION OF TILING ARRAY DATA GEORG ZELLER Friedrich Miescher Laboratory of the Max Planck Society & Max Planck Institute for Developmental Biology, Dept. for Molecular

More information

RNA-Seq with the Tuxedo Suite

RNA-Seq with the Tuxedo Suite RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with

More information

Gene Expression Technology

Gene Expression Technology Gene Expression Technology Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu Gene expression Gene expression is the process by which information from a gene

More information

Mapping strategies for sequence reads

Mapping strategies for sequence reads Mapping strategies for sequence reads Ernest Turro University of Cambridge 21 Oct 2013 Quantification A basic aim in genomics is working out the contents of a biological sample. 1. What distinct elements

More information

ARTS: Accurate Recognition of Transcription Starts in human

ARTS: Accurate Recognition of Transcription Starts in human ARTS: Accurate Recognition of Transcription Starts in human Sören Sonnenburg, Alexander Zien,, Gunnar Rätsch Fraunhofer FIRST.IDA, Kekuléstr. 7, 12489 Berlin, Germany Friedrich Miescher Laboratory of the

More information

measuring gene expression December 5, 2017

measuring gene expression December 5, 2017 measuring gene expression December 5, 2017 transcription a usually short-lived RNA copy of the DNA is created through transcription RNA is exported to the cytoplasm to encode proteins some types of RNA

More information

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson

More information

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAAT AATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAA

More information

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page

More information

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data Jan Hellemans 7th international qpcr & NGS Event - Freising March 24 th, 2015 Therapeutics lncrna oncology

More information

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist Whole Transcriptome Analysis of Illumina RNA- Seq Data Ryan Peters Field Application Specialist Partek GS in your NGS Pipeline Your Start-to-Finish Solution for Analysis of Next Generation Sequencing Data

More information

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project

More information

Gene Finding Genome Annotation

Gene Finding Genome Annotation Gene Finding Genome Annotation Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics Population biology & evolution Medical genomics

More information

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

TIGR THE INSTITUTE FOR GENOMIC RESEARCH Introduction to Genome Annotation: Overview of What You Will Learn This Week C. Robin Buell May 21, 2007 Types of Annotation Structural Annotation: Defining genes, boundaries, sequence motifs e.g. ORF,

More information

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian, Genscan The Genscan HMM model Training Genscan Validating Genscan (c) Devika Subramanian, 2009 96 Gene structure assumed by Genscan donor site acceptor site (c) Devika Subramanian, 2009 97 A simple model

More information

Introduction to RNA sequencing

Introduction to RNA sequencing Introduction to RNA sequencing Bioinformatics perspective Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden November 2017 Olga (NBIS) RNA-seq November 2017 1 / 49 Outline Why sequence

More information

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics RNA Sequencing T TM variation genetics validation SNP ncrna metagenomics private trio de novo exome mendelian ChIP-seq RNA DNA bioinformatics custom target high-throughput resequencing storage ncrna comparative

More information

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading: 132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, 214 1 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading: Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 155 12 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

Gene Identification in silico

Gene Identification in silico Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction

More information

Measuring transcriptomes with RNA-Seq

Measuring transcriptomes with RNA-Seq Measuring transcriptomes with RNA-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2017 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC

More information

RNA-Seq Software, Tools, and Workflows

RNA-Seq Software, Tools, and Workflows RNA-Seq Software, Tools, and Workflows Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 1, 2016 Some mrna-seq Applications Differential gene expression analysis Transcriptional profiling Assumption:

More information

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia RNA-Seq Workshop AChemS 2017 Sunil K Sukumaran Monell Chemical Senses Center Philadelphia Benefits & downsides of RNA-Seq Benefits: High resolution, sensitivity and large dynamic range Independent of prior

More information

MAKER: An easy to use genome annotation pipeline. Carson Holt Yandell Lab Department of Human Genetics University of Utah

MAKER: An easy to use genome annotation pipeline. Carson Holt Yandell Lab Department of Human Genetics University of Utah MAKER: An easy to use genome annotation pipeline Carson Holt Yandell Lab Department of Human Genetics University of Utah Introduction to Genome Annotation What annotations are Importance of genome annotations

More information

Eukaryotic Gene Prediction. Wei Zhu May 2007

Eukaryotic Gene Prediction. Wei Zhu May 2007 Eukaryotic Gene Prediction Wei Zhu May 2007 In nature, nothing is perfect... - Alice Walker Gene Structure What is Gene Prediction? Gene prediction is the problem of parsing a sequence into nonoverlapping

More information

Computational gene finding. Devika Subramanian Comp 470

Computational gene finding. Devika Subramanian Comp 470 Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) The biological context Lec 1 Lec 2 Lec 3 Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative

More information

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction Lecture 8 Reading Lecture 8: 96-110 Lecture 9: 111-120 DNA Libraries Definition Types Construction 142 DNA Libraries A DNA library is a collection of clones of genomic fragments or cdnas from a certain

More information

Gene Signal Estimates from Exon Arrays

Gene Signal Estimates from Exon Arrays Gene Signal Estimates from Exon Arrays I. Introduction: With exon arrays like the GeneChip Human Exon 1.0 ST Array, researchers can examine the transcriptional profile of an entire gene (Figure 1). Being

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Measuring transcriptomes with RNA-Seq. BMI/CS 776  Spring 2016 Anthony Gitter Measuring transcriptomes with RNA-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu Overview RNA-Seq technology The RNA-Seq quantification problem Generative

More information

Bacterial Genome Annotation

Bacterial Genome Annotation Bacterial Genome Annotation Bacterial Genome Annotation For an annotation you want to predict from the sequence, all of... protein-coding genes their stop-start the resulting protein the function the control

More information

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018 Outline Overview of the GEP annotation projects Annotation of Drosophila Primer January 2018 GEP annotation workflow Practice applying the GEP annotation strategy Wilson Leung and Chris Shaffer AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCT

More information

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme Illumina (Solexa) Current market leader Based on sequencing by synthesis Current read length 100-150bp Paired-end easy, longer matepairs harder Error ~0.1% Mismatch errors dominate Throughput: 4 Tbp in

More information

Analytics Behind Genomic Testing

Analytics Behind Genomic Testing A Quick Guide to the Analytics Behind Genomic Testing Elaine Gee, PhD Director, Bioinformatics ARUP Laboratories 1 Learning Objectives Catalogue various types of bioinformatics analyses that support clinical

More information

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015 ChIP-Seq Data Analysis J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015 What s the Question? Where do Transcription Factors (TFs) bind genomic DNA 1? (Where do other things bind DNA

More information

Methods and Algorithms for Gene Prediction

Methods and Algorithms for Gene Prediction Methods and Algorithms for Gene Prediction Chaochun Wei 韦朝春 Sc.D. ccwei@sjtu.edu.cn http://cbb.sjtu.edu.cn/~ccwei Shanghai Jiao Tong University Shanghai Center for Bioinformation Technology 5/12/2011 K-J-C

More information

Predictive and Causal Modeling in the Health Sciences. Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics

Predictive and Causal Modeling in the Health Sciences. Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics Predictive and Causal Modeling in the Health Sciences Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics 1 Exponentially Rapid Data Accumulation Protein Sequencing

More information

Next-Generation Sequencing. Technologies

Next-Generation Sequencing. Technologies Next-Generation Next-Generation Sequencing Technologies Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson Cancer Center Dept. Genetics Dept. Bioinformatics Introduction to Bioinformatics GS011062

More information

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 1 Vocabulary Gene: hereditary DNA sequence at a

More information

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015 Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A Computational Genomics 2/18/2015 The $1,000 genome is here! http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn Bioinformatics bottleneck

More information

RNA Sequencing Analyses & Mapping Uncertainty

RNA Sequencing Analyses & Mapping Uncertainty RNA Sequencing Analyses & Mapping Uncertainty Adam McDermaid 1/26 RNA-seq Pipelines Collection of tools for analyzing raw RNA-seq data Tier 1 Quality Check Data Trimming Tier 2 Read Alignment Assembly

More information

Gene Structure & Gene Finding Part II

Gene Structure & Gene Finding Part II Gene Structure & Gene Finding Part II David Wishart david.wishart@ualberta.ca 30,000 metabolite Gene Finding in Eukaryotes Eukaryotes Complex gene structure Large genomes (0.1 to 10 billion bp) Exons and

More information

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing Gene Regulation Solutions Microarrays and Next-Generation Sequencing Gene Regulation Solutions The Microarrays Advantage Microarrays Lead the Industry in: Comprehensive Content SurePrint G3 Human Gene

More information

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA The most sensitive cdna synthesis technology, combined with next-generation

More information

Interpreting RNA-seq data (Browser Exercise II)

Interpreting RNA-seq data (Browser Exercise II) Interpreting RNA-seq data (Browser Exercise II) In previous exercises, you spent some time learning about gene pages and examining genes in the context of the GBrowse genome browser. It is important to

More information

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity Genome Annotation Genome Sequencing Costliest aspect of sequencing the genome o But Devoid of content Genome must be annotated o Annotation definition Analyzing the raw sequence of a genome and describing

More information

Introduction to RNA-Seq

Introduction to RNA-Seq Introduction to RNA-Seq Monica Britton, Ph.D. Sr. Bioinformatics Analyst March 2015 Workshop Overview of RNA-Seq Activities RNA-Seq Concepts, Terminology, and Work Flows Using Single-End Reads and a Reference

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet May 2013 Standard sequence library generation Illumina

More information

Introduction to genome biology

Introduction to genome biology Introduction to genome biology Lisa Stubbs We ve found most genes; but what about the rest of the genome? Genome size* 12 Mb 95 Mb 170 Mb 1500 Mb 2700 Mb 3200 Mb #coding genes ~7000 ~20000 ~14000 ~26000

More information

Regulation of eukaryotic transcription:

Regulation of eukaryotic transcription: Promoter definition by mass genome annotation data: in silico primer extension EMBNET course Bioinformatics of transcriptional regulation Jan 28 2008 Christoph Schmid Regulation of eukaryotic transcription:

More information

GREG GIBSON SPENCER V. MUSE

GREG GIBSON SPENCER V. MUSE A Primer of Genome Science ience THIRD EDITION TAGCACCTAGAATCATGGAGAGATAATTCGGTGAGAATTAAATGGAGAGTTGCATAGAGAACTGCGAACTG GREG GIBSON SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc.

More information

Serial Analysis of Gene Expression

Serial Analysis of Gene Expression Serial Analysis of Gene Expression Cloning of Tissue-Specific Genes Using SAGE and a Novel Computational Substraction Approach. Genomic (2001) Hung-Jui Shih Outline of Presentation SAGE EST Article TPE

More information

Outline and learning objectives. From Proteomics to Systems Biology. Integration of omics - information

Outline and learning objectives. From Proteomics to Systems Biology. Integration of omics - information From to Systems Biology Outline and learning objectives Omics science provides global analysis tools to study entire systems How to obtain omics - What can we learn Limitations Integration of omics - In-class

More information

Lecture #1. Introduction to microarray technology

Lecture #1. Introduction to microarray technology Lecture #1 Introduction to microarray technology Outline General purpose Microarray assay concept Basic microarray experimental process cdna/two channel arrays Oligonucleotide arrays Exon arrays Comparing

More information

Baum-Welch and HMM applications. November 16, 2017

Baum-Welch and HMM applications. November 16, 2017 Baum-Welch and HMM applications November 16, 2017 Markov chains 3 states of weather: sunny, cloudy, rainy Observed once a day at the same time All transitions are possible, with some probability Each state

More information

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence Annotating 7G24-63 Justin Richner May 4, 2005 Zfh2 exons Thd1 exons Pur-alpha exons 0 40 kb 8 = 1 kb = LINE, Penelope = DNA/Transib, Transib1 = DINE = Novel Repeat = LTR/PAO, Diver2 I = LTR/Gypsy, Invader

More information

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute Introduction to Microarray Data Analysis and Gene Networks Alvis Brazma European Bioinformatics Institute A brief outline of this course What is gene expression, why it s important Microarrays and how

More information

SNPs - GWAS - eqtls. Sebastian Schmeier

SNPs - GWAS - eqtls. Sebastian Schmeier SNPs - GWAS - eqtls s.schmeier@gmail.com http://sschmeier.github.io/bioinf-workshop/ 17.08.2015 Overview Single nucleotide polymorphism (refresh) SNPs effect on genes (refresh) Genome-wide association

More information

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017 Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017 Agenda What is Functional Genomics? RNA Transcription/Gene Expression Measuring Gene

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 21 2011, pages 2957 2963 doi:10.1093/bioinformatics/btr507 Genome analysis Advance Access publication September 7, 2011 : fast length adjustment of short reads

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Changhui (Charles) Yan Old Main 401 F http://www.cs.usu.edu www.cs.usu.edu/~cyan 1 How Old Is The Discipline? "The term bioinformatics is a relatively recent invention, not

More information

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput Chapter 11: Gene Expression The availability of an annotated genome sequence enables massively parallel analysis of gene expression. The expression of all genes in an organism can be measured in one experiment.

More information

Original article: IDENTIFYING DNA SPLICE SITES USING PATTERNS STATISTICAL PROPERTIES AND FUZZY NEURAL NETWORKS

Original article: IDENTIFYING DNA SPLICE SITES USING PATTERNS STATISTICAL PROPERTIES AND FUZZY NEURAL NETWORKS Original article: IDENTIFYING DNA SPLICE SITES USING PATTERNS STATISTICAL PROPERTIES AND FUZZY NEURAL NETWORKS Essam Al-Daoud Computer Science Department, Faculty of Science and Information Technology,

More information

Form for publishing your article on BiotechArticles.com this document to

Form for publishing your article on BiotechArticles.com  this document to Your Article: Article Title (3 to 12 words) Article Summary (In short - What is your article about Just 2 or 3 lines) Category Transcriptomics sequencing and lncrna Sequencing Analysis: Quality Evaluation

More information

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM) BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM) PROGRAM TITLE DEGREE TITLE Master of Science Program in Bioinformatics and System Biology (International Program) Master of Science (Bioinformatics

More information

Genomic region (ENCODE) Gene definitions

Genomic region (ENCODE) Gene definitions DNA From genes to proteins Bioinformatics Methods RNA PROMOTER ELEMENTS TRANSCRIPTION Iosif Vaisman mrna SPLICE SITES SPLICING Email: ivaisman@gmu.edu START CODON STOP CODON TRANSLATION PROTEIN From genes

More information

Microarrays: since we use probes we obviously must know the sequences we are looking at!

Microarrays: since we use probes we obviously must know the sequences we are looking at! These background are needed: 1. - Basic Molecular Biology & Genetics DNA replication Transcription Post-transcriptional RNA processing Translation Post-translational protein modification Gene expression

More information

De Novo Assembly of High-throughput Short Read Sequences

De Novo Assembly of High-throughput Short Read Sequences De Novo Assembly of High-throughput Short Read Sequences Chuming Chen Center for Bioinformatics and Computational Biology (CBCB) University of Delaware NECC Third Skate Genome Annotation Workshop May 23,

More information

user s guide Question 3

user s guide Question 3 Question 3 During a positional cloning project aimed at finding a human disease gene, linkage data have been obtained suggesting that the gene of interest lies between two sequence-tagged site markers.

More information

Introduction to the UCSC genome browser

Introduction to the UCSC genome browser Introduction to the UCSC genome browser Dominik Beck NHMRC Peter Doherty and CINSW ECR Fellow, Senior Lecturer Lowy Cancer Research Centre, UNSW and Centre for Health Technology, UTS SYDNEY NSW AUSTRALIA

More information

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS

More information

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H Introduction to ChIP Seq data analyses Acknowledgement: slides taken from Dr. H Wu @Emory ChIP seq: Chromatin ImmunoPrecipitation it ti + sequencing Same biological motivation as ChIP chip: measure specific

More information

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter A shotgun introduction to sequence assembly (with Velvet) MCB 247 - Brem, Eisen and Pachter Hot off the press January 27, 2009 06:00 AM Eastern Time llumina Launches Suite of Next-Generation Sequencing

More information

CSE182-L16. LW statistics/assembly

CSE182-L16. LW statistics/assembly CSE182-L16 LW statistics/assembly Silly Quiz Who are these people, and what is the occasion? Genome Sequencing and Assembly Sequencing A break at T is shown here. Measuring the lengths using electrophoresis

More information

Long and short/small RNA-seq data analysis

Long and short/small RNA-seq data analysis Long and short/small RNA-seq data analysis GEF5, 4.9.2015 Sami Heikkinen, PhD, Dos. Topics 1. RNA-seq in a nutshell 2. Long vs short/small RNA-seq 3. Bioinformatic analysis work flows GEF5 / Heikkinen

More information

Course Presentation. Ignacio Medina Presentation

Course Presentation. Ignacio Medina Presentation Course Index Introduction Agenda Analysis pipeline Some considerations Introduction Who we are Teachers: Marta Bleda: Computational Biologist and Data Analyst at Department of Medicine, Addenbrooke's Hospital

More information

RNA-Seq analysis workshop

RNA-Seq analysis workshop RNA-Seq analysis workshop Zhangjun Fei Boyce Thompson Institute for Plant Research USDA Robert W. Holley Center for Agriculture and Health Cornell University Outline Background of RNA-Seq Application of

More information

SIMS2003. Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School. Introduction to Microarray Technology.

SIMS2003. Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School. Introduction to Microarray Technology. SIMS2003 Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School Introduction to Microarray Technology. Lecture 1 I. EXPERIMENTAL DETAILS II. ARRAY CONSTRUCTION III. IMAGE ANALYSIS Lecture

More information

Variant calling in NGS experiments

Variant calling in NGS experiments Variant calling in NGS experiments Jorge Jiménez jjimeneza@cipf.es BIER CIBERER Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain) 1 Index 1. NGS workflow 2. Variant calling

More information

Surely Better Target Enrichment from Sample to Sequencer and Analysis

Surely Better Target Enrichment from Sample to Sequencer and Analysis sureselect TARGET ENRIChment solutions Surely Better Target Enrichment from Sample to Sequencer and Analysis Agilent s market leading SureSelect platform provides a complete portfolio of catalog to custom

More information

Make the protein through the genetic dogma process.

Make the protein through the genetic dogma process. Make the protein through the genetic dogma process. Coding Strand 5 AGCAATCATGGATTGGGTACATTTGTAACTGT 3 Template Strand mrna Protein Complete the table. DNA strand DNA s strand G mrna A C U G T A T Amino

More information

Molecular Biology Primer. CptS 580, Computational Genomics, Spring 09

Molecular Biology Primer. CptS 580, Computational Genomics, Spring 09 Molecular Biology Primer pts 580, omputational enomics, Spring 09 Starting 19 th century What do we know of cellular biology? ell as a fundamental building block 1850s+: ``DNA was discovered by Friedrich

More information

If Dna Has The Instructions For Building Proteins Why Is Mrna Needed

If Dna Has The Instructions For Building Proteins Why Is Mrna Needed If Dna Has The Instructions For Building Proteins Why Is Mrna Needed if a strand of DNA has the sequence CGGTATATC, then the complementary each strand of DNA contains the info needed to produce the complementary

More information

Inference of Isoforms from Short Sequence Reads (Extended Abstract)

Inference of Isoforms from Short Sequence Reads (Extended Abstract) Inference of Isoforms from Short Sequence Reads (Extended Abstract) Jianxing Feng 1, Wei Li 2, and Tao Jiang 3 1 State Key Laboratory on Intelligent Technology and Systems Tsinghua National Laboratory

More information

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing project Unknown sequence { experimental evidence result read 1 read 4 read 2 read 5 read 3 read 6 read 7 Computational requirements

More information

Student Learning Outcomes (SLOS)

Student Learning Outcomes (SLOS) Student Learning Outcomes (SLOS) KNOWLEDGE AND LEARNING SKILLS USE OF KNOWLEDGE AND LEARNING SKILLS - how to use Annhyb to save and manage sequences - how to use BLAST to compare sequences - how to get

More information

Differential gene expression analysis using RNA-seq

Differential gene expression analysis using RNA-seq https://abc.med.cornell.edu/ Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, August 2017 Friederike Dündar with Luce Skrabanek & Ceyda Durmaz Day 3 QC of aligned reads

More information

Tree Depth in a Forest

Tree Depth in a Forest Tree Depth in a Forest Mark Segal Center for Bioinformatics & Molecular Biostatistics Division of Bioinformatics Department of Epidemiology and Biostatistics UCSF NUS / IMS Workshop on Classification and

More information

Local assembly and pre-mrna splicing analyses by high-throughput sequencing data

Local assembly and pre-mrna splicing analyses by high-throughput sequencing data Graduate Theses and Dissertations Graduate College 2012 Local assembly and pre-mrna splicing analyses by high-throughput sequencing data Hsien-chao Chou Iowa State University Follow this and additional

More information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene

More information

FAST AND ACCURATE GENE PREDICTION BY PROTEIN HOMOLOGY

FAST AND ACCURATE GENE PREDICTION BY PROTEIN HOMOLOGY FAST AND ACCURATE GENE PREDICTION BY PROTEIN HOMOLOGY by Rong She Master of Science, Simon Fraser University, 2003 Bachelor of Engineering, Shanghai Jiaotong University, 1993 THESIS SUBMITTED IN PARTIAL

More information

Introduction to Bioinformatics and Gene Expression Technology

Introduction to Bioinformatics and Gene Expression Technology Vocabulary Introduction to Bioinformatics and Gene Expression Technology Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 1.1 Gene: Genetics: Genome: Genomics: hereditary DNA

More information

Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences

Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences Michiel O. Noordewier Computer Science Rutgers University New Brunswick, NJ 08903 Geoffrey G. Towell Computer Sciences University

More information

Wednesday, November 22, 17. Exons and Introns

Wednesday, November 22, 17. Exons and Introns Exons and Introns Introns and Exons Exons: coded regions of DNA that get transcribed and translated into proteins make up 5% of the genome Introns and Exons Introns: non-coded regions of DNA Must be removed

More information