Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013
Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein
Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Qualitative Quantitative Integrative Understand the molecular basis of gene function. Classify and transform cellular states
RNA studies involve... Biological System Questions Project Available Resources Technology DB ~/bin
RNA studies involve... Biological System Questions Project Available Resources Technology DB ~/bin This talk: Focusing on reference based mammalian RNA-seq analysis
Transcriptional Complexity TSS TSS TSS pa pa pa pa TSS genomic DNA micrornas spliced intron TSS transcription start site protein coding regions pa polyadenylation signal translation start site non-coding regions polyadenylation
Transcriptional Complexity TSS TSS TSS pa pa pa pa TSS tirna PASR mirna genomic DNA micrornas spliced intron TSS transcription start site protein coding regions pa polyadenylation signal translation start site non-coding regions polyadenylation
Transcriptional Complexity TSS TSS TSS pa pa pa pa TSS tirna PASR Alu mirna genomic DNA micrornas spliced intron TSS transcription start site protein coding regions pa polyadenylation signal translation start site non-coding regions polyadenylation
Transcriptional Complexity Mutations Allelic Expression TSS TSS TSS pa pa pa pa TSS tirna PASR Alu mirna RNA Editing genomic DNA micrornas spliced intron TSS transcription start site protein coding regions pa polyadenylation signal translation start site non-coding regions polyadenylation
RNA-seq TSS TSS TSS pa pa pa pa TSS tirna PASR Alu mirna non-spliced reads mutations strand specific junction reads Cloonan et al. Nat Methods 2008; 5:613-619
Advantages of RNA-seq Discovery genes, exons, junctions, UTRs, fusions (Present and Future) %#!!!!" %!!!!!",-./01-2340" $#!!!!" $!!!!!" #!!!!"!" #&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$" 5/06789":6-02;/" <-;462/"=;>2/?" @6?-.>.A;/" /1BCD" <06E>;?6/6"
Advantages of RNA-seq Discovery genes, exons, junctions, UTRs, fusions (Present and Future) Dynamic Range,-./01-2340" %#!!!!" %!!!!!" $#!!!!" $!!!!!" #!!!!"!" #&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$" 5/06789":6-02;/" <-;462/"=;>2/?" @6?-.>.A;/" /1BCD" <06E>;?6/6" Mortazavi et al. Nat. Methods 2008; 5:621 628
Advantages of RNA-seq Discovery genes, exons, junctions, UTRs, fusions (Present and Future) Dynamic Range,-./01-2340" %#!!!!" %!!!!!" $#!!!!" $!!!!!" #!!!!"!" #&" #'" (!" ($" (%" ()" (*" (#" ((" (+" (&" ('" +!" +$" 5/06789":6-02;/" <-;462/"=;>2/?" @6?-.>.A;/" /1BCD" <06E>;?6/6" Mortazavi et al. Nat. Methods 2008; 5:621 628 Nucleotide Specific
Typical experiment workflow Field / Clinic Wet Lab Dry Lab Run Experiment Design Experiment Sample Acquisition Field / Clinic / Lab Obtain RNA Make Library Sequencing 1 Base Calling Mapping 2 Library QC 2 Sample Acquisition Verification Validation Analysis Interpretation 3 3 Publish
Typical experiment workflow Field / Clinic Wet Lab Dry Lab Run Experiment Design Experiment Sample Acquisition Field / Clinic / Lab Obtain RNA Make Library Sequencing 1 Base Calling Mapping 2 Library QC 2 Sample Acquisition Verification Validation Analysis Interpretation 3 3 Publish
Typical experiment workflow Field / Clinic Wet Lab Dry Lab Run Experiment Design Experiment Sample Acquisition Field / Clinic / Lab Obtain RNA Make Library Sequencing 1 Base Calling Mapping 2 Library QC 2 Sample Acquisition Verification Validation Analysis Interpretation 3 3 Publish
Typical experiment workflow Field / Clinic Wet Lab Dry Lab Run Experiment Design Experiment Sample Acquisition Field / Clinic / Lab Obtain RNA Make Library Sequencing 1 Base Calling Mapping 2 Library QC 2 Sample Acquisition Verification Validation Analysis Interpretation 3 3 Publish
Library Construction trna (15%) 5% Deplete rrna Enrich polya RNA AA AA AA Target RNA rrna (80%) Profile (ribosomes) AA A Fragment cellular RNA Capture (tiling arrays) ds-cdna synthesis Sequencing Ligate adaptors + Amplify
Typical experiment workflow Field / Clinic Wet Lab Dry Lab Run Experiment Design Experiment Sample Acquisition Field / Clinic / Lab Obtain RNA Make Library Sequencing 1 Base Calling Mapping 2 Library QC 2 Sample Acquisition Verification Validation Analysis Interpretation 3 3 Publish
RNA-seq Mapping Challenge #1: Introns
RNA-seq Mapping Challenge #1: Introns Align to database of junctions or transcriptome Split Read Alignments Wood et al. Bioinformatics 2011; 27:580 581 Trapnell et al. Bioinformatics 2009; 25:1105-11
RNA-seq Mapping Challenge #1: Introns Align to database of junctions or transcriptome Split Read Alignments Wood et al. Bioinformatics 2011; 27:580 581 Trapnell et al. Bioinformatics 2009; 25:1105-11 Challenge #2: Correctness Sufficient Overlap Sufficient Evidence
RNA-seq Mapping Challenge #1: Introns Align to database of junctions or transcriptome Split Read Alignments Wood et al. Bioinformatics 2011; 27:580 581 Trapnell et al. Bioinformatics 2009; 25:1105-11 Challenge #2: Correctness Challenge #3: Multi-mappers Sufficient Overlap Sufficient Evidence Align to the transcriptome Sequence Similarity
RNA-seq Mapping Data QC (clipping) Align to Filter Set Align to genome Align to junctions Split read Alignment Exclude Flag and Exclude Choose Alignments, Disambiguate Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11
RNA-seq Mapping Data QC (clipping) Align to Filter Set Align to genome Align to junctions Split read Alignment Exclude Flag and Exclude Choose Alignments, Disambiguate Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11 BAM BAM BAM Alignment Filtering Library QC Analysis
RNA-seq Mapping rrna, trna? reference? diploid? gene model? ESTs? Algorithm? Data QC (clipping) Align to Filter Set Align to genome Align to junctions Split read Alignment Exclude Flag and Exclude Choose Alignments, Disambiguate Tophat: Trapnell et al. Bioinformatics 2009; 25:1105-11 BAM BAM BAM Alignment Filtering Library QC Analysis
Typical experiment workflow Field / Clinic Wet Lab Dry Lab Run Experiment Design Experiment Sample Acquisition Field / Clinic / Lab Obtain RNA Make Library Sequencing 1 Base Calling Mapping 2 Library QC 2 Sample Acquisition Verification Validation Analysis Interpretation 3 3 Publish
Library Quality Control (QC) trna (15%) 5% Deplete rrna Enrich polya RNA AA AA AA Target RNA rrna (80%) Profile (ribosomes) AA A Fragment cellular RNA Capture (tiling arrays) ds-cdna synthesis Sequencing Ligate adaptors + Amplify
Library Quality Control (QC) trna (15%) 5% Deplete rrna Enrich polya RNA AA AA AA Target RNA Affects RNA content (Expression quantification) rrna (80%) Profile (ribosomes) AA A Fragment cellular RNA Capture (tiling arrays) ds-cdna synthesis Sequencing Ligate adaptors + Amplify
Library Quality Control (QC) trna (15%) 5% Deplete rrna Enrich polya RNA AA AA AA Target RNA Affects RNA content (Expression quantification) rrna (80%) cellular RNA Profile (ribosomes) Capture (tiling arrays) AA A Fragment Affects Insert Size (transcript identification) ds-cdna synthesis Sequencing Ligate adaptors + Amplify
Library Quality Control (QC) trna (15%) 5% Deplete rrna Enrich polya RNA AA AA AA Target RNA Affects RNA content (Expression quantification) rrna (80%) cellular RNA Profile (ribosomes) Capture (tiling arrays) AA A Fragment Affects Insert Size (transcript identification) ds-cdna synthesis Affects Strand Specificity Sequencing Ligate adaptors + Amplify
Library Quality Control (QC) trna (15%) 5% Deplete rrna Enrich polya RNA AA AA AA Target RNA Affects RNA content (Expression quantification) rrna (80%) cellular RNA Profile (ribosomes) Capture (tiling arrays) AA A Fragment Affects Insert Size (transcript identification) ds-cdna synthesis Affects Strand Specificity Sequencing Ligate adaptors + Amplify Affects Library Complexity (Tag uniqueness)
Library Quality Control (QC) trna (15%) 5% Deplete rrna Enrich polya RNA AA AA AA Target RNA Affects RNA content (Expression quantification) rrna (80%) cellular RNA Profile (ribosomes) Capture (tiling arrays) AA A Fragment Affects Insert Size (transcript identification) ds-cdna synthesis Affects Strand Specificity Affects Mapping Rate Paired-end? Sequencing Ligate adaptors + Amplify Affects Library Complexity (Tag uniqueness)
Typical experiment workflow Field / Clinic Wet Lab Dry Lab Run Experiment Design Experiment Sample Acquisition Field / Clinic / Lab Obtain RNA Make Library Sequencing 1 Base Calling Mapping 2 Library QC 2 Sample Acquisition Verification Validation Analysis Interpretation 3 3 Publish
Calculate Gene Expression Gene A 3500nt (700 reads) Gene B 400nt (160 reads)
Calculate Gene Expression Gene A 3500nt (700 reads) Gene B 400nt (160 reads) RPKM = 2.0 RPKM = 4.0 Reads Per Kilobase per Million 10 RPKM = R 3 10 6 L N R = Gene Read Count L = Length of gene N = Library Size Mortazavi et al. Nat. Methods 2008; 5:621 628
Further Normalisation Repeat Normalise to mappable gene length Koehler et al. Bioinformatics 2010
Further Normalisation Repeat Normalise to mappable gene length Scale Expression Values by TMM Koehler et al. Bioinformatics 2010 Cellular RNA Cond. 1 Cond. 2 Robinson et al. Genome Biology 2010; 11:R25
Further Normalisation Repeat Normalise to mappable gene length Scale Expression Values by TMM Koehler et al. Bioinformatics 2010 Cellular RNA RPKM Cond. 1 Cond. 2 Cond. 1 Cond. 2 Robinson et al. Genome Biology 2010; 11:R25
Further Normalisation Repeat Normalise to mappable gene length Scale Expression Values by TMM Koehler et al. Bioinformatics 2010 Robinson et al. Genome Biology 2010; 11:R25 Benjamini et al. NAR; 2012 Normalise to GC content of region
Calculate Feature Expression
Calculate Feature Expression Exonic Region
Calculate Feature Expression Exonic Region Exon Junction
Calculate Feature Expression Exonic Region Exon Junction Intronic Region
Calculate Feature Expression Exonic Region Exon Junction Intronic Region Exon Boundary
Calculate Feature Expression Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region
Calculate Feature Expression Calculate RPKM for any feature Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region
Calculate Feature Expression Calculate RPKM for any feature Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region Extended 3 UTR
Calculate Feature Expression Calculate RPKM for any feature Exonic Region Exon Junction Intronic Region Exon Boundary Intergenic Region Extended 3 UTR Retained Intron
Calculate Transcript Expression
Calculate Transcript Expression diagnostic feature
Calculate Transcript Expression diagnostic feature Approach #1: Expression calculated using diagnostic features Strong Evidence Easy to calculate Sampling Variability Lacks statistical robustness Dependent on gene model Excludes Transcripts ALEXA-seq: Griffith et al. Nat. Methods 2010; 11:R25
Calculate Transcript Expression
Calculate Transcript Expression Approach #2: Expression estimated Construct bipartite graph, then finds minimum path Cufflinks: Trapnell et al. Nat. Biotech. 2010, 28:511-515
Calculate Transcript Expression Approach #2: Expression estimated Construct bipartite graph, then finds minimum path Cufflinks: Trapnell et al. Nat. Biotech. 2010, 28:511-515 Estimates expression for all transcripts Incorporates ambiguous reads Model can fail in complex / highly expressed regions More statistically robust Error rate largely unknown
Expressed or not? Cond. 1 Cond. 2 Cond. 3 Frequency not expressed expressed Need to determine expression cut-off value log2 (expression)
Expressed or not? 1 Expressed if > 1 RPKM Has literature support Lacks sensitivity Arbitrary
Expressed or not? 1 Expressed if > 1 RPKM Has literature support Lacks sensitivity Arbitrary 2 Expressed if above intergenic background Frequency log2 Expression 95th percentile
Expressed or not? 1 Expressed if > 1 RPKM Has literature support Lacks sensitivity Arbitrary 2 Expressed if above intergenic background Cut-off based on empirical evidence Still somewhat arbitrary Frequency log2 Expression 95th percentile
Expressed or not? 1 2 3 Expressed if > 1 RPKM Expressed if above intergenic background Incorporate replicate information Has literature support Cut-off based on empirical evidence np IDR 0 0.1 0.3 0.5 0.7 0.9 1 Based on observed reproducibility Lacks sensitivity Still somewhat arbitrary Requires replicates log2 (expression) bins Arbitrary Frequency Rep 1 vs Rep 2 Rep 2 vs Rep 1 Mean Cut off log2 Expression 11 7 3 1 5 9 13 17 21 25 95th percentile
Expressed or not? 1 Expressed if > 1 RPKM Has literature support Lacks sensitivity Arbitrary 2 Expressed if above intergenic background Cut-off based on empirical evidence Still somewhat arbitrary Frequency log2 Expression 95th percentile 3 Incorporate replicate information Based on observed reproducibility Requires replicates np IDR 0 0.1 0.3 0.5 0.7 0.9 1 Rep 1 vs Rep 2 Rep 2 vs Rep 1 Mean Cut off 11 7 3 1 5 9 13 17 21 25 log2 (expression) bins
Expressed or not? 1 Expressed if > 1 RPKM Has literature support Lacks sensitivity Arbitrary 2 Expressed if above intergenic background Cut-off based on empirical evidence Still somewhat arbitrary Frequency log2 Expression 95th percentile 3 Incorporate replicate information Based on observed reproducibility Requires replicates np IDR 0 0.1 0.3 0.5 0.7 0.9 1 Rep 1 vs Rep 2 Rep 2 vs Rep 1 Mean Cut off 11 7 3 1 5 9 13 17 21 25 log2 (expression) bins Choose what is reasonable for your experiment, be consistent!
Nucleotide-Resolution Analysis Imprinting ICR
Nucleotide-Resolution Analysis Imprinting eqtl sqtl
Nucleotide-Resolution Analysis Imprinting eqtl sqtl Complex Traits
Nucleotide-Resolution Analysis Imprinting eqtl sqtl Complex Traits Allelic Fraction A B C SNPs
Nucleotide-Resolution Analysis Imprinting eqtl sqtl Complex Traits A B C SNPs Allelic Fraction Density 0.0 0.5 1.0 1.5 2.0 Expected Mean Observed Mean Reference bias 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of RNA seq Reads Matching Reference Allele Degner et al. Bioinformatics 2009
Nucleotide-Resolution Analysis Imprinting eqtl sqtl Complex Traits A B C SNPs Allelic Fraction Density 0.0 0.5 1.0 1.5 2.0 Expected Mean Observed Mean Reference bias Map to a diploid genome 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of RNA seq Reads Matching Reference Allele AlleleSeq: Rozowsky et al. Mol. Sys. Bio 2011 Degner et al. Bioinformatics 2009
Typical experiment workflow Field / Clinic Wet Lab Dry Lab Run Experiment Design Experiment Sample Acquisition Field / Clinic / Lab Obtain RNA Make Library Sequencing 1 Base Calling Mapping 2 Library QC 2 Sample Acquisition Verification Validation Analysis Interpretation 3 3 Publish
The future of RNA-seq (now) Single Cell Shalek, et al. Nature 2013
The future of RNA-seq (now) Single Cell Huge Cohort Genotype-Tissue Expression project (GTEx) 900 donors 30,000 RNA-seq data sets! Shalek, et al. Nature 2013 Lonsdale, et al. Nature Genetics 2013
Summary 1 Choose an alignment approach suitable for your experiment, available resources and tools 2 Assess library quality, specifically rrna contamination, insert size, strand specificity and library complexity 3 Gene and Feature Expression can be calculated using count data, and normalised by length, library size and GC content 4 Transcript expression calculation requires alternative approaches and algorithms, which although common, are largely unproven 5 RNA-seq can interrogate nucleotide specific questions, but be careful of alignment biases (diploid mapping can help here)
Questions and References Cloonan et al. Nat Methods 2008; Stem cell transcriptome profiling via massive-scale mrna sequencing Mortazavi et al. Nat. Methods 2008; Mapping and quantifying mammalian transcriptomes by RNA-Seq Wood et al. Bioinformatics 2011; X-MATE: A flexible system for mapping short read data Trapnell et al. Bioinformatics 2009; TopHat: discovering splice junctions with RNA-Seq Koehler et al. Bioinformatics 2010. The Uniqueome: A mappability resource for short-tag sequencing Robinson et al. Genome Biology 2010; A scaling normalization method for differential expression analysis of RNA-seq data. Benjamini et al. NAR; 2012. Summarizing and correcting the GC content bias in high-throughput sequencing Griffith et al. Nat. Methods 2010; Alternative expression analysis by RNA sequencing. Trapnell et al. Nat. Biotech. 2010; Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform Degner et al. Bioinformatics 2009; Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing Rozowsky et al. Mol. Sys. Bio 2011; AlleleSeq: analysis of allele-specific expression and binding in a Shalek, et al. Nature 2013; Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells Lonsdale, et al. Nature Genetics 2013; The Genotype-Tissue Expression (GTEx) project.