CS-E5870 High-Throughput Bioinformatics RNA-seq analysis

Size: px

Start display at page:

Download "CS-E5870 High-Throughput Bioinformatics RNA-seq analysis"

Dale Jennings
5 years ago
Views:

1 CS-E5870 High-Throughput Bioinformatics RNA-seq analysis Harri Lähdesmäki Department of Computer Science Aalto University September 30, 2016 Acknowledgement for J Salojärvi and E Czeizler for the previous version of the slides Contents Background Alignment of RNA-seq data Transcriptome assembly Gene expression quantification Differential gene expression analysis Transcript-level analysis

tid=15&sid=22 Alternative splicing A process of making alternative mrna molecules from the

2 Gene transcription A process of making an RNA copy of a gene sequence in DNA Figure from Alternative splicing A process of making alternative mrna molecules from the same precursor RNA In humans, 95% of multi-exonic genes are are alternatively spliced Figure from

Alternative splicing mechanisms Alternative splicing happens co-transcriptionally and is largely regulated by splicing factors (proteins) that bind RNA motifs

3 Alternative splicing mechanisms Alternative splicing happens co-transcriptionally and is largely regulated by splicing factors (proteins) that bind RNA motifs located in the pre-mrna Figure from (Chen & Manley, 2009) Different types of alternative splicing Basic modes of alternative splicing Figure from (Cartegni et al, 2002)

RNA-seq High-throughput sequencing of RNA (or cdna) to provide a comprehensive picture of the transcriptome Types of RNA molecules ribosomal RNA rrna 85-90% transfer RNA trna 10% mrna messenger RNA

4 RNA-seq High-throughput sequencing of RNA (or cdna) to provide a comprehensive picture of the transcriptome Types of RNA molecules ribosomal RNA rrna 85-90% transfer RNA trna 10% mrna messenger RNA 1-5% micro RNA and other mirna, pirna, etc rest RNA-seq basic pipeline 1 RNA population is converted to a library of cdna fragments with adaptors attached to one or both ends 2 High-throughput sequencing for the cdna fragment library (single-end or paired-end), read length bp 3 Alignment against reference genome or transcriptome, transcriptome reconstruction, expression quantification, etc Figure from (Wang et al, 2009)

5 RNA-seq vs microarrays RNA-seq measures any transcripts in a sample while microarrays measure only genes corresponding to predetermined probes on a microarray With RNA-seq, there is no need to identify/synthesize probes prior to a measurement RNA-seq provides count data which can be a better proxy for the amount of mrna in a sample than the fluorescence measurements produced with microarray technology RNA-seq provides information about transcript sequence in addition to information about transcript abundance Thus, with RNA-seq it is possible to measure the expression of different alternatively spliced transcripts that has turned out to be difficult with microarray technology Sequence information also permits the identification of SNPs, allele specific expression, single nucleotide polymorphisms (SNPs), and other forms of sequence variation What can we do with RNA-seq data? Transcript assembly Construct full-length transcript sequences from the RNA-seq data (either with or without the knowledge of the reference genome) Identify variants Transcript quantification Given transcript sequence annotations (reference), estimate the gene expression or abundances of all different transcripts (alternative transcript isoforms for a gene) Differential expression Statistical inference for differential gene expression or alternative splicing

6 Contents Background Alignment of RNA-seq data Transcriptome assembly Gene expression quantification Differential gene expression analysis Transcript-level analysis RNA-seq read alignment If full-length transcript annotations are known, then reads can be aligned exactly as aligning against a reference genome If transcript annotations are not known, still similar approaches as for aligning DNA sequence reads will work with some modifications Transcriptomic reads can span exon junctions Transcriptomic reads can contain poly(a) ends (from post-transcriptional RNA processing) Exon A Exon B Exon C Processed mrna Mapping to genome Figure from (Trapnell & Salzberg, 2009)

7 TopHat pipeline We will look at TopHat, a commonly used tool for RNA-seq alignment All reads are mapped to the reference genome using Bowtie Reads that do not map to the genome are set aside as initially unmapped reads (IUM reads) Figure from (Trapnell et al, 2009) TopHat pipeline Consensus assemble of initially mapped reads with Maq assemble For low-quality or low-coverage positions, use reference genome to call the base Consensus exons are likely missing some amount of sequence at ends TopHat considers flanking sequences from reference genome (default=45bp) Merge neighboring exons with very short gap to a single exon (low transcribed genes can Figure from (Trapnell et al, 2009) have coverage gap)

8 TopHat pipeline To map reads to splice junctions: Enumerate all canonical donor and acceptor sites in consensus exons Consider all possible pairings (allowed intron length is an adjustable parameter) For each candidate splice junction, find initially unmapped reads that span them: seed-and-extend approach 234 _ Figure from (Trapnell et al, 2009) TopHat pipeline 004 _ B3gat1 B3gat B3gat1 BC Seed-and-extend: Pre-compute an index of reads: lookup table based on 2k-mer keys, which are associated with reads that contain the 2k-mer in their high-quality region (default k = 5) For candidate splice junction, concatenate the k bases downstream of the acceptor to the k bases upstream Query this 2k-mer against the read index (exact seed match, no mismatch allowed) Align remaining part of read left and right of the exact match (allowing fixed number of mismatches) erlapped by the 5 -UTR of another transcript Both isoforms are present in the brain tissue RNA sample The top track is the le read coverage reported by ERANGE for this region (Mortazavi et al, 2008) The lack of a large coverage gap causes TopHat ining both exons TopHat looks for introns within single islands in order to detect this junction The figure shows the normalized coverage of ns by uniquely mappable reads as reported by pts are clearly present in the RNA-Seq sample, region as a single island In order to detect such performance and specificity, TopHat looks for deeply sequenced During the island extraction rithm computes the following statistic for each to j in the map: j m=i d m 1 j i nm=0 (1) d m erage at coordinate m in the Bowtie map, and ce genome When scaled to range [0, 1000], malized depth of coverage for an island We nctions tend to fall within islands with high D s looks for junctions contained in islands with er can be changed by the user A high D -value Fig 3 The seed and Figure extendfrom alignment (Trapnell usedet to match al, 2009) reads to possible splice sites For each possible splice site, a seed is formed by combining a small amount of sequence upstream of the donor and downstream of the acceptor Downloaded from bioinformaticsoxfordjournalsorg at Helsinki University of Technology on

9 Contents Background Alignment of RNA-seq data Transcriptome assembly Gene expression quantification Differential gene expression analysis Transcript-level analysis Transcriptome assembly Goal: define precise map of all transcripts and isoforms that are expressed in a particular sample Challenges Gene expression spans several orders of magnitude, with some genes represented by only few reads Reads can originate from mature mrna or from incompletely spliced precursor RNA For short reads, hard to determine from which isoform they were produced Two main classes of methods Genome-independent (de Brujin graph, see previous lecture) Genome-guided (after read alignment)

10 Transcriptome reconstruction with Cufflinks Genome-guided: takes TopHat spliced alignments as input With paired-end RNA-seq data, Bowtie and TopHat produce alignments where paired reads of the same fragment are treated together as single alignment a Map paired cdna fragment sequences to genome TopHat Spliced fragment alignments Figure from (Trapnell et al, 2010) Transcriptome reconstruction with Cufflinks Connect fragments in an overlap graph Each fragment (read pair) corresponds to a node Directed edge from node x to node y if the alignment for x started at a lower coordinate than y, the alignments overlapped in the genome, and the fragments were compatible (every implied intron in one fragment matched an identical implied intron in the other fragment) If two reads originate from different isoforms they are likely incompatible b Assembly Overlap graph Mutually incompatible fragments Figure from (Trapnell et al, 2010)

Transcriptome reconstruction with Cufflinks Construct from the overlap graph minimal set of transcript isoforms explaining all the fragments Minimum path cover problem Minimum path cover Dilworth s

11 Transcriptome reconstruction with Cufflinks Construct from the overlap graph minimal set of transcript isoforms explaining all the fragments Minimum path cover problem Minimum path cover Dilworth s theorem: maximum number of mutually incompatible fragments equals minimum number of paths covering the whole graph (=minimum number of transcripts needed to explain all the fragments) Transcripts Figure from (Trapnell et al, 2010) Contents Background Alignment of RNA-seq data Transcriptome assembly Gene expression quantification Differential gene expression analysis Transcript-level analysis

12 Gene expression quantification Expression of a gene: sum of the expression of all its isoforms Computing isoform abundances can be computationally challenging Simplified counting schemes without computing isoform abundances Exon union method: count reads mapped to any of the exons Exon intersection method: count reads mapped to constitutive exons c Isoform 1 Exon union method Isoform 2 Exon intersection method Figure from (Garber et al, 2011) Gene expression quantification REVIEW Cost of the simplified models REVIEW The union model tends to underestimate expression for verage across each path to decide alternatively a most likely to originate from the spliced genes The intersection 1 can 2 Low High reduce power for differential expression e similar computational requirepersonal computer Both assem- Short transcript Long transcript analysis 3 4 igh expression levels but differ sed transcripts where Cufflinks b d 10 ersus 25,000) most of which do 4 Exon union model Isoform 1 Transcript model 10 ance threshold used by Scripture 3 Isoform 2 Transcript expression method 10 pplementary Fig 3) In ccontrast, 2 10 s per locus (average of 16 veronly for a handful of transcripts 0 Exon union method 1 Isoform he most extreme case, Scripture 1 Isoform 2 a single locus whereas Cufflinks e gene Likelihood of isoform 2 0% 25% 25% Read count Exon intersection Isoform method 1 Isoform 2 100% Estimated FPKM FPKM True FPKM 10 4 truction Rather than mapping first, genome-independent tranrithms such as transabyss 53 use nsus transcripts Consensus ed to a genome or aligned to a nnotation purposes The central dent approaches is to partition Confidence interval Figure 3 An overview of gene expression quantification with RNA-seq Figure from (Garber et al, 2011) (a) Illustration of transcripts c of different lengths with different read coverage levels (left) as well as total read counts observed for each transcript (middle) and FPKM-normalized read counts (right) (b) Reads from alternatively spliced Isoform genes 1may be attributable to a single isoform or more than one isoform Reads are color-coded when their isoform of origin is clear Black reads indicate reads with uncertain origin Isoform expression methods estimate isoform abundances that best explain the Isoform 2 observed read counts under a generative model Samples near the original maximum likelihood estimate (dashed line) improve the robustness of the estimate and provide a confidence interval around each isoform s abundance Exon union method

13 Gene expression quantification Basic idea: read count corresponds to the expression level Basic assumption θ i = P(sample a read from transcript i) = 1 Z µ il i, where µi is the expression level of transcript i l i is the length of transcript i Normalizing constant is Z = i µ il i Gene expression quantification Use the frequency estimator to estimate the probability that a read originates from a given transcript i ˆθ i = k i N, where k i is the number of reads mapping to transcript i N is the total number of mapped reads Convert the estimates into expression values by normalizing by the transcript length ˆµ i ˆθ i l i

Gene expression quantification Common units to measure gene expression RPKM: reads per kilobases per million reads RPKM i = k i l i = 10 9 k i N 10 l 3 10 6 i N RPKM is for single-end reads, FPMK is

14 Gene expression quantification Common units to measure gene expression RPKM: reads per kilobases per million reads RPKM i = k i l i = 10 9 k i N 10 l i N RPKM is for single-end reads, FPMK is for paired-end reads Gene expression quantification Consider the 4 four transcripts with different lengths and expression levels illustrated below (left) On the right panel: the read counts normalized by the transcript length using the FPKM metric Transcripts 2 and 4 have comparable read-counts, transcript 2 has a significantly higher normalized expression level After normalization, transcripts 3 and 4 have similar expression values a 1 3 Low Short transcript 2 4 High Long transcript Read count FPKM Figure from (Garber et al, 2011) When the same gene is compared between conditions, the read counts (normalized by sequencing depth) are just fine

15 Gene expression quantification The above formulation assumes that all reads can be assigned uniquely to a single transcript That is generally not true Genes belonging to the same gene families have similar genome/rna sequence For gene expression estimate, different transcript isoforms can share a large fraction of their exons Contents Background Alignment of RNA-seq data Transcriptome assembly Gene expression quantification Differential gene expression analysis Transcript-level analysis

16 Gene expression quantification Recall the microarray based differential expression analysis using t-tests Two aspects Expression difference: how large is the expression difference between two groups? Statistical significance: how sure are we that there is a true difference? The latter is a statistical question: hypothesis testing Gene expression quantification Sequence count data is found to have non-gaussian distribution t-test based methods are not appropriate Recall: gene expression quantification using the frequency estimate For a single sample, assume read counts per unique transcripts have a multinomial (sampling) distribution Multinomial distribution can be approximated by Poisson distribution Read counts across replicates is observed to have a larger variance than what Poisson model suggests So-called overdispersed noise Negative binomial has been found to provide a good fit to the data

17 Gene expression quantification Assume that the number of aligned reads in sample j that are assigned to gene i can be modelled by negative binomial distribution K ij NB(µ ij, σ 2 ij) An alternative parametrization: the number of successes in a sequence of independent and identically distributed Bernoulli trials with probability p before a specified (non-random) number r of failures occurs K ij NB(r, p), where p = µ ij σ 2 ij r = µ2 ij σ 2 ij µ The parametric form P(K = k) = ( ) k + r 1 p r (1 p) k r 1 Gene expression quantification Estimation of mean and variance can be difficult because number of replicates is typically small Make further modeling assumptions (Anders & Huber, 2010) 1 Mean is modeled as product of (unknown) expected concentration of fragments from gene i under the condition of sample j and the sequencing depth for sample j µ ij = q i,ρ(j) s j 2 Variance is sum of shot noise and raw variance term σ 2 ij = µ ij + s 2 j v i,ρ(j) 3 The raw variance v i,ρ(j) is assumed to be a smooth function of q i,ρ(j) v i,ρ(j) = v ρ (q i,ρ(j) ) The last assumption allows us to pool the data from genes with similar expression strength for the purpose of variance estimation

18 Gene expression quantification The model of (Anders & huber, 2010) can be fitted to the data Assume we have m A replicate samples for biological condition A and m B samples for condition B We would like to test the null hypothesis of q i,a = q i,b q i,a is the expression strength for condition A (see previous page) Define a test statistic K is = K ia + K ib,where K ia = K ij, K ib = j:ρ(j)=a j:ρ(j)=b K ij Gene expression quantification Denote and assume independence (under the null hypothesis) p(a, b) =P(K ia = a, K ib = b) =P(K ia = a)p(k ib = b) The p-value of a pair of observed count sums (k ia, k ib )isthen the sum of all probabilities less or equal to p(k ia, k ib ), given that the overall sum is k is : p(a, b) p i = a + b = k is p(a, b) p(k ia, k ib ) p(a, b) a+b=k is (1) Correct p-values for multiple testing

19 1 000! 002! An example (Anders Wolfgang, 2010) Anders andfrom Huber Genome Biology& 2010, 11:R106 x7 RNA-Seq in fly embryos in two conditions, A and B Variance dependence on the mean: Poisson model (purple), orange (DEseq), dashed orange (edger) density! Gene expression quantification log10 mean 10^2 10^3 10^4 mean e SCV for the neural cell data Comes 1 and S9 See caption to Figure 1 Figure S6: Same plot as Figure 4, now for the neural cell data Again, red is DESeq s and blue edger s result (light blue, using read count sum for library size adjustment; dark blue, using Equation (5); solid, with common dispersion; dashed, with tagwise dispersion) Figure from (Anders & Wolfgang, 2010) Figure 1 Dependence of the variance on the mean for condition A in the fly RNA-Seq data (a) The scatter plot sh sample variances (Equation (7)) plotted against the common-scale means (Equation (6)) The orange line is the fit w(q) The p variance implied by the Poisson distribution for each of the two samples, that is, ^s j q^ i, A The dashed orange line is the varia edger (b) Same data as in (a), with the y-axis rescaled to show the squared coefficient of variation (SCV), that is all quantities square of the mean In (b), the solid orange line incorporated the bias correction described in Supplementary Note C in Add only shows SCV values in the range [0, 02] For a zoom-out to the full range, see Supplementary Figure S9 in Additional file Gene expression quantification! Figure S7: Type-I error control for the neural cell data Compare Volcano with plotfigure for 2the fly RNA-seq data from (Anders & Wolfgang, 2010) ential expression between conditions ll data Compare to Figure Empirical CDF obtained from each library var36 millions A good fraction of ample, from 32% to 53%) could ed to annotated genes, and End the data in a table of counts six samples and 18,760 rows, he di erential expression analye four GNS samples (excluding pathology and using only one of he same patient) is taken from a di erent sub- 008 Empirical CDF dealing with very di erent noise same analysis here for the Tagm et al [18] The data set comures derived from glioblastomas (condition GNS ) with two om non-cancerous neural stem p value Figure from (Anders & Wolfgang, 2010) p value 0 Figure S8: Volcano plot for comparison of conditions A and Figure 2 Type-I error control Thethe panels show empirical cumulative distribution functions (ECDFs) for P values from B in the flycondition RNA-Seq data replicate from A of the fly RNA-Seq data with the other one No genes are truly differentially expressed, and the should remain below the diagonal (gray) Panel (a): top row corresponds to DESeq, middle row to edger and bottom row test The right column shows the distributions for all genes, for genes the left and middle columns show them separately mean of 100 Panel (b) shows the same data, but zooms into the range of small P values The plots indicate that edger an 3 error at (and in fact slightly below) the nominal rate, while the Poisson-based c2 test fails to do so edger has an excess o counts: the blue line lies above the diagonal This excess is, however, compensated by the method being more conservat

20 Gene expression quantification MA plot for the fly RNA-seq data from (Anders & Wolfgang, 2010) MA plot: a scatter plot where x-axis shows mean and y-axis shows log Signal in A Signal in B Figure from (Anders & Wolfgang, 2010) Contents Background Alignment of RNA-seq data Transcriptome assembly Gene expression quantification Differential gene expression analysis Transcript-level analysis

21 Transcript-level expression quantification Let us assume that each gene i is associated with J i transcripts indexed by j, then θ ij = P(sample a read from transcript j associated with gene i) = 1 Z µ ijl ij, where µ ij is the expression level of transcript j associated with gene j lij is the length of transcript j of gene i Normalizing constant is Z = ij µ ijl ij The true expression level of gene i is µ i = J i j=1 µ ij Transcript-level expression quantification Lets denote the aligned RNA-seq reads as R 1, R 2,,R N Let us also make an unrealistic assumption that all reads are assigned uniquely to one of the transcripts Then the frequency estimator gives us ˆθ ij = k ij N, where k ij is the number of reads assigned uniquely to µ ij Correspondingly, we can convert the estimates into expression values by normalizing by the transcript length ˆµ ij j ˆθ ij l ij = j k ij l ij N

22 Transcript-level expression quantification Recall the union method for estimating the gene expression level k i = k ij j and the frequency estimator ˆθ i = k i l i, where l i is the length of the gene i Union method tends to underestimate the gene expression level because ˆθ i = j k ij l i = k i1 l i k i1 l i1 + + k ij i l iji, + + k ij i l i where l i l ij Transcript-level expression quantification Consider a simple case of skipped exon Intron Skipped exon Constitutive exons Inclusion reads Constitutive reads Exclusion reads Constitutive reads Figure from (Katz et al, 2010) We can use eg the reads in the skipped exon and the inclusion and exclusion reads together with the frequency estimator to estimate the relative expression of the two transcripts

23 Transcript-level expression quantification With paired end reads we can try to use all (non-uniquely) aligned reads assuming we can estimate insert variability c Insert variability: = d Insert length (nt) Paired-end estimate, MISO Figure from (Katz et al, 2010) Estimation can be done Markov chain Monte Carlo (MCMC) sampling (Katz et al, 2010) References Simon Anders & Wolfgang Huber, Differential expression analysis for sequence count data, Genome Biology, 11:R106, 2010 Cartegni, L, Chew, S L, & Krainer, A R Listening to silence and understanding nonsense: exonic mutations that affect splicing Nature Reviews Genetics 3, (2002) Mo Chen & James L Manley, Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches Nature Reviews Molecular Cell Biology 10, , 2009 Manuel Garber, et al, Computational methods for transcriptome annotation and quantification using RNA-seq, Nature Methods 8, , 2011 Katz Y, et al, Analysis and design of rna sequencing experiments for identifying isoform regulation, Nature Methods, 7(12): , 2010 Cole Trapnell, Lior Pachter and Steven L Salzberg, TopHat: discovering splice junctions with RNA-Seq, Bioinformatics, 25(9): , 2009 Cole Trapnell & Steven L Salzberg, How to map billions of short reads onto genomes, Nature Biotechnology 27, , 2009 Cole Trapnell, et al, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology 28, , 2010 Zhong Wang, Mark Gerstein & Michael Snyder, RNA-Seq: a revolutionary tool for transcriptomics Nature Reviews Genetics 10, 57-63, 2009

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ), 2012-01-26 What is a gene What is a transcriptome History of gene expression assessment RNA-seq RNA-seq analysis