Deep sequencing of transcriptomes

Size: px
Start display at page:

Download "Deep sequencing of transcriptomes"

Transcription

1 1 / 40 Deep sequencing of transcriptomes An introduction to RNA-seq Michael Dondrup UNI BCCS 2. november 2010

2 2 / 40 Transcriptomics by Ultra-Fast Sequencing Microarrays have been the primary transcriptomics high-throughput tool for almost a decade. New approach: sequence the transcriptome Millions of short read fragments ( bp) from NGS machines Remarks: SAGE/CAGE and other tag based methods not suitable for bacteria Reverse transcription: RNA cdna Most papers use a reference genome Count read fragments per bin (CDS, ORF, exon, intergenic region, window of size n)

3 3 / 40 Outline Introduction Lab procedures Bioinformatics analysis Workflow Application examples Statistics of DE analysis Normalization Trials and distributions Statistical testing

4 4 / 40 RNA-seq, a revolutionary tool?

5 4 / 40 RNA-seq, a revolutionary tool

6 4 / 40 RNA-seq, a revolutionary tool!

7 5 / 40 Applications Gene structure (e.g. Exons/Introns) Finding novel transcripts: non-annotated (pseudo) Genes non-coding RNA (ncrna, srna) antisense RNA Transcription Start Sites (TSS) Operon structure De-novo transcriptome assembly (given no reference exists) Metagenomics approach (sampling of bact. communities by RNA-seq) (semi-) quantitative approach: differential expression (DE)

8 6 / 40 Outline Introduction Lab procedures Bioinformatics analysis Workflow Application examples Statistics of DE analysis Normalization Trials and distributions Statistical testing

9 sample Overview Remarks: RNA extraction purification total RNA depletion of rrna, amplification of mrna small RNA reverse transcription fragmentation cdna high-throughput sequencing short sequence reads ACTGATGTGAT ACTGGTCCAAAAATGAT AATCCGCTTATGTGAT ACTTTCCCGTGAT mrna DNase I treatment prokaryotes don t have polya rrna depletion: 90% removal rrna/rna before: 97-99, after 90% induces bias no depletion e.g. for meta-rna-seq directional by: ss-cdna or adapter ligation 7 / 40

10 8 / 40 cdna Sequencing Illumina/Solexa ABI/Solid Roche/454 direct RNA sequencing is under development High-coverage (Illumina, Solid) preferred over read-length (454), if no transcriptome assembly required

11 9 / 40 Outline Introduction Lab procedures Bioinformatics analysis Workflow Application examples Statistics of DE analysis Normalization Trials and distributions Statistical testing

12 10 / 40 Overview Sequence data, (FASTA, FASTQ) Quality control & filtering reference sequence Aligning reads to reference Transcriptome assembly Alignment statistics & filtering Transcript variants analysis (Splicing, Intron/ Exon, etc) genome annotation Binning (genes, intergenic regions, etc) Compute coverage DE analysis Visualization Transcription start sites/ Operons Search for novel transcripts

13 11 / 40 Filtering of reads by base-qualities lenght Duplicate reads (identical sequence) removal condense into single read probabilistic (have not seen this applied) duplicates are suspected to be artifacts

14 Alignments Challenges are the same as with all NGS data: millions/billions of reads rather short reads sequencing errors sequencing bias (Short)-read alignment programs used in RNA-seq: BWA Bowtie Shrimp SOAP Eland blat blast (not so good!) / 40

15 13 / 40 Challenges begin after the alignment Mapping fragments to the genome:

16 14 / 40 Filtering possibilities unique alignments sequence identity alignment quality score alignment length proportion of read length aligned

17 15 / 40 Pseudo coverage count of alignments spanning a genomic position can be computed with 1bp resolution or for larger windows used in visualization pseudo: because we do not know the size of the transcriptome

18 16 / 40 Examples

19 16 / 40 Examples

20 17 / 40 Interval binning DE analysis requires binning typical bins: exons, CDS, transcripts, introns, genes reads and genes are represented as genomic intervals [start, end] fast interval overlap algorithms: Sorting based methods Interval tree based methods (as in the IRanges Bioconducto package) Nested containment lists (Alekseyenko & Lee, Bioinformatics 2007) Result: a read count n i N 0 reads bin i

21 18 / 40 Search for ncrnas define an arbitrary coverage threshold c 0 search for continuos intergenic regions (seed) c > c 0 possibly extend over small gaps with "intergenic": regions that do not overlap a CDS (why?) extract sequence and search databases

22 19 / 40 Search for ncrnas

23 20 / 40 TSS and operon detection try to find out where coverage changes quickly (a TSS candidate) compute first order differences d(c i ) for each genomic position i (aka differentiation) using a sliding window of width w look for maxima/minima upstream of a gene (max for +, min for - strand)

24 21 / 40 TSS and operon detection

25 22 / 40 Outline Introduction Lab procedures Bioinformatics analysis Workflow Application examples Statistics of DE analysis Normalization Trials and distributions Statistical testing

26 23 / 40 Why data normalization for RNA-seq data? Account for systematic technical errors/bias: different library sizes: different numbers of reads different gene lengths some sequencing methods seem to prefer longer transcripts even more GC content might have an influence too limited read capacity: highly-expressed genes stealreads This is not trivial (e.g. common 5 preference). Differentiate between biological effects and technical effects Btw.: isn t this a déjà-vu?

27 24 / 40 Normalization methods RPKM (Mortazavi et al., 2008) house-keeping or constant reference gene (e.g. POLR2A) upper quartil (75% pecentile) normalization quantile normalization (as for MA data, see: Bolstad et al., 2003) See: Bullard et al. Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC bioinformatics (2010)

28 25 / 40 RPKM reads per kilobase of exon per million mapped sequence reads rpkm(n i,l i,n) = n i l i N [ b ] n i : number of mapped reads l i : length of gene sequence N: total number of mapped reads b: genomic base position (as a unit, this is not really a unit!) A very simple gene specific scaling factor. Some publications also took log values of similarly scaled data.

29 26 / 40 Quantile normalization no assumption about original data distribution results in the data being all samples from the same distribution

30 27 / 40 Method comparison by Bullard et al. gold standard: qrt-pcr data (MAQC project) divide the gold standard data in DE, non-de, no-call try to reproduce classification based on RNA-seq data (+MA data) using statistical tests Results: house-keeping, upper quartile, and quantile preferable over RPKM and total counts the choice of the normalization method had a larger effect on the performance than the choice of the statistical model (test)

31 28 / 40 What is significant here? Is a gene significantly differentially expressed under two conditions (DE analysis)? Remarks: We need a model for the data, more precisely we need a model that can explain the variance in the data. Significance: the probability of rejecting the null-hypothesis in a statistical test setting just by chance, while there is in reality no effect, is low are dealing with count data, thus we are dealing with discrete statistics (for now).

32 29 / 40 Bernoulli trials A trial with two outcomes: success and failure p: probability of success probability of failure: p = 1 p

33 Binomial distribution The number of successes in a series of n iid Benoulli trials follow a binomial distribution: probability mass function 30 / 40

34 31 / 40 Poisson distribution The distribution of rare events. we know a rate but not the probability of a success in a series of independent Bernoulli trials occurring in space/time the number of trials is large while each individual trial has low probability of success e.g.: number of phone calls received in a call center per hour number of defective devices on an assembly line per day f (k;λ) = λ k e λ k!,

35 32 / 40 Remarks It is tempting to use a Poisson model and set λ as the (pseudo-) coverage. E(f ;λ) = σ 2 = λ Preconditions: Independence: trials do not depend on previous events Lack of clustering, prob. of two simultaneous events is low Rate is constant over space/time

36 33 / 40 Poisson distribution... is not a suitable model for RNA-seq data in general has been found sufficient for technical variation in RNA-seq data biological variance technical variance keep in mind: we do not know the the length of the trancriptome, bp is not a unit! two-stage random-process (bit sloppy!) : C(x) = Sequencing(Transcription(x)) overdispersion

37 34 / 40 Negative binomial distribution k : number of failures before r successes with prob. p occur used for overdispersion problems with a dispersion parameter X NB(r, p) alternatively written as NB(µ, σ 2 ) f (k) Pr(X = k) = ( ) k+r 1 r 1 (1 p) r p k for k = 0,1,2,...

38 probability mass function NB dnbinom(k, r, p) r=2, p=0.5 r=5, p=0.5 r=10, p=0.5 r=20, p=0.5 r=10, p= k 35 / 40

39 36 / 40 How do we assess significance? We have discrete probabilities We can enumerate all possibilities principle: enumerate all outcomes which are equally or more extreme than the given one

40 37 / 40 Example: Fisher s exact test Exact test for contingency tables with small sample sizes the probability of a single table follows the hyper-geometric distribution for large samples approximation by chi-sqare-test the dieting example

41 38 / 40 edger and DEseq Model: K ij NB(µ ij, σif 2 ) gene:i, sample : j in a replicated experiment* mean:µ and variance σ 2 must be estimated from the replicates* edger: σ 2 = µ + αµ 2 DEseq: µ ij = q i,ρ(j) s j DEseq: σ 2 ij = µ ij + s 2 j v i,ρ(j) Both packages use different approaches for parameter fitting and testing. *DEseq also works with no or few replicates, but with reduced power

42 39 / 40 Testing in DEseq similar to Fisher s test As test statistic: the total counts in two conditions: K ia, K ib Now we need a p-value: p i = p(a, b) a+b=k is, p(a,b) p(k ia,k ib ) p(a,b) a+b=k is Now the model comes in: p(a,b) = Pr(K ia = a)pr(k ib = b)

43 40 / 40 Summary Outlook (my 50 cent) RNA-seq is as promising as complex read mapping and binning are working fine though parameters need to be explored normalization and statistical models and tests need more work more agressive normalization should be explored many methods are bit ad-hoc or use arbitrary thresholds no framework for within sample significance testing