measuring gene expression December 11, 2018

Size: px
Start display at page:

Download "measuring gene expression December 11, 2018"

Transcription

1 measuring gene expression December 11, 2018

2 Intervening Sequences (introns): how does the cell get rid of them? Splicing!!! Highly conserved ribonucleoprotein complex recognizes intron/exon junctions and guides intron excision. This process is responsible for much of the diversity of proteins, and is closely regulated.

3 Intervening Sequences (introns): what are they good for? When introducing a sequence into a cell system for overexpression, things work better if the sequence has an intron. nonsense mediated decay requires introns may be a buffer for mutation or a way to shuffle protein domains creating variation by alternative splicing

4 Alternative Splicing

5 Alternative Splicing

6 gene expression analysis what genomic regions are being transcribed? which transcripts are being made, from those regions? what is the rate/level of transcription from a region? does expression from those regions change under certain conditions? remember, though, that gene expression!= cell function

7 challenges in studying gene expression transcript abundance mapping annotation quantitation comparing across experiments technical variability biological variability

8 annotation how can we find genes? observe RNA product observe protein product orthology (reciprocal best hit or similar method) de novo prediction

9 laboratory methods abundance measurement variety of transcripts throughput protein methods quantitative low low Northern blot qualitative predetermined low cdna subtraction poorly quantitative comparative low differential display poorly quantitative comparative low ESTs/cDNA sequencing qualitative moderate moderate SAGE quantitative moderate moderate RT-PCR quantitative predetermined moderate microarray quantitative predetermined high RNAseq quantitative high high

10 SAGE (serial analysis of gene expression) generate a 9-10bp sequence tag for each transcript, concatenate tags and sequence them. Tags will be near 3 end of genes, increasing the specificity of the method.

11

12 SAGE (serial analysis of gene expression) key points: ditags should be unique, so multiple observations of the same ditag are assumed to be PCR duplicates. Identification of tagged genes relies on having good annotation.

13 gene expression analysis by microarray

14 gene expression analysis by microarray ~100M oligonucleotides fixed to a microscope slide. Labeled cdna is hybridized to the array and scanned. Because the background isn t consistent, signal intensity is typically defined by the foreground:background ratio for each spot B F

15 gene expression analysis by microarray advantages: relatively inexpensive (lots of replicates possible), statistical properties are well-described disadvantages: requires high input quantity, limited dynamic range, limited range of genomic targets, typically not useful for spliceoform detection

16 gene expression analysis by sequencing (RNAseq) advantages: allows very, very low input quantity, excellent dynamic range, genomic targets are not preselected, theoretically extraordinarily sensitive for splice site detection disadvantages: expensive, statistical properties not at all clear

17 RNAseq total RNA (rrna, mrna, trna, microrna etc) library preparation and sequencing rrna depletion strand specificity paired end vs single end read length coverage mapping splice-aware with or without annotation transcript count assignment, normalization compare transcript abundance between samples

18

19 alignment approaches map to transcriptome (RSEM and others) splice-aware alignment (TopHat, STAR and others) transcriptome assembly kmer/word counting without alignment (Sailfish and others) segment genome by differentially expressed regions (derfinder)

20 RNAseq alignment challenge reads are not contiguous with the reference genome transcript genome this read does not map contiguously to the reference genome paired ends spanning junctions may map very far apart on reference genome

21 RNAseq and alternative splicing some reads can be unambiguously assigned to a transcript, but others cannot.

22 Used by TCGA. RNAseq by expectation maximization Can use a reference genome with or without annotation. In either mode, multimapping reads are explicitly considered and the transcript abundance is derived at every position using a maximum likelihood model. Finally, Bayesian estimates of transcript abundance are provided. (and often paired with EBSeq)

23 EBSeq

24 EBSeq

25 uncertainty is proportional to the number of spliceoforms

26 TopHat: splice-aware aligner

27

28 TopHat

29 splice-aware alignment: STAR

30 splice-aware alignment: STAR

31 used in TCGA as second processing step

32

33

34

35 Scripture: hybrid alignment/transcriptome assembly

36

37

38

39 HMM-based method: segment genomic regions according to sequencing coverage, then estimate abundance from observations (a mixture of background signal, measurement error, and true signal) Analyses are done across multiple samples, without considering annotation.

40

41 steps in RNAseq analysis alignment and transcript assignment quantitation comparison among experiments (differential expression)

42 normalization when comparing two RNAseq experiments, read depth is a critical factor (nonbiological effect). Options for normalizing for read depth: 1) Reads per kilobase per million reads (RPKM) normalizes for read depth and gene size 2) trimmed mean of M-values (TMM) 3) DESeq size factor 4) quantile-based normalizations such as upper quartile normalization

43 upper quartile normalization Table 1 How do you know whether the increased counts in condition 2 for the first gene reflect higher transcription? it s possible that there were just more reads for this experiment. gene condition 1 condition 2 ENST ENST ENST ENST ENST ENST ENST ENST idea: gene expression measurements are more robust for highly expressed genes. Find the normalization factor for these genes and apply it to all genes measured. ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST

44 upper quartile normalization remove genes that have no counts in all experiments rank genes by expression, for each experiment separately identify the gene at the 75th percentile in each experiment. This will be the size factor for that experiment. divide expression levels for all genes by the expression of the gene at the 75th percentile, for each experiment can multiply by mean expression level of top quartile to restore counts to larger numbers if needed

45 sorted normalized gene condition 1 condition2 ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST gene condition 1 condition 2 ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST ENST

46 normalization other nonbiological factors include gene length and GC content. These are gene-specific and are often assumed to cancel out if comparisons are done gene by gene.

47

48 normalization: GC content doesn t cancel out

49 normalization: BASP1 isoform levels vary by center!?

50 methods for comparing expression levels Let s assume that RNAseq reads can be modeled with a Poisson distribution (drawing randomly from all possible RNA fragments) Then, for each gene, the mean expression is measured across replicates, and the variance is set to be equal to the mean (the lambda parameter). Comparing the expression of the gene between two conditions is then fairly straightforward.

51 methods for comparing expression levels Problem: the variance in gene expression is usually much greater than the mean (overdispersion) Solution: Use a negative binomial model. This can be derived as a gamma Poisson mixture model, assuming that technical replicates follow a Poisson distribution, and biological replicates follow a gamma distribution (accounts for overdispersion) DESeq and DESeq2 are excellent implementations of this method.

52 methods for comparing expression levels Many other methods exist, including Bayesian approaches, beta binomial estimation, and nonparametric. Different methods are often optimized for particular types of data.