Quan9fying with sequencing. Week 14, Lecture 27. RNA-Seq concepts. RNA-Seq goals 12/1/ Unknown transcriptome. 2. Known transcriptome

Size: px
Start display at page:

Download "Quan9fying with sequencing. Week 14, Lecture 27. RNA-Seq concepts. RNA-Seq goals 12/1/ Unknown transcriptome. 2. Known transcriptome"

Transcription

1 BMMB 852D: Applied Bioinforma9cs Quan9fying with sequencing Week 14, Lecture 27 István Albert Biochemistry and Molecular Biology and Bioinforma9cs Consul9ng Center Penn State Samples consists of varying amounts of DNA that correlate with different genomic loca9ons. 1. Observed abundance à sample abundance à biological significance Note: currently we sequence DNA only! RNA needs to be reverse transcribed into DNA mrna needs to be purified from total RNA Each step introduces its own biases and challenges RNA-Seq concepts RNA-Seq goals An RNA-Seq analysis has three separate yet equally important segments: 1. Iden9fying transcripts 2. Es9ma9ng abundances per transcript 3. Comparing abundances à differen9al expression There is enormous disagreement about how each these steps should be performed à hence very large number of op9ons There are workflows that do all three. There are workflows that mix and match from methods. 1. Unknown transcriptome iden9fy transcripts (then go to 2) 2. Known transcriptome discover new isoforms transcript varia9on differen9al expression 1

2 RNA-seq is s9ll a new, emerging field With that come the challenges: The majority of tools techniques are immature and o_en targeted to specific organisms It is a complicated process à leads to the rise of the black box, a complex tool with a fancy name with instruc9ons that need to be followed rigorously step by step, in return it promises an easy answer Historically speaking new releases of these black boxes produce results that are only par9ally concordant (say 50% of genes are iden9cal) Do black box RNA-Seq analyses work? Put data in à press a bucon (or run a rou9ne workflow) à get a usable result. 1. They work when your problem is typical (falls into a generic case that the tool can deal with) 2. Fails completely otherwise How to tell which case you are in? Not that easy. RNA Ribonucleic Acid Early publica9ons 1. Mapping and quanafying mammalian transcriptomes by RNA-Seq. Nat. Methods (2008) 2. RNA-Seq: a revoluaonary tool for transcriptomics Nat. Rev. Gene9cs (2009) 3. Comprehensive comparaave analysis of strandspecific RNA sequencing methods Nature Methods 7, (2010) 2

3 RNA-Seq: how do we even compare gene expression levels? 1. Genes have various levels of expression à higher expression levels produce more reads for that gene 2. Genes of various lengths à longer lengths produce more reads for that gene 3. Sequencing coverage determines the abundance of the rarest transcript that can be detected How many reads do I need? Short answer: 35 million (human/mouse size genomes) + at least 4 replicates per sample Single end reads and more replicates for quan9fica9on, paired end reads for transcriptome assembly Long answer: it depends on many/many factors Run pilot studies. How to quan9fy expression? Simple counts Counts CPM à counts per million RPKM/FPKM à reads per kilobase of exons per million reads mapped COUNT = 100 TPM à transcripts per million 3

4 CPM: counts per million RPKM: reads per kilobase exons per million Total mapped reads à N = 20 million = Length of Gene A à L A = 5000 bp Total mapped reads à N = 20 million = CPM = N A / N 10 6 = 5 RPKM = 10 9 N A /L A / N = 1 (the frac9on 9mes a million) (length may need to be scaled to the method à effec9ve length) TPM: transcripts per million Helpful blog posts Sum of all transcript lengths à T = 0.12 Total mapped reads à N = 20 million = TPM = 10 6 N A / L A / T = 16,666 Example: N A =100, N B =200, N C =300, all are 5kb long T = Sum(T i ) = Sum(N i /L i ) = 100/ / /5000 = 4

5 How to pick what to use? Some rules of thumb 1. Counts à raw measurement à requires the use of another sta9s9cal tool to infer anything Homework 27 Download a strand specific RNA-Seq dataset from SRA. You may use the paper that we discussed or any other publica9on. 2. CPM à normalized by the total data obtained 3. CPM à RPKM à normalized by length and total data à most appropriate to compare transcripts within sample 4. TPM à normalized by transcript size à most appropriate to compare between samples Align your data. Pick a gene of interest that has some read coverage. Compute the coverage, RPK and TPM for that gene. 5