RNA Ribonucleic Acid. Week 14, Lecture 28. RNA- seq is a new, emerging field. Two major domains applica:on 12/4/ When the transcriptome is known

Size: px
Start display at page:

Download "RNA Ribonucleic Acid. Week 14, Lecture 28. RNA- seq is a new, emerging field. Two major domains applica:on 12/4/ When the transcriptome is known"

Transcription

1 BMMB 852D: Applied Bioinforma:cs RNA Ribonucleic Acid Week 14, Lecture 28 István Albert Biochemistry and Molecular Biology and Bioinforma:cs Consul:ng Center Penn State Two major domains applica:on 1. When the transcriptome is known discover new isoforms transcript varia:on differen:al expression 2. Unknown Transcriptome iden:fy transcripts RNA- seq is a new, emerging field With that come the challenges: The majority of tools techniques are immature and owen targeted to specific organisms It is a complicated process à leads to the rise of the black box, a complex tool with a fancy name with instruc:ons that need to be followed rigorously step by step, in return it promises an easy answer Historically speaking new releases of these black boxes produce results that are only par:ally concordant (say 50% of genes are iden:cal) 1

2 Important publica:ons Seminal papers 1. Mapping and quan<fying mammalian transcriptomes by RNA- Seq. Nat. Methods (2008) 2. RNA- Seq: a revolu<onary tool for transcriptomics Nat. Rev. Gene:cs (2009) RNA- Seq: gene expression levels Sample composi:on 1. Genes have various levels of expression à higher expression levels produce more reads for that gene 2. Genes of various lengths à longer lengths produce more reads for that gene 3. Sequencing coverage determines the abundance of the rarest transcript that can be detected How many readd do I need? Short answer: 35 million (human/mouse size genomes) + 4 replicates per sample (single end reads). Long answer: it depends on many/many factors For non- human size transcriptome scale it to human transcriptome(!) size. Run pilot studies. How to quan:fy expression CPM à counts per million RPKM/FPKM à reads per kilobase of exons per million reads mapped TPM à transcripts per million 2

3 Helpful blog posts: see links on course page CPM: counts per million CPM = N A 1/N 10 6 = 5 (the frac:on :mes a million) RPKM: reads per kilobase exons per million TPM: transcripts per million Length of Gene A à L A = 5000 bp RPKM = N A /L A 1/N 10 9 = 1 (length may need to be scaled to the method à effec:ve length) Sum of all transcript rates à T = 0.12 TPM = N A /L A 1/T 10 6 = 16,666 Example: N A =100, N B =200, N C =300, all are 5kb long T = Sum(T i ) = Sum(N i /L i ) = 100/ / /5000 = 3

4 RNA- seq quan:fica:on troubles Most scien:sts were mislead into believing that the RNA- seq measure they know about (CPM, RPKM, TMP) is the objec:ve way to quan:fy gene expression levels Tools readily produce one type of output and thus is very convenient to use these values without a second thought and plug them into other methods even when it is inappropriate to do so. How to pick what to use? Some rules of thumb 1. CPM à most appropriate for compu:ng sta:s:cal measures 2. RPKM à normalized by length and total data à most appropriate to compare transcripts within sample 3. TPM à normalized by transcript size à most appropriate to compare between samples Tuxedo Suite Tuxedo Suite: tophat Tophat (splice mapping) accepted_hits.bam Tophat splice mapper Cuffcompare Cuffdiff (differen:al expression) Cufflinks (assemble transcripts) Cuffmerge (merge transcripts) transcript.gn merged.gn differen:al expression with other sowware 1. Uses Bow:e to map reads to genome 2. Takes unmapped reads and maps them to junc:ons 3. Iden:fies junc:ons from an external file or by finding poten:al junc:on sites (GT- AG) splice sites From a CUBELP tutorial (now offline) 4

5 Tophat output creates a new output directory with files accepted_hits.bam various bed files: junc<ons, inser<ons, dele<ons Tuxedo suite: cuffdiff, cuffcompare Cuffcompare: track transcripts across mul:ple experiments Cuffdiff: calculate expression levels à differen:al expression Each tool generates a very large number of output files Understanding the Tuxedo suite is almost like a full <me job. It has lots of documenta:on but also makes substan:al tacit assump:ons. Special Homework 28 Example output Worth double points due Thursday Dec 11 Generate a cuffdiff report that compares the control samples to the uvb samples. Load the alignments into IGV From the cuffdiff reports pick a highly expressed gene that is sta:s:cally significant and take a screenshot of it in IGV so that it shows the differen:al expression levels. 5