Computational & Quantitative Biology Lecture 6 RNA Sequencing

Size: px
Start display at page:

Download "Computational & Quantitative Biology Lecture 6 RNA Sequencing"

Transcription

1 Peter A. Sims Dept. of Systems Biology Dept. of Biochemistry & Molecular Biophysics Sulzberger Columbia Genome Center October 27, 2014 Computational & Quantitative Biology Lecture 6 RNA Sequencing

2 We Have Many Tools for RNA Analysis & Quantification RT-qPCR cdna Microarray RNA Fluorescence in situ Hybridization (RNA-FISH) Northern Blot RNA-Seq RNA in situ Hybridization (RNA-ISH) Serial Analysis of Gene Expression (SAGE)

3 I. The Sequencer Lecture 6 RNA Sequencing II. RNA-Seq: Defining a Biological Question and making a Library III. Experimental Challenges and Bias: How to Recognize Them and What To Do IV. Computational Tools for Processing RNA-Seq Data

4 I. The Sequencer Lecture 6 RNA Sequencing II. RNA-Seq: Defining a Biological Question and making a Library III. Experimental Challenges and Bias: How to Recognize Them and What To Do IV. Computational Tools for Processing RNA-Seq Data

5 NGS Glossary Read A DNA sequence that a sequencer determines from a single, DNA fragment Paired-End Read A DNA sequence read obtained by sequencing from opposite ends of a single, DNA fragment Single-Read Raw Accuracy The base-calling accuracy obtained from sequencing a single read without leveraging the consensus of multiple reads at the same position in a genome Coverage The number of reads corresponding to a sequence position on a genome Amplicons DNA fragments that result from enzymatic amplification of a template DNA molecule Primer A DNA oligonucleotide that hybridizes to a DNA template and allows DNA polymerase to initiate replication of a new DNA strand that is complementary to the template

6 Current State of NGS at Genome Center: Illumina HiSeq 2500 High Output Mode: Throughput: 1 Tb/run, 4 billion reads Read Length: 2x125 base paired-end Turnaround: 6 days, ~6.9 Gb/hour Rapid Run Mode: Throughput: 180 Gb/run, 0.6 billion reads Read Length: 2x150 base paired-end Turnaround: 1.7 days, ~4.5 Gb/hour Illumina NextSeq 500 High Output Mode: Throughput: 120 Gb/run, 0.4 billion reads Read Length: 75 base single-end or 150 base paired-end Turnaround: ~1 day, 2.7 Gb/hour Bottom Line: You Can Re-Sequence Your Genome with >30x Coverage in a ~1 Day for ~$1,600**

7 Two Revolutionary Ideas Sequencing in the Flow Regime Expose the multiple template molecules to sequencing chemistry repeatedly. Reactions take place in a flow cell with immobilized DNA amplicons. Use automated laminar fluidics to exchange reagents between cycles. Detect incorporated bases after each cycle. Multiplex clonal amplification without bacterial cloning. Sequencing at the Diffraction Limit Image thousands of sequencing reactions simultaneously with a fluorescence microscope. Sample density limited by diffraction. Sequencing reactions become twodimensional, reagent volumes become negligible.

8 Illumina Bridge Amplification Shendure, Nature Biotech, 2008 On-Chip Amplification by Solid-Phase PCR Forward and reverse primers are immobilized on surface. Amplification is isothermal- chemical melting used to reset between cycles. Fully automated, on-chip clonal amplification- almost zero hands-on time. The simplicity of Bridge PCR is Essential to Illumina s competitiveness and relevance to the future of NGS in clinical applications.

9 Illumina Reversible Terminator Chemistry O N 3 O N H O O HN O HO O P O O P O O P O O N HN O OH OH OH O = Reversible O N 3 Triphosphate HN Fluorophore On Base Terminator at 3 - OH O

10 Illumina Reversible Terminator Chemistry polymerase chemicals What limits Illumina s read length?

11 Illumina s HiSeq System Under the Hood Key Features Laser-induced fluorescence line-scanner automated microscopy. Dual-surface imaging system. Dual flow cell system that is always imaging (acquisition-limited throughput). Time-delay integration charge-coupled device cameras (TDI-CCD). Automated fluidics, mechanical scanner, and autofocuser that are engineered to within an inch of their lives. Illumina

12 Architecture of an Illumina Sequencing Library P5 Read 1 Sequencing Primer Read 2 Sequencing/Index Primer Index P7 ILMN adapter DNA fragment ILMN adapter

13 Sequencing of an Illumina Sequencing Library

14 I. The Sequencer Lecture 6 RNA Sequencing II. RNA-Seq: Defining a Biological Question and making a Library III. Experimental Challenges and Bias: How to Recognize Them and What To Do IV. Computational Tools for Processing RNA-Seq Data

15 RNA-Seq for Expression Profiling in Mammalian Cells A A A A A A A A A A rrna mrna A A A A A A A Isolate mrna with oligo(dt) beads A A A A A A A A A A A A A A A A

16 Fragment with Alkaline Solution A A A A A A A A A A A A A A A Prime with Random Hexamers/Oligo(dT) NNNNNN A A A A A A T T T T T T NNNNNN NNNNNN NNNNNN NNNNNN A A A A A A A A A T T T T T T T NNNNNN Reverse Transcribe

17 NNNNNN NNNNNN NNNNNN NNNNNN NNNNNN NNNNNN A A A A A A T T T T T T A A A A A A A A A T T T T T T T Second-Strand Synthesis (RNAseH/DNA Pol I) NNNNNN T T T T T T NNNNNN NNNNNN NNNNNN NNNNNN T T T T T T T NNNNNN End-Repair (Klenow fragment DNA pol/ T4 DNA pol)

18 A P A P A P A P A P A P A P A P A P A P A P A P A P A P A P A P Phosphorylate and A-tail with PNK and Klenow fragment polymerase Ligate Library Enrichment Adapters P T Amplify by PCR

19 RNA-Seq with Strand Specificity Why is this Important? Insert dutp during Second Strand Synthesis: NNNNNN NNNNNN NNNNNN NNNNNN NNNNNN NNNNNN T T T T T T T End Repair, Phosphorylation, Ligation T T T T T T U U U Uracil DNA Glycosylase, Endonuclease VIII

20 RNA-Seq for Expression Profiling in Prokaryotes The Case of E. coli Two Unique Challenges: Polycistronic messages No poly(a) tail- rrna is a problem rrna mrna rrna Depletion

21 rrna mrna RT with random hexamers, Make standard library

22 Profiling Small RNAs rrna mrna Small RNA

23 A P P Ligate 3 -Adapter (T4 RNA Ligase 2 truncated) rrna mrna Small RNA Ligate 5 -Adapter (T4 RNA Ligase 1)

24 rrna mrna Small RNA Reverse Transcribe using Adapter Enrich Library by PCR

25 Gel-Purify Small RNA Library Sequence

26 Nascent RNA-Seq Global run-on (GRO-Seq) Metabolic Labeling with Br-UTP RNA polymerase Run-on RNA polymerase gdna Br Br Immunoprecipitate Nascent RNA with Anti-Bromine Fragment, Decap, Generate Library using Small RNA Procedure

27 Nascent RNA-Seq Native Elongating Transcript (NET-Seq) Lyse Cells and Preserve Ternary Complex Immunoprecipitate RNA Polymerase

28 3 -End of fragment maps location of RNA pol A P P Ligate 3 -Adapter (T4 RNA Ligase 2 truncated) Fragment, Reverse Transcribe, and Destroy RNA

29 Circularize with CircLigase Amplify by PCR

30 RNA-Protein Interactions Crosslinking Immunoprecipitation (CLIP-Seq) A A A A A A A Protein-RNA complex in the cell UV-crosslink cells A A A A A A Extract RNA, Digest with RNase T1 A A A A A Immunoprecipitate Target Protein

31 Y Y Y A A A A A Proteolyze Target Protein Ligate adapter, RT, sequence

32 RNA-Protein Interactions Photoactivatable Ribonucleoside Enhanced CLIP (PAR- CLIP) Metabolic Labeling with 4SU RNA polymerase Run-on CLIP-Seq ( x higher capture efficiency) 4-Thiouridine crosslinking at 365 nm enhances capture efficiency and binding site resolution What are the disadvantages of PAR-CLIP?

33 Proteomics with RNA-Seq: Ribosome Profiling A A A A A A A Treat Cells with Cycloheximide, Lyse A A A A A A A Digest with RNase I Isolate Monosomes with Sucrose Gradient

34 OD 254 nm Digested 40S 60S 80S Gel Purify Ribosomal Footprints Ligate adapter, RT, Sequence Top Bottom

35 Single Cell Transcriptomics A Zoo of New Approaches to Single Cell RNA-Seq Why would you want to quantify the transcriptomes of individual cells? What biological scenarios constitute true single cell problems?

36 Quantifying Full-Length Transcripts in Single Cells Switch Mechanism at the 5 -end of RNA Transcripts (SMART-Seq) A A A A A A T T T T T T Reverse Transcribe and Tail with MMLV-RT CCC A A A A A A T T T T T T GGG MMLV-RT Undergoes Template Switching in the Presence of Complementary Template GGG CCC A A A A A A T T T T T T

37 GGG CCC A A A A A A T T T T T T Pre-amplify by PCR GGG CCC A A A A A A T T T T T T GGG CCC A A A A A A T T T T T T GGG CCC A A A A A A T T T T T T Fragment, End Repair, Ligate Illumina Adapters (or insert with transposase) Enrich Library by PCR

38 I. The Sequencer Lecture 6 RNA Sequencing II. RNA-Seq: Defining a Biological Question and making a Library III. Experimental Challenges and Bias: How to Recognize Them and What To Do IV. Computational Tools for Processing RNA-Seq Data

39 Quantification Bias in RNA-Seq Class Exercise with Real-Life Stories from the Sims Lab Consider a standard poly(a)+ RNA-Seq experiment from mammalian tissue What is going on here? What experimental challenges might underlie this phenomenon?

40 Consider a microrna-seq experiment Normalized Histogram Showing the Occurrence of all Possible 3-base Read Termini Relative to their Occurrence in the Transcriptome What is wrong with this? What can be done to fix it?

41 Consider a carefully calibrated RNA-Seq experiment where the input copy number is known for each molecule

42 Single Molecule Barcoding can Mitigate PCR Bias This approach can also increase sequencing accuracy. How?

43 I. The Sequencer Lecture 6 RNA Sequencing II. RNA-Seq: Defining a Biological Question and making a Library III. Experimental Challenges and Bias: How to Recognize Them and What To Do IV. Computational Tools for Processing RNA-Seq Data

44 Computational Tasks Involved in RNA-Seq Data Processing Demultipexling CASAVA, bcl2fastq Trimming fastx toolkit, Picard Clipping fastx toolkit, Picard rrna removal RSeQC, Bowtie Mapping to a reference genome Bowtie, STAR, BWA Mapping to a reference transcriptome Tophat, Bowtie, STAR, BWA Discovering alternative splicing Tophat, OLego, MapSplice Read counting Cufflinks, HTSeq, BEDTools Differential expression analysis Cuffdiff, DESeq, edger de novo transcriptome assembly

45 Simple Case: Two Groups, Replicates Map to the Reference Transcriptome and Compare Map to the Reference Transcriptome with Tophat WHAT YOU NEED: 1) Raw Data fastq file containing raw reads and quality scores from sequencer 2) Genome fasta file containing the nucleotide sequence of each chromosome 3) Transcriptome Annotation gtf file containing the location of each exon of each gene 4) Bowtie program that uses Burrows-Wheeler algorithm for fast mapping of short reads WHAT WILL HAPPEN: 1) Tophat will use Bowtie to map your reads to the reference genome 2) Reads that do not map will then be mapped to annotated splice junctions 3) Tophat will output a bam file containing the alignments to the reference

46 Simple Case: Two Groups, Replicates Map to the Reference Transcriptome and Compare Count the Mapped Reads for Each Gene with HTSeq WHAT YOU NEED: 1) Alignments bam file containing genome and transcriptome alignments from Tophat 2) Transcriptome Annotation gtf file containing the location of each exon of each gene WHAT WILL HAPPEN: 1) HTSeq is a simple Python script that will count reads that uniquely map to the transcriptome for each gene 2) HTSeq will output a text file containing each gene and an integer number of counts

47 Simple Case: Two Groups, Replicates Map to the Reference Transcriptome and Compare Differential Expression analysis with DESeq WHAT YOU NEED: 1) Read counts text file from HTSeq containing read counts for each sample 2) Sample grouping a table grouping each sample by condition WHAT WILL HAPPEN: 1) DESeq is an R program that will normalize read counts for each sample and produce a matrix of counts 2) DESeq will attempt to model the noise in the data using a negative binomial distribution 3) DESeq will then use its estimate of the read count distribution to test for differential expression and report an FDR-corrected p-value

48 RNA-Seq Data are Overdispersed RNA Sequencing is a Poisson Process, but Poisson Distribution Model DESeq with Bias Correction and Negative Binomial Model Anders and Huber, Genome Biology, 2010