RNA Sequencing. Next gen insight into transcriptomes , Elio Schijlen

RNA Sequencing Next gen insight into transcriptomes 05-06-2013, Elio Schijlen

Transcriptome complete set of transcripts in a cell, and their quantity, for a specific developmental stage or physiological condition. Understanding the transcriptome is essential for interpreting the functional elements of the genome The key aims of transcriptomics are: to catalogue all species of transcripts, including mrnas, non-coding RNAs and small RNAs; to determine the transcriptional structure of genes, in terms of their start sites, 5 and 3 ends, splicing patterns and other post-transcriptional modifications; to quantify the changing expression levels of each transcript during development and under different conditions.

Recently, the development of novel high-throughput DNA sequencing methods has provided a new method for both determining, mapping and quantifying transcriptomes. This method, termed RNA-Seq (RNA sequencing) clear advantages over previous approaches is revolutionizing the manner in which eukaryotic transcriptomes are analysed

Illumina HiSeq2000 454 SOLiD 5500 Ion proton Pacbio RS

From next to 3rd generation sequencing Illumina HiSeq Fluorescent nt scanning SOLiD Ligation fluorescent oligos 454 Pyrosequencing Ion proton Hydrogen detection Pacbio Real time fluorescent detection

From next to 3rd generation sequencing Illumina HiSeq SOLiD 5500 ssdna sequence template Clonaly amplified into clusters on glass slide (flow cell) idem 454 Ion proton ssdna sequence template Clonaly amplified on beads (empcr) idem Pacbio dsdna sequence template Single molecule/polymerase molecule complex

Illumina HiSeq2000 Syringe pumps Optics Flow cell access door Reagents compartment Flow cell 8 channels

Illumina HiSeq2000 3 5 DNA (0.1-5.0 μg) Library Preparation Single molecule array Cluster Growth A G T C C G T G C A A G C 5 T G T A C G A T C A C C C G A T C G A A Sequencing 1 2 3 4 5 6 7 8 9 T G C T A C G A T Image Acquisition Base Calling

Eusol BACs 177.14 M PF clusters; 33.8 Gb>Q30 Lane Sample ID Sample Ref Index Description Yield (Mbases) % PF # Reads % of raw clusters per lane Clusters with unmatched barcodes 1 lane1 unknown Undetermined for lane 1 3,234 87.47 36,608,108 9.74 1 plate10 EUsol_fill_gaps TAGCTT 3,359 94.77 35,088,534 9.34 1 plate1 EUsol_fill_gaps ATCACG 4,150 95.35 43,091,246 11.47 1 plate2 EUsol_fill_gaps CGATGT 3,480 95.66 36,020,422 9.59 1 plate3 EUsol_fill_gaps TTAGGC 3,496 95.27 36,331,200 9.67 1 plate4 EUsol_fill_gaps TGACCA 4,674 95.4 48,508,022 12.91 1 plate5 EUsol_fill_gaps ACAGTG 2,305 93.65 24,365,574 6.49 1 plate6 EUsol_fill_gaps GCCAAT 1,895 94.83 19,783,144 5.27 1 plate7 EUsol_fill_gaps CAGATC 3,366 94.9 35,115,836 9.35 1 plate8 EUsol_fill_gaps ACTTGA 2,592 95.29 26,934,126 7.17 1 plate9 EUsol_fill_gaps GATCAG 3,232 94.59 33,829,830 9.01

SOLiD 5500

454 sequencing technology & workflow

NGS - 454 pyrosequencing raw read GCTAAG

Ion semiconductor sequencing

Ion Torrent PGM & Proton

3 d Gen Sequencing: PacBio SMRT sequencing Kb read length <50,000 reads <100 Mb

Pacbio sequencing Phospholinked Cleavage by DNA polymerase Fluorophore clipped off by polymerase DNA synthesized is natural No steric hindrance or accumulation of background signal ZMW Zero Mode Waveguide

Sequence read length (raw), quality Illumina HiSeq fixed 50 or 100 nt, SR and PE SOLiD 5500 fixed 75 nt 454 range 50-1,000 nt (av~750) Ion torrent range 50-200 nt (av ~170) Pacbio range 50-20,000 nt (av ~3-4 kb)

Sequence read quality Illumina HiSeq SOLiD 454 HQ reads, systematic errors Lower quality 3 ends Low GC coverage very HQ reads Lower quality 3 ends HQ reads, sytematic errors Homopolymer problems Clonality Lower quality 3 ends Ion torrent idem, but lower overall quality Pacbio Low Quality (0.8-0.85) Random errors No decrease read quality 3 end

Sequence reads & throughput/run Illumina HiSeq SOLiD 5500XL 454 Ion torrent Pacbio 1.5 E+09 full flowcell, 12days/run Up to 550 Gb (2 cells) 1.5 E+09 full flowcell, 6days/run Up to 240 Gb (2 flow chips) 1 E+06 full PTP, 1 day/run Up to 1 Gb 60-80 E+06 ionpi chip, 4 hours/run Up to 10 Gb 300,000 (8 cell strip), 1day/run Up to 0.75 Gb

Transcript coverage

DNA Samples for sequencing Active Chromatin 1 Genomic DNA mrna Library preparation: Ligate adapters to both ends of fragmented nucleic acid Small RNA ChIP-Sequencing Other Apps

RNA input requirements RNA: DNA free, RNAse free, non degraded, No contaminants (proteins, polysaccharides)

Protocol variations Fragmentation methods RNA: nebulization, hydrolysis cdna: sonication, Dnase I treatment Depletion of highly abundant transcripts Positive selection of mrna. Poly(A) selection or target specific Negative selection. (RiboMinus, RNAseH) Strand specificity Most RNA sequencing is not strand-specific Single-end or Paired-end sequencing

(Illumina) RNA seq workflow

Aligning the millions of reads to a "reference genome". many tools available for aligning genomic reads to a reference genome (sequence alignment tools), however, special attention is needed when alignment of a transcriptome to a genome, mainly when dealing with genes having intronic regions. As discussed above, the sequence libraries are created extracting mrna using its poly(a) tail, which is added to the mrna molecule post-transcriptionally and thus splicing has taken place. Therefore, the created library and the short reads obtained cannot come from intronic sequences and thus, when trying to align these short reads to a reference genome, only short reads aligning entirely inside exonic regions will be matched while short reads from exon-exon junction regions will not. Several software packages exist for short read alignment, and recently specialized algorithms for transcriptome alignment have been developed, e.g. TopHat and Cufflinks.

Sequences coverage A.thaliana:approx 60 E +06 mapped reads result in plateau of unique gene models expressedm(approx 20,000)

Multi mapped 50nt SR reads (A.thaliana ~5%) can cause inaccurate expressin estimates Tubulin B chain reads mapped to reference genome (gray) Blue lines intron spanning reads Histograms read coverage Blue multimapped contributed Green unique mapped contributed Including multimapped artificially increases expression value Readmapping 2 genes sharing genome region by their 3 end on opposite strands Multimapped reads derived from + strand would severly overestimate expression of strand gene.

Ekblom et al., 2012 Comparative and Functional Genomics doi:10.1155/2012/281693

Wenger and Galliot BMC Genomics 2013, 14:204 doi:10.1186/1471-2164-14-204

Some considerations The information gathered by RNAseq has similar limitations as other RNA expression analysis pipelines. RNA status dependent Biological variable: Tissue specific; Time dependent. Triplicates! During a cell's lifetime and context, its gene expression levels change. Strongly RNA quality dependent Library prep method dependent Sequencing technology dependent Analysis method dependent Because of this, care must be taken when drawing conclusions from the sequencing experiment. Results must be verified using independent technology