The Iso-Seq Method: Transcriptome Sequencing Using Long Reads

Size: px
Start display at page:

Download "The Iso-Seq Method: Transcriptome Sequencing Using Long Reads"

Transcription

1 The Iso-Seq Method: Transcriptome Sequencing Using Long Reads Elizabeth Tseng, Ph.D. Senior Staff Scientist FIND MEANING IN COMPLEXITY For Research Use Only. Not for use in diagnostic procedures. Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved.

2 Transcription Variation Proteomic/Gene Complexity slide from G. Shenykman, ASMS talk

3 A Single Gene Locus Many Transcripts slide from G. Shenykman, ASMS talk

4 Short reads cannot accurately assemble complex transcripts the complexity of higher eukaryotic genomes imposes severe limitations on transcript recall and splice product discrimination assembly of complete isoform structures poses a major challenge even when all constituent elements are identified Ultimately, the evolution of RNA-seq will move toward singlepass determination of intact transcripts. Steijger et al. (2013) Assessment of transcript reconstruction methods for RNA-Seq. Nature Methods doi: /nmeth.2714.

5 Iso-Seq Method: PacBio Transcriptome Sequencing Single-molecule observation one read = one transcript Sequence transcript in full length 0 15 full-length transcripts no assembly required The term Iso-Seq method can refer to any transcriptome (cdna) sequencing using the PacBio System, including those that do not follow recommended library preparation or the Iso-Seq bioinformatics pipeline (ICE + Quiver, later slides)

6 Iso-Seq Library Workflow Total RNA PCR Optimization Optional Poly-A Selection polya+ RNA Full Length 1 st Strand cdna Reverse Transcription (SMARTScribe RT) Large Scale Amplification (Phusion DNA Polymerase) Amplified cdna Re-Amplification (Phusion DNA Polymerase) SMRTbell Template Preparation Optional Size Selection (BluePippin) Size Selection (gel / BluePippin / SageELF) SMRT Sequencing Size cuts can be arbitrary Current max FL transcript seen: 15 6

7 Full-Length (FL) read identification Full-Length = 5 primer seen, polya tail seen, 3 primer seen Identify and remove primers and polya/t tail Identify transcript stranded-ness

8 Bioinformatics Challenge SAMPLE INPUT SEQUENCING OUTPUT TATAGGCAAGTAACGTT TATAGGCAAGTAACGTT ATTCAAGGCC AATTAGGGC TTTAGGCC AAT GGCCATTG TATAGGCAAGTACGTT TATAGGGGCAAGTAACGTT Need to recover the original sequence Error Correction 8

9 Bioinformatics Challenge SAMPLE INPUT SEQUENCING OUTPUT TATAGGCAAGTAACGTT TATAGGCAAGTAACGTT ATTCAAGGCC AATTAGGGC TTTAGGCC AAT GGCCATTG TATAGGCAAGTACGTT TATAGGGGCAAGTAACGTT POST- ERROR CORRECTION TATAGGCAAGTAACGTT Need to recover the original sequence Error Correction 9

10 Bioinformatics Challenge SAMPLE INPUT SEQUENCING OUTPUT TATAGGCAAGTAACGTT TATAGGCAAGTAACGTT ATTCAAGGCC AATTAGGGC TTTAGGCC AAT GGCCATTG TATAGGCAAGTACGTT TATAGGGGCAAGTAACGTT POST- ERROR CORRECTION : 3 : 2 TATAGGCAAGTAACGTT: 2 Need to recover the original sequence Error Correction 10

11 Error Correction: Three Approaches Tool Author Genome- Guided Hybrid (long + short reads) Abundance Inferrence ToFU (RS_IsoSeq) Liz T. N N (not really) CONVEX Meisam R. (David T.) N N Y LSC + IDP Kin Fai A. Y Y Y 11

12 ToFU: The ICE + Quiver error correction pipeline 12 For Research Use Only. Not for use in diagnostic procedures.

13 Transcript isoforms: Full-length and Unassembled Methods is available in paper supp de novo (no ref genome required) no assembly can handle any read length works for mixed accuracy post-quiver: % accuracy ToFU is available through SMRT Analysis (RS_IsoSeq) and GitHub (ToFU)

14 ToFU pipeline: classify cluster (ICE) Quiver polishing Per-molecule reads (ReadsOfInsert aka CCS reads) Non-FL reads Full-length (FL) reads Clusters of transcript alignments using FL + nfl reads Transcript 1 Transcript 2 Transcript 3 Isoform-level clusters ICE Final transcript consensus Quiver Transcript 1 Transcript 2 Transcript 3 Transcript 1 Transcript 2 Transcript 3

15 ToFU reveals transcriptional complexity in P. crispa Top: Short read mapping Bottom: PacBio transcripts Gray are single gene transcripts Green are polycistronic transcripts that span 2+ genes Gordon & Tseng, 2015

16 From Novel Transcripts to Novel Proteins PacBio public MCF-7 dataset ~90% predicted ORFs matched mass spec peptide 251 novel ORFs found unique to MCF-7 Shenykman, ASMS talk 2014

17 For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.