de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

Size: px
Start display at page:

Download "de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ"

Transcription

1 de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

2 de novo transcriptome assembly de novo from the Latin expression meaning from the beginning In bioinformatics, we often use de novo to mean without reference to existing knowledge. In this case, de novo transcriptome assembly is then the reconstruction of the transcriptome without referring to a reference genome. Queensland Institute of Medical Research 2

3 Pros and Cons of de novo assembly Pros no reference genome required can detect exogenous transcripts (eg. viral infections) splice sites from exonexon junctions are no longer a problem can assemble fusion genes or trans-splicing Cons often massive computational resources required higher sequencing depth required (>30X vs <10X) difficult to correct for sequencing errors in poorly expressed genes no real way to distinguish between sequence artifacts and true trans-spliced genes Reviewed in Martin & Wang Nature Reviews Genetics 12, Queensland Institute of Medical Research 3

4 The Jigsaw Puzzle Analogy Queensland Institute of Medical Research 4

5 There s more than one solution (alternative splicing) There s no way to know if you are correct There s no box with a picture There are no colours More than 100 million pieces Many pieces are redundant (but not equally redundant, expression differs) Many pieces fit in multiple places There are no edge or corner pieces All pieces are different sizes The puzzle is double sided (and you are working on both at the same time) Some pieces are fused to other pieces (but they shouldn t be, and you can t tell which ones) Some pieces are from another puzzle (but you can t tell which ones by looking) Some pieces are broken (but you can t tell which ones by looking) Lots of missing pieces Queensland Institute of Medical Research 5

6 What are the tools available? How do they work? Which one performs the best? Can we believe the results? Queensland Institute of Medical Research 6

7 What tools are available? Trans-ABYSS Rnnotator (Velvet) Trinity Oases (Velvet) SOAPdenovo-Trans qnovo EBARDenovo All are de Bruijn graph assemblers. Not a comprehensive list! Queensland Institute of Medical Research 7

8 What are the tools available? How do they work? Which one performs the best? Can we believe the results? Queensland Institute of Medical Research 8

9 Trans-ABYSS Based on ABYSS genome assembler Assembly across a range of k-mers (lower k = more sensitive; higher k = more accurate) Comes with a cool visual explorer of graphs ABySS-Explorer representations of the assembly of EZH2 gene, illustrating a novel alternative splicing event. Birol et al., Bioinformatics (2009) 25 (21): Queensland Institute of Medical Research 9

10 Rnnotator Based on Velvet Can use alternative assembler of choice complete pipeline multiple k-mer assemblies does not explicitly deal with alternatively spliced isoforms Martin et al., BMC Genomics 2010, 11:663 Queensland Institute of Medical Research 10

11 Trinity Inchworm performs greedy k-mer extension Chrysalis builds de Bruijn graphs from pooled contigs Butterfly compacts graphs and extracts sequence Can deal well with alternative transcription effectively single k-mer analysis Grabherr et al., Nature Queensland Biotechnology Institute of Medical 29, Research (2011)

12 Oases Based on Velvet genome assembler merges multiple k-mer and Trinity splicing Schulz M H et al. Bioinformatics 2012;28: Queensland Institute of Medical Research 12

13 SOAPdenovo-Trans Based on SOAP2denovo genome assembler Borrowed the things that worked well for Trinity Also borrowed things that work well from Oases (which borrowed from Trinity). Xie et al., (pre-publication) Queensland Institute of Medical Research 13

14 EBARDenovo NOT a de Bruijn graph assembler Uses full reads uses consensus to build highly accurate contigs whilst being robust to sequencing errors also resistant to chimeric errors (from both paralogues and sequence artifacts) Chu H et al. Bioinformatics 2013;29: Queensland Institute of Medical Research 14

15 What are the tools available? How do they work? Which one performs the best? Can we believe the results? Queensland Institute of Medical Research 15

16 Completion Time (hours) Xin et al., Science China Life Sciences Vol.56 No.2: Queensland Institute of Medical Research 16

17 Memory Usage (hours) Xin et al., Science China Life Sciences Vol.56 No.2: Queensland Institute of Medical Research 17

18 Contig Length Xin et al., Science China Life Sciences Vol.56 No.2: Queensland Institute of Medical Research 18

19 Length isn t everything Metric Accuracy Description The percentage of the correctly assembled bases estimated using the set of expressed reference transcripts (N). Completeness The percentage of expressed reference transcripts covered by all the assembled transcripts Contiguity Chimersim Variant Resolution The percentage of expressed reference transcripts covered by a single, longest-assembled transcript The percentage of chimaeras that occur owing to misassemblies among all of the assembled transcripts. The percentage of transcript variants assembled. Martin & Wang Nature Reviews Genetics 12, Queensland Institute of Medical Research 19

20 Assembler metrics Xin et al., Science China Life Sciences Vol.56 No.2: Queensland Institute of Medical Research 20

21 What are the tools available? How do they work? Which one performs the best? Can we believe the results? Queensland Institute of Medical Research 21

22 Do you have biological replicates? If not... Well, probably not. Assembly is a guess. An educated guess, but still a guess. Queensland Institute of Medical Research 22

23 What are you going to see? A ridiculously high number of transcripts (contigs/transfrags). Published numbers have been >9 million. Filtering is essential. A large number of fragmented transcripts (at repetitive regions, alternative splicing, or variations in coverage). A large number of highly similar contigs (with only a single nt difference, or minor changes in start or end sites). Depending on the assembler, some stuff that is made up inferred ALWAYS compare back to raw reads. Lots of noise from chimeric transcripts, poor quality sequencing reads, or low coverage. Queensland Institute of Medical Research 23

24 Conclusions The SOAP, Trinity, and OASES teams really, really need to talk to each other more often. There will ALWAYS be a trade off between competing needs, the best assembler will depend on how you rate these needs. Having said that performance is also highly dependent on the individual data set. A healthy dose of scepticism and common sense should be applied when considering the output of any assembly software. Queensland Institute of Medical Research 24

25 Thank you