De novo transcriptome assembly Does anybody even need it?

Size: px
Start display at page:

Download "De novo transcriptome assembly Does anybody even need it?"

Transcription

1 De novo transcriptome assembly Does anybody even need it? Andrey Prjibelski Center for Algorithmic Biotechnology SPbSU

2 Outline 1.Who assembles transcriptomes de novo? 2.How do assemblers work? 3.What is SPAdes and how does it work on RNA- Seq? 4.How do we evaluate transcriptome assemblies?

3 Reference based RNA-Seq analysis Spliced alignment o TopHat (Trapnell et al., Bioinf., 2009, 3684) o TopHat2 (Kim et al., Gen. Biol., 2013, 1052) o STAR (Dobin et al., Bioinf., 2013, 558) Expression analysis o DeSeq (Anders et al., Gen. Biol., 2010, 3265) o DeSeq2 (Love et al., Gen. Biol., 2014, 310) o Cuffdiff (Trapnell et al., Nat. Biotech., 2013, 578)

4 A lot of species do not have finished reference

5 De novo transcriptome assemblers Trans-ABySS (Robertson et al., Nat. Met., 2010) Trinity (Grabherr et al., Nat. Biotech., 2011) Oases (Schulz et al., Bioinf., 2012) SOAPdenovo-Trans (Xie et al., Bioinf., 2014) IDBA-tran (Peng et al., Bioinf., 2014)...

6 De novo transcriptome assemblers citations Trans-ABySS (Robertson et al., 387) Trinity (Grabherr et al., 2333) Oases (Schulz et al., 537) SOAPdenovo-Trans (Xie et al., 102) IDBA-tran (Peng et al., 18)

7 De novo RNA-Seq studies Reference genome is unsequenced, poorly assembled or not annotated Reference genome is too large to sequence The goal is to find fusion/novel genes

8 Ethanol production Sequenced brown algae A.cruciatus Created synthetic yeast platform for ethanol production from mannitol and DEHU Achieved up to 83% of the maximum theoretical yield from consumed sugars Enquist-Newman et al., Nature, 2014.

9 Associative transcriptomics Markers in crops are hard to identify due to high ploidy Used Tetraploid Brassica napus sequencing data to identify deleted gene which controls aliphatic glucosinolate biosynthesis Harper et al., Nat. biotech., 2012.

10 Cistanche deserticola assembly Identified key enzymes for synthesis of Lignin Phenylethanoid glycosides Li et al., PLoS one, 2015

11 Biosynthesis of capsaicinoids Assembly and annotation of Capsicum frutescens Several SSR and SNP markers are predicted 3 novel genes are predicted which filled gaps of the capsaicinoid biosynthetic pathway Shaoqun Liu et al., PLoS one, 2013

12 Outline 1.Who assembles transcriptomes de novo? 2.How do assemblers work? 3.What is SPAdes and how does it work on RNA- Seq? 4.How do we evaluate transcriptome assemblies?

13 De Bruijn graph ACGTCCGTAA

14 De Bruijn graph ACGTCCGTAA k=2 AC

15 De Bruijn graph ACGTCCGTAA k=2 AC CG

16 De Bruijn graph ACGTCCGTAA k=2 AC CG GT

17 De Bruijn graph ACGTCCGTAA k=2 AC CG GT TC

18 De Bruijn graph ACGTCCGTAA k=2 AC CG GT CC TC

19 De Bruijn graph ACGTCCGTAA k=2 AC CG GT CC TC

20 De Bruijn graph ACGTCCGTAA k=2 AC CG GT CC TC

21 De Bruijn graph ACGTCCGTAA k=2 AC CG GT TA CC TC

22 De Bruijn graph ACGTCCGTAA k=2 AC CG GT TA CC TC AA

23 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG GT TA CC TC AA

24 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT TA CC TC AA

25 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT TA GTC CC TC AA

26 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT TA GTC CC TCC TC AA

27 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT TA CCG GTC CC TCC TC AA

28 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT TA CCG GTC CC TCC TC AA

29 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTA TA CCG GTC CC TCC TC AA

30 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTA TA CCG GTC TAA CC TCC TC AA

31 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTA TA CCG GTC TAA CC TCC TC AA

32 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTA TA CCG GTC TAA CC TCC TC AA

33 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTA TA CCG GTC TAA CC TCC TC AA

34 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA CCG GTC CC TCC TC AA

35 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA CCG GTC CC TCC TC AA

36 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA CCG CC GTCC AA

37 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA CCG CC GTCC AA

38 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA GTCCG AA

39 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA GTCCG AA

40 So, what s the difference?

41 Heuristics Sequencing errors

42 Sequencing errors CCGTTG CGTTAC GTTGCA TGCAGG

43 Sequencing errors CCG CGT CCGTTG CGTTAC GTTGCA TGCAGG GTT TTA TAC TTG TGC GCA CAG AGG

44 What about sequencing errors? CCGTTG CGTTAC GTTGCA TGCAGG GTTAC TAC CCG CCGTT GTT GTTGCAGG AGG

45 Sequencing errors CCG CGT CCGTTG CGTTACAG GTTGCA TGCAGG GTT TTA TAC ACA TTG TGC GCA CAG AGG

46 Sequencing errors CCGTTG CGTTACAG GTTGCA TGCAGG GTTACAG CCG CCGTT GTT CAG CAGG AGG GTTGCAG

47 Real life

48 Real life

49 Heuristics Sequencing errors Isoform detection

50 Isoform detection

51 Isoform detection

52 Isoform detection

53 Isoform detection

54 Isoform detection

55 Isoform detection

56 Isoform detection

57 Isoform detection

58 De Bruijn graph

59 De Bruijn graph

60 De Bruijn graph Ideal case: 1 gene 1 component

61 De Bruijn graph Ideal case: 1 gene 1 component Reality o Short repeats o Paralogous genes o Introns o Intergenic sequences o Poly-A tails (caused by polyadenylation)

62 De Bruijn graph Ideal case: 1 gene 1 component Reality o Short repeats o Paralogous genes o Introns o Intergenic sequences o Poly-A tails (caused by polyadenylation) Varying expression levels result in different coverage o Harder to detect sequencing errors

63 Outline 1.Who assembles transcriptomes de novo? 2.How do assemblers work? 3.What is SPAdes and how does it work on RNA- Seq? 4.How do we evaluate transcriptome assemblies?

64 SPAdes (Saint Petersburg Assembler)

65 SPAdes now > 600 citations (Bankevich et al., JCB, 2012) ~ 3.5K unique users downloaded the latest SPAdes 3.6 One of the most popular apps on BaseSpace One of two best assemblers in GAGE-B study by Salzberg s lab (Magoc et al., Bioinf., 2013) The best bacterial genome assembler in the recent poll by acgt.me

66 Why to create new assembler? Conventional bacterial sequencing Metagenomics Single cell

67 Conventional bacterial sequencing

68 Metagenomics >99% bacteria cannot be cultured Metagenomics: sequencing of whole bacterial community o Reads from dozens of different genomes mixed in one data set o Different coverage for different bacteria o Presence of different strains o Conservative genomic regions Hard to assemble and classify resulting sequences

69 Single-cell sequencing via MDA Multiple Displacement Amplification (Lasken, Curr. Op. Microbiology., 2007) Random hexamer primers Phi29 DNA polymerase strand displacement

70 Challenges in single-cell assembly E. coli isolate dataset E.coli single-cell dataset

71 SPAdes and its offsprings Originally designed as a single-cell assembler Can deal with highly uneven coverage and MDAimposed chimeric reads Became the base for other assembly projects: o hybridspades (for Illumina + PacBio assembly) o dipspades (for highly polymorphic genomes) o truspades (for Illumina True Synthetic Long Reads) o metaspades (for metagenome assemblies) o rnaspades (for RNA-Seq data)

72 De novo transcriptome assemblers Trans-ABySS (Robertson et al., Nat. Met., 2010) Trinity (Grabherr et al., Nat. Biotech., 2011) Oases (Schulz et al., Bioinf., 2012) SOAPdenovo-Trans (Xie et al., Bioinf., 2014) IDBA-tran (Peng et al., Bioinf., 2014) Who needs yet another?

73 De novo transcriptome assemblers De novo assemblers are important However, reference-based assemblers perform better than de novo assemblers (Martin & Zhang, Nat. Rev. Genetics., 2011) o Means there is space for improving de novo transcriptome assemblers

74 New de novo transcriptome assemblers IDBA-tran (Peng et al., Bioinf., 2014) IDBA-MTP (Peng et al., RECOMB 2014) SOAPdenovo-Trans (Xie et al., Bioinf., 2014) Fu et al., ICCABS, 2014 StringTie (Pertea et al., Nat. Biotech., 2015) Bermuda (Tang et al., ACM, 2015)

75 SPAdes on RNA-Seq? SPAdes is a single-cell assembler o Single-cell data sets have highly uneven coverage o So does RNA-Seq data sets (different expression levels)

76 SPAdes on RNA-Seq M.musculus RNA-Seq dataset SRX (22740) genes, (47618) isoforms Trans-ABySS IDBA-tran SOAPdenovo Trans Trinity SPAdes Transcripts Aligned Unaligned Gene database coverage, % Partially-assembled isoforms (>50%) Fully-assembled isoforms (>95%) Misassemblies Avg. mismatches per transcript

77 From SPAdes to rnaspades Raw reads Bayes-Hammer SPAdes De Bruijn Graph construction De Bruijn graph Corrected reads Graph simplification Assembly graph Paired read alignment Paired index Repeat resolution & scaffolding Contigs & scaffolds

78 From SPAdes to rnaspades Raw RNA-Seq reads Bayes-Hammer rnaspades De Bruijn Graph construction De Bruijn graph Corrected reads Graph simplification Assembly graph Paired read alignment Paired index Isoform identification Transcripts

79 Results M.musculus RNA-Seq dataset SRX (22740) genes, (47618) isoforms Trans-ABySS IDBA-tran SOAPdenovo Trans Trinity SPAdes rnaspades Transcripts Aligned Unaligned Gene database coverage, % Partially-assembled isoforms (>50%) Fully-assembled isoforms (>95%) Misassemblies Avg. mismatches per transcript

80 Transcripts with low coverage exon genome

81 Transcripts with low coverage exon genome

82 Transcripts with low coverage ASSEMBLY

83 Results on strand-specific data S.cerevisiae RNA-Seq dataset SRR genes, 7126 isoforms Trans-ABySS IDBA-tran SOAPdenovo Trans Trinity rnaspades (k=55) Transcripts Aligned Unaligned Gene database coverage, % Partially-assembled isoforms (>50%) Fully-assembled isoforms (>95%) Misassemblies Avg. mismatches per transcript

84 Results on strand-specific data Z.mays RNA-Seq dataset SRR genes, isoforms Trans-ABySS IDBA-tran SOAPdenovo Trans Trinity rnaspades Transcripts Aligned Unaligned Gene database coverage, % Partially-assembled isoforms (>50%) Fully-assembled isoforms (>95%) Misassemblies Avg. mismatches per transcript

85 RNA-Seq QC RSeQC (Wang et al., Bioinf., 2012) a tool for reference based RNA-Seq QC Strand-specificity Insert size Reads distribution Reads duplication and many more

86 Outline 1.Who assembles transcriptomes de novo? 2.How do assemblers work? 3.What is SPAdes and how does it work on RNA- Seq? 4.How do we evaluate transcriptome assemblies?

87 rnaquast (QUality ASsessment Tool) You cannot develop an assembler without having an assembly quality assessment tool

88 rnaquast (QUality ASsessment Tool) You cannot develop an assembler without having an assembly quality assessment tool Based on our experience with SPAdes and QUAST, developing such tool is not an easy task (Gurevich et al., Bioinf., 2013)

89 But why?

90 RNA-Seq assembly evaluation RNA-Seq assemblers benchmark (Martin & Zhang, Nat. Rev. Genetics., 2011) Evaluating 454 RNA-Seq assemblies (Mundry et al., PloS one, 2012) RGASP-consortium (Steijger et al., Nat. Met., 2013) Metrics assessment (O Neil et al., BMC, 2013) Detonate RSEM-EVAL/REF-EVAL (Li et al., Gen. Biol., 2014)

91 RNA-Seq assembly evaluation Every paper describing new transcriptome assembler uses its own metrics and methods No tool yet have set a standard for assessing quality of transcriptome assemblies

92 rnaquast pipeline

93 Misassembly search {a 1,, a n } alignments of the transcript T A j = (a k1, a k2,, a j ) alignment union ending at a j Best alignment union is detected via dynamic programming algorithm o S(A j ) = max {score(a i, a j ) } for all i such that last position of a i - first position of a j < o Score includes mismatches, indels and alignment locality A transcript with 2 or more discordant partial alignments from best alignment union is considered as misassembly candidate

94 BLAT alignment

95 BLAST alignment

96 De novo evaluation Core gene detection o CEGMA (Parra et al., Bioinf., 2007, no longer supported) o BUSCO (Simão et al., Bioinf., 2015)

97 De novo evaluation Core gene detection o CEGMA (Parra et al., Bioinf., 2007, no longer supported) o BUSCO (Simão et al., Bioinf., 2015) De novo gene finding o GeneMarkS-T (Tang et al., Nucl. ac. res., 2015)

98 De novo evaluation Core gene detection o CEGMA (Parra et al., Bioinf., 2007, no longer supported) o BUSCO (Simão et al., Bioinf., 2015) De novo gene finding o GeneMarkS-T (Tang et al., Nucl. ac. res., 2015) Quality assessment using initial reads o Detonate RSEM-EVAL (Li et al., Gen. Bio., 2014)

99 Acknowledgements Dmitry Antipov Elena Bushmanova SPAdes team Pavel Pevzner Alla Lapidus Vladimir Suvorov

100 Thank you! Questions?