De novo transcriptome assembly Does anybody even need it?
|
|
- Alvin Norris
- 5 years ago
- Views:
Transcription
1 De novo transcriptome assembly Does anybody even need it? Andrey Prjibelski Center for Algorithmic Biotechnology SPbSU
2 Outline 1.Who assembles transcriptomes de novo? 2.How do assemblers work? 3.What is SPAdes and how does it work on RNA- Seq? 4.How do we evaluate transcriptome assemblies?
3 Reference based RNA-Seq analysis Spliced alignment o TopHat (Trapnell et al., Bioinf., 2009, 3684) o TopHat2 (Kim et al., Gen. Biol., 2013, 1052) o STAR (Dobin et al., Bioinf., 2013, 558) Expression analysis o DeSeq (Anders et al., Gen. Biol., 2010, 3265) o DeSeq2 (Love et al., Gen. Biol., 2014, 310) o Cuffdiff (Trapnell et al., Nat. Biotech., 2013, 578)
4 A lot of species do not have finished reference
5 De novo transcriptome assemblers Trans-ABySS (Robertson et al., Nat. Met., 2010) Trinity (Grabherr et al., Nat. Biotech., 2011) Oases (Schulz et al., Bioinf., 2012) SOAPdenovo-Trans (Xie et al., Bioinf., 2014) IDBA-tran (Peng et al., Bioinf., 2014)...
6 De novo transcriptome assemblers citations Trans-ABySS (Robertson et al., 387) Trinity (Grabherr et al., 2333) Oases (Schulz et al., 537) SOAPdenovo-Trans (Xie et al., 102) IDBA-tran (Peng et al., 18)
7 De novo RNA-Seq studies Reference genome is unsequenced, poorly assembled or not annotated Reference genome is too large to sequence The goal is to find fusion/novel genes
8 Ethanol production Sequenced brown algae A.cruciatus Created synthetic yeast platform for ethanol production from mannitol and DEHU Achieved up to 83% of the maximum theoretical yield from consumed sugars Enquist-Newman et al., Nature, 2014.
9 Associative transcriptomics Markers in crops are hard to identify due to high ploidy Used Tetraploid Brassica napus sequencing data to identify deleted gene which controls aliphatic glucosinolate biosynthesis Harper et al., Nat. biotech., 2012.
10 Cistanche deserticola assembly Identified key enzymes for synthesis of Lignin Phenylethanoid glycosides Li et al., PLoS one, 2015
11 Biosynthesis of capsaicinoids Assembly and annotation of Capsicum frutescens Several SSR and SNP markers are predicted 3 novel genes are predicted which filled gaps of the capsaicinoid biosynthetic pathway Shaoqun Liu et al., PLoS one, 2013
12 Outline 1.Who assembles transcriptomes de novo? 2.How do assemblers work? 3.What is SPAdes and how does it work on RNA- Seq? 4.How do we evaluate transcriptome assemblies?
13 De Bruijn graph ACGTCCGTAA
14 De Bruijn graph ACGTCCGTAA k=2 AC
15 De Bruijn graph ACGTCCGTAA k=2 AC CG
16 De Bruijn graph ACGTCCGTAA k=2 AC CG GT
17 De Bruijn graph ACGTCCGTAA k=2 AC CG GT TC
18 De Bruijn graph ACGTCCGTAA k=2 AC CG GT CC TC
19 De Bruijn graph ACGTCCGTAA k=2 AC CG GT CC TC
20 De Bruijn graph ACGTCCGTAA k=2 AC CG GT CC TC
21 De Bruijn graph ACGTCCGTAA k=2 AC CG GT TA CC TC
22 De Bruijn graph ACGTCCGTAA k=2 AC CG GT TA CC TC AA
23 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG GT TA CC TC AA
24 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT TA CC TC AA
25 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT TA GTC CC TC AA
26 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT TA GTC CC TCC TC AA
27 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT TA CCG GTC CC TCC TC AA
28 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT TA CCG GTC CC TCC TC AA
29 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTA TA CCG GTC CC TCC TC AA
30 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTA TA CCG GTC TAA CC TCC TC AA
31 De Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTA TA CCG GTC TAA CC TCC TC AA
32 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTA TA CCG GTC TAA CC TCC TC AA
33 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTA TA CCG GTC TAA CC TCC TC AA
34 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA CCG GTC CC TCC TC AA
35 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA CCG GTC CC TCC TC AA
36 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA CCG CC GTCC AA
37 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA CCG CC GTCC AA
38 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA GTCCG AA
39 Condensed de Bruijn graph ACGTCCGTAA k=2 AC ACG CG CGT GT GTAA GTCCG AA
40 So, what s the difference?
41 Heuristics Sequencing errors
42 Sequencing errors CCGTTG CGTTAC GTTGCA TGCAGG
43 Sequencing errors CCG CGT CCGTTG CGTTAC GTTGCA TGCAGG GTT TTA TAC TTG TGC GCA CAG AGG
44 What about sequencing errors? CCGTTG CGTTAC GTTGCA TGCAGG GTTAC TAC CCG CCGTT GTT GTTGCAGG AGG
45 Sequencing errors CCG CGT CCGTTG CGTTACAG GTTGCA TGCAGG GTT TTA TAC ACA TTG TGC GCA CAG AGG
46 Sequencing errors CCGTTG CGTTACAG GTTGCA TGCAGG GTTACAG CCG CCGTT GTT CAG CAGG AGG GTTGCAG
47 Real life
48 Real life
49 Heuristics Sequencing errors Isoform detection
50 Isoform detection
51 Isoform detection
52 Isoform detection
53 Isoform detection
54 Isoform detection
55 Isoform detection
56 Isoform detection
57 Isoform detection
58 De Bruijn graph
59 De Bruijn graph
60 De Bruijn graph Ideal case: 1 gene 1 component
61 De Bruijn graph Ideal case: 1 gene 1 component Reality o Short repeats o Paralogous genes o Introns o Intergenic sequences o Poly-A tails (caused by polyadenylation)
62 De Bruijn graph Ideal case: 1 gene 1 component Reality o Short repeats o Paralogous genes o Introns o Intergenic sequences o Poly-A tails (caused by polyadenylation) Varying expression levels result in different coverage o Harder to detect sequencing errors
63 Outline 1.Who assembles transcriptomes de novo? 2.How do assemblers work? 3.What is SPAdes and how does it work on RNA- Seq? 4.How do we evaluate transcriptome assemblies?
64 SPAdes (Saint Petersburg Assembler)
65 SPAdes now > 600 citations (Bankevich et al., JCB, 2012) ~ 3.5K unique users downloaded the latest SPAdes 3.6 One of the most popular apps on BaseSpace One of two best assemblers in GAGE-B study by Salzberg s lab (Magoc et al., Bioinf., 2013) The best bacterial genome assembler in the recent poll by acgt.me
66 Why to create new assembler? Conventional bacterial sequencing Metagenomics Single cell
67 Conventional bacterial sequencing
68 Metagenomics >99% bacteria cannot be cultured Metagenomics: sequencing of whole bacterial community o Reads from dozens of different genomes mixed in one data set o Different coverage for different bacteria o Presence of different strains o Conservative genomic regions Hard to assemble and classify resulting sequences
69 Single-cell sequencing via MDA Multiple Displacement Amplification (Lasken, Curr. Op. Microbiology., 2007) Random hexamer primers Phi29 DNA polymerase strand displacement
70 Challenges in single-cell assembly E. coli isolate dataset E.coli single-cell dataset
71 SPAdes and its offsprings Originally designed as a single-cell assembler Can deal with highly uneven coverage and MDAimposed chimeric reads Became the base for other assembly projects: o hybridspades (for Illumina + PacBio assembly) o dipspades (for highly polymorphic genomes) o truspades (for Illumina True Synthetic Long Reads) o metaspades (for metagenome assemblies) o rnaspades (for RNA-Seq data)
72 De novo transcriptome assemblers Trans-ABySS (Robertson et al., Nat. Met., 2010) Trinity (Grabherr et al., Nat. Biotech., 2011) Oases (Schulz et al., Bioinf., 2012) SOAPdenovo-Trans (Xie et al., Bioinf., 2014) IDBA-tran (Peng et al., Bioinf., 2014) Who needs yet another?
73 De novo transcriptome assemblers De novo assemblers are important However, reference-based assemblers perform better than de novo assemblers (Martin & Zhang, Nat. Rev. Genetics., 2011) o Means there is space for improving de novo transcriptome assemblers
74 New de novo transcriptome assemblers IDBA-tran (Peng et al., Bioinf., 2014) IDBA-MTP (Peng et al., RECOMB 2014) SOAPdenovo-Trans (Xie et al., Bioinf., 2014) Fu et al., ICCABS, 2014 StringTie (Pertea et al., Nat. Biotech., 2015) Bermuda (Tang et al., ACM, 2015)
75 SPAdes on RNA-Seq? SPAdes is a single-cell assembler o Single-cell data sets have highly uneven coverage o So does RNA-Seq data sets (different expression levels)
76 SPAdes on RNA-Seq M.musculus RNA-Seq dataset SRX (22740) genes, (47618) isoforms Trans-ABySS IDBA-tran SOAPdenovo Trans Trinity SPAdes Transcripts Aligned Unaligned Gene database coverage, % Partially-assembled isoforms (>50%) Fully-assembled isoforms (>95%) Misassemblies Avg. mismatches per transcript
77 From SPAdes to rnaspades Raw reads Bayes-Hammer SPAdes De Bruijn Graph construction De Bruijn graph Corrected reads Graph simplification Assembly graph Paired read alignment Paired index Repeat resolution & scaffolding Contigs & scaffolds
78 From SPAdes to rnaspades Raw RNA-Seq reads Bayes-Hammer rnaspades De Bruijn Graph construction De Bruijn graph Corrected reads Graph simplification Assembly graph Paired read alignment Paired index Isoform identification Transcripts
79 Results M.musculus RNA-Seq dataset SRX (22740) genes, (47618) isoforms Trans-ABySS IDBA-tran SOAPdenovo Trans Trinity SPAdes rnaspades Transcripts Aligned Unaligned Gene database coverage, % Partially-assembled isoforms (>50%) Fully-assembled isoforms (>95%) Misassemblies Avg. mismatches per transcript
80 Transcripts with low coverage exon genome
81 Transcripts with low coverage exon genome
82 Transcripts with low coverage ASSEMBLY
83 Results on strand-specific data S.cerevisiae RNA-Seq dataset SRR genes, 7126 isoforms Trans-ABySS IDBA-tran SOAPdenovo Trans Trinity rnaspades (k=55) Transcripts Aligned Unaligned Gene database coverage, % Partially-assembled isoforms (>50%) Fully-assembled isoforms (>95%) Misassemblies Avg. mismatches per transcript
84 Results on strand-specific data Z.mays RNA-Seq dataset SRR genes, isoforms Trans-ABySS IDBA-tran SOAPdenovo Trans Trinity rnaspades Transcripts Aligned Unaligned Gene database coverage, % Partially-assembled isoforms (>50%) Fully-assembled isoforms (>95%) Misassemblies Avg. mismatches per transcript
85 RNA-Seq QC RSeQC (Wang et al., Bioinf., 2012) a tool for reference based RNA-Seq QC Strand-specificity Insert size Reads distribution Reads duplication and many more
86 Outline 1.Who assembles transcriptomes de novo? 2.How do assemblers work? 3.What is SPAdes and how does it work on RNA- Seq? 4.How do we evaluate transcriptome assemblies?
87 rnaquast (QUality ASsessment Tool) You cannot develop an assembler without having an assembly quality assessment tool
88 rnaquast (QUality ASsessment Tool) You cannot develop an assembler without having an assembly quality assessment tool Based on our experience with SPAdes and QUAST, developing such tool is not an easy task (Gurevich et al., Bioinf., 2013)
89 But why?
90 RNA-Seq assembly evaluation RNA-Seq assemblers benchmark (Martin & Zhang, Nat. Rev. Genetics., 2011) Evaluating 454 RNA-Seq assemblies (Mundry et al., PloS one, 2012) RGASP-consortium (Steijger et al., Nat. Met., 2013) Metrics assessment (O Neil et al., BMC, 2013) Detonate RSEM-EVAL/REF-EVAL (Li et al., Gen. Biol., 2014)
91 RNA-Seq assembly evaluation Every paper describing new transcriptome assembler uses its own metrics and methods No tool yet have set a standard for assessing quality of transcriptome assemblies
92 rnaquast pipeline
93 Misassembly search {a 1,, a n } alignments of the transcript T A j = (a k1, a k2,, a j ) alignment union ending at a j Best alignment union is detected via dynamic programming algorithm o S(A j ) = max {score(a i, a j ) } for all i such that last position of a i - first position of a j < o Score includes mismatches, indels and alignment locality A transcript with 2 or more discordant partial alignments from best alignment union is considered as misassembly candidate
94 BLAT alignment
95 BLAST alignment
96 De novo evaluation Core gene detection o CEGMA (Parra et al., Bioinf., 2007, no longer supported) o BUSCO (Simão et al., Bioinf., 2015)
97 De novo evaluation Core gene detection o CEGMA (Parra et al., Bioinf., 2007, no longer supported) o BUSCO (Simão et al., Bioinf., 2015) De novo gene finding o GeneMarkS-T (Tang et al., Nucl. ac. res., 2015)
98 De novo evaluation Core gene detection o CEGMA (Parra et al., Bioinf., 2007, no longer supported) o BUSCO (Simão et al., Bioinf., 2015) De novo gene finding o GeneMarkS-T (Tang et al., Nucl. ac. res., 2015) Quality assessment using initial reads o Detonate RSEM-EVAL (Li et al., Gen. Bio., 2014)
99 Acknowledgements Dmitry Antipov Elena Bushmanova SPAdes team Pavel Pevzner Alla Lapidus Vladimir Suvorov
100 Thank you! Questions?