TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Size: px
Start display at page:

Download "TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)"

Transcription

1 tru TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) Anton Bankevich Center for Algorithmic Biotechnology, SPbSU

2 Sequencing costs 1. Sequencing costs do not follow Moore s law anymore

3 Sequencing costs 1. Sequencing cost does not follow Moore s law anymore 2. Sequencing cost does not consistently go down anymore

4 Third generation sequencing: view from a different angle BioNano Do we need all these letters? PacBio Are all those errors really that important? Nanopore Do you think PacBio reads are long?

5 TSLR: TruSeq Synthetic Long Reads Developed by Illumina, the leader in DNA sequencing technologies The first technology generating long AND accurate reads at moderate cost Promises to revolutionize detection of complex human variation and metagenomics.

6 TSLR: from sample to data DNA is shred into 10Kb long fragments Fragments are distributed among 384 pools Pools are barcoded and sequenced by Illumina HiSeq

7 TSLR: from sample to data Reads from each pool are assembled separately Resulting in virtual long reads Virtual long reads can be further assembled

8 Accuracy of TSLR technology Barcode assembly results in TSLRs Very low mismatch rate ACCGACCTTACCCCGAGGGCC ACCGTCTTTACCCCGAGGCGC ACCGTTTTTACCTCGAGGCCC ACTGTCTTTACCACGAGGCCC ACCGTCTTTACCCCGAGGCAC ACCGTCTTTACCCCGAGGCCC Some assembly errors

9 TSLR: algorithmic challenges and applications Accurate barcode assembly Bankevich, Pevzner (2016) Metagenomic analysis Sharon et al (2015) Kuleshov et al (2015) Structural variation detection???

10 Barcode assembly

11 TruSPAdes: Barcode assembly challenges Complex repeat structure inherited from target genome Inter-strand chimeric connections Fragmentation of barcode span Uneven coverage by reads

12 From SPAdes to truspades We introduced several changes into SPAdes pipeline to adopt it to TSLR data: Increased number of iterations in iterative assembly New tip trimming procedure Additional analysis of alignments of paired-end reads to contigs Various parameter changes optimized specifically for TSLR data

13 Barcode assembly results (human data) Illumina Ray SPAdes truspades Ideal #contigs, pbpb* #contigs(>8000bp), pb Total length (Mb), pb N NGA #N's per 100 Kbp Misassemblies, pb Mismatches per 100 Kbp pb (per barcode): average among all barcodes in dataset

14 Structural variation detection

15 Why are we searching for variations?

16 Identification of variations with short reads: detection target genome: sequencing NGS reads: reference genome: deletion

17 Identification of variations with short reads: statistical analysis target genome: sequencing NGS reads: reference genome: deletion

18 Identification of variations with short reads: challenges False alignments to repeats Chimeric paired-end reads Coverage bias Large insertions and rearrangements

19 Identification of variations with short reads: challenges False alignments to repeats Chimeric paired-end reads Coverage bias Large insertions and rearrangements What about long reads?

20 SV detection with long reads target genome: sequencing long reads: reference genome: deletion

21 Any long read challenges? Accurate sequencing Accurate alignment Simple statistical analysis

22 Variation detection: short reads vs long reads Short reads Cheap Requires high coverage Can detect SNPs Deletions CNV in tandem repeats Long reads Expensive Does not require high coverage Can detect: SNPs Deletions Insertions (length almost up to read size) Variations in complex repeat structures

23 Variation detection: short reads vs TSLRs Short reads Cheap Required high coverage Can detect TSLRs SNPs Deletions CNV in tandem repeats Reasonable cost Does not require high coverage Can detect: SNPs Deletions Insertions (length almost up to read size) Normally can handle complex repeat structures Typically contain 1-4 misassemblies per barcode Difficulties in resolving variations in complex repeat structures

24 What do you do when you are not happy with accuracy of your reads?

25 Sequencing workflow Reads Sequencing machine Analysis

26 Sequencing workflow improves! Reads Analysis Sequencing machine Raw signal

27 Success story: nanopore SNP calling in Ebola study Oxford Nanopore reads ACCGACCTTACCACGAGGGCC ACCGTCTTTACCACGAGGCGC ACCGTTTTTACCACGAGGCCC ACTGTCTTTACCACGAGGCCC ACCGTCTTTACCACGAGGCAC ACCGTCTTTACCCCGAGGCCC SNP calling MinIon Raw Oxford Nanopore signal Quick et. al. Real-time, portable genome sequencing for Ebola surveillance, Nature (2016)

28 TSLR analysis workflow TSLRs Analysis Illumina sequencing machine barcoded reads

29 TSLR analysis workflow TSLRs Illumina sequencing machine Analysis barcoded reads De Bruijn graph

30 De Bruijn graphs

31 DeBruijnGraph(Genome) Vertices: k-mers from the genome Edges: (k+1)-mers from the genome k=2: 3-mer ACG results in an edge AC -> CG genome

32 DeBruijnGraph(Reads) Vertices: k-mers from reads Edges: (k+1)-mers from reads k=2: 3-mer ACG results in an edge AC -> CG reads genome

33 De Bruijn Graphs 1. Collapses repetitive regions (longer than k) 2. Genome corresponds to a path in the graph 3. Error-prone reads introduce errors in DeBruijnGraph(Reads)

34 de Bruijn graph of a barcode Fragments of the reference genome with the same barcode

35 de Bruijn graph of a barcode Fragments of the reference genome with the same barcode De Bruijn graph

36 de Bruijn graph of a barcode Fragments of the reference genome with the same barcode Tip De Bruijn graph Tip Tips are removed since in most cases they represent errors in reads

37 de Bruijn graph of a barcode Fragments of the reference genome with the same barcode De Bruijn graph As the result red/blue/green chimeric sequence is indistinguishable from correct orange sequence

38 de Bruijn graph of a barcode Fragments of the reference genome with the same barcode De Bruijn graph misassembly Assembly

39 de Bruijn graph of a barcode Fragments of the reference genome with the same barcode De Bruijn graph Tips are supported by fragments of reference genome.

40 Genome-Graph alignment

41 Alignment of genome to graph Elementary alignment is alignment of a genome fragment to a single edge of the de Bruijn graph Two elementary alignments perfectly fit if they correspond to two adjacent genome fragments aligning to two adjacent edges in de Bruijn graph. Genome-graph alignment is a sequence of elementary alignments Ideally consecutive elementary alignments in genome-graph alignment are perfectly fit

42 From genome to genome fragments Human genome is too large for efficient comparison with the graph We extract fragments of genome using alignment of Illumina reads to the reference As the result, we analyse 300 fragments of length 10Kb instead of a single fragment of length 3Gb

43 Genome-graph alignment problem Input: De Bruijn graph of a single barcode A fragment of the genome along with its alignment to the de Bruijn graph Output: All breakpoints/insertions/deletions in the given genome fragment

44 Genome-graph alignment problem Input: De Bruijn graph of a single barcode A fragment of the genome along with its alignment to the de Bruijn graph Output: All breakpoints/insertions/deletions in the given genome fragment Note: In ideal case all consecutive elementary alignments in genome alignment are perfectly fit. Question: Should we report all breaks in elementary alignments as SVs?

45 de Bruijn graph reveals structural variations Insertion Reference: Target: De Bruijn graph: Deletion Breakpoint

46 ...but read artifacts make the problem difficult Coverage break Reference: Target: De Bruijn graph for ideal coverage: De Bruijn graph with coverage break: Coverage break + repeat SNP + repeat + bad luck

47 How should we deal with artifacts? Some erroneous alignments must be discarded Coverage breaks are usually marked by tips on both sides All other unexpected events represent structural variations

48 Genome-Graph Alignment problem (new version) Input: sequence of elementary alignments A1, A2,..., An Find: subsequence of elementary alignments a1, a2,..., am with minimum penalty Penalty is assigned: For skipped elementary alignments (low) For coverage breaks (low) For insertions/deletions (medium) For breakpoints (high)

49 Genome-Graph Alignment problem (new version) Input: sequence of elementary alignments A1, A2,..., An Find: subsequence of elementary alignments a1, a2,..., am with minimum penalty Penalty is assigned: For skipped elementary alignments (low) For coverage breaks (low) For insertions/deletions (medium) For breakpoints (high) This problem can be solved using dynamic programming

50 Dynamic programming for Genome-Graph alignment Dynamic subproblem: Find the best alignment for segment of the initial alignment sequence in case the last elementary alignment is a part of the chosen subsequence procedure Solve(A1, A2,..., An) best = None for k = 1 n-1 candidate = Solve(A1, A2,..., Ak) + ScoreSkip(Ak + 1, Ak+2,..., An-1) + ScoreVariation(Ak, An) best = min(best, candidate) return best

51 ScoreVariation(A1, A2) Coverage break A1 Insertion/Deletion/SNP Breakpoint A2 A1 A2 A1 A2

52 Verification by TSLRs reference TSLR TSLRs represent information from paired-end reads If structural variation is supported by TSLR it is given significantly lower penalty

53 TruSPAdes variation detection pipeline Reference genome barcoded reads Barcoded reads can be used to validate SVs found in TSLRs Genome-Graph alignment de Bruijn graph De Bruijn graph is used as representation of barcoded reads Statistical analysis of alignments List of SV We analyse alignments of reference genome and TSLRs to the de Bruijn graph constructed from each TSLR pool of barcoded reads We filter out most false positive SVs that were caused by misassemblies in TSLRs and find variations in regions that are not covered by assembled TSLRs

54 Thank you