TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

tru TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) Anton Bankevich Center for Algorithmic Biotechnology, SPbSU

Sequencing costs 1. Sequencing costs do not follow Moore s law anymore

Sequencing costs 1. Sequencing cost does not follow Moore s law anymore 2. Sequencing cost does not consistently go down anymore

Third generation sequencing: view from a different angle BioNano Do we need all these letters? PacBio Are all those errors really that important? Nanopore Do you think PacBio reads are long?

TSLR: TruSeq Synthetic Long Reads Developed by Illumina, the leader in DNA sequencing technologies The first technology generating long AND accurate reads at moderate cost Promises to revolutionize detection of complex human variation and metagenomics.

TSLR: from sample to data DNA is shred into 10Kb long fragments Fragments are distributed among 384 pools Pools are barcoded and sequenced by Illumina HiSeq

TSLR: from sample to data Reads from each pool are assembled separately Resulting in virtual long reads Virtual long reads can be further assembled

Accuracy of TSLR technology Barcode assembly results in TSLRs Very low mismatch rate ACCGACCTTACCCCGAGGGCC ACCGTCTTTACCCCGAGGCGC ACCGTTTTTACCTCGAGGCCC ACTGTCTTTACCACGAGGCCC ACCGTCTTTACCCCGAGGCAC ACCGTCTTTACCCCGAGGCCC Some assembly errors

TSLR: algorithmic challenges and applications Accurate barcode assembly Bankevich, Pevzner (2016) Metagenomic analysis Sharon et al (2015) Kuleshov et al (2015) Structural variation detection???

Barcode assembly

TruSPAdes: Barcode assembly challenges Complex repeat structure inherited from target genome Inter-strand chimeric connections Fragmentation of barcode span Uneven coverage by reads

From SPAdes to truspades We introduced several changes into SPAdes pipeline to adopt it to TSLR data: 1. 2. 3. 4. Increased number of iterations in iterative assembly New tip trimming procedure Additional analysis of alignments of paired-end reads to contigs Various parameter changes optimized specifically for TSLR data

Barcode assembly results (human data) Illumina Ray SPAdes truspades Ideal #contigs, pbpb* 452 414 677 387 300 #contigs(>8000bp), pb 117 83 108 140 300 Total length (Mb), pb 2.3 1.8 2.7 2.4 3 N50 7 579 6 222 6 235 8 677 10 000 NGA50 5 235 2 511 4 770 7 274 10 000 #N's per 100 Kbp 0.9 3083 242 0.2 0 Misassemblies, pb 1.2 7 47 2.0 0 Mismatches per 100 Kbp 69 84 190 90 0 pb (per barcode): average among all barcodes in dataset

Structural variation detection

Why are we searching for variations?

Identification of variations with short reads: detection target genome: sequencing NGS reads: reference genome: deletion

Identification of variations with short reads: statistical analysis target genome: sequencing NGS reads: reference genome: deletion

Identification of variations with short reads: challenges False alignments to repeats Chimeric paired-end reads Coverage bias Large insertions and rearrangements

Identification of variations with short reads: challenges False alignments to repeats Chimeric paired-end reads Coverage bias Large insertions and rearrangements What about long reads?

SV detection with long reads target genome: sequencing long reads: reference genome: deletion

Any long read challenges? Accurate sequencing Accurate alignment Simple statistical analysis

Variation detection: short reads vs long reads Short reads Cheap Requires high coverage Can detect SNPs Deletions CNV in tandem repeats Long reads Expensive Does not require high coverage Can detect: SNPs Deletions Insertions (length almost up to read size) Variations in complex repeat structures

Variation detection: short reads vs TSLRs Short reads Cheap Required high coverage Can detect TSLRs SNPs Deletions CNV in tandem repeats Reasonable cost Does not require high coverage Can detect: SNPs Deletions Insertions (length almost up to read size) Normally can handle complex repeat structures Typically contain 1-4 misassemblies per barcode Difficulties in resolving variations in complex repeat structures

What do you do when you are not happy with accuracy of your reads?

Sequencing workflow Reads Sequencing machine Analysis

Sequencing workflow improves! Reads Analysis Sequencing machine Raw signal

Success story: nanopore SNP calling in Ebola study Oxford Nanopore reads ACCGACCTTACCACGAGGGCC ACCGTCTTTACCACGAGGCGC ACCGTTTTTACCACGAGGCCC ACTGTCTTTACCACGAGGCCC ACCGTCTTTACCACGAGGCAC ACCGTCTTTACCCCGAGGCCC SNP calling MinIon Raw Oxford Nanopore signal Quick et. al. Real-time, portable genome sequencing for Ebola surveillance, Nature (2016)

TSLR analysis workflow TSLRs Analysis Illumina sequencing machine barcoded reads

TSLR analysis workflow TSLRs Illumina sequencing machine Analysis barcoded reads De Bruijn graph

De Bruijn graphs

DeBruijnGraph(Genome) Vertices: k-mers from the genome Edges: (k+1)-mers from the genome k=2: 3-mer ACG results in an edge AC -> CG genome

DeBruijnGraph(Reads) Vertices: k-mers from reads Edges: (k+1)-mers from reads k=2: 3-mer ACG results in an edge AC -> CG reads genome

De Bruijn Graphs 1. Collapses repetitive regions (longer than k) 2. Genome corresponds to a path in the graph 3. Error-prone reads introduce errors in DeBruijnGraph(Reads)

de Bruijn graph of a barcode Fragments of the reference genome with the same barcode

de Bruijn graph of a barcode Fragments of the reference genome with the same barcode De Bruijn graph

de Bruijn graph of a barcode Fragments of the reference genome with the same barcode Tip De Bruijn graph Tip Tips are removed since in most cases they represent errors in reads

de Bruijn graph of a barcode Fragments of the reference genome with the same barcode De Bruijn graph As the result red/blue/green chimeric sequence is indistinguishable from correct orange sequence

de Bruijn graph of a barcode Fragments of the reference genome with the same barcode De Bruijn graph misassembly Assembly

de Bruijn graph of a barcode Fragments of the reference genome with the same barcode De Bruijn graph Tips are supported by fragments of reference genome.

Genome-Graph alignment

Alignment of genome to graph Elementary alignment is alignment of a genome fragment to a single edge of the de Bruijn graph Two elementary alignments perfectly fit if they correspond to two adjacent genome fragments aligning to two adjacent edges in de Bruijn graph. Genome-graph alignment is a sequence of elementary alignments Ideally consecutive elementary alignments in genome-graph alignment are perfectly fit

From genome to genome fragments Human genome is too large for efficient comparison with the graph We extract fragments of genome using alignment of Illumina reads to the reference As the result, we analyse 300 fragments of length 10Kb instead of a single fragment of length 3Gb

Genome-graph alignment problem Input: De Bruijn graph of a single barcode A fragment of the genome along with its alignment to the de Bruijn graph Output: All breakpoints/insertions/deletions in the given genome fragment Note: In ideal case all consecutive elementary alignments in genome alignment are perfectly fit. Question: Should we report all breaks in elementary alignments as SVs?

de Bruijn graph reveals structural variations Insertion Reference: Target: De Bruijn graph: Deletion Breakpoint

...but read artifacts make the problem difficult Coverage break Reference: Target: De Bruijn graph for ideal coverage: De Bruijn graph with coverage break: Coverage break + repeat SNP + repeat + bad luck

How should we deal with artifacts? 1. 2. 3. Some erroneous alignments must be discarded Coverage breaks are usually marked by tips on both sides All other unexpected events represent structural variations

Genome-Graph Alignment problem (new version) Input: sequence of elementary alignments A1, A2,..., An Find: subsequence of elementary alignments a1, a2,..., am with minimum penalty Penalty is assigned: For skipped elementary alignments (low) For coverage breaks (low) For insertions/deletions (medium) For breakpoints (high)

Dynamic programming for Genome-Graph alignment Dynamic subproblem: Find the best alignment for segment of the initial alignment sequence in case the last elementary alignment is a part of the chosen subsequence procedure Solve(A1, A2,..., An) best = None for k = 1 n-1 candidate = Solve(A1, A2,..., Ak) + ScoreSkip(Ak + 1, Ak+2,..., An-1) + ScoreVariation(Ak, An) best = min(best, candidate) return best

ScoreVariation(A1, A2) Coverage break A1 Insertion/Deletion/SNP Breakpoint A2 A1 A2 A1 A2

Verification by TSLRs reference TSLR TSLRs represent information from paired-end reads If structural variation is supported by TSLR it is given significantly lower penalty

TruSPAdes variation detection pipeline Reference genome barcoded reads Barcoded reads can be used to validate SVs found in TSLRs Genome-Graph alignment de Bruijn graph De Bruijn graph is used as representation of barcoded reads Statistical analysis of alignments List of SV We analyse alignments of reference genome and TSLRs to the de Bruijn graph constructed from each TSLR pool of barcoded reads We filter out most false positive SVs that were caused by misassemblies in TSLRs and find variations in regions that are not covered by assembled TSLRs

Thank you