Haploid Assembly of Diploid Genomes Challenges, Trials, Tribulations 13 October 2011 İnanç Birol
Assembly By Short Sequencing IEEE InfoVis 2009 2
3
in Literature ~40 citations on tool comparisons ~20 citations on using ABySS for a biology study Crowded field 17 teams in Assemblathon 1 4 Overlap-Overlay-Consensus ARACHNE CAP3 Celera assembler MIRA Newbler Phred/Phrap SGA De Bruijn Graph Euler Velvet ABySS SOAPdenovo ALLPATHS
Assembly Problem TCGATCGATTTTCGGCCTAA read1 ATTTTCGGCCTAATATTAGG read2 GCATCGATCGATTTTCGGCCTAATATTAGGCCGATAATCGACGATC 5 A partial and unambiguous read-to-read alignment extends the length of sequence information First stage of an assembly algorithm is to find such alignments Assembly algorithms differ in the way they find and use these alignments
Algorithm SE Assembly: k-mer extension on a de Bruijn graph PE Assembly: search for unambiguous contig merging along paths d=6±5 d=5±4 Scaffolding: search for unambiguous linkage across distant contigs d=12±5 d=26±9 6
7 Software
De Bruijn Graph Description of read-to-read overlaps 2x4 possible extension of every k-mer Provides and O(n) algorithm for SE assembly GACATTGC seq1 GACATTAT seq2 GACAT ACATT CATTG CATTA ATTGC ATTAT 8 k = 5
Adjacency Graph Description of contig overlaps Built during SE assembly Overlap = k-1 bp Generalized during PE assembly Arbitrary overlap 9
Linkage Graph Built through read pairs aligned to different contigs PE reads from a tight fragment length distribution Reliable distance estimates MP reads from broader insert length distribution Noisy data Used in PE assembly (PE) and scaffolding (PE and MP) stages 10
Anchor Scrubbing homozygous variations Indel SNPs 11
Anchor Local directional assembly scaffold gap filling (bridging) extension (planking) 12
Case Study Mountain Pine Beetle Genome Assembly 13
Mountain Pine Beetle Genome Assembly statistics contigs scaffolds n 1,128,463 1,103,221 n:500bp 33,591 11,657 n:n50 4,324 82 N50 (bp) 11,220 541,443 Max (bp) 276,135 3,583,207 Reconstruction (Mb) 201.9 200.4 14
Assembly As a Hairball ABySS v1.2.7 PE/MP information disambiguates short contig extensions out in Node connectivity* 1 2 3 4 5 6+ 1 15822 7354 1882 530 109 1 2 7354 9814 1817 456 72 3 3 1882 1817 1074 238 31 1 4 530 456 238 126 13 1 5 109 72 31 13 10 0 6+ 1 3 1 1 0 0 * For contigs 2 kb 15
16 Scaffolding
Quality Assessment Alignment of 81,047,980 reads Before Anchor After Anchor Change Mapped 65,624,456 (80.97%) Paired 43,207,118 (53.31%) Single-end 9,536,178 (11.77%) Gene alignments 66,949,341 (82.60%) 44,732,320 (55.19%) 8,846,977 (10.92%) + 1,324,885 + 1,525,202-689,201 2,180 ESTs 248 Conserved Genes Complete Partial Complete Partial Contigs 968 1169 212 18 Scaffolds 1,481 619 228 5 17
Date ABySS Version Data n:500 N50 Max Sum August 2009 1.0.11 3x GAiix 81,431 1,526 20,755 107.3e6 November 2009 1.0.15 +2x GAiix 104,958 2,333 55,845 195.8e6 February 2010 1.1.1 +4x GAiix 157,081 2,790 136,637 346.3e6 July 2010 1.2.0 +2x GAiix 146,313 3,354 129,008 376.2e6 November 2010 1.2.4 +1x GAiix +1x GAiix (MP) 100,690 4,474 294,323 268.8e6 May 2011 1.2.7 -- 18,660 108,158 1,908,773 201.4e6 July 2011 1.2.7 + 1x HiSeq +1x HiSeq (MP) 11,657 541,443 3,583,207 200.4e6 August 2011 1.2.7 -- 11,523 561,847 3,746,698 206.5e6 18
19 Transcriptome Assembly
Transcriptome Sequencing RNA-seq protocol Brings information on how a genome acts Expression levels Allelic expression Present isoforms Gene fusions Other transcriptional events Post-transcriptional RNA editing Rodrigo Goya 20
Transcriptome Assembly Transcriptome assembly is different from genome assembly varying coverage levels varying expression levels split assembly paths isoforms/splice variants small contig sizes small product sizes Transcript models 21
22 What Overlap to Choose?
23 Selection of k
What Overlap to Choose? Selection of parameter k depends on read coverage depth Expression levels vary over 5 orders of magnitude 24
Assembly Merging buried parent untouched 25
Multi-k Assembly We capture a wide range of expression levels Gray: all transcripts with a read alignment Blue: at least 80% of a transcript in a single contig Red: at least 80% of a transcript is reconstructed 26
Trans-ABySS A versatile tool for Transcript reconstruction Gene identification InDel and SNV discovery Chimeric transcript discovery Gene fusions Trans-splicing Expression analysis 27
Transcriptome Assembly Trans-ABySS De novo assembly based on ABySS Cufflinks 0.8.3 Scripture Reference-based assembly based on TopHat alignments [Trapnell et al., 2010; Guttman et al., 2010; Trapnell et al., 2009] 28
Events 29 + chimeric transcripts
Performance Compared to mapping-based analysis tools Trans-ABySS constructs as many transcripts with better sensitivity and specificity 30 [Trapnell et al., 2010; Guttman et al., 2010; Trapnell et al., 2009]
Case Study Acute Myeloid Leukemia Transcriptome Assembly 31
Fusions 1 2 4 5 6 Lucas Swanson, Readman Chiu and Gordon Robertson Assembled transcriptome contigs span multiple genes Break point (usually) corresponds to exon boundaries Break point is supported by Spanning reads Read pairs linking regions Gene fusions are often drivers in AML and define subtypes (e.g. PML/RARα and M3 subtype) 32
Number of patients AML Gene Fusions 16 14 12 9% 71 events in 65/173 (38%) patients 30 different gene fusions identified 94% validation by RT-PCR sequencing Known AML fusion events (12) Known polymorphism (1) Novel fusion event (17) 10 5% 8 4% MLL fusions 6 4 Low frequency (<1%) 2 0 33 Candidate fusion events Karen Mungall
Validation of a Novel Fusion Chr 17p13.1 Chr 19p13.2 DNA directed RNA polymerase II polypeptide A (POLR2A) 5 UTR Exon 1 2 Fibrillin 3 (FBN3) Exon 47 48 M: 1kb plus DNA ladder 1: A00160 (2938) POLR2A-FBN3 5 UTR Exon 1 Exon 48 Exon 63 1 M EGF-like, calcium binding domains 505bp 34 Andy Mungall
Internal Tandem Duplications 2 2 2 2 Contig alignments result in Query gaps Contiguous target blocks Read support on break point(s) Aberrant read pair distances Known AML ITDs: 29/173 (17%) harbour partial FLT3 exon 14 duplication 6/173 (3.5%) harbour partial WT1 exon 7 duplication Nakao et al., Leukemia 1996; Christiansen et al., Leukemia 2001 35
Known ITD in FLT3 A 33 bp duplication in exon 14 CTCCCATttgagatcatattcatattctctgaaatcaacgTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAA 36 Karen Mungall
Partial Tandem Duplications 2 3 Usually coexist with the wild-type PTD event manifested in a particular contig type A short contig with 50/50 split alignment Break point is supported by Spanning reads Read pairs in opposite orientation Known AML PTD: 10/173 (5.8%) harbour duplication of MLL exons 2-10 Dorrance et al., Blood 2008 Identified 88 genes with PTDs 37
Novel PTD in Arid1a Exons 2-4 tandemly repeated in 5 AML libraries WT CT Recurrent across tissues and species Source AML LBC Normal mouse NCBI EST Observations 5/173 Libraries 5/54 Libraries 3/7 Libraries colon_ins, placenta_normal 38
39 Summary
ABySS Team: Shaun Jackman Tony Raymond Rod Docking Beetle Project: Joerg Bohlmann Chris Keeling Nancy Liao Greg Taylor Simon Chan Diana Palmquist Trans-ABySS Team: Readman Chiu Karen Mungall Gordon Robertson Ka Ming Nip Jenny Qian Rong She Lucas Swanson AML Project: Richard Moore Yongjun Zhao Andy Mungall Aly Karsan GSC: Sequencing Team Library Core Systems Team Steven Jones Marco Marra
Final Hairball ABySS v1.2.7 Read pairs and inferred distances allow for scaffolding contigs scaffolds n 1,128,463 1,103,221 n:500bp 33,591 11,657 n:n50 4,324 82 N50 (bp) 11,220 541,443 Max (bp) 276,135 3,583,207 Reconstruction (Gb) 201.9 200.4 41
Biotin Read-Through circularized insert 42
43
Triage of MP Reads Challenge: A B B A Which one? 44 Information: Distances from contig ends Base mismatches on read ends Inferred contig orientations
Triage of MP Reads Read 1 Read 2 x xx MP-like x xxx x x x xxx PE-like MP-like PE-like MP-like PE-like 45