Genome Assembly. Microbial Single Cell Genomics Workshop 2010 Sergey Koren JCVI, CBCB at UMD

Size: px

Start display at page:

Download "Genome Assembly. Microbial Single Cell Genomics Workshop 2010 Sergey Koren JCVI, CBCB at UMD"

Trevor Henry
5 years ago
Views:

1 Genome Assembly Microbial Single Cell Genomics Workshop 2010 Sergey Koren JCVI, CBCB at UMD

2 Introduction

Platform: 3730 Model length reads bases ABI 3730 800 96 80K Applied Biosystems 3730xl DNA

90 min run makes 96 reads. 500 t0 800 bp/read. Paired end range 2Kbp - 40Kbp. Mate rate almost 100%.

3 Platform: 3730 Model length reads bases ABI K Applied Biosystems 3730xl DNA amplification in E. coli. Sanger dye-terminator PCR. Capillary gel electrophoresis. 90 min run makes 96 reads. 500 t0 800 bp/read. Paired end range 2Kbp - 40Kbp. Mate rate almost 100%. Reads include vector at 5 end. Quality drops gradually at 3 end. Traces available for inspection. ABI 3730 Trace from 3730

Platform: FLX Model length reads bases ABI 3730 800 96 80K 454 FLX 400 1M 400M Roche 454 FLX Titanium Successor to GS 20 and FLX Standard Pyrosequencing:

single circles (in lab), Size options are 3Kbp or 8Kbp or 20Kbp, Linker-positive reads cut into pairs by software, 50% of reads are linker-positive.

4 Platform: FLX Model length reads bases ABI K 454 FLX 400 1M 400M Roche 454 FLX Titanium Successor to GS 20 and FLX Standard Pyrosequencing: Emulsion PCR, Bead capture, Transfer to picoliter plate, sequencing by synthesis of complementary strand, Image processing Paired-ends: Reads are cut from single circles (in lab), Size options are 3Kbp or 8Kbp or 20Kbp, Linker-positive reads cut into pairs by software, 50% of reads are linker-positive. Run time is 9hr. Clear ranges of 80bp-500bp. Problems: Inaccurate counts of repeated bases, duplicate reads, lower output on paired-end. 454 Beads on a picotitre plate

Platform: Illumina Model length reads bases ABI 3730 800 96 80K 454 FLX 400 1M 400M Illumina GA 100 200M 20G Illumina Solexa Genome Analyzer Bridge

5 Platform: Illumina Model length reads bases ABI K 454 FLX 400 1M 400M Illumina GA M 20G Illumina Solexa Genome Analyzer Bridge hybridization on flat surface, Cluster formation, Sequencing by synthesis, Image Processing. One 2x run in 10 days. Lengths have grown from 35 to 2x100. Reads have identical length initially, trimmed by software, Substitution more than indel, error accumulates at read end. Paired ends, sequencing both ends of a fragment, range of 35bp-700bp. Mate pairs, sequencing fragment cut from circle, range of 2Kbp-8Kbp. Illumina Flow cell with 8 lanes

Platform: SOLiD Model length reads bases ABI 3730 800 96 80K 454 FLX 400 1M 400M Illumina GA 100 200M 20G ABI SOLiD 50 >1G 50G Applied Biosystems SOLiD System 4 succeeds 1 thru 3.

6 Platform: SOLiD Model length reads bases ABI K 454 FLX 400 1M 400M Illumina GA M 20G ABI SOLiD 50 >1G 50G Applied Biosystems SOLiD System 4 succeeds 1 thru 3. Sequencing by reversible ligation, 5-mers query 2 consecutive bases per luminous interrogation, light signal is di-base encoded. Up to 16 days per run. Reads vary from perfect to unusable due to high error. Color space output, requires special software, allows detection and correction of some errors, vendor claims higher accuracy. ABI SOLiD Colors represent di-base encoding

Emerging Platforms Helicos Method: Single molecule sequencing, 35bp reads Application:

coli genome per channel PacBio Single molecule with polymerase Light detection from small

short, high-error Ion Torrent Electronic chip Charge detection, no photonics Machine is

7 Emerging Platforms Helicos Method: Single molecule sequencing, 35bp reads Application: whole-genome resequencing Capacity: 50 channels per 8-day run, 50X coverage of one E.coli genome per channel PacBio Single molecule with polymerase Light detection from small chambers Runs are fast (15m) but low-throughput Reads of 10Kbp possible Current reads are short, high-error Ion Torrent Electronic chip Charge detection, no photonics Machine is compact and cheap ($50K) Runs are fast (1hr) and cheap ($500) Millions of reads/run, hundreds of bases/read

8 Illumina Output s_3_1_0083_qseq.txt! HWUSI-EAS TAAACAACCCCGGGCTCACCGAGCTGAGTCTGTAGC ^IV`J^JaT\F\D^DFN ORH\GYIM\SSTWXF\T 0! HWUSI-EAS CAGCAGAAGCAAGGCGGCTGCAGAGAGCAGAGAGAC X]\LaaJ`a\MZWWGP_]IZZWZa\IPFIMFPa_TX 0! HWUSI-EAS AACCTCCCTTGCCAGCTTCCCCCGATAGTTAATGGG ab\fhew]`rt`vwikpr^]qrx[gfziqfr]_bbb 0! HWUSI-EAS AGGCTGTTAACTGAGTATGGCCGCTCCTATTTGTTT WSabaaVIHYH[]MKUZBBBBBBBBBBBBBBBBBBB 0! HWUSI-EAS CGTTTAGCTCTATCTTACACCACTTCCTCTCCTTTC O HYJXI_^GZ`bbbb^RN]IGRVI`K\YGZUH\[ 0! HWUSI-EAS AGCGAAGGGGCAGTGAAGGGACGAAAAGCCGAGCAG ZGHZ_^^OO_^BBBBBBBBBBBBBBBBBBBBBBBBB 0! HWUSI-EAS TCCTCTCCGAACGGTTGACAGCGCTCAGCATTGTAC a]b_b_bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb 0! HWUSI-EAS TGCCCATTGGAGGAAAAAAAATTACATAGATCTCCA ab_agydmzx^]`z^]hwyd[_w\\hqwumhwwax_ 0! HWUSI-EAS GCCCCGGC.TACGAATGCGGTCAGCGGAAAAAGCAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0! HWUSI-EAS CGGCAATT.TTCTTATCAACACCACCCGAACGACTG abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb 0! HWUSI-EAS AAATCTG..AATGATTCTCCCTTCTGACGTACCACA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 Machine ID Lane 1 or 2 of pair Base calls Quality values run=090401! TAAACAACCCCGGGCTCACCGAGCTGAGTCTGTAGC! +HWUSI-EAS541:3:83:1787:876#0/1! ^IV`J^JaT\F\D^DFN run=090401! CAGCAGAAGCAAGGCGGCTGCAGAGAGCAGAGAGAC! +HWUSI-EAS541:3:83:1787:673:0#0/1! X]\LaaJ`a\MZWWGP_]IZZWZa\IPFIMFPa_TX! Alternate format, same data phred Quality Values, now used by Illumina QV=20 means 1/10 2 chance of error. QV=30 means 1/10 3 chance of error. QV=40 means 1/10 4 chance of error. Solexa Quality Values, formerly used by Illumina QV=2 is encoded as ascii( 2+64)= B. QV=20 is encoded as ascii(20+64)= T. QV=30 is encoded as ascii(30+64)= ^. QV=40 is encoded as ascii(40+64)= h. *64 is an arbitrary constant. Sanger files used 33.!

9 454 Output Example of using unix utilities from 454. Shown is sffinfo. See also sfffile.! $ sffinfo! Usage: sffinfo [options...] [- sfffile] [accno...]! Options:! -a or -accno Output just the accessions! -s or -seq Output just the sequences! -q or -qual Output just the quality scores! -f or -flow Output just the flowgrams! -t or -tab Output the seq/qual/flow as tab-delimited lines! -n or -notrim Output the untrimmed sequence or quality scores! Dump 1 read from SFF file for half-plate #2 Each flow yields 0 or more bases $ sffinfo FFPMSHM02.sff FFPMSHM02B382V! # of Flows: 400! # of Bases: 271! Clip Qual Left: 5! Clip Qual Right: 90 Flow Chars: TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG! Flowgram: ! Bases: tcagggggggggaaccactgtaacaaaggaaagtaggttggtctcgcggtgagtg! Quality Scores: ! The first 4 bases (lower case) are the key sequence for calibration. On this slide, the 4 colors show correspondence between flows and bases.

10 Color Space ABI SOLiD The reads have colors not bases. encoded as numbers 0 to 3 ABI claims colors improve accuracy. Software should align in color space. convert to bases after alignment ABI software catalog is skimpy. A followed by T is coded red. T followed by G is coded green. Note T is interrogated twice. The color code.

11 Sanger-era Paired Ends 1. Randomly shear copies of the genome 2. Build libraries of similar-length fragments 3. Amplify individual fragments 4. Sequence two ends of each fragment Insert size estimate given as µ ± σ per library. µ ± σ Oriented reads with distance estimate.

12 Sanger and 454 Paired Ends Sanger sequencing of insert in vector 454 sequencing across linker Mate (blue) is same as read (brown) Mate (blue) is reverse complement of read (brown) Sanger mate pair from both forward sequences, vector trimmed Sanger-style mate pair from opposite strands, linker trimmed

13 Sequencer Summary Sequencing Machines ABI 3730xl 454 FLX with Titanium: variable-length reads Illumina GA (Solexa): pairs of 100bp reads ABI SOLiD system 4: reads of 50 colors Helicos Heliscope: 35bp from single molecule Pacific Biosciences: variable-length reads from single molecule Ion Torrent: not yet available Sequencing Approaches Whole-genome shotgun Mate pair and paired end Bar coding Contract sequencing

14 de novo Assemblers Greedy Graph Algorithms for short reads SSAKE, SHARCGS, VCAKE Traditional Overlap/Layout/Consensus Algorithms Developed for Sanger reads Celera Assembler (maintained at JCVI, for reads of 100bp+) Newbler (proprietary from Roche, tuned for 454 reads) Others: TIGR, phrap, Arachne, PCAP, Phusion, Atlas, Jazz de Bruijn Graph Algorithms EULER: originally developed for Sanger reads Velvet: the first widely used assembler for short reads AllPaths: version 3 may scale up to human genome ABySS: distributed memory, distributed processing, first assembly of human from short (<100bp) reads SOAP de novo: best assembly of human from short reads, running on large-ram computer CLC: commercial assembler designed for mixed platform fragments, uses many processors and low memory

15 Algorithms: Basic Steps Read processing Filter erroneous reads and trim erroneous bases Graph construction Reads and their pair-wise overlaps, or K-mers and their K-1 overlaps Graph reduction Remove spurs (erroneous read ends) Collapse bubbles (minor polymorphisms) Tease apart collapsed repeats (use reads or mates) Form contigs but break at repeats Scaffold with mate pairs Consensus from multiple sequence alignment

16 Assembly Components Consensus AGGCATGACGGCTAGGCGCGTANNNNNNNNNCCGCGAATACGAG Scaffold = Contigs + Gaps (gap lengths derived from pairings) Contigs = Unitigs + Pairing Unitigs = Reads + Overlaps (up to apparent repeat boundaries) Overlaps Reads & Pairing

17 Algorithms: Techniques Allowing for sequencing error complicates everything Use sequence similarity as guide and paired-ends as constraint Techniques for extra speed Use indexes to avoid all-against-all pair-wise alignment K-mers (exact match substrings) are suitable for fast computes In De Bruijn Graph assembly (Velvet), identical K-mers collapse to graph nodes In traditional assembly (Celera Assembler) K-mers seed overlaps Arbitrary thresholds Minimum overlap length or percent identity Maximum repetitiveness tolerated in a read Sub-optimal algorithms Greedy algorithms (keep enlarging my contig without concern for global picture) Heuristic algorithms (use coverage ratios that seem to work well on real data) Save consensus for last Multiple pair-wise alignments is faster than one multiple sequence alignment Must choose a strategy for reads from genomic repeats Report 0 reads/repeat (leave gap or provide consensus with 0X coverage) Report 1 read/repeat (chosen randomly for even coverage) Report all reads/repeat (repeat reads are multiply placed)

18 Algorithms: Graphs aaccgg ccggtt Assembly algorithms construct graphs, which are nodes connected by edges. In traditional assemblers (left) nodes are reads, edges are overlaps. In de Bruijn assemblers (right) nodes are K-mers, edges are exact matches of K-1. aacc accg ccgg cggt ggtt Assembly algorithms reduce graph complexity. Boxes are reads or K-mers. Edges are overlaps or matches. Edge thickness can indicate amount of support in reads. Left: A spur is induced by bad sequence at a read end or low coverage and polymorphism. Middle: A bubble is induced by polymorphism or sequencing error. Right: A collapsed repeat might get teased apart or isolated or multiply placed.

19 Coverage: Impact on Assembly Sequence this genome by the random shotgun method. 1. Initially, all new reads make contigs. 2. Later, new reads fill gaps. Contigs enlarge and coalesce. 3. Eventually, new reads add little to nothing. Gaps remain due to chance or bias. 4. At very high coverage, error accumulates, assemblies fracture (not shown).

20 Lander-Waterman The Lander-Waterman equations use Poisson statistics to describe an unbiased and unpaired whole-genome shotgun experiment on a non-repetitive genome assuming {detected overlaps}={true overlaps}. Your results may vary! Average Contig Length Total Number of Contigs Read coverage: 0X 8X Read coverage: 0X 8X Parameters: read length = 800, minimum detected overlap = 40. Early reads nucleate contigs. Later reads are mostly redundant.

21 Read-Length Effects 1. Overlap Effect: For same number of sequenced bases, shorter reads require more coverage to achieve comparable N50. Assembly 1. 9 reads of length = 30bp. Total sequenced bases = 270. Assemble with min overlap = 20bp. Result = 7 contigs. Same inputs Different outputs Assembly 2. 3 read length = 90. Total sequenced bases = 270. Assemble with min overlap = 20bp. Result = 1 contig. 2. Repeat Effect: Shorter reads resolve fewer repeats. Repeat length = 600bp. Read length = 800bp (Sanger). Reads span the repeats. Repeat length = 600bp. Read length = 400bp (454). Reads bridge the repeats. Repeat length = 600bp. Read length = 75bp (Solexa). Repeats not resolved.??

22 Mate Pair Effects 1. Variety of insert sizes will span variety of repeats. Inserts that span the repeat will enable scaffolds. High coverage in mates will tile the repeat. Larger repeats require larger insert sizes. 2. Mates can resolve repeats even if not possible to tile with reads. Mates guide multiple placement of repeat contig.

23 Assembly Metrics Number of contigs, number of scaffolds Average number of contigs per scaffold Total bases in scaffolds Total span of scaffolds (including the gap lengths) Average gap length (for intra-scaffold gaps) Contig N50, Scaffold N50, Gap N50 Sum of bases in contigs>10kbp Sum of span in scaffolds>2kbp Min, max, and mean span of a scaffold Min, max, and mean bases in a contig Fraction of paired-end constraints satisfied in scaffolds Average read coverage in contigs

24 The N50 Metric Problem: How to represent a distribution of intervals by one number? Solution: N50 1. Measure the total span. 2. Sort the set by size. 3. Start walking from the largest interval. 4. Stop walking at the 50% mark. 5. Measure the interval on which you stopped. Start at largest. Intervals may represent contigs, scaffolds, gaps, alignments, etc. 0 10% 30% 50% 70% 90% This interval is your N50. Extra credit: What is N90? Why is N50 unstable?

25 Typical Assembly Output Scaffolds.fasta Each scaffold contains one or more contigs. Gap lengths may be represented by strings of N. Contigs.fasta These are the contigs from scaffolds, without the gaps. These may be filtered by a minimum size criterion. UnassembledReads.fasta Solo reads may be called singletons. Their mates may or not be assembled. UnassembledContigs.fasta These are contigs not placed in scaffolds. May be repeats, plasmids, contamination, short contigs.

26 Common Assembly Errors Chimera Assembly joins unrelated sequences, usually across a low-copy repeat Collapsed tandem repeat Copy number is reduced in the assembly Missed join Contigs not assembled despite sequence similarity Assembly Forensics* * Phillippy, Schatz, Pop (2007) Genome assembly forensics: finding the elusive mis-assembly. Genome Biology.

27 Consensus Problems How does 1 sequencing error propagate to the consensus? GGCGT-AGCA GGCGT-AGCA GGCGTTAGCA GGCG-TAGCA GGCG-TAGCA GGCGTTAGCA These alignments were valid. So were these. Each consensus base has at least 3 bases supporting it. Even sophisticated software can generate alignments that the human eye recognizes as wrong. The root cause is that multiple sequence alignment is solved by a series of pair-wise alignments. Celera Assembler reviews its final alignments to compress gap-rich regions. Celera Assembler fixes this particular problem, but it misses more complex ones.

28 Single Cell Assembly Challenges

Uneven Coverage A 120K subset of a scaffold Middle (180K-220K) has high coverage Is this high coverage a repeat (and an assembly mistake) Is it due to

29 Uneven Coverage A 120K subset of a scaffold Middle (180K-220K) has high coverage Is this high coverage a repeat (and an assembly mistake) Is it due to amplification bias (and a correct assembly) Flavobacteria bacterium MS024-2A assembly.! Scaffold generated by Celera Assembler, displayed with Hawkeye viewer.!

30 Uneven Coverage Aligning to reference High coverage region ( K) has good alignment No discrepancy with reference The high coverage is due to amplification bias Flavobacteria bacterium MS024-2A assembly.! Scaffold generated by Celera Assembler, displayed with Hawkeye viewer.! Aligned to reference using MUMmer.!

31 Coverage in scaffolds including gaps and surrogates. Bar at 0 is truncated; value is 152.9M. Count at each position reflected sum of spanning reads, regardless of alignment gaps. (Bonobo i7 Assembly, Bonobo Consortium)

32 Uneven Coverage (cont) Flavobacteria bacterium MS024-2A assembly.! Scaffold generated by Celera Assembler!

33 Uneven Coverage (cont) Flavobacteria bacterium MS024-2A assembly.! Scaffold generated by Celera Assembler!

34 Chimera 85% of chimeras correspond to a sequence inversion (A, C). 15% of chimeras resulted from the joining of two segments in direct orientation (B, D). The order of the two segments could be reversed. The first segment in the chimera (black arrows) can join to a segment that is downstream (A, B) or upstream (C, D).

35 Bioinformatics Challenges

36 Assembly Tools Bioinformatics tools are often not user-friendly They must be compiled from code and give cryptic error messages Require Unix knowledge/it support

37 Bioinformatics Availability A collection of bioinformatics tools Can be run on the Amazon EC2 Designed to skip the complexity of finding/ downloading/building tools

38 Preparing Inputs to Assembly

39 Hands On: Preparing Data

40 Assembling with Celera Assembler

41 Hands On: Assembling with CA

42 Celera Assembler Output Assembly.scf.fasta >scf138 CTTAATTTTGGAATGAGCAAATCAAACGGCTCTCTTTCTTCTCTCTTCCGACGATCATTTCTCTCCTTCA TCCTTTTTCTCTATTGAATTTATGGTTGTGAGACACTTGCCTATGCCGTTCAAAATTTTACATTAGAGTG GATCTAAACCAAATTGGACACTGTGTTATCTGGTTTTAGCTTGAAGTAATTTCTAACAAGAATGGTTGCT TCAAAAAGACAAATATAAAGTGAAATTGATTGCCTCATTCTCTTCACCCTCATTCAAACTAAAATTGAAG Assembly.asm CTTAAAGAAAGAAGTAAAGGAATCAAAAATGGCAGCCTGCACGTATGAGATGACGAATTCTAAAATTTCA {MDI AGCTCTGCGTTATATTTATTTTGAGTTATACGTTATCTTAATTAATTGGGTTGCTATAATTATAACTTAT ref:(20kb,1) TTCGCTCCACCTAGATTTTTAACGGCCAAAGTGAGTTTAGTTGGTGGTATTAAAATGACTTTTTTATTTA Assembly.posmap.frgscf mea: AAAATCGAAGGTTTAGTTTTTACTATGCAATTAGTGTACCAAAAGAAATTTTTATACCTATTTAAATCAA Assembly.qc FPQMBNN02SL55F f std: [Scaffolds] FPMLNGL01BTUZ f min:1539 TotalScaffolds=4127 FO77AEH01COVYY f max:31474 TotalContigsInScaffolds=8653 FPKRC2D02RHQ3C r buc:28 MeanContigsPerScaffold=2.10 FPDPMZT01DTV0D f his: MinContigsPerScaffold=1 FPBG22E02Q51Y r 8 MaxContigsPerScaffold=73 FPMX3IK01CS2XJ f 0 TotalBasesInScaffolds= FPMLNGL02QFBKX f 1 MeanBasesInScaffolds=48621 FO6AT3P02QPVK r 52 MinBasesInScaffolds=446 FO6AT3P02TCI7F r 69 MaxBasesInScaffolds= FO77AEH01A5XKF r 59 N25ScaffoldBases= FO77AEH01CKY2Z r 104 N50ScaffoldBases= FO77AEH01BD0T r 146 N75ScaffoldBases= FO76IZV02R25OI f 85 ScaffoldAt = FO76IZV02P6FZH f 66 ScaffoldAt =1027 FPDPEIT01BA86D r

43 Dealing with Coverage Bias Shaded node is a repeat Links together distinct regions of a genome Most assemblers start by disconnecting repetitive nodes

44 Coverage in CA Celera Assembler uses the A-Statistic P(read starting at base) = arrival rate = α = P(k starts in ρ bases) is Binomial We approximate via Poisson with giving us For a 2-copy repeat, the arrival rate is doubled, giving us A-Statistic = log P(unique)/P(2-copy) = When A-Statistic is negative, consider node a repeat Read coverage is not perfect Sequencing bias due to GC-content Borderline cases Incorrect coverage expectation Polyploid genomes Metagenomes

45 Fixing Repeat Detection: Toggling Celera Assembler pipeline: default reads overlap unitig scaffold scaffolds Celera Assembler pipeline: toggling reads overlap unitig scaffold Find large repeat unitigs placed once in scaffolds. Mark them unique. scaffold scaffolds

46 Oil Palm - Improvement in Assembly N50 Toggling (flagging repeats as trusted) Default Biggest Contigs Biggest Scaffolds Satisfied Paired-End Unsatisfied Paired-End

47 Hands on: How To Toggle dotoggle At the end of a successful assembly, search for placed surrogates and toggle them to be unique unitigs. Re-run the assembly starting from scaffolder Either 0 or 1, default is 0 Toggling Useful Options Output toggleunitiglength Minimum length for a surrogate to be toggled. Any number >= 1, default is 2000 togglenuminstances Number of instances for a surrogate to be toggled. If 0 is specified, all non-singleton unitigs are toggled to unique status. Any number >=0, default is 1. In future version of CA, setting it to 0 will toggle all non-singleton unitigs >= toggleunitiglength Pre-toggling results: asm/9-teminator Post-toggling results: asm/10-toggled/9-terminator Same contents as from a regular assembly

48 Incorporating Finishing Reads Goals of finishing Close sequencing gaps by generating new reads Get higher coverage in low-coverage sections Generate reads to resolve repeat instances Manual finishing is expensive Finishing reads generated from PCR product or clone Constrain possible assembly: between PCR product ends or clone ends Formulate as a constraint the scaffolder should satisfy

49 Finishing Reads in CA Given the set of bounded finishing reads F, and WGS reads W Bounded algorithm Incorporate unitigs into gaps Place reads in repeat instances

50 Finishing Reads Basic Assembly Bounded Assembly (a) (b) (c) Celera Assembler s two algorithms for assembling shotgun reads with finishing reads. The basic method treats both read types equally. The bounded algorithm attempts to assemble finishing reads consistently with their bounding constraints. For each algorithm, the figure shows its construction of a scaffold from contigs (rectangles) with 2X in shotgun reads (black lines). Each finishing read (colored line) has a corresponding pair of PCR primer sites (arrows of same color). External to the scaffold is a unitig (grey area) deemed repetitive due to high coverage. (a) A mate pair constraint (curve) localizes one read and the unitig to this gap. Nevertheless, the basic algorithm cannot tile this gap with reads. The bounded algorithm localizes two finishing reads by their primer sites. The bounded algorithm does tile the gap with reads, enabling a more accurate consensus sequence. (b) The basic cannot localize the unitig or any reads to this gap. It does not close the gap. The bounded algorithm localizes the unitig by finishing reads and their primer sites. It tiles the gap with finishing reads from the unitig. (c) Both algorithms assemble finishing reads from a gap that is not a genomic repeat.

51 Results * Koren, Miller, Walenz, Sutton (2010) An algorithm for automated closure during assembly. BMC Bioinformatics.

Specifying Finishing Reads Special record in FRG

must be defined in an {FRG X, Y and Z can appear

52 Specifying Finishing Reads Special record in FRG file Only supported in FRG v2 Example of a PLC: {PLC } act:a frg:x frg:y frg:z Finishing read ID Boundary #1 ID Boundary #2 ID Reads X, Y, and Z must be defined in an {FRG X, Y and Z can appear in an {LKG Y (and Z) can be either the left or right bound

53 Hands On: Finishing Reads in CA

54 Assembling with Newbler

55 Hands On: Assembling with Newbler

56 Hands On: Finishing Reads in Newbler

57 Newbler Output 454Scaffolds.fna >scaffold00001 length=48466 tttgtctaaagtaaatcttagcttaaggcagtgatatcgatggattagaaaggatttgga TTGGGGCATTGGTGTAAAAGAAAAACCCTCCGCTTGGGAGGGCTAAATGGCTTtATTTTG TTGAACCTATAaTTTCCCAAATTTCAATAGCTTTAAAACCGTTTGGCATCAAATCGCCCA TCTTGGTTTTATTCGCAATAGAAGCAGGtAATGGAAATTCTCCTGCTAGATACTTCATAA 454Contigs.ace CGTTAGTCATTGAGCCAAAAGCGTCCCAAGTCAGTATTGTTGATGAGTCCTCGTAATATG AS 389 GAGCAATTcTTTAAAACCCGTCCTCTCTCAGTGtATTTtCAATGTTACAATCTTAGAAAA TCACATAGACGCGAGTTTTTAGTTATGAGTACTATTCTATTGAATTTCTATATTCTGAAG 454NewblerMetrics.txt 454ReadStatus.txt CO contig00001 GGGTAACTGATTTtATTTTTTtAAatACTCTATTAAAATTTTCTTGGGAATTAAACCCTG pairedreadstatus U tttgtctaaagtaaatcttagcttaaggcagtgatatcgatggattagaa ATTTTTCTACAAGTGCCACAATTGTATATTTATCTAAAAATTTATTTTGGATTAAATCAC numberwithbothmapped = 2124; aggatttggattggggcattggtgtaaaagaaaaaccctccgcttgggag numberwithoneunmapped = 69; GGCTAAATGGCTTt*ATTTTGTTGAACCTATAa*TTTCCCAAATTTCAAT numbermultiplymapped = 1; AGCTTTAAAACCGTTTGGCATCAAATCGCCCATCTTGGTTTTATTCGCAA numberwithbothunmapped = 23; TAGAAGCAGGt*AATGG*AAATTCT*CCTGCTAGATACTTCATAACGTTA library { GT*CATTGAGCCAAAAGCGTCCCAAGTCAGTATTGTTGATGAGTCCTCGT libraryname = "sanger3kb_fios"; EM9Q5SV01BCVA3 Singleton AATATGGAGCAATT*c*TTTAAAACCCGTCCTCTCTCAGTGt*ATTTtCA pairdistanceavg = ; ATGTTACAATCTTAGAAAATCACATAGACGCG*AGTTTTTAGTTATGAGT pairdistancedev = ; ACTATTCTATTGAATTTCTATATTCTGAAGGGGTAACTGATTTtATTTTT } TtAAa*t*ACTCTATTAAAATTTTCTTGGGAATTAAACCCTG*ATTTTTC scaffoldmetrics { EM9Q5SV01CKSJH Singleton TACAAGTGCCACAATTGTATATTTAT*CTAAAAATTTATTTTGGATTAAA numberofscaffolds = 86; TCACAAGCGAAATCTATTTTCAAAGAATTGATGTAAAATGGTGTAGTGAC numberofbases = ; ATTTAACTTTTGTTTGATAGAATCACGTATATCATTTTGAGTTATTCCTG avgscaffoldsize = 18218; TATCTACTGACAAATTTACTATAGAATAATTATTTTGAAGGAATGCTTTT N50ScaffoldSize = ; GTTTCTTTAAGGTAATGGTTAAaGTTTtGAATTAATTCCTT largestscaffoldsize = ; EM9Q5SV01A4E99 Assembled contig contig EM9Q5SV01ECMLE Assembled contig contig EM9Q5SV01A835M Assembled contig contig EM9Q5SV01APR37 Assembled contig contig EM9Q5SV01CUR52 Assembled contig contig EM9Q5SV01CMCSJ Assembled contig contig EM9Q5SV01CCIN0 Assembled contig contig EM9Q5SV01BXHGV Assembled contig contig EM9Q5SV01B8QV8 Assembled contig contig EM9Q5SV01B5IA1 Assembled contig contig EM9Q5SV01CMFK1 Assembled contig contig EM9Q5SV01EORKA Assembled contig contig EM9Q5SV01A44JD Assembled contig contig EM9Q5SV01CS1HJ Assembled contig contig EM9Q5SV01A8ZAK Assembled contig contig EM9Q5SV01DGAKL Assembled contig contig EM9Q5SV01AVH7L PartiallyAssembled contig contig EM9Q5SV01AX7FC Assembled contig contig EM9Q5SV01EDHFV Assembled contig contig EM9Q5SV01ANK9N Assembled contig contig EM9Q5SV01AW150 Assembled contig contig EM9Q5SV01AZYCP Assembled contig contig EM9Q5SV01AWB3A Assembled contig contig EM9Q5SV01EABU1 Assembled contig contig EM9Q5SV01AVZ12 Assembled contig contig EM9Q5SV01EVB54 Assembled contig contig EM9Q5SV01CFUI9 Assembled contig contig EM9Q5SV01BH1OF Assembled contig contig EM9Q5SV01CI83R Assembled contig contig EM9Q5SV01CQC2E Assembled contig contig

58 Assembly Analysis

Assembly Alignments Forward strand Reverse strand Assembly Reference Alignment of a problematic bacterial assembly. Note double coverage at most reference positions.

59 Assembly Alignments Forward strand Reverse strand Assembly Reference Alignment of a problematic bacterial assembly. Note double coverage at most reference positions. Further analysis revealed the presence of a second strain whose sequence similarity to reference was lower. Alignment by nucmer. Drawing by mummerplot. Both part of the mummer package (

60 Two Assemblies from 454 Cucumber 27M Titanium reads 3K, 8K, 20K inserts 199M in scaffolds 797K scaffold N50 87K contig N50 Read fates: 65% contig 31% degen 1% singleton Bonobo 250M Titanium reads 3K, 8K, 20K inserts 2.9G in scaffolds 9.6M scaff N50 67K contig N50 Read fates: 88% contig 7% degen 3% singleton Cucumber assembly by CA 5.3. Consortium led by 454 Life Sciences. Bonobo assembly by CA 5.4. Consortium led by Max Planck Institute.

61 Cucumber Reads 3Kbp library 20Kbp library Unpaired library * Read lengths equal the clear ranges reported by sffinfo. For mated reads, the read length includes both mates and the linker. The data was produced at 454 on the XLR Titanium.

62 Cucumber Mates Kbp library Count (log scale) Kbp library Separation (Kbp) * The mate separation was determined by Celera Assembler on data provided by 454.

63 Library Effect Large-insert libraries are overrepresented in degenerates. Large-insert reads were more often rejected for being too short or perfect prefix.

64 Cucumber Comparison Celera Assembler Newbler Contigs, total bases (Mbp) Largest contig (Kbp) Contig N50 (Kbp)* Contig mean (Kbp)* Scaffolds, total span (Mbp)* Largest scaffold span (Mbp) * Not apples-to-apples! N50 depends on assembler s genome size estimate. Mean depends on small-contig threshold. Total span depends on small-scaffold threshold.

65 Cucumber assembly Max N10 N10 N25 N25 N50 N50 N95 N95 Total Combined length count length count length count length count length count length Scaffolds 4,649, ,717, ,506, , ,997 3, ,566,885 Alignments 3,813, ,292, ,098, , ,463 3, ,183,263 Contigs 522, , , ,421 3,121 10,379 7, ,010,660 Alignments of CA assembly to the Newbler assembly approach the scaffold size, which is the maximum.

66 Cucumber Degens A degenerate is Two or more reads in a unitig (high-confidence contig) That could not be promoted to contig Negative A-stat due to short length or high coverage That could not be incorporated into scaffolds Would need 2 consistent mates at least Lots of small degenerates Number of degenerates= 665,311 Combined length= 332 Mbp Equivalent to 165% of scaffold bases! Largest= 21 Kbp. Average length= 499 bp Lots of reads in degenerates Reads in degenerate= 40% of all reads Reads in degenerate and so is mate= 2.9% of reads Non-degen reads with degen mate= 0.3% of reads

67 One scaffold, aligned to itself. Select scaffold with most degen alignments. Scaffold : 7 contigs (5 shown). Length= 96Kbp (first 50K shown). Alignments by nucmer (mummer) reveal no intra-scaffold repeats. Red: forward strand. Green: approximate contig layout.

Degenerates aligned to the scaffold. Select degens with fewer than 2 mated reads. Of those, select degens whose reads have pair-wise overlaps to reads in this scaffold.

68 Degenerates aligned to the scaffold. Select degens with fewer than 2 mated reads. Of those, select degens whose reads have pair-wise overlaps to reads in this scaffold. There are 646 such degens, more than for any other scaffold. Align with nucmer, minimum 95% identity. Some degens have multiple alignments. Red: forward strand. Blue: reverse strand. Green: approximate contig layout.

69 Cucumber Improvement i1 i2 i3 i4 Contigs, total bases (Mbp) Largest contig (Kbp) Contig N50 (Kbp) Contig mean (Kbp) Scaffolds, total span (Mbp) Largest scaffold span (Mbp) Degen bases (vs scaffolds) 165% 165% 132% 88% Degen reads 40% 40% 38% 31%

70 Illumina Assemblies The NAT experiment at JCVI Bacterial DNA from E. coli and Y. pestis Test Illumina 100bp PE at various coverage Test addition of 454 fragment or PE Test all assemblers

71 Cumulative Contig Length (bp)

72 Results: Performance Metrics For E. coli Assemblies 1600 %CPU Used x trim 50x trim 100x trim Readset 150x trim CLC illumina Velvet illumina SOAP illumina MIRA illumina Virtual Memory (kb) Log scale EULER illumina x trim 50x trim 100x trim Readset 150x trim CLC illumina Velvet illumina SOAP illumina MIRA illumina EULER illumina RAM (kb) log scale CLC illumina Velvet illumina SOAP illumina MIRA illumina EULER illumina x trim 50x trim 100x trim 150x trim

73 Hands On: Assembly Comparison

74 Assembly Viewers

75 Assembly Error? Green: mate pairs satisfied by the assembly.! Coverage discontinuity.! Prochlorococcus from GOS III metagenomic assembly.! Scaffold generated by Celera Assembler, displayed with Hawkeye viewer.! Investigation by Doug Rusch, JCVI.!

76 Another Viewpoint 100% identity! Color: collection site.! 50% identity! Recruitment discontinuity! corresponds to 16s-23S rrna operon! Prochlorococcus fragment recruitment to scaffold.! GOS reads aligned to GOS III assembly, displayed with JCVI Fragment Recruitment Viewer.! Investigation by Doug Rusch, JCVI.!

77 Chimera Detected By Viewers

78 Introduction to Viewers: Consed

79 Introduction to Viewers: Hawkeye

80 Hands On: Viewing Using Hawkeye

81 Introduction to Viewers: Gap4

82 Introduction to Viewers: Gap5

83 Hands On: Viewing Using Gap4/5

84 Questions

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing project Unknown sequence { experimental evidence result read 1 read 4 read 2 read 5 read 3 read 6 read 7 Computational requirements