Supplementary Methods

Size: px
Start display at page:

Download "Supplementary Methods"

Transcription

1 Supplementary Methods Calculation of completed genomes The number of completed genomes we reported in the main text was obtained from summary statistics provided by the Genomes OnLine Database (GOLD). The total number of complete and permanent draft genomes (3200 = 2079 (published) + 27 (unpublished) ) was divided by all genomes with sequencing data, excluding genomes annotated as still awaiting DNA or sequencing ( (10506 (incomplete projects) 1126(awaiting DNA) 276 (awaiting sequencing)), yielding an estimate of 26%. Southern Blot Chromosomal DNA was prepared from V. cholerae using the Qiagen DNAeasy kit. Southern blotting was followed using procedures found in Sambrook et al (1989). 1 Intact and BglII/BstEII- digested chromosomal DNA was separated on a 0.7% agarose gel and then denatured, neutralized, and transferred to a nylon membrane. The P 32 -labeled hybridization probe was prepared by PCR amplification of two products on either side of a unique BstEII site in the TLC Cri gene. Oligonucleotides 5 to the BstEII site (5 -CCACACTTCACCGCAATATG-3 and 5 -ATAAGCGACCGCTAGGTGTG-3 ) produced the 199 nt probe 1 and primers 3 to the BstEII (5 -GATTGCGGTTTTGTGGACTT-3 and 5 - TCTCTGCTAAGCCGACATCC-3 ) produced the 360 nt probe 2. The product was labeled by randomprimer extension using random hexamers, [α32]-dctp, and Klenow fragment of DNA polymerase I. After hybridization and washing, the blot was exposed to phosphorimage screen and visualized with ImageQuant software (Molecular Dynamics). PCR validation and sequencing of long products Regions with lengths ranging from 5 to 15kb of the V. cholerae chromosome that encode the CTX prophage, RS1, and TLC were PCR amplified from DNA isolated from H1 (Cholera from Haitian patient),

2 N5 (Cholera strain N16961). Phusion DNA polymerase (NEB) was used to amplify. Primers are listed in Supplementary Table 8. To confirm the larger, tandemly duplicated TLC region, the N16961 sequenced amplicons were compared to the N16961 canonical reference (Supplementary Figure 11). Due to the length of this region (>15kb), a longer protocol was run, leading to reads exceeding 17kb with mapped subreads exceeding 12kb. In both the Haitian cholera outbreak strain and N16961 samples, other species were observed in the PCR product, indicating either PCR artifact or real heterogeneity in the samples. Specifically, in both samples there were sequences from PCR products indicating an excision of sequence similar to the ptlc element previously shown to insert into V. cholerae 2. Illumina reads from Haitian cholera outbreak strain were mapped back to the TLC tandem junction (between CDC contigs 7 and 55), confirming the presence of the tandem cycle in this sample. DNA fragment extraction and library preparation Three different types of SMRTBell libraries were constructed for the two isolates (H1: V. cholerae from Haitian patient, and N5: V. cholerae strain N16961). Specifically, for each sample, long-insert, strobe, and whole-amplicon libraries were constructed using Pacific Biosciences commercially available Template Prep Kit, similarly to what has been previously described. For strobe libraries, abasic sites were engineered into the SMRTbell hairpins to prevent the polymerase from wrapping. For each library, a sequencing primer was annealed to each SMRTbell, and the libraries bound to DNA polymerases using 3nM of the respective SMRTbell library and 3X excess DNA polymerase at a concentration of 9nM using Pacific Biosciences commercially available DNA/Polymerase Binding Kit 1.0, similarly to what has been previously described 3.

3 DNA/polymerase complexes were immobilized into zero-mode waveguides (ZMWs) for 60 minutes. After thorough washing to remove unbound complexes, standard sequencing was conducted on a PacBio-RS Sequencer using 75 to 120 minute continuous collection times and an R&D version of the Pacific Biosciences DNA Sequencing 1.0 kit, similar to what was previously reported 1. In addition, H1 and N5 strobe libraries were sequenced using commercially available Pacific Biosciences Strobe Sequencing Kit 1.0 on a lower throughput R&D instrument. The R&D collection protocol for strobe sequencing included strobe series consisting of a 4-minute subread followed by a 48-minute advance time and ending with an 8-minute subread, as well as strobes comprised of a 4-minute subread followed by a 52-minute advance time ending with a 4-minute subread. Note that the lower throughput machines had higher variability as well as low throughput (5% the throughput of the RS). The 454 sequencing for N16961 was performed using a Roche 454 GS-FLX+ sequencer using Titanium FLX+ chemistry and the standard Roche shotgun library protocol (454 Life Sciences/Roche Applied Science). Genomic DNA was extracted using the Promega Wizard Genomic DNA purification kit. DNA was then sheared to ~250 to 350 bp using sonication. Illumina sequencing for isolate H1 was constructed with standard Illumina adapters and PCR primers using the New England Biolabs (NEB) NEBnext DNA library preparation kit. Sequencing 40 nt reads was accomplished using the Illumina GAIIx Sequencer at Harvard Biopolymers Facility and processed using the Illumina pipeline 1.5. The sequencing data for the H1 and N16961 experimental assemblies was deposited in the short read archive under SRP as a part of BioProject: PRJNA Additionally, the N16961 assembly based on simulated data used SRX Sanger Validation

4 To test the validity of the scaffold we designed a subset of primer pairs to test the validity of our contig pairing and resultant fill-in. Flanking sequence upstream and downstream of a particular gap was extracted for all regions in the assembly corresponding to gaps between CDC contigs.. Sequence not corresponding to the flanking CDC contigs was eliminated prior to primer design. Therefore, any amplified primer pair corresponded to a correct CDC contig pairing. In the case of the 78 breakpoints where adequate flanking sequence was available, flanking sequence was obtained. To maximize the possibility of spanning a breakpoint with Sanger sequencing only those spanning regions less than 600 bp were selected for primer design resulting in primer sets were designed for the smallest 56 regions (with breaks between 1 and 560 bases). For each breakpoint, upstream and downstream flanking sequences were concatenated with an appropriate number of N s inserted in between the sequences to represent the sequence to verify. Concatenated sequences were fed into Primer 3 ( to search for appropriate PCR amplification primers. Resulting primers were ordered from Integrated DNA technologies (Coralville, IA). Previously-isolated genomic DNA from Haitian cholera strain H1 was diluted 1:10, and 1 µl used as template in a 50µL PCR reaction with the Advantage 2 PCR kit (Clonetech) according to manufacturer s recommendations. Out of 56 reactions, 55 yielded amplicons. In order to obtain 96 reactions, the top 48 reactions were selected (those which did not contain non-specific amplification products that hindered downstream process). A total of 48 PCR products purified using AMPure XP beads (Agencourt) were sent to MCLabs (South San Francisco) for Sanger Sequencing in both forward and reverse directions (96 reactions). Of these we were unable to obtain a Sanger product for one. To assess if the failed amplicon and Sanger products were due to primer failure or represented true misassemblies, the respective regions were compared to both the MJ1236 and CIRS101 references. Both regions were observed to have flanking sequences consistent with the two references

5 (upstream regions mapped at 99.9% and 100% accuracies and downstream regions at 100% accuracies, respectively), suggesting that the assembly is most likely correct in these regions. Assembly Accuracy Calculation and Gap Determination Accuracy was calculated for all assemblies as described previously in Rasko et al. (nucmer maxmatch nosimplify; delta-filter -1) 3. Gaps in assemblies were identified using BLASR with default parameters. Supplementary Results N16961 Control Simulated Assembly In order to determine if our AHA pipeline could reconstruct a correct assembly given nearly perfect starting contigs, we simulated sequence based on the known N16961 reference (Supplementary Table 4). These contigs were constructed to have similar properties to the Haitian outbreak strain 454/Illumina input contigs. Specifically, we split the N16961 reference sequence on repeats >100 bp. We then excluded regions <100 bp. We determined repeats using the Mummer tool nucmer with the following options: "-l maxmatch --nosimplify". All regions deemed to have multiple hits in the genome were then labeled as repeats. Overlapping repeats were merged. Only contigs >100 bp were retained. The resulting assembly was comprised of 186 scaffolds (188 contigs), with 111 scaffolds covering > 99% of the genome at an accuracy of 99.99% (Supplementary Table 4). Full results of the N16961 study are shown in Supplementary Tables 3 and 4 and Supplementary Figures 8 and 9. After layering in the PacBio long and strobe read data for N16961, the hybrid assembly procedure resulted in an assembly comprised of 11 scaffolds (176 contigs contigs were rigorously defined as the longest contiguous stretches of DNA sequence not containing any Ns) containing 14,218 uncalled bases, with only 3 scaffolds covering 99% of the genome, and an accuracy of 99.99% (Supplementary Table 4).

6 These results are qualitatively consistent with the results we achieved on the Haitian outbreak strain genome. N16961 Control Experimental Assembly Corrected errors in original input contigs The high-quality input contigs for the control assembly contained one large gap internal to the contigs (a 440 bp deletion). The error-correction approach (recalling the consensus sequencing using a multiple sequence alignment from error-corrected PacBio reads) adjusted this region such that it was consistent with the reference. Accuracies for raw error-corrected reads (prior to consensus) are shown in Supplementary Table 2. Below is an alignment of the corresponding reference region from N16961 (Chromosome 2, coordinates 272kb-274kb) to the raw contigs, and the error-corrected contigs. The amount of deleted sequence reduces from 440 bp to 2 bp. NC_ (coordinates: 272kb-274kb) vs Raw CDC Contigs: nmatch: 1560 nmismatch: 0 nins: 440 ndel: 0 %sim: GCTGTGCGGCGCGATCACTTCTCCTCCTGCGAGCTTAAACGGCTGCGCGC GCTGTGCGGCGCGATCACTTCTCCTCCTGCGAGCTTAAACGGCTGCGCGC 850 TAAAATTCGCACTGAGAATCGAGCCATCACTAAAGCGCGTTTGTTGCACT TAAAATTCGCACTGAGAATCGAGCCATCACTAAAGCGCGTTTGTTGCACT 900 TGGCCTTGCTTATCCAGCCAGCGAAAATCGACCAAGGCTTTATCCCACAA TGGCCTTGCTTATCCAGCCAGCGAAAATCGACCAAGGCTTTATCCCACAA 950 CTGCTCATGCAAGGGCAAAAACCCTTGTTGATAGTGCTTGAGCTGCTCGA CTGCTCATGCAAGGGCAAAAACCCTTGTTGATA--G--T--GCT--T TACGCGCTGGCGGCACTCAAGCAAGCGGTCATCCAATATGATGCCGAGCC GCATAACCCATTGAGCTAGCTTTTCATTGCTTCAGCACAAGGCTGTGACC CCATAAGGTCACAGCCTTGATTGAACTACCCAATCGCGCGGTTTTCGCAC GCGATGGTTATTCACTTGATTGTTATTTGTCCGATTTCTATTTAGGTTGC CATTGATGGGTTTGCCCATTCGCCAATTGAGCCAATAGGCTGTGTGGCGC GATCACTTCTCCTCCTGCGAGCTTAAACGCGGCTGCGCGCTAAAATTCGC ACTGAGAATCGAGCCATCACTAAAGCGCGTTTGTTGCACTTGGCCTTGCT

7 TATCCAGCCAGCGAAAATCGACCAAGGCTTTATCCCACAACTGCTCATGC AAGGGCAAAAACCCTTGTTGATAGTGCTTGAGCTGCTCGATACGCGCGGA GAGCTGCTCGATACGCGCGGA 1450 GTTTTTACGCGTTTCATCGCGGCTTAAATGCACCATCGCCGGAGTGTTAA GTTTTTACGCGTTTCATCGCGGCTTAAATGCACCATCGCCGGAGTGTTAA 1500 ACAGCATGGCACGCAAATCACGCACGGCTTTGACATTCGAAAACTTCAAA ACAGCATGGCACGCAAATCACGCACGGCTTTGACATTCGAAAACTTCAAA... NC_ (coordinates: 272kb-274kb) vs Error-corrected AHA scaffold: nmatch: 1998 nmismatch: 0 nins: 2 ndel: 0 %sim: GCTGTGCGGCGCGATCACTTCTCCTCCTGCGAGCTTAAACGGCTGCGCGC GCTGTGCGGCGCGATCACTTCTCCTCCTGCGAGCTTAAACGGCTGCGCGC 850 TAAAATTCGCACTGAGAATCGAGCCATCACTAAAGCGCGTTTGTTGCACT TAAAATTCGCACTGAGAATCGAGCCATCACTAAAGCGCGTTTGTTGCACT 900 TGGCCTTGCTTATCCAGCCAGCGAAAATCGACCAAGGCTTTATCCCACAA TGGCCTTGCTTATCCAGCCAGCGAAAATCGACCAAGGCTTTATCCCACAA 950 CTGCTCATGCAAGGGCAAAAACCCTTGTTGATAGTGCTTGAGCTGCTCGA CTGCTCATGCAAGGGCAAAAACCCTTGTTGATAGTGCTTGAGCTGCTCGA 1000 TACGCGCTGGCGGCACTCAAGCAAGCGGTCATCCAATATGATGCCGAGCC TACGCGCTGGCGGCACTCAAGCAAGCGGTCATCCAATATGATGCCGAGCC 1050 GCATAACCCATTGAGCTAGCTTTTCATTGCTTCAGCACAAGGCTGTGACC GCATAACCCATTGAGCTAGCTTTTCATTGCTTCAGCACAAGGCTGTGACC 1100 CCATAAGGTCACAGCCTTGATTGAACTACCCAATCGCGCGGTTTTCGCAC CCATAAGGTCACAGCCTTGATTGAACTACCCAATCGCGCGGTTTTCGCAC 1150 GCGATGGTTATTCACTTGATTGTTATTTGTCCGATTTCTATTTAGGTTGC GCGATGGTTATTCACTTGATTGTTATTTGTCCGATTTCTATTTAGGTTGC 1200 CATTGATGGGTTTGCCCATTCGCCAATTGAGCCAATAGGCTGTGTGGCGC CATTGATGGGTTTGCCCATTCGCCAATTGAGCCAATAGGCTGTGTGGCGC 1250 GATCACTTCTCCTCCTGCGAGCTTAAACGCGGCTGCGCGCTAAAATTCGC GATCACTTCTCCTCCTGCGAGCTTAAA--CGGCTGCGCGCTAAAATTCGC 1300 ACTGAGAATCGAGCCATCACTAAAGCGCGTTTGTTGCACTTGGCCTTGCT ACTGAGAATCGAGCCATCACTAAAGCGCGTTTGTTGCACTTGGCCTTGCT 1350 TATCCAGCCAGCGAAAATCGACCAAGGCTTTATCCCACAACTGCTCATGC TATCCAGCCAGCGAAAATCGACCAAGGCTTTATCCCACAACTGCTCATGC 1400 AAGGGCAAAAACCCTTGTTGATAGTGCTTGAGCTGCTCGATACGCGCGGA AAGGGCAAAAACCCTTGTTGATAGTGCTTGAGCTGCTCGATACGCGCGGA 1450 GTTTTTACGCGTTTCATCGCGGCTTAAATGCACCATCGCCGGAGTGTTAA GTTTTTACGCGTTTCATCGCGGCTTAAATGCACCATCGCCGGAGTGTTAA 1500 ACAGCATGGCACGCAAATCACGCACGGCTTTGACATTCGAAAACTTCAAA ACAGCATGGCACGCAAATCACGCACGGCTTTGACATTCGAAAACTTCAAA 1550 CTGTCGCTGTGCCAGTGGTGGCTACTGATCAGCTCGTCATGAAACACCGC CTGTCGCTGTGCCAGTGGTGGCTACTGATCAGCTCGTCATGAAACACCGC...

8 Possible true differences from N16961 reference Subsequent to error-correction and resequencing a large (353 base pairs of insertions and deletions) difference relative to the N16961 reference remained. This region corresponded to a tandem repeat on Chromosome 1 between coordinates 1736kb-1739kb. Below is the alignment of our final N16961 contig relative to this region. There are several reads confirming our control assembly in this location (Supplementary Table 5 and Supplementary Figure 6). NC_ (coordinates: 1,735kb-1740kb) vs Error-corrected and resequenced AHA scaffold: nmatch: 4990 nmismatch: 1 nins: 9 ndel: 344 %sim: GACCCGCGATTACTAAAGACGCTAAGCCGTCAATAGAAGTAAAACTGAAA GACCCGCGATTACTAAAGACGCTAAGCCGTCAATAGAAGTAAAACTGAAA 1550 CTCCCAACTTTGCTAAGAGCAGAGTGATCTGGCGACGAACCACCTGAAAG CTCCCAACTTTGCTAAGAGCAGAGTGATCTGGCGACGAACCACCTGAAAG 1600 ATTAGATTCATGGATAGTTAACTCAGCATTAGCTCCATTCAACCCTGAAA ATTAGATTCATGGATAGTTAACTCAGCATTAGCTCCATTCAACCCTGAAA 1650 TACTAAGCTCTTCATCAA T-G TACTAAGCTCTTCATCAATGTTCTGCTCGTTGAGCTTGACTGTGATGTCG GTGGTCTTCACGCCCCCCAGACCTGCGTCTTCGGTCGCAGTCACCACCAA GCTGTGCACGTTGGCCAGCGCCTCAAAGTCGTTCGCCGCCGCCTCTGCAC CTTTCGCAGTCAGGGTGATCACGCCCGTGGTCGCATCAATCGCAAACCAG CCGTTGTCGTTGCCTGACTTGATGCTGTAGGTCACCGTCTCTTTATCCGC ATCGGTGGCCTTGACCGTGCCCAGTACGGTGTCCGCCGCGCTGTTTTCGT CGTAGCTGAAGCTGTACTCACCGTCGGTGGTGCCTTCAAACTTCGGTGCG TTCTGCTCGTTGAGCTTGACTGTGATGTCGGTGGTCTT TTGTCATCGAGGTTCTGCTCGTTGAGCTTGACTGTGATGTCGGTGGTCTT 1708 CACGCCCCCCAGACCTGCGTCTTCGGTCGCAGTCACCACCAAGCTGTGCA * CACGCCACCCAGACCTGCGTCTTCGGTCGCAGTCACCACCAAGCTGTGCA 1758 CGTTGGCCAGCGCCTCAAAGTCGTTCGCCGCCGCCTCTGCACCTTTCGCA CGTTGGCCAGCGCCTCAAAGTCGTTCGCCGCCGCCTCTGCACCTTTCGCA 1808 GTCAGGGTGATCACGCCCGTGGTCGCATCAATCGCAAACCAGCCGTTGTC GTCAGGGTGATCACGCCCGTGGTCGCATCAATCGCAAACCAGCCGTTGTC

9 1858 GTTGCCTGACTTGATGCTGTAGGTCACCGTCTCTTTATCCGCATCGGTGG GTTGCCTGACTTGATGCTGTAGGTCACCGTCTCTTTATCCGCATCGGTGG 1908 CCTTGACCGTGCCCAGTACGGTGTCCGCCGCGCTGTTTTCGTCGTAGCTG CCTTGACCGTGCCCAGTACGGTGTCCGCCGCGCTGTTTTCGTCGTAGCTG 1958 AAGCTGTACTCACCGTCGGTGGTGCCTTCAAACTTCGGTGCGTTGTCATC AAGCTGTACTCACCGTCGGTGGTGCCTTCAAACTTCGGTGCGTTGTCATC 2008 GAGGTTCTGCTCGTTGAGCTTGACTGTGATGTCGGTGGTCTTCACGCCCC GAGGTTCTGCTCGTTGAGCTTGACTGTGATGTCGGTGGTCTTCACGCCCC 2058 CCAGACCTGCGTCTTCGGTCGCAGTCACCACCAAGCTGTGCACGTTGGCC CCAGACCTGCGTCTTCGGTCGCAGTCACCACCAAGCTGTGCACGTTGGCC 2108 AGCGCCTCAAAGTCGTTCGCCGCCGCCTCTGCACCTTTCGCAGTCAGGGT AGCGCCTCAAAGTCGTTCGCCGCCGCCTCTGCACCTTTCGCAGTCAGGGT 2158 GATCACGCCCGTGGTCGCATCAATCGCAAACCAGCCGTTGTCGTTGCCTG GATCACGCCCGTGGTCGCATCAATCGCAAACCAGCCGTTGTCGTTGCCTG 2208 ACTTGATGCTGTAGGTCACCGTCTCTTTATCCGCATCGGTGGCCTTGACC ACTTGATGCTGTAGGTCACCGTCTCTTTATCCGCATCGGTGGCCTTGACC 2258 GTGCCCAGTACGGTGTCCGCCGCGCTGTTTTCGTCGTAGCTGAAGCTGTA GTGCCCAGTACGGTGTCCGCCGCGCTGTTTTCGTCGTAGCTGAAGCTGTA 2308 CTCACCGTCGGTGGTGCCTTCAAACTTCGGTGCGTTGTCATCGAGGTTCT CTCACCGTCGGTGGTGCCTTCAAACTTCGGTGCGTTGTCATCGAGGTTCT 2358 GCTCGCTGAGCTTGACTGTGATGTCGGTGGTCTTCACGCCCCCCAGACCT GCTCGCTGAGCTTGACTGTGATGTCGGTGGTCTTCACGCCCCCCAGACCT 2408 GCGTCTTCGGTCGCAGTCACCACCAAGCTGTGTCACGTTGGCCAGCGCCT GCGTCTTCGGTCGCAGTCACCACCAAGCTGTG-CACGTTGGCCAGCGCCT 2458 CGAAA--CGTTGCGCGCGCGCGTCACCTCTGCACCTTTCGCAGTCAGGGT C-AAAGTCGTT-CGC-CGC-CG---CCTCTGCACCTTTCGCAGTCAGGGT 2506 GATCACGCCCGTGGTCGCATCAATCGCAAACCAGCCGTTGTCGTTGCCTG GATCACGCCCGTGGTCGCATCAATCGCAAACCAGCCGTTGTCGTTGCCTG 2556 ACTTGATGCTGTAGGTCACCGTCTCTTTATCCGCATCGGTGGCCTTGACC ACTTGATGCTGTAGGTCACCGTCTCTTTATCCGCATCGGTGGCCTTGACC 2606 GTGCCCAGTACGGTGTCCGCCGCGCTGTTTTCGTCGTAGCTGAAGCTGTA GTGCCCAGTACGGTGTCCGCCGCGCTGTTTTCGTCGTAGCTGAAGCTGTA 2656 CTCACCGTCGGTGGTGCCTTCAAACTTCGGTGCGTTGTCATCGAGGTTCT CTCACCGTCGGTGGTGCCTTCAAACTTCGGTGCGTTGTCATCGAGGTTCT 2706 GCTCGTTGAGCTTGACTGTGATGTCGGTGGTCTTCACGCCCCCCAGACCT GCTCGTTGAGCTTGACTGTGATGTCGGTGGTCTTCACGCCCCCCAGACCT 2756 GCGTCTTCGGTCGCAGTCACCACCAAGCTGTGCACGTTGGCCAGCGCCTC GCGTCTTCGGTCGCAGTCACCACCAAGCTGTGCACGTTGGCCAGCGCCTC 2806 AAAGTCGTTCGCCGCCGCCTCTGCACCTTTCGCAGTCAGGGTGATCACGC AAAGTCGTTCGCCGCCGCCTCTGCACCTTTCGCAGTCAGGGTGATCACGC 2856 CCGTGGTCGCATCAATCGCAAACCAGCCGTTGTCGTTGCCTGACTTGATG CCGTGGTCGCATCAATCGCAAACCAGCCGTTGTCGTTGCCTGACTTGATG 2906 CTGTAGGTCACCGTCTCTTTATCCGCATCGGTGGCCTTGACCGTGCCCAG CTGTAGGTCACCGTCTCTTTATCCGCATCGGTGGCCTTGACCGTGCCCAG 2956 TACGGTGTCCGCCGCGCTGTTTTCGTCGTAGCTGAAGCTGTACTCACCGT TACGGTGTCCGCCGCGCTGTTTTCGTCGTAGCTGAAGCTGTACTCACCGT 3006 CGGTGGTGCCTTCAAACTTCGGTGCGTTGTCATCGAGGTTCTGCTCGTTG

10 CGGTGGTGCCTTCAAACTTCGGTGCGTTGTCATCGAGGTTCTGCTCGTTG... Ensuring an H1-specific assembly and clonality among Haitian outbreaks In general, we sought to ensure that the assembly was not biased by the input strain in ways that would lead to a faulty assembly. To address concerns about clonality of the CDC and PacBio Haitian cholera strains we performed a series of experiments to demonstrate that PacBio sequences were consistent with the CDC contigs, and that the CDC contigs were themselves self-consistent. All PacBio long reads were aligned to our fragmented CDC contig consensus, each CDC isolate reference, and our fragmented CDC contig consensus with a synthetic breakpoint formed by merging two contigs erroneously (as a positive control to demonstrate the pipeline s ability to detect misassemblies). For each read we only accepted its top alignment and discarded any read whose top alignment had less than 80% accuracy relative to the selected reference. We first sought to address the notion of clonality. We evaluated PacBio reads to determine if they were consistent with the 454 assemblies. Because the 454 reads were < 350 bp (aligned length to their corresponding initial contig set) on average across the three Haitian assemblies, for every base-pair of the reference we calculated the coverage of PacBio reads that overhang that position by at least 200 bp (necessitating a >400 base pair alignment across each base-pair, similar to the alignments one might expect by high depth 454 across the contigs). Requiring 200 bp implies that misassemblies within 200bp of the ends of contigs won t be detected. Also, due to the PacBio error-rate, not all reads are guaranteed to align to the edge of the contigs. To account for this we ignored zero coverage regions that were more than 220 bp from either edge of the contigs (10% more than 200bp overhang), which also resulted in contigs < 440 bp being ignored. Ignoring these contigs is reasonable because contigs less than 1kb were not used during AHA scaffolding. The largest zero-coverage region in the fragmented

11 contigs was 203 bp. This indicates that the method worked as intended and that the fragmented CDC contigs are fairly robust at their ends (at least within the last 200 bp). As mentioned, the same analysis was performed on the raw CDC contigs from each isolate (1786, 1792, 1798). There were slightly larger zero-coverage regions near the contig edges (two additional regions of 204 bp at ends). Finally, the synthetic misassembly led to an internal zero-coverage gap in coverage of 367 bp centered at the misassembly junction, confirming our method s ability to detect misassemblies. This gap corresponds to erroneous overhang of <17 bp on average, indicating that any contamination from incorrectly overhanging reads is marginal. Even if the strains are clonal, idiosyncrasies of the assembly algorithm can lead to structural problems with the assembled contigs. We second sought to ascertain the structural consistency of the assemblies using additional information (three way comparison between the strains and alignment of PacBio reads requiring 500 bp (>1kb aligned length) and 1000bp (>2kb aligned length) across each of the breakpoints. This was to conservatively evaluate every position in the selected contig set for structural consistency with the PacBio raw read. A three-way comparison was done between the isolates. For each pair of assembled contig sets from an isolate, we checked all alignment pairs (denoted as isolates A and B). Given a contig we checked that it had alignments that either: 1.) Contig A was contained in Contig B 2.) Contig B was contained in contig A 3.) Contig A overlapped Contig B on the left side 4.) Contig A overlapped Contig B on the right side Each of the alignments was given 500 bp of slack allowed unaligned sequence so as to be directly compared to the subsequent PacBio overhang analyses above. The only structurally inconsistent regions corresponded to repeat compression or expansion, suggesting that these discrepancies are a result of

12 assembly errors in the initial CDC contigs as opposed to true differences between strains The same repeat-driven events were captured by the PacBio reads with required overhang of 1000 bp. With one exception, these events were located at the edge of both contigs A and B. The single exception was internal to one contig and at the edge of other contigs in both other assemblies. We tested whether this event was a real event or due to an algorithmic artifact by mapping the 454 reads from the isolate (1786) back to the isolate s contigs. We then plotted coverage of reads with 99% accuracy overhanging base pairs by at least 25 bp. Supplementary Figure 13 shows the coverage across this breakpoint with 5kb of flanking sequence on each side (other events did not have at least 5kb flanking sequence). The clear drop in coverage down to zero indicates that this region has little structural support even from the isolate s own raw reads, suggesting that it is likely a misassembly. We further examined these contigs with PacBio reads, requiring a 1kb overhang. An additional breakpoint was identified and manual inspection indicated that this breakpoint appeared to be due to lack of coverage in the region. As described in the main text, the final scaffolds were recalled using error-corrected PacBio reads with Illumina reads from H1 as well as the PacBio re-sequencing pipeline. This ensured that the final consensus composed entirely of reads from the H1 isolate.

13 Supplementary Figures Supplementary Figure 1. Strobe Span Distribution. Strobe span distribution for H1. The distribution combines data from all runs used for scaffolding H1, leading to a broad span distribution. The small bump less than 1kb is likely due to small inserts in which strobes which did not properly terminate at the adaptor site leading to subreads mapping from overlapping passes around a SMRTBell.

14 Supplementary Figure 2. Read Length and Accuracy Distributions for C2 and original long read data. Accuracies for the C2 reads (A) are approximately the same as those of the original C1 data (C) while C2 read lengths (B) are significantly greater than the C1 read lengths (D).

15 Supplementary Figure 3. V. cholerae Hybrid Assembly Pipeline. A) The pipeline starts with five inputs: the CDC contigs, and the raw 454, Illumina and PacBio (long and strobe) reads. PacBio long reads and strobe reads were used to scaffold the CDC contigs (right) with the AHA scaffolding algorithm. The PacBio long reads were also corrected with the 454 and Illumina reads, and these corrected reads were used to either error-correct or fill in gaps in the AHA scaffolds at various points of the pipeline (left). Finally, the resulting scaffolds were resequenced using the PacBio long reads, and consensus was called using these reads, ensuring that the final sequence came from a single clonal source (bottom). A Makefile that applies this pipeline and the PacBio SMRT Analysis software suite is available at B) The AHA scaffolding portion of the pipeline takes as input highconfidence contigs (or scaffolds) and PacBio reads (long or strobe). The reads are aligned to input contigs, a scaffold graph is built, and the graph is untangled to produce a set of linearized scaffolds (including Ns with the expected span distance between contigs). This process is run iteratively, with the output scaffolds of untangling in one step used as the input contigs for the next step, until a final output scaffold is produced at the last iteration.

16 Supplementary Figure 4. Overhang coverage of 454 reads around structurally inconsistent region in isolate The x-axis corresponds to the putative breakpoint region in CDC contig AELIO with 5kb of sequence upstream and downstream. A clear drop in 454-read coverage (aligning at 99% accuracy with at least 25 base pair of overhanging aligned sequence) is observed in the structurally inconsistent interval (positions ).

17 Supplementary Figure 5. Southern blots for H1 and N16961 TLC. Southern blot hybridization with a two probes internal to TLC (A) and the predicted BglII and BstEII restriction sites and generated fragments within the constructed CTX prophage and TLC region (B). Chromosomal DNA isolated from N16961 (N5) and H1 was digested with BglII, and then secondly digested with BstEII. The observed fragment sizes validate reconstructed regions in both strains and the duplication of TLC. Fragment A 11 in H1 replaces fragment A in N5 due to the additional BglII site in contig 11. No additional bands can be seen in any lane to indicate a detectable proportion of extrachromosomal TLC. Numbers to the left label the marker sizes.

18 Supplementary Figure 6. N16961 TLC Validation. A) Strobes and B) continuous reads mapped to the N16961 assembly. C) The CDC contigs from H1 were mapped to N16961 in order to directly compare the references, highlighting the shuffling of elements upstream and downstream of the CTX. C) PCR primers were designed to validate the TLC structure. D) PCR products were sequenced and mapped back to confirm the structure; a sampling of subreads (> 8kb) that aligned to the products is shown.

19 Supplementary Figure 7. Raw PacBio reads spanning tandem repeat difference in N16961 assembly (NC_ : 1,736,000 to 1,739,000) support an additional copy of the tandem repeat. Longest (A) and second longest (B) reads spanning the tandem repeat in N The x-axis corresponds to N16961 and the y-axis corresponds to the PacBio read in each subplot.

20 Supplementary Figure 8. Distribution of alignment gaps between assemblies and N16961 reference. Histogram of alignment gap lengths between our hybrid assemblies and the N16961 reference, with counts on a log scale, as determined by BLASR alignment. Here, gaps are defined as contiguous inserted or deleted bases. Proximal gaps separated by single bases (or more) of identity are counted separately for example, the <350 base pair gap corresponds to a set of insertions/deletions spanning 353 bases. (A) Assembly based on scaffolding and fill-in of the 454 reads. (B) Assembly after error-correction of reads with simulated Illumina reads and resequencing with PacBio raw reads. The largest event in A corresponds to the 454 assembly error eliminated after error-correction and resequencing.

21 Supplementary Figure 9. N16961 Assembly Test. The outermost track (salmon) represents the assembly of N16961 with PacBio reads relative to N16961 coordinates. The inner track (blue) indicates the position of contigs simulated to be analogous to the CDC contigs.

22 Supplementary Figure 10. Structural Comparison to Nepalese Strains. The boxplot shows concordance for all Nepalese strains split up by their respective Nepal group. Hendricksen et al describes four different Nepalese subgroups; phylogenetic analysis revealed MJ1236 as the closest completed genome to all four subgroups 4. For each Nepalese group the strain (H1, MJ1236, and N16961) with the maximum number of concordant mappings was set to 1.0 (in each case the H1 strain). The number of concordant hits for each Nepal strain to H1, MJ1236, and N16961 is given relative to this value. The boxplots represent the spread of values for each strain within a group. Note the low variance of the concordance within each group indicating a high degree of structural similarity within groups.

23 Supplementary Figure 11. Cumulative repeat length distribution N16961 Cholera and Arabidopsis. The shaded green boxes correspond to repeats of lengths between 1-7kb. If we assume 454/Sanger sequencing can achieve reads of 1kb, this suggests the putative benefit of adding PacBio long reads of up to 7kb. Repeat length analysis for A) the N16961 reference sequence and B) Arabidopsis.

24 Supplementary Figure 12. ICE Assembly Validation. A) C2 Reads mapped to H1 ICE assembly. B) strobes and C) continuous reads were used to assembly across ICE. Concordant strobes (with spans between 5.5-7kb) are shown over the region. D) The CDC contigs as they appear in the assembly, with position of genes. E) Comparing the assembly to Ind5 only 5 SNVs are observed.

25 Supplementary Figure 13. Examples of subgraph untangling. The first column shows the graph before a particular untangling operation, the second after that operation. A) The scaffold link between contigs S and K contain the smaller internal contig I. This spanning link can be eliminated leading to a simple linear path. B) Multiple contigs exist between S and K. Since all internal contigs (I 1 to I m ) are connected to both S and K we can order them in a direct path from S to K based on their layout. C) A repeat contig R is resolved with a scaffolding edge between S and K. Contig R is duplicated and its remaining edges are removed from the original contig R and passed onto the duplicated node. D) A link between S and K exists but the internal nodes are not completely connected to either S or K (or both). In this case edges are inferred between the source and sink nodes, and all internal nodes, based on the span distributions of linking edges and the lengths of the internal nodes.

26 Supplementary Tables Supplementary Table 1. Estimated memory and run time for AHA (Single 32GB 8-core Nehalem blade). Size (MB) Nodes Edges (linking Step Memory (max Time (h) (contigs) reads) GB) 5 1e3 1e4 Total Alignment Scaffolding Untangling 0.05 < e5 1e6 Total Alignment Scaffolding Untangling 4.6 < e7 1e8 Total Alignment Scaffolding Untangling 460 <0.1

27 Supplementary Table 2. Effect of error-correction by different platforms on long read accuracy. Platforms Average Accuracy* Average Subread Readlength* PacBio bp PacBio bp PacBio Illumina bp * The average for the top mapping for each subread returned by BLASR, with mappings required to haveat least 75% accuracy.

28 Supplementary Table 3. Sequencing statistics for the N16961 strain of V. cholerae. Dataset Number of mapped Mapped Mean mapped read Mean subread reads / strobe reads coverage length / strobe span accuracy PacBio continuous X % reads PacBio continuous X % reads (C2) PacBio strobe X* 5.52 kb 84.22% reads 454 Reads bp * Mapped physical coverage for strobe reads including the strobe span length.

29 Supplementary Table 4. Assembly statistics for N16961 Dataset Number of scaffolds > 1kb Total number of scaffolds Number of scaffolds covering 99% of genome N50 Total number of contigs Total number of Ns in scaffolds Consensus Accuracy Simulated contigs kb >99.99** Simulated contigs kb >99.99** + PacBio continuous (long + C2 long reads) Simulated contigs + PacBio continuous reads (long + C2 long read ) + PacBio strobe Mb >99.99** 454 only contigs kb PacBio Mb * continuous reads (long + C2 long reads) + PacBio strobe Simulated Illumina + PacBio continuous reads (long + C2 long reads) + PacBio strobe Mb * * Gaps are broken down by position and type in Supplementary Tables 6 and 7. ** Because we did not have any short read data to generate error-corrected PacBio reads for these contigs, we did not gapfill. Hence accuracies in all cases are very close to the original reference.

30 Supplementary Table 5. Identity and alignment length of all raw PacBio reads spanning tandem repeat difference in N16961 assembly (NC_ : 1,736,000 to 1,739,000) Spanning Read Id Identity to PacBio N16961 Identity to Reference N16961 Alignment Length to PacBio N16961 Alignment Length to Reference N16961 c _s1_p0/ c _s1_p0/ c _s1_p0/ c _s1_p0/ c _s1_p0/ c _s1_p0/ c _s1_p0/ c _s1_p0/ c _s1_p0/

31 Supplementary Table 6. Distribution of gaps in N16961 control intermediate assembly after strobe scaffolding and gap-fill. Scaffold ID Gap start position Gap end position Gap length Comment scaffold0/ Gap induced by strobe reads. scaffold0/ Gap spanned by lower quality long reads. scaffold0/ Gap induced by strobe reads. scaffold0/ Gap induced by strobe reads.

32 Supplementary Table 7. Distribution of gaps in N16961 control final assembly after resequencing with PacBio long reads. Scaffold ID Gap start position Gap end position Gap length Comment scaffold0/ Gap induced by strobe reads. scaffold0/ Gap spanned by lower quality long reads. scaffold0/ Gap induced by strobe reads. scaffold0/ Gap induced by strobe reads.

33 Supplementary Table 8. De novo assembly of C1 long reads. Error Number Sum Max Average N50 length N99 Rate (%) contigs contig contig contig length length length CDC contigs Fails to assemble

34 Supplementary Table 9. Oligonucleotide primers used for extended-length PCR products ID WR1 WR3 WR8 WR9 WR10 WR11 Sequence 5'-TCGAGTGGCAAAGAAAATCA-3' 5'-TCTGGTTCAAGCGATGAGTG-3' 5'-GTAACCAAACGCCTCGACAT-3' 5'-CTTGTGAAAAACGGGGTTTG-3' 5'-ATGCCTATCGACGTTCTGCT-3' 5'-TAGAAATCAACGCCCCAAAC-3'

35 Supplementary References 1. Sambrook, J., Fritsch, E.F. & Maniatis, T. (eds.) Molecular cloning: a laboratory manual, Edn. 2nd. (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.; 1989). 2. Rubin, E.J., Lin, W., Mekalanos, J.J. & Waldor, M.K. Replication and integration of a Vibrio cholerae cryptic plasmid linked to the CTX prophage. Mol Microbiol 28, (1998). 3. Rasko, D.A. et al. Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. N Engl J Med 365, (2011). 4. Hendriksen, R.S. et al. Population genetics of Vibrio cholerae from Nepal in 2010: evidence on the origin of the Haitian outbreak. MBio 2, e (2011).