High-throughput genome scaffolding from in vivo DNA interaction frequency

Size: px
Start display at page:

Download "High-throughput genome scaffolding from in vivo DNA interaction frequency"

Transcription

1 correction notice Nat. Biotechnol. 31, (213) High-throughput genome scaffolding from in vivo DNA interaction frequency Noam Kaplan & Job Dekker In the version of this supplementary file originally posted online, a section entitled Supplementary Methods was inadvertently included in the file and the legends for tables and figures omitted. The errors have been corrected in this file as of 4 December 213. nature biotechnology

2 Supplementary Discussion DNA triangulation is a novel approach to genome scaffolding, consisting of an ensemble of methods which share a common underlying theme: the use of high-throughput in vivo genome-wide chromatin interaction data to infer genomic location. We chose to evaluate our approach in extreme large-gap scenarios, in which contigs of 1 kb are surrounded by 1 Mb gaps. Such gaps are unbridgeable using standard long-insert libraries, and would remain so even if the size of long-insert molecules increases by an order of magnitude. Additionally, Hi-C data is exponentially sparser at long ranges, providing a more challenging setting for evaluation. Even so, our approach is able to estimate genomic position with remarkable accuracy. We also show our approach can scaffold an actual dataset of assembled contigs, without the use of a long-insert library. Interestingly, we find that incorrect predictions tend to be associated mostly with centromeric, pericentromeric or telomeric regions, all of which are highly-repetitive and difficult to assemble. However, it is difficult to tell whether this is due to sparse data, mapping difficulties, unique interaction patterns, or previous assembly errors. Recently, Genovese et al. 1 have devised an admixture mapping method that performs high-throughput genome augmentation based on extensive population SNP data from the 1 Genomes Project and applied it to previously unassembled contigs in the human genome. Using our method, we were able to predict the locations of 65 yet unplaced human contigs, largely in agreement with the Genovese et al. predictions and FISH data. Importantly, since our method requires simpler data that is easier to obtain, it is easily extendible to more contigs and to other species. Importantly, we each of the four computational problems that we present can be mapped to a wellstudied problem in the field of machine learning. Scaffold augmentation generally fits problems in the supervised learning framework, where chromosome and locus prediction are akin to classification and regression problems, respectively. De novo scaffolding generally fits problems in the unsupervised learning framework, where karyotyping and chromosome scaffolding are akin to clustering and multidimensional scaling (or manifold learning) problems, respectively. Each of these established problems is supported by an extensive theoretical background, several algorithms for solution, and strategies for evaluating results. Thus, given this framework it should be straightforward to improve the performance of our approach. However, these are by no means the only possible strategies. For example, an alternative approach to de-novo chromosome scaffolding could be first applying de-novo chromosome scaffolding to a subset of large contigs, producing a high confidence, but incomplete, scaffold and then applying scaffold augmentation to place the remaining smaller contigs that have less interaction data and are thus more difficult to place. While we focus in this paper on evaluating DNA triangulation in isolation from other data types, we do not necessarily suggest this should be the case for finalized genome assembly tools. Specifically, our approach is not aimed to be an alternative to the use of long-insert libraries for scaffolding. Rather, we suggest combining the long-range strength of Hi-C interaction data with the short-range accuracy of Nature Biotechnology: doi:1.138/nbt.2768

3 long-insert libraries. Indeed, long-insert library scaffolding will benefit greatly from the accurate karyotyping and mostly complete, but approximate, scaffolds produced by DNA triangulation. Knowing the chromosomal assignment and approximate location of each contig could significantly reduce the computational complexity of long-insert scaffolding, as each contig need only be paired with contigs in its vicinity. This could help resolve ambiguous contig joining, and reduce some of the most critical largescale assembly errors where contigs which are located at distant regions of a chromosome or on different chromosomes, are incorrectly joined. Alternatively, our method can be used as an independent unbiased evaluation of classically assembled scaffolds. In parallel with the many opportunities for further development of the computational methods, there are ways in which the data can be improved as well. While we have restricted ourselves in most scenarios to use only a fraction of available reads in order to demonstrate the robustness of our approach, our results can easily be improved simply by using higher amounts of reads. Additionally, the presence of cell-type or locus-specific structures such as promoter-enhancer loops 2 or chromosome compartments 3,4 and domains 5,6 may limit the accuracy of our approach. One way to address this is by pooling Hi-C data from multiple cell types and multiple conditions (computationally or experimentally), which may strengthen the canonical interaction patterns by smoothing out cell-type specific structures in the data. Another way of eliminating locus-specific structure is to perform the Hi-C experiment on cells in metaphase, where locus-specific chromatin structure is not apparent and only the canonical patterns are retained 7. Standard genome assembly methods have reached a point where additional sequencing depth does little to bridge fragmented genomes. This is unfortunate, due to increasing ease of obtaining massive amounts of sequencing reads. The approach we present here resuscitates the usefulness of additional sequence reads, since its resolution and accuracy are largely a factor of the strength of the interaction patterns, and thus of the number of reads. Our approach builds on data measured in an established, simple experiment that can be performed on any species. The power of our method may be attributed not to the sophistication of the computational tools, which are purposefully simple, but rather to two canonical interaction patterns. The fact that these patterns are strong, consistent across the genome, and ubiquitous in all species, cell types and conditions observed to date, suggests that this method is widely applicable. In fact, it is straightforward to apply our method to existing Hi-C data in additional species, such as mouse, in order to improve on current genome assemblies. Finally, we note that we addressed only two out of several possible applications along similar lines. Applications include targeted assembly (e.g. by using 4C 8,9 or 5C 1 ), detection of assembly errors, resolution of non-unique genomic sequences and detection of chromosomal aberrations. References 1. Genovese, G. et al. Using population admixture to help complete maps of the human genome. Nat. Genet. 45, 46 14, 414e1 2 (213). Nature Biotechnology: doi:1.138/nbt.2768

4 2. Sanyal, A., Lajoie, B., Jain, G. & Dekker, J. The long-range interaction landscape of gene promoters. Nature 489, (212). 3. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, (29). 4. Yaffe, E. & Tanay, A. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43, (211). 5. Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, (212). 6. Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 1 5 (212). 7. Naumova, N. et al. Organization of the Mitotic Chromosome. Science (213). doi:1.1126/science Zhao, Z. et al. Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat. Genet. 38, (26). 9. Simonis, M. et al. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat. Genet. 38, (26). 1. Dostie, J. et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, (26). Nature Biotechnology: doi:1.138/nbt.2768

5 a 1 b 25 Predicted position (scaled) Predicted position (ranks) Chr. 1 genomic position (Mb) Chr. 1 genomic position (ranks) c 1 d 18 Predicted position (scaled) Predicted position (ranks) 2 18 Chr. 5 genomic position (Mb) Chr. 5 genomic position (ranks) Supplemental Figure 1. Accurate de novo chromosome scaffolding with interaction frequencies. We retained every tenth 1-kb contig in the genome, leaving.9-mb gaps between contigs. We then estimated the positions of all contigs, without using any prior knowledge regarding their positions. We arbitrarily scaled the predicted positions to the interval [,1]. Note that the slope, which reflects scaling and orientation, is arbitrary. (a) Scaled predicted contig positions versus actual contig positions on chromosome 1. (b) Ranks of predicted contig positions versus ranks of actual contig positions on chromosome 1. (c) Scaled predicted contig positions versus actual contig positions on chromosome 5. (d) Ranks of predicted contig positions versus ranks of actual contig positions on chromosome 5. Nature Biotechnology: doi:1.138/nbt.2768

6 Contig Chr. Position Band Genovese FISH (primary/secondary) from predict prediction prediction band pred. Genovese et al. chr1_gl191_random chr1 24,35,1 1p p36.11 chr1 chr chr1 123,425,767 1p11.1 1q21.1 1q11 16q11.2 7cen / 2p11.2 3p p11 chr chr1 128,55,638 1q11 1p11.2 1q11 16q21 / 2p12 7q12 chr chr1 128,771,438 1q11 chr chr1 131,22,948 1q12 1q21.1 chr chr1 133,851,195 1q12 chr7_gl195_random chr1 134,63,189 1q12 chr7 / 15p13 14p13 13p13 22p13 21p13 chr chr q12 14q cen 14cen 15cen 2cen 21cen 22cen chr chr1 142,41,328 1q12 chr chr q21.1 chr chr q21.1 chrun_gl225 chr q21.1 chrun_gl224 chr q21.1 chr chr q21.1 chr chr1 143,15,1 1q q p11.2 / 2cen 9p13 14cen 15cen chr1_gl192_random chr1 143,75,1 1q21.1 1q21.1 chr1 chr chr1 4,676,617 1q11.1 chr chr12 132,5,1 12q24.33 chr chr13 13,84,954 13q q34 13q34 chr chr13 15,77,915 13q q34 13q33 chr chr15 2,5,1 15q11.2 chr chr15 2,5,1 15q q11.2 chr chr15 2,25,1 15q q11.2 2cen / 15cen 13cen 22cen chr chr16 33,75,1 16p11.2 chr chr17 21,85,1 17p q p11 chr chr17 22,5,1 17p p11.1 chr chr17 22,228,64 17p p cen / 17p chr chr17 22,25,1 17p p11.2 7cen 22cen 16cen / 1cen 2cen 17cen chr chr17 24,229,586 17q p11.2 chr17_gl24_random chr17 77,447,792 17q q25.3 chr17 chr19_gl28_random chr19 23,98,855 19p12 chr19 / 9q32-9q34.11 chr chr19 24,45,1 19p12 chr chr2 97,85,1 2q11.2 chr chr2 112,35,1 2q13 chrun_gl219 chr2 19,85,1 2p11.23 chr chr2 28,612,693 2q11.1 2p11.1 chr chr2 28,87,556 2q11.1 9cen 13cen 15cen 2cen 21cen 22cen / 9q2 14cen 4q3 Nature Biotechnology: doi:1.138/nbt.2768

7 chr chr2 29,96,239 2q11.1 2p p13 2cen 22cen 14cen / 2q13 chr chr2 29,45,1 2q q q11 / 9qh 13cen 14cen 15cen 21cen 22cen chr chr21 9,45,1 21p11.2 chr chr21 9,45,1 21p11.2 1q42 chr chr21 9,45,1 21p q11.2 chrun_gl216 chr21 9,65,1 21p11.2 chr4_gl194_random chr21 9,85,1 21p11.2 chr4 chrun_gl214 chr21 1,15,1 21p11.2 chr chr21 1,15,1 21p11.2 chrun_gl22 chr21 1,15,1 21p q25 chrun_gl212 chr21 1,65,1 21p q11 13cen 22cen 2p11.2-2p12 chr chr21 11,15,1 21p11.1 1q12 16p13 15q11 / 14cen 13cen chrun_gl218 chr21 11,269,37 21p11.1 chr4_gl193_random chr21 11,97,35 21p q11.2 chr4 / 13p p q q11.2 4q28 9q12 14p p11.2 4q35 9p12 chr17_gl25_random chr21 41,85,1 21q22.2 chr17 / 15p13 14p13 13p13 22p13 21p13 chr chr22 16,85,1 22q11.1 chrun_gl222 chr3 75,95,1 3p12.3 3p12.3 chr chr3 9,45,1 3p11.1 3p11.1 6p12.1 3p14 chrun_gl221 chr3 93,933,811 3q11.1 chr chr5 49,45,1 5q11.1 chr chr6 32,95,1 6p p21.32 chr chr7 61,15,1 7q11.1 7p11 / 2cen 1qh 16qh chr chr7 61,15,1 7q11.1 7q q12 7cen 16q11 / 2cen 9cen 13cen 14cen 15cen 22cen chr9_gl199_random chr9 51,749,339 9q12 9p11.2 chr9 chr chr9 55,217,84 9q12 9p11.2 9p11 chr chr9 66,75,1 9q13 13cen 22cen 9q13 / 14cen 15cen 21cen 1q12 4p12 chrun_gl211 chr9 69,85,1 9q21.11 chr chrx 24,25,1 Xp22.11 Xp22.33 Supplementary Table 1. Predicted positions of unplaced contigs in hg19. Shown are the predicted chromosome, locus and cytoband of each unplaced contig. We also show the cytoband predicted by Genovese et al. and the FISH localization. Green and red indicate agreement and disagreement of predictions, as defined in Methods. Nature Biotechnology: doi:1.138/nbt.2768

8 Chromosome Median error Median rank error Error > 1mb Rank error > 1 Correct neighbor ranks Spearman correlation Breaks E E E E E E E E E E E E E E E E E E E E E E Mean: 2.5E Median: 1.97E Std: 7.48E Min: 1.4E Max: 3.25E Supplemental Table 2. De novo chromosome scaffolding statistics for 1kb virtual contigs with 9kb gaps. Nature Biotechnology: doi:1.138/nbt.2768

9 Chromosome Median error Median rank error Error > 1mb Rank error > 1 Correct neighbor ranks Spearman correlation Breaks E Supplemental Table 3. De novo chromosome scaffolding statistics for real set of contigs from chromosome 14. Nature Biotechnology: doi:1.138/nbt.2768

10 Supplementary Discussion DNA triangulation is a novel approach to genome scaffolding, consisting of an ensemble of methods which share a common underlying theme: the use of high-throughput in vivo genome-wide chromatin interaction data to infer genomic location. We chose to evaluate our approach in extreme large-gap scenarios, in which contigs of 1 kb are surrounded by 1 Mb gaps. Such gaps are unbridgeable using standard long-insert libraries, and would remain so even if the size of long-insert molecules increases by an order of magnitude. Additionally, Hi-C data is exponentially sparser at long ranges, providing a more challenging setting for evaluation. Even so, our approach is able to estimate genomic position with remarkable accuracy. We also show our approach can scaffold an actual dataset of assembled contigs, without the use of a long-insert library. Interestingly, we find that incorrect predictions tend to be associated mostly with centromeric, pericentromeric or telomeric regions, all of which are highly-repetitive and difficult to assemble. However, it is difficult to tell whether this is due to sparse data, mapping difficulties, unique interaction patterns, or previous assembly errors. Recently, Genovese et al. 1 have devised an admixture mapping method that performs high-throughput genome augmentation based on extensive population SNP data from the 1 Genomes Project and applied it to previously unassembled contigs in the human genome. Using our method, we were able to predict the locations of 65 yet unplaced human contigs, largely in agreement with the Genovese et al. predictions and FISH data. Importantly, since our method requires simpler data that is easier to obtain, it is easily extendible to more contigs and to other species. Importantly, we each of the four computational problems that we present can be mapped to a wellstudied problem in the field of machine learning. Scaffold augmentation generally fits problems in the supervised learning framework, where chromosome and locus prediction are akin to classification and regression problems, respectively. De novo scaffolding generally fits problems in the unsupervised learning framework, where karyotyping and chromosome scaffolding are akin to clustering and multidimensional scaling (or manifold learning) problems, respectively. Each of these established problems is supported by an extensive theoretical background, several algorithms for solution, and strategies for evaluating results. Thus, given this framework it should be straightforward to improve the performance of our approach. However, these are by no means the only possible strategies. For example, an alternative approach to de-novo chromosome scaffolding could be first applying de-novo chromosome scaffolding to a subset of large contigs, producing a high confidence, but incomplete, scaffold and then applying scaffold augmentation to place the remaining smaller contigs that have less interaction data and are thus more difficult to place. While we focus in this paper on evaluating DNA triangulation in isolation from other data types, we do not necessarily suggest this should be the case for finalized genome assembly tools. Specifically, our approach is not aimed to be an alternative to the use of long-insert libraries for scaffolding. Rather, we suggest combining the long-range strength of Hi-C interaction data with the short-range accuracy of

11 long-insert libraries. Indeed, long-insert library scaffolding will benefit greatly from the accurate karyotyping and mostly complete, but approximate, scaffolds produced by DNA triangulation. Knowing the chromosomal assignment and approximate location of each contig could significantly reduce the computational complexity of long-insert scaffolding, as each contig need only be paired with contigs in its vicinity. This could help resolve ambiguous contig joining, and reduce some of the most critical largescale assembly errors where contigs which are located at distant regions of a chromosome or on different chromosomes, are incorrectly joined. Alternatively, our method can be used as an independent unbiased evaluation of classically assembled scaffolds. In parallel with the many opportunities for further development of the computational methods, there are ways in which the data can be improved as well. While we have restricted ourselves in most scenarios to use only a fraction of available reads in order to demonstrate the robustness of our approach, our results can easily be improved simply by using higher amounts of reads. Additionally, the presence of cell-type or locus-specific structures such as promoter-enhancer loops 2 or chromosome compartments 3,4 and domains 5,6 may limit the accuracy of our approach. One way to address this is by pooling Hi-C data from multiple cell types and multiple conditions (computationally or experimentally), which may strengthen the canonical interaction patterns by smoothing out cell-type specific structures in the data. Another way of eliminating locus-specific structure is to perform the Hi-C experiment on cells in metaphase, where locus-specific chromatin structure is not apparent and only the canonical patterns are retained 7. Standard genome assembly methods have reached a point where additional sequencing depth does little to bridge fragmented genomes. This is unfortunate, due to increasing ease of obtaining massive amounts of sequencing reads. The approach we present here resuscitates the usefulness of additional sequence reads, since its resolution and accuracy are largely a factor of the strength of the interaction patterns, and thus of the number of reads. Our approach builds on data measured in an established, simple experiment that can be performed on any species. The power of our method may be attributed not to the sophistication of the computational tools, which are purposefully simple, but rather to two canonical interaction patterns. The fact that these patterns are strong, consistent across the genome, and ubiquitous in all species, cell types and conditions observed to date, suggests that this method is widely applicable. In fact, it is straightforward to apply our method to existing Hi-C data in additional species, such as mouse, in order to improve on current genome assemblies. Finally, we note that we addressed only two out of several possible applications along similar lines. Applications include targeted assembly (e.g. by using 4C 8,9 or 5C 1 ), detection of assembly errors, resolution of non-unique genomic sequences and detection of chromosomal aberrations. References 1. Genovese, G. et al. Using population admixture to help complete maps of the human genome. Nat. Genet. 45, 46 14, 414e1 2 (213).

12 2. Sanyal, A., Lajoie, B., Jain, G. & Dekker, J. The long-range interaction landscape of gene promoters. Nature 489, (212). 3. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, (29). 4. Yaffe, E. & Tanay, A. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43, (211). 5. Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, (212). 6. Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 1 5 (212). 7. Naumova, N. et al. Organization of the Mitotic Chromosome. Science (213). doi:1.1126/science Zhao, Z. et al. Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat. Genet. 38, (26). 9. Simonis, M. et al. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat. Genet. 38, (26). 1. Dostie, J. et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, (26).

13 Predicted position (scaled) Predicted position (ranks) Predicted position (scaled) Predicted position (ranks) a 1 b Chr. 1 genomic position (Mb) Chr. 1 genomic position (ranks) c 1 d Chr. 5 genomic position (Mb) Chr. 5 genomic position (ranks) Supplemental Figure 1. Accurate de novo chromosome scaffolding with interaction frequencies. We retained every tenth 1-kb contig in the genome, leaving.9-mb gaps between contigs. We then estimated the positions of all contigs, without using any prior knowledge regarding their positions. We arbitrarily scaled the predicted positions to the interval [,1]. Note that the slope, which reflects scaling and orientation, is arbitrary. (a) Scaled predicted contig positions versus actual contig positions on chromosome 1. (b) Ranks of predicted contig positions versus ranks of actual contig positions on chromosome 1. (c) Scaled predicted contig positions versus actual contig positions on chromosome 5. (d) Ranks of predicted contig positions versus ranks of actual contig positions on chromosome 5.

14 Contig Chr. Position Band Genovese FISH (primary/secondary) from predict prediction prediction band pred. Genovese et al. chr1_gl191_random chr1 24,35,1 1p p36.11 chr1 chr chr1 123,425,767 1p11.1 1q21.1 1q11 16q11.2 7cen / 2p11.2 3p p11 chr chr1 128,55,638 1q11 1p11.2 1q11 16q21 / 2p12 7q12 chr chr1 128,771,438 1q11 chr chr1 131,22,948 1q12 1q21.1 chr chr1 133,851,195 1q12 chr7_gl195_random chr1 134,63,189 1q12 chr7 / 15p13 14p13 13p13 22p13 21p13 chr chr q12 14q cen 14cen 15cen 2cen 21cen 22cen chr chr1 142,41,328 1q12 chr chr q21.1 chr chr q21.1 chrun_gl225 chr q21.1 chrun_gl224 chr q21.1 chr chr q21.1 chr chr1 143,15,1 1q q p11.2 / 2cen 9p13 14cen 15cen chr1_gl192_random chr1 143,75,1 1q21.1 1q21.1 chr1 chr chr1 4,676,617 1q11.1 chr chr12 132,5,1 12q24.33 chr chr13 13,84,954 13q q34 13q34 chr chr13 15,77,915 13q q34 13q33 chr chr15 2,5,1 15q11.2 chr chr15 2,5,1 15q q11.2 chr chr15 2,25,1 15q q11.2 2cen / 15cen 13cen 22cen chr chr16 33,75,1 16p11.2 chr chr17 21,85,1 17p q p11 chr chr17 22,5,1 17p p11.1 chr chr17 22,228,64 17p p cen / 17p chr chr17 22,25,1 17p p11.2 7cen 22cen 16cen / 1cen 2cen 17cen chr chr17 24,229,586 17q p11.2 chr17_gl24_random chr17 77,447,792 17q q25.3 chr17 chr19_gl28_random chr19 23,98,855 19p12 chr19 / 9q32-9q34.11 chr chr19 24,45,1 19p12 chr chr2 97,85,1 2q11.2 chr chr2 112,35,1 2q13 chrun_gl219 chr2 19,85,1 2p11.23 chr chr2 28,612,693 2q11.1 2p11.1 chr chr2 28,87,556 2q11.1 9cen 13cen 15cen 2cen 21cen 22cen / 9q2 14cen 4q3

15 chr chr2 29,96,239 2q11.1 2p p13 2cen 22cen 14cen / 2q13 chr chr2 29,45,1 2q q q11 / 9qh 13cen 14cen 15cen 21cen 22cen chr chr21 9,45,1 21p11.2 chr chr21 9,45,1 21p11.2 1q42 chr chr21 9,45,1 21p q11.2 chrun_gl216 chr21 9,65,1 21p11.2 chr4_gl194_random chr21 9,85,1 21p11.2 chr4 chrun_gl214 chr21 1,15,1 21p11.2 chr chr21 1,15,1 21p11.2 chrun_gl22 chr21 1,15,1 21p q25 chrun_gl212 chr21 1,65,1 21p q11 13cen 22cen 2p11.2-2p12 chr chr21 11,15,1 21p11.1 1q12 16p13 15q11 / 14cen 13cen chrun_gl218 chr21 11,269,37 21p11.1 chr4_gl193_random chr21 11,97,35 21p q11.2 chr4 / 13p p q q11.2 4q28 9q12 14p p11.2 4q35 9p12 chr17_gl25_random chr21 41,85,1 21q22.2 chr17 / 15p13 14p13 13p13 22p13 21p13 chr chr22 16,85,1 22q11.1 chrun_gl222 chr3 75,95,1 3p12.3 3p12.3 chr chr3 9,45,1 3p11.1 3p11.1 6p12.1 3p14 chrun_gl221 chr3 93,933,811 3q11.1 chr chr5 49,45,1 5q11.1 chr chr6 32,95,1 6p p21.32 chr chr7 61,15,1 7q11.1 7p11 / 2cen 1qh 16qh chr chr7 61,15,1 7q11.1 7q q12 7cen 16q11 / 2cen 9cen 13cen 14cen 15cen 22cen chr9_gl199_random chr9 51,749,339 9q12 9p11.2 chr9 chr chr9 55,217,84 9q12 9p11.2 9p11 chr chr9 66,75,1 9q13 13cen 22cen 9q13 / 14cen 15cen 21cen 1q12 4p12 chrun_gl211 chr9 69,85,1 9q21.11 chr chrx 24,25,1 Xp22.11 Xp22.33 Supplementary Table 1. Predicted positions of unplaced contigs in hg19. Shown are the predicted chromosome, locus and cytoband of each unplaced contig. We also show the cytoband predicted by Genovese et al. and the FISH localization. Green and red indicate agreement and disagreement of predictions, as defined in Methods.

16 Chromosome Median error Median rank error Error > 1mb Rank error > 1 Correct neighbor ranks Spearman correlation Breaks E E E E E E E E E E E E E E E E E E E E E E Mean: 2.5E Median: 1.97E Std: 7.48E Min: 1.4E Max: 3.25E Supplemental Table 2. De novo chromosome scaffolding statistics for 1kb virtual contigs with 9kb gaps.

17 Chromosome Median error Median rank error Error > 1mb Rank error > 1 Correct neighbor ranks Spearman correlation Breaks E Supplemental Table 3. De novo chromosome scaffolding statistics for real set of contigs from chromosome 14.