SUPPLEMENTARY INFORMATION

Size: px

Start display at page:

Download "SUPPLEMENTARY INFORMATION"

Ezra Nicholson
6 years ago
Views:

1 doi: /nature17405 Supplementary Information 1 Determining a suitable lower size-cutoff for sequence alignments to the nuclear genome Analyses of nuclear DNA sequences from archaic genomes have until now been restricted to fragments of length 35bp or longer. However, to maximize the yield of DNA fragments we have used fragments as short as 30bp when reconstructing mtdna genomes from the SH specimens. Lowering the size-cutoff for the analysis of the nuclear genome of the SH specimens increases the number of informative sequences available for analysis, but also increases the risk of including incorrectly aligned microbial sequences that have spurious similarity to the much larger nuclear genome. The problem of spurious alignments becomes apparent when plotting the percentage of sequences that map to the human genome as a function of their length (Supplementary Figure 1). With the alignment software and parameters chosen here and in previous studies 1,2 (BWA 3 with options -n 0.01 o 2 l ), we observe a steep increase in the success of mapping for sequences shorter than 35bp. However, this is driven largely by alignments that include mismatches. The fraction of alignments without mismatches only increases at a much shorter length of around 25bps. Since neither evolutionary differences (approximately 1 in 700bp 1 ), nor deamination, nor sequencing error would reduce the fraction of perfectly aligning sequences in a manner that is dependent on sequence length to the extent observed here, we deem spurious alignments the best explanation for the difference between perfect matches and alignments allowing mismatches. To quantify the proportion of spurious alignments with size cutoffs of 30 and 35bp we identified 11,299 sites where the human reference genome differs from all present-day humans in the 1000 genomes project phase I data 4 as well as from the chimpanzee genome (pantro2) 5. We expect that nearly 100% of endogenous hominin sequences overlapping these sites, which represent errors in the human reference genome sequence or rare polymorphisms observed with ~0.1% frequency in presentday humans, will carry the state of other humans and the chimpanzee (i.e. the ancestral state). In contrast, less than ~3% of evolutionarily unrelated microbial sequences are expected to do so due to random similarity to the human reference genome (Supplementary Figure 2). By counting how often DNA fragments overlapping these positions share the derived state we obtained an approximate estimate of the proportion of spurious alignments. To minimize the impact of cytosine deamination in this analysis, we disregarded alignments if either the ancestral or the derived state at a given position was C in the orientation sequenced, or G if the complementary strand was sequenced. Using this 1

2 approach (Supplementary Figure 2), we estimate that at a fragment length cut-off of 30bp between 9% and 67% of all aligned DNA fragments from the five specimens are likely of microbial origin. When we increase the fragment length cut-off to 35bp we find no evidence for spurious microbial alignments although the numbers are admittedly small (Supplementary Table 1). Supplementary Table 1: The proportion of sequences aligned randomly to the human genome, estimated as 1 minus the frequency of sharing the ancestral state, using lower size cut-offs of 30 and 35bp. 95% binomial confidence intervals are provided in brackets. All sequences used in this analysis aligned to the genome with a map quality score

3 Supplementary Figure 1: Percentage of sequences from SH femur XIII that map to the human reference genome when mismatches are allowed (solid line) or perfect matches required (dashed line) as a function of their size. The alignment parameters chosen here decrease the number of allowed mismatch by one for reads shorter than 22bp, leading to the observed dip in the percentage of mapped sequences. 3

4 Supplementary Figure 2: Probability of sharing rare variants with the human reference genome. These variants (green) are identified by requiring that all of the present-day human genomes sequenced in phase I of the 1000 genomes project as well as the chimpanzee differ from the reference genome, i.e. share the ancestral state (red). With the parameters chosen for mapping, the number of mismatches between a DNA fragment and the reference is limited to 3 for fragments 41bp, to 4 for fragments >=42bp, and to 5 for fragments >=64bp, etc., i.e. less than 10% of the bases are allowed differ from the reference genome in any given alignment. Given that differences to the reference genome can be due to three different bases, less than 3.3% of spuriously aligned microbial DNA fragments are expected to share the ancestral state at these positions. In contrast, nearly all (>99.9%) hominin sequences are expected to be ancestral at these positions. 4

5 Supplementary Information 2 Using conditional substitution frequencies for detecting and quantifying ancient DNA in mixtures with recent contamination Library-based sample preparation techniques in combination with high-throughput sequencing have revealed that authentic ancient DNA sequences show a distinct elevation of C to T substitutions close to their ends (or G to A substitutions depending on the method used for library preparation) 6. These substitutions, which arise due to cytosine deamination, accumulate predominantly at molecule ends and are much less frequent in contaminant DNA that was introduced during or after excavation of fossils 7. Elevated frequencies of damage-induced substitutions can thus be used to provide evidence for the presence of endogenous ancient DNA. Although there is no strict linear relationship between the strength of deamination signals and the age of specimens 8, nearly all samples that are thousands of years in age have been reported to show terminal C to T substitution frequencies exceeding 10% (see for example ref. 9 and 10). Some samples in which authentic ancient DNA may still be present, including some of the SH specimens described here, exhibit lower frequencies of C to T substitutions, not compatible with the presumed age of the sequences due to an overwhelming background of presentday contamination that dilutes the deamination signal. To increase our ability to detect ancient DNA in highly contaminated samples based on patterns of DNA damage, we have recently introduced the concept of conditional substitution frequencies in the analysis of mtdna sequences of SH femur XIII 11. Conditional substitution frequencies are obtained by isolating sequences that show a C to T difference to the reference at their 5 ends and determining the frequency of C to T substitutions at their 3 ends, and vice versa. In the case of femur XIII, C to T substitution frequencies increased from 12 and 17% at the 5 and 3 ends of all sequences to 55 and 62% in the population of sequences with a C to T substitution at the other end. The latter numbers are close to the deamination signals detected in a similar-age cave bear from the same site, suggesting that they might be a good proxy for the deamination frequencies of the fraction of endogenous DNA fragments in the specimen. However, it is also conceivable that decay processes leading to deamination at one end of a molecule may increase the likelihood of deamination at the other end, thereby reducing the power to detect and quantify mixtures of ancient and contaminant DNA based on conditional substitution frequencies. 5

6 To test whether a population of highly deaminated molecules may also be present among contaminant DNA fragments, we analyzed published mtdna sequences from two ancient hominin specimens, SH femur XIII 11 and a Late Pleistocene Neanderthal (Mezmaiskaya 1) 12, that show high levels of modern human contamination. Using diagnostic positions that differentiate the mtdna genomes of 311 present-day humans from the two archaic individuals (132 positions for femur XIII and 41 for Mezmaiskaya 1), we separated contaminant and endogenous sequences based on their sharing of either the human-derived or the ancestral state. To minimize erroneous assignments due to deamination we disregarded sequences that overlap a diagnostic site with their first or last 3 positions. We then determined the frequencies of C to T substitutions in both sets of sequences, both with and without conditioning on the presence of a C to T substitution on the other end (Supplementary Table 2). As expected for mixtures of ancient and contaminant sequences, conditional substitution frequencies increase compared to regular substitution frequencies prior to separation of the sequences. In the subset of contaminant sequences, conditional substitution frequencies increase slightly in femur XIII but remain well below 10%, providing little evidence for the presence of a highly deaminated population of DNA fragments among the contaminant DNA in the two specimens. To test whether deamination occurs independently on both ends of molecules we determined regular and conditional substitution frequencies using larger data sets of nuclear sequences from six archaic individuals 1,2,13 that have been estimated to contain negligible levels of modern human contamination (1% or less). Regular and conditional substitution frequencies are expected to be equal if deamination occurs independently at either end of the molecules. However, conditional substitution frequencies are between 10% lower and 14% higher than regular substitution frequencies, indicating that deamination frequencies at the ends are not entirely independent (Supplementary Table 3). A reduction in conditional substitution frequencies in same samples may be explained by mapping bias, i.e. the difficulty of mapping sequences with 2 terminal C to T substitutions and possibly other differences to the reference genome. Comparisons of regular and conditional substitution frequencies therefore do not allow for accurate estimation of modern human contamination, but they can be used to detect the presence of small amounts of ancient sequences in a background of human contamination, and provide rough estimates of contamination in heavily contaminated samples. When applied to the nuclear DNA sequences from the Sima de los Huesos specimens, we estimate that modern human DNA contamination is present in at least a 2.7-fold excess in all libraries, contributing 63% or more of all sequences (Supplementary Table 3). To minimize the impact of contamination on our analyses, we therefore restricted down-stream analyses to only those fragments showing evidence of deamination. 6

7 Supplementary Table 2: C to T substitution frequencies and conditional substitution frequencies in mtdna sequences from two archaic individuals. Diagnostic sites that differentiate the respective archaic individual from 311 present-day humans were used to identify subsets of sequences that are putatively of endogenous origin or from recent human contamination. The percentage of sequences identified as contamination from diagnostic sites is provided in the second column. Specimen SH femur XIII 714,601 sequences Human contamination based on diagnostic Population of sites [%] 94.9 All sequences 6.2 (9,406/151,703) C to T substitution frequency [%] (observations) Conditional sequences 5 end 3 end 5 end 3 end (10,099/112,213) (1081/2,335) Human contamination 1.6 (532/33,269) 2.3 (539/23,453) 3.8 (5/132) 54.7 (1,082/1,978) 5.2 (5/96) Endogenous 59.8 (1,354/2,264) 54.2 (1,004/1,852) 56.5 (152/269) 67.0 (152/227) Mezmaiskaya 1 (library B9687) 43.3% All sequences 8.1 (512/6,320) 13.7 (723/5,277) 15.5 (28/181) 25.9 (28/108) 28,955 sequences Human contamination 0.9 (5/572) 0.7 (4/550) 0.0 (0/1) 0.0 (0/1) Endogenous 13.5 (93/690) 18.3 (104/568) 15.6 (5/32) 16.7 (5/30) 7

8 Supplementary Table 3: Damage-induced substitution frequencies and conditional substitution frequencies in 6 archaic genomes and the SH libraries sequenced in the present study. Substitution frequencies at the 3 ends of sequences in the Vindija and Mezmaiskaya samples correspond to G to A substitutions, as the respective libraries were generated using a double-stranded method. Rough contamination estimates based on the comparison of the two substitution frequencies are provided in the final column. Specimen C to T (or G to A) substitution frequency [%] Conditional 5 end 3 end 5 end 3 end Conditional / regular substitution frequencies Contamination estimate based on deamination Altai Neanderthal (chromosome 21 only) Denisova manual phalanx (chromosome 21 only) Vindija (all sequences) Vindija (all sequences) Vindija (all sequences) Mezmaiskaya 1 (all sequences) SH Femur XIII (AT2944) SH Incisor (AT-5482) SH Femur frag. (AT-5431) SH Molar (AT-5444) SH Scapula (AT-6672)

9 Supplementary Information 3 Estimating present-day human contamination based on the sharing of alleles that have risen to high frequency in modern humans since the split from the archaic hominins The use of diagnostic sites, i.e. sites that show fixed differences between the hominin group from which the sequence comes and the potential contaminant group, is well established for ancient mtdna analysis. Here, we extend this strategy to the autosomes. We identified 28,552 diagnostic positions where all present-day human genomes sequenced in phase I of the 1000 genomes project differ from the high-coverage genome sequences of the Denisovan individual, the Altai Neanderthal and the chimpanzee, bonobo, gorilla, orangutan and rhesus macaque. The states of the outgroups were retrieved from the Ensembl Compara EPO 6 primate whole genome alignments 14,15. All sites were required to be located within unique regions of the human genome as defined by the map35_50% criteria described in Prüfer et al (Supplementary Section 5b) 2. Sequences from the SH specimens showing the derived, i.e. the human state at such sites are either due to present-day human contamination, variants segregating in the archaic populations that are not present in any of the four archaic chromosomes used in this analysis, or sequencing error. The percentage of sequences sharing the derived state can therefore be used to estimate an upper limit on the present-day human contamination present in the libraries. To test the suitability of this approach we first performed the analysis using low coverage sequence data from four Neanderthal individuals that were previously determined to contain less than 1% human contamination 2,13. To reduce errors due to cytosine-deamination, we masked out thymines in the first three positions and adenines in the last three to account for the occurrence of damage-derived C to T substitutions at the 5 end and G to A substitutions at the 3 end. Based on the percentage of sequences sharing the derived state our contamination estimate for these samples is between 6.3 and 8.6% (Supplementary Table 4). Because this approach measures not only contamination, but also unknown variation in the archaic hominins and sequencing error, these estimates are higher than the actual contamination estimates but allow us to exclude contamination above the 10% level. We thus performed the same analysis for the SH sequences, this time masking out only thymines within the first and last three sequence positions as single-stranded library preparation fully retains the strand information of the sequenced molecules and does not induce artificial G to A substitutions. Estimates of contamination are high in all SH libraries (78% or more), but drop to 20% for incisor AT-5482 and 0% for 9

10 both femur AT-5431 and molar AT-5444 after filtering for putatively deaminated sequences by requiring a C to T substitution at the 5 or 3 terminus. Since the confidence intervals of our estimates are extremely wide due to the limited data, we tried to further increase the power of the analysis by considering sites (totaling 128,004) where the derived allele has risen to 90% or more frequency in modern humans. While providing more power, this analysis may slightly under-estimate contamination as some contaminant sequences may share the archaic state. Overall, contamination estimates are similar to the first estimates but with narrower confidence intervals (Supplementary Table 4). 10

11 Supplementary Table 4: Nuclear contamination estimates based on sites that are derived in all, or more than 90%, of present-day humans and ancestral in two Denisovan and two Neanderthal chromosomes. 95% binomial confidence intervals are provided in brackets. Specimen Femur XIII Library A2021 Incisor AT-5482 Femur fragment AT-5431 Molar AT-5444 Scapula AT-6672 Filtered for sequences with terminal C to T substitution Derived allele frequency in present-day humans #sites #presentday human state Present-day human contamination [%] (95% C.I.) 81.1 ( ) ( ) 83.9 ( ) 20.0 ( ) 68.1 ( ) 0.0 ( ) 76.7 ( ) 0.0 ( ) 92.7 ( ) #sites #present-day human state ,811 1, N/A 1 1 Vindija , Vindija , Vindija , Mezmaiskaya 1-13,068 1, ( ) 6.0 ( ) 6.3 ( ) 8.6 ( ) 36,470 1,931 38,037 1,676 33,972 1,528 59,783 4,100 Present-day human contamination [%] (95% C.I.) 87.5 ( ) 62.5 ( ) 85.7 ( ) 20.8 ( ) 76.1 ( ) 18.2 ( ) 88.0 ( ) 0.0 ( ) 97.6 ( ) ( ) 5.3 ( ) 4.4 ( ) 4.5 ( ) 6.9 ( ) 11

12 References 1. Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual. Science 338, (2012). 2. Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43-9 (2014). 3. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, (2009) Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, (2010). 5. The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, (2005). 6. Briggs, A.W. et al. Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci U S A 104, (2007). 7. Krause, J. et al. A complete mtdna genome of an early modern human from Kostenki, Russia. Curr Biol 20, (2010). 8. Sawyer, S., Krause, J., Guschanski, K., Savolainen, V. & Paabo, S. Temporal patterns of nucleotide misincorporations and DNA fragmentation in ancient DNA. PLoS One 7, e34131 (2012). 9. Gamba, C. et al. Genome flux and stasis in a five millennium transect of European prehistory. Nat Commun 5, 5257 (2014). 10. Allentoft, M.E. et al. Population genomics of Bronze Age Eurasia. Nature 522, (2015). 11. Meyer, M. et al. A mitochondrial genome sequence of a hominin from Sima de los Huesos. Nature 505, (2014). 12. Gansauge, M.T. & Meyer, M. Selective enrichment of damaged DNA molecules for ancient genome sequencing. Genome Res 24, (2014). 13. Green, R.E. et al. A draft sequence of the Neandertal genome. Science 328, (2010). 14. Paten, B., Herrero, J., Beal, K., Fitzgerald, S. & Birney, E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res 18, (2008). 15. Paten, B. et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res 18, (2008). 12

An early modern human from Romania with a recent Neanderthal ancestor

An early modern human from Romania with a recent Neanderthal ancestor The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation