Supplementary Information for The ratio of human X chromosome to autosome diversity

Size: px
Start display at page:

Download "Supplementary Information for The ratio of human X chromosome to autosome diversity"

Transcription

1 Supplementary Information for The ratio of human X chromosome to autosome diversity is positively correlated with genetic distance from genes Michael F. Hammer, August E. Woerner, Fernando L. Mendez, Joseph C. Watkins, Murray P. Cox, Jeffrey D. Wall 1

2 Supplementary Table 1. Summaries of nucleotide diversity a and divergence in expanded resequencing dataset. Population Sample Size b Segregating Sites π (%) D(%) c π /D(%) Autosomes Mandenka 3.8 1, Biaka , San , Han 32. 1, Basque 32. 1, Oceanians X chromosome Mandenka Biaka San Han Basque Oceanians a Mean diversity for 61 autosomal and 3 X-linked loci. b Mean number of alleles sequenced per locus per population. c D = human-orangutan sequence divergence (Orangutan Genome Project). Note- Aligning human sequences with the recently available orangutan genome yielded slightly higher levels of divergence than that found in the study of Hammer et al. 1, especially for the X chromosome (4.8%). This difference may result from the fact that Hammer et al. 1 used primers designed from human sequences to PCR-amplify orthologous regions from orangutan. 1. Hammer, M.F., Mendez, F.L., Cox, M.P., Woerner, A.E. & Wall, J.D. Sex-biased evolutionary forces shape genomic patterns of human diversity. PLoS Genet 4, e122 (28). 2

3 Supplementary Table 2. Coverage a (Mb) for four bins at increasing distances from a human gene b. distance from Marmoset Rhesus Orangutan Gorilla Chimpanzee nearest gene (cm) A X A X A X A X A X total length a Coverage is defined as a base that is present in the outgroup and aligned with a homologous base in at least one ingroup. b See supplementary material for gene definitions Note- Repeated and ultraconserved regions were discard from alignments. 3

4 Supplementary Table 3. Summary of X chromosome/autosomal data for European population samples a in three studies. Hammer et al. 1 This study Public Data (all bins) Public Data (Bin 4) A X A X A X A X n samples n loci , n bases (Mb) S , ,251,83 49,246 33,68 2,169 π D (H-O) π/d N X /N aut a Hammer et al. 1 and this study: French Basque; Public data: See Supplementary Materials 1. Hammer, M.F., Mendez, F.L., Cox, M.P., Woerner, A.E. & Wall, J.D. Sex-biased evolutionary forces shape genomic patterns of human diversity. PLoS Genet 4, e122 (28) 4

5 Supplementary Table 4. X chromosomal/autosomal ratios of π/d. bin a Marmoset Rhesus Orangutan Gorilla Chimp a distance from nearest gene (cm) 5

6 Supplementary Figure 1. Point estimates and 95% CIs for N X /N aut ratios for 91-locus dataset. The tick represents the point estimate, while the vertical bar shows the estimated 95% confidence interval (see Hammer et al. 1 for methods). The dotted line represents the expected ratio (.75) under a neutral model with breeding sex ratio of 1. The mean value of ~.9 is consistent with a 2-3-fold excess of breeding females. Three letter population codes are as follows: Mandenka (Man), Biaka (Bia), San (San), French Basque (Bas), Han Chinese (Han), Oceanians (Oce) MAN BIA SAN BAS HAN OCE 1. Hammer, M.F., Mendez, F.L., Cox, M.P., Woerner, A.E. & Wall, J.D. Sex-biased evolutionary forces shape genomic patterns of human diversity. PLoS Genet 4, e122 (28) 6

7 Supplementary Figure 2. Strategies for sampling genomes. The binned approach (top panel) places each neutral region into several bins (.1 cm) that are defined by their distance to the nearest gene (f 1, f 2, f 3, etc.). The continuous approach (bottom panel) takes each non-genic interval in the genome, finds the medial.1 cm subsection (as defined by the genetic map), and defines that as the neutral region. Binned f f f f f n+ π/ cm from the nearest Continuous f 3 f n+1 f 1.1 cm f 2 x cm x cm π/. x cm from 7

8 Supplementary Figure 3. Diversity (p/d) on the X chromosome and the autosomes as a function of physical distance from genes (as per 2 ). The values shown are means ± standard errors of the mean. Note the different scales shown here and in Fig 1 (e.g., if 1 cm = 1 Mb, then the bar labeled > 1 kb here, corresponds to the three right bars in Fig 1)..3 Autosomes X chromosome.25.2 π / D kb 5-1 kb 1 kb Physical distance from nearest gene 2. Keinan, A., Mullikin, J.C., Patterson, N. & Reich, D. Accelerated genetic drift on chromosome X during the human dispersal out of Africa. Nat Genet 41, 66-7 (29). 8

9 Supplementary Figure 4. Percentage of genome sequence in each bin (binned approach) (see Table S2). % total sequence Autosomes X Chromosome cm from the nearest gene 9

10 Supplementary Methods Genomes Sampled. Six human genomes (Venter 1, Watson 2, three CEU samples NA12878, NA and NA722, and one personal genome project (NA2431) 4, broadly characterizable as being of European descent, were sampled for these analyses. We reconstructed diploid chromosome files by using the hg18 (26) human genome as a template, overlaying SNPs on top of this template, and then masking out the regions where the coverage is missing. Watson and Venter's genomes were constructed using SNP files from the Genome Variants track from the UCSC genome database and coverage estimates from the 1 genomes browser ( The first two CEU genomes were constructed using the SNPs ascertained in Kidd et al. 3, remapped to the hg18 human genome (using liftover, which is available at: Coverage for these CEU samples was estimated from the qualityaligns files 3, where a base was considered covered if it and its immediate neighbors had a Phred quality of at least 3. Coverage and SNP information for the genomes found in Drmanac et al. 4 were constructed from variation files provided at which use assembly versions and for NA722 and NA2431, respectively. To avoid missed heterozygous sites, which would reduce the apparent diversity of the autosomes more than the X chromosome, each reconstructed diploid chromosome was then subsampled to a single haploid chromosome by randomly picking one of the two alleles from the SNP files. Multiple outgroup sequences, including an orangutan, were obtained from the 44-way vertebrate alignments 5,6. Unless otherwise stated, the orangutan was used as the outgroup in all analyses. Regions Sampled. We took the union of both relatively conservative (UCSC Genes) and lax 1

11 (Gene Bounds and Spliced Ests) gene predictions 7,8 from the UCSC genome browser database 9 for the hg18 (26) human genome to produce an inclusive definition of putatively functional genomic regions. Recombination rate estimates were taken from HapMap Phase II 1 and the recombination rates for the X chromosome was further scaled by 2/3 as per Payseur and Nachman 11. By coupling the complement of the putatively functional regions with the fine-scale recombination rate estimates from HapMap, we were able to associate each non-genic region to its genetic distance to the nearest gene. Each non-genic interval of the genome is represented two-fold in this analysis. The binned approach places each neutral region into several bins that are defined by their distance to the nearest gene (see Figure S2). The bin size used was.1 cm. The second continuous approach takes each non-genic interval in the genome, finds the medial.1 cm subsection of said interval (medial as defined by the genetic map), and defines that as the neutral region (see Figure S2). To be conservative, we use several inclusive definitions of genes and attempt to control for the effects that conserved non-genic sequence, simple repeats, and duplications may have on patterns of nucleotide variability. Thus, each non-genic region was further filtered by removing simple repeats 12, duplicated regions, as defined by the segmental duplications and the self chain tracks 13,14, and conserved non-genic sequence, defined by the 28- way vertebrate alignment most conserved track 6. After filtering, regions with less than 1kb of coverage (where a base is considered covered if it is defined both in the outgroup and in at least one of the ingroups) were excluded from the analysis. From each non-genic region we then computed!, using the libsequence C++ library 15, and divergence, using an in-house script. Calculations. We ran two-tailed Mann-Whitey U tests to determine whether samples of values were drawn from the same parental distribution. For the local regression analysis, we first 11

12 determined a line of best fit of π/d versus genetic distance from genes for both autosomal data and for X chromosome data. Because the residual standard deviation depends on genetics distance, we use an iterated weighted least squares regression approach. To begin the iteration, a line is determined using ordinary least squares. To prepare for the weighted regression line in iteration k+1, the weights are determined from local regression 16 of the squares of the residuals from iteration k. This is continued until the sum of the squares of the difference between the weights from two consecutive iterations divided by the squares of the weights from the latest iteration is less that 1-6. This procedure is the basis for a test of the slopes ß aut and ß X of these two lines: H ß X 3/4 ß aut versus H 1 ß X > 3/4 ß aut Code to perform the iteration was written in R, which gives an output value for the F-statistic. Given 1 degree of freedom in the numerator, we take the square root of the F-statistic to get the t- statistic. 12

13 References for Supplementary Methods 1. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol 5, e254 (27). 2. Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, (28). 3. Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, (28). 4. Drmanac, R. et al. Human Genome Sequencing Using Unchained Base Reads on Self- Assembling DNA Nanoarrays. Science (29). 5. Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14, (24). 6. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, (25). 7. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. & Wheeler, D.L. GenBank: update. Nucleic Acids Res 32, D23-6 (24). 8. Hsu, F. et al. The UCSC Known Genes. Bioinformatics 22, (26). 9. Karolchik, D. et al. The UCSC Genome Browser Database: 28 update. Nucleic Acids Res 36, D773-9 (28). 1. International_HapMap_Consortium. A haplotype map of the human genome. Nature 437, (25). 11. Payseur, B.A. & Nachman, M.W. Gene density and human nucleotide polymorphism. Mol Biol Evol 19, (22). 12. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, (1999). 13. Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J. & Eichler, E.E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res 11, (21). 14. Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W. & Haussler, D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A 1, (23). 15. Thornton, K. Libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics 19, (23). 16. Cleveland, W., Grosse, E. & Shyu, W. Local regression models. in Statistical Models (eds. Chambers, J. & Hastie, T.) (CRC Press, Boca Raton, 1992). 13