Comparison of the levels of diversity between coldspots (CS) and highly recombining regions (HRRs) for SNPs in the FCQ data set.

Size: px
Start display at page:

Download "Comparison of the levels of diversity between coldspots (CS) and highly recombining regions (HRRs) for SNPs in the FCQ data set."

Transcription

1 Supplementary Figure 1 Comparison of the levels of diversity between coldspots (CS) and highly recombining regions (HRRs) for SNPs in the FCQ data set. Odds ratios (ORs) are computed to compare SNP density between coldspots and HRRs for all SNPs (red) and SNPs divided in different allele frequency classes (black). OR < 1 means that diversity is greater in HRRs than in coldspots. We confirm the lack of diversity in coldspots relative to HRRs, in line with previous evidence that diversity is reduced in regions with low recombination rates owing to background selection. The effect is seen for all frequency classes and does not differ significantly between classes of SNPs with MAF > The class of variants with MAF < 0.05 shows a smaller effect than the other frequency classes.

2 Supplementary Figure 2 Differential mutational burden between coldspots (CS) and highly recombining regions (HRRs) in a genomic subset of the data. Differential burden is computed using odds ratios (ORs), representing the relative enrichment of a category of variants compared to all variants in coldspots versus HRRs for (a) RNA and exome sequencing of French Canadians (FC) and for exome sequencing of (b) Europeans (EUR), (c) Asians (ASN) and (d) Africans (AFR) from the 1000 Genomes Project. Variants are categorized as rare (MAF < 0.01 in a population), nonsynonymous (missense and nonsense) and damaging (as predicted by both SIFT and PolyPhen-2). Highly covered exons (HC exons) have coverage above 20 for each position within the exons in all data sets. The set of exons analyzed does not affect the results, and the exome data set in French Canadians (FC) replicates the results found in RNA sequencing.

3 Supplementary Figure 3 Minor allele frequencies (MAF) impact on odds ratios between coldspots (CS) and high recombining regions (HRRs). Impact of MAF on the effects for functional mutations in the French-Canadian (FCQ) RNA sequencing data set (a,b) and for private and shared variants in (c) Europeans (EUR) and (d) Africans (AFR). (a) The enrichment of nonsynonymous and damaging mutations in coldspots remains significant for MAF < 0.05, indicating that the excess of rare variants in coldspots does not drive the effect for nonsynonymous and damaging variants. (b) Neutral variants with MAF < 0.05 are enriched in coldspots in comparison to more frequent variants, indicating that neutral diversity contributes to the excess of rare variants in coldspots. (c,d) The enrichment of private mutations in coldspots and of shared mutations in HRRs remains significant for MAF < 0.1 in both EUR and AFR, indicating that these effects are not driven only by differences in allele frequency between shared and private variants.

4 Supplementary Figure 4 Distribution of conservation across exons measured by GERP scores in coldspots (CS) and highly recombining regions (HRRs). (a) Mean GERP score per exon. (b) Proportion of constrained positions (GERP > 3) per exon. (c) Scatter plot of mean GERP by the proportion of constrained positions for all exons. (d) For each measure of conservation per exon, exons were grouped into four categories of equal size. Only exons that were concordant between the two classifications were kept in analyses within conservation categories, to minimize the effect of outliers for one of the two measures. Characteristics of exons in these four conservation categories in terms of average GERP score per base pair and number of constrained sites per base pair (GERP > 3) are reported in Supplementary Table 7.

5 Supplementary Figure 5 Differential mutational burden in conservation categories. Differential mutational burden between coldspots (CS) and highly recombining regions (HRRs) for rare (MAF < 0.01), nonsynonymous (nonsyn), damaging and constrained variants in (a) French Canadians (FCQ) and (b) Europeans (EUR) for highly covered (HC) exons and in (c) Asians (ASN) and (d) Africans (AFR) for the whole exome. Results for EUR in the whole exome are presented in Figure 3a. Conservation categories are described in Supplementary Table 7. Results for ASN and AFR in HC exons (data not shown) are similar to EUR results. For all populations and exon data sets, the medium high and high conservation categories always show a significant enrichment for potentially deleterious mutations in coldspots.

6 Supplementary Figure 6 Haplotype load of nonsynonymous variants in Asians (ASN) and Africans (AFR) in the different conservation categories in coldspots (CS) and highly recombining regions (HRRs). Haplotype load is computed as described in the Online Methods. Haplotype load in Europeans is presented in Figure 3b, with the characteristics of conservation categories shown in Figure 3c.

7 Supplementary Figure 7 Additional simulations testing the effect of recombination rates and phasing. (a) Distribution of effects for initial and modified coldspot (CS) and highly recombining region (HRR) rates in simulations, with CS/HRR recombination rates matching the rates in the CEU and YRI maps, respectively (Supplementary Note, section 4). The distributions are significantly different, but the shift in the mean is very weak and unlikely to cause the large differences observed between populations in Figure 5. (b,c) Effect of phasing on the distribution of the number of haplotypes with two and more rare mutations (MAF < 0.01) in real haplotypes and phased haplotypes on chunks of the same length (25 kb) in simulated coldspots and HRRs. (b) The number of haplotypes with two mutations is reduced by statistical phasing with SHAPEIT2, (b) but no significant difference between coldspots and HRRs was found in this phasing bias.

8 Supplementary Figure 8 Effects for private and shared variants between African subpopulations. Comparison of closely related populations of African ancestry. Odds ratios comparing coldspots (CS) and highly recombining regions (HRRs) are computed on the basis of private and shared variants called in 88 Yoruba in Ibadan from Nigeria (YRI), 97 Luhya in Webuye from Kenya (LWK) and 61 Americans of African ancestry (ASW).

9 Supplementary Figure 9 Per-individual differential mutational burden across populations. Comparison of proportions of (a) rare and (b) nonsynonymous mutations between coldspots (CS) and highly recombining regions (HRRs) in French Canadians (FCQ), Europeans (EUR), Asians (ASN) and Africans (AFR). For each individual (ordered by their OR values), the relative proportions of rare or nonsynonymous mutations in coldspots and HRRs are shown, computed by dividing coldspot and HRR proportions by genome-wide proportions of rare or nonsynonymous variants within each individual, to adjust for differences across individuals. The larger symbols represent individuals with the minimum and maximum OR values in each population. Ticks at the bottom of the plots show individual OR values significantly different from 1 (two-tailed P < 0.05). The French-Canadian data used are the RNA sequencing data set (Supplementary Note, section 2); replication with exome data of 96 French Canadians is presented in Supplementary Figure 11.

10 Supplementary Figure 10 Per-individual differential mutational burden across European populations for private variants. Distribution of odds ratios (ORs) per individual comparing proportions of private variants between coldspots (CS) and highly recombining regions (HRRs) in closely related populations of western European ancestry. ORs are computed on the basis of private variants called in the exome sequencing data set of 96 French Canadians (FCX), 89 British individuals (GBR), 93 Finns (FIN), 98 Italians from Tuscany (TSI) and 85 European Americans (CEU). The left panel shows the frequencies of individual ORs in each population. The right panel shows, for each individual (ordered by their OR values), the relative proportions of private mutations in coldspots and HRRs, computed by dividing coldspot and HRR proportions by genome-wide proportions of private variants within each individual, to adjust for differences across individuals.

11 Supplementary Figure 11 Per-individual differential mutational burden across populations with FCQ exome sequencing data. Distribution of odds ratios (ORs) per individual comparing proportions of rare (a,b) and nonsynonymous (c,d) mutations between coldspots (CS) and highly recombining regions (HRRs). For Europeans (EUR), Asians (ASN) and Africans (AFR), the results are the same as shown in Figure 4 and Supplementary Figure 9, whereas French-Canadian (FCQ) results are computed using the exome sequencing data set from 96 individuals. Further descriptions of the plots are found in Figure 4 and Supplementary Figure 9.

12 Supplementary Figure 12 Quality checks on per-individual differential mutational burden across populations. Distribution of odds ratios (ORs) per individual in French Canadians (FCQ), Europeans (EUR), Asians (ASN) and Africans (AFR), comparing proportions of (a) nonsynonymous variants after modifying annotations in the 1000 Genomes Project populations (see the Supplementary Note, section 4.1) and (b) nonsynonymous and (c) rare variants, after excluding mutations that are fixed in one population but still segregating in others, between coldspots (CS) and highly recombining regions (HRRs). The differences between populations observed in Figure 5 remain the same after correcting for these potential technical differences.

13 Supplementary Figure 13 Population structure in regional populations of Quebec. Sampling from the CARTaGENE Project includes individuals from the Montreal area (MTL), Quebec City (QCC) and the Saguenay region (SAG). The regional origin of individuals was confirmed by a principal-component analysis of genetic diversity in FCQ individuals compared with genetic diversity within the Reference Panel of Quebec (RPQ) and in the CEU population from HapMap 3. Other populations included in the RPQ are GAS (Gaspesia region), ACA (Acadians), LOY (Loyalists) and CNO (North Shore region).

14 Supplementary Tables Supplementary Table 1. Distribution of sequence (A) and SNPs (B) in Coldspots (CS) and High Recombination Regions (HRRs) genome-wide and in highly covered (HC) exons. A Regions CS (bp) HRRs (bp) Whole genome 1,048,937, ,243,758 Whole exome 25,302,008 17,768,017 HC exons 6,906,137 2,036,963 The reference genome hg19 contains autosomal positions in total. The whole-exome includes bp of sequences and the HC exons include bp of sequences. B POP a Number of SNPs Number of SNPs Number of Whole Datasets HC exons b individuals TOTAL HRRs CS TOTAL HRRs CS FCQ ,394 37,076 69,295 73,627 14,899 30,734 EUR ,296 34,248 48,665 69,672 12,489 29,827 ASN ,697 31,372 44,292 63,726 11,358 27,562 AFR ,549 45,816 62,944 89,789 16,464 38,157 a French-Canadians (FCQ) Europeans (EUR). Asians (ASN) and Africans (AFR). C POP a Number of non-synonymous SNPs Number of rare SNPs Number of HC exons HC exons individuals TOTAL HRRs CS TOTAL HRRs CS FCQ ,472 6,488 14,984 42,627 10,097 22,530 EUR ,712 6,536 16,614 48,045 8,111 21,447 ASN ,322 5,949 15,307 44,676 7,511 20,196 AFR ,040 8,065 19,117 51,863 8,871 23,111 a French-Canadians (FCQ) Europeans (EUR). Asians (ASN) and Africans (AFR).

15 Supplementary Table 2. Summary of linear regression models (FCQ dataset). Variable Rec rates a GC content Expression Exon Size SNPs/Kb R 2 SNP/Kb + *** + *** + *** NS NA Average MAF + *** + *** + *** + *** NA Expression b - *** + *** NA NS NA Density (SNP/Kb) for Constrained - *** NS NS - *** + *** Non-synonymous -* - ** - *** - *** + *** Damaging NS NS - *** - *** + *** Private - *** + *** NS NS + *** *** p<0.001; ** p<0.01; * p<0.05; NS : non-significant; +/- = positive/negative correlation; NA : nonapplicable. a Average recombination rates per exon in cm/mb are computed based on the FCQ genetic map. b This correlation is evaluated considering all exons with minimum coverage of 20x in at 50% individuals. It is therefore biased against low expression genes.

16 Supplementary Table 3. Robustness of the impact of recombination (Odds ratios OR) to GC-content. gene expression levels and divergence (FCQ dataset). Feature Category OR Nonsynonymous OR Rare OR Damaging OR Neutral Low GC-content (% GC per exon) Medium Low Medium High High NS 1.23 NS 0.85 Low Average Expression (gene expression levels based on RNAseq FCQ dataset) Medium Low Medium High NS 1.27 NS 0.64 High Low NS 0.67 Divergence to Chimpanzee (ds per exon computed by PAML) Medium Low Medium High High All categories See Online Methods for description of categories. NS: non-significant

17 Supplementary Table 4. Effects (Odds Ratios) for different mutation types (FCQ dataset) comparing coldspots and high recombining regions. Mutation type OR Rare [CI 95%] OR Non-syn [CI 95%] OR damaging [CI 95%] OR neutral [CI 95%] G C/C G 1.44 [1.24;1.67] 1.49 [1.3;1.71] 1.44 [1.14;1.80] 0.54 [0.44;0.67] A G/ T C 1.22 [1.16;1.30] 1.26 [1.19;1.33] 1.24 [1.13;1.37] 0.69 [0.64;0.75] A C/ T G 1.42 [1.21;1.69] 1.21 [1.04;1.41] 1.27 [0.99;1.63] 0.59 [0.46;0.76] G A/ C T 1.49 [1.36;1.64] 1.06 [0.97;1.16] 0.88 [0.74;1.05] 0.53 [0.48;0.60] G T/ C A 1.53 [1.23;1.89] 1.40 [1.15;1.71] 1.37 [0.95;1.95] 0.39 [0.30;0.53] A T/T A 1.25 [1.18;1.32] 1.25 [1.19;1.32] 1.25 [1.14;1.36] 0.68 [1.63;1.74] Exclusion of CpGs sites 1.47 [1,39;1.56] 1.28 [1.22;1.35] 1.20 [1.09;1.33] 0.56 [0.52;0.60] Exclusion of CpG islands 1.33 [1.27;1.39] 1.25 [1.20;1.30] 1.15 [1.07;1.24] 0.62 [0.58;0.66]

18 Supplementary Table 5. Robustness of the effect to recombination parameters used to define coldspots (CS) and high recombining regions (HRRs) in FCQ dataset. Rec. rates (cm/mb) L H Nonsynonymous Odds ratios CS vs HRRs Number of SNPs Rare Private CS HRRs In between Coldspots: SNPs within a 50Kb region with no recombination rate higher than L High Recombination Regions (HRRs): SNPs within 50Kb of at least two hotspots with rate higher than H Parameters used in this study (red) were chosen to maximize the overall number of SNPs included in the analyses while minimizing the difference between the number of SNPs in coldspots and in HRRs.

19 Supplementary Table 6. Differential mutational burden between coldspots and high recombining regions (Odds Ratios) by chromosome and by telomere bin. A B Chr Rare Non-synonymous Neutral FCQ EUR ASN AFR FCQ EUR ASN AFR FCQ EUR ASN AFR Telomere Rare Non-synonymous Neutral bin a FCQ EUR ASN AFR FCQ EUR ASN AFR FCQ EUR ASN AFR a Each chromosome is divided in 10 bins of equal length, with bin 1 closer to centromere and 10 closer to telomere. Values in bold are significant OR.

20 Supplementary Table 7. Characteristics of exons in the four conservation categories. Conservation Classes Average GERP per bp Number of constrained sites per bp (GERP>3) Number of Exons Number of SNPs (EUR) Total HRRs CS Total HRRs CS Low [-5.179; 1.103] [0; 0.34] Medium Low [1.103; 2.613] [0.34; 0.63] Medium High [2.613; 3.548] [0.63; 0.75] High [3.548; 6.170] [0.75; 1] The number of exons and SNPs considered in Figure 3 are reported.

21 Supplementary Table 8. Demographic and selection models used in simulations Model Parameters μ r Genetic Map Demography Mean s a EW μ=r CS/HRRs constant size EW μ=2r CS/HRRs constant size AA μ=r CS/HRRs expansion AA μ=2r CS/HRRs expansion EA μ=r CS/HRRs bottleneck + expansion EA μ=2r CS/HRRs bottleneck + expansion EW constant r constant rate constant size NTR CS/HRRs constant size 0 a Negative selection is modelled by a gamma distribution of mean s. with p=75% of mutations attributed a non-zero selective coefficient EW : DFE from Eyre-Walker et al EA : DFE and demographic model from Boyko et al for European-Americans AA : DFE and demographic model from Boyko et al for African-Americans NTR : No selection

22 Supplementary Table 9. Differential Mutational and Haplotype Load in Coldspots versus High Recombining Regions (HRRs) in Simulations rare : derived allele frequency (DAF) <0.01; ns : non-synonymous with s > -1/N 0 ; nsneg : nonsynonymous with s < -1/N 0 ; nsdam : non-synonymous with s < 1%. A Odds Ratios for Different Models (Sample size n=500) Mut Type EW μ=r [CI 95%] EW μ=2r [CI 95%] constant ρ [CI 95%] NTR [CI 95%] rare 1.22 [1.12;1.29] 1.14 [1.08;1.2] 0.97 [0.9;1.05] 0.99 [0.92;1.07] ns 1.05 [0.999;1.12] 1.04 [0.99;1.09] 0.99 [0.94;1.05] - - nsneg 1.1 [1.04;1.16] 1.06 [1.01;1.12] 0.99 [0.93;1.06] - - nsdam 1.12 [1.03;1.19] 1.08 [1.02;1.16] 0.98 [0.9;1.09] - - AA μ=r [CI 95%] AA μ=2r [CI 95%] EA μ=r [CI 95%] EA μ=2r [CI 95%] rare 1.18 [1.14;1.23] 1.14 [1.09;1.19] 1.21 [1.14;1.27] 1.14 [1.09;1.22] ns 1.03 [0.998;1.07] 1.02 [0.98;1.06] 1.03 [0.99;1.07] 1.02 [0.98;1.05] nsneg 1.06 [ ] 1.04 [1.002;1.09] 1.05 [1.01;1.09] 1.03 [0.99;1.08] nsdam 1.09 [1.01;1.17] 1.07 [0.99;1.14] 1.06 [0.998;1.14] 1.04 [0.99;1.11] (Sample size n=200) EW μ=r [CI 95%] AA μ=r [CI 95%] EA μ=r [CI 95%] rare 1.20 [1.06;1.26] 1.15 [1.07;1.25] 1.18 [1.1;1.32] ns 1.07 [0.95;1.15] 1.02 [0.95;1.08] 1.01 [0.97;1.09] nsneg 1.11 [1.05;1.2] 1.05 [1.01;1.09] 1.07 [1.01;1.011] nsdam 1.12 [1.02;1.25] 1.07 [0.98;1.16] 1.07 [1.004;1.13] Values in bold represent OR significantly different from 1. B Models Mutation Type rare ns nsneg nsdam EW μ=r, p=0.75 < < AA μ=r < <0.01 EA μ=r < <0.01 EW μ=r, p= EW μ=r, p=0.5 < < bp deleterious motif (s = 0.1) EW constant r NTR NTR rephrased with ShapeIt Proportion of simulated replicates where haplotypes in HRRs have a higher proportion of a given mutation type. Values in bold represent significant results (one-tailed p-value <0.05).

23 Supplementary Table 10. Gene Ontology Analysis Terms Number of WebGestalt PANTHER genes a p-value p-value Biological processes cell cycle mitosis protein metabolic processes mrna processing organelle organisation term not included microtubule-based processes term not included Molecular function Binding nucleotide binding RNA binding ATP binding term not included Catalytic Activity ligase transferase a The number of genes is the maximum number reported between WebGestalt and PANTHER

24 Supplementary Table 11. Overrepresentation of clinically relevant mutations, cancer variants and sensitive genomic regions in coldspots (CS) compared to high recombining regions (HRRs) COUNTS CS HRRs OR for CS vs HRRs Genome-wide SNPs (1000G) rare (<0.01) 7,445,567 5,435,142 correcting for genome-wide diversity segregating (>0.01) 4,883,516 4,160,183 clinvar mutations 18,103 11, [1.15;1.21] segregating [0.83;0.99] non-segregating and rare 17,114 11, [1.10;1.16] humsavar mutations 11,080 8, [1.01;1.07] segregating 7,763 6, [1.04;1.11] non-segregating and rare 3,317 2, [1.07; 1.20] whole exome (pb) 25,302,008 17,768,017 correcting for sequence length cosmic mutations (point mut.) 574, , [1.270;1.281] segregating 16,254 12, [0.881;0.923] non-segregating 557, , [1.285;1.296] motifs in Khurana et al. (pb) sensitive 2,087, , [1.518; 1.525] ultra-sensitive 165,453 17, [6.406; 6.607] Genome-wide SNPs (1000G) 12,329,083 9,595,325 correcting for total number of SNPs GWAS hits 3,470 4, [0.535; 0.583] Affy 6.0 Chip 223, , [0.599; 0.606] Illumina 1M 301, , [0.771;0.780] Illumina 2.5M 573, , [0.546; 0.550] all Chips 869,577 1,037, [0.650;0.654] Common SNPs (1000G, >10%) 1,653,476 1,566,782 correcting for number of common SNPs GWAS hits 2,367 3, [0.641; 0.713] Affy 6.0 Chip 162, , [0.696; 0.706] Illumina 1M 203, , [0.771; 0.781] Illumina 2.5M 220, , [0.447;0.453] all Chips 415, , [0.639;0.644]

25 Supplementary Note 1 Recombination analyses 1. Overview Accurate population recombination maps are necessary to identify regions that are in different recombination environments. We used five population genetic maps (1-3), to identify cold or hot regions. We first identified these regions using LD-based population maps, and in a second step, we excluded regions for which recombination rates in pedigree and admixture maps were inconsistent with the definitions of the cold and hot regions. The reason to first consider LDbased methods is that, we wanted to identify regions that have been low recombining (or high recombining) for a considerable amount of time during human evolution, as selective interference is a phenomenon that would occur over many generations. Pedigree maps give information on the recombination rates in the current generations, which does not guarantee that a given region was not recombining in the evolutionary past of humans. LD-based methods, although potentially biaised by SNP density, by using polymorphism data from population samples, look at ancestral recombination rates and provide us with the opportunity to identify regions with no evidence of recombination over hundred of thousands of years. 2. Population genetic map of French-Canadians The genetic map of the French-Canadians of Quebec (FCQ) population was built using LDhat (4). 521 FCQ individuals were genotyped on the Illumina Omni2.5M array. A total of 1,554,440 autosomal SNPs were obtained after filtering (Quality control HWE p<0.001, Missingness < 0.05, MAF>0). We ran the interval program from the LDhat package on FCQ genotyping data. Because the likelihood tables for the interval program are pre-computed for a maximum number of 192 haplotypes, we randomly selected 96 unrelated individuals from the 521 FCQ individuals. The largest chromosomes (1 to 12) were broken into two segments (p and q arms) and all genomic segments were phased with ShapeIT2 (5). We ran the interval program on each genomic segment for 30,300,000 iterations with a burn-in of 300,000 iterations and sampled the population recombination rates ρ every 10,000 iterations. The estimate of the recombination rate

26 ρ between each pair of adjacent SNPs, in units of 4N e r per Kb, was computed by taking the average rate across iterations of the rjmcmc procedure implemented in interval. To convert the population recombination rate estimates in 4N e r per Kb into centimorgan per Megabase (cm/mb), we inferred the effective population size N e for the FC population using estimates of r computed for the 2010 decode map in cm units (2). Specifically, we identified chromosomal segments where both FCQ data and decode SNP positions allowed estimates of rates and we summed rates across these genomic regions to obtain the total estimated distance (4N e R) and the total genetic distance (R in cm units) from the decode map. 3. Population genetic maps from HapMap3 data Population genetic maps from the HapMap2 data have been built by the HapMap consortium in 2007 (3), using the 2002 decode pedigree map (6) and hg18 positions. It was subsequently lifted over to hg19 using the UCSC liftover tool and regions where the order of markers had changed were removed from the final maps. We re-computed the HapMap maps for CEU and YRI with the methodology used to compute the FCQ map described above to allow direct comparison. Specifically, we performed a lift over on the HapMap3 SNPs positions prior to estimating recombination rates with interval using 96 unrelated individuals from the CEU and YRI populations. We then converted the recombination rates in cm/mb using the 2010 decode pedigree map (2), and obtained new HapMap genetic maps for these populations. These maps are available here: 4. Coldspots and High Recombining Regions We used these genetic maps to locate coldspots and hotspots of recombination. We define coldspots (CS) as regions of more than 50Kb with recombination rates between adjacent SNPs below 0.5 cm/mb in FCQ, CEU and YRI populations, such that they are shared between all human populations studied. We excluded centromeric regions and required that at least 5 SNPs support the coldspot, to avoid regions with dramatically reduced diversity, where power to estimate recombination rates is decreased. For each region identified, we computed the mean recombination rate (cm/mb) using the decode pedigree map (2) and the admixture-based African American map (1), and we excluded all regions that have a recombination rate larger than 0.5 cm/mb in one of these maps. We obtained a list of 7,381 autosomal coldspots, spanning about a third of the human genome, for a total of Gb (Supplementary Table 1). A

27 hotspot is defined as a short segment (<15Kb) with recombination rates falling in the 90th percentile (> 5 cm/mb). We define high recombination regions (HRRs) as regions with a high density of hotspots, such that the distance separating neighbouring hotspots is smaller than 50 Kb. We identified 12,500 HRRs genome wide shared between FC, CEU and YRI populations, covering a total of Mb (Supplementary Table 1). The definition of coldspot, hotspot and HRRs are illustrated in Figure 1. A complete list of these regions can be found at The recombination rate thresholds used to define coldspots and hotspot were chosen to maximize the overall number of SNPs included in the analyses while minimizing the difference between the number of SNPs in coldpots and in HRRs. The effects are robust to different recombination thresholds (Supplementary Table 5). Although coldspots and HRRs are present by definition in both YRI and CEU LD maps, they may display different recombination rates in these maps. We compared the mean rates per coldspots and HRRs between these two LD-based maps. The mean recombination rate in coldspots is cm/mb for CEU and cm/mb for YRI, with this difference being highly significant (p<10-5, permutation test). In HRRs, the mean recombination rate in coldspots is 5.70 cm/mb for CEU and 4.62 cm/mb for YRI (p<10-5, permutation test). The distributions of rates within these regions are highly different between YRI and CEU (Kruskal-Wallis chi-squared = , df = 1, p-value < 2.2e-16). These differences could be due to differences in LD-based maps caused by varying demography and population specific selection; however, it is more likely that it reflects differences in local recombination rates due to the presence of different alleles of PRDM9, the protein responsible for recombination clustering in hotspots along the genome (1, 7). The impact of these differences in mean rates in coldspots and HRRs between African and non-african populations is explored in Supplementary Note section 3.5 and Supplementary Figure 11 and is unlikely to cause the large differences we observe between populations in Figure Comparison of coldspots and HRRs with the decode map Only 481 coldspots were excluded because recombination rates were larger than 0.5 cm/mb in the decode map. The decode recombination rate for each coldspot is reported in the supplementary data available online at To ensure our result are robust to the choice of map, we computed coldspots and HRRs using the decode pedigree map alone and compared this set of regions to the set of regions used in

28 our study. There are 8165 coldspots inferred from the decode map, and 1824 coldspots supported by more than 5 SNPs that do not overlap a coldspots in our final list of coldspots: - 94 had been removed because of a lack of SNPs in the region in HapMap or FCQ data (less than 5 SNPs); have a recombination rate > 0.5cM/Mb in all LD-based maps; - For the remaining 406 coldspots : o 121 have recombination rate > 0.5cM/Mb in FCQ LD map; o 125 have recombination rate > 0.5cM/Mb in CEU LD map; o 142 have recombination rate > 0.5cM/Mb in ASN LD map; o 332 have recombination rate > 0.5cM/Mb in YRI LD map. For HRRs, a smaller number of regions are found with the decode map (9071). Discordances in HRRs lists are mainly due to differences in the computed intensity of recombination hotspots by the two methods, that are not directly comparable. We recomputed the enrichment statistics for rare, non-synonymous, damaging and neutral variants for the FCQ dataset with the decode regions alone. Rare (1.11 [1.06;1.16]), nonsynonymous (1.10 [1.06;1.15]) and damaging (1.08 [1.01,1.16]) variants remain significantly enriched in coldspots and neutral variants are significantly underrepresented (0.85 [0.81;0.91]). However, the effects are somewhat weaker, which is explained by the fact that we include coldspots with evidence of recombination in LD-maps. If we only take the overlap (ie. variants in coldspots in both decode and LD-maps), the effects become comparable to the ones observed in the final list of coldspots obtained: rare (1.27[1.22;1.33]), non-synonymous (1.17 [1.12;1.22]),damaging (1.05 [1.01,1.11]) neutral (0.68 [0.65;0.73]). Our results are therefore robust to the choice of genetic map. 2 French-Canadian Genomic Data 1. Overview The main analyses for the French-Canadian (FCQ) population rely on SNPs called from RNA sequencing (RNAseq) data. SNPs from other populations come from the 1000 Genomes phase I high coverage exome dataset and were grouped according to their ancestry (African, European, Asian ancestry). Admixed populations from the Americas were excluded. We performed

29 extensive analyses to ensure that calling SNPs from transcriptomes do not create biases influencing our population genetic analyses and cannot explain the difference seen in the FCQ population. We also performed exome sequencing for a subset of FCQ individuals for which we had RNAseq data and further validated SNP calls and the overall results found with the RNAseq SNPs. 2. French Canadians The CARTaGENE project (CaG) collected biologicals and data from 20,000 participants recruited throughout the province of Quebec (8), and high-density genotyping and RNA sequencing data was generated for 521 French-Canadians participants (Online Methods). Sampling includes individuals from three distinct metropolitan regions of Quebec: the Montreal area (MTL), Quebec City (QCC) and the Saguenay Lac-St-Jean region (SAG) (Supplementary Figure 13). Regional origins of the individuals were validated with a principal component analysis (PCA) of genetic diversity using genotypic data and including individuals from the Reference Panel of Quebec (RPQ) (9). Population structure is complex and made of regionally differentiated populations (Supplementary Figure 13), resulting from the very recent regional founder effect that occurred in Saguenay. This territory was colonized during the 19th century by a reduced number of settlers, who contributed massively to the genetic pool of individuals living in this region today (10). 3. Processing of the raw RNAseq Data and SNP calling Approximately 3 ml of blood was collected for RNA work in Tempus Blood RNA Tubes (Life Technologies). Total RNA was extracted using a Tempus Spin RNA Isolation kit followed by globin mrna depletion by using a GLOBINclear-Human kit (Life Technologies). RNAseq 100bp pair-ends indexed libraries were constructed using the TruSeq RNASeq library kit (Illumina). Sequencing was done on HiSeq machines (Illumina), multiplexing three samples per lane. After initial filtering based on sequencing read quality, paired-end reads were aligned using TopHat (V1.4.0) (11) to the hg19 European Major Allele Reference Genome (12). PCR removal was performed using Picard (picard_tools/1.56, Raw gene-level counts data were generated using htseq 0.5.3p3 (13). These counts were then normalized using EDASeq v1.4.0 and a procedure that adjust for GC-content as well as for distributional

30 differences between and within sequencing lanes (14, 15). Average normalized gene expression levels per gene were determined by averaging expression levels of each gene across all individuals (Idaghdour et al In preparation). Every exon of a gene was attributed the gene-level value. SNPs were called from RNAseq data using a procedure similar to SNP calling in exome sequencing data. However, prior to SNP calling, bowtie2 (0.12.7)(16) was used to removed abundant sequences (polya, polyt, trna). Only reads that were properly paired and uniquely mapped were kept. Mapping quality score were recalibrated using GATK (17) and SNP calling was performed with samtools (0.1.18) (18). Filtering of SNPs was done using vcftools v0.1.7 (19). We kept SNPs with variant quality of 30 and genotype quality of 20 (Phred scores). Minor allele frequencies (MAF), the proportion of individuals with non-missing genotypes and Hardy- Weinberg equilibrium (HWE) p-values were computed using plink v1.07. SNPs showing departures from HWE at p < were excluded. We obtained a total of 178,394 polymorphic SNPs (MAF > 0) in the 521 French-Canadians individuals (Supplementary Table 1B). 4. Selection of Highly Covered Exons To insure that sequencing SNPs are called throughout the length of exons and to reduce the possible biases due to read depth, we selected highly covered exons (hereafter termed HC exons) with all positions of their sequence covered at a minimum of 20 in more than 50% of the sequenced individuals (i.e. at least 261 FCQ individuals). We used BAMStats-1.25 to obtain the minimum coverage per exon per individual for 208,226 autosomal exons. A total of 89,390 exons passed this stringent filter containing a total of 73,627 SNPs. For subsequent analyses, we also excluded 9 genes for which the mutational profiles were abnormal (Online Methods). 5. FCQ Exome Sequencing Data and Genotyping All individuals were also genotyped on the Illumina Omni2.5M array. A total of 1,554,440 autosomal SNPs were obtained after filtering (Quality control HWE p<0.001, MAF>0). We took all positions in common between the Omni2.5 chip and the RNAseq SNPs called. We filtered for missingness (<50% for RNAseq, <95% Omni) and filtered out positions for which the alleles did not match between the chip and RNAseq after flipping, ending up with 26,615 positions to

31 compare. We compared genotypes for which there was a call in both datasets, and the concordance rates were above 98.8% in all individuals, with mean across individuals of 99.3%. Exome sequencing was also performed for 96 FCQ individuals. DNA from each sample was extracted from peripheral blood cells and paired-end exome sequencing was performed on HiSeq machines (Illumina), multiplexing six samples per lane. We first performed trimming of sequencing read data using Trim Galore prior alignment to trim adaptors (with parameter q 0, Alignment was performed using BWA version r16. After recalibration with GATK (17), reads were trimmed for quality using bamutil version (genome.sph.umich.edu/wiki/bamutil). SNP calling was performed with samtools (0.1.18) (18) using only properly paired and uniquely mapped reads. We kept SNPs with variant quality of 30 and genotype quality of 20 and minimum coverage of 10x, for a total of 60,251 SNPs. Using the concordance procedure described above, we computed concordance rates between the exome dataset and the RNAseq dataset for 30,850 SNPs called in both datasets. The mean concordance rate across individuals is 99.01%. There is one outlier individual with concordance rate of 94.8% (although its concordance rate between RNAseq and genotyping is 99.27%), all other individuals have concordance rate above 98%. 6. Checks in the FCQ dataset Many additional analyses were performed on the FCQ data to insure the robustness of the results: - We evaluated differences in diversity between coldspots and HRRs in FCQ (Supplementary Figure 1), replicating the documented observation of decreased diversity in coldspots. - To ensure that the differences in the effects observed in FCQ are not due to biases in the RNAseq data, all results were derived with both RNAseq SNPs, and re-sequencing data of exomes in 96 individuals (Supplementary Figure 2A, Supplementary Figure 10). Furthermore, as no significant differences were found with these two samples of different size, the measure of enrichment used to assess the differential mutational burden, is robust to sample size. To evaluate the effect of private variants (Figure 2B, Supplementary Figure 9), we used the exome sequencing dataset of 96 individuals. - We evaluated the effect of confounding factors (Supplementary Table 2-5). The results are robust to GC content, expression levels and between-species neutral substitution

32 rates (Supplementary Table 3). Furthermore, we regressed the number of mutations of different types per exon on the recombination rate per exon, controlling for GC content, expression levels, exon size and total SNP density (Supplementary Table 2). The effect is seen for all mutation types, with no marked differences between transitions and transversions or for mutations towards GC (Supplementary Table 4), excluding the possibility that GC-biased gene conversion is responsible for the differences seen between recombination environments. More details for controlling for GC content are given below. Finally, we tested the effect using a wide array of recombination rate thresholds used to define coldspots and HRRs (Supplementary Table 5). - We computed OR for non-synonymous and damaging mutations for different frequency classes, to verify that the enrichment of potentially deleterious mutation is not only due to an enrichment of rare variants, that include more non-synonymous variants (Supplementary Figure 3). In all the above cases, the results obtained confirm our major conclusions. Similar checks were performed in the 1000 Genomes populations. In particular, we performed analyses using the highly covered exons from the RNAseq datasets as well as using all exons where SNPs were called in the 1000 Genomes populations (Supplementary Figure 2). For other checks (confounding factors, frequency classes) the results obtained are generally the same as in the FCQ, therefore only the FCQ results are shown. 7. Controlling for GC bias We controlled for variation in GC content in the genome in different ways to verify that this variable is not confounding our analyses. When comparing exons with the same GC content between coldspots and HRRs, the effects remain significant, with the exception of the High GC content class, for non-synonymous (1.07 [0.99;1.15]) and damaging (1.04 [0.92;1.18]). These non-significant results are likely due to the small number of exons in coldspots with high GC content, and hence the small number of mutations in this class, leading to a lack of power to detect a significant effect. To make sure that our effects are not influenced by GC-biased gene conversion, a recombination-associated process that favors the fixation of G/C alleles over A/T alleles, we computed the effects independently for all mutations types, and found that the mutations

33 towards G or C showed the same effect as the mutations towards A or T (Supplementary Table 4). We further studied the impact of CpG sites. We identified CpG sites by retrieving 3-nucleotide sequences with the central nucleotide being the position of every variant in the FCQ dataset for which the reference allele is a C or a G. Out of 179,005 potential variants, 100,012 were found to be CpG sites. Overall, we see a significant deficit of CpG mutations in coldspots (OR=0.62 [0.59;0.64]) reflecting the lower GC content in coldspots. When excluding all CpG sites from the analyses, the enrichment of putatively deleterious mutations remains significant (Supplementary Table 4). Similarly, when excluding mutations within CpG islands, the results remain unchanged (Supplementary Table 4). These additional analyses confirm that GC/CpG content is not responsible for the significant enrichment of putatively deleterious mutations in coldspots in the human genome. 3 Simulations 1. Overview In the past, selective interference leading to Muller s ratchet has been mainly investigated through simulation studies. These studies demonstrated the impact of no recombination on the accumulation of deleterious mutations within genomes (20-24). However, most results were produced for haploid genomes (but see (25)) and were set up to compared regions of free recombination with regions that entirely lack recombination. Here, we performed additional computer simulations (with SLiM (26) and sfs_code (27)) to describe the expectations under a model of selective interference between negatively selected mutations in diploid genomes, with recombination environments comparable to the ones observed in human autosomes. Both forward-in-time simulation programs gave similar results for all analyses. 2. Recombination environments We simulated diploid genomes with a distribution of recombination rates similar to the one observed in the empirical data. For 250 individuals (500 haplotypes), we simulated exon-like sequences with non-synonymous mutations, with the mutation rate μ = per base and N e = 1000, chosen to minimize computing time while getting diversity data comparable to human

34 data. We tested models with the overall recombination rate being r = μ or r = μ/2, to evaluate whether the relationship between r and μ had an impact. We defined three recombination environments: coldspots (CS), high recombining regions (HRRs) and regions in between. In the human exome, coldspots and HRRs contain 4.1% and 58.6% of recombination events, respectively, according to values in the decode map. These values were used in the simulations to match the human genetic map. For each genome, we simulated 75 fragments of 200Kb, with coldspots and HRRs of 95 Kb and 28 Kb, respectively, and 77Kb of regions in between HRRs and coldspots. We also simulated a null model with constant r, to insure that the effects seen do not reflect the difference in region length between coldspots and HRRs. Finally, we also simulated modified coldspots and HRRs, such that their recombination rate matches the African recombination rates within coldspots and HRRs better (Supplementary Note section 1, Supplementary Figure 11). 3. Models of selection and demography Our simplest model is a constant population size model with the distribution of fitness effects estimated by Eyre-Walker and colleagues (28), using their model without correction for demography (EW model). Simulations were performed with the same distributions of selection coefficients across coldspots and HRRs, with proportion p = 75% of variants attributed a selection coefficient from a gamma distribution of mean s = We also used other scenarios to model European and African human data, with parameters taken from the EA and AA models from (29), a study that inferred both selection and demographic parameters simultaneously. Finally, we simulate data under a model without selection (NTR), a control under selective neutrality but with the human-specific recombination map. The description of these models is shown in Supplementary Table 7. For each model, we generated 100 replicates. Each replicate took between 15 and 55 hours to run, depending on the model and on the simulation program. Odds ratios are used to estimate the differential mutational burden between coldspots and HRRs for rare and non-synonymous mutations (Online Methods). Mutations with derived allele frequency (DAF) below 0.01 are labelled as rare. Mutations with s larger than -1/2Ne (ns) are effectively neutral, and others are negatively selected (nsneg). To model damaging mutations, we chose the threshold of below s = (nsdam), to match the number of non-synonymous damaging variants in the empirical data. Simulated coldspots have a higher proportion of rare and nsneg mutations than simulated HRRs for all models of selective interference with r = μ.

35 Results for r = μ/2 are highly similar, although the effect is somewhat reduced (Supplementary Table 8A). The neutral model with varying recombination rates does not show an enrichment of rare neutral variants in coldspots, indicating that the reduction of N e in low recombination region alone does not account for the excess of rare variants. 4. Models of background selection with and without interference We simulated various models of background selection, by changing the proportion p of nonsynonymous variants and the mean selection coefficient s from a gamma distribution. We make p take a value from 0.1 to 0.75 and s takes values from to We find that the effects on both rare and deleterious diversity increases with p (Figure 4). We simulated a scenario with p = 0.1 and s = -0.3, such that the overall selective pressure acting on the loci is approximately the same as for p = 0.75 and s = (28). Interestingly, this model did not perform better than the scenario with p = 0.1 and s = , suggesting that a small proportion of strongly selected mutations is unlikely to cause the difference in mutational burden observed in the data. Scenarios modelling background selection with no interference further confirm this result: we simulated a single motif of 10bp in the centre of each region, where all sites have fixed s = 0.05, 0.1 or These models are designed not to contain linkage between negatively selected sites because the size of the region is too short for many deleterious mutations to occur and interfere with each other. These scenarios do not predict differences in the proportion of rare neutral mutations between coldspots and HRRs (Figure 4), indicating that background selection acting at one independent locus does not lead to an enrichment of rare variants in regions with reduced recombination rate. However, we note that U, the deleterious mutation rate, is very small in the models without interference, and therefore, these models may not reflect the effects that background selection can have when many mutations are present and increase U. Unfortunately, adding more deleterious mutations to increase U also creates the opportunity for more selective interference between deleterious mutations, therefore, the effects of Hill- Robertson and background selection are very hard to disentangle in this simulation framework. Nevertheless, these results show that many mutations with small effects are required to cause the patterns observed, which likely result from a combination of background selection and reduced efficiency of selection in coldspots due to interference between these mutations. 5. Simulation of various scenarios to explain population differences