Supporting Information

Similar documents
Supporting Information

Nature Genetics: doi: /ng.3254

Questions we are addressing. Hardy-Weinberg Theorem

Supplemental Data. Who's Who? Detecting and Resolving. Sample Anomalies in Human DNA. Sequencing Studies with Peddy

Supplementary Figures

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY

Linkage Disequilibrium. Adele Crane & Angela Taravella

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015

Supplementary Note: Detecting population structure in rare variant data

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2017

PLINK gplink Haploview

Understanding genetic association studies. Peter Kamerman

EXERCISE 1. Testing Hardy-Weinberg Equilibrium. 1a. Fill in Table 1. Calculate the initial genotype and allele frequencies.

Genome-wide analyses in admixed populations: Challenges and opportunities

N e =20,000 N e =150,000

Evolution of Populations (Ch. 17)

Detecting ancient admixture using DNA sequence data

Measuring Evolution of Populations. SLIDE SHOW MODIFIED FROM KIM

Introduction to Population Genetics. Spezielle Statistik in der Biomedizin WS 2014/15

Supplementary Methods 2. Supplementary Table 1: Bottleneck modeling estimates 5

COMPUTER SIMULATIONS AND PROBLEMS

Measuring Evolution of Populations

Haplotypes, linkage disequilibrium, and the HapMap

The evolutionary significance of structure. Detecting and describing structure. Implications for genetic variability

Population stratification. Background & PLINK practical

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

Supplementary Methods Illumina Genome-Wide Genotyping Single SNP and Microsatellite Genotyping. Supplementary Table 4a Supplementary Table 4b

Average % If you want to complete quiz corrections for extra credit you must come after school Starting new topic today. Grab your clickers.

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

Lab 2: Mathematical Modeling: Hardy-Weinberg 1. Overview. In this lab you will:

b. (3 points) The expected frequencies of each blood type in the deme if mating is random with respect to variation at this locus.

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

More Introduction to Positive Selection

Algorithms for Genetics: Introduction, and sources of variation

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

An Introduction to Population Genetics

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

5/18/2017. Genotypic, phenotypic or allelic frequencies each sum to 1. Changes in allele frequencies determine gene pool composition over generations

Exam 1, Fall 2012 Grade Summary. Points: Mean 95.3 Median 93 Std. Dev 8.7 Max 116 Min 83 Percentage: Average Grade Distribution:

General aspects of genome-wide association studies

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

1) (15 points) Next to each term in the left-hand column place the number from the right-hand column that best corresponds:

Supplementary Online Content

Digital RNA allelotyping reveals tissue-specific and allelespecific gene expression in human

Statistical Methods for Quantitative Trait Loci (QTL) Mapping

Summary. Introduction

H3A - Genome-Wide Association testing SOP

Edexcel (B) Biology A-level

Papers for 11 September

Neutral mutations in ideal populations. The null model in population genetics The Fisher-Wright population model

Human Genetics and Gene Mapping of Complex Traits

TEST FORM A. 2. Based on current estimates of mutation rate, how many mutations in protein encoding genes are typical for each human?

Why do we need statistics to study genetics and evolution?

Office Hours. We will try to find a time

A brief introduction to population genetics

Genome-Wide Association Studies (GWAS): Computational Them

UNDERSTANDING VARIATION IN HUMAN SKIN COLOR. Student Handout

Population Structure and Gene Flow. COMP Fall 2010 Luay Nakhleh, Rice University

Quantitative Genetics for Using Genetic Diversity

Lecture 10: Introduction to Genetic Drift. September 28, 2012

Conifer Translational Genomics Network Coordinated Agricultural Project

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

Enhanced Resolution and Statistical Power Through SNP Distributions Within the Short Tandem Repeats

GENETICS - CLUTCH CH.21 POPULATION GENETICS.

Molecular markers in plant systematics and population biology

TITLE: Refugia genomics: Study of the evolution of wild boar through space and time

Fossils From Vindija Cave, Croatia (38 44 kya) Admixture between Archaic and Modern Humans

GENOME WIDE ASSOCIATION STUDY OF INSECT BITE HYPERSENSITIVITY IN TWO POPULATION OF ICELANDIC HORSES

POPULATION GENETICS studies the genetic. It includes the study of forces that induce evolution (the

Runs of Homozygosity Analysis Tutorial

Population Genetics II. Bio

Concepts: What are RFLPs and how do they act like genetic marker loci?

Supplementary File: In search of the Goldilocks zone for hybrid speciation

Diagnosis of MDD was based on criteria of the Diagnostic and Statistical Manual of Mental

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

QTL Mapping Using Multiple Markers Simultaneously

Hardy-Weinberg problem set

Themes. Homo erectus. Jin and Su, Nature Reviews Genetics (2000)

Computational Population Genomics

Supplementary Figure 1. Growth of E.coli strains with mutant DHFR genes as a function of trimethoprim concentration.

A Bioinformatics Workflow for Genetic Association Studies of Traits in Indonesian Rice

Statistical Tools for Predicting Ancestry from Genetic Data

Association mapping of Sclerotinia stalk rot resistance in domesticated sunflower plant introductions

EFFICIENT DESIGNS FOR FINE-MAPPING OF QUANTITATIVE TRAIT LOCI USING LINKAGE DISEQUILIBRIUM AND LINKAGE

Statistical Tests for Admixture Mapping with Case-Control and Cases-Only Data

UPIC: Perl scripts to determine the number of SSR markers to run

Introduction to Add Health GWAS Data Part I. Christy Avery Department of Epidemiology University of North Carolina at Chapel Hill

University of York Department of Biology B. Sc Stage 2 Degree Examinations

SLiM: Simulating Evolution with Selection and Linkage

LINKAGE DISEQUILIBRIUM MAPPING USING SINGLE NUCLEOTIDE POLYMORPHISMS -WHICH POPULATION?

In the presence of CWD the form of selection operating is Heterozygote Advantage (overdominance).

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

Leveraging admixture analysis to resolve missing and crosspopulation. Noah Zaitlen

Analysis of genome-wide genotype data

The Evolution of Populations

Evolution of species range limits. Takuji Usui (Angert Lab) BIOL Nov 2017

Characterization of Allele-Specific Copy Number in Tumor Genomes

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Transcription:

Supporting Information Natural selection for the Duffy-null allele in the recently admixed people of Madagascar Jason A. Hodgson a,b, Joseph K. Pickrell c,d, Laurel N. Pearson a, Ellen E. Quillen a,e, António Prista f, Jorge Rocha g,h, Himla Soodyall i, Mark D. Shriver a, and George H. Perry a,j a Department of Anthropology, The Pennsylvania State University, University Park, PA 16802, USA, b Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot, Berkshire SL5 7PY, UK, c New York Genome Center, New York, NY 10013, USA, d Department of Biological Sciences, Columbia University, New York, NY 10027, USA, e Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78245, USA, f Faculdade de Educação Física e Desporto, Universidade Pedagógica, Moçambique, g Centro de Investigação em Biodiversidade e Recursos Genéticos da Universidade do Porto (CIBIO), Vairão, Portugal, h Departamento de Biologia, Faculdade de Ciências da Universidade do Porto, Porto, Portugal, i Human Genomic Diversity and Disease Research Unit, Division of Human Genetics, School of Pathology, Faculty of Health Sciences, University of Witwatersrand and the National Health Laboratory Service, Johannesburg 2000, South Africa, j Department of Biology, Penn State University, University Park, PA 16802, USA 1. Supplementary Material and Methods. Page 2 2. Supplementary Table. Page 6 3. Supplementary Figures.. Page 7 4. References... Page 14 1

1. SUPPLEMENTARY MATERIAL AND METHODS Samples. The samples used in this study consist of 19 Malagasy, 16 Bantu speakers from South Africa, 20 Burunge from East Africa, 20 Mozambican, 42 Europeans, 11 Mala and 11 Brahmin from India, and 20 Chinese or Japanese. 10k SNP data were collected using methods described in ref. [1]. The Burunge, Europeans, Mala, Brahmin, Chinese and Japanese, and Mozambican samples have been described previously [2-4]. The European, and Chinese and Japanese samples are from the Coriell Caucasian and East Asian collections. The Malagasy and South African Bantu samples are newly described here. The Malagasy sample consists of Merina individuals collected from the Central Highlands of Madagascar. The sample of South African Bantu speakers was collected in Johannesburg, South Africa. Both the Malagasy and South African samples were collected with approval from the Human Research Ethics Committee of the University of Witwatersrand. Duffy-null allele genotyping. We genotyped the C -> T SNP that determines the Duffy positive/null genotype (rs2814778) using Sanger sequencing. A 464bp segment of the DARC promoter region containing rs2814778 was amplified with the PCR (electronic supplementary material, figure S2a) using the following primers (both 5-3 ): forward ACTTTCTGGTCCCCACCTTT, and reverse ACAAGAGGGAGCTAGGAGGC. Sequencing was performed on the positive strand using an internal primer TAGTCCCAACCAGCCAAATC on an Applied Biosystems 3730 capillary sequencer. Electropherograms were visualized with Geneious. Homozygotes were resolved as a single peak of normal magnitude, while heterozygotes were resolved as double peaks at half the normal magnitude (electronic supplementary material, figure S2b). Quantification of Malagasy admixture. An identity by state-multidimensional scaling (IBS- MDS) analysis was performed with PLINK v1.07 [5]. First, an IBS matrix was calculated using autosomal SNPs, excluding SNPs with MAF < 1, SNPS with > 20% missing data, and individuals with >50% missing data. The resulting IBS matrix was then used as input for the MDS analysis. Results were plotted using the ggplot2 library [6] for R v2.15.1 [7]. The same excluded dataset used to calculate the IBS matrix was also used in the ADMIXTURE analysis. To determine the number of ancestral populations, K, that best describes the data, we calculated cross-validation error for K=2 through K=6 (electronic supplementary material, figure S1), and chose K that minimized the error (K = 3), using the ADMIXTURE cross-validation 2

procedure. Statistical analyses and plots of estimated ancestry proportions were performed with R v2.15.1 [7]. Expected allele frequency in the Malagasy. Expected allele frequencies were calculated according to the following formula: E(p d ) = α(p 1 ) + (1 - α)(p 2 ) where E(p d ) is the expected frequency of allele p in the admixed daughter population, p 1 is the frequency of the allele in the parent population 1, p 2 is the frequency of the allele in the parent population 2, and α is the admixture proportion contributed by parent population 1. Likelihood ratio test to compare a history of Duffy-null allele frequency stasis to a history of change. Because we observed wide variance in sub-saharan African ancestry proportions among our Merina population sample (figure 1b), we tested whether the observed Duffy-null genotypes were different than expected, given the corresponding individual estimates of sub- Saharan African ancestry (and assuming Duffy-null allele frequencies of and 1.0 in the founding Austronesian and mainland African populations, respectively). Specifically, we used a likelihood ratio test to determine whether the observed Duffy genotypes (SNP rs2814778), and the corresponding estimates of sub-saharan African ancestry for 18 Malagasy individuals were more consistent with the null hypothesis of no allele frequency change since admixture began, to the alternative hypothesis that Duffy-null allele frequency has changed significantly. We assume no genotyping error and no error in the ancestry estimates. Let g = [g 1,g 2,...g i ] be a vector containing the numbers of Duffy-null alleles in each individual i and f = [ f 1, f 2,...f i ] be the corresponding vector of African ancestry proportions. If the Duffy-null allele was fixed in the African population that colonized Madagascar and absent from the Austronesian population that colonized Madagascar (as well as from any Eurasian populations that also contributed to Malagasy ancestry), then: P(g i f i ) ~ Bin(2, f i ) 3

We then define a simple model that allows the Duffy-null allele frequency to have changed since admixture. Let δ be a parameter controlling the amount of change. We define the following likelihood function: L( g 18 f ) = Bin(g i 2, f i +δ(1 f i )) i=1, where Bin(x n, p) represents the binomial density function with parameters n and p evaluated at x.using this likelihood function, we used a likelihood ratio test to compare the null hypothesis where δ = 0, to the alternative hypothesis where δ 0. This likelihood ratio test rejected the null hypothesis that the frequency of the Duffy-null allele has not changed since the time of admixture (P = 04), consistent with the notion of a recent Duffy-null allele frequency increase in the Malagasy. Computer simulations. A program, fastadsim.pl, was written to simulate admixture between two parent populations. The program simulates genetic drift for a single allele through time, assuming i) instantaneous admixture between two parent populations, ii) diploidy, iii) random mating, iv) no mutation, and the following user-defined parameters: a) allele frequency of parent population 1, b) allele frequency of parent population 2, c) admixture proportion, d) starting effective population size, e) per generation population growth, f) the number of generations of drift, g) per generation migration rate from population 1, h) per generation migration rate from population 2, and i) the number of simulation replicates to perform. An alternative test of the null hypothesis that the observed Duffy-null allele frequency change can be explained by genetic drift alone could be to compare the observed allele frequency shift to a null distribution derived from putatively neutrally evolving SNPs in the Malagasy with similar frequency difference between the mainland sub-saharan African and East Asian source populations. However, the extreme Duffy-null allele frequency difference observed between sub-saharan Africans and the rest of the world is practically unknown elsewhere in the genome. For example, the 1000 Genomes Project found only four fixed differences between West Africans and Western Europeans, including the Duffy-null allele, and only 72 between West Africans and East Asians [8], none of which are included in the Affymetrix 10k SNP panel that we used for this study. Of the 11,552 SNPs genotyped for this study, only 2 had frequency differences greater than 0.9, and 37 greater than 0.8, between Mozambicans and East Asians. Also, SNPs with such extreme frequency differences between populations may have resulted 4

from natural selection in at least one of the two populations [9, 10], making such SNPs inappropriate from which to derive a neutral null distribution. Consequently, we used the neutral evolution computer simulation approach described above to explore the amount of genetic drift that can be expected for the Duffy-null allele in Madagascar under various conservative demographic scenarios. Expected allelic diversity lost to drift. We used fastadsim.pl to estimate the amount of genetic diversity lost to drift that we would expect to observe for a given demographic scenario. We performed 200 simulations under the specified demographic parameters for alleles with starting frequencies of 01 0.500 in increments of 01 (100,000 simulations), recording the final allele frequency of each simulation. Then to account for sampling bias in our observed SNP data (i.e., low frequency polymorphisms will often be sampled to be monomorphic), for each simulation we drew a sample of 2N alleles with probability equal to the simulated final allele frequency, where N is the observed sample size. We then recorded the proportion of simulations in which one of the two alleles was sampled to be lost due to drift (fixation). Estimation of selection coefficient. We estimated the coefficient of selection, s, assuming w (+/+) = 1, w (+/-) = 1, and w (-/-) = 1 + s, where w (+/+) is the fitness of Duffy-positive phenotype homozygotes, w (+/-) is the fitness of Duffy-positive heterozygotes, and w (-/-) is the fitness of Duffynegative phenotype homozygotes. We used the following equation to calculate the frequency of the Duffy-null allele in the next generation (q ) given the frequency of Duff-null in the current generation (q) and the coefficient of selection (s): q = ((1-q)q + q 2 + sq 2 ) / (1 + sq 2 ) This equation was used recursively to find values of s that resulted in the observed change in Duffy-null allele frequency given g generations. A perl script was used to automate this process. This equation was also used recursively to estimate the number of generations required for a Duffy-null selective sweep. 5

2. SUPPLEMENTARY TABLE Table S1. Estimated selection coefficients. Population Duffy-null freq change # gens s Merina 0.48 à 0.72 * 33 52 Merina 0.48 à 0.72 * 43 40 Merina 0.48 à 0.72 * 77 22 Merina 0.48 à 0.78 ** 33 66 Merina 0.48 à 0.78 ** 43 50 Merina 0.48 à 0.78 ** 77 28 South coast 0.67 à 0.92 ** 33 65 South coast 0.67 à 0.92 ** 43 49 South coast 0.67 à 0.92 ** 77 27 Estimated selection coefficients assuming selection for the Duffy-null phenotype (e.g. w+/+ = 1, w+/- = 1, w-/- = 1 + s), and various assumptions about the number of generations of selection. *Duffy-null allele frequency in our sample of 18 Merina individuals. **Duffy-null allele frequency from Highland (N=69) and south coast (N=86) Malagasy [11]. 6

3. SUPPLEMENTARY FIGURES Figure S1. Plot of ADMIXTURE cross validation error from K=2 through K=6. We chose K=3 to analyze the SNP data, as the value that minimizes the error. Cross Validation Error 0.605 0.600 error 0.595 0.590 2 3 4 5 6 K 7

Figure S2. Duffy-positive/ null genotyping. a) The Duffy-positive/ null phenotype is determined by SNP rs2814778 located in the promoter region of the Duffy Antigen Receptor for Chemokines (DARC) gene, just upstream of the 5 UTR. We successfully genotyped this SNP in 18 of 19 Malagasy, by amplifying a 464 bp. region around the SNP and Sanger sequencing the PCR product. b) Genotypes were determined by examining the electropherograms with large single peaks recorded as homozygotes, and small double peaks recorded as heterozygotes. 8

Figure S3. Simulated frequencies of the Duffy-null allele under genetic drift with constant migration from sub-saharan Africa. For each demographic scenario comprised of variable initial effective population sizes (N e = 100, 200, or 500), number of generations (33, 43, 77; equivalent to ~1,000, 1,300, and 2,300 years respectively), number of migrants per generation from sub-saharan Africa (1, 2, or 4 when possible to result in a final average allele frequency of ~0.48), and 2% population growth per generation forward-evolution simulations were performed starting with the indicated initial admixture proportion (α 0 ). Initial admixture proportions were chosen such that the final admixture proportion (α x ) would be ~0.48 given the migration rate. The final allele frequency was recorded for each of 10,000 simulations for each demographic scenario. The bars show the 50%, 75%, 95%, and 99% distributions of the final simulated allele frequencies for each demographic scenario. The solid vertical line indicates the observed frequency of 0.78 for the Duffy-null allele in a sample of Highland Merina and Bezanozano Malagasy (N=69) [11], while the dashed line indicates the frequency of the Duffynull allele in our sample of 18 Merina. The percentage of simulations with an allele frequency greater than or equal to 0.72 and 0.78 are shown in the right hand column (>= 5% shown in bold). Simulated Duffy-null allele frequencies under drift with constant migration from Africa N e = 500 N e = 200 N e = 100 migrants Generations Simulation Parameters x 77 4 0.328 0.487 77 2 0.396 0.477 77 1 0.439 0.480 43 4 0.371 0.487 43 2 0.420 0.480 43 1 0.456 0.486 33 4 0.388 0.484 33 2 0.431 0.480 33 1 0.453 0.477 77 4 73 0.478 77 2 0.275 0.480 77 1 0.375 0.480 43 4 0.185 0.481 43 2 0.342 0.488 43 1 0.411 0.482 33 4 0.235 0.480 33 2 0.359 0.480 33 1 0.414 0.478 77 2 73 0.480 77 1 0.279 0.482 43 2 0.179 0.477 43 1 0.331 0.481 33 2 0.223 0.473 33 1 0.359 0.484 Proportion of simulations mean 50% 75% 95% 99% observed Duffy-null allele frequency (0.72, N=18) observed Duffy-null allele frequency Menard et al. Highlanders (0.78, N=69) % sims >= 0.72 0.78 0.2 0.3 0.5 0.1 0.1 0.2 0.6 0.1 2.2 0.4 3.6 1.0 0.4 1.6 0.2 2.0 0.3 0.4 0.8 0.1 1.3 0.2 3.6 1.1 7.7 3.2 3.0 0.7 5.9 2.2 2.3 0.5 5.5 1.8 0.25 0.50 0.75 simulated frequency of the Duffy-null allele 9

Figure S4. Simulated frequencies of the Duffy-null allele under genetic drift with constant migration from Eurasia. For each demographic scenario comprised of variable initial effective population sizes (N e = 100, 200, or 500), number of generations (33, 43, 77; equivalent to ~1,000, 1,300, and 2,300 years respectively), number of migrants per generation from Eurasia (1, 2, or 4 when possible to result in a final average allele frequency of ~0.48), and 2% population growth per generation forward-evolution simulations were performed starting with the indicated initial admixture proportion (α 0 ). Initial admixture proportions were chosen such that the final admixture proportion (α x ) would be ~0.48 given the migration rate. The final allele frequency was recorded for each of 10,000 simulations for each demographic scenario. The bars show the 50%, 75%, 95%, and 99% distributions of the final simulated allele frequencies for each demographic scenario. The solid vertical line indicates the observed frequency of 0.78 for the Duffy-null allele in a sample of Highland Merina and Bezanozano Malagasy (N=69) [11], while the dashed line indicates the frequency of the Duffy-null allele in our sample of 18 Merina. The percentage of simulations with an allele frequency greater than or equal to 0.72 and 0.78 are shown in the right hand column (>= 5% shown in bold). Simulated Duffy-null allele frequencies under drift with constant migration from Asia N e = 500 N e = 200 N e = 100 migrants Generations Simulation Parameters x 77 4 0.626 0.477 77 2 0.553 0.478 77 1 0.511 0.476 43 4 0.584 0.475 43 2 0.542 0.487 43 1 0.511 0.484 33 4 0.574 0.483 33 2 0.522 0.477 33 1 0.501 0.477 77 4 0.853 0.479 77 2 0.657 0.471 77 1 0.568 0.476 43 4 0.745 0.477 43 2 0.617 0.480 43 1 0.549 0.481 33 4 0.698 0.475 33 2 0.594 0.479 33 1 0.532 0.474 77 2 0.854 0.479 77 1 0.667 0.479 43 2 0.750 0.481 43 1 0.605 0.472 Proportion of simulations mean 50% 75% 95% 99% % sims >= 0.72 0.78 0.1 0.2 0.4 0.1 0.1 0.5 1.4 0.1 3.0 0.6 0.2 1.4 0.1 2.3 0.4 0.2 0.6 1.4 0.2 3.3 0.6 7.7 2.6 2.5 0.4 5.1 1.4 33 4 0.927 0.477 0.4 33 2 0.698 0.473 2.2 0.4 33 1 0.594 0.479 4.4 1.2 observed Duffy-null allele frequency (0.72, N=18) observed Duffy-null allele frequency Menard et al. Highlanders (0.78, N=69) 0.20 0.40 0.60 0.80 simulated frequency of the Duffy-null allele 10

Figure S5. Simulated frequencies of the Duffy-null allele under genetic drift in southern coastal populations with 67% sub-saharan African ancestry. For each demographic scenario - comprised of variable initial effective population sizes (Ne = 100, 200, or 500), number of generations (33, 43, 77; equivalent to ~1,000, 1,300, or 2,300 years, respectively), and levels of population growth (2%, 5% or 10%) - forward-evolution simulations were performed from a starting allele frequency of 0.67 based on the estimate of African ancestry in southern coastal Malagasy groups [12]. The final allele frequency was recorded for each of 10,000 simulations run for each demographic scenario. The bars show the 50%, 75%, 95%, and 99% distributions of the final simulated allele frequencies for each demographic scenario. The solid vertical line indicates the observed Duffy-null frequency of 0.92 from the southern coastal location of Faragangana (N=86) [11]. The percentage of simulations with a final Duffy-null allele frequency greater than or equal to 0.92 are shown in the right hand column (>= 5% shown in bold). Simulated Duffy-null allele frequencies under drift and with starting allele frequency = 0.67 Simulation Parameters N e = 500 N e = 200 N e = 100 Generations 77 10% 77 5% 77 2% 43 10% 43 5% 43 2% 33 10% 33 5% 33 2% 77 10% 77 5% 77 2% 43 10% 43 5% 43 2% 33 10% 33 5% 33 2% 77 10% 77 5% 77 2% 43 10% 43 5% 43 2% 33 10% 33 5% 33 2% Growth Proportion of simulations mean 50% 75% 95% 99% observed Duffy-null allele frequency south coast (0.92, N=86)) % sims >= 0.92 0.1 0.4 3.4 0.2 1.4 0.1 0.8 0.4 3.45 11.8 0.3 2.4 7.1 0.4 2.2 5.0 0.25 0.50 0.75 1.00 simulated frequency of the Duffy-null allele 11

Figure S6. Theoretical versus observed levels of allelic fixation in the Merina with 48% sub-saharan African ancestry, and with a Duffy-null allele frequency of 0.72. Proportions of fixed SNP loci in the Merina from simulated (dashed line) and observed (solid line) data, binned by predicted allele frequencies at the initial time of admixture, and observed allele frequencies in present-day sub-saharan African and East Asian population samples, given the least extreme scenario given 48% sub-saharan admixture and a Duffy-null allele frequency of 0.72 (see Fig. 2). The observed data are based on the Affymetrix 10k SNP data from our population sample of 19 Merina individuals. In each case the observed allelic fixation is less than that expected under the type of extreme demographic conditions necessary for such a large shift in the Duffy-null allele frequency in the absence of positive selection. 1.0 Genetic Diversity Lost to Drift Least extreme demographic scenario given an observed Duffy-null frequency of 0.72 48% sub-saharan African admixture 0.8 Simulated expected data: N e = 200, 77 generations, 2% growth Observed Malagasy SNP data proportion of fixed loci 0.6 0.4 0.2 5 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Predicted or simulated minor allele frequency prior to drift (upper limit of bin) 12

Figure S7. Simulated expected distribution of genetic drift for the Duffy-null allele for a demography consistent with observed levels of allelic fixation in the Merina. a) Expected level of allelic fixation given N e = 500, 33 generations, and 2% growth; a demography we found to produce similar allelic fixation to the observed Merina SNP data. b) Distribution of 10,000 genetic drift simulations for the Duffy-null allele given the demography this close fit demography. None of the 10,000 simulations finished with a Duffy-null allele frequency 0.78. Only seven of 10,000 simulations finished with a Duffy-null allele frequency 0.72. Proportion of fixed loci 1.0 0.8 0.6 0.4 0.2 a) Simulated demography with allelic diversity lost to drift similar to observed Merina genetic diversity 48% sub-saharan African admixture Simulated expected data: N e = 500, 33 generations, 2% growth Observed Merina SNP data 5 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Predicted or simulated minor allele frequency prior to drift (upper limit of bin) Number of simulations 400 200 0 b) Simulated Duffy-null alllele frequency with demography compatible with observed Merina genetic diversity 48% sub-saharan African admixture N e = 500 33 generations 2% growth 0.20 0.30 0.40 0.50 0.60 0.70 0.80 Simulated Duffy-null Allele Frequency observed Duffy-null allele frequency (0.72, N=18) observed Duffy-null allele frequency Menard et al. Highlanders (0.78, N=69) 13

4. REFERENCES 1. Shriver M.D., Mei R., Parra E.J., Sonpar V., Halder I., Tishkoff S.A., Schurr T.G., Zhadanov S.I., Osipova L.P., Brutsaert T.D., et al. 2005 Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Hum Genomics 2(2), 81-89. 2. Norton H.L., Kittles R.A., Parra E., McKeigue P., Mao X., Cheng K., Canfield V.A., Bradley D.G., McEvoy B., Shriver M.D. 2007 Genetic evidence for the convergent evolution of light skin in Europeans and East Asians. Mol Biol Evol 24(3), 710-722. 3. Bauchet M., McEvoy B., Pearson L.N., Quillen E.E., Sarkisian T., Hovhannesyan K., Deka R., Bradley D.G., Shriver M.D. 2007 Measuring European population stratification with microarray genotype data. Am J Hum Genet 80(5), 948-956. 4. Alves I., Coelho M., Gignoux C., Damasceno A., Prista A., Rocha J. 2011 Genetic Homogeneity Across Bantu-Speaking Groups from Mozambique and Angola Challenges Early Split Scenarios between East and West Bantu Populations. Human Biology 83(1), 13-38. 5. Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., et al. 2007 PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3), 559-575. 6. Wickham H. 2009 ggplot2: elegant graphics for data analysis. (New York, NY, USA, Springer. 7. R_Development_Core_Team. 2011 R: A language and environment for statistical computing. (Vienna, Austria, R Foundation for Statistical Computing. 8. Durbin R.M., Abecasis G.R., Altshuler D.L., Auton A., Brooks L.D., Gibbs R.A., Hurles M.E., McVean G.A. 2010 A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061-1073. 9. Barreiro L.B., Laval G., Quach H., Patin E., Quintana-Murci L. 2008 Natural selection has driven population differentiation in modern humans. Nat Genet 40(3), 340-345. 10. Excoffier L., Hofer T., Foll M. 2009 Detecting loci under selection in a hierarchically structured population. Heredity 103(4), 285-298. 11. Menard D., Barnadas C., Bouchier C., Henry-Halldin C., Gray L.R., Ratsimbasoa A., Thonier V., Carod J.F., Domarle O., Colin Y., et al. 2010 Plasmodium vivax clinical malaria is commonly observed in Duffy-negative Malagasy people. Proc Natl Acad Sci U S A 107(13), 5967-5971. 12. Pierron D., Razafindrazaka H., Pagani L., Ricaut F.X., Antao T., Capredon M., Sambo C., Radimilahy C., Rakotoarisoa J.A., Blench R.M., et al. 2014 Genome-wide evidence of Austronesian-Bantu admixture and cultural reversion in a hunter-gatherer group of Madagascar. Proc Natl Acad Sci U S A. 14