Statistical Inference in the Wright Fisher Model Using Allele Frequency Data

Size: px
Start display at page:

Download "Statistical Inference in the Wright Fisher Model Using Allele Frequency Data"

Transcription

1 Syst. Biol. 661:e30 e46, 2017 The Authos Published by Ofod Univesity Pess, on behalf of the Society of Systematic Biologists. This is an Open Access aticle distibuted unde the tems of the Ceative Commons Attibution Non-Commecial License which pemits non-commecial e-use, distibution, and epoduction in any medium, povided the oiginal wok is popely cited. Fo commecial e-use, please contact DOI: /sysbio/syw056 Advance Access publication August 2, 2016 Statistical Infeence in the Wight Fishe Model Using Allele Fequency Data PAULA TATARU,MARIA SIMONSEN,THOMAS BATAILLON, AND ASGER HOBOLTH Bioinfomatics Reseach Cente, Aahus Univesity, Aahus, 8000, Denmak. Coespondence to be sent to: Bioinfomatics Reseach Cente, Aahus Univesity, C.F. Mølles Allé 8, Aahus C 8000, Denmak; asge@bic.au.dk These authos contibuted equally to this wok Received 4 Decembe 2015; eviews etuned 31 May 2016; accepted 6 June 2016 Associate Edito: David Byant Abstact. The Wight Fishe model povides an elegant mathematical famewok fo undestanding allele fequency data. In paticula, the model can be used to infe the demogaphic histoy of species and identify loci unde selection. A cucial quantity fo infeence unde the Wight Fishe model is the distibution of allele fequencies DAF. Despite the appaent simplicity of the model, the calculation of the DAF is challenging. We eview and discuss stategies fo appoimating the DAF, and how these ae used in methods that pefom infeence fom allele fequency data. Vaious evolutionay foces can be incopoated in the Wight Fishe model, and we conside these in tun. We begin ou eview with the basic bi-allelic Wight Fishe model whee andom genetic dift is the only evolutionay foce. We then conside mutation, migation, and selection. In paticula, we compae diffusion-based and moment-based methods in tems of accuacy, computational efficiency, and analytical tactability. We conclude with a bief oveview of the multi-allelic pocess with a geneal mutation model. [Allele fequency, diffusion, infeence, moments, selection, Wight Fishe.] A cental goal of population genetics is to infe the past histoy of populations and descibe the evolutionay foces that have shaped thei genetic vaiation. The Wight Fishe model Fishe 1930; Wight 1931 eplicitly accounts fo the effects of vaious evolutionay foces andom genetic dift, mutation, selection on allele fequencies ove time. This model can also accommodate the effect of demogaphic foces such as vaiation in population size though time and/o migation connecting populations. Infomation about these evolutionay and demogaphic foces can, in pinciple, be etieved fom allele fequency data. The questions that eseaches can answe and the types of infeence they can make depend on the type of genetic data available, which can be boadly divided into two categoies. One type of data is a time seies of allele fequencies fom a single population Fig. 1a. Hee, the task is often to quantify the amount of dift that has influenced the changes in allele fequencies ove time. This is done by estimating the size of the ideal Wight Fishe population that best accounts fo the pattens of genetic dift obseved in the data, o, in othe wods, to estimate the effective population size. Futhemoe, an impotant goal could be to identify those loci that have been unde positive selection ove the time inteval consideed. The second type of data consists of allele fequencies fom multiple populations, typically collected in the pesent Fig. 1b. In this situation, the task is often to infe divegence times, population sizes, mutation ates, and, if applicable, migation ates between populations. Additionally, thee is also consideable inteest in evaluating the ole of selection in shaping the obseved data. Typical questions ae: Do allele fequencies in egions of inteest habo footpints of selection? What is the oveall impotance of puifying selection on a specific set of sites e.g., non-coding egions of functional inteest o non-synonymous positions in gene coding egions? We emphasize that this second type of data is vey simila to the type of data analyzed in phylogenetics. In both instances, infomation is gained as new mutations aise at the nucleotide level and the fate of these mutations is influenced by the diffeent evolutionay and demogaphic foces of inteest. The diffeence between phylogenetics and population genetics essentially esides in the time scales that ae modeled. Phylogenetics is often concened with long time scales, and the data contain one sample pe species. Diffeences among the sequences ae most often substitutions. Population genetics typically consides data whee seveal samples ae available within a species, and many diffeences ae detected due to mutations that ae still segegating polymophic. Inteestingly, these two time scales tend to mege when consideing data sets containing sequences of individuals that compise ecently diveged species, as both types of diffeences mutations that ae still polymophic and mutations that have been fied as substitutions have to be modeled jointly. To infe the evolutionay histoy of a population, model-based appoaches in population genetics have to ely on an eplicit model fo the evolution of populations. The Wight Fishe model Fishe 1930; Wight 1931 occupies a cental position in this endeavou. It povides an elegant mathematical famewok fo modeling allele fequency data. The dynamics of the model ae well undestood Kimua 1955a, 1955b, 1964; Cow and Kimua 1956; Cow and Kimua 1970; Ewens 1972; Cow 1987; Ewens 2004 but infeence unde the Wight Fishe model is complicated due to the lack of a simple closed-fom analytical epession fo the distibution of allele fequencies DAF. Common to all infeence methods is the need to detemine the DAF, eithe at equilibium o ove specified time intevals. Hee, we focus on how the DAF is influenced by demogaphic and evolutionay foces and concentate on both classical and moe ecent attempts to calculate e30

2 2017 TATARU ET AL. STATISTICAL INFERENCE IN THE WRIGHT FISHER MODEL e31 a b Geneation Sample size Sample size Allele count n 1 z 1 n 2 z 2 n 3 z 3 Allele count n 1 z 1 n 2 z 2 n 3 z 3 FIGURE 1. Data types. The gay boes epesent the unobseved histoy of the populations, togethe with the coesponding population allele fequency, wheeas the white boes indicate the obseved data: the geneation when the data ae sampled, the size n of the sample, and the allele count z, that is, how many alleles of a given type have been obseved among the genotyped individuals. Given the population fequency i, z i follows a binomial distibution with size n i and pobability i. In ode to calculate the likelihood of the data, the DAF of i is needed. a Time seies data whee, typically, one population is sampled at diffeent known geneations. b Single time-point data, whee multiple populations ae sampled just once, typically in the pesent. The histoy of the populations is given as a tee. The leaves and intenal nodes epesent the sampled and ancestal populations, espectively. The banch lengths eflect the amount of time populations have diveged since the split fom the ancestal population. the DAF that enable accuate yet tactable population genetics infeence. We begin ou eview with the basic bi-allelic Wight Fishe model by consideing, in tun, the foces of pue genetic dift, mutation, migation, and selection. Fo each of these foces, we povide epessions fo the mean and vaiance of the DAF, and discuss and compae the appoaches used to obtain the DAF. We also eview implementations of the infeence methods Table 1. Although the bi-allelic Wight Fishe model captues a majo pat of data types, in paticula single-nucleotide polymophisms SNPs, some loci ae intinsically multiallelic. We theefoe also biefly discuss ecent pogess to calculate the DAF unde the geneal multi-allelic Wight Fishe model. We investigate if one of the widely used appoimations fo the multi-allelic DAF can captue adequately the fist two moments of the DAF, and point to limitations of the appoimation. A vaiety of methods that ae gounded in the Wight Fishe model use a ange of tests and/o summay statistics to detect population diffeentiation Balding and Nichols 1995; 1997; Nicholson et al. 2002; Gaggiotti and Foll 2010, o cay out genome-wide scans fo selection Foll and Gaggiotti 2008; Coop et al. 2010; Gautie et al. 2010; Gautie Seveal of these methods use some of the appoaches fo calculating the DAF discussed hee. Howeve, they do not diectly use o estimate the effect of the diffeent evolutionay foces on the DAF. Theefoe, we do not eview such methods and efe the eade instead to Haasl and Payseu 2015 fo details. Net to the Wight Fishe model, the coalescent Kingman 2000, 1982a, 1982b, 1982c and Moan Moan 1958 models occupy an impotant ole in the field. The coalescent pocess is dual to the Wight Fishe model: although the Wight Fishe model descibes the evolution of a population fowad in time in discete non-ovelapping geneations, the coalescent pocess is built backwads in time, and aises as an appoimation to the Wight Fishe model when the population size is lage. Unlike the coalescent, the Moan model is a fowad-in-time pocess, and it is often egaded as an equivalent to the Wight Fishe model but see Bhaska and Song Both the coalescent and Moan models have been analyzed etensively and thei dynamics ae in seveal cases moe amenable to mathematical analysis Donnelly 1984; Ewens 2004; Hobolth et al. 2007; Muihead and Wakeley 2009; Li and Dubin 2011; Paul et al. 2011; Vogl and Clemente Howeve, the Moan model is hadly eve used fo infeence but see, e.g., De Maio et al. 2013; 2015, wheeas the coalescent is typically esticted to a handful of individuals Hobolth et al. 2007; Li and Dubin 2011; Paul et al. 2011; Mailund et al. 2012; Sheehan et al. 2013; Schiffels and Dubin 2014; Rasmussen et al and does not use allele fequency data but see, e.g., Liu and Fu Theefoe, we do not include the coalescent and Moan models in this eview, and efe the eade instead to Fu and Li 1999; Duett 2008; Kuhne 2009; Liu et al. 2009; Wakeley 2009; Nielsen and Slatkin 2013; Edwads et al BI-ALLELIC WRIGHT FISHER MODEL The Wight Fishe model assumes a andomly mating population of finite size epoducing in discete nonovelapping geneations, by allowing the individuals in geneation +1 to choose paents at andom fom the pevious geneation. The model descibes the stochastic behavio though time of the fequency of an allele at a locus. This fequency is influenced by a seies of evolutionay foces that, as discussed below, change the pobability of choosing a paent. Hee, we conside a diploid population of size which contains only two alleles, denoted A and a. Below we eview methods used to obtain the DAF of allele A afte a cetain amount of geneations. Pue Dift The Wight Fishe model, in its simplest fom, only consides andom genetic dift Fig. 2, whee the stochastic fluctuations in the allele fequency ae puely detemined by the andom mating of the population. This assumption is appopiate fo the analysis of loci that have small mutation ates and the analysis of ecently diveged populations, leaving little time fo mutation to ceate new alleles, and whee we epect an oveall negligible effect of selection. Dynamics and moments. Let z be the numbe of A alleles in geneation and = z/ be the

3 e32 SYSTEMATIC BIOLOGY VOL. 66 TABLE 1. Oveview of ecent infeence methods fo the bi-allelic Wight Fishe model Refeence Data Mut Mig Sel Appoach Availability Makov chain theoy Mathieson and McVean 2013 a T Nomal - Gompet 2015 a T Beta spatpg Diffusion appoimation Bollback et al T Finite-diffeence - Gutenkunst et al S Finite-diffeence a i Lukić and Hey 2012 S Spectal decomposition MultiPop Malaspinas et al T Numeical appoimation upon equest Gautie and Vitalis 2013 S Spectal decomposition KimTee Steinücken et al T Spectal decomposition spectalhmm Vitalis et al S Stationay DAF SelEstim Živković etal.2015 S Spectal decomposition upon equest Fee-Admetlla et al T Numeical appoimation AppoWF Moment-based appoimations Sién et al S Beta - Pickell and Pitchad 2012 S Nomal TeeMi Laceda and Seoighe 2014 T Nomal upon equest Hui and But 2015 T Beta NB Tatau et al S Beta with spikes SpikeyTee Tehost et al T Nomal EandR-timeseies Notes: The table indicates what type of data the method uses Data: time seies data fom one population T o single time-point data fom multiple populations S; if the method models new mutations Mut, migation Mig o selection Sel; which type of appoach is used fo calculating the DAF Appoach; and whethe the method is publicly available Availability. All methods model genetic dift. a analyze jointly time seies data fom multiple populations. The table coves only the moe ecent infeence methods. geneation A A a a a A A a z =4 +1 A a a a A A a a z +1=3 FIGURE 2. Dynamics in the pue dift bi-allelic Wight Fishe model. The child inheits the paental allele. coesponding allele fequency. The andom mating of the population leads to a count of A alleles in geneation +1 that is binomially distibuted Fishe 1930; Wight 1931; Cow and Kimua 1970; Ewens 2004 z+1 z Bin,. 1 Hee, Binn, p is the binomial distibution with sample size n and pobability p. The genetic vaiation pesent in the population is due to ancestal polymophism, and because no new vaiation is added, the A allele is eventually fied o lost Fig. 3a. The goal is to detemine the DAF: the distibution f ; of, afte evolving fo geneations fom an initial fequency 0 Fig. 3b. We fist calculate the fist two moments of the DAF. Fom the binomial sampling, the mean and vaiance ove one geneation ae given by E[+1 ]=, Va +1 = 1 1. The mean and vaiance afte geneations can be obtained by iteating the two epessions above o fom altenative deivations Wight 1942; Cow 1954; Cow and Kimua The esult is E[ 0]=0, 2 Va 0 = Fo lage N, we can appoimate the vaiance by Va t e t, 4 whee t=/. Note that this implies that N can be estimated by equation 4 only if is known, othewise only the atio t=/ can be estimated. Makov chain theoy. Because the allele fequency at geneation +1 only depends on geneation, the Wight Fishe model is a discete-time finite-space Makov chain. Using this popety, the DAF can be obtained fom classical Makov chain theoy Kalin and Taylo 1975, whee the tansition pobabilities ae given by equation 1 Williamson and Slatkin Howeve, this pocedue quickly becomes computationally infeasible, as the tansition pobability mati has a size of By ecognizing that most of the pobability mass fom equation 1 is centeed aound z, the computational demand can be educed by evaluating, stoing and using only the tansition pobabilities that ae lage enough to contibute significantly to the DAF Wang 2001; Feeman et al Unde the assumption of lage N, diffusion theoy see below shows that the population size acts as a scaling facto Felle et al. 1951; Wakeley 2005

4 2017 TATARU ET AL. STATISTICAL INFERENCE IN THE WRIGHT FISHER MODEL e33 a 1.00 b 0 P A a FIGURE 3. a Simulation unde the pue dift model equation 1 with =200 and 0 =0.5. The vetical bas indicate thee sampled time-points. The -ais denotes the time measued in scaled numbe of geneations. b DAF at the thee sampled time-points. The vetical bas indicate the simulated allele fequencies. and theefoe one could calculate the DAF using a smalle N. This appoach was used by De Maio et al. 2013; 2015, though they elied on the Moan model athe than the Wight Fishe. Altenatively, if N is lage enough such that the allele fequencies can be teated as continuous, the Makov chain can be built ove discetized allele fequencies, and thus the computational buden is contolled by the numbe of bins. The oiginal discete binomial sampling pobability fom equation 1 is then eplaced by the continuous nomal o beta distibutions Mathieson and McVean 2013; Gompet Diffusion appoimation. One way to calculate the DAF is to take advantage of the diffusion appoimation to the Wight Fishe model, which is appopiate when the population size N is lage, such that both allele fequencies and time can be teated as continuous. Diffusion theoy uses two fundamental equations, the Kolmogoov fowad and backwad equations Kolmogoov The fowad equation was fist used by Wight 1945 to calculate the ate of decay and stationay DAF, wheeas Kimua 1957 used the backwad equation fist to study the poblem of fiation. Let us define a new time scale by t=1/ such that one time unit coesponds to geneations. Then, we have t+ t t Bin,t, fom which we can appoimate t+ t t Nt,t1 t t. 5 Hee, N, 2 is the nomal distibution with mean and vaiance 2. Equation 5 coesponds to the timehomogeneous stochastic diffeential equation dt=adt+ bdwt, 6 whee {wt:t 0} is a standad Bownian motion, and a and b ae the infinitesimal mean and vaiance, espectively. Fo the Wight Fishe model, b = 1, wheeas a has diffeent foms depending on the evolutionay foces. Unde pue dift, a = 0, as is evident fom equation 5. The DAF f ;t at time t is now detemined by the fowad Kolmogoov o Fokke Planck o diffusion equation Kolmogoov 1931; Cow and Kimua 1970; Ewens 2004 f ;t = t { 2 2 { } af ;t } 1 f ;t, with bounday condition =0 fo t=0. This equation can be solved using diffeent appoaches Table 1. Kimua fist descibed how the DAF can be calculated unde pue dift Kimua 1955a using the spectal decomposition of equation 7, which esults in an infinite sum of scaled Gegenbaue polynomials. In pactice, the infinite sum needs to be tuncated and the optimal tuncation level depends on the convegence popeties. This contols the accuacy, but also the computational pefomance. The diffusion equation can also be solved using puely numeical methods. Chang and Coope 1970 developed a finite-diffeence scheme to numeically solve any diffusion equation, wheeas Zhao et al poposed a finite-volume scheme to solve the Wight Fishe diffusion equation. Gautie and Vitalis 2013 elied on the solution poposed by Kimua 1955a to estimate divegence times between populations that have been evolving unde pue dift, fom single time-point data. Moment-based appoimations. The use of the diffusion appoimation is limited in pactice due to the high computational buden. Cavalli-Sfoza and Edwads 1967 appoimated pue dift as a Bownian motion pocess, and cuent moment-based appoimations ae eminiscent of that appoach, in that they ae based on mathematically convenient instumental distibutions. By elying on the equations fo the mean 2 and vaiance 3, 4, we can fit to the tue DAF distibutions that can be paameteized solely though the fist two moments, such as the nomal and beta distibutions. These two distibutions aise as special cases of the DAF appoimated fom the diffusion theoy: the nomal 7

5 e34 SYSTEMATIC BIOLOGY VOL. 66 distibution is a tansient distibution equation 5 which is appopiate fo vey shot evolutionay times, wheeas the stationay DAF unde linea evolutionay pessue is given by a beta distibution see Bo 1, equation B.9. Seveal authos used the nomal distibution Nicholson et al. 2002; Coop et al. 2010; Gautie et al. 2010; Pickell and Pitchad 2012; Laceda and Seoighe 2014; Tehost et al. 2015, which takes the fom 0 N E[ 0],Va 0. 8 Equations 5 and 8 ae equivalent unde pue dift when the numbe of geneations is small elative to the population size. Then, by using the appoimation e, in the vaiance equation 3, we ecove equation 5fom equation 8 with t=/. Balding and Nichols 1995; 1997 fist poposed the use of the Diichlet distibution, the multivaiate genealization of the beta distibution, fo the multiallelic Wight Fishe see the multi-allelic section below. Fo the bi-allelic Wight Fishe model, the DAF can be appoimated with a beta distibution as follows, 0 Beta E[ 0],Va 0, whee Betam,v is the beta distibution paameteized by mean m and vaiance v. Wenoteheethatabeta distibution always veifies the condition v < m1 m. Fo the altenative paameteization with shapes and, we have the elation m1 m = 1 m, v m1 m = v 1 1 m. Although both the nomal and beta distibutions have been used fo infeence, they diffe in accuacy. One majo diffeence comes fom the suppot of the distibutions. The allele fequency always lies between 0 and 1, and, unde the Wight Fishe model, thee can be a positive pobability fo being eithe 0 o 1 the allele is lost o fied, espectively. The nomal distibution is defined ove the whole eal line, and a positive pobability can eist outside [0,1]. If 0 is intemediate and is small, the pobability that falls outside of [0,1] is small and theefoe can be ignoed Pickell and Pitchad 2012; Laceda and Seoighe 2014; Tehost et al If 0 is close to the boundaies, the nomal distibution fom equation 8 can be tuncated to [0,1]. The pobabilities in the intevals,0] and [1, ae added as two atoms at 0 and 1 and seve as the loss and fiation pobabilities, espectively Nicholson et al. 2002; Coop et al. 2010; Gautie et al Gautie and Vitalis 2013 noted that 9 the tuncated nomal distibution no longe has the tue vaiance of the DAF. Unlike the nomal distibution, the beta distibution has suppot in [0,1]. Howeve, due to its continuous natue, the beta distibution cannot account fo the discete events that can be 0 o 1. Tatau et al addessed this issue and intoduced a new appoimation, the beta with spikes, a beta distibution fo the polymophic fequencies 0 < < 1, supplemented by two spikes at 0 and 1 accounting fo the loss and fiation pobabilities. Then the distibution of is 0 Beta E[ 0],Va 0, p 0,p 1, whee Beta m,v,p 0,p 1 is the beta with spikes distibution paameteized by mean m, vaiance v, and pobabilities p 0 and p 1 found at 0 and 1, espectively. This is given by Beta ; m,v,p 0,p 1 =p 0 +p p 0 p 1 B, Beta; m,v. Hee, is the Diac delta function, intoduced to account fo the non-zeo pobabilities at the boundaies, and m and v ae the mean and vaiance of the beta distibution fo the polymophic fequencies, given by Tatau et al m = m p 1, v = v+m2 p 1 m 2. 1 p 0 p 1 1 p 0 p 1 The beta function B, acts as a nomalization facto, whee and ae the shape paametes of Betam,v equation 9. Using the equations 2 and 3 fo the mean and vaiance, the nomal and beta appoimations of the DAF can be witten in closed fom. Howeve, the loss and fiation pobabilities ae not known in closed fom, and theefoe, the beta with spikes elies on a ecusive appoach to calculate these pobabilities see Tatau et al fo details. The moment-based appoimations have been used in a seies of infeence methods Table 1. Hui and But 2015 used the beta distibution to infe the effective size of one population undegoing pue dift fom time seies data. Sién et al and Tatau et al used single time-point data to infe divegence times between populations evolving unde pue dift. Sién et al used the beta distibution, and theefoe could not accuately model the alleles that ae close to being lost o fied. Tatau et al used the beta with spikes appoimation and demonstated that the addition of spikes leads to a moe accuate infeence compaed with meely using the beta distibution. Quality of appoimations. We evaluated the accuacy of the appoimations to the tue DAF obtained

6 2017 TATARU ET AL. STATISTICAL INFERENCE IN THE WRIGHT FISHER MODEL e35 a 1.00 Tuncated nomal Beta Beta with spikes Diffusion e-04 5e-03 5e-02 5e-01 b 0 =0.1 0 =0.3 0 =0.5 P P P Wight Fishe Appoimation FIGURE 4. Fit of vaious appoimations to the pue dift tue DAF, calculated using the Makov chain popety fo =200 and a ange of 0 and /. Each column shows a diffeent type of appoimation, indicated at the top of the figue. a Hellinge distance on log scale between the appoimated and tue DAF. The thee s in each of the heatmaps indicate the combinations of 0 and / used in b. b Tue dashed lines and appoimated solid lines DAF fo 0 =0.5 and diffeent values of /. The tuncated nomal, beta and beta with spikes ae discetized as in Tatau et al The diffusion DAF is calculated as in Zhao et al. 2013, with =0.01 and K =. We used =200 fo computational easons, but we see simila pattens fo lage N. fom the Makov chain popety, using the Hellinge distance Le Cam and Yang 2000, which lies between 0 and 1, with 0 indicating a pefect match of the two distibutions. The diffusion appoimation is the most accuate, wheeas the tuncated nomal and beta distibutions ae the least accuate Fig. 4. They appoimate the tue DAF well when the pobability mass is away fom the boundaies: 0 is close to 0.5 and the geneation is not too lage. As inceases, the fequency difts away fom 0 and moe and moe pobability accumulates at the boundaies. The beta distibution fails to captue this, wheeas the atoms and spikes in the tuncated nomal and beta with spikes distibutions, espectively, appoimate these pobabilities with vaious degees of accuacy. Oveall, the beta with spikes distibution is moe accuate than both the tuncated nomal and beta distibutions. Neutal Mutations The most common way to intoduce vaiation in a population is by allowing the alleles to mutate Fig. 5. Dynamics and moments. If u is the pobability of a mutation fom A to a, and v is the pobability fo the evese event, the sampling pobability fom equation 1 is changed by allowing each individual to undego a mutation afte choosing its paent. Theefoe, the individual is caying an A allele if the paent had an A allele pobability and thee was no mutation pobability 1 u, o the paent had an a allele pobability 1 and it mutated pobability v, leading to a sampling pobability 1 u+1 v =1 u v+v. Then, the binomial distibution of z+1 becomes z+1 z Bin,1 u v+v. 10 Fo lage N, Cow and Kimua 1956 deived geneal fomulas fo all moments of. The mean and vaiance afte geneations of evolution can also be obtained by epeated use of the laws of total epectation and vaiance Sién Tatau et al povided the

7 e36 SYSTEMATIC BIOLOGY VOL. 66 geneation A A a a a A A a +1 A A a a A a a a mutation a pob. A v z =4 z +1=3 mutation A pob. a FIGURE 5. Dynamics in the bi-allelic Wight Fishe model with mutations. If the paental allele is A, the child has the same allele with pobability 1 u, and a mutation occus with pobability u. If the paental allele is a, the child allele is a with pobability 1 v, and becomes A with pobability v. fomulas: E[t 0]= e + t, 11 + Va t 0 = 1 e t e 2 + t 1 e t e + t 1 e + +1t, whee t=/, =u, and =v. Diffusion appoimation. The diffusion appoimation of the Wight Fishe with neutal mutations is obtained in a simila way as fo pue dift. Let =u and =v be the scaled mutation ates, and we again scale the time in units of geneations. Recall that the infinitesimal vaiance is independent of the evolutionay foces. Fo neutal mutations, the infinitesimal mean is given by a= When new vaiation is constantly intoduced in the population, afte enough time, the allele fequency will each a stationay distibution. This was fist obtained by Wight 1931 by noting that at stationaity, the mean and vaiance ae unchanged between successive geneations. Late on, the stationay DAF was e-deived using altenative methods, including diffusion Wight 1945; The stationay DAF fo neutal mutations is given by a beta distibution with shape paametes 2 and 2 Cow and Kimua 1970; Ewens Note that this esult is in ageement with the mean equation 11 and vaiance equation 12 in the limit t. The spectal decomposition method developed by Kimua 1955a to calculate the DAF unde pue dift was etended to calculate the DAF with ecuent mutation Cow and Kimua 1956; 1970; Song and Steinücken 2012, and to incopoate mutation ates and population sizes that vay in time in a piecewise constant manne Steinücken et al u Moment-based appoimations. Using the moments of the DAF fo the bi-allelic Wight Fishe with neutal mutations equations 11 and 12, the moment-based appoimations ae obtained just as fo pue dift. Quality of appoimations. The non-zeo mutation pobabilities intoduce vaiation in the population, and educe the loss and fiation pobabilities elative to pue dift Figs. 4 and 6. Fo eample, unde pue dift, the pobability that the mutation is lost fied at /=0.5 is 0.072, while when alleles mutate with = =0.05, the pobability is educed to As moe of the pobability mass is now found away fom the 0 and 1 boundaies, all appoimations have an oveall impoved fit to the tue DAF Fig. 6. Migation In its simplest fom, the migation model descibes the evolution of the allele fequency in one population that sends migants, with pobability m, to an infinitely lage population with constant allele fequency c, and eceives immigants such that the population size stays constant ove time. Then the allele count at geneation +1 is given by Cow and Kimua 1970 z+1 z Bin,1 m+m c. 14 Unde pue dift, the sampling among the alleles in geneation is done unifomly equation 1. Howeve, as diffeent evolutionay pessues act on the allele, the sampling pobability is changed, as obseved fo neutal mutations and migation in 14. We can captue all the evolutionay pessues acting on the allele in a function g :[0,1] [0,1] which altes the sampling pobability of the binomial distibution fom equation 1. We then obtain the moe geneal pocess z+1 z Bin,g. 15 The evolutionay pessues fo pue dift, mutation, and migation ae linea in see Bo 1 and ae theefoe collectively called linea pessue Cow and Kimua It is this lineaity that allows the calculation of the fist two moments of the DAF in closed fom. One can fomulate a geneal linea evolutionay pessue model, whee pue dift, mutation and migation ae special cases see Bo 1. The migation model fom equation 14 is a good appoimation if the immigants epesent a andom sample of the entie species Cow and Kimua This is often not the case, and migants ae typically echanged by at least two populations that have nonconstant allele fequencies. This leads to an evolutionay pessue g that is dependent on the geneation, and the DAFs of both populations need to be modeled jointly. Makov chain theoy. Mathieson and McVean 2013 infeed effective population sizes and migation ates fom time seies data Table 1 while modeling multiple

8 2017 TATARU ET AL. STATISTICAL INFERENCE IN THE WRIGHT FISHER MODEL e37 a 1.00 Tuncated nomal Beta Beta with spikes Diffusion b 5e-04 5e-03 5e-02 5e-01 0 =0.1 0 =0.3 0 =0.5 P P P Wight Fishe Appoimation FIGURE 6. Fit of vaious appoimations to the tue DAF with neutal mutations, calculated using the Makov chain popety fo =200, = =0.05 and a ange of 0 and /. Each column shows a diffeent type of appoimation, indicated at the top of the figue. a Hellinge distance on log scale between the appoimated and tue DAF. The thee ""s in each of the heatmaps indicate the combinations of 0 and / used in b. b Tue dashed lines and appoimated solid lines DAF fo 0 =0.5 and diffeent values of /. Calculations ae pefomed as fo Figue 4. Fo compaison puposes, the a heatmap and b y-ais scales ae the same as in Figue 4. populations distibuted on a lattice, whee neighboing populations echange migants evey geneation. Diffusion appoimation. Gutenkunst et al built a diffusion equation to model jointly the allele fequencies in multiple populations. They solved this equation using the finite-diffeence scheme to infe divegence time between populations, mutation, and migation ates. Fom the joint DAF, Gutenkunst et al calculated the epected multi-population allele fequency spectum AFS, which summaizes allele fequency data. Because the dimension of the AFS depends on the numbe of populations, the time needed to compute the AFS gows eponentially with the numbe of populations. This limited thei analysis to only thee populations. Lukić and Hey 2012 also calculated the epected AFS, but they etended the spectal decomposition method to calculate the joint DAF of multiple populations that echange migants, while accounting fo de novo mutations. The implementation of Lukić and Hey 2012 was optimized to use little memoy, and can theefoe tackle moe than thee populations. Howeve, compaed with Gutenkunst et al. 2009, it has a lowe computational speed on two and thee populations. Moment-based appoimations. Pickell and Pitchad 2012 used the nomal distibution to infe divegence times between populations that have been evolving unde pue dift and have echanged migants. Due to thei use of the nomal distibution, the method is not accuate fo alleles with fequencies close to 0 o 1. Quality of appoimations. As both the neutal mutation equation 10 and migation equation 14 models ae special cases of the geneal linea evolutionay pessue model Bo 1, the quality of the appoimations is simila. The appoimation quality shown in Figue 6, whee = =0.05, also applies fo m= + =0.01 and c = / + =0.5. Selection When selection is pesent, the diffeent genotypes ae tansmitted to the net geneation with diffeent pobabilities, detemined by thei fitness. If the A allele

9 e38 SYSTEMATIC BIOLOGY VOL. 66 Bo 1 Evolutionay models fo the bi-allelic Wight Fishe Conside the geneal bi-allelic Wight Fishe pocess, whee g :[0,1] [0,1] captues the evolutionay pessues acting on the allele, z+1 z Bin,g. B.1 The function g can take diffeent foms. Geneal linea evolutionay pessue: g=1 a+b, fo 0 b a<1, B.2 whee a and b ae given by Pue dift: a=0, b=0, Mutation: a=u+v, b=v, B.3 Migation: a=m, b=m c. Let A=a, B=b and t=/. Fo lage N, the mean and vaiance fo the DAF ae given by Tatau et al E[t 0]= B A +e At 0 B, A B.4 Va t 0 = B A 1 B 1 e 2A+1t A 2A+1 2 e 2At 1 e t 0 B A + 1 2B 0 B e A A At 1 e A+1t. A+1 Fo pue dift, A=B=0 and we set 0/0:=1. Note that equations 2, 4, 11, and 12 can be obtained as special cases of the above. Selection non-linea evolutionay pessue: 1+s 2 +1+sh1 g= 1+s sh B.6 +s1 h+1 2h, B.7 whee the appoimation elies on the selection coefficients s and sh being small Cow and Kimua Selection with linea evolutionay pessue: Alleles can undego linea evolutionay pessue and selection jointly. Then, g=1 a { +s1 h+1 2h } +b. B.8 Stationay distibution: When A,B = 0, vaiation is constantly intoduced in the population and the DAF has a stationay distibution given by up to a nomalization constant, f 2B 1 1 2A B 1 e S2h+1 2h, B.9 whee S=s is the scaled selection coefficient. When s=0, we obtain a beta distibution with shape paametes 2B and 2A B, which is in ageement with the epessions fo mean and vaiance in the limit t. B.5 has fequency and selection is paameteized by coefficient s and dominance paamete h, the thee possible genotypes have the following fequencies assuming Hady Weinbeg equilibium and fitness Cow and Kimua 1970 Genotype AA Aa aa Fequency Fitness 1+s 1+sh 1 The allele count z+1 still follows the pocess given in equation 15, with the evolutionay pessue function fom equation B.7. Dynamics and moments. The fist two moments of the DAF fo the geneal linea evolutionay pessue equations B.4 and B.5 can be obtained using the law of total epectation and vaiance, espectively. These take

10 2017 TATARU ET AL. STATISTICAL INFERENCE IN THE WRIGHT FISHER MODEL e39 the fom E[+1 0]=E [ g 0 ], 16 Va +1 0 = 1 E[+1 0] E[+1 0] [ ] E g 2 0. The evaluation of E [ g 0 ] [ ] and E g 2 0 typically equies all moments of. Howeve, these can be witten as functions of only the fist two moments when g is a linea function in, allowing the above ecusions to be solved in closed fom Tatau et al When the allele is unde selection [ and g is ] no longe linea, we can appoimate E g i 0 by only using the fist two moments by elying on a Taylo seies. This will yield a ecusion fo calculating the mean and vaiance of the DAF. The Taylo seies can be evaluated aound the deteministic tajectoy of Baton and Otto 2005; Tehost et al. 2015, o aound the pe-calculated mean of Laceda and Seoighe To obtain the Taylo seies about the deteministic tajectoy, we decompose as = +, whee = g 1 epesents the deteministic tajectoy followed by the allele fequency in the infinitepopulation limit, and is the andom distubance away fom. Then, E[ 0]= +E[ 0], 18 Va 0 =Va Fom equations 16 and 18 we obtain, using the Taylo seies fo E [ g 0 ] about, E[ +1 0] E[ 0] dg d + 1 ] [ 2 E 2 0 d2 g d 2. [ ] Similaly, fom the Taylo seies of E g 2 0 about, and using equations [ 17, 18, ] and 19 we obtain the ecusion fo E , [ ] E E[ +1 0] [ ] 2 E 2 dg 0 d. By iteating the ecusions above and calculating numeically the fist two moments of, we can ecove the mean and vaiance of the DAF afte geneations. Makov chain theoy. Mathieson and McVean 2013 and Gompet 2015 infeed selection fom time seies data by discetizing continuous allele fequencies and building a Makov chain with nomal and beta tansition pobabilities, espectively Table 1. Gompet 2015 additionally allowed fo vaiability in time of selection coefficients and population sizes. Diffusion appoimation. Fo a Wight Fishe model with dift, mutation and selection, specified by equations B.1, B.2, B.3, and B.8, and letting S = s, we obtain the following infinitesimal mean a= + 1 +S1 h+1 2h. The diffusion equation when selection is pesent is the most difficult to solve. Howeve, the stationay distibution is known in closed fom Wight 1937; Cow and Kimua 1970; Ewens 2004 and is, up to a nomalization constant, given by a tilted beta distibution f e S2h+1 2h. 20 We note hee that the diffusion limit to the Wight Fishe model equies that the paametes involved in the evolutionay pessue, u, v, m, s, and sh, ae all in the ode of 1/, such that the esulting scaled paametes, u, v, m, s, and sh, ae in the ode of 1. This is the souce of the appoimation of equation B.6 with equation B.7, and of the common pactice of simplifying epessions by emoving small tems Felle et al. 1951; Wakeley It also indicates that in the diffusion limit, the population size N acts as a scaling facto, and a escaling of the paametes and time by a constant facto will not affect the DAF. This esult is esponsible fo the notion that it is impossible to estimate, fo eample, the mutation ate and effective population size sepaately. Howeve, although it may be tue that thee is low powe in doing so, this is simply a consequence of the assumptions of the diffusion appoimation. These might be epected to beak down in cases in which the diffusion is not appopiate Wakeley In this espect, the moment-based appoimations ae fee of the small paametes assumption, especially because the mean and vaiance of the geneal linea evolutionay pessue can be calculated without making the appoimation of lage N Tatau et al Theefoe, moment-based appoimations might be moe appopiate when the evolutionay pessue is stong Laceda and Seoighe Using the spectal decomposition of the diffusion equation, Kimua 1955b; 1957 found the DAF when selection is pesent. This appoach was etended by Song and Steinücken 2012 to impove the convegence popeties fo stonge selection, wheeas

11 e40 SYSTEMATIC BIOLOGY VOL. 66 Steinücken et al developed it futhe to model selection coefficients that vay ove time in a piecewise constant manne. The DAF was also calculated using a finite-diffeence scheme Bollback et al. 2008, finite-volume scheme Zhao et al. 2013, a path integal fomalism Schaibe 2014 and othe numeical appoaches Malaspinas et al. 2012; Fee- Admetlla et al Bollback et al. 2008; Steinücken et al. 2014; Malaspinas et al and Fee-Admetlla et al estimated jointly selection coefficients and effective population sizes fom time seies data fom one population. Fee-Admetlla et al could additionally infe mutation ates. Živković et al used the spectal decomposition of Song and Steinücken 2012 to infe mutation, selection and vaiable population size fom pesent data fom one population. Vitalis et al used the stationay distibution of the DAF when multiple populations echange migants and epeience selection. As they used the stationay DAF, they could not ecove any infomation about the divegence of the populations. We would like to note hee that although the method of Gutenkunst et al can in pinciple incopoate selection, the infeence softwae does not estimate selection coefficients. Moment-based appoimations. Using the numeically appoimated moments of the DAF, the tuncated nomal and beta distibutions ae obtained as peviously. The beta with spikes appoimation has not been etended to include selection. Howeve, the appoimation developed by Tatau et al fo the loss and fiation pobabilities should still be easonable if the selection pessue is small and the loss and fiation pobabilities ae mainly dominated by genetic dift. Moment-based appoimations have had limited use fo infeence of selection due to the difficulties in calculating the fist two moments of the DAF. Both Laceda and Seoighe 2014 and Tehost et al estimated effective population sizes and selection coefficients fom time seies data, using the nomal distibution and the Taylo epansion appoach. One citical diffeence between the two is that Laceda and Seoighe 2014 assumed additive selection h = 0.5 and used a Taylo seies about the mean of, wheeas Tehost et al made no assumptions about dominance and used a Taylo seies about the deteministic tajectoy. Additionally, Tehost et al wee the fist to incopoate linkage, but in pactice thei model is limited to jointly analyze only a small numbe of loci typically 3. Quality of appoimations. Relative to pue dift, positive selection acts by inceasing the epected fequency and pobability of fiation of the A allele, and deceasing the pobability of loss Figs. 4 and 7. Fo eample, unde pue dift and with a beginning fequency of 0=0.5, the pobability that the mutation is lost fied at / = 0.5 is , while when selection is pesent with S = 1, the pobability is educed inceased to Oveall, fo S=1, all appoimations have a fit to the tue DAF Fig. 7 that is vey simila to that fo pue dift Fig. 4. We note hee that S=1 is a vey small selection coefficient. Fo lage values of S, the Taylo seies appoach leads to estimated values fo the mean m and vaiance v fo which v>m1 m, and these cannot be fitted by a beta distibution. MULTI-ALLELIC WRIGHT FISHER MODEL The bi-allelic Wight Fishe model is typically a vey good appoimation fo SNP data because the penucleotide mutation ate is typically small, but due to highly mutable sites, ancestal polymophism, vey lage sample size o lage evolutionay distance, a numbe of SNPs may contain 3 o 4 alleles. Futhemoe, highly vaiable loci e.g., shot tandem epeats ae still widely used, especially in foensics Balding and Nichols 1997; Balding and Steele 2015, and ae typically multi-allelic. In these cases, the data can be analyzed using the multiallelic Wight Fishe model, an etension of the bi-allelic model. Instead of following the fequency of one allele, which is sampled fom a binomial distibution fom one geneation to the net, the multi-allelic model descibes the joint distibution of the K alleles pesent in the population, which ae now sampled fom one geneation to the net fom a multinomial distibution. Pue Dift Simila to the bi-allelic model, the simplest fom is the pue andom genetic dift model, whee the stochastic fluctuations in the allele fequencies ae puely detemined by the andom mating of the finite population Fig. 8. Dynamics and moments. Let z i be the numbe of i alleles in geneation, z=z 1,...,z K and = z/ be the coesponding allele fequency. The distibution of z+1 is z +1 z Mult,. 21 Hee, Multn,p is the multinomial distibution with sample size n and pobability vecto p. To detemine the mean and covaiance of the DAF, we move fom discete geneations to continuous time whee one time unit coesponds to geneations, and set t=/. Then, E[t 0]=0, 22 Va t 0 = 1 e t diag{0} ,

12 2017 TATARU ET AL. STATISTICAL INFERENCE IN THE WRIGHT FISHER MODEL e41 a 1.00 Tuncated nomal Beta Beta with spikes Diffusion b 5e-04 5e-03 5e-02 5e-01 0 =0.1 0 =0.3 0 =0.5 P P P Wight Fishe Appoimation FIGURE 7. Fit of vaious appoimations to the tue DAF with selection, calculated using the Makov chain popety fo =200, S=h=1, = =0 andaangeof0 and /. Each column shows a diffeent type of appoimation, indicated at the top of the figue. a Hellinge distance on log scale between the appoimated and tue DAF. The thee s in each of the heatmaps indicate the combinations of 0 and / used in b. b Tue dashed lines and appoimated solid lines DAF fo 0 =0.5 and diffeent values of /. Calculations ae pefomed as fo Figue 4. Fo compaison puposes, the a heatmap and b y-ais scales ae the same as in Figue 4. geneation A T A G T T A G z =3, 2, 0, 3 +1 z +1=3, 3, 0, 2 T A G A A T G G : in the ode A, G, C, T FIGURE 8. Dynamics in the pue dift K =4 multi-allelic Wight Fishe model fo A,G,C,T. The child inheits the paental allele. whee denotes vecto tanspose. These fomulas ae natual etensions of equations 2 and 4. Diffusion appoimation. Diffusion theoy can be etended fom the bi-allelic to the multi-allelic case. We will not cove this hee, but efe to Ewens 2004; section 4.8, p. 151 fo a geneal discussion of multidimensional diffusion pocesses, and Ewens 2004; section 5.10, p. 192 fo the K-allele pue dift Wight Fishe model. In paticula, Ewens 2004 mentions that a genealization of equation 7 can be fomulated and that a genealization of Kimua s solution in tems of othogonal polynomials eists. Moment-based appoimations. The beta distibution is a natual choice fo appoimating the DAF fo the bi-allelic Wight Fishe model, and it povides a good appoimation when the allele is not close to being lost o fied Figs. 4, 6, and 7. It is theefoe natual to appoimate the DAF fo the multi-allelic Wight Fishe using the genealization of the beta distibution, the Diichlet distibution Balding and Nichols 1995; Just like fo the bi-allelic case, whee the beta distibution aises as the stationay DAF unde linea evolutionay pessue, the Diichlet distibution is the stationay DAF fo a specific mutation model Ewens 2004 see below. Unde the Diichlet model, also called the Balding Nichols model Balding and Steele 2015, the allele fequency vecto t follows a Diichlet distibution t 0 Diichlet, whee = 1,..., K >0. This implies that allele i = 1,...,K has maginal distibution i t Beta i, 0 i, with 0 = K i. i=1

13 e42 SYSTEMATIC BIOLOGY VOL. 66 geneation A T A G T T A G z =3, 2, 0, 3 +1 z +1=2, 2, 1, 3 T C G A A T T G : in the ode mutation A pob. C mutation G pob. T A, G, C, T U AC FIGURE 9. Dynamics in the K =4 multi-allelic Wight Fishe model with mutations fo A,G,C,T. If the paental allele is i, the child eceives the same allele with pobability U ii and anothe allele j with pobability U ij,foi,j {A,G,C,T}. U GT diffusion appoimation Hobolth and Sién 2016, E[t 0]=0e Qt, 26 Va t 0 t = e s e Qs diag {0e Qt s} e Qs ds 27 0 e Qt 0 0e Qt 1 e t. Unde the Diichlet distibution, the mean and covaiance of the DAF ae E[t 0]=, 24 0 Va t 0 = 1 { } 25 diag The mean and covaiance of the DAF equations 22 and 23 ae equivalent to those unde the Diichlet distibution equations 24 and 25 when 0= 0, and 1 e t = Theefoe, the Diichlet distibution can accuately captue the tue mean and covaiance of the multi-allelic pue dift Wight Fishe model. Neutal Mutations Just as is the case fo the bi-allelic model Fig. 3, when the alleles evolve unde pue dift, eventually the pocess will each a monomophic state, whee only one of the alleles will be pesent in the population. The vaiation can be maintained in the population by allowing mutations Fig. 9. Dynamics and moments. If U ij is the pobability of an i allele to mutate to a j allele, the multinomial distibution of z+1 becomes z +1 z Mult,U, whee the mutation pobabilities ae stoed in a K K mati U. By specifying the stuctue of U, diffeent evolutionay mutation models can be fomulated, such as the Jukes Canto JC model, paent independent mutation model, infinite alleles model, Kimua model, and single-step mutation model Felsenstein The mean and covaiance of the DAF in continuous time t = / ae obtained using the ate mati Q=U I, whee I is the identity mati, fom the These geneal fomulas make it possible to numeically calculate the mean and covaiance fo any mutation model. In pactice, the mean can be calculated using one of the many available numeical pocedues fo mati eponentials Mole and Van Loan Calculating the covaiance, which involves integals of mati eponentials, is moe tedious, but this can be done numeically using the eigenvalue decomposition of the ate mati Hobolth and Sién The JC is the most simple mutation model, whee all mutation pobabilities ae equal, U ij =u/k 1, fo all i =j. The enties in the ate mati fo the JC model ae given by q if i =j Q ij =U ij I ij = K 1 q if i =j whee q=u. The ate mati can be witten in mati fom as Q= q K 1 E IK, whee E is the K K mati with 1 in evey enty. We can now obtain a closed-fom solution fo the mati eponential e Qt, namely e Qt =e t 2 I E + E K K, whee =2qK/K 1. The mean and covaiance in the JC model ae found fom equations 26 and 27 and given by E[t 0]=e t 2 Va t 0 = 1 I E 1 K K + 0 e K diag 0 e + e K K, 28 1 e 1+ t 1+ 0 e K { 0 e } 0 e e K K K e t 1 e t e 0 e K K e t e 1+ 2 t, 29

14 2017 TATARU ET AL. STATISTICAL INFERENCE IN THE WRIGHT FISHER MODEL e43 a E[t 0] t Vat t Covit,jt t b E[t 0] Vat t Allele t Allele 2 Allele Eact Diichlet Covit,jt t FIGURE 10. Fit of the Diichlet distibution dotted lines to the tue mean and covaiance of the multi-allelic JC Wight Fishe model solid lines with a q=0.1 =0.3 small, and b q=1 =3lage. All si plots ae calculated fo K =3, 0=0.7,0.2,0.1, =100 and diffeent values of t. whee e is the 1K vecto with 1 in evey enty. Fo t, these educe to E[t 0]= e K, Va t 0 = 1 I E 1 K K 1+. We note that these moments ae the same as fo a Diichlet distibution with =e/k, and indeed the Diichlet distibution is the stationay DAF of the multiallelic JC Wight Fishe model Ewens Moment-based appoimations. The mean and covaiance of the Diichlet distibution equations 24 and 25 ae equivalent to those unde the JC model if the covaiance appoimately fulfills the popotionality condition Va t 0 diag{e[t]} E[t] E[t] = 1 I E 0 e K K K + diag 0 e K { 0 e } 0 e e K K K e 0 e e t 2, K K e t 30 whee we used the epession fo the mean in equation 28. By compaing equations 29 and 30, we obseve that the epessions ae appoimately popotional with popotionality constant 1 e t when is small, which coesponds to the pue dift case. Regadless of the paamete, the epessions ae also appoimately popotional, with popotionality constant t, when the evolutionay distance t is small. Finally, fo lage t, the popotionality constant is 1, because the Diichlet distibution is the stationay distibution fo the JC model. These analytical consideations ae confimed by Figue 10. The Diichlet distibution cannot accuately captue the mean and covaiance of the JC model fo intemediate values of t, and the deviation is vey clea fo lage values of Fig. 10b. Theefoe, cae should be taken when using the Diichlet distibution in pactice. Because the JC is the most simple mutation model, with just one paamete, one could epect that the fit of the Diichlet distibution could be even moe poblematic fo moe comple mutation models. An impotant step in developing moe appopiate distibutions fo the DAF unde the multiallelic Wight Fishe model is made by Sién et al and Hobolth and Sién 2016, but in geneal moe eseach is needed in this diection. CONCLUSION AND PERSPECTIVES We have povided a boad oveview of methods to calculate the DAF unde the Wight Fishe model. These methods have a numbe of woking assumptions in common. Hee, we discuss in tun each of these and how cuent methods tackle these issues o potentially could be impoved to do so. Vitually all methods pesented hee ely on unlinked loci, with an eception woth mentioning using a moment-based appoach Tehost et al Seveal