arxiv: v1 [q-bio.pe] 16 Jul 2013

Size: px
Start display at page:

Download "arxiv: v1 [q-bio.pe] 16 Jul 2013"

Transcription

1 A model-based approach for identifying signatures of balancing selection in genetic data arxiv: v1 [q-bio.pe] 16 Jul 2013 Michael DeGiorgio 1, Kirk E. Lohmueller 1,, and Rasmus Nielsen 1,2,3 1 Department of Integrative Biology, University of California, Berkeley, CA, USA 2 Department of Statistics, University of California, Berkeley, CA, USA 3 Department of Biology, University of Copenhagen, Copenhagen, Denmark Present address: Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA, USA Corresponding author Michael DeGiorgio Department of Integrative Biology University of California 4134 Valley Life Sciences Building Berkeley, CA 94720, USA Phone mdegiorgio@berkeley.edu Running Head: Detecting balancing selection using the coalescent Keywords: Coalescent, composite likelihood, transmission distortion Classification: Biological Sciences, Genetics 1

2 Abstract While much effort has focused on detecting positive and negative directional selection in the human genome, relatively little work has been devoted to balancing selection. This lack of attention is likely due to the paucity of sophisticated methods for identifying sites under balancing selection. Here we develop two composite likelihood ratio tests for detecting balancing selection. Using simulations, we show that these methods outperform competing methods under a variety of assumptions and demographic models. We apply the new methods to whole-genome human data, and find a number of previously-identified loci with strong evidence of balancing selection, including several HLA genes. Additionally, we find evidence for many novel candidates, the strongest of which is FANK1, an imprinted gene that suppresses apoptosis, is expressed during meiosis in males, and displays marginal signs of segregation distortion. We hypothesize that balancing selection acts on this locus to stabilize the segregation distortion and negative fitness effects of the distorter allele. Thus, our methods are able to reproduce many previouslyhypothesized signals of balancing selection, as well as discover novel interesting candidates. 2

3 Introduction Balancing selection maintains variation within a population. Multiple processes can lead to balancing selection. In overdominance, the heterozygous genotype has higher fitness than either of the homozygous genotypes [1, 2]. In frequency-dependent balancing selection, the fitness of an allele is inversely related to its frequency in the population [2, 3]. In a fluctuating or spatially-structured environment, balancing selection can occur when different alleles are favored in different environments over time or geography [2,4,5]. Finally, balancing selection can also be a product of opposite directed effects of segregation distortion balanced by negative selection against the distorter [6]. That is, segregation distortion leads to one allele increasing in frequency. However, if that allele is deleterious, then it is reduced in frequency by negative selection. The combined effect of these opposing forces leads to a blanaced polymorphism. The genetic signatures of long-term balancing selection at a locus can roughly be divided into three categories [2]. The first signature is that the distribution of allele frequencies will be enriched for intermediate frequency alleles. This occurs because the selected locus itself is likely at moderate frequency within the population and, thus, neutral linked loci will also be at intermediate frequency. The second signature is the presence of trans-specific polymorphisms, which are polymorphisms that are shared among species [7]. This is a result of alleles being maintained over long evolutionary time periods, sometimes for millions of years [8 10]. The third signature is an increased density of polymorphic sites. This is due to neutral loci sharing similar deep genealogies as that of the linked selected site, increasing the probability of observing mutations at the neutral loci. The majority of selection scans in humans have focused on positive and negative directional selection. These studies have found evidence of both types of selection, with negative selection being ubiquitous, and the amount and mechanism of positive selection currently being debated [11 13]. However, it is unclear how much balancing selection exists in the human genome. Some scans for balancing selection (e.g., Bubb et al. [14] and Andrés et al. [15]) have been carried out using summary statistics such as the Hudson-Kreitman-Aguadé (HKA) test [16] and Tajima s D [17] as well as combinations of summary statistics [15, 18] (though see Ségural et al. [7] and Leffler et al. [19] for recent complementary approaches). The power of such approaches in unclear, and so it is uncertain how important balancing selection is in the human genome. Because balancing 3

4 selection shapes the genealogy of a sample around a selected locus, more power can be gained by implementing a model of the genealogical process under balancing selection [20, 21]. Composite likelihood methods have proven to be extremely useful for the analysis of genetic variation data using complex population genetic models. [22 28]. This approach allows estimation under models without requiring full likelihood calculations, permitting many complex models to be investigated. In this article, we develop two composite likelihood ratio methods to detect balancing selection, which we denote by T 1 and T 2. These methods are based on modeling the effect of balancing selection on the genealogy at linked neutral loci (e.g., Kaplan et al. (1988) [20] and Hudson and Kaplan (1988) [21]) and take into consideration the spatial distributions of polymorphisms and substitutions around a selected site. Through simulations, we show that our methods outperform both HKA and Tajima s D under a variety of demographic assumptions. Further, we apply our methods to autosomal whole-genome sequencing data consisting of nine unrelated European (CEU) and nine unrelated African (YRI) individuals. We find support for multiple targets of balancing selection in the human genome, including previously hypothesized regions such as the human leukocyte antigen (HLA) locus. Additionally, we find evidence for balancing selection at the FANK1 gene, which we hypothesize to result from segregation distortion. Results Theory A new test for balancing selection In this section, we provide a basic overview of a new test for balancing selection, and we describe the method in greater detail in the Kaplan-Darden-Hudson model, Solving the recursion relation, A composite likelihood ratio test based on polymorphism and substitution, and A composite likelihood ratio test based on frequency spectra and substitutions sections. We have developed a new statistical method for detecting balancing selection, which is based on the model of Kaplan, Darden, and Hudson [20, 21] (full details provided in the Kaplan-Darden-Hudson model section). Under this model, we calculate the expected distribution of allele frequencies using simulations, and approximate the probability of observing a fixed difference or polymorphism at a site as a function 4

5 of its genomic distance to a putative site under balancing selection. Using these calculations, we construct composite likelihood tests that can be used to identify sites under balancing selection, similar to the approaches by Kim and Stephan [23] and Nielsen et al. [26] for detecting selective sweeps. Basic framework Consider a biallelic site S that is under strong balancing selection and maintains an allele A 1 at frequency x and an allele A 2 at frequency 1 x. Consider a neutral locus i that is linked to the selected locus S. Denote the scaled recombination rate between the selected locus and the neutral locus as ρ i = 2Nr i, where N is the diploid population size and r i is the per-generation recombination rate. Assume we have a sample of n genomes from an ingroup species (e.g., humans) and a single genome from an outgroup species (e.g., chimpanzee). From these data, we can estimate the genomewide expected coalescence time Ĉ between the ingroup and outgroup species (see Materials and Methods for details). Also, under the Kaplan-Darden-Hudson model, we can obtain the expected tree length L n (x, ρ) and height H n (x, ρ) for a sample of n lineages affected by balancing selection by solving a set of recursive equations using the numerical approach described in the Solving the recursion relation. The relationship among Ĉ, L n(x, ρ), and H n (x, ρ) is depicted in Figure 1A. Assuming a small mutation rate, the probability that a site is polymorphic under a model of balancing selection, given that it contains either a polymorphism or a substitution (fixed difference), is p n,ρ,x = L n (x, ρ) 2Ĉ H n(x, ρ) + L n (x, ρ), (1) and the conditional probability that it contains a substitution is s n,ρ,x = 1 p n,ρ,x. That is, conditional on a mutation occurring on the genealogy relating the n ingroup genomes and the outgroup genome, the probability that a site is polymorphic is the probability that a mutation occurs before the most recent common ancestor of the n ingroup species (i.e., mutation occurs on red branches indicated in Fig. 1B), and the probability that a site contains a substitution is the probability that a mutation occurs along the branch leading from the outgroup sequence to the most recent common ancestor of the n ingroup species (i.e., mutation occurs on blue branches indicated in Fig. 1C). 5

6 Figure 1D shows how the spatial distribution of polymorphism around a selected site is influenced by the underlying genealogy at the site and how this spatial distribution of polymorphism can be used to provide evidence for balancing selection. Within a window of sites, we can obtain the composite likelihood that a particular site is under selection by multiplying the conditional probability of observing a polymorphism or a substitution at every other neutral site as a function of the distance of the neutral site to the balanced polymorphism. Kaplan-Darden-Hudson model The genealogy of a neutral locus i linked to the selected locus S can be traced back in time using the Kaplan, Darden, and Hudson [20,21] model, which provides a framework for modeling the coalescent process at a neutral locus that is linked to a locus under balancing selection. Their framework involves modeling selection as a structured population containing two demes representing each of the two allelic classes and migration taking the role of recombination and mutation. Lineages within the first deme are linked to A 1 alleles and lineages within the second deme are linked to A 2 alleles. Lineages migrate between demes by changing their genomic background. That is, a lineage in the first deme will migrate to the second deme if there was a mutation that changed an A 1 allele to an A 2 allele or if there was a recombination event that transferred a lineage linked to an A 1 allele to an A 2 background. Similarly, a lineage in the second deme will migrate to the first deme if there was a mutation that changed an A 2 allele to an A 1 allele or if there was a recombination event that transferred a lineage linked to an A 2 allele to an A 1 background. The rate at which a lineage linked to an A 1 background transfers to an A 2 background is β 1 = θ 1 + ρ i (1 x) and the rate at which a lineage linked to an A 2 background transfers to an A 1 background is β 2 = θ 2 + ρ i x. Consider a sample of n lineages with k lineages linked to allele A 1 (i.e., in the first deme) and n k lineages linked to allele A 2 (i.e., in the second deme). Given this configuration, only four events are possible. The first event involves a coalescence of a pair of lineages linked to A 1 alleles, the second involves a coalescence of a pair of lineages linked to A 2 alleles, the third involves the transfer of a lineage from an A 1 background to an A 2 background, and the fourth involves the transfer of a lineage from an A 2 background to an A 1 background. The time until the first event 6

7 (i.e., a coalescence or a transfer of background) is exponentially distributed with rate λ k,n k (x, ρ) = ( k 2) x + ( n k ) 2 1 x + kβ 2(1 x) x + (n k)β 1x. (2) 1 x The probability that the event is a coalescence of a pair of A 1 -linked lineages is ) c (1) ( k k,n k (x, ρ) = 2 xλ k,n k (x, ρ), (3) the event is a coalescence of a pair of A 2 -linked lineages is ( c (2) k,n k (x, ρ) = n k ) 2 (1 x)λ k,n k (x, ρ), (4) the event is a transfer from an A 1 to an A 2 background is m (1) k,n k (x, ρ) = kβ 2(1 x) xλ k,n k (x, ρ), (5) and the event is a transfer from an A 2 to an A 1 background is m (2) (n k,n k (x, ρ) = k)β 1 x (1 x)λ k,n k (x, ρ). (6) Note that in the notation of Kaplan et al. (1988) [20], λ k,n k (x, ρ) = h k,n k (x), c (1) k,n k (x, ρ) = q k 1,n k (x), c (2) k,n k (x, ρ) = q k,n k 1(x), m (1) k,n k (x, ρ) = q k 1,n k+1(x), and m (2) k,n k (x, ρ) = q k+1,n k 1 (x). Let L k,n k (x, ρ) denote the expected tree length given a sample with k A 1 -linked lineages and n k A 2 -linked lineages. Using eq. 18 of Kaplan et al. (1988) [20], the expected total tree length can be expressed using the recursion relation L k,n k (x, ρ) = n λ k,n k (x, ρ) + c(1) k,n k (x, ρ)l k 1,n k(x, ρ) + c (2) k,n k (x, ρ)l k,n k 1(x, ρ) + m (1) k,n k (x, ρ)l k 1,n k+1(x, ρ) + m (2) k,n k (x, ρ)l k+1,n k 1(x, ρ). (7) Similarly, the expected tree height H k,n k (x, ρ) given a sample with k A 1 -linked lineages and n k 7

8 A 2 -linked lineages can be expressed by H k,n k (x, ρ) = 1 λ k,n k (x, ρ) + c(1) k,n k (x, ρ)h k 1,n k(x, ρ) + c (2) k,n k (x, ρ)h k,n k 1(x, ρ) + m (1) k,n k (x, ρ)h k 1,n k+1(x, ρ) + m (2) k,n k (x, ρ)h k+1,n k 1(x, ρ). (8) Solving the recursion relation Consider a sample of n lineages. Denote the (n + 1)-dimensional vector of tree lengths for a sample of size n as L 0,n (x, ρ) L 1,n 1 (x, ρ) l (n) = L 2,n 2 (x, ρ),. L n,0 (x, ρ) such that element k, k = 0, 1,..., n, of l (n) is l (n) k = L k,n k (x, ρ). Next, define the (n + 1)- dimensional vector b (n) = n λ 1,n 1 (x,ρ) + c(1) n λ 2,n 2 (x,ρ) + c(1) n λ 0,n (x,ρ) + c(2) 0,n (x, ρ)l 0,n 1(x, ρ) 1,n 1 (x, ρ)l 0,n 1(x, ρ) + c (2) 1,n 1 (x, ρ)l 1,n 2(x, ρ) 2,n 2 (x, ρ)l 1,n 2(x, ρ) + c (2) 2,n 2 (x, ρ)l 2,n 3(x, ρ),. n λ n,0 (x,ρ) + c(1) n,0 (x, ρ)l n 1,0(x, ρ) such that element 0 is element n is b (n) 0 = b (n) n = n λ 0,n (x, ρ) + c(2) 0,n n λ n,0 (x, ρ) + c(1) n,0 (x, ρ)l(n 1) 0, (x, ρ)l(n 1), 8

9 and element k, k = 1, 2,..., n 1 is b (n) k = n λ k,n k (x, ρ) + c(1) k,n k (x, ρ)l(n 1) k 1 + c (2) k,n k (x, ρ)l(n 1) k. Further, consider an (n + 1) (n + 1)-dimensional tridiagonal matrix of migration rates M (n) = m (1) 1,n 1 1 m (2) 0,n (x, ρ) (x, ρ) 1 m(2) 1,n 1 (x, ρ) m (1) 2,n 2 (x, ρ) (2) m n 1,1 (x, ρ) m (1) n,0 (x, ρ) 1, with (n + 1)-dimensional main diagonal diag(m (n) ) = [1, 1,..., 1], n-dimensional lower diagonal lower(m (n) ) = [ m (1) 1,n 1 (x, ρ), m(1) 2,n 2 (x, ρ),..., m(1) n,0 (x, ρ)], and n-dimensional upper diagonal upper(m (n) ) = [ m (2) 0,n (x, ρ), m(2) 1,n 1 (x, ρ),..., m(2) n 1,1 (x, ρ)]. All elements that do not fall on the main, lower, and upper diagonals of M (n) are zero. Given M (n), b (n), and l (n), we can rewrite the recursion relation in eq. 7 as system of equations M (n) l (n) = b (n). (9) Because we can calculate eqs. 5 and 6, M (n) is a constant matrix. For a sample of size n, suppose we know l (n 1) for a sample of size n 1. Therefore, l (n 1) is now a constant vector and hence, because we can calculate eqs. 2-4, b (n) is also a constant vector. Therefore, eq. 9 is a tridiagonal system of n + 1 equations with n + 1 unknowns, which can be solved in O(n) time using the tridiagonal matrix algorithm [29]. The base case for the recursion in eq. 8 is when the number of lineages equals one. That is, when all lineages have coalesced and the most recent common ancestor is linked either to an A 1 allele or to an A 2 allele. This base case can be represented by L 0,1 (x, ρ) = 0 and L 1,0 (x, ρ) = 0. Given these values, set l (1) = [L 0,1 (x, ρ), L 1,0 (x, ρ)] = [0, 0] and solve the system of equations M (2) l (2) = b (2) for l (2). Next, given l (2), solve the system of equations M (3) l (3) = b (3) for l (3). Iterate this processes 9

10 until M (n) l (n) = b (n) is solved for l (n). An analogous process can be used to solve the recursion (eq. 8) for the expected tree height. Using the framework in this section for a sample of size n, we can obtain values for L 0,n (x, ρ), L 1,n 1 (x, ρ),..., L n,0 (x, ρ). Given that the A 1 allele has frequency x and the A 2 allele has frequency 1 x, the expected tree length for a sample of size n is L n (x, ρ) = n k=0 ( ) n x k (1 x) n k L k,n k (x, ρ). (10) k Similarly, we can obtain the expected tree height H n (x, ρ) for a sample of size n. The tree heights and total branch lengths are then used in eq. 1 to compute the likelihood of the data under the selection model. A composite likelihood ratio test based on polymorphism and substitution In this section, we illustrate how eq. 1 can be incorporated into a composite likelihood. We will then describe a likelihood ratio test that compares the balancing selection model described above to a neutral model based on the background genome patterns of polymorphism. Consider a window of I sites that are either polymorphisms or substitutions and consider a putatively selected site S located within the window. Suppose site i within the window has n i sampled alleles, a i observed ancestral alleles, and is a recombination distance of ρ i from S. Let n = [n 1, n 2,..., n I ], a = [a 1, a 2,..., a I ], and ρ = [ρ 1, ρ 2,..., ρ I ]. Define the indicator random variable 1 {ai =k} that site i has k ancestral alleles. Using the Kaplan-Darden-Hudson model, the probability that site i is polymorphic is p ni,ρ i,x and the probability that the site is a substitution (or fixed difference) is s ni,ρ i,x = 1 p ni,ρ i,x. Under the model, the composite likelihood that site S is under balancing selection is which is maximized at x = L M (n, ρ, x ; a) = [ I s ni,ρ i,x1 {ai =0} + p ni,ρ i,x i=1 n i 1 k=1 1 {ai =k} ], (11) arg max x (0,1) L M(n, ρ, x ; a). Further, suppose that for a sample of size k, k = 2, 3,..., n, conditioning only on sites that are polymorphisms or substitutions, the proportion of loci across the genome that are polymorphic is p k and the proportion of loci that are substitutions 10

11 is ŝ k = 1 p k. Then the composite likelihood that site S is evolving neutrally is L B (n ; a) = [ I n i 1 ŝ ni 1 {ai =0} + p ni i=1 k=1 1 {ai =k} ]. (12) It follows that the composite likelihood ratio test statistic that site S is under balancing selection is T 1 = 2{ln[L M (n, ρ, x ; a)] ln[l B (n ; a)]}. A composite likelihood ratio test based on frequency spectra and substitutions A balanced polymorphism not only increases the number of polymorphisms at linked neutral sites, but also leads to an increase in allele frequencies at these sites. Therefore, power can be gained by using frequency spectra information in addition to information on the density of polymorphisms and substitutions. Given a sample of size n, an A 1 allele at frequency x, A 2 allele at frequency 1 x, and a polymorphic neutral site that is ρ recombination units from a selected site, we can obtain the probability p n,k,ρ,x that there are k, k = 1, 2,..., n 1, ancestral alleles observed at the neutral site. The composite likelihood that site S is under balancing selection is L M (n, ρ, x ; a) = which is maximized at x = [ I s ni,ρ i,x1 {ai =0} + p ni,ρ i,x i=1 n i 1 k=1 p ni,k,ρ i,x1 {ai =k} ], (13) arg max x (0,1) L M(n, ρ, x ; a). Further, suppose that for a sample of size k, k = 2, 3,..., n, conditioning only on sites that are polymorphisms or substitutions, the proportion of polymorphic loci across the genome that have j, j = 1, 2,..., k 1, ancestral alleles is p k,j. Then the composite likelihood that site S is evolving neutrally is L B (n ; a) = [ I n i 1 ŝ ni 1 {ai =0} + p ni i=1 k=1 p ni,k1 {ai =k} ]. (14) It follows that the composite likelihood ratio test statistic that site S is under balancing selection is T 2 = 2{ln[L M (n, ρ, x ; a)] ln[l B (n ; a)]}. The two new methods, T 1 and T 2, have been implemented in the software package ballet (BALancing selection LikElihood Test), which is written in C. 11

12 Evaluating the methods using simulations To evaluate the performance of T 1 and T 2 relative to HKA and Tajima s D, we carried out extensive simulations of balancing selection using different selection and demographic parameters. We simulated genomic data for a pair of species that diverged τ D years ago. We introduced a site that is under balancing selection at time τ S, and the mode of balancing selection at the site is overdominance with selection strength s and dominance parameter h. In the simulations discussed in this article, we varied the demographic history in the target ingroup species, the strength of selection s, the dominance parameter h, and the time at which the selected allele arises τ S. Details of how the simulations were implemented are further described in the Materials and Methods section. Selected allele arising in ingroup species We considered demographic models shown in Figure 2 with s = 10 2 and h = 100. For these simulations, we constructed receiver operator characteristic (ROC) curves, which illustrate the relationships between the true positive and false positive rates of the four methods. Figure 3 displays ROC curves for T 1, T 2, HKA, and Tajima s D under each of the three demographic models depicted in Figure 2, in which the strength of selection is s = 10 2 and the dominance parameter is h = 100. This dominance parameter was chosen to represent an extremely strong level of heterozygote advantage. We later discuss a wider range of dominance parameters to test the limits of our methods. Under a model of constant population size (Fig. 3A), for a given false positive rate, T 2 tends to obtain more true positives than T 1, T 1 more true positives than HKA, and HKA more true positives than Tajima s D. In practice, however, we are typically concerned with a method s performance at low false positive rates. For a false positive rate of 1%, T 1, T 2, HKA, and Tajima s D have true positive rates of 30, 40, 14, and 6%, respectively. Also, at a false positive rate of 5%, T 1, T 2, HKA, and Tajima s D have true positive rates of 58, 67, 37, and 25%, respectively. These results show that T 1 and T 2 vastly outperform both HKA and Tajima s D, with T 2 performing better than T 1. However, the demographic model used in these simulations is the same as the one assumed in T 1 and T 2, namely, the standard neutral model. To examine the robustness of our methods, we considered two complex demographic scenarios that could potentially affect the results of our methods a population bottleneck (Fig. 2B) and a population expansion (Fig. 2C). 12

13 Figure 3B displays ROC curves under a model in which the ingroup species experiences a recent severe bottleneck. Aside from Tajima s D, all of the methods perform well under this scenario. For a false positive rate of 1%, the true positive rates of T 1, T 2, HKA, and Tajima s D are 75, 74, 72, and 5%, respectively. Similarly, for a false positive rate of 5%, the true positive rates of T 1, T 2, HKA, and Tajima s D are 80, 81, 80, and 14%, respectively. This is because, under a model with a severe population bottleneck, there is a lower level of diversity across the genome and, hence, a lower polymorphism-to-substitution ratio. Because T 1, T 2, and HKA compare the level of polymorphism and divergence at a putatively selected site with that of the corresponding genome-wide background levels, these three methods identify a large excess of polymorphism compared to background levels at a site that is under balancing selection. However, Tajima s D performs no such comparison and, thus, has little power to detect balancing selection under this scenario. We next considered a demographic scenario in which the ingroup species experiences a recent population growth (Fig. 2C). Under this setting (Fig. 3C), similar to that of constant population size, T 2 tends to obtain more true positives than T 1, T 1 more true positives than HKA, and HKA more true positives than Tajima s D for a given false positive rate. At a false positive rate of 1%, T 1, T 2, HKA, and Tajima s D have true positive rates of 39, 41, 15, and 10%, respectively, and at a false positive rate of 5%, T 1, T 2, HKA, and Tajima s D have true positive rates of 65, 69, 37, and 32%, respectively. Interestingly, all four methods perform better under a recent population growth than under a constant population size. This result is potentially due to more efficient selection after a population growth. By considering the demographic scenarios in Figure 2, we have demonstrated that our statistics, T 1 and T 2, generally outperform both HKA and Tajima s D. Additional simulation results are displayed in the Supplementary Material, in which we consider a range of values of the dominance parameter (i.e., h = 100, 10, 3, and 1.5), a strong selection coefficient s = 10 2 (Fig. S1), a weak selection coefficient s = 10 4 (Fig. S2), and a scenario in which the selected allele arises in the population ancestral to the split of the ingroup and outgroup species (Fig. S3-S5). In all scenarios tested, T 1 and T 2 perform as well as, though often better than, HKA and Tajima s D. Next, we investigated scenarios in which we vary the dominance parameter h with a selection coefficient of s = Considering an ingroup with a constant population size, T 2 outperforms T 1, T 1 outperforms HKA, and HKA outperforms Tajima s D (Fig. S1). As h decreases, the performance 13

14 of HKA and Tajima s D decreases, yet the performance of T 1 and T 2 is not dramatically impacted. Hence, as h decreases the performance of T 1 and T 2 relative to HKA and Tajima s D increases, showing that the two new statistics provide a dramatic increase in power compared to HKA and Tajima s D. Under a scenario in which the ingroup undergoes a recent population bottleneck, T 1, T 2, and HKA perform well, whereas Tajima s D performs poorly (Fig. S1). In addition, h appears to have little influence on the relative performance of these methods. Hence, population bottlenecks tend to enhance the performance of T 1, T 2, and HKA, whereas they inhibit the performance of Tajima s D. Moving to a scenario in which the ingroup undergoes a recent population expansion, Figure S1 shows that T 2 outperforms T 1, T 1 outperforms HKA, and HKA outperforms Tajima s D. The results in Figure S1 indicate that the performance of T 1 is generally similar to T 2, whereas the performance of HKA and Tajima s D is generally similar for large h (i.e, h = 10 and 100), and dissimilar for low h (i.e, h = 1.5 and 3). In addition, under the set of parameters investigated, h appears to have little influence on the performance of T 1, T 2, and HKA, but causes the performance of Tajima s D to decrease with decreasing h. By considering a selection coefficient of modest strength (i.e., s = 10 2 ), we found that, in general, T 1 and T 2 perform quite well (Fig. S1). However, as these two methods were developed to detect long-term balancing selection, then it is unclear how the methods should perform under a setting with weak selection. To investigate this scenario, we considered a weak selection coefficient of s = 10 4, which is two orders of magnitude smaller than the one considered previously. For a setting in which the ingroup remains at constant size, T 2 outperforms T 1, T 1 outperforms HKA, and HKA outperforms Tajima s D (Fig. S2) for large h (i.e., h = 10 and 100). In contrast to the results for the case of s = 10 2, when h is small (i.e., h = 1.5 and 3), all methods perform poorly, each identifying signatures of selection only slightly better than random. Hence, when selection is weak and the level of overdominance is low, T 1 and T 2 cannot extract enough information from the data to create meaningful predictions. However, HKA and Tajima s D perform just as poorly, and therefore T 1 and T 2 outperform HKA and Tajima s D in general under a demographic model with constant population size. Next, considering a situation in which the ingroup undergoes a recent population bottleneck, 14

15 similarly to the observations for s = 10 2, T 1, T 2, and HKA perform well, whereas Tajima s D performs poorly (Fig. S2). In contrast to the results for s = 10 2, h appears to have some influence on the relative performance of these methods. As h decreases, the performance of all methods decreases though not substantially. In addition, similarly to s = 10 2, the performance of T 1, T 2, and HKA is approximately the same. Hence, even for weak selection, population bottlenecks tend to enhance the performance of T 1, T 2, and HKA, whereas they inhibit the performance of Tajima s D. Finally, under a scenario in which the ingroup undergoes a recent population expansion, Figure S2 shows that T 2 outperforms T 1, T 1 outperforms HKA, and HKA outperforms Tajima s D for large h (i.e., h = 10 and 100). In contrast to the results for the case of s = 10 2, when h is small (i.e., h = 1.5 and 3), all methods perform poorly. Hence, like the case for an ingroup population with constant size, when selection is weak and the level of overdominance is low, T 1 and T 2 cannot extract enough information from the data to create meaningful predictions. However, HKA and Tajaima s D perform just as poorly, and therefore T 1 and T 2 outperform HKA and Tajima s D in general under a demographic model with recent population growth. Selected allele arising within ancestral population One hallmark of balancing selection is that it maintains polymorphism for a long time, potentially for millions of years [8 10]. Due to the extreme age of some balanced polymorphisms, they tend to occur within multiple species, thereby creating a polymorphism shared across species referred to as a trans-specific polymorphism. Figure S3 displays the three models that we consider in which a selected allele arises in the population ancestral to the split of the ingroup and outgroup species. For each of the three demographic scenarios, we set τ S = years ago, creating a selected allele that is three times as ancient as the one that we consider in Figure 2. All other models parameters are identical to those considered in Figure 2. Here we investigate the performance of T 1, T 2, HKA, and Tajima s D in the context of demographic models in which a selected allele arises in an ancestral population and in which the selective pressure is of modest strength (i.e., s = 10 2 ) and varying dominance h. For a setting in which the ingroup remains at constant size, T 2 outperforms T 1, T 1 outperforms HKA, and HKA outperforms Tajima s D (Fig. S4). As h decreases, the performance of HKA and Tajima s D decreases, yet the 15

16 performance of T 1 and T 2 is not dramatically impacted. Hence as h decreases the performance of T 1 and T 2 relative to HKA and Tajima s D increases, mirroring the results observed in Figure S1. Next, considering a situation in which the ingroup undergoes a recent population bottleneck, T 1, T 2, and HKA perform well, whereas Tajima s D performs poorly (Fig. S4). In addition, h appears to have little influence on the relative performance among T 1, T 2, and HKA yet causes Tajima s D to perform worse for small h (i.e., h = 1.5). Hence, akin to the observations for Figure S1, population bottlenecks tend to enhance the performance of T 1, T 2, and HKA, whereas they inhibit the performance of Tajima s D. Under a scenario in which the ingroup undergoes a recent population expansion, Figure S4 shows that in most cases T 2 outperforms T 1, T 1 outperforms HKA, and HKA outperforms Tajima s D. The performance of T 1 is generally similar to T 2, whereas the performance of HKA and Tajima s D generally similar for large h (i.e, h = 10 and 100), and dissimilar for low h (i.e, h = 1.5 and 3). Interestingly, for h = 1.5, T 1 performs slightly better than T 2. In addition, under the set of parameters investigated, h appears to have little influence on the performance of T 1, T 2, and HKA, but causes the performance of Tajima s D to decrease with decreasing h. By considering the demographic model in Figure S3, we have shown that the performance of T 1, T 2, HKA, and Tajima s D are not greatly impacted by the age of the selected allele, provided that the selected allele is old and has maintained balancing selection for an extended period of time. Hence, though T 1 and T 2 make the assumption that lineages from the ingroup species are monophyletic, this assumption does not hinder the methods in practice. For a setting in which the ingroup remains at constant size, T 2 outperforms T 1, T 1 outperforms HKA, and HKA outperforms Tajima s D (Fig. S5) for large h (i.e., h = 10 and 100). In contrast, when h is small (i.e., h = 1.5 and 3), all methods perform poorly, each identifying signatures of selection only slightly better than random. Hence, as observed in Figure S2, when selection is weak and the level of overdominance is low, T 1 and T 2 cannot extract enough information from the data to create meaningful predictions. Next, considering a situation in which the ingroup undergoes a recent population bottleneck, T 1, T 2, and HKA perform well, whereas Tajima s D performs poorly (Fig. S5). In addition, h appears to have some influence on the relative performance of these methods. As h decreases, the performance of all methods decreases though not substantially. Also, the performance of T 1, T 2, 16

17 and HKA is approximately the same. These results mirror those observed in Figure S2, and thus, even for weak selection at a trans-specific polymorphism, population bottlenecks tend to enhance the performance of T 1, T 2, and HKA, whereas they inhibit the performance of Tajima s D. Finally, under a scenario in which the ingroup undergoes a recent population expansion, Figure S5 shows that T 2 outperforms T 1, T 1 outperforms HKA, and HKA outperforms Tajima s D for large h (i.e., h = 10 and 100). In contrast, when h is small (i.e., h = 1.5 and 3), all methods perform poorly. Hence, like the case for an ingroup population with constant size, when selection is weak and the level of overdominance is low, T 1 and T 2 cannot extract enough information from the data to create meaningful predictions. These results show that, for the case of weak selection, a setting in which the selected allele generates trans-specific polymorphisms has little effect on the performance on T 1, T 2, HKA, and Tajima s D when compared with their respective performances under the case in which the polymorphism is not trans-specific. Hence, we have shown that the performance of T 1 and T 2 is not influenced by the presence of a trans-specific polymorphism even though they are based on the assumption that lineages from the ingroup species are monophyletic. Empirical analysis Balancing selection in humans We probed the effects of balancing selection in humans by using whole-genome sequencing data from nine unrelated individuals from the CEU population and nine unrelated individuals from the YRI population (see Materials and Methods). We performed a scan for balancing selection at each position in our dataset by considering a window of 100 substitutions or polymorphisms upstream and downstream of our focal site. This window size was taken for computational convenience, rather than by consideration of the recombination rate or polymorphism density within the region. Though we used a window size of 200 polymorphisms or substitutions for computational convenience, T 1 and T 2 can also be computed using all sites on a chromosome. The mean window length was 14.7kb for the CEU and 13.7kb for the YRI populations, which should be sufficiently long because recombination quickly breaks down the signal of balancing selection at distant neutral sites. Manhattan plots for T 1 (Figs. S6 and S7) and T 2 (Figs. S8 and S9) test statistics suggest 17

18 that there are multiple outlier candidate regions. Intersecting the locations of these scores with those from the longest transcript of each RefSeq gene (i.e., coding region) led to identification of many previously-hypothesized and novel genes potentially undergoing balancing selection (see Tables S1-S4, with previously-hypothesized genes highlighted in bold). Multiple genes at the HLA region are strong outliers (top 0.01% of all scores across the genome) in our scan for balancing selection (Tables S1-S4). Because this study uses high-coverage sequencing data, resolution in the HLA region is particularly fine (Figs. S10 and 4), with strong signals in classical MHC genes such as HLA-A, HLA-B, HLA-C, HLA-DR, HLA-DQ, and HLA-DP genes [14]. The HLA region, which is located on chromosome six, is a well-known site of balancing selection in humans [8 10]. The protein products encoded by HLA genes are involved in antigen presentation, thus playing important roles in immune system function. Genes at the HLA locus are known to be highly polymorphic and are thought to be subject to balancing selection due to frequency-dependent selection, overdominance, or fluctuating selection in a rapidly changing pathogenic environment [30, 31]. As the HLA region is so well known as a locus under balancing selection, it is important that our methods identify strong candidate candidate genes in the regions as a proof of concept. One gene that we found particularly intriguing is FANK1 (Figs. S11 and 5). This gene is one of the top four candidates in the CEU and YRI populations when using either the T 1 or T 2 statistic (Tables S1-S4). In addition, FANK1 is the top candidate among genes that have not been previously hypothesized to be under balancing selection when using either test in the CEU and the T 1 test in the YRI. FANK1 is expressed during the transition from diploid to haploid state in meiosis [32, 33]. Though it is often identified as spermatogenesis-specific [32, 33], it is also expressed during oogenesis in cattle [34] and mice [35]. Its function is to suppress apoptosis [33], and it is one of ten to 20 genes identified as being imprinted in humans (i.e., allele specific methylation) [36]. Interestingly, it also shows marginal evidence of segregation distortion (Fig. 5) [37]. Further, as a CpG island resides directly underneath our signal in both the CEU and YRI populations, we analyzed the region around FANK1 with all GC AT transitions on chromosome 10 removed as well as all transitions on chromosome 10 removed and we still retain the peak (Fig. S12), strongly suggesting that the signature of balancing selection that we identified around FANK1 is not driven by CpG mutational effects. 18

19 Gene ontology analysis To elucidate functional similarities among genes identified to be under balancing selection, we performed gene ontology (GO) enrichment analysis using GOrilla [38, 39]. First, we compared an unranked list of the top 100 candidate genes (Tables S1-S4) to the background list of all unique genes. Genes obtained using either test statistic are enriched for processes involved in the immune response in both the CEU and YRI populations (Tables S5-S8). Similarly, the top genes are enriched for MHC class II functional categories (Tables S9-S12), with the exception of the T 2 statistic applied to YRI, which has no functional enrichment (Table S12). Further, these top genes tend to be components of the MHC complex and membranes (Tables S13-S16), which often directly interact with pathogens. Interestingly, removing all HLA genes from both the top 100 and background sets of genes reveals no GO enrichment for process, function, or component categories, indicating that enrichment is predominately driven by the HLA region. Because we can also provide a score for each candidate gene in our likelihood framework, we performed a second analysis in which we ranked genes by their likelihood ratio test statistic, with the goal of identifying GO categories that are enriched in top-ranked genes. Using this framework, the top candidate genes tend to be involved in immune response and cell adhesion processes (Tables S17-S20); MHC activity and membrane protein activity functions, such as transporting and binding molecules (Tables S21-S24); and MHC complex, membrane, and cell junction components (Tables S25-S28). In contrast to the case of the top 100 candidate genes, removing all HLA genes from the ranked list still resulted in GO enrichment in categories such as cell adhesion (processes), membrane protein activity (function), and components of membranes and cell junctions (component). Discussion In this article, we presented two likelihood-based methods, T 1 and T 2, to identify genomic sites under balancing selection. These methods combine intra-species polymorphism and inter-species divergence with the spatial distribution of polymorphisms and substitutions around a selected site. Through simulations, we showed that T 1 and T 2 vastly outperform both the HKA test and Tajima s D under a diverse set of demographic assumptions, such as a population bottleneck and growth. In addition, application of T 1 and T 2 to whole-genome sequencing data from Europeans and Africans 19

20 revealed many previously identified and novel loci displaying signatures of balancing selection. Simulation results suggest that T 2 performs at least as well as T 1, and so a natural question is whether T 1 would ever be used. Based on the fact that T 2 uses the allele frequency spectrum and T 1 does not, then T 1 would be a valuable statistic to employ when allele frequencies cannot be estimated well. One example is a situation in which the sample size is small (e.g., one or two genomes). Under this scenario, the T 2 test statistic would likely provide little additional power over the T 1 statistic. As another example, it is becoming increasingly common for studies to sequence a pooled sample of individuals rather than each individual in the sample separately. This pooled sequencing will tend to yield inaccurate estimates of allele frequencies across the genome, which could heavily influence the performance of the T 2 statistic. However, if there is sufficient enough evidence that a site has a pair of alleles observed in the sample, then this site can be considered polymorphic regardless of its actual allele frequency. Future developments that can statistically account for this uncertainty in allele frequency estimation could be incorporated into the T 2 test statistic so that it can be applied to pooled sequencing data. The model of balancing selection used in this article is from Hudson and Kaplan [21], and assumes that natural selection is so strong that it maintains a constant allele frequency at the selected locus forever. The simulation scenarios considered here assumed that the strength of balancing selection was also constant since the selected allele arose. However, selection coefficients can fluctuate over time, which provides the basis for future work on investigating the robustness of methods for detecting balancing selection under scenarios in which the strength of selection fluctuates or when selection is weak. Future work can use the framework developed here to construct methods for identifying balancing selection under models with more relaxed assumptions (e.g., see Barton and Etheridge [40] and Barton et al. [41] for potential models). Though we have shown that T 1 and T 2 perform well under a population bottleneck and growth, they may be less robust to other forms of demographic model violations, such as population structure. Because population subdivision increases the time to coalescence and corresponding length of a genealogy, we expect higher levels of polymorphism across the genome. Under most assumptions, population subdivision affects the genome uniformly; it increases the level of background polymorphism and likely only slightly decreases the power of the new statistics. However, in some cases, such as an ancient admixture event (e.g., with Neanderthals [42] or Denisovans [43]), levels of 20

21 variability may increase in only a few regions of the genome, increasing the mean coalescence time in these regions. Such regions may appear to have excess polymorphism relative to background levels and, hence, display false signals of balancing selection under the T 1 statistic. However, in non-african humans, introgressed regions typically have low population frequencies [42, 43], and, hence, it would be unlikely for polymorphic sites in these regions to harbor many introgressed alleles segregating at intermediate frequencies. Thus, the T 2 statistic, which explicitly utilizes allele frequency spectra information, would likely be able to distinguish these blocks of archaic admixture from regions of balancing selection. Further, as observed in other studies of natural selection [44,45], increased robustness to confounding demographic processes can potentially be gained through the use of additional information. For example, population bottlenecks as well as gene flow can increase linkage disequilibrium [46,47]. Therefore, knowledge about linkage disequilibrium in a region could aid in distinguishing population subdivision from long-term balancing selection. Another concern when performing genomes scans for balancing selection is the possibility of false positives due to bioinformatical errors. For example, misalignment of sequence reads in duplicated regions may lead to falsely elevated levels of variability. In many cases, this problem can be alleviated by removing duplicated regions from analyses. However, a non-negligible portion of the human genome is not represented in standard reference sequences and, thus, there may be many unidentified paralogs in the genome. Fortunately, removing sites that deviate from Hardy-Weinberg equilibrium helps to alleviate these problems, because SNPs fixed between or segregating at high frequencies in one of two (or more) paralogous regions will have an excess of heterozygotes in combined short-read alignments. We applied a Hardy-Weinberg filter to all empirical data analyzed in this article. We note that deviations from Hardy-Weinberg equilibrium are expected under certain forms of balancing selection. In theory, a balancing selection signal could, therefore, be lost due to such filtering. However, we used a filtering cutoff of p < 10 4 (see Materials and Methods). The strength of selection required to cause this type of deviation from Hardy-Weinberg equilibrium used in the filtering is extremely strong, and such selection would almost certainly have been detected using other methods. Well-established examples of balancing selection in the human genome, such as the selection affecting the HLA loci, are not lost because of filtering, and would generally not be easily detectable using deviations from Hardy-Weinberg as a test. Nonetheless, because phenomena other than balancing selection, such as bioinformatical errors or archaic admixture, could 21

22 potentially lead to false signals of balancing selection, additional evidence should be obtained before definitively concluding that a site has been subjected to balancing selection. One source of additional evidence of balancing selection is whether a signal lies within a region harboring a trans-specific polymorphism [7,19] because it is unlikely to have a polymorphism segregating in each of a pair of closely-related species without selection maintaining the polymorphism. However, relying solely on evidence from trans-specific polymorphisms would miss many true signals of balancing selection that are not maintained as trans-specific polymorphisms. In addition, regions with bioinformatical errors (e.g., mapping errors) may give the same errors in both species, resulting in a false signal of a shared polymorphism between the pair of species. Nevertheless, the observation of a trans-specific polymorphism can provide convincing evidence of ancient balancing selection [7, 19]. Previous studies of selection have shown that combinations of statistics can be powerful tools when identifying genes under selection [15, 18, 48]. Hence, combining our methods with other summaries (e.g., linkage disequlibrium [44 47]) or information on trans-species polymorphisms [7,19] will lead to increasingly effective approaches for detecting balancing selection. Another commonly-cited source of evidence for balancing selection is based on consideration of the topology and branch lengths of within-species haplotype trees. Under long-term balancing selection, the underlying genealogy (e.g., see Fig. S13) will be symmetric, with long basal branches separating a pair of allelic classes (i.e., haplotypes containing one variant and haplotypes containing the other variant). However, the underlying genealogy for a linked neutral variant may differ substantially from that of the selected site. Around a balanced polymorphism, there will be a strong reduction of linkage disequilibrium, not unlike a recombination hotspot, because the long genealogy in the balanced polymorphism provides extra opportunities for recombination. Consequently, the signal of balancing selection will be narrow, and trees estimated from sites located in a window around the balanced polymorphism may fail to detect the presence of highly divergent haplotypes. The utility of within-species haplotype trees as a signature of long-term balancing selection is unclear, as the genealogy of the haplotype may not match the genealogy of the selected region. For example, Figure S14 shows that haplotype trees based on scenarios under balancing selection appear similar to those under neutrality, with the difference that external branches are slightly longer under balancing selection than under neutrality, which contrasts with the generally-held belief that basal branches should be long. As such, haplotype networks or trees may not be powerful tools for 22