Approximate likelihood methods for estimating local recombination rates

Size: px

Start display at page:

Download "Approximate likelihood methods for estimating local recombination rates"

Sheena Burke
6 years ago
Views:

1 J. R. Statist. Soc. B (2002) 64, Part 4, pp Approximate likelihood methods for estimating local recombination rates Paul Fearnhead and Peter Donnelly University of Oxford, UK [Read before The Royal Statistical Society at a meeting organized by the Research Section on Statistical modelling and analysis of genetic data on Wednesday, May 22nd, 2002, Professor D. Firth and Professor R. A. Bailey in the Chair ] Summary. There is currently great interest in understanding the way in which recombination rates vary, over short scales, across the human genome. Aside from inherent interest, an understanding of this local variation is essential for the sensible design and analysis of many studies aimed at elucidating the genetic basis of common diseases or of human population histories. Standard pedigree-based approaches do not have the fine scale resolution that is needed to address this issue. In contrast, samples of deoxyribonucleic acid sequences from unrelated chromosomes in the population carry relevant information, but inference from such data is extremely challenging. Although there has been much recent interest in the development of full likelihood inference methods for estimating local recombination rates from such data, they are not currently practicable for data sets of the size being generated by modern experimental techniques. We introduce and study two approximate likelihood methods. The first, a marginal likelihood, ignores some of the data. A careful choice of what to ignore results in substantial computational savings with virtually no loss of relevant information. For larger sequences, we introduce a composite likelihood, which approximates the model of interest by ignoring certain long-range dependences. An informal asymptotic analysis and a simulation study suggest that inference based on the composite likelihood is practicable and performs well. We combine both methods to reanalyse data from the lipoprotein lipase gene, and the results seriously question conclusions from some earlier studies of these data. Keywords: Coalescent; Composite likelihood; Lipoprotein lipase; Marginal likelihood; Mutation rate heterogeneity; Pseudolikelihood; Recombinational hot spot; Recombination rate 1. Introduction Humans, like most (so-called diploid) organisms, have two versions of each chromosome, one inherited from each parent. Each sperm or egg has only a single copy of each chromosome, typically formed as a mosaic of the two parental copies. The process of shuffling the two parental copies to produce a single chromosome is called recombination. Although not completely understood, it involves variation, so for example the complete set of chromosomes in two different sperm or eggs from the same person would be expected to differ, because of differences in the recombination processes during their formation. The recombination rate between two specified positions, or loci, on a chromosome is defined to be the probability that the deoxyribonucleic acid (DNA) at each location on the offspring chromosome comes from different chromosomes in the parent. Between two positions which are very close together on the chromosome, the recombination rate will be extremely small. For Address for correspondence: Paul Fearnhead, Department of Mathematics and Statistics, Fylde College, Lancaster University, Lancaster, LA1 4YF, UK. p.fearnhead@lancaster.ac.uk 2002 Royal Statistical Society /02/64657

2 658 P. Fearnhead and P. Donnelly positions which are far apart, it will be close to 1 2. (Recombination rates are bounded below by 0 and above by 0.5.) There is enormous current interest in understanding the way in which recombination rates vary across the human genome. There is known to be variation over large scales, but little is known about the extent to which recombination rates vary over small scales. Aside from inherent interest, in shedding light on the underlying biological processes, a good understanding of patterns of local variation is central to both the design and the analysis of current and planned studies aimed at elucidating the genetic basis of common human diseases (e.g. Pritchard and Przeworski (2001)). Although the methods described in this paper can be applied to data from any diploid species, we shall, for definiteness, focus the discussion on the case of current major interest, namely humans. It may be helpful to the subsequent discussion to have some (very rough!) idea of the scales involved in the human genome. The genome consists of around nucleotides, or base pairs of DNA, broken into 23 chromosomes. (For our purposes, DNA can be thought of as being a linear string made up of discrete building-blocks, namely the four nucleotides {A, C, G, T}.) The chromosomes differ in size but think of a typical chromosome as being 10 8 bp (base pairs) or 100 Mb (megabases) long the smallest is about 50 Mb, and the largest about 300 Mb. Distances measured in base pairs are called physical distances. This contrasts with so-called genetic distance which measures the amount of recombination between two loci. Two loci are at a genetic distance of 1 M (morgan) if the expected number of recombination events between them during one meiosis (the formation of a sperm or egg cell) is 1. Because recombination rates vary across the genome, there is no standard conversion between physical and genetic distances. The average across the genome is close to 1 cm per megabase (e.g. Pritchard and Przeworski (2001)). There are various models which relate recombination rates to genetic distance (see for example Speed and Zhao (2001)), but over small genetic distances recombination rates and genetic distances are almost the same. Thus for example the recombination rate between two loci at genetic distance 1 cm will be close to The standard method for assessing recombination rates in humans is from pedigree data. In its simplest form, we might have observations on parent child duos and simply count the number of recombination events between the loci of interest. In practice it is much more complicated, and can be highly non-trivial statistically, effectively because often only incomplete information is available about recombination events (see for example Holmans (2001) and Thompson (2001)). There are natural limits to the resolution of pedigree-based methods they can only be used to estimate recombination rates between loci whose genetic distance is of the order of centimorgans or more. If complete information were available, the problem amounts to estimating a binomial success probability. The real problem is more difficult, but for example thousands of meiotic events are needed to estimate recombination rates of the order of Over the last 5 years or so, the ability to type single sperm separately has added to the resolution, allowing the estimation of rates of the order of 10 3 or Recently, more sophisticated (sperm typing) experimental approaches allow a practicable estimation of male recombination rates as small as 10 5 (Jeffreys et al., 2001) but at very substantial cost. Sperm typing methods are currently not feasible for simultaneously assessing rates across the genome, and in any event provide no information on recombination rates in females (which differ in humans from rates in males, at least over large scales). Thus, in short, existing high throughput methods provide information about recombination rates over scales of millions, or in some cases hundreds of thousands, of bases. (We note in passing that an estimation of recombination rates is much easier in most experimental organisms, effectively because experimenters can simply arrange for large numbers of matings of the type that are most helpful.)

3 Estimating Local Recombination Rates 659 A recent study (Jeffreys et al., 2001) showed substantial clumping of recombination events within one 200 kb region of the human genome. The possibility of such extreme local variation in recombination rates is also supported by several other lines of evidence (see for example Jeffreys et al. (2001) and Daly et al. (2001), and references therein), but little systematic information is currently available about recombination rates over physical distances of kilobases or tens of kilobases in humans. There is potentially useful information for estimating local recombination rates in samples of chromosomes taken from unrelated individuals in a population. Such chromosomes are related by an (unobserved) pedigree, or genealogy, going back many generations (typically many tens of thousands for human DNA). If we could observe both the genealogy and the recombination events on it, this would allow (straightforward) estimation of recombination rates as small as 10 4 or In practice, neither the genealogy nor the recombination events are directly observed, but data of this kind carry some information about each, and hence, at least in principle, information for estimating recombination rates over small scales. There has thus been intense recent interest in methods for estimating local recombination rates from population genetic data. Early approaches used summary statistics (often via a method-of-moments approach; see for example Hudson and Kaplan (1985), Hudson (1987) and Wakeley (1997)); however, these methods only use a small amount of the information that is available from the data and can have poor statistical properties (Wall, 2000). More encouragingly, several different groups have developed full likelihood inference methods (Griffiths and Marjoram, 1996a; Kuhner et al., 2000; Fearnhead and Donnelly, 2001), which are optimal in the sense that they use all the available information from the data. However, all these approaches use computationally intensive statistical techniques (either Markov chain Monte Carlo or importance sampling), with very substantial computational burdens, and potential difficulties in assessing the accuracy of approximated likelihood curves. As one example, the most efficient current method takes a month on a 400 MHz Pentium personal computer to give reliable estimates of the likelihood surface for a data set of 500 bp of sequence data from each of 31 chromosomes (Fearnhead et al., 2002). (This method is typically about four orders of magnitude more efficient than some other published approaches; see Fearnhead and Donnelly (2001).) Existing full likelihood approaches are not practicable for the sizes of data set that are currently being generated. Although there may be scope for improving their efficiency, possibly substantially, we adopt a different approach in this paper, in studying two approximations. Rather than developing computational methods for approximating the full likelihood, we consider two different likelihoods. The first is a marginal likelihood, where we ignore some of the information in the data, considering instead the likelihood for a reduced data set. Through a careful choice of which aspects of the data to ignore, this results in considerable computational savings with only a very limited loss of information for estimating parameters of interest. Our second approximation in effect uses a different, simpler, model for the data, essentially by ignoring certain long-range dependences. We call the resulting likelihood a composite likelihood for our original problem. We sketch some asymptotic (in the length of the sequence) theory and describe simulation studies, both of which are encouraging, and suggest that the resulting procedures may be reasonable for estimation. Neither general idea is new, although the way in which we implement them is novel. For example Wall (2000) presented an estimation method for recombination based on the likelihood for a summary of the data. We choose a higher dimensional, more informative, summary, so rather more sophisticated methods for studying the associated likelihood are needed. We note that,

4 660 P. Fearnhead and P. Donnelly with our chosen summary, the marginal likelihood is typically almost indistinguishable from the full likelihood. Our composite likelihood approach draws on ideas from spatial statistics (e.g. Besag (1975)). A related, though in some senses complementary, approach to estimating recombination rates has been developed by Hudson (2001a) and subsequently adapted by McVean et al. (2002). They created a composite log-likelihood by summing the log-likelihood of the data at all pairs of nucleotides. Instead, we split the chromosomal region of interest into subregions and create a composite log-likelihood by summing the full log-likelihoods for each subregion. The accuracy of inference based on our composite log-likelihood function depends crucially on the size of subregions used. The larger the subregions the closer the composite log-likelihood is to the (optimal) full log-likelihood. There are two ways of thinking about the generic statistical problem that is considered here. In the population genetics context there is an accepted family of models for population evolution, including mutation and recombination. Forwards in time these can be thought of as versions of the Fleming Viot measure-valued diffusion, an infinite dimensional stochastic process. Backwards in time, they induce a random genealogy modelled by the coalescent, and its relatives. (There are important issues about the adequacy of these models for particular applications, but the models are none-the-less widely used, and we shall not consider such issues here. For a discussion in the context of estimating recombination rates, see Fearnhead and Donnelly (2001).) One way of thinking about the challenge of the inference problem is that our sample of chromosomes from the population corresponds to partial information about the value of the Fleming Viot process at a single point in time, on the basis of which we wish to estimate the (two) parameters governing its evolution. Another perspective, closer to the presentation four paragraphs above, is that this is a missing data problem. If we were told the unobserved genealogy, and recombination events, inference would be straightforward. Missing data problems have received considerable recent attention, but this one is particularly challenging. Although, as is typical, there are choices about exactly how to specify the missing data, however this is done, the space in which it resides is very large. The paper is organized as follows. The next section gives a very brief outline of the role of the coalescent and its extension to allow for recombination. For further background see for example Hudson (1990) and Donnelly and Tavaré (1995). The marginal likelihood approximation is introduced and studied in Section 3, with the composite likelihood approximation in Section 4. In Section 5 we apply these methods to a data set of sequence variation in the lipoprotein lipase (LPL) gene, a data set which is many times too large for it to be practicable to use full likelihood methods. Aspects of our results differ markedly from a previous analysis of these data (Templeton et al., 2000a): we find no evidence for their conclusion that repeat mutation, and not recombination, is responsible for producing many of the features that are observed in the data. Evidence for their conjectured recombinational hot spot differs substantially across population samples. The term site refers to a particular nucleotide position. A site is said to be segregating in a sample of chromosomes if not all chromosomes in the sample have the same nucleotide at that site. We also assume throughout that the data consist of DNA sequences from each chromosome sampled. In genetics terminology, this is an assumption that the haplotype of each chromosome is known, or equivalently that we know the phase at each segregating site. Phase information can either be obtained experimentally (e.g. Sobel and Lange (1996)) or inferred (e.g. Stephens et al. (2001)). Throughout we only consider estimating recombination rates from DNA sequence data, although the approximate likelihood methods that we suggest have obvious extensions to other types of population genetic data.

5 Estimating Local Recombination Rates 661 (a) (b) Fig. 1. Example of a genealogy at (a) a single site and (b) at two sites for a sample of size 3: moving up the tree or graph corresponds to going back in time; the joining of branches (going back in time) represents chromosomes sharing a common ancestor (these are called coalescent events); in (b) the dependence between the genealogies can be seen they differ only because of the effect of recombination 2. Background Population genetic data are generated by the interaction of two processes: the genealogical process (the interrelatedness of different chromosomes as a result of shared ancestry over long timescales) and the mutation process. Note that recombination affects the first of these, as it enables the DNA at two loci on one chromosome to be descended from different chromosomes in the previous generation. First consider the genealogical history of a single position or site in the sequence. This can be represented by a tree (Fig. 1), with time going back into the past as we move up the page. For a sample of n chromosomes, there are initially n distinct branches, each representing the ancestry, at the site of interest, of one of the chromosomes sampled. As we go back in time, chromosomes in the sample share common ancestors (represented by the joining, or coalescing, of branches in the tree). The genealogy stops when all the chromosomes sampled are traced back to a single common ancestor at the site in question. The genealogy for a stretch of recombining DNA is more complicated: each site in the region of interest has its own genealogical tree. However, genealogies at nearby sites will be strongly dependent in fact they are often identical (they only differ if a recombination event occurs between the two sites during the genealogical history of the sample). The entire collection of genealogical trees for the region of interest can be represented by a graph, called the ancestral recombination graph (ARG) (Griffiths and Marjoram, 1996b); see Fig. 1 for an example. Conditional on the genealogy of a sample, mutations occur as a Poisson process along the branches of the ARG, independently for distinct sites. Furthermore, given the realization of the ARG, it is straightforward to evaluate the probability of a particular configuration of sequences in the chromosomes sampled. As noted above, we shall focus here on the simplest setting in which the model for the ARG is given by the coalescent with recombination. See for example Kingman (1982a, b), Hudson (1983) and Kaplan and Hudson (1985) for further background. The model is parameterized by scaled mutation and recombination rates, denoted by θ and ρ respectively. If N is the effective population size, and u and r are respectively the probabilities of mutation and recombination

6 662 P. Fearnhead and P. Donnelly within the region of interest in a single generation, then θ = 4Nu and ρ = 4Nr. (The timescaling by N reflects the fact that chromosomes are typically related over times of the order of N generations. Thus unless the per generation recombination rate r between two loci is extremely small (say 10 4 or smaller) the scaled recombination parameter ρ will be large, and the genealogies, and hence genetic types in a sample, will be independent at the two loci. This is an important difference from pedigree analyses of recombination. In our, population, setting, dependence between the types at two loci (induced by the correlation in their genealogies) typically only extends over distances of fractions of centimorgans. In pedigree studies the dependence extends over chromosomal scales.) Any full likelihood method essentially needs to calculate probabilities by averaging over the unobserved realizations of the genealogy. These are themselves very high dimensional objects, living in an infinite dimensional space. Two different classes of approach have been suggested, based on either importance sampling or Markov chain Monte Carlo methods. See Stephens and Donnelly (2000) and Stephens (2001) for a general discussion of the different approaches to this problem and Fearnhead and Donnelly (2001) for a comparison of published methods for full likelihood inference in the presence of recombination. To proceed, we shall also need to fix on a particular model for mutation. For simplicity and definiteness we shall focus primarily on the so-called infinite sites model. This model, which assumes that each mutation event that occurs in the genealogical history of the sample will affect a different nucleotide site, may not be unreasonable for much human DNA sequence data (with the exception of mitochondrial DNA). An extension of the ideas in what follows to other mutational models is straightforward in principle, and in some cases also in practice. One of our methods, and the data analysis of Section 5, is based on a finite sites model which explicitly allows recurrent mutation. In the simulation studies which follow we fix particular values for θ and ρ, which are plausible for human populations, of 1 per kilobase in each case (e.g. Pritchard and Przeworski (2001)). Discussions of sequence length should be interpreted relative to these parameter values. Thus the computational burden depends on the total recombination and mutation rates across regions in question. A method which is computationally feasible for a sequence of length 2 kb with θ = ρ = 1 per kilobase will be feasible for a sequence of length 10 kb if θ = ρ = 0:2 per kilobase. 3. Marginal likelihood Our first approximation is to consider the likelihood of a summary of the data, the idea being to find a summary which is both informative about ρ and for which it is still practicable to calculate likelihood surfaces. Wall (2000) used this idea in the context of estimating recombination rates. He chose a summary which was sufficiently simple that the likelihood for the reduced data could be estimated well by naïve simulation. Here we work with a higher dimensional, and hopefully considerably more informative, summary. The price to be paid is that more sophistication is needed to estimate the relevant likelihood. As described below, we do this here by adapting to this simpler problem the importance sampling approach that we developed in Fearnhead and Donnelly (2001) for full likelihood estimation of recombination rates. Let S be the set of segregating sites at which the minor nucleotide frequency (the number of times that the less common nucleotide appears in the data) is greater than some prespecified value. Several particular choices are discussed below. We summarize the data by D S, the haplotypes defined (only) by sites in S, and S O, the number of segregating sites in the data which are not in S. See Fig. 2 for an example. The idea is that D S should be informative about ρ,asit

7 Full data Summary ACGATTAG C A G ACGATTAA C A A AGGTTTAA G T A AGGTCTAG G T G Estimating Local Recombination Rates other segregating site Fig. 2. Example of our summary of the data: the full data consist of the DNA at eight sites in four chromosomes; here we have chosen to keep only the sites at which the minor nucleotide frequency is 2 (i.e. there are two of both of the nucleotides at that site); this defines our set S; our summary is the types of each chromosome at these three sites, and the number of other segregating sites (sites at which more than one nucleotide appears); we also assume known the position of the sites in S, the number of sites sequenced and the minor nucleotide frequency used in the definition of S contains much of the linkage disequilibrium (LD) information from the data (i.e. information about the non-independence of the collections of nucleotides at different positions) and it is this information that is particularly informative about ρ. The total number of segregating sites should be informative about θ. The marginal likelihood L M.ρ; θ/ is the likelihood of this summary of the data. Let an ancestral history H be the collection of genealogies at all sites in the sequence, together with the mutational history at the sites in S. Thus H is the ARG for the whole sequence (see Fig. 1), plus the positions in the ARG of the mutations which affect the sites in S. Conditionally on H, D S and S O are independent, so L M.ρ; θ/ = p.d S H/ p.s O H;θ/p.H ρ; θ/; H where the summation is over all possible ancestral histories. (For simplicity, in our notation we have suppressed the dependence of p.s O H;θ/ on the position of sites in S and the threshold used to define S. Also we have slightly abused the notation as the summation over ancestral histories is in fact a sum over all topologies of the ARG, together with an integral over the lengths of the branches of this ARG, and the positions of the mutations in the ARG.) Now p.d S H/ is either 1 or 0 (as the ancestral histories uniquely determine the sample at sites in S). If we let H denote the set of ancestral histories for which p.d S H/ = 1, and if q.h/ is a probability mass function whose support contains H, then L M.ρ; θ/ = p.s O H;θ/p.H ρ; θ/ q.h/: (1) H H q.h/ This suggests using importance sampling, with proposal density q.h/, to approximate L M.ρ; θ/. This will be feasible provided that we can calculate p.s O H;θ/ (up to a constant multiplier which does not depend on H). We calculate p.s O H;θ/ as follows. Consider just sites not in S, and write L for the total length of branches in the genealogies at these sites, L F for the length of the subset of branches on which mutations can occur, and yet the site would not be in S, and θ b for the per base mutation rate. Then, assuming that there are no repeat mutations at sites not in S (which is automatic under the infinite sites assumption, but also may be a reasonable approximation under other mutation models), since mutations occur, independently, as a Poisson process of rate θ b =2 along each branch, p.s O H;θ/.L F θ b / S O exp. Lθ b =2/:

8 664 P. Fearnhead and P. Donnelly Although in principle any choice of proposal density q (with the appropriate support) in equation (1) will result in unbiased estimation of L M.ρ; θ/, the choice of q can have dramatic effects on the efficiency of the method. We took as our choice of proposal density the one developed by Fearnhead and Donnelly (2001) for infinite sites data. We implemented this by adapting their program infs. This approach is still highly computationally intensive because of the need to propose ancestral histories which contain genealogies at all sites in the sequence. Large computational savings can be obtained if we define a new ancestral history H, consisting of just genealogies and mutations at the sites in S. As before, if H denotes the set of all such ancestral histories with p.d S H / = 1, then L M.ρ; θ/ = p.s O H ; θ/p.h ρ; θ/ H H q.h q.h /; (2) / for any proposal density q.h / whose support contains H. Exact calculation of p.s O H ; θ/ is not now possible. Instead we approximate this probability by assuming that the genealogy at each site is the same as the genealogy of the site in S to which it is closest (if a site is equidistant from the two closest sites in S, then with probability 1 2 we assume that it has the genealogy of the left-hand site, and otherwise we assume that it has the genealogy of the right-hand site). Use of this approximation to p.s O H ; θ/ in equation (2) defines an approximation to the marginal likelihood, which we call the approximate marginal likelihood. Again we approximated the approximate marginal likelihood via importance sampling. Our proposal density is that derived by Fearnhead and Donnelly (2001) for finite sites data. (Thus one advantage of our approximate marginal likelihood approach is that it is based on a mutation model which explicitly allows repeat mutations at sites in S.) We implemented this by adapting their program fins Implementation We simulated data for samples of 50 sequences of length 1 kb, 2 kb and 4 kb, for θ = ρ = 1 per kilobase, values which, as noted above, are plausible for human populations. At least for this region of parameter space, increasing the sample size has little effect on the precision of parameter estimation (Fearnhead and Donnelly, 2001), so here and elsewhere in the paper we focus simulation effort on exploring other aspects of the problem. The thresholds used in deciding which segregating sites to exclude in the marginal likelihoods were as follows: (a) for the 1 kb data, include only sites at which the minor nucleotide frequency is at least 2; (b) for 2 kb and 4 kb, include the five sites with the highest minor nucleotide frequency (including ties), and all other segregating sites with minor nucleotide frequency at least 30% of the sample. There was striking agreement between each of the full likelihood, marginal likelihood and approximate likelihood surfaces in the simulation study. For example, Fig. 3 shows likelihood curves for ρ with θ fixed at the true value, for 2 kb and 4 kb data. Similar results are obtained for many other data sets. For analysing 1 kb data the three curves are always almost identical (the results are not shown), confirming the plausible intuition that there is little information about recombination in singleton mutations (although such mutations are not completely uninformative). We conclude that in practice, for these thresholding schemes, there is little to be gained

9 Estimating Local Recombination Rates 665 Fig. 3. Comparisons of full ( ), marginal ( ) and approximate marginal ( ) log-likelihood curves for ρ at the true value of θ for simulated data sets of 50 chromosomes (the thresholding schemes used are described in the text; the true value of ρ was 1): (a) 2 kb data; (b) 4 kb data (only the full and approximate marginal log-likelihood curves are shown) in using a full likelihood approach, nor in working hard to calculate the marginal likelihood exactly. We would thus recommend the use of the approximate marginal likelihood method for sequences of this length, implemented by ignoring (only) singleton sites. The saving in computational time is considerable. Although it depends on the structure of particular data sets, and (often to a lesser extent) the level at which the threshold is set, calculating the approximate marginal likelihood accurately can reduce computing time by 1 2 orders of magnitude when compared with the marginal likelihood and 1 3 orders of magnitude compared with the full likelihood, with greater relative savings for more complicated problems. For sequences of length above about 5 kb, even the calculation of the marginal likelihoods can become computationally prohibitive. 4. A composite likelihood Here we consider a different approximation, which is similar in spirit to that of Besag (1975) for spatial data. Consider DNA sequence data from a chromosomal region of interest. Split the region of interest into R subregions. For r = 1;:::;R, let D r be the data from the rth subregion. For notational simplicity, we assume that each subregion is of the same length, and let ρ and θ

10 666 P. Fearnhead and P. Donnelly now denote the recombination and mutation rates over one subregion (i.e. 1=R times the rates for the whole region of interest). We assume in this section that these rates are constant across the subregions. Now we define the composite likelihood L C.ρ; θ/ to be L C.ρ; θ/ = R p.d r ρ; θ/: r=1 The composite likelihood ignores information in the data: it neglects the fact that the ith sequence in each of D 1 ;:::;D R comes from the same chromosome. Furthermore, the composite likelihood is not even a probability of some summary of the data, as it ignores the dependence between data from different subregions. However, we propose to base inference on this composite likelihood, and in particular to estimate the parameters of interest by the values which maximize the composite likelihood. Note that this approach has similarities to the pairwise methods of Hudson (2001a) and subsequently McVean et al. (2002), and the idea of zeroth-order likelihood for stationary stochastic processes (Azzalini, 1983). The composite likelihood function can be calculated by using the importance sampling method of Fearnhead and Donnelly (2001) to evaluate each factor in the product. In view of the results of the previous section, it would seem natural to evaluate each subregion likelihood p.d r ρ; θ/ via the approximate marginal likelihood, rather than the full likelihood. This is indeed what we recommend in practice (and what we have applied in Section 5). To understand the consequences of our various approximations better, we consider in this section the use of the composite likelihood with an evaluation of the full likelihood for each subregion. In the next subsection we give a very informal discussion of some relevant theoretical issues. For a more complete consideration, see Fearnhead (2002). Section 4.2 then considers empirical evidence on the use of this composite likelihood. Both the theoretical and the empirical considerations are encouraging Informal theoretical considerations The obvious point estimates for ρ and θ are just the values ˆρ and ˆθ that maximize L C.ρ; θ/. However, the statistical properties of these estimators are unknown, and a rationale for interval estimation is not straightforward. Partial answers to these questions can be obtained by using asymptotic theory. By asymptotic we mean here the limit as the number of subregions, and hence the size of the region of interest, increases. Another limiting regime would be to fix the size of the region and then to let the number of chromosomes sampled tend to. These are rather different scenarios. Additional sampled copies of the same region are very highly positively correlated, so the gain in information is small. Nothing is known formally, but it is plausible that in this limiting regime the information grows as the logarithm of the sample size. However, precisely because of recombination, there is rather more independence between sequenced regions, from the same chromosomes, as the regions move further apart. Again, the formal position is not clear, but it seems likely that information grows linearly, or close to linearly, in the number of subregions sequenced. For a discussion of these issues in a simpler setting see Pluzhnikov and Donnelly (1996). Azzalini (1983) discussed the asymptotic properties of maximum likelihood estimates (MLEs) which are based on approximate likelihoods that are similar to our composite likelihood. How these ideas specifically apply to our composite likelihood is considered in detail in Fearnhead (2002). We briefly, and very informally, discuss these results here. The asymptotic properties of the composite likelihood MLEs will depend on the correlation between the score functions for different subregions. We studied these correlations via simula-

11 Estimating Local Recombination Rates 667 tion (the results are not shown), and they appeared small. This suggests treating the composite likelihood as a true likelihood, and in particular assuming that the MLE has an approximate normal distribution, and that the likelihood ratio statistic has an approximate χ 2 -distribution. The theoretical results of Fearnhead (2002) show that the correlation decays inversely with the amount of recombination between the subregions. Whereas this decay is sufficiently quick to ensure consistency of the MLEs based on the composite likelihood, it is sufficiently slow that the asymptotic distribution of the likelihood ratio statistic will not be χ 2 distributed (in fact a χ 2 -approximation can be made arbitrarily poor by using increasingly more subregions). It may still be the case that the usual asymptotic distributional results will provide useful approximations in some settings. This may occur if the subregions themselves were large and the log-likelihood from each subregion were approximately quadratic, and if the number of subregions were small. Below, we use simulation to examine the distribution of the likelihood ratio statistic, and also the MLE, for our composite likelihood Simulation results We now describe the results of a simulation study aimed at understanding the properties of the composite likelihood method. We consider the large sequence properties and sampling distributions of the estimators. We simulated our data from the coalescent, assuming neutrality, random mating and a constant population size. For reasons discussed above we simulated data from 50 chromosomes throughout. We generated data over different sequence lengths, assuming that 1 kb of DNA corresponds to parameter values ρ = θ = 1:0. As noted above, these would be typical values for human populations. Calculating the full likelihood even for a subregion of 1 or 2 kb is extremely computationally intensive. Especially for the larger region, it is also person intensive, as care should ideally Fig. 4. Composite log-likelihood curves for kb data sets: each composite log-likelihood curve is based on 1 kb subregions; the bold curve shows the sum of the 10 composite log-likelihood curves (the true value of ρ per kilobase is 1); each curve is adjusted to have its maximum at 0

12 668 P. Fearnhead and P. Donnelly Fig. 5. Composite log-likelihood curves for kb data sets, based on 2 kb subregions: the bold curve shows the sum of the relevant composite log-likelihood curves (the true value of ρ per kilobase is 1); each curve is adjusted to have its maximum at 0 be used in deciding whether enough iterations of the importance sampling method have been used. (See Fearnhead and Donnelly (2001) for a detailed discussion.) The trade-off in the composite likelihood approach is clear. The larger the size of the subregions, the less information is lost through ignoring dependences, but the greater the cost of obtaining good likelihood estimates, and the higher the chance, especially if the process is automated, of serious errors in the subregion likelihood estimates. The main effect here is that, although the importance sampling method that we use gives an unbiased estimate of the likelihood, it is the sample mean of a sample from a distribution with an extremely long right-hand tail. Thus in practice not running the method for sufficiently long will typically result in an underestimation of the likelihood. (See Fearnhead and Donnelly (2001) and Stephens and Donnelly (2000) for a fuller discussion.) Simulation studies of highly computationally intensive methods are necessarily somewhat limited, so caution should be applied in interpreting the results below. For the composite likelihood approach, the largest size of subregion that is amenable to a simulation study will be smaller than the largest size which can be used for particular data analyses. We first considered the large sample properties of the composite likelihood MLE. We generated one 500 kb and kb data sets. The composite log-likelihood curves (based on splitting each region into 1 kb subregions) for the kb data sets are shown in Fig. 4. There is no noticeable bias in the estimation of ρ. This conclusion is supported by the composite loglikelihood curve for the 500 kb data set (the results are not shown). We also analysed the 100 kb data sets using subregions of 2 kb. Fig. 5 shows the composite likelihood curves based on using 2 million simulated ancestral histories to estimate the likelihood for each subregion. There is evidence of a negative bias caused by occasional inaccuracies in estimating the likelihood curves for individual subregions.

13 Estimating Local Recombination Rates 669 (a) (b) Fig. 6. QQ-plots of the composite likelihood MLE and likelihood ratio statistic for (a) 20 kb data and (b) 10 kb data: each composite likelihood is based on 1 kb subregions; the results are based on 1500 kb of simulated data Also of concern is the appropriateness of assuming an approximate normal distribution for ˆρ, and an approximate χ 2 1-distribution for the likelihood ratio statistic. Fig. 6 shows QQ-plots for both ˆρ and the likelihood ratio statistic for the composite likelihood for both 10 kb and 20 kb data (each composite likelihood was based on 1 kb subregions). For 20 kb data, the χ 2 1 -approximation for the likelihood ratio statistic seems poor, whereas (except for the constraint of ρ 0) ˆρ does have an approximate normal distribution. Also, as the sequence length is increased from 10 kb to 20 kb, the fit of the likelihood ratio statistic appears to be less good, as suggested by the theoretical analysis of Fearnhead (2002). Finally, Table 1 summarizes how the performance of inference for ρ depends on the length of data that are analysed (again inference is based on the composite likelihood for the various choices of subregion size). Firstly consider the results based on 1 kb subregions. The point estimates of ρ appear good, though the length of the sequence plays a crucial role in the variance of the estimates. In contrast, the coverage properties of interval estimates based on the asymptotic distribution of the likelihood ratio statistic are poor. Interval estimation based on a normal approximation for ˆρ performs somewhat better, but again the confidence intervals have coverage probabilities for large sequences that are lower than nominal. The results based on 2 kb subregions show evidence for a bias. Despite this, the performance of point estimation is better (measured either via the mean-square error or the proportion of times that the MLE is within a factor of 2 of the truth). The extra information in a2kbsubregion as opposed to two 1 kb subregions considerably reduces the variance of the estimators. Interval estimates have poor coverage properties. We note again that there is no theoretical foundation even for interval estimation for the full likelihood MLE for ρ, although limited empirical evidence is encouraging (Fearnhead and Donnelly, 2001).

14 670 P. Fearnhead and P. Donnelly Table 1. Summary of the sampling properties of the composite likelihood MLE, for ρ per kilobase (the true value is 1) and associated confidence intervals, for different lengths of data, based on a sample of 50 chromosomes Subregions Length Mean Variance Median g Confidence interval coverage (kb) (a)(b) 1 kb kb kb The statistic g (used in Wall s (2000) comparisons) is the proportion of times that the MLE is within a factor of 2 of the truth. The final two columns give the estimated coverage probability of approximate 95% confidence intervals. These confidence intervals are based on (a) an approximate χ 2 1-distribution for the likelihood ratio statistic and (b) an approximate normal distribution for the MLE, whose variance is the inverse of the curvature of the relevant composite log-likelihood curve. The results for subregions of 1 and 2 kb are based on 1500 kb of simulated data; those for 5 kb on 1000 kb of simulated data, with data simulated under the standard neutral model. Composite likelihoods for subregions of 1 and 2 kb were based on the exact likelihood for each subregion, that for 5 kb on the approximate marginal likelihood, omitting only singleton sites, for each subregion. The composite likelihood curve was calculated for ρ per kilobase between 0 and 5 in all cases. For 5 kb sequences, estimates of ρ = 5 were obtained in around 2% of cases; these indicate an estimate of ρ which is greater than 5.0, and hence the estimated means and variances for 5 kb sequences are negatively biased. The analysis for the 5 kb subregions is at the limit of feasibility for a simulation study. One consequence is that the importance sampling estimates of the approximate marginal likelihoods for each subregion may not be particularly accurate. For example in many cases these likelihood estimates had estimated effective sample sizes (see Fearnhead and Donnelly (2001) for details) of around 10. Here, an inaccurate likelihood surface may, and apparently does, particularly affect the properties of confidence intervals, although it may be responsible for the increased bias. In analysing a particular data set, one can assess the effective sample size and if necessary increase the simulation effort in calculating subregion approximate likelihoods. As a consequence, the actual performance when using 5 kb subregions should be better than suggested by Table 1. We thus regard the use of 5 kb subregions as the best alternative among the composite likelihood approaches that we considered. In connection with Table 1 we note that there are possible but very unlikely sample configurations for which the MLE of ρ will be. Thus the moments of ˆρ do not exist. None-the-less, we have found that the use of the sample mean and sample variance of ˆρ provide a helpful summary of the properties of different estimators. We give estimated histograms of the sampling distributions for some scenarios in Fig. 7. These confirm the conclusions described above in discussing Table 1. Fig. 7(c) shows the performance of the approximate marginal likelihood when applied to 5 kb sequences. The sampling properties of the estimator compare favourably with those of the MLE for smaller sequences (see Fig. 9 of Fearnhead and Donnelly (2001)). In spite of the approximations that are inherent in its construction, the behaviour of the composite likelihood MLEs, at least for point estimation, is encouraging. We briefly compare it with other available estimators.

15 Estimating Local Recombination Rates 671 (a) (b) (c) (d) Fig. 7. Histograms of the composite likelihood MLE for ρ based on samples of 50 chromosomes: (a) 5 kb sequences, 1 kb subregions; (b) 20 kb sequences, 1 kb subregions; (c) 5 kb sequences, 5 kb subregions; (d) 20 kb sequences, 5 kb subregions The method of Wakeley (1997) estimates ρ per kilobase for a 50 kb region (based on a sample of 20 chromosomes) with a variance of 1.16, making it substantially worse than our composite likelihood estimators (for example, when based on 2 kb subregions, the composite likelihood MLE, as described in Table 1, has a sample variance of 0.09). Wall (2000) had also found that this estimator performed poorly in his comparisons. Wall s (2000) method performs worse than the approximate marginal likelihood for 5 kb data and has a performance which is comparable with that of the composite likelihood for 10 kb data (with 5 kb subregions) when performance is measured in terms of the statistic g (defined in Table 1) and better performance in terms of the mean-square error (Wall, personal communication). On the basis of our own simulations (the data are not shown), and those of Hudson (personal communication), Hudson s pairwise likelihood method performs somewhat worse than approximate marginal likelihood for 5 kb data, but comparably or sometimes better than any composite likelihood for longer sequence lengths. The comparison with Hudson s (2001a,b) method is interesting. His method combines data from all pairs of segregating sites. In long sequences, this will be dominated by comparisons between sites which are reasonably distant. In contrast, our composite likelihood explicitly ignores information in data from regions of the sequence that are widely separated, concentrating instead on the local information from nearby segregating sites but using joint information from many such sites, rather than just from pairs of sites. The loss of the joint information seems expensive for relatively short sequences (for which full or marginal likelihood approaches perform better). As the sequence length grows, it begins to be offset by the additional information from patterns of LD between distant sites. In this sense Hudson s pairwise and our composite

16 672 P. Fearnhead and P. Donnelly likelihood approaches are somewhat complementary. We are currently investigating ways of combining the two approaches. 5. Lipoprotein lipase data We now apply our approximate likelihood methods to analyse sequence data from the LPL gene. The full data are presented in Nickerson et al. (1998) and Clark et al. (1998) and have been used by Templeton et al. (2000a), Przeworski and Wall (2001) and Kuhner et al. (2000) to estimate the amount of recombination in the LPL gene. The data consist of approximately 9.7 kb of DNA sequenced in 142 chromosomes from individuals in Jackson and Rochester in the USA and from North Karelia, Finland. The full haplotype information of some chromosomes is not known: the phase of singleton mutations was not determined, and for some segregating sites the alleles on some chromosomes are unknown. We focus here on two specific issues raised in the original papers: (a) the extent to which recurrent mutation (i.e. more than one mutation event at a particular site), rather than recombination, has shaped the data and (b) the possibility of substantially elevated recombination rates towards the centre of the sequenced region Possible recurrent mutation Templeton et al. (2000a, b) suggested that there is a significant amount of repeat mutation, due to variation in the mutation rate between sites. The issue is potentially important, because multiple mutations at the same site can leave patterns in the data that are similar to those from recombination events, so, as Templeton and colleagues have argued, a failure to account for this would lead to an overestimation of the recombination rate. In particular they suggested that CpG dinucleotides mutate at a much faster rate than the genome-wide average. (This has been noted from other data: for example, Nachman and Crowell (2000) estimated that they mutate 10 times more frequently than average; see also Krawczak et al. (1998).) We first analysed the data from the 48 chromosomes from Jackson. We base our inference on the composite likelihood, splitting the data into 10, 975-base, subregions. As with our earlier suggestion, we used the approximate marginal likelihood in each subregion, in which singleton mutations were ignored. In addition to the computational saving, this has the advantage that for these data the phase of singleton mutations is not known in any case. We also note that it means that throughout this analysis we use a finite sites mutation model, which explicitly allows for the possibility of repeat mutations at each site. For each subregion, we omitted any chromosome whose full haplotype was not known, leaving samples of between 31 and 48 chromosomes for the 10 subregions. On average, it took half a day to calculate the approximate marginal likelihood for each subregion by using a 400 MHz personal computer. The coefficient of variation of the final importance sampling weights was generally of the order of 10 3 or 10 4, implying that the estimates of the approximate marginal likelihood surface in each subregion are within a few per cent of the correct values. As one way of assessing the consequences of increased mutation rates associated with CpG dinucleotides, we analysed the data under two mutational models. In the first, all sites mutated at the same rate. In the second, we identified CpG doublets in the data and allowed these to mutate at a rate that was 10 times higher than that for other sites. In fact, only mutations away from CpG sites seem to be at increased frequency and, of these, transitions at a higher rate than

I See Dead People: Gene Mapping Via Ancestral Inference

I See Dead People: Gene Mapping Via Ancestral Inference Paul Marjoram, 1 Lada Markovtsova 2 and Simon Tavaré 1,2,3 1 Department of Preventive Medicine, University of Southern California, 1540 Alcazar Street,