Approximate likelihood methods for estimating local recombination rates

Size: px
Start display at page:

Download "Approximate likelihood methods for estimating local recombination rates"

Transcription

1 J. R. Statist. Soc. B (2002) 64, Part 4, pp Approximate likelihood methods for estimating local recombination rates Paul Fearnhead and Peter Donnelly University of Oxford, UK [Read before The Royal Statistical Society at a meeting organized by the Research Section on Statistical modelling and analysis of genetic data on Wednesday, May 22nd, 2002, Professor D. Firth and Professor R. A. Bailey in the Chair ] Summary. There is currently great interest in understanding the way in which recombination rates vary, over short scales, across the human genome. Aside from inherent interest, an understanding of this local variation is essential for the sensible design and analysis of many studies aimed at elucidating the genetic basis of common diseases or of human population histories. Standard pedigree-based approaches do not have the fine scale resolution that is needed to address this issue. In contrast, samples of deoxyribonucleic acid sequences from unrelated chromosomes in the population carry relevant information, but inference from such data is extremely challenging. Although there has been much recent interest in the development of full likelihood inference methods for estimating local recombination rates from such data, they are not currently practicable for data sets of the size being generated by modern experimental techniques. We introduce and study two approximate likelihood methods. The first, a marginal likelihood, ignores some of the data. A careful choice of what to ignore results in substantial computational savings with virtually no loss of relevant information. For larger sequences, we introduce a composite likelihood, which approximates the model of interest by ignoring certain long-range dependences. An informal asymptotic analysis and a simulation study suggest that inference based on the composite likelihood is practicable and performs well. We combine both methods to reanalyse data from the lipoprotein lipase gene, and the results seriously question conclusions from some earlier studies of these data. Keywords: Coalescent; Composite likelihood; Lipoprotein lipase; Marginal likelihood; Mutation rate heterogeneity; Pseudolikelihood; Recombinational hot spot; Recombination rate 1. Introduction Humans, like most (so-called diploid) organisms, have two versions of each chromosome, one inherited from each parent. Each sperm or egg has only a single copy of each chromosome, typically formed as a mosaic of the two parental copies. The process of shuffling the two parental copies to produce a single chromosome is called recombination. Although not completely understood, it involves variation, so for example the complete set of chromosomes in two different sperm or eggs from the same person would be expected to differ, because of differences in the recombination processes during their formation. The recombination rate between two specified positions, or loci, on a chromosome is defined to be the probability that the deoxyribonucleic acid (DNA) at each location on the offspring chromosome comes from different chromosomes in the parent. Between two positions which are very close together on the chromosome, the recombination rate will be extremely small. For Address for correspondence: Paul Fearnhead, Department of Mathematics and Statistics, Fylde College, Lancaster University, Lancaster, LA1 4YF, UK. p.fearnhead@lancaster.ac.uk 2002 Royal Statistical Society /02/64657

2 658 P. Fearnhead and P. Donnelly positions which are far apart, it will be close to 1 2. (Recombination rates are bounded below by 0 and above by 0.5.) There is enormous current interest in understanding the way in which recombination rates vary across the human genome. There is known to be variation over large scales, but little is known about the extent to which recombination rates vary over small scales. Aside from inherent interest, in shedding light on the underlying biological processes, a good understanding of patterns of local variation is central to both the design and the analysis of current and planned studies aimed at elucidating the genetic basis of common human diseases (e.g. Pritchard and Przeworski (2001)). Although the methods described in this paper can be applied to data from any diploid species, we shall, for definiteness, focus the discussion on the case of current major interest, namely humans. It may be helpful to the subsequent discussion to have some (very rough!) idea of the scales involved in the human genome. The genome consists of around nucleotides, or base pairs of DNA, broken into 23 chromosomes. (For our purposes, DNA can be thought of as being a linear string made up of discrete building-blocks, namely the four nucleotides {A, C, G, T}.) The chromosomes differ in size but think of a typical chromosome as being 10 8 bp (base pairs) or 100 Mb (megabases) long the smallest is about 50 Mb, and the largest about 300 Mb. Distances measured in base pairs are called physical distances. This contrasts with so-called genetic distance which measures the amount of recombination between two loci. Two loci are at a genetic distance of 1 M (morgan) if the expected number of recombination events between them during one meiosis (the formation of a sperm or egg cell) is 1. Because recombination rates vary across the genome, there is no standard conversion between physical and genetic distances. The average across the genome is close to 1 cm per megabase (e.g. Pritchard and Przeworski (2001)). There are various models which relate recombination rates to genetic distance (see for example Speed and Zhao (2001)), but over small genetic distances recombination rates and genetic distances are almost the same. Thus for example the recombination rate between two loci at genetic distance 1 cm will be close to The standard method for assessing recombination rates in humans is from pedigree data. In its simplest form, we might have observations on parent child duos and simply count the number of recombination events between the loci of interest. In practice it is much more complicated, and can be highly non-trivial statistically, effectively because often only incomplete information is available about recombination events (see for example Holmans (2001) and Thompson (2001)). There are natural limits to the resolution of pedigree-based methods they can only be used to estimate recombination rates between loci whose genetic distance is of the order of centimorgans or more. If complete information were available, the problem amounts to estimating a binomial success probability. The real problem is more difficult, but for example thousands of meiotic events are needed to estimate recombination rates of the order of Over the last 5 years or so, the ability to type single sperm separately has added to the resolution, allowing the estimation of rates of the order of 10 3 or Recently, more sophisticated (sperm typing) experimental approaches allow a practicable estimation of male recombination rates as small as 10 5 (Jeffreys et al., 2001) but at very substantial cost. Sperm typing methods are currently not feasible for simultaneously assessing rates across the genome, and in any event provide no information on recombination rates in females (which differ in humans from rates in males, at least over large scales). Thus, in short, existing high throughput methods provide information about recombination rates over scales of millions, or in some cases hundreds of thousands, of bases. (We note in passing that an estimation of recombination rates is much easier in most experimental organisms, effectively because experimenters can simply arrange for large numbers of matings of the type that are most helpful.)

3 Estimating Local Recombination Rates 659 A recent study (Jeffreys et al., 2001) showed substantial clumping of recombination events within one 200 kb region of the human genome. The possibility of such extreme local variation in recombination rates is also supported by several other lines of evidence (see for example Jeffreys et al. (2001) and Daly et al. (2001), and references therein), but little systematic information is currently available about recombination rates over physical distances of kilobases or tens of kilobases in humans. There is potentially useful information for estimating local recombination rates in samples of chromosomes taken from unrelated individuals in a population. Such chromosomes are related by an (unobserved) pedigree, or genealogy, going back many generations (typically many tens of thousands for human DNA). If we could observe both the genealogy and the recombination events on it, this would allow (straightforward) estimation of recombination rates as small as 10 4 or In practice, neither the genealogy nor the recombination events are directly observed, but data of this kind carry some information about each, and hence, at least in principle, information for estimating recombination rates over small scales. There has thus been intense recent interest in methods for estimating local recombination rates from population genetic data. Early approaches used summary statistics (often via a method-of-moments approach; see for example Hudson and Kaplan (1985), Hudson (1987) and Wakeley (1997)); however, these methods only use a small amount of the information that is available from the data and can have poor statistical properties (Wall, 2000). More encouragingly, several different groups have developed full likelihood inference methods (Griffiths and Marjoram, 1996a; Kuhner et al., 2000; Fearnhead and Donnelly, 2001), which are optimal in the sense that they use all the available information from the data. However, all these approaches use computationally intensive statistical techniques (either Markov chain Monte Carlo or importance sampling), with very substantial computational burdens, and potential difficulties in assessing the accuracy of approximated likelihood curves. As one example, the most efficient current method takes a month on a 400 MHz Pentium personal computer to give reliable estimates of the likelihood surface for a data set of 500 bp of sequence data from each of 31 chromosomes (Fearnhead et al., 2002). (This method is typically about four orders of magnitude more efficient than some other published approaches; see Fearnhead and Donnelly (2001).) Existing full likelihood approaches are not practicable for the sizes of data set that are currently being generated. Although there may be scope for improving their efficiency, possibly substantially, we adopt a different approach in this paper, in studying two approximations. Rather than developing computational methods for approximating the full likelihood, we consider two different likelihoods. The first is a marginal likelihood, where we ignore some of the information in the data, considering instead the likelihood for a reduced data set. Through a careful choice of which aspects of the data to ignore, this results in considerable computational savings with only a very limited loss of information for estimating parameters of interest. Our second approximation in effect uses a different, simpler, model for the data, essentially by ignoring certain long-range dependences. We call the resulting likelihood a composite likelihood for our original problem. We sketch some asymptotic (in the length of the sequence) theory and describe simulation studies, both of which are encouraging, and suggest that the resulting procedures may be reasonable for estimation. Neither general idea is new, although the way in which we implement them is novel. For example Wall (2000) presented an estimation method for recombination based on the likelihood for a summary of the data. We choose a higher dimensional, more informative, summary, so rather more sophisticated methods for studying the associated likelihood are needed. We note that,

4 660 P. Fearnhead and P. Donnelly with our chosen summary, the marginal likelihood is typically almost indistinguishable from the full likelihood. Our composite likelihood approach draws on ideas from spatial statistics (e.g. Besag (1975)). A related, though in some senses complementary, approach to estimating recombination rates has been developed by Hudson (2001a) and subsequently adapted by McVean et al. (2002). They created a composite log-likelihood by summing the log-likelihood of the data at all pairs of nucleotides. Instead, we split the chromosomal region of interest into subregions and create a composite log-likelihood by summing the full log-likelihoods for each subregion. The accuracy of inference based on our composite log-likelihood function depends crucially on the size of subregions used. The larger the subregions the closer the composite log-likelihood is to the (optimal) full log-likelihood. There are two ways of thinking about the generic statistical problem that is considered here. In the population genetics context there is an accepted family of models for population evolution, including mutation and recombination. Forwards in time these can be thought of as versions of the Fleming Viot measure-valued diffusion, an infinite dimensional stochastic process. Backwards in time, they induce a random genealogy modelled by the coalescent, and its relatives. (There are important issues about the adequacy of these models for particular applications, but the models are none-the-less widely used, and we shall not consider such issues here. For a discussion in the context of estimating recombination rates, see Fearnhead and Donnelly (2001).) One way of thinking about the challenge of the inference problem is that our sample of chromosomes from the population corresponds to partial information about the value of the Fleming Viot process at a single point in time, on the basis of which we wish to estimate the (two) parameters governing its evolution. Another perspective, closer to the presentation four paragraphs above, is that this is a missing data problem. If we were told the unobserved genealogy, and recombination events, inference would be straightforward. Missing data problems have received considerable recent attention, but this one is particularly challenging. Although, as is typical, there are choices about exactly how to specify the missing data, however this is done, the space in which it resides is very large. The paper is organized as follows. The next section gives a very brief outline of the role of the coalescent and its extension to allow for recombination. For further background see for example Hudson (1990) and Donnelly and Tavaré (1995). The marginal likelihood approximation is introduced and studied in Section 3, with the composite likelihood approximation in Section 4. In Section 5 we apply these methods to a data set of sequence variation in the lipoprotein lipase (LPL) gene, a data set which is many times too large for it to be practicable to use full likelihood methods. Aspects of our results differ markedly from a previous analysis of these data (Templeton et al., 2000a): we find no evidence for their conclusion that repeat mutation, and not recombination, is responsible for producing many of the features that are observed in the data. Evidence for their conjectured recombinational hot spot differs substantially across population samples. The term site refers to a particular nucleotide position. A site is said to be segregating in a sample of chromosomes if not all chromosomes in the sample have the same nucleotide at that site. We also assume throughout that the data consist of DNA sequences from each chromosome sampled. In genetics terminology, this is an assumption that the haplotype of each chromosome is known, or equivalently that we know the phase at each segregating site. Phase information can either be obtained experimentally (e.g. Sobel and Lange (1996)) or inferred (e.g. Stephens et al. (2001)). Throughout we only consider estimating recombination rates from DNA sequence data, although the approximate likelihood methods that we suggest have obvious extensions to other types of population genetic data.

5 Estimating Local Recombination Rates 661 (a) (b) Fig. 1. Example of a genealogy at (a) a single site and (b) at two sites for a sample of size 3: moving up the tree or graph corresponds to going back in time; the joining of branches (going back in time) represents chromosomes sharing a common ancestor (these are called coalescent events); in (b) the dependence between the genealogies can be seen they differ only because of the effect of recombination 2. Background Population genetic data are generated by the interaction of two processes: the genealogical process (the interrelatedness of different chromosomes as a result of shared ancestry over long timescales) and the mutation process. Note that recombination affects the first of these, as it enables the DNA at two loci on one chromosome to be descended from different chromosomes in the previous generation. First consider the genealogical history of a single position or site in the sequence. This can be represented by a tree (Fig. 1), with time going back into the past as we move up the page. For a sample of n chromosomes, there are initially n distinct branches, each representing the ancestry, at the site of interest, of one of the chromosomes sampled. As we go back in time, chromosomes in the sample share common ancestors (represented by the joining, or coalescing, of branches in the tree). The genealogy stops when all the chromosomes sampled are traced back to a single common ancestor at the site in question. The genealogy for a stretch of recombining DNA is more complicated: each site in the region of interest has its own genealogical tree. However, genealogies at nearby sites will be strongly dependent in fact they are often identical (they only differ if a recombination event occurs between the two sites during the genealogical history of the sample). The entire collection of genealogical trees for the region of interest can be represented by a graph, called the ancestral recombination graph (ARG) (Griffiths and Marjoram, 1996b); see Fig. 1 for an example. Conditional on the genealogy of a sample, mutations occur as a Poisson process along the branches of the ARG, independently for distinct sites. Furthermore, given the realization of the ARG, it is straightforward to evaluate the probability of a particular configuration of sequences in the chromosomes sampled. As noted above, we shall focus here on the simplest setting in which the model for the ARG is given by the coalescent with recombination. See for example Kingman (1982a, b), Hudson (1983) and Kaplan and Hudson (1985) for further background. The model is parameterized by scaled mutation and recombination rates, denoted by θ and ρ respectively. If N is the effective population size, and u and r are respectively the probabilities of mutation and recombination

6 662 P. Fearnhead and P. Donnelly within the region of interest in a single generation, then θ = 4Nu and ρ = 4Nr. (The timescaling by N reflects the fact that chromosomes are typically related over times of the order of N generations. Thus unless the per generation recombination rate r between two loci is extremely small (say 10 4 or smaller) the scaled recombination parameter ρ will be large, and the genealogies, and hence genetic types in a sample, will be independent at the two loci. This is an important difference from pedigree analyses of recombination. In our, population, setting, dependence between the types at two loci (induced by the correlation in their genealogies) typically only extends over distances of fractions of centimorgans. In pedigree studies the dependence extends over chromosomal scales.) Any full likelihood method essentially needs to calculate probabilities by averaging over the unobserved realizations of the genealogy. These are themselves very high dimensional objects, living in an infinite dimensional space. Two different classes of approach have been suggested, based on either importance sampling or Markov chain Monte Carlo methods. See Stephens and Donnelly (2000) and Stephens (2001) for a general discussion of the different approaches to this problem and Fearnhead and Donnelly (2001) for a comparison of published methods for full likelihood inference in the presence of recombination. To proceed, we shall also need to fix on a particular model for mutation. For simplicity and definiteness we shall focus primarily on the so-called infinite sites model. This model, which assumes that each mutation event that occurs in the genealogical history of the sample will affect a different nucleotide site, may not be unreasonable for much human DNA sequence data (with the exception of mitochondrial DNA). An extension of the ideas in what follows to other mutational models is straightforward in principle, and in some cases also in practice. One of our methods, and the data analysis of Section 5, is based on a finite sites model which explicitly allows recurrent mutation. In the simulation studies which follow we fix particular values for θ and ρ, which are plausible for human populations, of 1 per kilobase in each case (e.g. Pritchard and Przeworski (2001)). Discussions of sequence length should be interpreted relative to these parameter values. Thus the computational burden depends on the total recombination and mutation rates across regions in question. A method which is computationally feasible for a sequence of length 2 kb with θ = ρ = 1 per kilobase will be feasible for a sequence of length 10 kb if θ = ρ = 0:2 per kilobase. 3. Marginal likelihood Our first approximation is to consider the likelihood of a summary of the data, the idea being to find a summary which is both informative about ρ and for which it is still practicable to calculate likelihood surfaces. Wall (2000) used this idea in the context of estimating recombination rates. He chose a summary which was sufficiently simple that the likelihood for the reduced data could be estimated well by naïve simulation. Here we work with a higher dimensional, and hopefully considerably more informative, summary. The price to be paid is that more sophistication is needed to estimate the relevant likelihood. As described below, we do this here by adapting to this simpler problem the importance sampling approach that we developed in Fearnhead and Donnelly (2001) for full likelihood estimation of recombination rates. Let S be the set of segregating sites at which the minor nucleotide frequency (the number of times that the less common nucleotide appears in the data) is greater than some prespecified value. Several particular choices are discussed below. We summarize the data by D S, the haplotypes defined (only) by sites in S, and S O, the number of segregating sites in the data which are not in S. See Fig. 2 for an example. The idea is that D S should be informative about ρ,asit

7 Full data Summary ACGATTAG C A G ACGATTAA C A A AGGTTTAA G T A AGGTCTAG G T G Estimating Local Recombination Rates other segregating site Fig. 2. Example of our summary of the data: the full data consist of the DNA at eight sites in four chromosomes; here we have chosen to keep only the sites at which the minor nucleotide frequency is 2 (i.e. there are two of both of the nucleotides at that site); this defines our set S; our summary is the types of each chromosome at these three sites, and the number of other segregating sites (sites at which more than one nucleotide appears); we also assume known the position of the sites in S, the number of sites sequenced and the minor nucleotide frequency used in the definition of S contains much of the linkage disequilibrium (LD) information from the data (i.e. information about the non-independence of the collections of nucleotides at different positions) and it is this information that is particularly informative about ρ. The total number of segregating sites should be informative about θ. The marginal likelihood L M.ρ; θ/ is the likelihood of this summary of the data. Let an ancestral history H be the collection of genealogies at all sites in the sequence, together with the mutational history at the sites in S. Thus H is the ARG for the whole sequence (see Fig. 1), plus the positions in the ARG of the mutations which affect the sites in S. Conditionally on H, D S and S O are independent, so L M.ρ; θ/ = p.d S H/ p.s O H;θ/p.H ρ; θ/; H where the summation is over all possible ancestral histories. (For simplicity, in our notation we have suppressed the dependence of p.s O H;θ/ on the position of sites in S and the threshold used to define S. Also we have slightly abused the notation as the summation over ancestral histories is in fact a sum over all topologies of the ARG, together with an integral over the lengths of the branches of this ARG, and the positions of the mutations in the ARG.) Now p.d S H/ is either 1 or 0 (as the ancestral histories uniquely determine the sample at sites in S). If we let H denote the set of ancestral histories for which p.d S H/ = 1, and if q.h/ is a probability mass function whose support contains H, then L M.ρ; θ/ = p.s O H;θ/p.H ρ; θ/ q.h/: (1) H H q.h/ This suggests using importance sampling, with proposal density q.h/, to approximate L M.ρ; θ/. This will be feasible provided that we can calculate p.s O H;θ/ (up to a constant multiplier which does not depend on H). We calculate p.s O H;θ/ as follows. Consider just sites not in S, and write L for the total length of branches in the genealogies at these sites, L F for the length of the subset of branches on which mutations can occur, and yet the site would not be in S, and θ b for the per base mutation rate. Then, assuming that there are no repeat mutations at sites not in S (which is automatic under the infinite sites assumption, but also may be a reasonable approximation under other mutation models), since mutations occur, independently, as a Poisson process of rate θ b =2 along each branch, p.s O H;θ/.L F θ b / S O exp. Lθ b =2/:

8 664 P. Fearnhead and P. Donnelly Although in principle any choice of proposal density q (with the appropriate support) in equation (1) will result in unbiased estimation of L M.ρ; θ/, the choice of q can have dramatic effects on the efficiency of the method. We took as our choice of proposal density the one developed by Fearnhead and Donnelly (2001) for infinite sites data. We implemented this by adapting their program infs. This approach is still highly computationally intensive because of the need to propose ancestral histories which contain genealogies at all sites in the sequence. Large computational savings can be obtained if we define a new ancestral history H, consisting of just genealogies and mutations at the sites in S. As before, if H denotes the set of all such ancestral histories with p.d S H / = 1, then L M.ρ; θ/ = p.s O H ; θ/p.h ρ; θ/ H H q.h q.h /; (2) / for any proposal density q.h / whose support contains H. Exact calculation of p.s O H ; θ/ is not now possible. Instead we approximate this probability by assuming that the genealogy at each site is the same as the genealogy of the site in S to which it is closest (if a site is equidistant from the two closest sites in S, then with probability 1 2 we assume that it has the genealogy of the left-hand site, and otherwise we assume that it has the genealogy of the right-hand site). Use of this approximation to p.s O H ; θ/ in equation (2) defines an approximation to the marginal likelihood, which we call the approximate marginal likelihood. Again we approximated the approximate marginal likelihood via importance sampling. Our proposal density is that derived by Fearnhead and Donnelly (2001) for finite sites data. (Thus one advantage of our approximate marginal likelihood approach is that it is based on a mutation model which explicitly allows repeat mutations at sites in S.) We implemented this by adapting their program fins Implementation We simulated data for samples of 50 sequences of length 1 kb, 2 kb and 4 kb, for θ = ρ = 1 per kilobase, values which, as noted above, are plausible for human populations. At least for this region of parameter space, increasing the sample size has little effect on the precision of parameter estimation (Fearnhead and Donnelly, 2001), so here and elsewhere in the paper we focus simulation effort on exploring other aspects of the problem. The thresholds used in deciding which segregating sites to exclude in the marginal likelihoods were as follows: (a) for the 1 kb data, include only sites at which the minor nucleotide frequency is at least 2; (b) for 2 kb and 4 kb, include the five sites with the highest minor nucleotide frequency (including ties), and all other segregating sites with minor nucleotide frequency at least 30% of the sample. There was striking agreement between each of the full likelihood, marginal likelihood and approximate likelihood surfaces in the simulation study. For example, Fig. 3 shows likelihood curves for ρ with θ fixed at the true value, for 2 kb and 4 kb data. Similar results are obtained for many other data sets. For analysing 1 kb data the three curves are always almost identical (the results are not shown), confirming the plausible intuition that there is little information about recombination in singleton mutations (although such mutations are not completely uninformative). We conclude that in practice, for these thresholding schemes, there is little to be gained

9 Estimating Local Recombination Rates 665 Fig. 3. Comparisons of full ( ), marginal ( ) and approximate marginal ( ) log-likelihood curves for ρ at the true value of θ for simulated data sets of 50 chromosomes (the thresholding schemes used are described in the text; the true value of ρ was 1): (a) 2 kb data; (b) 4 kb data (only the full and approximate marginal log-likelihood curves are shown) in using a full likelihood approach, nor in working hard to calculate the marginal likelihood exactly. We would thus recommend the use of the approximate marginal likelihood method for sequences of this length, implemented by ignoring (only) singleton sites. The saving in computational time is considerable. Although it depends on the structure of particular data sets, and (often to a lesser extent) the level at which the threshold is set, calculating the approximate marginal likelihood accurately can reduce computing time by 1 2 orders of magnitude when compared with the marginal likelihood and 1 3 orders of magnitude compared with the full likelihood, with greater relative savings for more complicated problems. For sequences of length above about 5 kb, even the calculation of the marginal likelihoods can become computationally prohibitive. 4. A composite likelihood Here we consider a different approximation, which is similar in spirit to that of Besag (1975) for spatial data. Consider DNA sequence data from a chromosomal region of interest. Split the region of interest into R subregions. For r = 1;:::;R, let D r be the data from the rth subregion. For notational simplicity, we assume that each subregion is of the same length, and let ρ and θ

10 666 P. Fearnhead and P. Donnelly now denote the recombination and mutation rates over one subregion (i.e. 1=R times the rates for the whole region of interest). We assume in this section that these rates are constant across the subregions. Now we define the composite likelihood L C.ρ; θ/ to be L C.ρ; θ/ = R p.d r ρ; θ/: r=1 The composite likelihood ignores information in the data: it neglects the fact that the ith sequence in each of D 1 ;:::;D R comes from the same chromosome. Furthermore, the composite likelihood is not even a probability of some summary of the data, as it ignores the dependence between data from different subregions. However, we propose to base inference on this composite likelihood, and in particular to estimate the parameters of interest by the values which maximize the composite likelihood. Note that this approach has similarities to the pairwise methods of Hudson (2001a) and subsequently McVean et al. (2002), and the idea of zeroth-order likelihood for stationary stochastic processes (Azzalini, 1983). The composite likelihood function can be calculated by using the importance sampling method of Fearnhead and Donnelly (2001) to evaluate each factor in the product. In view of the results of the previous section, it would seem natural to evaluate each subregion likelihood p.d r ρ; θ/ via the approximate marginal likelihood, rather than the full likelihood. This is indeed what we recommend in practice (and what we have applied in Section 5). To understand the consequences of our various approximations better, we consider in this section the use of the composite likelihood with an evaluation of the full likelihood for each subregion. In the next subsection we give a very informal discussion of some relevant theoretical issues. For a more complete consideration, see Fearnhead (2002). Section 4.2 then considers empirical evidence on the use of this composite likelihood. Both the theoretical and the empirical considerations are encouraging Informal theoretical considerations The obvious point estimates for ρ and θ are just the values ˆρ and ˆθ that maximize L C.ρ; θ/. However, the statistical properties of these estimators are unknown, and a rationale for interval estimation is not straightforward. Partial answers to these questions can be obtained by using asymptotic theory. By asymptotic we mean here the limit as the number of subregions, and hence the size of the region of interest, increases. Another limiting regime would be to fix the size of the region and then to let the number of chromosomes sampled tend to. These are rather different scenarios. Additional sampled copies of the same region are very highly positively correlated, so the gain in information is small. Nothing is known formally, but it is plausible that in this limiting regime the information grows as the logarithm of the sample size. However, precisely because of recombination, there is rather more independence between sequenced regions, from the same chromosomes, as the regions move further apart. Again, the formal position is not clear, but it seems likely that information grows linearly, or close to linearly, in the number of subregions sequenced. For a discussion of these issues in a simpler setting see Pluzhnikov and Donnelly (1996). Azzalini (1983) discussed the asymptotic properties of maximum likelihood estimates (MLEs) which are based on approximate likelihoods that are similar to our composite likelihood. How these ideas specifically apply to our composite likelihood is considered in detail in Fearnhead (2002). We briefly, and very informally, discuss these results here. The asymptotic properties of the composite likelihood MLEs will depend on the correlation between the score functions for different subregions. We studied these correlations via simula-

11 Estimating Local Recombination Rates 667 tion (the results are not shown), and they appeared small. This suggests treating the composite likelihood as a true likelihood, and in particular assuming that the MLE has an approximate normal distribution, and that the likelihood ratio statistic has an approximate χ 2 -distribution. The theoretical results of Fearnhead (2002) show that the correlation decays inversely with the amount of recombination between the subregions. Whereas this decay is sufficiently quick to ensure consistency of the MLEs based on the composite likelihood, it is sufficiently slow that the asymptotic distribution of the likelihood ratio statistic will not be χ 2 distributed (in fact a χ 2 -approximation can be made arbitrarily poor by using increasingly more subregions). It may still be the case that the usual asymptotic distributional results will provide useful approximations in some settings. This may occur if the subregions themselves were large and the log-likelihood from each subregion were approximately quadratic, and if the number of subregions were small. Below, we use simulation to examine the distribution of the likelihood ratio statistic, and also the MLE, for our composite likelihood Simulation results We now describe the results of a simulation study aimed at understanding the properties of the composite likelihood method. We consider the large sequence properties and sampling distributions of the estimators. We simulated our data from the coalescent, assuming neutrality, random mating and a constant population size. For reasons discussed above we simulated data from 50 chromosomes throughout. We generated data over different sequence lengths, assuming that 1 kb of DNA corresponds to parameter values ρ = θ = 1:0. As noted above, these would be typical values for human populations. Calculating the full likelihood even for a subregion of 1 or 2 kb is extremely computationally intensive. Especially for the larger region, it is also person intensive, as care should ideally Fig. 4. Composite log-likelihood curves for kb data sets: each composite log-likelihood curve is based on 1 kb subregions; the bold curve shows the sum of the 10 composite log-likelihood curves (the true value of ρ per kilobase is 1); each curve is adjusted to have its maximum at 0

12 668 P. Fearnhead and P. Donnelly Fig. 5. Composite log-likelihood curves for kb data sets, based on 2 kb subregions: the bold curve shows the sum of the relevant composite log-likelihood curves (the true value of ρ per kilobase is 1); each curve is adjusted to have its maximum at 0 be used in deciding whether enough iterations of the importance sampling method have been used. (See Fearnhead and Donnelly (2001) for a detailed discussion.) The trade-off in the composite likelihood approach is clear. The larger the size of the subregions, the less information is lost through ignoring dependences, but the greater the cost of obtaining good likelihood estimates, and the higher the chance, especially if the process is automated, of serious errors in the subregion likelihood estimates. The main effect here is that, although the importance sampling method that we use gives an unbiased estimate of the likelihood, it is the sample mean of a sample from a distribution with an extremely long right-hand tail. Thus in practice not running the method for sufficiently long will typically result in an underestimation of the likelihood. (See Fearnhead and Donnelly (2001) and Stephens and Donnelly (2000) for a fuller discussion.) Simulation studies of highly computationally intensive methods are necessarily somewhat limited, so caution should be applied in interpreting the results below. For the composite likelihood approach, the largest size of subregion that is amenable to a simulation study will be smaller than the largest size which can be used for particular data analyses. We first considered the large sample properties of the composite likelihood MLE. We generated one 500 kb and kb data sets. The composite log-likelihood curves (based on splitting each region into 1 kb subregions) for the kb data sets are shown in Fig. 4. There is no noticeable bias in the estimation of ρ. This conclusion is supported by the composite loglikelihood curve for the 500 kb data set (the results are not shown). We also analysed the 100 kb data sets using subregions of 2 kb. Fig. 5 shows the composite likelihood curves based on using 2 million simulated ancestral histories to estimate the likelihood for each subregion. There is evidence of a negative bias caused by occasional inaccuracies in estimating the likelihood curves for individual subregions.

13 Estimating Local Recombination Rates 669 (a) (b) Fig. 6. QQ-plots of the composite likelihood MLE and likelihood ratio statistic for (a) 20 kb data and (b) 10 kb data: each composite likelihood is based on 1 kb subregions; the results are based on 1500 kb of simulated data Also of concern is the appropriateness of assuming an approximate normal distribution for ˆρ, and an approximate χ 2 1-distribution for the likelihood ratio statistic. Fig. 6 shows QQ-plots for both ˆρ and the likelihood ratio statistic for the composite likelihood for both 10 kb and 20 kb data (each composite likelihood was based on 1 kb subregions). For 20 kb data, the χ 2 1 -approximation for the likelihood ratio statistic seems poor, whereas (except for the constraint of ρ 0) ˆρ does have an approximate normal distribution. Also, as the sequence length is increased from 10 kb to 20 kb, the fit of the likelihood ratio statistic appears to be less good, as suggested by the theoretical analysis of Fearnhead (2002). Finally, Table 1 summarizes how the performance of inference for ρ depends on the length of data that are analysed (again inference is based on the composite likelihood for the various choices of subregion size). Firstly consider the results based on 1 kb subregions. The point estimates of ρ appear good, though the length of the sequence plays a crucial role in the variance of the estimates. In contrast, the coverage properties of interval estimates based on the asymptotic distribution of the likelihood ratio statistic are poor. Interval estimation based on a normal approximation for ˆρ performs somewhat better, but again the confidence intervals have coverage probabilities for large sequences that are lower than nominal. The results based on 2 kb subregions show evidence for a bias. Despite this, the performance of point estimation is better (measured either via the mean-square error or the proportion of times that the MLE is within a factor of 2 of the truth). The extra information in a2kbsubregion as opposed to two 1 kb subregions considerably reduces the variance of the estimators. Interval estimates have poor coverage properties. We note again that there is no theoretical foundation even for interval estimation for the full likelihood MLE for ρ, although limited empirical evidence is encouraging (Fearnhead and Donnelly, 2001).

14 670 P. Fearnhead and P. Donnelly Table 1. Summary of the sampling properties of the composite likelihood MLE, for ρ per kilobase (the true value is 1) and associated confidence intervals, for different lengths of data, based on a sample of 50 chromosomes Subregions Length Mean Variance Median g Confidence interval coverage (kb) (a)(b) 1 kb kb kb The statistic g (used in Wall s (2000) comparisons) is the proportion of times that the MLE is within a factor of 2 of the truth. The final two columns give the estimated coverage probability of approximate 95% confidence intervals. These confidence intervals are based on (a) an approximate χ 2 1-distribution for the likelihood ratio statistic and (b) an approximate normal distribution for the MLE, whose variance is the inverse of the curvature of the relevant composite log-likelihood curve. The results for subregions of 1 and 2 kb are based on 1500 kb of simulated data; those for 5 kb on 1000 kb of simulated data, with data simulated under the standard neutral model. Composite likelihoods for subregions of 1 and 2 kb were based on the exact likelihood for each subregion, that for 5 kb on the approximate marginal likelihood, omitting only singleton sites, for each subregion. The composite likelihood curve was calculated for ρ per kilobase between 0 and 5 in all cases. For 5 kb sequences, estimates of ρ = 5 were obtained in around 2% of cases; these indicate an estimate of ρ which is greater than 5.0, and hence the estimated means and variances for 5 kb sequences are negatively biased. The analysis for the 5 kb subregions is at the limit of feasibility for a simulation study. One consequence is that the importance sampling estimates of the approximate marginal likelihoods for each subregion may not be particularly accurate. For example in many cases these likelihood estimates had estimated effective sample sizes (see Fearnhead and Donnelly (2001) for details) of around 10. Here, an inaccurate likelihood surface may, and apparently does, particularly affect the properties of confidence intervals, although it may be responsible for the increased bias. In analysing a particular data set, one can assess the effective sample size and if necessary increase the simulation effort in calculating subregion approximate likelihoods. As a consequence, the actual performance when using 5 kb subregions should be better than suggested by Table 1. We thus regard the use of 5 kb subregions as the best alternative among the composite likelihood approaches that we considered. In connection with Table 1 we note that there are possible but very unlikely sample configurations for which the MLE of ρ will be. Thus the moments of ˆρ do not exist. None-the-less, we have found that the use of the sample mean and sample variance of ˆρ provide a helpful summary of the properties of different estimators. We give estimated histograms of the sampling distributions for some scenarios in Fig. 7. These confirm the conclusions described above in discussing Table 1. Fig. 7(c) shows the performance of the approximate marginal likelihood when applied to 5 kb sequences. The sampling properties of the estimator compare favourably with those of the MLE for smaller sequences (see Fig. 9 of Fearnhead and Donnelly (2001)). In spite of the approximations that are inherent in its construction, the behaviour of the composite likelihood MLEs, at least for point estimation, is encouraging. We briefly compare it with other available estimators.

15 Estimating Local Recombination Rates 671 (a) (b) (c) (d) Fig. 7. Histograms of the composite likelihood MLE for ρ based on samples of 50 chromosomes: (a) 5 kb sequences, 1 kb subregions; (b) 20 kb sequences, 1 kb subregions; (c) 5 kb sequences, 5 kb subregions; (d) 20 kb sequences, 5 kb subregions The method of Wakeley (1997) estimates ρ per kilobase for a 50 kb region (based on a sample of 20 chromosomes) with a variance of 1.16, making it substantially worse than our composite likelihood estimators (for example, when based on 2 kb subregions, the composite likelihood MLE, as described in Table 1, has a sample variance of 0.09). Wall (2000) had also found that this estimator performed poorly in his comparisons. Wall s (2000) method performs worse than the approximate marginal likelihood for 5 kb data and has a performance which is comparable with that of the composite likelihood for 10 kb data (with 5 kb subregions) when performance is measured in terms of the statistic g (defined in Table 1) and better performance in terms of the mean-square error (Wall, personal communication). On the basis of our own simulations (the data are not shown), and those of Hudson (personal communication), Hudson s pairwise likelihood method performs somewhat worse than approximate marginal likelihood for 5 kb data, but comparably or sometimes better than any composite likelihood for longer sequence lengths. The comparison with Hudson s (2001a,b) method is interesting. His method combines data from all pairs of segregating sites. In long sequences, this will be dominated by comparisons between sites which are reasonably distant. In contrast, our composite likelihood explicitly ignores information in data from regions of the sequence that are widely separated, concentrating instead on the local information from nearby segregating sites but using joint information from many such sites, rather than just from pairs of sites. The loss of the joint information seems expensive for relatively short sequences (for which full or marginal likelihood approaches perform better). As the sequence length grows, it begins to be offset by the additional information from patterns of LD between distant sites. In this sense Hudson s pairwise and our composite

16 672 P. Fearnhead and P. Donnelly likelihood approaches are somewhat complementary. We are currently investigating ways of combining the two approaches. 5. Lipoprotein lipase data We now apply our approximate likelihood methods to analyse sequence data from the LPL gene. The full data are presented in Nickerson et al. (1998) and Clark et al. (1998) and have been used by Templeton et al. (2000a), Przeworski and Wall (2001) and Kuhner et al. (2000) to estimate the amount of recombination in the LPL gene. The data consist of approximately 9.7 kb of DNA sequenced in 142 chromosomes from individuals in Jackson and Rochester in the USA and from North Karelia, Finland. The full haplotype information of some chromosomes is not known: the phase of singleton mutations was not determined, and for some segregating sites the alleles on some chromosomes are unknown. We focus here on two specific issues raised in the original papers: (a) the extent to which recurrent mutation (i.e. more than one mutation event at a particular site), rather than recombination, has shaped the data and (b) the possibility of substantially elevated recombination rates towards the centre of the sequenced region Possible recurrent mutation Templeton et al. (2000a, b) suggested that there is a significant amount of repeat mutation, due to variation in the mutation rate between sites. The issue is potentially important, because multiple mutations at the same site can leave patterns in the data that are similar to those from recombination events, so, as Templeton and colleagues have argued, a failure to account for this would lead to an overestimation of the recombination rate. In particular they suggested that CpG dinucleotides mutate at a much faster rate than the genome-wide average. (This has been noted from other data: for example, Nachman and Crowell (2000) estimated that they mutate 10 times more frequently than average; see also Krawczak et al. (1998).) We first analysed the data from the 48 chromosomes from Jackson. We base our inference on the composite likelihood, splitting the data into 10, 975-base, subregions. As with our earlier suggestion, we used the approximate marginal likelihood in each subregion, in which singleton mutations were ignored. In addition to the computational saving, this has the advantage that for these data the phase of singleton mutations is not known in any case. We also note that it means that throughout this analysis we use a finite sites mutation model, which explicitly allows for the possibility of repeat mutations at each site. For each subregion, we omitted any chromosome whose full haplotype was not known, leaving samples of between 31 and 48 chromosomes for the 10 subregions. On average, it took half a day to calculate the approximate marginal likelihood for each subregion by using a 400 MHz personal computer. The coefficient of variation of the final importance sampling weights was generally of the order of 10 3 or 10 4, implying that the estimates of the approximate marginal likelihood surface in each subregion are within a few per cent of the correct values. As one way of assessing the consequences of increased mutation rates associated with CpG dinucleotides, we analysed the data under two mutational models. In the first, all sites mutated at the same rate. In the second, we identified CpG doublets in the data and allowed these to mutate at a rate that was 10 times higher than that for other sites. In fact, only mutations away from CpG sites seem to be at increased frequency and, of these, transitions at a higher rate than

I See Dead People: Gene Mapping Via Ancestral Inference

I See Dead People: Gene Mapping Via Ancestral Inference I See Dead People: Gene Mapping Via Ancestral Inference Paul Marjoram, 1 Lada Markovtsova 2 and Simon Tavaré 1,2,3 1 Department of Preventive Medicine, University of Southern California, 1540 Alcazar Street,

More information

Inference about Recombination from Haplotype Data: Lower. Bounds and Recombination Hotspots

Inference about Recombination from Haplotype Data: Lower. Bounds and Recombination Hotspots Inference about Recombination from Haplotype Data: Lower Bounds and Recombination Hotspots Vineet Bafna Department of Computer Science and Engineering University of California at San Diego, La Jolla, CA

More information

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1 Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1 Human single nucleotide polymorphisms The majority of human sequence variation is due to substitutions that have occurred once in the

More information

ReCombinatorics. The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination. Dan Gusfield

ReCombinatorics. The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination. Dan Gusfield ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination! Dan Gusfield NCBS CS and BIO Meeting December 19, 2016 !2 SNP Data A SNP is a Single Nucleotide Polymorphism

More information

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations An Analytical Upper Bound on the Minimum Number of Recombinations in the History of SNP Sequences in Populations Yufeng Wu Department of Computer Science and Engineering University of Connecticut Storrs,

More information

On the Power to Detect SNP/Phenotype Association in Candidate Quantitative Trait Loci Genomic Regions: A Simulation Study

On the Power to Detect SNP/Phenotype Association in Candidate Quantitative Trait Loci Genomic Regions: A Simulation Study On the Power to Detect SNP/Phenotype Association in Candidate Quantitative Trait Loci Genomic Regions: A Simulation Study J.M. Comeron, M. Kreitman, F.M. De La Vega Pacific Symposium on Biocomputing 8:478-489(23)

More information

Modelling genes: mathematical and statistical challenges in genomics

Modelling genes: mathematical and statistical challenges in genomics Modelling genes: mathematical and statistical challenges in genomics Peter Donnelly Abstract. The completion of the human and other genome projects, and the ongoing development of high-throughput experimental

More information

Recombination, and haplotype structure

Recombination, and haplotype structure 2 The starting point We have a genome s worth of data on genetic variation Recombination, and haplotype structure Simon Myers, Gil McVean Department of Statistics, Oxford We wish to understand why the

More information

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY Third Pavia International Summer School for Indo-European Linguistics, 7-12 September 2015 HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY Brigitte Pakendorf, Dynamique du Langage, CNRS & Université

More information

Population Genetics II. Bio

Population Genetics II. Bio Population Genetics II. Bio5488-2016 Don Conrad dconrad@genetics.wustl.edu Agenda Population Genetic Inference Mutation Selection Recombination The Coalescent Process ACTT T G C G ACGT ACGT ACTT ACTT AGTT

More information

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST Introduction to Artificial Intelligence Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST Chapter 9 Evolutionary Computation Introduction Intelligence can be defined as the capability of a system to

More information

Supplementary Note: Detecting population structure in rare variant data

Supplementary Note: Detecting population structure in rare variant data Supplementary Note: Detecting population structure in rare variant data Inferring ancestry from genetic data is a common problem in both population and medical genetic studies, and many methods exist to

More information

12/8/09 Comp 590/Comp Fall

12/8/09 Comp 590/Comp Fall 12/8/09 Comp 590/Comp 790-90 Fall 2009 1 One of the first, and simplest models of population genealogies was introduced by Wright (1931) and Fisher (1930). Model emphasizes transmission of genes from one

More information

Drift versus Draft - Classifying the Dynamics of Neutral Evolution

Drift versus Draft - Classifying the Dynamics of Neutral Evolution Drift versus Draft - Classifying the Dynamics of Neutral Evolution Alison Feder December 3, 203 Introduction Early stages of this project were discussed with Dr. Philipp Messer Evolutionary biologists

More information

Probability that a chromosome is lost without trace under the. neutral Wright-Fisher model with recombination

Probability that a chromosome is lost without trace under the. neutral Wright-Fisher model with recombination Probability that a chromosome is lost without trace under the neutral Wright-Fisher model with recombination Badri Padhukasahasram 1 1 Section on Ecology and Evolution, University of California, Davis

More information

Summary. Introduction

Summary. Introduction doi: 10.1111/j.1469-1809.2006.00305.x Variation of Estimates of SNP and Haplotype Diversity and Linkage Disequilibrium in Samples from the Same Population Due to Experimental and Evolutionary Sample Size

More information

Haplotypes, linkage disequilibrium, and the HapMap

Haplotypes, linkage disequilibrium, and the HapMap Haplotypes, linkage disequilibrium, and the HapMap Jeffrey Barrett Boulder, 2009 LD & HapMap Boulder, 2009 1 / 29 Outline 1 Haplotypes 2 Linkage disequilibrium 3 HapMap 4 Tag SNPs LD & HapMap Boulder,

More information

Model based inference of mutation rates and selection strengths in humans and influenza. Daniel Wegmann University of Fribourg

Model based inference of mutation rates and selection strengths in humans and influenza. Daniel Wegmann University of Fribourg Model based inference of mutation rates and selection strengths in humans and influenza Daniel Wegmann University of Fribourg Influenza rapidly evolved resistance against novel drugs Weinstock & Zuccotti

More information

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs (3) QTL and GWAS methods By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs Under what conditions particular methods are suitable

More information

Dan Geiger. Many slides were prepared by Ma ayan Fishelson, some are due to Nir Friedman, and some are mine. I have slightly edited many slides.

Dan Geiger. Many slides were prepared by Ma ayan Fishelson, some are due to Nir Friedman, and some are mine. I have slightly edited many slides. Dan Geiger Many slides were prepared by Ma ayan Fishelson, some are due to Nir Friedman, and some are mine. I have slightly edited many slides. Genetic Linkage Analysis A statistical method that is used

More information

Lecture 5: Recombination, IBD distributions and linkage

Lecture 5: Recombination, IBD distributions and linkage Lecture 5: Recombination, IBD distributions and linkage Magnus Dehli Vigeland NORBIS course, 8 th 12 th of January 2018, Oslo Outline Review from yesterday: Meiotic recombination IBD segments Consequences

More information

Introduction to Quantitative Genomics / Genetics

Introduction to Quantitative Genomics / Genetics Introduction to Quantitative Genomics / Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics September 10, 2008 Jason G. Mezey Outline History and Intuition. Statistical Framework. Current

More information

Supplementary Material online Population genomics in Bacteria: A case study of Staphylococcus aureus

Supplementary Material online Population genomics in Bacteria: A case study of Staphylococcus aureus Supplementary Material online Population genomics in acteria: case study of Staphylococcus aureus Shohei Takuno, Tomoyuki Kado, Ryuichi P. Sugino, Luay Nakhleh & Hideki Innan Contents Estimating recombination

More information

Algorithms for Genetics: Introduction, and sources of variation

Algorithms for Genetics: Introduction, and sources of variation Algorithms for Genetics: Introduction, and sources of variation Scribe: David Dean Instructor: Vineet Bafna 1 Terms Genotype: the genetic makeup of an individual. For example, we may refer to an individual

More information

Understanding UPP. Alternative to Market Definition, B.E. Journal of Theoretical Economics, forthcoming.

Understanding UPP. Alternative to Market Definition, B.E. Journal of Theoretical Economics, forthcoming. Understanding UPP Roy J. Epstein and Daniel L. Rubinfeld Published Version, B.E. Journal of Theoretical Economics: Policies and Perspectives, Volume 10, Issue 1, 2010 Introduction The standard economic

More information

Bi/Ge105: Evolution Homework 3 Due Date: Thursday, March 01, Problem 1: Genetic Drift as a Force of Evolution

Bi/Ge105: Evolution Homework 3 Due Date: Thursday, March 01, Problem 1: Genetic Drift as a Force of Evolution Bi/Ge105: Evolution Homework 3 Due Date: Thursday, March 01, 2018 1 Problem 1: Genetic Drift as a Force of Evolution 1.1 Simulating the processes of evolution In class we learned about the mathematical

More information

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions Near-Balanced Incomplete Block Designs with An Application to Poster Competitions arxiv:1806.00034v1 [stat.ap] 31 May 2018 Xiaoyue Niu and James L. Rosenberger Department of Statistics, The Pennsylvania

More information

Revision confidence limits for recent data on trend levels, trend growth rates and seasonally adjusted levels

Revision confidence limits for recent data on trend levels, trend growth rates and seasonally adjusted levels W O R K I N G P A P E R S A N D S T U D I E S ISSN 1725-4825 Revision confidence limits for recent data on trend levels, trend growth rates and seasonally adjusted levels Conference on seasonality, seasonal

More information

ESTIMATING GENETIC VARIABILITY WITH RESTRICTION ENDONUCLEASES RICHARD R. HUDSON1

ESTIMATING GENETIC VARIABILITY WITH RESTRICTION ENDONUCLEASES RICHARD R. HUDSON1 ESTIMATING GENETIC VARIABILITY WITH RESTRICTION ENDONUCLEASES RICHARD R. HUDSON1 Department of Biology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 Manuscript received September 8, 1981

More information

Application of coalescent methods to reveal fine scale rate variation and recombination hotspots

Application of coalescent methods to reveal fine scale rate variation and recombination hotspots Application of coalescent methods to reveal fine scale rate variation and recombination hotspots Paul Fearnhead 1, Rosalind M. Harding 2,3, Julie A. Schneider 2,4, Simon Myers 5, & Peter Donnelly 5 1.

More information

Estimation of Genetic Recombination Frequency with the Help of Logarithm Of Odds (LOD) Method

Estimation of Genetic Recombination Frequency with the Help of Logarithm Of Odds (LOD) Method ISSN(Online) : 2319-8753 ISSN (Print) : 237-6710 Estimation of Genetic Recombination Frequency with the Help of Logarithm Of Odds (LOD) Method Jugal Gogoi 1, Tazid Ali 2 Research Scholar, Department of

More information

Nature Genetics: doi: /ng.3254

Nature Genetics: doi: /ng.3254 Supplementary Figure 1 Comparing the inferred histories of the stairway plot and the PSMC method using simulated samples based on five models. (a) PSMC sim-1 model. (b) PSMC sim-2 model. (c) PSMC sim-3

More information

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions Measurement of Molecular Genetic Variation Genetic Variation Is The Necessary Prerequisite For All Evolution And For Studying All The Major Problem Areas In Molecular Evolution. How We Score And Measure

More information

Human linkage analysis. fundamental concepts

Human linkage analysis. fundamental concepts Human linkage analysis fundamental concepts Genes and chromosomes Alelles of genes located on different chromosomes show independent assortment (Mendel s 2nd law) For 2 genes: 4 gamete classes with equal

More information

MARKOV CHAIN MONTE CARLO SAMPLING OF GENE GENEALOGIES CONDITIONAL ON OBSERVED GENETIC DATA

MARKOV CHAIN MONTE CARLO SAMPLING OF GENE GENEALOGIES CONDITIONAL ON OBSERVED GENETIC DATA MARKOV CHAIN MONTE CARLO SAMPLING OF GENE GENEALOGIES CONDITIONAL ON OBSERVED GENETIC DATA by Kelly M. Burkett M.Sc., Simon Fraser University, 2003 B.Sc., University of Guelph, 2000, THESIS SUBMITTED IN

More information

QTL Mapping, MAS, and Genomic Selection

QTL Mapping, MAS, and Genomic Selection QTL Mapping, MAS, and Genomic Selection Dr. Ben Hayes Department of Primary Industries Victoria, Australia A short-course organized by Animal Breeding & Genetics Department of Animal Science Iowa State

More information

Introduction to QTL mapping

Introduction to QTL mapping Introduction to QL mapping in experimental crosses Karl W Broman Department of Biostatistics he Johns Hopkins niversity http://biosun.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA

More information

Association studies (Linkage disequilibrium)

Association studies (Linkage disequilibrium) Positional cloning: statistical approaches to gene mapping, i.e. locating genes on the genome Linkage analysis Association studies (Linkage disequilibrium) Linkage analysis Uses a genetic marker map (a

More information

Statistical Methods for Quantitative Trait Loci (QTL) Mapping

Statistical Methods for Quantitative Trait Loci (QTL) Mapping Statistical Methods for Quantitative Trait Loci (QTL) Mapping Lectures 4 Oct 10, 011 CSE 57 Computational Biology, Fall 011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 1:00-1:0 Johnson

More information

Coalescent-based association mapping and fine mapping of complex trait loci

Coalescent-based association mapping and fine mapping of complex trait loci Genetics: Published Articles Ahead of Print, published on October 16, 2004 as 10.1534/genetics.104.031799 Coalescent-based association mapping and fine mapping of complex trait loci Sebastian Zöllner 1

More information

TEST FORM A. 2. Based on current estimates of mutation rate, how many mutations in protein encoding genes are typical for each human?

TEST FORM A. 2. Based on current estimates of mutation rate, how many mutations in protein encoding genes are typical for each human? TEST FORM A Evolution PCB 4673 Exam # 2 Name SSN Multiple Choice: 3 points each 1. The horseshoe crab is a so-called living fossil because there are ancient species that looked very similar to the present-day

More information

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY E. ZEGGINI and J.L. ASIMIT Wellcome Trust Sanger Institute, Hinxton, CB10 1HH,

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

Molecular Evolution. COMP Fall 2010 Luay Nakhleh, Rice University

Molecular Evolution. COMP Fall 2010 Luay Nakhleh, Rice University Molecular Evolution COMP 571 - Fall 2010 Luay Nakhleh, Rice University Outline (1) The neutral theory (2) Measures of divergence and polymorphism (3) DNA sequence divergence and the molecular clock (4)

More information

HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007

HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007 MIT OpenCourseWare http://ocw.mit.edu HST.161 Molecular Biology and enetics in Modern Medicine Fall 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Overview. Methods for gene mapping and haplotype analysis. Haplotypes. Outline. acatactacataacatacaatagat. aaatactacctaacctacaagagat

Overview. Methods for gene mapping and haplotype analysis. Haplotypes. Outline. acatactacataacatacaatagat. aaatactacctaacctacaagagat Overview Methods for gene mapping and haplotype analysis Prof. Hannu Toivonen hannu.toivonen@cs.helsinki.fi Discovery and utilization of patterns in the human genome Shared patterns family relationships,

More information

A Fast Estimate for the Population Recombination Rate Based on Regression

A Fast Estimate for the Population Recombination Rate Based on Regression INVESTIGATION A Fast Estimate for the Population Recombination Rate Based on Regression Kao Lin,* Andreas Futschik, and Haipeng Li*,1 *CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute

More information

Analysis of genome-wide genotype data

Analysis of genome-wide genotype data Analysis of genome-wide genotype data Acknowledgement: Several slides based on a lecture course given by Jonathan Marchini & Chris Spencer, Cape Town 2007 Introduction & definitions - Allele: A version

More information

An Introduction to Population Genetics

An Introduction to Population Genetics An Introduction to Population Genetics THEORY AND APPLICATIONS f 2 A (1 ) E 1 D [ ] = + 2M ES [ ] fa fa = 1 sf a Rasmus Nielsen Montgomery Slatkin Sinauer Associates, Inc. Publishers Sunderland, Massachusetts

More information

Genome-Wide Association Studies (GWAS): Computational Them

Genome-Wide Association Studies (GWAS): Computational Them Genome-Wide Association Studies (GWAS): Computational Themes and Caveats October 14, 2014 Many issues in Genomewide Association Studies We show that even for the simplest analysis, there is little consensus

More information

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes Coalescence Scribe: Alex Wells 2/18/16 Whenever you observe two sequences that are similar, there is actually a single individual

More information

High-throughput genome scaffolding from in vivo DNA interaction frequency

High-throughput genome scaffolding from in vivo DNA interaction frequency correction notice Nat. Biotechnol. 31, 1143 1147 (213) High-throughput genome scaffolding from in vivo DNA interaction frequency Noam Kaplan & Job Dekker In the version of this supplementary file originally

More information

Human linkage analysis. fundamental concepts

Human linkage analysis. fundamental concepts Human linkage analysis fundamental concepts Genes and chromosomes Alelles of genes located on different chromosomes show independent assortment (Mendel s 2nd law) For 2 genes: 4 gamete classes with equal

More information

N e =20,000 N e =150,000

N e =20,000 N e =150,000 Evolution: For Review Only Page 68 of 80 Standard T=100,000 r=0.3 cm/mb r=0.6 cm/mb p=0.1 p=0.3 N e =20,000 N e =150,000 Figure S1: Distribution of the length of ancestral segment according to our approximation

More information

ESTIMATION OF AVERAGE HETEROZYGOSITY AND GENETIC DISTANCE FROM A SMALL NUMBER OF INDIVIDUALS

ESTIMATION OF AVERAGE HETEROZYGOSITY AND GENETIC DISTANCE FROM A SMALL NUMBER OF INDIVIDUALS ESTIMATION OF AVERAGE HETEROZYGOSITY AND GENETIC DISTANCE FROM A SMALL NUMBER OF INDIVIDUALS MASATOSHI NE1 Center for Demographic and Population Genetics, University of Texas at Houston, Texas 77025 Manuscript

More information

Why do we need statistics to study genetics and evolution?

Why do we need statistics to study genetics and evolution? Why do we need statistics to study genetics and evolution? 1. Mapping traits to the genome [Linkage maps (incl. QTLs), LOD] 2. Quantifying genetic basis of complex traits [Concordance, heritability] 3.

More information

Usefulness of single nucleotide polymorphism (SNP) data for estimating population parameters

Usefulness of single nucleotide polymorphism (SNP) data for estimating population parameters Usefulness of single nucleotide polymorphism (SNP) data for estimating population parameters Mary K. Kuhner, Peter Beerli, Jon Yamato and Joseph Felsenstein Department of Genetics, University of Washington

More information

Genetic Recombination

Genetic Recombination Chapter 10 Genetic Recombination Mary Sara McPeek Genetic recombination and genetic linkage are dual phenomena that arise in connection with observations on the joint pattern of inheritance of two or more

More information

Selection and genetic drift

Selection and genetic drift Selection and genetic drift Introduction There are three basic facts about genetic drift that I really want you to remember, even if you forget everything else I ve told you about it: 1. Allele frequencies

More information

COMPUTER SIMULATIONS AND PROBLEMS

COMPUTER SIMULATIONS AND PROBLEMS Exercise 1: Exploring Evolutionary Mechanisms with Theoretical Computer Simulations, and Calculation of Allele and Genotype Frequencies & Hardy-Weinberg Equilibrium Theory INTRODUCTION Evolution is defined

More information

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA.

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA. QTL mapping in mice Karl W Broman Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA www.biostat.jhsph.edu/ kbroman Outline Experiments, data, and goals Models ANOVA at marker

More information

Lecture WS Evolutionary Genetics Part I - Jochen B. W. Wolf 1

Lecture WS Evolutionary Genetics Part I - Jochen B. W. Wolf 1 N µ s m r - - - Mutation Effect on allele frequencies We have seen that both genotype and allele frequencies are not expected to change by Mendelian inheritance in the absence of any other factors. We

More information

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping - from Darwin's time onward, it has been widely recognized that natural populations harbor a considerably degree of genetic

More information

Computational Workflows for Genome-Wide Association Study: I

Computational Workflows for Genome-Wide Association Study: I Computational Workflows for Genome-Wide Association Study: I Department of Computer Science Brown University, Providence sorin@cs.brown.edu October 16, 2014 Outline 1 Outline 2 3 Monogenic Mendelian Diseases

More information

PopGen1: Introduction to population genetics

PopGen1: Introduction to population genetics PopGen1: Introduction to population genetics Introduction MICROEVOLUTION is the term used to describe the dynamics of evolutionary change in populations and species over time. The discipline devoted to

More information

Strategy for Applying Genome-Wide Selection in Dairy Cattle

Strategy for Applying Genome-Wide Selection in Dairy Cattle Strategy for Applying Genome-Wide Selection in Dairy Cattle L. R. Schaeffer Centre for Genetic Improvement of Livestock Department of Animal & Poultry Science University of Guelph, Guelph, ON, Canada N1G

More information

Lecture 16 Major Genes, Polygenes, and QTLs

Lecture 16 Major Genes, Polygenes, and QTLs Lecture 16 Major Genes, Polygenes, and QTLs Bruce Walsh. jbwalsh@u.arizona.edu. University of Arizona. Notes from a short course taught May 2011 at University of Liege Major and Minor Genes While the machinery

More information

Phasing of 2-SNP Genotypes based on Non-Random Mating Model

Phasing of 2-SNP Genotypes based on Non-Random Mating Model Phasing of 2-SNP Genotypes based on Non-Random Mating Model Dumitru Brinza and Alexander Zelikovsky Department of Computer Science, Georgia State University, Atlanta, GA 30303 {dima,alexz}@cs.gsu.edu Abstract.

More information

LINKAGE AND CHROMOSOME MAPPING IN EUKARYOTES

LINKAGE AND CHROMOSOME MAPPING IN EUKARYOTES LINKAGE AND CHROMOSOME MAPPING IN EUKARYOTES Objectives: Upon completion of this lab, the students should be able to: Understand the different stages of meiosis. Describe the events during each phase of

More information

2014 Pearson Education, Inc. Mapping Gene Linkage

2014 Pearson Education, Inc. Mapping Gene Linkage Mapping Gene Linkage Dihybrid Cross - a cross showing two traits e.g pea shape and pea color The farther apart the genes are to one another the more likely a break between them happens and there will

More information

Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip

Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip : Sample Size, Power, Imputation, and the Choice of Genotyping Chip Chris C. A. Spencer., Zhan Su., Peter Donnelly ", Jonathan Marchini " * Department of Statistics, University of Oxford, Oxford, United

More information

FINDING THE PAIN GENE How do geneticists connect a specific gene with a specific phenotype?

FINDING THE PAIN GENE How do geneticists connect a specific gene with a specific phenotype? FINDING THE PAIN GENE How do geneticists connect a specific gene with a specific phenotype? 1 Linkage & Recombination HUH? What? Why? Who cares? How? Multiple choice question. Each colored line represents

More information

Signatures of a population bottleneck can be localised along a recombining chromosome

Signatures of a population bottleneck can be localised along a recombining chromosome Signatures of a population bottleneck can be localised along a recombining chromosome Céline Becquet, Peter Andolfatto Bioinformatics and Modelling, INSA of Lyon Institute for Cell, Animal and Population

More information

Notes on Intertemporal Consumption Choice

Notes on Intertemporal Consumption Choice Paul A Johnson Economics 304 Advanced Topics in Macroeconomics Notes on Intertemporal Consumption Choice A: The Two-Period Model Consider an individual who faces the problem of allocating their available

More information

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad.

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad. GENETIC ALGORITHMS Narra Priyanka K.Naga Sowjanya Vasavi College of Engineering. Ibrahimbahg,Hyderabad mynameissowji@yahoo.com priyankanarra@yahoo.com Abstract Genetic algorithms are a part of evolutionary

More information

Genetic drift. 1. The Nature of Genetic Drift

Genetic drift. 1. The Nature of Genetic Drift Genetic drift. The Nature of Genetic Drift To date, we have assumed that populations are infinite in size. This assumption enabled us to easily calculate the expected frequencies of alleles and genotypes

More information

Detecting ancient admixture using DNA sequence data

Detecting ancient admixture using DNA sequence data Detecting ancient admixture using DNA sequence data October 10, 2008 Jeff Wall Institute for Human Genetics UCSF Background Origin of genus Homo 2 2.5 Mya Out of Africa (part I)?? 1.6 1.8 Mya Further spread

More information

LD Mapping and the Coalescent

LD Mapping and the Coalescent Zhaojun Zhang zzj@cs.unc.edu April 2, 2009 Outline 1 Linkage Mapping 2 Linkage Disequilibrium Mapping 3 A role for coalescent 4 Prove existance of LD on simulated data Qualitiative measure Quantitiave

More information

Application of Statistical Sampling to Audit and Control

Application of Statistical Sampling to Audit and Control INTERNATIONAL JOURNAL OF BUSINESS FROM BHARATIYA VIDYA BHAVAN'S M. P. BIRLA INSTITUTE OF MANAGEMENT, BENGALURU Vol.12, #1 (2018) pp 14-19 ISSN 0974-0082 Application of Statistical Sampling to Audit and

More information

Genetics Lecture Notes Lectures 6 9

Genetics Lecture Notes Lectures 6 9 Genetics Lecture Notes 7.03 2005 Lectures 6 9 Lecture 6 Until now our analysis of genes has focused on gene function as determined by phenotype differences brought about by different alleles or by a direct

More information

RECOMBINATION plays a central role in shaping (MHC) class II region (Jeffreys et al. 2001; Jeffreys

RECOMBINATION plays a central role in shaping (MHC) class II region (Jeffreys et al. 2001; Jeffreys Copyright 2004 by the Genetics Society of America DOI: 10.1534/genetics.103.021584 Application of Coalescent Methods to Reveal Fine-Scale Rate Variation and Recombination Hotspots Paul Fearnhead,* Rosalind

More information

Crash-course in genomics

Crash-course in genomics Crash-course in genomics Molecular biology : How does the genome code for function? Genetics: How is the genome passed on from parent to child? Genetic variation: How does the genome change when it is

More information

Control Charts for Customer Satisfaction Surveys

Control Charts for Customer Satisfaction Surveys Control Charts for Customer Satisfaction Surveys Robert Kushler Department of Mathematics and Statistics, Oakland University Gary Radka RDA Group ABSTRACT Periodic customer satisfaction surveys are used

More information

2. Genetic Algorithms - An Overview

2. Genetic Algorithms - An Overview 2. Genetic Algorithms - An Overview 2.1 GA Terminology Genetic Algorithms (GAs), which are adaptive methods used to solve search and optimization problems, are based on the genetic processes of biological

More information

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA.

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA. QTL mapping in mice Karl W Broman Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA www.biostat.jhsph.edu/ kbroman Outline Experiments, data, and goals Models ANOVA at marker

More information

Mathematics of Forensic DNA Identification

Mathematics of Forensic DNA Identification Mathematics of Forensic DNA Identification World Trade Center Project Extracting Information from Kinships and Limited Profiles Jonathan Hoyle Gene Codes Corporation 2/17/03 Introduction 2,795 people were

More information

Proceedings of the World Congress on Genetics Applied to Livestock Production 11.64

Proceedings of the World Congress on Genetics Applied to Livestock Production 11.64 Hidden Markov Models for Animal Imputation A. Whalen 1, A. Somavilla 1, A. Kranis 1,2, G. Gorjanc 1, J. M. Hickey 1 1 The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK

More information

Terry Speed s Work on Genetic Recombination

Terry Speed s Work on Genetic Recombination Terry Speed s Work on Genetic Recombination Mary Sara McPeek Genetic recombination and genetic linkage are dual phenomena that arise in connection with observations on the joint pattern of inheritance

More information

Improvement of Association-based Gene Mapping Accuracy by Selecting High Rank Features

Improvement of Association-based Gene Mapping Accuracy by Selecting High Rank Features Improvement of Association-based Gene Mapping Accuracy by Selecting High Rank Features 1 Zahra Mahoor, 2 Mohammad Saraee, 3 Mohammad Davarpanah Jazi 1,2,3 Department of Electrical and Computer Engineering,

More information

Gene Linkage and Genetic. Mapping. Key Concepts. Key Terms. Concepts in Action

Gene Linkage and Genetic. Mapping. Key Concepts. Key Terms. Concepts in Action Gene Linkage and Genetic 4 Mapping Key Concepts Genes that are located in the same chromosome and that do not show independent assortment are said to be linked. The alleles of linked genes present together

More information

FINDING THE PAIN GENE How do geneticists connect a specific gene with a specific phenotype?

FINDING THE PAIN GENE How do geneticists connect a specific gene with a specific phenotype? FINDING THE PAIN GENE How do geneticists connect a specific gene with a specific phenotype? 1 Linkage & Recombination HUH? What? Why? Who cares? How? Multiple choice question. Each colored line represents

More information

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012 Lecture 23: Causes and Consequences of Linkage Disequilibrium November 16, 2012 Last Time Signatures of selection based on synonymous and nonsynonymous substitutions Multiple loci and independent segregation

More information

Supplementary Methods 2. Supplementary Table 1: Bottleneck modeling estimates 5

Supplementary Methods 2. Supplementary Table 1: Bottleneck modeling estimates 5 Supplementary Information Accelerated genetic drift on chromosome X during the human dispersal out of Africa Keinan A, Mullikin JC, Patterson N, and Reich D Supplementary Methods 2 Supplementary Table

More information

Lecture 10: Introduction to Genetic Drift. September 28, 2012

Lecture 10: Introduction to Genetic Drift. September 28, 2012 Lecture 10: Introduction to Genetic Drift September 28, 2012 Announcements Exam to be returned Monday Mid-term course evaluation Class participation Office hours Last Time Transposable Elements Dominance

More information

Random Allelic Variation

Random Allelic Variation Random Allelic Variation AKA Genetic Drift Genetic Drift a non-adaptive mechanism of evolution (therefore, a theory of evolution) that sometimes operates simultaneously with others, such as natural selection

More information

ENVIRONMENTAL FINANCE CENTER AT THE UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL SCHOOL OF GOVERNMENT REPORT 3

ENVIRONMENTAL FINANCE CENTER AT THE UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL SCHOOL OF GOVERNMENT REPORT 3 ENVIRONMENTAL FINANCE CENTER AT THE UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL SCHOOL OF GOVERNMENT REPORT 3 Using a Statistical Sampling Approach to Wastewater Needs Surveys March 2017 Report to the

More information

Cross Haplotype Sharing Statistic: Haplotype length based method for whole genome association testing

Cross Haplotype Sharing Statistic: Haplotype length based method for whole genome association testing Cross Haplotype Sharing Statistic: Haplotype length based method for whole genome association testing André R. de Vries a, Ilja M. Nolte b, Geert T. Spijker c, Dumitru Brinza d, Alexander Zelikovsky d,

More information

Identifying Genes Underlying QTLs

Identifying Genes Underlying QTLs Identifying Genes Underlying QTLs Reading: Frary, A. et al. 2000. fw2.2: A quantitative trait locus key to the evolution of tomato fruit size. Science 289:85-87. Paran, I. and D. Zamir. 2003. Quantitative

More information

Basic Concepts of Human Genetics

Basic Concepts of Human Genetics Basic oncepts of Human enetics The genetic information of an individual is contained in 23 pairs of chromosomes. Every human cell contains the 23 pair of chromosomes. ne pair is called sex chromosomes

More information

Using Mapmaker/QTL for QTL mapping

Using Mapmaker/QTL for QTL mapping Using Mapmaker/QTL for QTL mapping M. Maheswaran Tamil Nadu Agriculture University, Coimbatore Mapmaker/QTL overview A number of methods have been developed to map genes controlling quantitatively measured

More information