PopGen1: Introduction to population genetics

Size: px
Start display at page:

Download "PopGen1: Introduction to population genetics"

Transcription

1 PopGen1: Introduction to population genetics Introduction MICROEVOLUTION is the term used to describe the dynamics of evolutionary change in populations and species over time. The discipline devoted to the study of microevolution at the molecular level is called POPULATION GENETICS. Note that the complementary term, Macroevolution, is most often used to describe patterns and processes of diversification above the species level such as rates of speciation, and the origin of evolutionary novelties. Disciplines such as PHYLOGENETICS, COMPARATIVE BIOLOGY, and PALAEONTOLOGY concern themselves with macroevolution. It is important to understand that the process of microevolution can generate (most, if not all) macroevolution, by operation over the relevant time spans. Since we are interested in the outcome of evolution at the molecular level, i.e., the diversity of genes, proteins, chromosomes, genomes, and organisms that we see in the world around us today, we must understand population genetics. The principles of population genetics are used to (i) measure the extent of genetic variation in natural populations, and (ii) explain natural genetic variation in terms of its origin, maintenance, and evolutionary significance. The practice of population genetics is based on using the information contained in a sample of a population to make inferences about the evolutionary processes affecting that population. The discipline often uses explicit models as the basis for testing hypotheses about the evolutionary processes that generated the sample of data, and to estimate parameters; this approach is called PARAMETRIC INFERENCE. Sometimes, no model is assumed, and that approach is called NONPARAMETRIC INFERENCE. The discipline of population genetics has practical value in a wide variety of other disciplines. Medical biology relies heavily of population genetics to detect links between genetic variation and a disease. Agriculture uses the principles in development of breeding programs. Conservation, fisheries, wildlife, and forestry biologists all use population genetics. In fact, the combination of molecular data and population genetic principles has revolutionized the fields of conservation biology and ecology. Anthropologists are very interested in population genetics because it can be used to make inferences about patterns of migration and interbreeding during the history of human evolution. Until this point the term POPULATION has been used rather informally, but now is the time for a formal definition. A population is a subgroup of individuals of the same species living within some set of restrictions, usually a restricted geographic area. The term population is not used to refer to an entire species. Individuals of a species are rarely distributed uniformly, having some form of aggregation according to the most favourable aspects of the environment. For terrestrial vertebrates such aggregations are often obvious: herds, colonies, habitat preferences such as marshes or deserts, etc. Less clear are the boundaries associated with genetic similarity. In population genetics, the population is defined by the restriction of local interbreeding. Sometimes such populations are called LOCAL POPULATIONS or DEMES, but they are very often simply called a POPULATION. The practical definition will vary from species to species, and also from researcher to researcher. Although a more precise definition is difficult, for our purposes we need only assume that we have defined the population appropriately. We want to define the local population correctly because it is the unit within which the evolution of adaptive characteristics occur; i.e., the evolving unit of the species. So, for the purposes of this course we come to a definition of EVOLUTION: the change in the genetic constitution of a population over time, or simply the change in allele frequencies in a population over time. Allele frequencies in populations In population genetics, variation is often quantified using the notion of ALLELE FREQUENCIES. Alleles are the alternative forms of genes, including the so-called normal form or wild-type form. The allele frequencies are the simple proportions of different alleles within a local population. Note that sometimes the term GENE FREQUENCY will be used synonymously with allele frequency. Long-term changes in allele frequencies, i.e., evolution, occur within the unit of the local population. Before we can measure evolution by the change in allele frequencies in a population, we must be able to measure the frequencies of those alleles. Consider an autosomal locus in a collection of diploid individuals.

2 Let s take the human M-N blood group as one example. In this blood group there are two alleles, M and N. In this case there would be three possible types of GENOTYPES, MM, MN, or NN. A sample of the population of Icelandic Eskimos was typed according the MN blood group and the data are presented in the table below: Genotypes Total (n) MM MN NN count From the data in this table we can compute the allele frequencies of M and N in this sample. By convention we use p and q to represent the frequencies of M and N, with p + q = 1. p = ( MM) 1( MN) + n and q = 1 p Using the data in Table 1 p = ( ) + ( 85) 747 = 0.57 and q = = 0.4 The exercise in calculating the frequencies of M and N leads us to an important point: the quantities p and q are PARAMETERS of a population with unknown true values. We had to estimate the values of those parameters from a sample of the population. Such estimates will have an error, and we must judge the reliability of our estimate by estimating its VARIANCE. We can estimate the sampling variance of p with: Var ( p) ( 1 p) p = n Assuming that repeated estimates would be normally distributed, we can use the variance to make predictions about how close the true frequency of the allele in the population is to our estimate of the frequency. First we take the square root of the variance; this is known as the STANDARD ERROR of the estimate. Under the normal distribution approximately 68% of the estimates of a parameter would fall within the interval defined by the true value ± 1 standard error. This is called a CONFIDENCE INTERVAL. The 95% confidence interval is defined by ± standard errors, and the 99.7% confidence interval is defined by ± standard errors.

3 Let s use the variance to evaluate the effect of sample size on the precision of an estimate of allele frequencies. Suppose we obtain the following sample from a population: AA = 8, Aa = 6, and aa =. Let s use the above formulas to estimate p and q and the 95% confidence interval. p = , var(p) = , 95%CI = (0.54,0.851) Now let s recalculate the results after multiplying the sample by a factor of 10. p = , var(p) = , 95%CI = (0.66,0.79) Increasing the sample size decreased our confidence interval, indicating that the precision of our estimate should have increased. Of course had we actually taken another sample 10 times larger, we would not have estimated p exactly as we had in our small sample. The 95%CI is a tool for quantifying this sort of sampling error. Changes in gene frequencies between populations Let s return to our sample of the frequencies of the MN blood groups from Iceland, and compare it to a sample from Greenland. The data, presented in the table below, clearly indicate a difference in genotype frequencies between these populations. We can make at least one conclusion based on these data: evolution has occurred! Genotypes Total (n) MM MN NN Iceland Greenland We are now faced with the task of identifying the origin and evolutionary significance of such change. The causes of change are the subject matter for later lectures. For now we will simply list the possible causes of change: 1. Finite population size: The alleles passed from one generation to another are a sample of the genes of the parental generation. Because the proportion of alleles in the sample is subject to sampling variation between generations, allele frequencies will vary from generation to generation (evolution by our definition). The size of the sample, in this case the number of members of the population that are reproducing, influences the strength of this effect. Think back to the difference in the variance of our samples of allele frequencies above.. Mutation: Of course if a mutation occurs then a change in allele frequencies has occurred. You have probably already guessed that a single mutational event will have a very small effect in large populations. It is other forces such as sampling variance of selection that determine the evolutionary significance of a new mutant over evolutionary time.. Non-random mating: Suppose that potential mates in the parental generation have a preference for a certain trait with a genetic basis. If they preferentially choose mates with such a trait then the frequency of the gene encoding that trait will increase in the next generation. This is just one example of how nonrandom mating influences allele frequencies. 4. Natural selection: Individuals in a population differ in their ability to survive and reproduce (fitness). If an allele confers an increased fitness to those individuals who carry it, then they will make a greater contribution to subsequent generations, thus increasing the frequency of that allele over time. Of course it can work the other way. If a new allele arose by mutation and reduced that individual s fitness, then that individual might not reproduce and the allele would not appear in the next generation. 5. Migration or Gene Flow: Given two or more populations, immigration of individuals from outside the population will change the frequencies of alleles in that population.

4 In order to identify which of these is relevant, we must first understand the simple case where all these factors are excluded from the system. This very simple model results in a type of equilibrium that will be the focus of the remainder of this topic. Hardy-Weinberg equilibrium A MODEL is an intentional simplification of a complex situation designed to eliminate extraneous detail in order to focus attention on the essentials of the situation (Daniel L. Hartl). We need a model of an idealized, unchanging, population as our starting point. Under such a model gene frequencies will not change between generations: i.e., no evolution occurs. Starting with such a model we can then add different causes of change, such as natural selection or non-random mating, and see how such processes affect the relationship between genotype frequencies and allele frequencies across generations. Modern population and quantitative genetics depends on understanding this relationship. The HARDY-WEINBERG (HW) MODEL provides us with the model we need. It represents a population at equilibrium with respect to allele and genotype frequencies. Independently specified in 1908 by G.H. Hardy and W. Weinberg, the model is based on the following assumptions: Assumption of the HW model 1. The organism is diploid. Reproduction is sexual. Mating is random 4. Generations are discrete 5. Population size is infinite (or very large) 6. No migration 7. No mutation 8. No natural selection Under these assumptions we may specify the probabilities of all possible combinations of gametes in the next generation as: (p + q) x (p + q) Do you remember the PRODUCT RULE and ADDITIVE RULE from your genetics class? Also, the proportions of all possible genotypes must sum to 1: Multiplication gives: (p + q) x (p +q) = 1 p + pq + q = 1 Lets say that p is the frequency of the A allele and q is the frequency of the a allele. Then the genotype frequencies in the next generation are as follows: f AA = p f Aa = pq f aa = q

5 We can see this in the form of a table: Female gametes Male gametes A(p) a(q) A (p) AA (p ) Aa (pq) a Aa aa (q) (pq) (q ) Keynotes of the HW model: 1. HW model specifies the relationship between allele frequencies and gene frequencies in a population. Such a relationship can be tested with real data. When a population is consistent with HW expectations we say that population is in HARDY-WEINBERG EQUILIBRIUM.. The standard mechanisms of Mendelian inheritance ensure that allele frequencies do not change from generation to generation unless some external pressure is acting.. No matter what the initial frequencies, just one generation of random mating will result in HW frequencies. These features provide the foundation for many methods of investigation of the genetics of natural populations. In particular it is the departure from HW equilibrium, and the cause of such a departure (i.e., which assumptions of the model have been violated), that interests populations geneticists. Unfortunately, HW is not very sensitive to certain kinds of violations. This means that just because we cannot reject the notion of HW equilibrium for a natural population does not mean that we can conclude all the assumptions are valid. Rare recessive alleles can hide in the heterozygotes The relationship between allele and genotype frequency under HW is informative: when a recessive allele is rare, most of the individuals who carry the allele are heterozygous. The ratio of heterozygote: rare homozygote = pq : q Let s look at a real example: q pq : q : : :1 Cystic fibrosis is a serious human illness caused by an autosomal recessive gene. An individual homozygous recessive for the CF gene will have abnormal glandular secretions resulting in the generation of thickened secretions, which causes the blockage of normal functions of some organs like the lungs and digestive organs causing difficulties in breathing and digestion. The frequency CF is about 1 in 1700 newborn Caucasians. q = 1/1700 q = (1/1700) 1/ = 0.04

6 ASSUMING HW: Frequency of heterozygotes in population = pq = x 0.04 x (1-0.04) = So, if we allowed a generation of random mating and no selection, 1 in 1 people would be heterozygous carriers of the CF allele, even though the frequency of CF in this population is quite rare, at 1 in 1700! Note that in this case we just assumed HW without testing if the assumption was reasonable. CF is clearly under selection, and population sizes are finite. However, heterozygous individuals are not subject to selection, and the example illustrates that deleterious alleles can lurk in a large fraction of the population without any overt phenotypic effect. More realistic models of CF require consideration of selection, mutation, and drift; the frequency of the CF allele in human populations reflects the interaction of all these evolutionary forces. We will return to this example later in the course. A proof of the HW principle Let: P 1 = p, P = pq, P = q in the parental generation Genotypes AA Aa aa Genotype frequencies P 1 P P Allele frequencies A: p = (P 1 1) + (P 1/) a: q = (P 1) + (P 1/) The questions is what happens to the three genotype frequencies (P 1, P, and P ) in the next generation? Under HW, P 1 = P 1 and P = P and P = P. We can test for this outcome by listing all possible matings and the frequencies of their offspring genotypes. Then we simply sum the products of each mating that give rise to each of the three possible genotypes. The results are shown in the table below. Genotype frequencies of offspring Mating Frequency AA Aa aa AA AA P AA Aa P 1 P 1/ 1/ 0 AA aa P 1 P Aa Aa P 1/4 1/ 1/4 Aa aa P P 0 1/ 1/ aa aa P Total in next generation: P 1 P P P = P ' P P 1 + ( 1/ 4) P = P + = p 1 P

7 ' P P P = P1 P + P1 P + 1 = ( 1/) P + P P = P + P + pq P ' = ( 1/ ) P + P P + P = P + = q P So, we have our proof; after all this we see that allele frequencies do NOT change from generation to generation under HW! Testing for HW proportions We test a population for HW frequencies in three steps: (i) compute observed genotype frequencies; (ii) compute expected genotype frequencies under HW; and (iii) conduct a goodness of fit test. We will use the following data as an example: MN blood types in Pueblo Indians Genotypes MM MN NN Observed counts Total = n = 140 PART 1: Observed Genotype frequencies: Observed allele frequencies: MM = 8/140 = 0.59 MN = 46/140 = 0. NN = 11/140 = 0.08 M = p = (1/) = N = q = ( ) = 0.45 Do NOT compute the allele frequencies at this step by assuming HW (i.e., p 0.59). Anyone who does this will automatically get an F in the class! PART : Expected genotype frequencies: Expected genotype counts: p = (0.755) = 0.57 pq = x x 0.45 = 0.7 q = (0.45) = 0.06 p x n = 0.57 x 140 = 79.8 pq x n = 0.7 x 140 = 51.8 q x n = 0.07 x 140 = 8.4 Compare these counts to the observed counts in the table above.

8 PART : χ = ( observed - expected) ( ) ( ) ( ) expected = χ = χ = 1.58 d.f. = [(number of tests) (non-independent categories) (calculate p from data)] = [ 1 1] = 1 P = 0.0; i.e., there is a 0% chance that we would have observed a test statistic this large under HW. Testing for HW is not possible under dominance DOMINANCE occurs when the phenotypic effect of one allele conceals the effect of an alternate allele. An example is the D allele of the human Rh + blood group: Genotype Phenotype DD Rh + Dd Rh + dd Rh - In North America the frequency of the Rh + phenotype (DD or Dd) is and the Rh - phenotype (dd) is If are willing to assume HW then we can calculate the expected frequencies of D and d: q = (0.14) 1/ = p = ( ) = 0.6 Why not use these to test HW? These are EXPECTED frequencies under HW! The calculated frequencies fit the expected frequencies exactly. d.f. = (for two classes of data) 1 1 (for estimating q) = 0 HW equilibrium with three alleles There is no reason why the allelic variation at a locus must be limited to two variants. Micro-satellite loci commonly have very large numbers of alleles per locus. Let s look at HW for the case of three autosomal alleles: Alleles: A 1 A and A Frequencies: p 1 p and p p 1 + p + p = 1 We can do a cross multiplication square that shows the HW frequencies at three alleles at a single autosomal locus. Remember we are still making all the HW assumptions listed above.

9 Female gametes A 1 (p 1 ) A (p ) A (p ) Male gametes A 1 (p 1 ) A (p ) A (p ) A 1 A 1 A 1 A A 1 A p 1 p 1 p p 1 p A 1 A A A A A p 1 p p p p A 1 A A A A A p 1 p p p p Frequencies in the next generation: A 1 A 1 : p 1 A 1 A : p 1 p A 1 A : p 1 p A A : p A A : p p A A : p Hardy-Weinberg as a null model We have seen that under the assumptions of the Hardy-Weinberg model, i.e., a large random mating population with no selection, that the allele frequencies are stable over time. In other words our model is the equivalent of no evolution. Under this model nothing interesting ever happens: what good is such a model? It might not be interesting, but it is valuable as our NULL MODEL. By systematically changing the assumptions of HW, through models that explicitly relax the involved assumption, we can investigate its effect on evolution. This approach is the basis of the next five topics in population genetics. The table below summarized some of the involved assumptions and their effects. Change in frequencies Agency Genotype Allele Notes Linkage no no Creates disequilibrium among loci Inbreeding yes no Acts on all loci in genome; results in loss of heterozygosity Assortative Mating yes no Only acts on the locus subject to assortment, and those loci linked to it Migration a yes yes Depends of migration rate and frequency differences between populations Mutation yes yes Very very very slow Natural Selection yes yes Acts on the locus subject to selection, and those loci linked to it Genetic Drift yes yes Acts on all loci in the genome; results in loss of heterozygosity and loss of alleles a we use the term migration in population genetics to mean movement of genes or individuals among populations in terms of gene flow. Migration has a different meaning in the discipline of ecology. We do not examine the impact of migration on evolution in this course; a full course in population genetics will cover migration in detail. Don t worry if the table does not make much sense the first time you look at it. At some point you will look back on it and it will.