January 6, 2005 Bio 107/207 Winter 2005 Lecture 2 Measurement of genetic diversity

Size: px
Start display at page:

Download "January 6, 2005 Bio 107/207 Winter 2005 Lecture 2 Measurement of genetic diversity"

Transcription

1 January 6, 2005 Bio 107/207 Winter 2005 Lecture 2 Measurement of genetic diversity - in his 1974 book The Genetic Basis of Evolutionary Change, Richard Lewontin likened the field of population genetics to a complex and exquisitely oiled machine. - this machine had been built over time from the theoretical work of Wright, Fisher, and Haldane. - the problem with this machine was that designed to run on fuel (i.e., data) that no one had succeeded in mining. - occasionally, some lucky prospector would come across a natural outcrop of high-grade ore and the machinery would be started up to prove to its backers that it really could work. - however, for most of the time the machine was left to the engineers, forever tinkering and modifying, in hopes of the day that it could be called upon to carry out full production. - the situation changed overnight in the mid 1960s with the discovery of the technique of protein electrophoresis. - the mother-lode had been tapped and allozyme data was poured into the hopper of the great machine. - and what has emerged from the other end? - unfortunately, not much! - this was not because the machinery did not work it did and the great clashing of gears was clearly audible. - it simply that the great machine had failed to work the great mass of data into some finished product. - instead, it gave rise to the so-called neutralist-selectionist controversy that raged for 20 years before waning as population geneticists began to lose interest in continuously arguing with each other. - this controversy has yet to be resolved. - we will return to this issue in a coming class I bring up this analogy today only to point out that most of its history, the field of population genetics had been limited by data. - things have progressed remarkably from this point. - today, there exist a plethora of techniques that be used to survey and quantify genetic diversity in natural populations. - the most recent advancements have been made in the development of high-throughput DNA sequencers and microarrays. - in today s class we will review the different ways in which genetic diversity can be characterized and quantified. - this will lead into next Tuesday s class when we cover the Hardy-Weinberg-Castle equilibrium principle.

2 Measures of genetic diversity 1. Visible polymorphisms. - the first population genetic studies were performed on species with discrete morphological polymorphisms [we will defer treatment of polygenic (i.e., continuous) variation until later]. - included here are the pioneering studies by E.B. Ford on wing color polymorphisms in butterflies and those by Cain and Sheppard on shell polymorphisms (color and banding patterns) in the land snail Cepaea nemoralis. - these classic studies demonstrated that strong natural selection can be observed in nature. - however, they are not useful for allowing any strong inference about the levels of genetic variation in nature. - rather, they are interesting case studies with limited generality. 2. Chromosomal variation. - a large number of studies in the 1950s and 1960s examined the fitness effects of chromosomes extracted from natural populations of Drosophila. - these experiments confirmed that natural populations harbored a large amount of genetic variation. - the numbers of lethal or semilethal mutations per genome has been estimated in different populations of D. melanogaster to be about 1.0 (although quite variable see Table 1.8 in text). - this class of variation is too rare, however, to produce the more subtle variation in fitness observed among individuals. - experiments designed to estimate the numbers of mutations that affect viability (called fitness modifiers) have also demonstrated a large reservoir of deleterious mutations (see Figure 1.19 in text). - roughly 37% and 55% of 2 nd and 3 rd chromosomes isolated from natural populations possessed lethal mutations! - the remainder of chromosomes had broad distributions of relative fitnesses falling between 0.5 and in contrast, heterozygote viabilities were unimodal and much less variable. - this is strong evidence that most mutations in Drosophila that affect fitness are deleterious and recessive. - the ability to study the fitness effects of chromosomal variation in Drosophila was made possible only because of the fact that it had served as a laboratory model for so many years. - the creation of balancer chromosomes and the complete absence of crossing over in males made the isolation of chromosomes from natural populations relatively simple (see crossing design in Figure 1.17) - these techniques are not amenable for the vast majority of other species. - the technique that revolutionized and reinvigorated the field of population genetics in the mid 1960s was protein electrophoresis.

3 3. Allozymes. - allozymes are polymorphic protein loci. - the term allozyme was coined by Prakash (1969) as a shortened form of allelic isozyme. - an isozyme is a protein coding locus that produces a product having a distinct mobility on a gel. - allozymes thus represent variable forms of isozymes. - different allozyme alleles have mutations that affect their net charge. - this causes them to have different electrophoretic mobilities. - allozymes are separated on a matrix such as a starch gel and visualized by applying a specific histochemical stain for that enzyme. - although the allozyme era has passed, most of the information we have on levels and patterns of genetic variation for most species is still based on allozyme data. 4. Restriction fragment length polymorphisms (RFLPs) - prior to the advent of DNA sequencing, DNA polymorphism was initially determined by digesting DNA with restriction endonucleases and running the digested products out on polyacrylamide or agarose gels. - alternatively, the digested fragments can be transferred to nylon filters and hybridized with probes that will bind to the fragments of interest. - the basis of RFLP polymorphism is variation in small 4-6 bp sequences that are recognized by specific restriction endonucleases. - the majority of RFLP data was been collected for mtdna and is now no longer used as DNA sequencing has become much easier and cheaper. - in principle, information on the level of RFLP variation should provide some information on the level of DNA polymorphism in a species. - however, there is far less precision than originally thought. 5. Mini- and microsatellite variation - in 1989, the first tandem repeat polymorphisms were studied by Alec Jeffries and colleagues. - this class of loci was originally called VNTRs (Variable Number of Tandem Repeats) because the basis of the polymorphism is variation in the number of repeating elements. - Jeffries originally scored minisatellites that had repeating elements typically ranging between 16 and 64 bp. - minisatellite loci were scored by hybridizing the minisatellite core repeat to DNA digested with a restriction endonuclease. - the resulting DNA fingerprint is a variable number of DNA fragments that differs from individual to individual. - VNTRs were originally used for paternity analyses and in forensics. - today, they have been replaced by the use of microsatellites that have repeating units of typically 2-4 bp. - microsatellites can be amplified by PCR and run out on automated DNA sequencers to enable very precise determination of allele sizes.

4 6. DNA sequence variation - the first study to examine DNA sequence polymorphism directly was that of Kreitman (1983) on the alcohol dehydrogenase (Adh) locus of D. melanogaster. - the great advantage of DNA sequencing is that it provides the most accurate and comprehensive assessment of genetic polymorphism. - unlike other methods that attempt to infer the numbers of variable sites (i.e. RFLPs or allozymes), DNA sequences allow for mutations to be directly identified. -a further advantage of DNA sequence information is that it allows different classes of mutation to be studied most notable silent and replacement mutations. - comparisons of silent and replacement polymorphisms provide important insights into the mechanisms of evolutionary change, especially natural selection. Measurement of genetic diversity Allozyme variation. - there are two standard measures of allozyme diversity. - the first is P, the proportion of loci sampled that are polymorphic: P = x / m where x is the number of polymorphic loci in a sample of m loci. - the second is mean heterozygosity, Hbar. - this is calculated as follows. - suppose we sample a locus and find two alleles present at frequencies of 0.4 and let p 1 = 0.4 and p 2 = if we assume Hardy-Weinberg equilibrium, then two homozygotes will be present at the locus at frequencies of p 1 2 = 0.16 and p 2 2 = summing the two together, we expect that 52% of genotypes at the locus will be homozygotes and the remainder (48%) will be heterozygotes. n - in general, the expected heterozygosity = H E = 1 Σ p i 2 - this is simply 1 minus the frequency of homozygotes and can accommodate any number of alleles. - an unbiased estimate of expected heterozygosity is n H E = 2N/(2N-1) {1 Σ p 2 i } i=1 i = 1

5 - mean heterozygosity, Hbar, is simply the average over all loci scored in the sample (including monomorphic ones!). - another way to estimate mean heterozygosity for any class of genetic markers is by creating a matrix listing the genotypes for all individuals at all loci (see Table 2.17). - each individual is assigned a 0 if it is a homozygote and a 1 if it is a heterozygote. - an estimate of the mean heterozygosity of the population is thus: N m H = 1/Nm Σ Σ Hij i=1 j=1 - a measure of the sampling variance can be obtained: V (H) = [H (1 H)]/ Nm - it is important to note that this variance has two components that caused by variation in the heterozygosity among individuals and that caused by variation in heterozygosity among loci. - the variance in heterozygosity among loci turns out to be much greater than the heterozygosity among individuals. - therefore, if one wants the most accurate measures of genetic diversity in nature it is better to increase the numbers of loci scored, not the numbers of individuals! - Nei (1977) showed that one typically needs loci! - back in the heyday of allozymes it was not uncommon to see studies published with 30 or more electrophoretic loci scored. - nowadays, it is rare to see a microsatellite study with this number of loci. - a further problem is that many researchers today bias their choice of microsatellites - many thousands of species have been studied for their level of allozyme variation. - the highest levels of allozyme variation have been observed in some marine invertebrates and plants here P is roughly 50% and mean heterozygosity is about 20%. - if these genes are a representative sample of the genome, then this suggests that these species possess a staggering amount of genetic polymorphism. - the lowest levels of allozyme polymorphism have been observed in large mammals. - Nevo (1984) reviewed some of this enormous literature and found that for a sample of 1042 species, P = and H = the levels of allozyme diversity are, however, extremely variable within groups as well as between groups. - this is most notable in the Drosophila, where H varies by a factor of three among different species. DNA sequence variation 1. The proportion of polymorphic sites - the simplest way to measure the amount of DNA sequence variation in a sample is to quantify the proportion of nucleotide positions that are polymorphic.

6 - if n t is the total number of base pairs in the region examined and n p is the number of polymorphic positions then the proportion of polymorphic nucleotide sites is estimated by: P n = n p /n t 2. Nucleotide diversity - a second measure is called nucleotide diversity, or π. - nucleotide diversity is the average proportion of nucleotides that differ between any randomly sampled pair of sequences. - nucleotide diversity uses information about the extent of differentiation between sequences as well the relative frequencies of the sequences in the sample. - it is calculated by the following equation n n π = Σ Σ p i p j π ij i=1 j=1 where p i is the frequency of sequence i, p j is frequency of sequence j, and π ij is the proportion of nucleotides that differ between the sequences i and j. - an unbiased estimate of π is given by n n π = [N/(N-1)] Σ Σ p i p j π ij i=1 j=1 3. The number of segregating sites, θ (theta) - this is measure of nucleotide polymorphism expected under a specific model of mutation known as the infinite-sites model (from Kimura 1969). - the infinite-sites model assumes that the number of nucleotide sites is large enough that each new mutation occurs at a site that has not mutated before. - it further assumes that the population has reached a equilibrium between the processes of mutation and random genetic drift (that we will be discussing in depth in coming classes). - it also assumes that the mutations are neutral. - under the infinite-sites model, the expected number of sites segregating for different nucleotides (S) is E(S) = a 1 θ S

7 n-1 where a 1 = Σ 1/i in a sample of n alleles. i=1 - this equation can be rearranged to give an estimate of theta: θ S = S/a 1 - theta and π are two measures of DNA polymorphism that are expected to be equal according to the neutral theory of molecular evolution. - nucleotide diversity is similar to a classic measure of heterozygosity and is not greatly influenced by rare alleles. - in contrast, theta counts all segregating sites equally and can be strongly influenced by rare alleles. Estimating π and θ from DNA sequence data - suppose we collected a sample of 5 banana slugs from the woods outside the EMS building. - we proceed to sequence a 500 bp region of the mitochondrial COI gene and observe 5 segregating sites in four distinct haplotypes: Segregating sites N Haplotype 1 2 T G T C T Haplotype 2 1 T A T T A Haplotype 3 1 C G T C T Haplotype 4 1 C G G C T 1. The proportion of polymorphic sites in the sample is 2. Nucleotide diversity P n = n p /n t = 5/500 = nucleotide diversity can be estimated after determining the proportion of nucleotide differences between all possible pairs of haplotypes:

8 Haplotype we then need weight the pairwise differences (i.e, π ij s) by the frequencies of the haplotypes: i j p i q i π ij p i q i π ij SUM = therefore, n n π = [N/(N-1)] Σ Σ p i p j π ij i=1 j=1 π = 5/4 ( ) = Theta - there are five segregating sites in the sample. - therefore, S = 5/500 = n-1 a 1 = Σ 1/i i=1

9 a 1 = 1/1 + 1/2 + 1/3 + 1/4 a 1 = therefore θ S = S/a 1 = 0.010/2.083 = notice that our two estimates of nucleotide diversity are similar.