Measurement of Molecular Genetic Variation Genetic Variation Is The Necessary Prerequisite For All Evolution And For Studying All The Major Problem Areas In Molecular Evolution. How We Score And Measure Genetic Variation Determines The Questions We Can Address And Even Our Fundamental View Of Evolution Forces Creating Genetic Variation Mutation Transposition Recombination and Gene Conversion Mutation: Nucleotide Substitutions Replacement, or Nonsynonymous, or 1
Mutation: Nucleotide Substitutions β-thalassaemia is a genetic disease characterized by a lowered rate of production of β-globin protein. Many Thalassaemia Alleles Are Due To Nucleotide Substitutions In Non-coding Regions of The β-globin Gene, Including The 5 and 3 Control Areas And Intron Splice Sites That Affect Either Transcription Or mrna Processing Mutation: Nucleotide Substitutions Not All Silent Mutations Are Phenotypically Silent. GGT GGT GAG G.. GGA GGT GAG G.. Both GGT and GGA code for Glycine, But GGAGGTGAGG Is A Splice Site Sequence, So Get Abnormal Splicing Mutation: Nucleotide Insertions & Deletions Many Thalassaemia Alleles Are Due Insertions and Deletions, Some Of Which Are Frameshift Mutations When They Occur In The Coding Region. 2
Transposition Many Insertions and Deletions Are Due To Transposons, With The Phenotypic Effects Ranging From Undetectable to Drastic, Depending Upon The Location Of The Site Of Insertion Or Excision Various types of repetitive elements in the human gene encoding homogentisate 1,2-dioxygenase Recombination/Gene Conversion Recombination and Gene Conversion Can Create Much Genetic Variation By Producing New Combinations of Genetic Variants At Two Or More Sites Genetic Variation At Different Biological Levels (All Subject To Study By Molecular Evolutionists) Genome Individual Local Population Among Local Populations Within A Species Between Species OR Pairs of Molecules 3
Variation Within A Genome The ribosomal DNA in Drosophila mercatorum exists as a tandem family of about 200 repeats on the X chromosome and about 80 on the Y chromosome. Variation Within An Individual In diploid individuals, there is the possibility of heterozygosity versus homozygosity at any homologous DNA region. One type of variation frequently scored at the individual level are SNP s (Single Nucleotide Polymorphisms) because their scoring can be automated. There are about 6 million SNPs available for humans. Variation Within An Individual Underreplication in Polytene Tissues in Drosophila There is also somatic variation, often in copy number, but one can also show individual heterozygosity for variation in this copy number (e.g., abnormal abdomen in D. mercatorum). Somatic cells can also differ due to somatic mutation, which can play an important role in individual phenotype (e.g., cancer) and in evolution (e.g., plants) 4
Variation Among Individuals Within A Local Population It is at this level that the number of ways of measuring diversity increase substantially Variation can be measured at both the genotypic level and the gamete level Genotypic Variation Within A Local Population (Single Locus or Single Site) The Number of Genotypes Genotype Frequencies =No. of Individuals with Specific A Genotype Total Number of Individuals Sampled Observed Heterozygosity = No. of Individuals Heterozygous at a Locus Total Number of Individuals Sampled The above can be averaged over several loci (but note that it is an average of single locus measures and not a true multi-loci statistic) Gametic Variation Within A Local Population (Single Locus or Single Site) Number of alleles or haplotypes in Gene Pool = n a The allele or haplotype frequency = p = No. of DNA copies with Specific Allele Total Number of DNA copies Sampled Percent Polymorphism = Percent of loci with p most common 0.95 Expected Heterozygosity (note, this is not a genotypic measure 5
Expected Heterozygosity A Gene Pool is the population of gene copies that are collectively shared by the individuals of a deme (local population) OR A Gene Pool is the population of potential gametes that can be produced by the individuals of a deme (in either case, the gene pool consists of haploid genetic elements). Expected Heterozygosity If n is the number of distinct alleles or haplotypes and p i = frequency of the i th allele or haplotype in gene pool, then If several loci are scored, then the average expected heterozygosity is: Expected Heterozygosity At the nucleotide level, for DNA sequence data: 1. If have SNPs, simply use previous equations 2. If have haplotypes, then use: where p i is the frequency of haplotype i in the gene pool, and π ij is the number of nucleotide differences between haplotypes i and j. Note, recall from classical neutral theory that: 6
Expected Heterozygosity At the nucleotide level, for DNA sequence data, another commonly encountered equation is: where V is the number of variable sites in the sample, L is the length in nucleotides of the DNA region surveyed, and n is the number of DNA molecules in the sample. This equation assumes the INFINITE SITES MODEL. Many statistics used in molecular evolution are based upon the infinite sites model in which each mutation occurs at a new nucleotide site. This model allows for no multiple mutational hits at a single site. When dealing with a region of DNA in which mutational hotspots exist, the statistics based upon the infinite sites model may be very misleading. Despite this danger, such statistics still are widely used and few test this assumption. Sensitivity To The Infinite Sites Assumption Palsboll et al. Evolution 58: 670-675, 2004. Looked at mtdna in fin whales to estimate θ and the gene flow parameter M=Nm: our results clearly show that one can arrive at radically different conclusions if applying the wrong mutation model during estimation. Posterior Distribution of θ under the infinite sites model Posterior Distribution of θ under a fitted finite sites model Posterior Distribution of M under the infinite sites model (implied either no gene flow or highly restricted gene flow & significant subdivision) Posterior Distribution of M under a fitted finite sites model (implied extensive gene flow and near panmixia) 7
Sensitivity To The Infinite Sites Assumption The Four Gamete Test For Detecting Recombination Start With Genetic Variation At Only One Site: Two Gamete Types A G A G C G C G Mutation At A Second Site Produces Three Gamete Types: A G A G C G C T Recombination Produces Four Gamete Types A G A T C G C T Start With Genetic Variation At Only One Site: Two Gamete Types A G A G C G C G Mutation At A Second Site Produces Three Gamete Types: A G A G C G C T Second Mutation At Site Produces Four Gamete Types A T A G C G C T 8
Genetic Survey of Lipoprotein Lipase LPL Has 10 Exons Over 30 kb of DNA on Chromosome 8p22 Sequenced 9,734 bp from the 3 End of Intron 3 to the 5 End of Intron 9 Sequenced: 24 Individuals from North Karelia, Finland (World s Highest Frequency of CAD) 23 European-Americans from Rochester, Minnesota 24 African-Americans from Jackson, Mississippi Found 88 Variable Sites Ignored Singleton and Doubleton Sites and Variation Due to a Tetranucleotide Repeat, but Phased the Remaining 69 Polymorphic Sites by a Combination of Using Allele Specific Primer Pairs and Haplotype Subtraction The Phased Site Data Identified 88 Distinct Haplotypes Clark et al. (Am. J. Human Genet. 63:595-612, 1998) Applied The 4-Gamete Test to the LPL Region and Inferred Extensive Recombination Uniformly Distributed Throughout The LPL Region A G A T C G C T But does this region satisfy the infinite sites model? Mutagenesis via 5-Methylcytosine 9
ANALYSIS OF HIGHLY MUTABLE SITES IN LPL TYPE OF SITE NUMBER OF NUMBER % POLY. PER NUCLEOTIDES POLYMORPHIC NUCLEOTIDE C P G 198 19 9.6% MONONUCLEOTIDE RUNS 5 456 15 3.3% POLYMERASE α ARREST SITE ± 3 NUCLEOTIDES 264 8 3.0% [TG(A/G)(A/G)GA] ALL OTHER NUCLEOTIDES 8,866 46 0.5% ln-liklihood RATIO TEST OF HOMOGENEITY = 99.8, 3 df, p 1.75 10-7 ln-liklihood RATIO TEST OF HOMOGENEITY WITHIN THE THREE MUTABLE CLASSES = 12.3, 2 df, p 0.002 Templeton et al. (Am. J. Human Genet. 66:69-83, 2000) Used A Test That Did Not Assume The Infinite Sites Model And Inferred Much Less Recombination Concentrated Into The 6 th Intron of the LPL Region Nucleotide Position The four gamete test was applied to human mtdna (a molecule with no recombination) and identified 413 recombination events uniformly distributed across the molecule! 10
Large Sections of Chromsome 19 Have Been Sequenced and Can Be Used To Study How Often Deviations From the Infinite Sites Model Occur An Analysis For The Mutagenic Effects of CpG Dinucleotides in the APOS Chromosome Block and for Heterogeneity Within The Block Must Look At Your Data Before Analysis! Pr ob. Homogenei ty Acr oss Bl ocks CG, C T CG, Other Non- CG 1.7 10-6 0.51 5.5 10-3 CG dinucleotides accounted for 4.7% of the nucleotides and 40% of the polymorphic sites. Prob. Homogeneity: 0.076 0.005 3.1 10-4 1.4 10-7 5.5 10-6 1.7 10-17 1.4 10-9 0.033 1.8 10-7 6.6 10-14 No Block Satisfied The Infinite Sites Model, But The Degree of Deviation Varied Significantly From Block To Block. Must Look At Your Data Before Analysis! Several tools are available to help you: e.g., Posada s ModelTest at http://darwin.uvigo.es/ 11
Variation Among Populations Within A Species n s = number of subpopulations with N s being size of subpopulation s, and n L = number of loci surveyed p ijs = frequency of allele (haplotype) i at locus j in subpopulation s p ij = frequency of allele (haplotype) i at locus j in total population Variation Among Populations Within A Species F ST Measures How Genetic Variation Is Distributed Within and Among Subpopulations on a 0-1 Scale All Demes Have Identical Gene Pools; All Variation Shared Equally Throughout The Species No Variation Within Demes; All Variation Exists As Differences Between Demes Gene Pools F ST 0 1 12
Wright Quantified The Balance of Gene Flow To Drift as Measured by F st for the Island Model Impact of Drift and Gene Flow On Average F In The Island Model Effect of Drift Alone On The Prob. Two Randomly Chosen Genes are I.B.D: 1 1 F(t) = + (1 - )F(t-1) 2N 2N With Gene Flow, Two Genes Can Only Be I.B.D. If They Are From the Same Deme (by assumption), So: 1 1 F(t) =[ + (1 - )F(t-1)](1-m) 2 2N 2N At equilibrium, F(t) = F(t-1) = F, and the above equation yields: F = F st 1 4Nm+1 This is the M seen before The equation: F 1 4Nm+1 is NOT the universal relationship Between gene flow and genetic drift, as often presented. E.g., consider The one-dimensional stepping stone model (isolation by distance): 13
Impact of Drift and Gene Flow On f st In The Stepping Stone Model f st 1 1+ 4N ev 2m 1 m When m 1 >> m Because the two migration parameters appear as the product m 1 m, this means that even small amounts of long distance Gene flow have a major impact on f st. The reason is that the evolutionary impact of gene flow Depends both on the amount of gene flow and the difference In allele frequency. The farther the distance, the greater the Difference in allele frequency in general, so long distance Dispersal has a disproportionate evolutionary impact F ST Can Also Be Applied To A Pair of Demes As A Measure of The Genetic Distance Between Their Gene Pools. Note: This Genetic Distance As Measured by F ST Is A Function Only Of Allele Frequency Differences In The Gene Pools Of The Two Demes Being Compared. Human Populations Theoretical Expectations To Wright s Isolation By Distance Model Nei Created Another Genetic Distance Between Populations Based On Allele Frequency Differences That Is Widely Used. Genetic Identity between two populations, X & Y: I XY ranges on a 0 to 1 scale, with 1 meaning that both populations share the same alleles with the same fequencies, and 0 meaning the gene pools share no alleles in common. Nei converted this into a genetic distance on a 0 to scale by a log transformation: 14
Variation Among Isolates And Species There Is No Gene Flow Among Isolates and Species, So Their Long Term Divergence Is Dominated By Mutation Genetic Distances Are The Most Common Measures Used To Quantify This Divergence Distance Estimates attempt to estimate the mean number of mutational changes per site since 2 species (or isolates, or sequences) split from each other Simply counting the number of differences (p distance) may underestimate the amount of change - especially if the sequences are very dissimilar - because of multiple hits We therefore use a model which includes parameters which reflect how we think sequences may have evolved The Bogus Proof Of Nei s Genetic Distance As Applied To Species Let t = the time of the split (cessation of gene flow), and assume a constant mutation rate of α. Then Nei showed : Expected Homozygosity at the time of the split. Probability of No Mutation In Lineage X by time t Probability of No Mutation In Lineage Y by time t Problems: Nei treats his expected homozygosity as a constant over time until a mutation occurs in one lineage, thereby destroying it completely. But J XY is exclusively a function of allele frequencies, and any evolutionary force that alters allele frequencies will cause J XY to change with time. Thus, J can change even in the absence of mutation. Moreover, mutation is modeled as reducing J to 0 instantly, which is incompatible with the definition of J in terms of allele frequencies. The Bogus Proof Of Nei s Genetic Distance As Applied To Species Nei Next Assumes That J X and J Y Are Constants Over Time (Realistic?). Then Nei showed : Nei Next Assumes That D XY (0) =0 (Not valid when one or more of the isolates was established by a founder event or if an ancestral population with F ST >0 is Fragmented): 15
The Bogus Proof Of Nei s Genetic Distance As Applied To Species The assumptions of Nei s proof of a linear genetic distance for species are not appropriate for a genetic distance for populations based on allele frequencies; rather they are more appropriate for the divergence of two DNA lineages from a common ancestral molecule. Nei was trying to force his genetic distance to have properties similar to DNA molecules as calculated by Kimura under neutrality. Nei s bogus justification of his distance through mutational accumulation has caused and continues to cause great confusion between population genetic distance (defined in terms of allele or haplotype frequencies) and molecule genetic distance. This distinction is critical, but is often not made in much of the literature, so BEWARE! Measurement of Molecular Genetic Variation We now turn our attention to molecules, but to do so, we must first look at coalescent theory. 16