Why do we need statistics to study genetics and evolution?

Size: px
Start display at page:

Download "Why do we need statistics to study genetics and evolution?"

Transcription

1 Why do we need statistics to study genetics and evolution? 1. Mapping traits to the genome [Linkage maps (incl. QTLs), LOD] 2. Quantifying genetic basis of complex traits [Concordance, heritability] 3. Testing hypotheses/significance [linkage, LD, HWE] I. BASIC CONCEPTS OF STATISTICS Continuous distributions Important parameters: mean and standard deviation/variance (measures spread) o Variance = mean squared-distance of datapoints from the mean o Standard deviation (σ) undoes the squaring in variance (σ 2 ). i.e., s.d. = (var) o Same concept of variance that is operative in discussing heritability! One important example: the normal distribution o Special relationship: o Application: continuous traits! (central limit theorem) [e.g. 1] Suppose height in a population of pea plants is normally distributed with mean value of 10 inches and a variance of 9 inches 2, what percentage of the population falls within 7 inches and 16 inches? Correlation Two related concepts: Slope of the regression line, b o Directness of association of two variables o How much does changing one variable affect the other? Correlation coefficient, r o Precision of association of two variables tightness of cluster o How much noise is there in the association? For both: Sign indicates direct/inverse relationship Magnitude indicates degree [e.g. 3]

2 Hypothesis-Testing Null hypothesis has two parts: o Substantive (what are values we expect if nothing interesting is happening?) o Formal (how much deviation from expected values do we allow?) o Some exemplars: H 0 : Alleles at locus A and locus B assort independently; thus any deviation from a 1:1:1:1 gametic ratio is no greater than could be explained by chance alone. (linkage) H 0 : The population is in Hardy-Weinberg equilibrium; thus any deviation from a 1:2:1 genotypic ratio is no greater than could be explained by chance alone. (HWE) H 0 : The disease-associated allele is dominant, and does not exhibit recessive lethality, and therefore any deviation from a 3:1 phenotypic ratio of offspring is no greater than could be explained by chance alone. (recessive lethality) Test statistic is calculated based on the data intermediate step to a p-value o Comparing expected vs. actual integer counts: use chi-squared Σ {(O-E) 2 /E} o Comparing the mean (or s.d.) of two populations: use t-test o Many others out there! (e.g. Z-tests) Interpret using degrees of freedom How many classes of data you need to know, in order to determine the whole distribution of data For Mendelian ratios, # phenotypic classes 1 For HWE, # genotypic classes - # alleles p-value represents P(observed data H 0 ) o statistics means never having to say you re certain o statistical significance: p<.05 o fail to reject H 0 why not accept H 0? o reject H 0 what can you therefore conclude (if anything?) II. GENETIC APPLICATIONS Heritability Technical definition: proportion of phenotypic variance attributable to genetic variance Interpretation: extent to which genetic differences among individuals explain phenotypic differences among individuals Doesn t tell us how many genes are involved in a trait, e.g., but does help us understand relative contribution of genetics and environment for a given population

3 Abstract concepts with complicated measurement don t get hung up on the technical details (see Fisher, 1918 if you re curious) V P = V G + V E + V GE o V G = V A + V D + V I Broad sense: H 2 = V G / V P Narrow sense: h 2 = V A / V P o h 2 = b for regression of mean offspring vs mean parents) Useful for predicting response to selection o Breeder s equation : R=h 2 S [e.g. 2] Concordance P(twin A has trait twin B has trait) Monozygotic vs. Dizygotic H 2 = 2(r MZ r DZ ) Hardy-Weinberg Equilibrium Expected relationship between allele frequencies and genotype frequencies under complete neutrality / random mating (as opposed to: selection, population structure due to inbreeding, etc.) Can be used to furnish expected values for H 0 Freq(A1) = p Freq(A2) = q Freq(A1) = p A1A1: p 2 A1A2: pq Freq(A2) = q A1A2: pq A2A2: q 2 [e.g. 4]

4 Linkage and Linkage Disequilibrium both are recombination-based metrics, used for (among other things) genetic mapping trying to associate genotype and phenotype. Linkage is based on the amount of recombination between two loci in one generation. o Measured in cm (=1% probability of recombination), also called m.u. (map units) o Maximum genetic distance: 50 cm Linkage disequilibrium is based on allele and haplotype frequencies within a population. o We compare observed values to HWE values (hence, disequilibrium) o Declines at a rate of (1-r) per generation, where r is recombination rate. o We have not given you the tools to calculate. o If you have an allele that is always on a given haplotype background (i.e., allele A1 at locus A always appear with allele B1 at locus B), there is complete LD between those two loci Another way to think of it: if not all possible haplotypes are present in the population If this relationship is two-way, you have complete LD and complete correlation Genetic Mapping Why do we need this if we can sequence? Use pedigrees to determine recombination frequencies between loci (=genetic distance) Linkage group: set of loci with <50% recombination freq Three-point testcross: double crossover is least probable outcome; can be used to infer relative order of loci [e.g. 5] Construct a map for the following 3-point cross (Homozygous recessive x heterozygote):

5 Logarithm of Odds (LOD) Metric to compare probabilities of two hypotheses o Here: linkage vs. independence o Application: use genetic markers to localize alleles of import (e.g. disease): is the marker linked to the causative locus? Used for complex traits as well (QTLs) The idea: calculate the probability of observed data under each hypothesis and compare o To do so we must find recombinants (must know phase of the parent) o Calculate probability of each gamete, using values of Θ Why are we taking a log? o Makes a messy number (relative probabilities) more friendly and interpretable o A = 10 B <-> B = log 10 A So, if our LOD score = log[p(h 1 )/p(h 0 )], where p(h 1 )/p(h 0 )=relative probability, then p(h 1 )/p(h 0 )=10 LOD Thus, 10 LOD tells you how much more probable linkage is than independent assortment Cutoff for significance: LOD = = 1000; thus we are looking for cases where linkage is 1000 times as likely as independent assortment [e.g. 6] Calculate LOD for the following recessive trait, given Θ=.4

6 III. OTHER IMPORTANT TOPICS Molecular Clock Assuming constant mutation rate, can use #sequence differences to infer time since divergence o When is this assumption violated? Must account for both branches of the evolutionary tree since divergence t=d/2k k=mutation rate; d=#neutral substitutions Research techniques Forward genetics: find the gene responsible o QTL crosses: 1. cross different inbred strains (completely homozygous P -> completely heterozygous F1) 2. cross F1 to P. Thus, F2 is of variable genetic composition and chromosomal inheritance from P can be easily measured 3. check for associations between phenotype and chromosomal inheritance (e.g. with LOD) Reverse genetics: what happens if I mutate this gene? Sequencing: PCR-style replication, with a twist terminator nucleotides, distinguishable by color Generate fragments and use electrophoresis to determine length Tree building 1. group the most similar taxa/individuals 2. update distance matrix, using averages 3. reiterate