Population and Statistical Genetics including Hardy-Weinberg Equilibrium (HWE) and Genetic Drift

Size: px
Start display at page:

Download "Population and Statistical Genetics including Hardy-Weinberg Equilibrium (HWE) and Genetic Drift"

Transcription

1 Population and Statistical Genetics including Hardy-Weinberg Equilibrium (HWE) and Genetic Drift Heather J. Cordell Professor of Statistical Genetics Institute of Genetic Medicine Newcastle University, UK

2 DNA Series of molecules (nucleotides, bases) arranged in a double helix structure A Adenine C Cytosine G Guanine T Thymine For our purposes, we can consider DNA as a long strong of bases ACCTGTGTGCCCAATGGCGTCCCATACTATCGG Heather Cordell (Newcastle) Popn and Statistical Genetics 2 / 29

3 DNA Series of molecules (nucleotides, bases) arranged in a double helix structure A Adenine C Cytosine G Guanine T Thymine For our purposes, we can consider DNA as a long strong of bases ACCTGTGTGCCCAATGGCGTCCCATACTATCGG In humans, the DNA sequence is divided into 23 pairs of chromosomes 22 pairs of autosomes 1 pair of sex chromosomes (X and Y) One chromosome of each pair was inherited from your father, and one from your mother The entire DNA sequence (all 23 pairs of chromosomes) is known as the genome Heather Cordell (Newcastle) Popn and Statistical Genetics 2 / 29

4 Genetic variation The sequences of unrelated humans are 99.9% identical Differences are mostly single nucleotide polymorphisms (SNPs) (= single base changes) Three DNA sequences ACCTGTGTGCCCAATGGCGTCCCATACTATCGG ACCTGTGCGCCCAGTGGCGTCCCATACTATCGG ACCTGTGCGCCCAATGGCGTCCCATAGTATCGG Different sequences are said to possess different alleles at these positions (=locations or loci, singular =locus) i.e. the allele is the genetic variant present at a particular locus Heather Cordell (Newcastle) Popn and Statistical Genetics 3 / 29

5 Genetic variation As well as SNPs ACCTGTGTGCCCAATGGCGTCCCATACTATCGG ACCTGTGCGCCCAGTGGCGTCCCATAGTATCGG other types of variation include deletions or inversions ACCTGTGTGCCCAAATGGCGTCCCATACTATCGG ACCTGTGTGCCCA ATACTATCGG ACCTGTGTGCCCACCCTGCGGTAAATACTATCGG Or differences in the number of repeats e.g. copy number variants (CNVs) or short tandem repeats (STRs) (sometimes called microsatellites) ACCTG AGTT AGTT AGTT AGTT AGTT ATACTATCGG ACCTG AGTT AGTT AGTT ATACTATCGG Rather than writing out the full sequence of these more complicated variations, we can simply label the alleles e.g. 1, 2, 3, 4, 5 or A, B, C, D or a, b, c, d or D, d etc. etc. Heather Cordell (Newcastle) Popn and Statistical Genetics 4 / 29

6 Alleles and genotypes Each person has two homologous DNA sequences One inherited from their father, one from their mother Each person has two alleles at each genetic position (locus) One inherited from their father, one from their mother Their genotype at a locus is the combination of alleles they possess Considering the sequences inherited from both parents, different individuals possess different genotypes at certain positions Two individuals Person 1 Person 2 ACCTGTGTGCCCAATGGCGTCCCATACTATCGG ACCTGTGCGCCCAATGGCGTCCCATACTATCGG ACCTGTGCGCCCAGTGGCGTCCCATACTATCGG ACCTGTGCGCCCAGTGGCGTCCCATAGTATCGG Heather Cordell (Newcastle) Popn and Statistical Genetics 5 / 29

7 Mendelian inheritance When a parent passes one of their two alleles at a locus to an offspring, these are transmitted randomly i.e. with probability 0.5 When the offspring passes one of their two alleles to their own child (the grandchild), again they are transmitted with probability 0.5 Etc. etc. down through the generations This transmission of alleles down the generations forms the basis of population genetics Combined with phenomena like mutation (the appearance of new alleles) And selection (the fact that some alleles are disadvantageous, and so don t get passed on) Heather Cordell (Newcastle) Popn and Statistical Genetics 6 / 29

8 Hardy-Weinberg Equilibrium (HWE) Essentially a probablistic statement relating the frequency of the different possible genotypes at a locus, to the frequency of the constituent alleles. Suppose we have a locus with 3 alleles: A, B, C. Gives rise to 6 possible genotypes: A/A, B/B, C/C, A/B (or B/A), A/C (or C/A), B/C (or C/B) Heather Cordell (Newcastle) Popn and Statistical Genetics 7 / 29

9 Hardy-Weinberg Equilibrium (HWE) Essentially a probablistic statement relating the frequency of the different possible genotypes at a locus, to the frequency of the constituent alleles. Suppose we have a locus with 3 alleles: A, B, C. Gives rise to 6 possible genotypes: A/A, B/B, C/C, A/B (or B/A), A/C (or C/A), B/C (or C/B) Suppose in a population of 1000 people (=2000 chromosomes) we found that the A allele occurred 1000 times, the B allele 600 times and the C allele 400 times. So allele frequencies (probabilities) are: p A = = 0.5, p B = = 0.3, p C = = 0.2 Heather Cordell (Newcastle) Popn and Statistical Genetics 7 / 29

10 Hardy-Weinberg Equilibrium (HWE) Essentially a probablistic statement relating the frequency of the different possible genotypes at a locus, to the frequency of the constituent alleles. Suppose we have a locus with 3 alleles: A, B, C. Gives rise to 6 possible genotypes: A/A, B/B, C/C, A/B (or B/A), A/C (or C/A), B/C (or C/B) Suppose in a population of 1000 people (=2000 chromosomes) we found that the A allele occurred 1000 times, the B allele 600 times and the C allele 400 times. So allele frequencies (probabilities) are: p A = = 0.5, p B = = 0.3, p C = = 0.2 HWE states that the genotype frequencies are: P(A/A) = p 2 A = 0.25 P(B/B) = p 2 B = 0.09 P(C/C) = p 2 C = 0.04 P(A/B) = 2p A p B = 0.3 P(A/C) = 2p A p C = 0.2 P(B/C) = 2p B p C = 0.12 Heather Cordell (Newcastle) Popn and Statistical Genetics 7 / 29

11 Hardy-Weinberg Equilibrium (HWE) HWE holds after a single generation of random mating Even if it did not hold in the original (base) population A nice mathematical exercise is to prove this... Heather Cordell (Newcastle) Popn and Statistical Genetics 8 / 29

12 Hardy-Weinberg Equilibrium (HWE) HWE holds after a single generation of random mating Even if it did not hold in the original (base) population A nice mathematical exercise is to prove this... For computationally-inclined students, an even nicer exercise is to program up a computer simulation demonstrating it... Heather Cordell (Newcastle) Popn and Statistical Genetics 8 / 29

13 Hardy-Weinberg Equilibrium (HWE) HWE holds after a single generation of random mating Even if it did not hold in the original (base) population A nice mathematical exercise is to prove this... For computationally-inclined students, an even nicer exercise is to program up a computer simulation demonstrating it... Suppose we have a locus with 2 alleles: A, B. Gives rise to 3 possible genotypes: A/A, B/B, A/B (or B/A). Heather Cordell (Newcastle) Popn and Statistical Genetics 8 / 29

14 Hardy-Weinberg Equilibrium (HWE) HWE holds after a single generation of random mating Even if it did not hold in the original (base) population A nice mathematical exercise is to prove this... For computationally-inclined students, an even nicer exercise is to program up a computer simulation demonstrating it... Suppose we have a locus with 2 alleles: A, B. Gives rise to 3 possible genotypes: A/A, B/B, A/B (or B/A). Suppose genotype frequencies in a population of 10,000 people are P(A/A) = 0.7 (7000 people) P(B/B) = 0.2 (2000 people) P(A/B) = 0.1 (1000 people) Heather Cordell (Newcastle) Popn and Statistical Genetics 8 / 29

15 Hardy-Weinberg Equilibrium (HWE) HWE holds after a single generation of random mating Even if it did not hold in the original (base) population A nice mathematical exercise is to prove this... For computationally-inclined students, an even nicer exercise is to program up a computer simulation demonstrating it... Suppose we have a locus with 2 alleles: A, B. Gives rise to 3 possible genotypes: A/A, B/B, A/B (or B/A). Suppose genotype frequencies in a population of 10,000 people are P(A/A) = 0.7 (7000 people) P(B/B) = 0.2 (2000 people) P(A/B) = 0.1 (1000 people) Allele frequencies are p A = = 0.75, p B = = 0.25 Under HWE we would expect P(A/A) = p 2 A = = P(B/B) = p 2 B = = P(A/B) = 2p A p B = = Heather Cordell (Newcastle) Popn and Statistical Genetics 8 / 29

16 HWE computer simulation genotype frequency A/A A/B B/B generation Heather Cordell (Newcastle) Popn and Statistical Genetics 9 / 29

17 Utility of HWE Useful as a quality control check of genotyping data Given that we expect HWE to hold in (most) populations, if it doesn t hold, can indicate a problem with genotyping errors Particularly true of high-throughput SNP genotyping technologies; may wish to exclude SNPs failing HWE Heather Cordell (Newcastle) Popn and Statistical Genetics 10 / 29

18 Utility of HWE Useful as a quality control check of genotyping data Given that we expect HWE to hold in (most) populations, if it doesn t hold, can indicate a problem with genotyping errors Particularly true of high-throughput SNP genotyping technologies; may wish to exclude SNPs failing HWE Many methods/computer programs for linkage and/or association analysis have assumptions of HWE embedded within them E.g. for averaging over possible genotypes for untyped founder members in a family Or for reducing the number of parameters to estimate m alleles produce m(m+1) 2 possible genotypes How important the HWE assumption is varies according to the method/program... Heather Cordell (Newcastle) Popn and Statistical Genetics 10 / 29

19 Utility of HWE Useful as a quality control check of genotyping data Given that we expect HWE to hold in (most) populations, if it doesn t hold, can indicate a problem with genotyping errors Particularly true of high-throughput SNP genotyping technologies; may wish to exclude SNPs failing HWE Many methods/computer programs for linkage and/or association analysis have assumptions of HWE embedded within them E.g. for averaging over possible genotypes for untyped founder members in a family Or for reducing the number of parameters to estimate m alleles produce m(m+1) 2 possible genotypes How important the HWE assumption is varies according to the method/program... Not all populations are expected to be in HWE E.g. populations where there are consanguinous matings Such as populations where first cousin marriages are common Can be a useful tool for exploring different population attributes Heather Cordell (Newcastle) Popn and Statistical Genetics 10 / 29

20 Genetic drift Defined as the change in the frequency of an allele through different generations of a population due to random sampling As opposed to a change in frequency due to selection Whereby an allele that is detrimental has a lower chance of getting passed on to the next generation E.g. because individuals with that allele die before they produce any offspring Genetic drift tends to be more important in small populations Such as founder populations where a new population is established by a very small number of individuals from a larger population E.g. French Canadians in Quebec, Amish in USA, Ashkenazi Jews, the population of Tristan da Cunha Heather Cordell (Newcastle) Popn and Statistical Genetics 11 / 29

21 Genetic drift and fixation "Random sampling genetic drift" by Gringer - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - Random_sampling_genetic_drift.svg Heather Cordell (Newcastle) Popn and Statistical Genetics 12 / 29

22 Genetic drift computer simulation By Professor marginalia (Own work) [CC BY-SA 3.0 ( or GFDL ( via Wikimedia Commons Heather Cordell (Newcastle) Popn and Statistical Genetics 13 / 29

23 Utility of Genetic Drift Most relevant for ecological studies of non-human populations e.g. starlings (?) Can help explain why findings from genetic studies vary from one population to another i.e. why they do not appear to be replicable Power of a study to detect a genetic effect on an outcome of interest (e.g. a correlation between genotype and disease phenotype) depends on the allele frequency So in populations where an allele is rare (or non-existent), there will be no power to detect the effect Heather Cordell (Newcastle) Popn and Statistical Genetics 14 / 29

24 Disease genes The phenotype is the visible characteristic or trait (e.g. eye colour, height, occurence of a diabetes) that results from having a specific genotype Disease genes (or disease loci) are genes (or locations on the genome) where there is a particular allele (perhaps a mutation in the genetic sequence) that increases disease risk Heather Cordell (Newcastle) Popn and Statistical Genetics 15 / 29

25 Disease genes The phenotype is the visible characteristic or trait (e.g. eye colour, height, occurence of a diabetes) that results from having a specific genotype Disease genes (or disease loci) are genes (or locations on the genome) where there is a particular allele (perhaps a mutation in the genetic sequence) that increases disease risk Simple Mendelian or monogenic disorders show a close correspondence between genotype (at a single genetic locus) and phenotype In dominant disorders, only one disease allele is required for an individual to get the disease In recessive disorders, two disease alleles are required for an individual to get the disease Heather Cordell (Newcastle) Popn and Statistical Genetics 15 / 29

26 Penetrances The penetrance is the probability of being diseased, given genotype Incomplete Genotype Relative Risk Dominant Recessive Penetrance (GRR) odds ratio (OR) dd dd DD = factor by which your baseline penetrance should be multiplied (for 0,1,2 copies of D) Heather Cordell (Newcastle) Popn and Statistical Genetics 16 / 29

27 Mendelian inheritance and recombination When a parent passes one of their two alleles at a locus to an offspring, these are transmitted with probability 0.5 However, alleles at different loci are not inherited independently Alleles at loci that are physically closer tend to get transmitted together (i.e. in coupling) Parental transmission Parent Child 1 Child 2 Child 3 ACCTGTGTGCCCAATGGCGTCCCATACTATCGG ACCTGTGCGCCCATTGGCGTCCCATAATATCGG ACCTGTGTGCCCAATGGCGTCCCATACTATCGG ACCTGTGTGCCCAATGGCGTCCCATAATATCGG ACCTGTGTGCCCATTGGCGTCCCATAATATCGG Heather Cordell (Newcastle) Popn and Statistical Genetics 17 / 29

28 Linkage and linkage disequilibrium (LD) θ represents the probability of recombination between two loci θ ranges from 0 to 0.5 If two loci lie close together on the same chromosome, θ is small ( 0) and the loci are said to be completely linked If the loci are farther apart, θ increases and eventually approaches 0.5, at which point the loci are said to be unlinked At that point, alleles at the two loci are transmitted independently Over many generations, this phenomenon of linkage induces a correlation between alleles at nearby loci, known as linkage disequilibrium or LD Heather Cordell (Newcastle) Popn and Statistical Genetics 18 / 29

29 Visualising LD Plot showing LD measures r 2 (upper) and D (lower): Heather Cordell (Newcastle) Popn and Statistical Genetics 19 / 29

30 Linkage and association studies Linkage studies measure correlation between alleles at a test locus and disease status within families Caused by a lack of recombination between alleles at the test locus and the underlying disease locus As genetic material is being passed down through the family In linkage studies, we test for this lack of recombination directly (parametric linkage analysis) or indirectly (non-parametric linkage analysis) Association studies measure correlation between alleles at a test locus and disease status across families Caused by historical lack of recombination between alleles at the two loci over many generations Leads to linkage disequilibrium (LD) (=association/correlation) between the disease allele and alleles at nearby loci In the population as a whole Since correlations across families operate even across unrelated individuals (=families of size 1) Heather Cordell (Newcastle) Popn and Statistical Genetics 20 / 29

31 Within-family correlations Heather Cordell (Newcastle) Popn and Statistical Genetics 21 / 29

32 Genetics of common diseases Linkage analysis has been a highly successful strategy for identifying (localising) genes involved in rare monogenic (single-gene) disorders e.g. Huntingdon s disease, Cystic Fibrosis Less successful for common complex disorders Type 1 diabetes: confirmed the roles of HLA and insulin genes (Davies et al. 1994) Crohn s disease: NOD2 /CARD15 gene implicated (Hugot et al. 2001) Age-related macular degeneration: Complement factor H gene identified through a combination of approaches, including follow-up of significant regions from non-parametric linkage scan (Haines et al. 2005) A much more successful strategy for complex diseases has been association analysis Particularly genome-wide association studies (GWAS) Heather Cordell (Newcastle) Popn and Statistical Genetics 22 / 29

33 Genome-wide association studies (GWAS) Popular approach over past 7-8 years Enabled by advances in microarray-based genotyping technologies Allowing us to measure between 500,000 and 4 million genetic variants in each individual Scan through genome looking for variants that correlate with phenotype (e.g. disease status) In a hypothesis-free approach Don t need to have any prior reason (e.g. good biological candidate) to focus on specific genes or genetic variants Heather Cordell (Newcastle) Popn and Statistical Genetics 23 / 29

34 Association testing: case/control studies Collect sample of affected individuals (cases) and unaffected individuals (controls) Or a else a sample of random population controls Most of whom will not have the disease of interest Examine the association (correlation) between alleles present at a genetic locus and presence/absence of disease By comparing the distribution of genotypes in affected individuals with that seen in controls Heather Cordell (Newcastle) Popn and Statistical Genetics 24 / 29

35 Case/control studies Each person can have one of 3 possible genotypes at a diallelic genetic locus Genotype Cases Controls (= a) 200 (= b) (= c) 820 (= d) (= e) 980 (= f ) Total Test for association (correlation) between genotype and presence/ absence of disease using standard χ 2 test for independence on 2 df Or some other, more sophisticated, statistical test Generates a p value indicating how significant the association/ correlation appears to be i.e. how likely it was to have occurred by chance Heather Cordell (Newcastle) Popn and Statistical Genetics 25 / 29

36 Manhattan Plots Heather Cordell (Newcastle) Popn and Statistical Genetics 26 / 29

37 Close-up of hit region Heather Cordell (Newcastle) Popn and Statistical Genetics 27 / 29

38 Success of GWAS Over the last 7-8 years, there have been a slew of high-profile GWAS, in a variety of different diseases Highly successful: many hundreds of associations (between genotype and phenotype) detected GWAS point us to genomic regions (loci) highly likely to harbour disease genes We still don t know the functional (causal) variant, in most cases Indeed, the causal variant may well not even have been genotyped (but SNPs that are correlated with it have been genotyped) Heather Cordell (Newcastle) Popn and Statistical Genetics 28 / 29

39 Success of GWAS GWAS are best considered as a hypothesis generating exercise Identifying candidate genomic regions for further investigation Possibly via different types of experiment And potentially pointing us to new biology Ankylosing spondylitis (IL-23 pathway) Schizophrenia (calcium signalling) Inflammatory bowl disease (IL-23 pathway, autophagy pathway, innate immunity) See Visscher et al. (2012) AJHG 90:7-24 Five Years of GWAS Discovery Heather Cordell (Newcastle) Popn and Statistical Genetics 29 / 29