Intro to population genetics

Size: px
Start display at page:

Download "Intro to population genetics"

Transcription

1 Intro to population genetics Shamil Sunyaev Department of Biomedical Informatics Harvard Medical School Broad Institute of M.I.T. and Harvard

2 Forces responsible for genetic change Mutation µ Selection s Drift N e Population structure F ST

3 Mutations

4 Mutation rate in humans and flies 2.5x10-8 (Nachman & Crowell) 1.8x10-8 (Kondrashov) NGS estimates ~1.2X10-8 per nt changes genome ~10 2 per nt changes genome Other events: indels (10-9 ) repeat extensions/contractions (10-5 ) large events (?)

5 Mutation rate is variable along the genome Replication fidelity DNA damage DNA repair CpG deamination Regional variation of mutation rate Context dependence of mutation rate

6 Genetic drift

7 Drift is a random change of allele frequencies

8 Drift depends on population size

9 Demographic history

10

11 Selection

12 Most functional mutations are deleterious Selective effect of mutation Deleterious Neutral Advantageous New mutation Functional Nonfunctional Selection indicates functional mutations, whether or not the tested trait is under selection 12

13 Methods of mathematical population genetics

14 Dynamic of allelic substitution Mathematically, allele frequency change in a population follows a one-dimensional random walk 1 0 time

15 Diffusion approximation Random walk that does not jump long distances can be approximated by a diffusion process φ x, p,t ( ) t = Mφ x, p,t ( ) x Vφ x, p,t ( ) x 2

16 Coalescent theory Instead of modeling a population, we can model our sample Time goes backwards t

17 Natural selection in protein coding regions

18 Signatures of purifying selection Reduced in variation Excess of rare alleles

19

20 Effect of new missense mutations

21

22 Computer simulations Demographic history φ( x, p,t) t Mφ( x, p,t) = + 1 x 2 2 Vφ( x, p,t) x 2 time Natural selection

23

24 Can we find additional evidence in sequence data? Is there any information beyond frequency? Can we tell alleles under selection from neutral alleles if they are of the same frequency?

25

26 At a given frequency deleterious and advantageous alleles are younger than neutral Maruyama effect (1974): at any frequency advantageous, or deleterious alleles are younger than neutral alleles Frequency x Frequency 0% Time 26

27 Intuition: shorter trajectories require fewer lucky jumps Frequency x Frequency 0% Shorter trajectory: 4 jumps Longer trajectory: 6 jumps Time

28 Diffusion theory: deleterious alleles pass fast through higher frequencies allele frequency Neutrals: equal time at each frequency Selecteds: faster through higher frequencies time Idea: low accumulation of mutations at linked sites indicates selection

29 Intermediate allele frequency (%) mean sojourn time (2N generations) Selection coefficient (2Ns) 0 (neutral) 2 (weakly deleterious) 10 (deleterious) 3%

30 Neighborhood clock (fuzzy clock) Closest'variant'beyond'' recombina4on'event' Variant'' Closest'rarer' linked'variant' 30

31 Neighborhood clock is consistent with Maruyama-effect expectations MAC=3 Proportion NC <= x missense ancestral missense derived probably damaging derived synonymous derived NC statistic coding-synon 7486 missense 9728 benign 4464 possibly damaging 1656 probably damaging 2654 baseline Data: pilot Genome of Netherlands dataset

32 Can signals of selection guide prioritization? Genes of interest should be highly selectively constrained ExAC dataset combines exomes of >60,000 individuals Several methods to estimate gene-based selection constrained exist (pli, RVIS) Can we estimate fitness loss directly?

33 Selection inference using frequencies of individual SNPs Change in allele frequency = = Mutation + Selection + Drift Of the order of 10-8 Demographic history Population structure

34 Focusing on rare deleterious PTVs PTV protein truncating variant (a.k.a. nonsense) Combine all PTVs per gene we assume that they have identical effects Consider each gene as a bi-allelic locus PTV / no PTV

35 Selection inference using combined frequency of PTVs Change in allele frequency = = Mutation + Selection + Drift Assuming string selection and a very large population, combined frequency of rare deleterious PTVs is expected to be Poisson distributed with l =U/hs

36 Simulations Drift becomes non-negligible U = Large s "#$ Small s "#$

37 The model PTV counts in each gene are Poisson distributed but we lack sufficient data to estimate selection coefficients We can treat selection coefficients as random variables with a distribution to be estimated

38 Distribution of selection coefficients P(shet α,β ) Heterozygous selection coefficient, s het

39 Estimates for each gene The estimated distribution over selection coefficients can be now used as a prior, and per gene estimates from posteriors V "#$,6 N 6 ; W 6 = Ü _ á S àâä,á ;g á Ü S àâä,á ;f ã,d ã Ü _ á S;g á Ü S;f ã,d ã ds

40 What happens if we incorporate drift?

41 Fraction of 4% 2% AD and AR Mendelian genes 0% >= >= [c] Mode of Inheritance in Molecular Diagnoses [Baylor] s_het bin [d] Baylor s_het bins [e] UCLA s_het bins Number of observed genes Mode of Inheritance AD AR % 80% 60% 40% 20% Fraction of genes by Mode of Inheritance 19.57% 80.43% 96.04% 21.18% 78.82% 96.70% >= % s_het < 0.04 s_het > 0.04 s_het < 0.04 s_het > 0.04

42 Age of onset, penetrance and severity

43 [c] Cell-Essential by KBM7 CRISPR Assay [d] Cell-Essential by Yeast Gene Trap Concordance with mouse knockout data [a] Orthologous mouse knockouts by phenotype [b] Distribution of s_ s_het bin Percentage of genes in each bin, by phenotype 100% 80% 60% 40% 20% 0% >= Phenotype Lethal Subviable Viable 1 11

44 Concordance with cell essentiality screens Phenotype Lethal Subviable Viable [c] Cell-Essential by KBM7 CRISPR Assay [d] Cell-Essential by Yeast Gene Trap Assay 100 s_het bin 20% 70 s_het bin 25% Percentage of genes classified as essential 20% 15% 10% 5% 0% >= Percentage of genes classified as essential 15% 10% 5% 0% >=

45 Black hole in knowledge 0.6 Non-Viable Sanger Mice KBM7 Human Cell Line Protein- Protein Interactions s_het Value Yeast Gene Trap Score 0.5 Percentage of Genes in Each Group Fewest Publications Most Publications Fewest Publications Most Publications Fewest Publications Most Publications Fewest Publications Most Publications Fewest Publications Most Publications ê