12/8/09 Comp 590/Comp Fall

Size: px
Start display at page:

Download "12/8/09 Comp 590/Comp Fall"

Transcription

1 12/8/09 Comp 590/Comp Fall

2 One of the first, and simplest models of population genealogies was introduced by Wright (1931) and Fisher (1930). Model emphasizes transmission of genes from one generation to the next For simplicity we ll first focus on a fixed population size, each with a distinct gene variant 12/8/09 Comp 790 Introduction & Coalescence 2

3 Rules Antecedent genes are chosen randomly, with replacement, from their parental generation No selection Fixed population size G0: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'] G1: ['J', 'A', 'H', 'B', 'I', 'E', 'D', 'G', 'A', 'B'] G2: ['A', 'J', 'E', 'G', 'D', 'E', 'B', 'I', 'A', 'A'] G3: ['A', 'A', 'E', 'J', 'I', 'A', 'I', 'A', 'J', 'B'] G4: ['E', 'A', 'B', 'B', 'A', 'E', 'A', 'A', 'A', 'A'] G5: ['A', 'A', 'B', 'A', 'A', 'E', 'A', 'A', 'A', 'B'] G6: ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'A', 'A'] G7: ['B', 'A', 'A', 'B', 'A', 'A', 'A', 'A', 'A', 'A'] What will this population eventually look like? 12/8/09 Comp 790 Introduction & Coalescence 3

4 Discrete and non-overlapping generations Haploid individuals Populations size is constant All individuals are equally fit No population of social structure Genes segregate independently 12/8/09 Comp 790 Introduction & Coalescence 4

5 Replace letters with colors Draw lineages Sort topologically 12/8/09 Comp 790 Introduction & Coalescence 5

6 Every population results in just one gene 12/8/09 Comp 790 Introduction & Coalescence 6

7 10000 trials Mode = 11 (616) Mean = /8/09 Comp 790 Introduction & Coalescence 7

8 Thus far, we ve considered very simple, and admittedly oversimplified models of biological and genetic processes. Next we ll discuss many of the biological realities that the coalescent model either crudely approximates, or entirely ignores We also want to move from our simple geocentric view to a more complete organism 12/8/09 Comp 790 Continuous-Time Coalescence 8

9 Gene: A unit of information transferred from generation to the next. Allele: An alternative form of a gene, information that comes in two or more forms. SNP: (acronym for Single Nucleotide Polymorphism) A position in a DNA s sequence that can be found in multiple states of the 4 nucleotides (A, C, G, T). SNPs are one type of allele Haplotype: A subsequence of DNA that includes only positions known to vary (SNPs) 12/8/09 Comp 790 Continuous-Time Coalescence 9

10 Mutation: Changes in the genetic material of an organism. Events that actually modify genes potentially generating new alleles Recombination: A process in which new gene combinations are introduced Crossovers, Gene-conversion, Lateral Gene Transfer Structural Rearrangement: Modifications that impact the number of old gene copies and their relative orderings Insertions, Deletions, Inversions 12/8/09 Comp 790 Continuous-Time Coalescence 10

11 There are many ways of altering a gene, some common and some rare Environmental exposure (radiation, chemical, etc.) Random events (faulty DNA replication, other malfunctions of biochemical machinery) Many mutations affect cells of an higher organisms without genetic ramifications (mutations of the so-called somatic cells), but they may be important to the organism (i.e. lead to cancer) Mutations of the germline (gamete) cells are those of genetic interest because they impact the life of genes, as opposed to their protective organism 12/8/09 Comp 790 Continuous-Time Coalescence 11

12 The DNA sequence is broken into several independent segments organized into structures called chromosomes Chromosomes vary between different organisms. The DNA molecule may be circular or linear, and can contain from 10,000 to 1,000,000,000 nucleotides. Simple single-cell organisms (prokaryotes, cells without nuclei such as bacteria) generally have smaller circular chromosomes, although there are many exceptions. More complicated cells (eukaryotes, with nuclei) have linear DNA molecules that are broken into segments and wound around special proteins. The aggregates are called chromosomes. 12/8/09 Comp 790 Continuous-Time Coalescence 12

13 The number of fragments that DNA is broken into leads to a distinct number of chromosomes. The number is called the monoploid number. Organism Unique Chromosomes Human 23 Chimpanzee 24 Mouse 20 Dog 39 Horse 32 Donkey 31 Hare 23 12/8/09 Comp 790 Continuous-Time Coalescence 13

14 Having only one copy of DNA is a risky proposition, since the loss of a single functional gene could lead to a bad outcome Evolution has addressed this obvious shortcoming by incorporating a mostly redundant copy of the entire sequence in most cells The haploid number is the number of chromosomes in a gamete of an individual. Nearly all mammals are diploid and receive a homologous sequence from each parent Many plants carry more than 2 copies of there sequence, 4 and 8 are typical, and the number can vary between subspecies. 12/8/09 Comp 790 Continuous-Time Coalescence 14

15 In the formation of gametes (sperm and ovum) homologous DNA strands are combined in a process called crossover This effectively combines the prefix of one sequence with the suffix of another 12/8/09 Comp 790 Continuous-Time Coalescence 15

16 The DNA sequence is transferred from one copy (which remains unchanged) to another, whose sequence is altered. Results from the repair of damaged DNA as described by the Double Strand Break Repair Model. 12/8/09 Comp 790 Continuous-Time Coalescence 16

17 Any process in which an organism incorporates genetic material from another organism without being the offspring of that organism. Horizontal gene transfer is a confounding factor in inferring phylogenetic trees based on sequences. One of the most prevalent forms of recombination in early evolution 12/8/09 Comp 790 Continuous-Time Coalescence 17

18 Large scale structural changes (deletions/ insertions/inversions) may occur in a population. Wi 07 Vineet Bafna

19 Previous we allowed for gene variants (alleles), but without a model of how they came into being Rather than the coalescence of a single gene, next we consider successive generations of gene sets Two things to consider Variants of a gene (Alleles) Variants in allele combinations (Sequences) We begin by treating each independently G n G n G n G n G n G n G n G n +1 G n +2 G n +3 G n +4 12/8/09 Comp 790 Genealogies to Sequences 19

20 Assumes all that is knowable is if alleles are identical or different No Spatial (i.e. sequence position) or quantitative information related to the observed differences Only keeps track of how many of each allele type Number of mutations that result in a variant is lost Two event types, splits and mutations Labels are arbitrary B D A C C (A) (A,A) (B)(A) (B)(A) (B)(A,A) (B)(A)(C) (B)(A)(C,C) (B,B)(A)(C,C) (B)(D)(A)(C,C) (B)(D)(A)(C,C) 12/8/09 Comp 790 Genealogies to Sequences 20

21 Assumes mutations are rare events Assumes DNA sequences are large Multiple mutations at the same site are extremely rare Infinite Sites Model assumes that multiple mutations never occur at the same sequence position Thus, all genes are Biallelic Lost haplotype /8/09 Comp 790 Genealogies to Sequences 21

22 Observed Haplotypes and SNPs from previous example Under the Infinite Sites Model the haplotype size equals number of historical mutations While sequences can be lost, alleles cannot, in contrast to the Infinite Alleles Model SNP Diversity Patterns (SDPs) can be repeated (eg. S 1 and S 2 ) S 1 S 2 S 3 S 4 S 5 H H H H Since the assignment of 1s and 0s is arbitrary, a SNP and its complement share the same SDP For N haplotypes, there are at most 2 N-1 1 possible SDPs 12/8/09 Comp 790 Genealogies to Sequences 22

23 Unrooted Perfect Phylogeny Nodes correspond to haplotypes (both visible and historical) Edges correspond to SNPs Removal of an edge creates a bipartition Tree leaves correspond to mutations (allele variants) that are unique to a sequence, i.e. an SDP with only one minority allele instance, a singleton /8/09 Comp 790 Genealogies to Sequences 23

24 Assume we only have direct access to observed haplotypes Construct a pair-wise distance matrix between haplotypes using Hamming distances Add smallest edge between all nodes which do not introduce a loop If the smallest distance is greater than 1 add d-1 hidden nodes between the pair so that adjacent nodes have a hamming distance of 1 S 1 S 2 S 3 S 4 S 5 H H H H Augment the distance matrix with the new nodes and claim the introduced edges Repeat finding the smallest distance, and augmenting until the graph is fully connected H 2 H 22 H 3 3 H 4 H A H B H 1 H H 2 H H H 4 A H 12/8/09 A 1 Comp 790 Genealogies to Sequences 24

25 Under the assumption of the infinite sites model all SNP pairs exhibit the property no more that 3 out of the possible 4 allele combinations occur Direct consequence of only one mutation per site Showing that all SNP pair combinations satisfy the four gamete test is a necessary and sufficient condition for there to exist a perfect phylogeny tree S 1 S 2 S 3 S 4 S 5 H H H H /8/09 Comp 790 Genealogies to Sequences 25

26 Which SDPs are compatible with any other SNP? Singleton SNPs are compatible are compatible with any other SNP Given N distinct haplotype sequences resulting from an infinite sites model what is minimum number of SDPs? N-1 edges are the fewest necessary to connect N haplotypes into a linear tree. How many singleton SNPs occur in such a tree? 2 Given N distinct haplotype sequences resulting from an infinite sites model what is maximum number of SDPs? 2N-3 edges, the number of edges in an unrooted tree with N leaves 12/8/09 Comp 790 Genealogies to Sequences 26

27 Consider the following SNP panel S 1 S 2 S 3 S 4 S 5 S 5 H H H H H Satisfies the four gamete test? Construct the tree Is the SDP T possible? 12/8/09 Comp 790 Continuous-Time Coalescence 27

28 Without recombination or recurring mutations, genomes would have a very specific structure. All SNP pairs would exhibit no more than 3 of their 4 of their possible allele combinations (a.k.a. the 4-gamete test Hudson & Kaplan 85) Haplotype blocks where all SNPs satisfy the 4- gamete test admit a perfect phylogeny tree. Exploit this property to partition large SNP panels into haplotype blocks within which there is no evidence of homoplasy or recombination Treat these regions as *mega-markers*

29 Issues Can we efficiently find all compatibility intervals How many intervals? (fewest necessary to cover the entire genome) Unique? Common properties Compatibility Matrix Haps SNPs

30 There are many ways of partitioning genomes into compatible intervals. However, there is tight bound on the minimum number of intervals necessary to cover the whole genome. And, all minimuminterval solutions include a common core sets of SNPs

31 We can also efficiently find *all* maximal compatible intervals across the genome as well as all minimum-interval solutions composed of maximal compatible intervals. These SNP regions show no evidence of historical recombination events or homoplasy.

32 of Perlegen SNPs on Chr 1, 60 Billion pairwise relationships, >7.5 GBytes

33 Chromosome Trees based on Perfect Phylogenies