Recombination, and haplotype structure

Size: px
Start display at page:

Download "Recombination, and haplotype structure"

Transcription

1 2 The starting point We have a genome s worth of data on genetic variation Recombination, and haplotype structure Simon Myers, Gil McVean Department of Statistics, Oxford We wish to understand why the haplotype structure looks how it does Differences between regions, populations 3 4 Where do haplotypes come from? What determines the shape of the tree? In the absence of recombination, the most natural way to think about haplotypes is in terms of the genealogical tree representing the history of the chromosomes Present day Tree affects mutation patterns Mutation patterns give information on tree

2 5 6 Ancestry of current population Ancestry of sample Present day Present day 7 8 The coalescent: a model of genealogies Simulating histories with the coalescent Most recent common ancestor (MRCA) coalescence Ancestral lineages Present day time

3 9 10 Simulating data with the coalescent Haplotype structure in the absence of recombination In the absence of recombination, the shape of the tree and where mutations fall on it determine patterns of haplotype structure Two mutations on the same branch will be in complete association, mutations on different branches will have lower and often low association r 2 = 1 r 2 = Haplotypes when there is recombination A bit of recombination shuffles genetic variation When there is no recombination, haplotype structure reflects the age distribution of mutations and the shape of the underlying tree When there is some recombination, every nucleotide position has a tree, but the tree changes along the chromosome at a rate determined by the local recombination landscape By using SNP information to inform us about the trees, we can learn about how quickly the trees changes This relates to the recombination rate

4 13 14 Lots of recombination does lots of shuffling Recombination and haplotype diversity Without recombination, a new mutation can create at most one new haplotype Any two mutations delineate at most 3 haplotypes in total (ancestral, plus two new types) With recombination, this mutation can spread onto every existing haplotype background, creating the potential for more haplotypes For a given number of SNPs a region with recombination will tend to have (in comparison to a region with no recombination) More haplotypes Less variance in the pairwise differences between haplotypes Less skewed haplotype frequencies The ancestral recombination graph In humans, recombination is not uniformly distributed The combined history of recombination, mutation and coalescence is described by the ancestral recombination graph Event Coalescence Mutation Coalescence Coalescence Mutation Coalescence Recombination Most recombination occurs in recombination hotspots short (1-2kb) regions every kb that occupy at most 3% of the genome but probably account for 90% or more of the recombination This means that haplotype structure in humans is an interesting hybrid between the no recombination and lots of recombination situations

5 17 18 Learning about recombination The signal of recombination? Just like there is a true genealogy underlying a sample of sequences without recombination, there is a true ARG underlying samples of sequences with recombination We can consider nonparametric and parametric ways of learning about recombination There are useful nonparametric ways of learning about recombination which we will consider first These really only apply to species, such as humans, where we can be fairly sure that most SNPs are the result of a single ancestral mutation event Ancestral chromosome recombines Recurrent mutation Recombination Detecting recombination from DNA sequence data Improving the detection algorithm Look for all pairs of incompatible sites R m greatly underestimates the amount of recombination in the history of a set of sequences Myers and Griffiths (2003) developed an improved way of detecting recombination events Without recombination, every new mutation can create only a single new haplotype With recombination, mutations can be shuffled between haplotype background, generating haplotype diversity Each recombination makes at most one new haplotype If I see H haplotypes with S segregating sites, at least H-S-1 recombination events must have occurred Find minimum number of intervals in which recombination events must have occured (Hudson and Kaplan 1985): R m This offers potential to identify many more recombination events Carefully combine bounds from different collection of sites Dynamic programming algorithm makes computation extremely fast Better (sometimes slower) algorithms developed recently

6 21 22 Problems with counting recombination events Modelling recombination A tree-pair where we could see recombination events, but don t Tree-pairs where we cannot see recombination events Model-based approaches to learning about recombination allow us to ask more detailed questions than nonparametric approaches What is the rate of recombination (as opposed to just the number of events) Does gene A have a higher recombination rate than gene B? Is the rate of recombination across a region constant? Where are the recombination hotspots? We can use coalescent model approaches (approximations) to calculating the likelihood of arbitrary recombination maps given observed data Fitting a variable recombination rate Acceptance rates Use a reversible-jump MCMC approach (Green 1995) SNP positions Split blocks Merge blocks Change block size Change block rate lc ( ψ ) π ( ψ ) q( ψ, ψ ) ( ψ, u ) α( ψ, ψ ) = min 1, l ( ψ ) π ( ψ ) q( ψ, ψ ) ( ψ, u) Composite likelihood ratio C Ratio of priors Hastings ratio Jacobian of partial derivatives relating changes in dimension to sampled random numbers Include a prior on the number of change points that encourages smoothing

7 Strong concordance between fine-scale rate estimates from sperm and genetic variation 25 Inferring hotspots 26 We perform a statistical test for hotspot presence Based on an approximation to the coalescent similar to that used for rate estimation All previously identified hotspots are 1-2kb in size At a position in genome, consider where 2kb hotspot might be present Fit a model with hotspot Fit one without Compare in terms of (approximate) likelihood ratio test Evaluate significance via simulation When p-value below threshold, declare a hotspot Rates estimated from genetic variation McVean et al (2004) Rates estimated from sperm Jeffreys et al (2001) Rates and hotspots across the human genome Applications of recombination approaches to real data Rates and hotspots across the human genome (Myers et al. 2005) Previously, no understanding of why hotspots localise where they do Can 35,000 hotspots, accounting for >50% of human recombination, help? Comparison of recombination rates (Winckler et al. 2004, Ptak et al. 2005) Between humans and chimpanzees At individual recombination hotspots Understanding genomic rearrangements (Myers et al., submitted!) Cause a number of genomic disorders Relationship to recombination hotspots spots throughout human genome (35,000 identified) From Myers et al. (2005)

8 ,996 Phase II HapMap hotspots Estimated 50-70% of all human recombination spots on all chromosomes, including X THE1 consensus:...cttccgccatgattgtgaggcctccccagccatgtggaactgtgagtccatt... CCTCCCTAGCCAC (n=165) CCNCCNTNNCCNC (n=263) CCTCCCCNNCCAT (n=10,690) ~3-4% of hotspots Rate (cm/mb) Position relative to repeat center (kilobases) Perlegen chr12: GC Percent Known Genes II SNPs _ _ SINE LINE LTR DNA Simple Low Complexity Satellite trna Other Unknown Percentage GC in 5-Base Windows UCSC Known Genes II (June 05, Based on hg17 UCSC Known Genes) Simple Nucleotide Polymorphisms (SNPs) Oxford Recombination Rates from Perlegen data Oxford Recombination spots from Perlegen data Repeating Elements by RepeatMasker THE1B (LTR of retrotransposon) ~20,000 hotspots localised to within 5kb THE1B: Found in 1196 hotspots versus 606 coldspots (p<<10-20 ) AluY: Found in 3635 hotspots versus 3262 coldspots (p=7x10-5 ) L2 consensus:...tgtcacctcctcagagaggccttccctgaccaccctatctaaaatwgcacacc... CCTCCCTGACCAC (n=157) CCNCCNTNNCCNC (n=6,901) CTTCCCTNNCCAC (n=1,211) ~3-4% of hotspots AluY, AluSc, AluSg consensus:...ctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacag... CCGCCTTGGCCTC (n=14,028) CCNCCNTNNCCNC (n=15,706) CCGCCTCNNCCTC (n=55,916) ~3-4% of hotspots, including DNA3 Rate (cm/mb) Rate (cm/mb) Position relative to repeat center (kilobases) Position relative to repeat center (kilobases) Human hotspot motifs Variation in individual hotspots In humans, specific words produce recombination hotspot activity spot motif CCTCCCTNNCCAC (p<10-33 ) Raises probability of a hotspot across genetic backgrounds Degenerate versions CCNCCNTNNCCNC and truncated CCTCCCT also raise probability, to lesser extent Motif explains ~40% of human hotspots Operates in both sexes We don t know, very clearly, which hotspots On THE1 background, hotspot 70-80% of time! Biology not clearly understood We identified a second, different hotspot motif (the best 9bp motif), CCCCACCCC, also by comparison of hot and cold regions of the genome Sequence variation affects recombination at DNA2 (Jeffreys and Neumann, Nature Genetics 2002)

9 33 34 SNPs disrupting hotspots disrupt motifs! DNA2: Jeffreys and Neumann (Nature Genetics 2002, Hum Mol. Evol. 2005) SNPs disrupting hotspots disrupt motifs DNA2: Jeffreys and Neumann (Nature Genetics 2002, Hum Mol. Evol. 2005) AAAAGACAGCCTCCCTGTTGCTGC AAAAGACAGCCTCCCTGTTGCTGC AAAAGACAGCCCCCCTGTTGCTGC AAAAGACAGCCCCCCTGTTGCTGC Disruption of CCTCCCT, best 7bp motif NID1: NID1: CACCCCCCACCCCACCCCAACATA CACCTCCCACCCCACCCCAACATA CACCCCCCACCCCACCCCAACATA CACCTCCCACCCCACCCCAACATA Disruption of CCCCACCCC, best 9bp motif Role of motif in X-linked ichthyosis A more general link? chrx: Chromosome Band VCX2 VCX3A Segmental Dups IDS Chromosome Bands Localized by FISH Mapping Cloness Xp22.31 RefSeq Genes HDHD1A VCX VCX22 VCX-C FAM9A STS 1/5000 births PNPLA4 KAL1 Duplications of >1000 Bases of Non-RepeatMasked Sequence Copy Number Polymorphisms from BAC Microarray Analysis (Sharp) Copy Number Polymorphisms from BAC Microarray Analysis (Iafrate) Copy Number Polymorphisms from ROMA (Sebat) Structural Variation identified by Fosmids (Tuzun) Deletions from Genotype Analysis (McCarroll) chrx.3 chrx.4 Deletions from Genotype Analysis (Conrad) chrx.1 Deletions from Haploid Hybridization Analysis (Hinds) Many other diseases are caused by recombination-mediated deletions and duplications (NAHR) Smith-Magenis syndrome (hotspot) CMT1A (hotspot) NF1 microdeletion syndrome (hotspot) DiGeorge syndrome. The 1kb deletion hotspot contains 25 repeats of chrx: Chromosome Bands Localized by FISH Mapping Clones CCTCCCTNNCCAC Xp22.31 RefSeq Genes Highest motif density in VCX3A any LCR in entire genome Duplications of >1000 Bases of Non-RepeatMasked Sequence Strongly implicates Copy motif Number Polymorphisms in producing from BAC Microarray hotspot Analysis (Sharp) Copy Number Polymorphisms from BAC Microarray Analysis (Iafrate) Points to a link between Copy deletion-causing Number Polymorphisms from ROMA (Sebat) and normal Structural Variation identified by Fosmids (Tuzun) IDS recombination Deletions from Genotype Analysis (McCarroll) Chromosome Band Segmental Dups Deletion breakpoint hotspot (Van Esch et al. 2005) Deletions from Genotype Analysis (Conrad) Deletions from Haploid Hybridization Analysis (Hinds) Simple Tandem Repeats by TRF Simple Repeats Two recent studies suggest normal hotspots and hotspots of disease-causing deletion may coincide de Raedt, Stephens et al. (Nature Genetics, 2006) Two NF1 deletion hotspots both likely to coincide with crossover hotspots Lindsay et al. (ASHG, 2006) CMT1A deletion hotspot associated with crossover hotspot