Haplotype phasing in large cohorts: Modeling, search, or both?

Similar documents
Nature Genetics: doi: /ng Supplementary Figure 1

Improving the accuracy and efficiency of identity by descent detection in population

Proceedings of the World Congress on Genetics Applied to Livestock Production 11.64

SEGMENTS of indentity-by-descent (IBD) may be detected

Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Supplementary information

Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis

Haplotype estimation for biobank scale datasets

Analysis of genome-wide genotype data

Understanding genetic association studies. Peter Kamerman

Imputation. Genetics of Human Complex Traits

Genotype Prediction with SVMs

Maternal Grandsire Verification and Detection without Imputation

Assessment of the performance of hidden Markov models for imputation in animal breeding

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Constrained Hidden Markov Models for Population-based Haplotyping

Supplementary Note: Detecting population structure in rare variant data

Haplotypes, linkage disequilibrium, and the HapMap

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

ARTICLE High-Resolution Detection of Identity by Descent in Unrelated Individuals

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).

The Clark Phase-able Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS

Genome-wide association studies (GWAS) Part 1

ARTICLE Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering

Population differentiation analysis of 54,734 European Americans reveals independent evolution of ADH1B gene in Europe and East Asia

The Lander-Green Algorithm. Biostatistics 666

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

Oral Cleft Targeted Sequencing Project

Crash-course in genomics

What is genetic variation?

SNP Selection. Outline of Tutorial. Why Do We Need tagsnps? Concepts of tagsnps. LD and haplotype definitions. Haplotype blocks and definitions

Human Genetics and Gene Mapping of Complex Traits

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

ARTICLE Haplotype Estimation Using Sequencing Reads

Using Big Data technologies to uncover genetic causes of Amyotrophic lateral sclerosis

PLINK gplink Haploview

Popula'on Gene'cs I: Gene'c Polymorphisms, Haplotype Inference, Recombina'on Computa.onal Genomics Seyoung Kim

Estimation problems in high throughput SNP platforms

QTL Mapping, MAS, and Genomic Selection

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY

Title. Conflation of short identity- by- descent segments bias their inferred length distribution. Authors and Affiliations

Algorithms for Genetics: Introduction, and sources of variation

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

FSuite: exploiting inbreeding in dense SNP chip and exome data

Genotype SNP Imputation Methods Manual, Version (May 14, 2010)

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

Genome wide association studies. How do we know there is genetics involved in the disease susceptibility?

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

SUPPLEMENTARY INFORMATION

Making sense of DNA For the genealogist

Computing the Minimum Recombinant Haplotype Configuration from Incomplete Genotype Data on a Pedigree by Integer Linear Programming

Statistical Methods for Quantitative Trait Loci (QTL) Mapping

Haplotypes versus Genotypes on Pedigrees

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics

Introduction to Quantitative Genomics / Genetics

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Answers to additional linkage problems.

Association studies (Linkage disequilibrium)

Human Genetics and Gene Mapping of Complex Traits

Association Mapping. Mendelian versus Complex Phenotypes. How to Perform an Association Study. Why Association Studies (Can) Work

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

H3A - Genome-Wide Association testing SOP

THE cost of next-generation sequencing has fallen remarkably

12/8/09 Comp 590/Comp Fall

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

A fast and accurate method for detection of IBD shared haplotypes in genome-wide SNP data

BTRY 7210: Topics in Quantitative Genomics and Genetics

Haplotypes versus genotypes on pedigrees

Our motivation for a NGS/MPS SNP panel

PUBH 8445: Lecture 1. Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota

Questions we are addressing. Hardy-Weinberg Theorem

Genetics and Psychiatric Disorders Lecture 1: Introduction

Leveraging admixture analysis to resolve missing and crosspopulation. Noah Zaitlen

Using the Trio Workflow in Partek Genomics Suite v6.6

CUMACH - A Fast GPU-based Genotype Imputation Tool. Agatha Hu

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

A General Approach for Haplotype Phasing across the Full Spectrum of Relatedness

Fine-scale mapping of meiotic recombination in Asians

Genomic resources. for non-model systems

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2017

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Prostate Cancer Genetics: Today and tomorrow

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

Phasing of 2-SNP Genotypes based on Non-Random Mating Model

Fast and Accurate Haplotype Inference with Hidden Markov Model. Yi Liu

Resources at HapMap.Org

Genetics - Fall 2004 Massachusetts Institute of Technology Professor Chris Kaiser Professor Gerry Fink Professor Leona Samson

Genome-Wide Association Studies. Ryan Collins, Gerissa Fowler, Sean Gamberg, Josselyn Hudasek & Victoria Mackey

Alkes Price Harvard School of Public Health January 24 & January 26, 2017

GBS Usage Cases: Examples from Maize

Improvement of Association-based Gene Mapping Accuracy by Selecting High Rank Features

Nature Genetics: doi: /ng.3143

Linkage Disequilibrium. Adele Crane & Angela Taravella

Haplotype Inference by Pure Parsimony via Genetic Algorithm

Computational Workflows for Genome-Wide Association Study: I

Transcription:

Haplotype phasing in large cohorts: Modeling, search, or both? Po-Ru Loh Harvard T.H. Chan School of Public Health Department of Epidemiology Broad MIA Seminar, 3/9/16

Overview Background: Haplotype phasing New long-range phasing method: Eagle Results on N=150K UK Biobank samples Algorithm overview Future directions GCCGCGTTAATTGGTGGTATCTCGGGTCGCCTCCCACTGGTAATTAATGGTGCGAACCCTACCTCTCCGTTATGTCTGGT TCTACGGTGAACGGTGGCCTCTAGTCTCGACAACCAGTAGAAATGAAAGGGGCCCACGCAATCCCACGTTTATGTCAGGT TTTGCTTGGAACGTTCGCCTCCAATGTAGACACCCTGGGGTAATTAATGTTGCGAACCCACTTTTACCTTACTGAAACGC TTCGCGGTGAACGGTCGCCGCCAAGCACGCCAACCACGAGTAATGAAAGGTGTGAAAGCTACTCTACGTTTCTATATCTT GCCACTTTAAATGGTCGTCTCCAAGGACGCCTCGCAGTGTAAATGAATGGTGTCAAAGCTCCCCCACGGAACTGTCACTC TCCGCTGTGAACGGTGGCAGCTCGGCACGCCACGCTCTATTGATTAATGTGGTCCAACCTACTTTACGGTACTGAATCGC TCCACGGGGATTGTTGGCAGCCAGTGACGCCACCCACGATAAATTAATGTTGCCCAACCTACCCCACCTATCTGTCAGTC GTCGCGTGGAACGTTCGTATCTCGGCACGCCAACCACGGTAGATGAATGGTGCGCAAGCACTCTCTCCGTAATGACTGTC TCCACTTGGAATGTTGGTCTCCAGTCAAGCCTCCCAGGAGTAATTAAAGTGGCGCAACCTATCTTACCTAACTATCACGC TTTGCTGGGAACGTTCGTAGCCCGGCTCGACAACCTGGGGTAATTAATGTTGCGAACCCACTTTTACCTTACTGTAAGTT

Humans have diploid genomes Pairs of chromosomes = Paternally-derived + Maternally-derived

Genotyping arrays combine paternal and maternal contributions Paternal haplotype: A C A T G G C Maternal haplotype: A T A T G G A Genotype calls (measured): AA CT AA TT GG GG AC

Problem: Phase information is lost! Paternal haplotype: A C A T G G C Maternal haplotype: A T A T G G A Genotype calls (measured): AA CT AA TT GG GG AC Haplotypes could have been: A C A T G G A + A T A T G G C

Problem: Phase information is lost! Paternal haplotype: A C A T G G C Maternal haplotype: A T A T G G A Genotype calls (measured): AA CT AA TT GG GG AC Haplotypes could have been: A C A T G G A + A T A T G G C Switch error!

Problem: Phase information is lost and phase information is useful Uses of phased haplotypes: Genotype imputation (e.g., for GWAS) Population genetic analyses Identity-by-descent (IBD) detection Demographic history inference Marchini et al. 2007 Howie et al. 2012

Problem: Phase information is lost and phase information is useful So how can we recover it? (and how accurately can we do it?)

Problem: Phase information is lost and phase information is useful So how can we recover it? (and how accurately can we do it?) Wet lab techniques exist, but they re expensive (e.g., long read sequencing, Hi-C)

Problem: Phase information is lost and phase information is useful So how can we recover it computationally? (and how accurately can we do it?) Teaser: It depends on your sample size, but for N=150K UK samples We can phase chr20 perfectly(!) in ~20% of samples and with 2 switches in >50% of samples

Problem: Phase information is lost and phase information is useful So how can we recover it computationally? (and how accurately can we do it?) Teaser: It depends on your sample size, but for N=150K UK samples We can phase chr20 perfectly(!) in ~20% of samples and with 2 switches in >50% of samples

Phasing (for computer scientists)???????????????? Question: Given a diploid genotype sequence, can we recover its parental haplotypes????? +???? 1 0 1 2 Binary encoding: Haploid 0=A, 1=G (at A/G SNP) Diploid 0=AA, 1=AG, 2=GG 1 0 0 1 + 0 0 1 1 1 0 1 2 0 0 1 1 + 1 0 0 1 1 0 1 2 1 0 1 1 + 0 0 0 1 1 0 1 2

Trio phasing???? +???? 0 1 1 2???? +???? 2 0 1 1 Question: Given a diploid genotype sequence and diploid parental genotypes, can we recover its parental haplotypes????? +???? 1 0 1 2

Trio phasing 0?? 1 + 0?? 1 0 1 1 2 1 0?? + 1 0?? 2 0 1 1 Answer: Yes (for the most part): 1. Fill in homozygous sites? 0? 1 +? 0? 1 1 0 1 2

Trio phasing 0?? 1 + 0 0? 1 0 1 1 2 1 0?? + 1 0? 1 2 0 1 1 Answer: Yes (for the most part): 1. Fill in homozygous sites 2. Propagate info 0 0? 1 + 1 0? 1 1 0 1 2

Trio phasing 0 1? 1 + 0 0? 1 0 1 1 2 1 0? 0 + 1 0? 1 2 0 1 1 Answer: Yes (for the most part): 1. Fill in homozygous sites 2. Propagate info until stuck 0 0? 1 + 1 0? 1 1 0 1 2

Trio phasing 0 1? 1 + 0 0? 1 0 1 1 2 1 0? 0 + 1 0? 1 2 0 1 1 Result: Perfect phase* in child at sites homozygous in 1 individual Almost-perfect phase* in parents (up to recombination) No phase at all-het sites; need to use LD from a panel of reference haplotypes * up to genotype error + de novo mutation 0 0? 1 + 1 0? 1 1 0 1 2

Pedigree phasing Extension of trio phasing to extended families Additional relatives => more Mendelian constraints (i.e., fewer all-het sites) MERLIN software (Abecasis et al. 2002 Nat Genet) 1 1? 0 + 1 0? 0 2 1 1 0 1 0? 0 + 0 0? 1 1 0 1 1 1 0? 1 + 0 0? 1 1 0 1 2

Statistical phasing Question: Given a large number of diploid genotype sequences, can we recover their parental haplotypes? 01201021020102100210010101010020100101010010010201201020101001020120100010010120 02012010001001012012010210201021002100101010100201001010100100102012010201010010 01021002100101010100201001010100100102012010201010012010210021001010101002010010 01021020100210021001010101002010010101001001020120110010120120102102010210021011 02100210010101010020100101010010010201201021002100101010100201001010100100102011 10101001001020120102100210010101010020100101010010010201001020120102100210010210 01001010100100102012010201010012010210021001010101002100210010101010020100101110 00100102012011001012012010210201021002102100210010101010020100101010010010201201 02100101010100201001010100100102012010210021001010101002010012011001012012010211

Statistical phasing Question: Given a large number of diploid genotype sequences, can we recover their parental haplotypes? Answer: Yes, using linkage disequilibrium (LD) = statistical correlation between nearby sites

General statistical phasing approach: Hidden Markov models (HMMs) Answer: Various methods based on probabilistic modeling of population LD work well: PHASE (Stephens et al. 2001 AJHG) Beagle (Browning & Browning 2007 AJHG) HAPI-UR (Williams et al. 2012 AJHG) SHAPEIT2 (Delaneau et al. 2013 Nat Meth)

Limitation of existing methods: Answer: Various HMM-based methods work well albeit not as well as pedigree-based methods: 2-5% switch error rates on N 10K samples Phase switches occur every 20-50 hets (~1-2 cm) Accuracy improves with N Accuracy Williams et al. 2012 AJHG; Delaneau et al. 2013 Nat Meth

Limitation of existing methods: Answer: Various HMM-based methods work well albeit not as well as pedigree-based methods and much more slowly: Several days to phase a single chromosome for N=16K samples Speed HMMs are efficient, but genomic data is big! Williams et al. 2012 AJHG

Long-range phasing (LRP) LRP method: Kong et al. 2008 Nat Genet 35K of 316K Icelanders typed (11%)! Identify local surrogate parents for each individual; proceed as in trio phasing (fast!) Yield: near-perfect phase calls at 90-95% of het sites LRP applications: ~25 subsequent papers in Nature or Nat Genet Note: Diagram is an oversimplification; in practice, surrogate parents only match proband in sub-chromosomal segments identical-by-descent (IBD)

Challenge: Recombination breaks down IBD segments Shared segment lengths decrease exponentially with # of generations LRP is easy with close relatives but hard otherwise Sub-chromosomal segment

Key contrast HMM vs. LRP: Modeling vs. Search accurate fast and accurate at very large sample sizes Let s build a great model! Let s use lots of data!

Predictions about LRP Kong et al. 2008: We speculate that having as little as 1% of a population genotyped may be adequate for the method to yield useful results. Browning & Browning 2011: It is likely that IBD-based phasing can be extended by using more sensitive methods for detecting IBD and combining IBD-based phasing with population haplotype frequency models.

Predictions about LRP Kong et al. 2008: Realization of (beyond Iceland) 0.2%, in fact! We speculate that having as little as 1% of a population genotyped may be adequate for the method to yield useful results. Browning & Browning 2011: It is likely that IBD-based phasing can be extended by using more sensitive methods for detecting IBD and combining IBD-based phasing with population haplotype frequency models. Loh, Palamara & Price 2016 Nat Genet (in press) Yes! Yes!

Overview Background: Haplotype phasing New long-range phasing method: Eagle Results on N=150K UK Biobank samples Algorithm overview Future directions GCCGCGTTAATTGGTGGTATCTCGGGTCGCCTCCCACTGGTAATTAATGGTGCGAACCCTACCTCTCCGTTATGTCTGGT TCTACGGTGAACGGTGGCCTCTAGTCTCGACAACCAGTAGAAATGAAAGGGGCCCACGCAATCCCACGTTTATGTCAGGT TTTGCTTGGAACGTTCGCCTCCAATGTAGACACCCTGGGGTAATTAATGTTGCGAACCCACTTTTACCTTACTGAAACGC TTCGCGGTGAACGGTCGCCGCCAAGCACGCCAACCACGAGTAATGAAAGGTGTGAAAGCTACTCTACGTTTCTATATCTT GCCACTTTAAATGGTCGTCTCCAAGGACGCCTCGCAGTGTAAATGAATGGTGTCAAAGCTCCCCCACGGAACTGTCACTC TCCGCTGTGAACGGTGGCAGCTCGGCACGCCACGCTCTATTGATTAATGTGGTCCAACCTACTTTACGGTACTGAATCGC TCCACGGGGATTGTTGGCAGCCAGTGACGCCACCCACGATAAATTAATGTTGCCCAACCTACCCCACCTATCTGTCAGTC GTCGCGTGGAACGTTCGTATCTCGGCACGCCAACCACGGTAGATGAATGGTGCGCAAGCACTCTCTCCGTAATGACTGTC TCCACTTGGAATGTTGGTCTCCAGTCAAGCCTCCCAGGAGTAATTAAAGTGGCGCAACCTATCTTACCTAACTATCACGC TTTGCTGGGAACGTTCGTAGCCCGGCTCGACAACCTGGGGTAATTAATGTTGCGAACCCACTTTTACCTTACTGTAAGTT

New LRP hybrid method: Eagle A. Rapidly identify long (>4cM) IBD segments => make initial phase calls Note: 4cM = common ancestor ~12 generations ago! GCCGCGTTAATTGGTGGTATCTCGGGTCGCCTCCCACTGGTAATTAATGGTGCGAACCCTACCTCTCCGTTATGTCTGGT TCTACGGTGAACGGTGGCCTCTAGTCTCGACAACCAGTAGAAATGAAAGGGGCCCACGCAATCCCACGTTTATGTCAGGT TTTGCTTGGAACGTTCGCCTCCAATGTAGACACCCTGGGGTAATTAATGTTGCGAACCCACTTTTACCTTACTGAAACGC TTCGCGGTGAACGGTCGCCGCCAAGCACGCCAACCACGAGTAATGAAAGGTGTGAAAGCTACTCTACGTTTCTATATCTT GCCACTTTAAATGGTCGTCTCCAAGGACGCCTCGCAGTGTAAATGAATGGTGTCAAAGCTCCCCCACGGAACTGTCACTC TCCGCTGTGAACGGTGGCAGCTCGGCACGCCACGCTCTATTGATTAATGTGGTCCAACCTACTTTACGGTACTGAATCGC TCCACGGGGATTGTTGGCAGCCAGTGACGCCACCCACGATAAATTAATGTTGCCCAACCTACCCCACCTATCTGTCAGTC GTCGCGTGGAACGTTCGTATCTCGGCACGCCAACCACGGTAGATGAATGGTGCGCAAGCACTCTCTCCGTAATGACTGTC TCCACTTGGAATGTTGGTCTCCAGTCAAGCCTCCCAGGAGTAATTAAAGTGGCGCAACCTATCTTACCTAACTATCACGC TTTGCTGGGAACGTTCGTAGCCCGGCTCGACAACCTGGGGTAATTAATGTTGCGAACCCACTTTTACCTTACTGTAAGTT B. Run two iterations of HMM-based phasing

Samples: New data: UK Biobank (N=150K interim release) N=150K 0.2% of British population 94% White ethnicity 88% British + 3% Irish + 3% Other 72 trios (70 white) Markers: Allows benchmarking! M=650K autosomal SNPs after QC

Computational performance: Eagle achieves ~14x speedup, ~3x less RAM Run time (hours) 10 4 10 2 Beagle SHAPEIT2 HAPI-UR Eagle Memory (GB) 10 2 10 1 Beagle HAPI-UR SHAPEIT2 Eagle 10 0 10 0 15K 50K 150K Sample size (N) 15K 50K 150K Sample size (N) Benchmark: 6K SNPs 1% of genome Hardware: up to 10 cores of a Xeon L5640

Benchmarking phase accuracy Switch error: transition between following paternal haplotype vs. maternal haplotype Each row: Haplotype produced by algorithm Samples 0 20 40 Genetic position (cm) Blue = matches paternal Red = matches maternal (from trio phasing) Switch error rate = (# errors) / (# hets 1)

Accuracy: Eagle achieves 0.3% switch Benchmark: First 40cM of chr10 (~6K SNPs) Trio parents removed Phase accuracy evaluated on 70 trio children (at triophased hets) 0.3% error rate 1 switch per 13cM 2/3 of Eagle and SHAPEIT2 switch errors for N=150K are minor (1-3 SNPs): 1 major switch / 40cM error rate at N=150K Switch error rate (%) 2.5 2 1.5 1 0.5 0 HAPI-UR Beagle Eagle SHAPEIT2 15K 50K 150K Sample size (N) Large markers + solid lines: Methods that can phase 150K samples genome-wide in <200 node days

Eagle achieves ~4x lower error than other genome-wide tractable methods Method Run time (12% of genome) Switch error rate Eagle 1x150K 7.9 days 0.31% SHAPEIT2 10x15K 24.4 days 1.35% HAPI-UR 10x15K 15.8 days 2.20% Chromosome-scale analyses: chr1 (short arm), chr10, chr20 Methods: Eagle, 1 batch of all 150K samples SHAPEIT2 and HAPI-UR, 10 batches of 15K samples Batching is necessary due to computational cost Hardware: up to 10 cores of a Xeon L5640

Algorithm overview 1. Detect long matching segments (>4cM IBD); call initial phase 2. Locally refine phase in shorter segments (~1cM IBD/IBS) 3. Run two approximate HMM iterations Samples Samples Step 1: Direct IBD-based phasing 0 20 40 Step 3a: HMM iteration 1 Samples Samples Step 2: Local phase refinement 0 20 40 Step 3b: HMM iteration 2 0 20 40 Genetic position (cm) 0 20 40 Genetic position (cm)

Step 1: Rapidly search for long diploid matches (>4cM IBD); make initial phase calls Input: Diploid {0,1,2} data for N samples Procedure: Find pairs of diploid segments that share 1 haplotype IBD Combine IBD info to make phase calls Output: Accurate phase in long chunks of each sample (wherever IBD exists) = Haploid {0,1} reference panel of 2N haplotypes 00010000110110001001010010001001010010010010010101000010001001000100100001010001 10001010100000010000010100010000100001010010100001010100111010100100010001010010 Step 1: Direct IBD-based phasing 01000101010000010100011001100100000010010100101110001000000100100010100001001110 00001010010001010101111010000100001001100000010001000001001001000010010100001011 00100010010010010010101001000101010010000010010010101110010000100101010000010010 01010010101010010001001001001000110000101000001001001000100000100101100100100101 01010001010000100001000101100000001010000010100000100001010000000101010101010000 00100000010010000100100000100100000010000100000000101010000000100101000000001001 Samples 00101000100000100000100100000100001010000100100000010000101010000001001011000001 01010000000100001001100001000000101010010100101110001000000100100010100100001000 0 20 40 Genetic position (cm)

Finding IBD in diploid data If two diploid segments share a haplotype, they can t have opposite homozygous sites (0 vs. 2) 01010002011000100200000101001102000 00120021002010000101001200000110011 Possible haplotype sharing Basic idea: Scan all pairs of samples for long segments with no opposite homozygotes O MN 2 time (N=#samples, M=#markers), but very small constant factor via bit arithmetic Henn et al. 2012 PLOS ONE

Finding IBD in diploid data Basic idea: Scan all pairs of samples for long segments with no opposite homozygotes 01010002011000100200000101001102000 00120021002010000101001200000110011 Possible haplotype sharing Problem: Lots of false positives! Solution: Consider allele frequencies and LD information => likelihood ratio score (IBD vs. chance) Check overlapping regions for consistency (i.e., existence of maternal/paternal bicoloring); trust longer regions

End of step 1: Some good chunks, some bad chunks Step 1: Direct IBD-based phasing Samples 0 20 40 Genetic position (cm) We've used long IBD Phasing is great where >4cM IBD (detectable in diploids) exists How can we use shorter IBD?

Step 2: Locally refine phase using dip=hap1+hap2 constraint Input: Diploid {0,1,2} data for N samples + Haploid {0,1} provisional phasing (step 1) Procedure (for each sample in turn): Find long segments of provisional {0,1}-haplotypes consistent with {0,1,2} diploid sample In overlapping ~1cM windows, check existence of implied complementary haplotype: hap2 = dip hap1 Output: Improved phase calls (switch error rate ~1.5%) Samples Step 2: Local phase refinement 0 20 40 Genetic position (cm)

Using the dip=hap1+hap2 constraint Input: Diploid {0,1,2} data for N samples + Haploid {0,1} provisional phasing: 2N haplotypes??????????????????????????????????? (hap2) + 00010011001000000100000100000101000 (hap1) 00122021002010000101001200000110011 (dip) Possible haplotype sharing Find long segments of provisional {0,1}-haplotypes consistent with {0,1,2} diploid sample Need fast, error-tolerant search Locality-sensitive hashing (LSH) overcomes curse of dimensionality (work by Indyk, Motwani, et al.) Idea: multiple random hashes

Using the dip=hap1+hap2 constraint Input: Diploid {0,1,2} data for N samples + Haploid {0,1} provisional phasing: 2N haplotypes?????01000101000000100110000001???? (hap2) + 00010011001000000100000100000101000 (hap1) 00122021002010000101001200000110011 (dip) Possible haplotype sharing Does the implied haplotype (hap2 = dip hap1) exist among the provisional haplotypes? Need fast, error-tolerant search Locality-sensitive hashing (LSH) overcomes curse of dimensionality (work by Indyk, Motwani, et al.) Idea: multiple random hashes

End of step 2: Mostly good chunks; sporadic errors Step 2: Local phase refinement Samples 0 20 40 Genetic position (cm) We've used long+short IBD How can we refine the inference?

Step 3: Run two HMM-like iterations Input: Diploid {0,1,2} data for N samples + Haploid {0,1} provisional phasing (step 2) Procedure (for each sample in turn): Build small sets of locally best reference haplotypes Find approximate max-likelihood (Viterbi) path Instead of full dynamic programming, use beam search: branch and prune Step 3a: HMM iteration 1 Step 3b: HMM iteration 2 Samples Samples 0 20 40 Genetic position (cm) 0 20 40 Genetic position (cm)

LRP vs. HMM-based phasing LRP: top-down Search for long stretches of IBD; use homozygous sites to call phase Don t worry about all-het sites; clean up later HMM: bottom-up Model haplotype structure Get short-range phasing right first; then improve phase accuracy at longer scales Well-engineered HMMs (SHAPEIT2) implicitly make use of long IBD O Connell et al. 2014 PLOS Genet

Future direction: Best of both worlds? HMM vs. LRP: Modeling vs. Search accurate fast and accurate at very large sample sizes How about both? Positional Burrows-Wheeler Transform (PBWT, Durbin 2014): efficient data structure for haplotype storage/search

Eagle-PBWT (work in progress) N M 001000010100101000101110010001 010010100100100100011010010010 000100100100000100100100010000 0 1 0 1 0 0 1 0 0 1...... O MM time to build complete set of trees! O 1 time to move along a haplotype prefix! Series of haplotype trees...

Other future directions Reference-based phasing Haplotype Reference Consortium (HRC) has N=32K samples for phasing + imputation Applications of high-accuracy phasing Detecting allele-specific expression Detecting clonal mosaicism Inferring demography

Acknowledgments Alkes Price Pier Francesco Palamara Gaurav Bhatia Sasha Gusev Mark Lipson Bogdan Pasaniuc Nick Patterson Noah Zaitlen Eagle software: Loh et al. 2016 Nat Genet (in press) http://hsph.harvard.edu/alkes-price/software/

In-sample imputation accuracy: R 2 0.75 down to 0.1% MAF Benchmark: First 40cM of chr10 (~6K SNPs) 2% of genotypes randomly masked and imputed In-sample imputation R 2 measured on 120K British-ancestry samples Note: Eagle and SHAPEIT2 hard-call imputed haplotypes; dosages allow higher R 2 Mean in-sample imputation R 2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.1-0.2 0.2-0.5 0.5-1 1-2 MAF bin (%) 2-5 Eagle 1x150K SHAPEIT2 1x150K SHAPEIT2 3x50K SHAPEIT2 10x15K 5-10 10-50

In-sample imputation accuracy remains high for non-british samples Benchmark: First 40cM of chr10 (~6K SNPs) 2% of genotypes randomly masked and imputed In-sample imputation R 2 stratified by selfreported ancestry (including minority populations): 5K any other white 1K Indian 1K Caribbean Mean in-sample imputation R 2 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.1-0.2 British otherwhite Indian Caribbean 0.2-0.5 0.5-1 1-2 MAF bin (%) 2-5 5-10 10-50

Eagle pre-phasing improves downstream imputation accuracy MAF bin SHAPEIT2 10x15K Eagle 1x150K Difference 0.1 0.2% 0.574 (0.012) 0.594 (0.012) 0.020 (0.002) 0.2 0.5% 0.665 (0.010) 0.679 (0.010) 0.013 (0.002) 0.5 1% 0.753 (0.009) 0.765 (0.009) 0.012 (0.001) 1 2% 0.786 (0.008) 0.798 (0.008) 0.012 (0.001) 2 5% 0.812 (0.007) 0.822 (0.007) 0.010 (0.001) 5 10% 0.881 (0.007) 0.888 (0.006) 0.007 (0.000) 10 50% 0.924 (0.004) 0.928 (0.004) 0.004 (0.000) Chromosome-scale analyses: chr1 (short arm), chr10, chr20 Pre-phasing: Eagle, 1 batch of all 150K samples SHAPEIT2, 10 batches of 10K samples (due to computational cost) Imputation: Haplotype Reference Consortium (r1) PBWT imputation algorithm on Sanger Imputation Service