Fast and Accurate Haplotype Inference with Hidden Markov Model. Yi Liu

Size: px
Start display at page:

Download "Fast and Accurate Haplotype Inference with Hidden Markov Model. Yi Liu"

Transcription

1 Fast and Accurate Haplotype Inference with Hidden Markov Model Yi Liu A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science. Chapel Hill 2013 Approved by: Wei Wang Yun Li Vladimir Jojic Fernando Pardo Manuel de Villena William Valdar Ethan Lange

2 c 2013 Yi Liu ALL RIGHTS RESERVED ii

3 Abstract YI LIU: Fast and Accurate Haplotype Inference with Hidden Markov Model (Under the direction of Wei Wang and Yun Li) The genome of human and other diploid organisms consists of paired chromosomes. The haplotype information (DNA constellation on one single chromosome), which is crucial for disease association analysis and population genetic inference among many others, is however hidden in the data generated for diploid organisms (including human) by modern high-throughput technologies which cannot distinguish information from two homologous chromosomes. Here, I consider the haplotype inference problem in two common scenarios of genetic studies: 1. Model organisms(such as laboratory mice): Individuals are bred through prescribed pedigree design. 2. Out-bred organisms (such as human): Individuals (mostly unrelated) are drawn from one or more populations or continental groups. In the two scenarios, one individual may share short blocks of chromosomes with other individual(s) or with founder(s) if available. I have developed and implemented methods, by identifying the shared blocks statistically, to accurately and more rapidly reconstruct the haplotypes for individuals under study and to solve important related problems including genotype imputation and ancestry inference. My methods, based on hidden Markov model, can scale up to tens of thousands of individuals. Analysis iii

4 based on my method leads to a new genetic map in mouse population which reveals important biological properties of the recombination process. I have also explored the study design and empirical quality control for imputation tasks with large scale datasets from admixed population. iv

5 To my parents and my wife. v

6 Acknowledgements First of all I would like to express my sincere thanks to my advisors, Drs. Wei Wang and Yun Li, for their continuous guidance and support, for being approachable anytime I had a problem, for explaining to me patiently even when I was in the memoryless state, and for giving me much freedom in working (and playing). I had been very lucky to have chances to explore several different areas. I feel especially fortunate to have worked with Drs. Fernando Pardo Manuel de Villena and Vladimir Jojic. Fernando has patiently taught me many basics of biology and helped me in linking computational methods to real biology problems; Vladimir introduced me to many optimization techniques and always inspired me through thoughtful questions. My thanks also go to other research collaborators and committee members for helpful discussions on research and on completing my dissertation, William Valdar, Ethan Lange, Gary Churchill, Leonard McMillan, Xiang Zhang and all students (current and past) in the CompGen group and in the Li lab. I am very grateful to Qi Zhang for mentoring and helping me in many ways during my first year and my internships, to Zhaojun Zhang and Qing Duan for carrying out research together. Finally, I would like to thank my parents, Yongjian and Chengcui, for their endless support, and for buying me my first computer with all their savings 20 years ago. I am also deeply thankful to my wife, Ping, for trust in me, and for following me across the ocean to every place we have lived in and to the next place we are heading for. vi

7 Table of Contents List of Tables xi List of Figures xiii 1 Introduction Background DNA and Haplotype Genotype Model Organisms from Prescribed Breeding Samples from Out-bred Human Populations Thesis Statement Contributions Model Organisms from Prescribed Breeding Samples from Out-bred Human Populations Efficient Genome Ancestry Inference in Complex Pedigrees with Inbreeding Introduction The Genome Ancestry Problem Modeling Inheritance in Pedigree vii

8 2.3.1 Modeling Inbreeding Generations Integrating the Inbreeding Model Modeling the Collaborative Cross The Breeding Scheme Modeling the Genome of G2I k Generation Experiments Experiments on Simulated Data Experiments on Real CC data Running Time Performance Discussion High Definition Recombination Map in a Highly Divergent Mouse Population Introduction Materials and Methods The Genotype Data Haplotype Reconstruction and Recombination Inference Overview of the Recombination Map Sex Effect on Recombination Cold Regions Identification of Cold Regions in the G2I 1 Population External Validation of Cold Regions Genomic Analysis of Cold Regions Conclusion viii

9 4 MaCH-Admix: Genotype Imputation for Admixed Populations Introduction Materials and Methods General Framework Piecewise IBS-based Reference Selection Ancestry-weighted Approach MaCH-Admix Datasets Methods Compared Measure of Imputation Quality Results WHI-AA and WHI-HA with the 1000G Reference HapMap ASW and MEX with the 1000G Reference Imputation Performance with HapMap References WHI-HA and WHI-AA with HapMap references HapMap ASW and MEX with HapMap references Running Time Discussion Genotype Imputation of Metabochip SNPs in African Americans Using a Study Specific Reference Panel Introduction Materials and Methods Pre-Imputation Quality Control ix

10 5.2.2 General Pipeline for Reference Construction and Subsequent Imputation Results Genomewide Imputation Quality Estimate by Masking GWAS SNPs Quality Estimate by Masking Reference Individuals Overall Imputation Performance and Practical Guidelines Rare SNPs during Haplotype Reconstruction Discussion Conclusion Future Directions Model Organisms from Prescribed Breeding Samples from Out-bred Human Populations Bibliography x

11 List of Tables 2.1 All Possible Transitions of S(a),S(b) Summary of Identified Recombination Events in G2I 1 Mice List of Cold Regions Identified Median Half Life of r 2 (in Kb) Imputation Results of WHI-HA Individuals over Five 5Mb Regions with the 1000G reference Imputation Results of WHI-AA Individuals over Five 5Mb Regions with the 1000G reference Imputation Results of HapMap ASW & MEX Individuals over Five 5Mb Regions with the 1000G reference Imputation Results of WHI-HA Individuals over Five 5Mb Regions with the HapMapII reference Imputation Results of WHI-AA Individuals over Five 5Mb Regions with the HapMapII reference Imputation Results of 49 ASW Individuals Over All Five Short Regions Imputation Results of 49 ASW Individuals Over All Five Short Regions Imputation Results of 50 MEX Individuals Over All Five Short Regions Average Dosage r 2 by MAF, Estimated by Masking 2% GWAS SNPs xi

12 5.2 Average Rsq and Dosage r 2 by MAF, Estimated by Masking 100 Reference Individuals Effect of Including Rare Variants for Reference Panel Construction Effect of Including Rare Variants for Haplotype Reconstruction among Target Individuals Effect of Including/Excluding the 100 Masked Reference Individuals during Reference Haplotype Reconstruction Average Rsq and Dosage r 2 by MAF, Estimated by Masking One Reference Individual at a Time xii

13 List of Figures 1.1 Toy example of two chromosomes with haplotypes defined on three sites containing variations Inhertiance indicators of an inbreeding process Comparison of predicted probabilities and observed probabilities from simulations Collaborative Cross breeding scheme and the corresponding inheritance indicators Comparison of error rates of GAIN, MERLIN and HAPPY on simulated data sets Proportion of probabilities assigned to wrong ancestry by GAIN and HAPPY on simulated data sets The difference in ancestry estimated by GAIN and HAPPY Two examples of ancestry inference by GAIN and HAPPY Average running time of GAIN, HAPPY and MERLIN The CC funnel pedigree to G2I 1 generation Distribution of recombination interval length in log-scale Recombination map length of autosomes by Prdm9 allele and gender Distribution of recombination events along the autosomes in female and male meioses Distribution of single and double recombination events along the autosomes in female and male meioses xiii

14 4.1 A cartoon illustration of two scenarios where three IBSbased selection methods perform differently Median r 2 half-life value of 5Mb windows on 5 chromosomes Imputation of 3587 WHI-HA with the 1000G reference panel Imputation of 8421 WHI-AA with the 1000G reference panel Minor Allele Frequency (MAF) distribution of SNPs in WHI-AA and WHI-HA Imputation of 49 HapMap ASW and 50 HapMap MEX individuals with the 1000G reference panel Imputation quality of ASW with HapMapII CEU+YRI+ LWK+MKK reference panel Reference construction and imputation pipeline using a study-specific reference panel Imputation accuracy by chromosome for 2% randomly masked GWAS SNPs Rsq by dosage r 2 for 2% randomly masked GWAS SNPs MAF distributions of Affymetrix 6.0 and Metabochip SNPs Physical spreading of Affymetrix 6.0 and Metabochip SNPs Imputation accuracy by chromosome for Metabochip SNPs (estimated by masking 100 reference individuals) Accuracy and calibration of imputation Rsq by dosage r 2 for Metabochip SNPs (estimated by masking 100 reference individuals) xiv

15 Chapter 1 Introduction Recent technological advances in life sciences have generated massive amounts of data which enables accurate analyses of genome ancestry, recombination properties, complex disease susceptibility, and drug response, among many others. However, it is often the haplotype information that is more powerful in such analyses than the data directly obtained from high-throughput technologies such as genotyping. Therefore, how to reconstruct haplotype information from massive amount of raw data and make related inference based on recovered haplotype information are key problems in genetic studies and pose serious computational challenge. In this thesis, I have developed statistical methods and computational tools that, by reconstructing haplotype information, generate accurate inferences for important problems including genome ancestry and imputation. My methods, based on Hidden Markov Model (HMM), can efficiently handle large scale datasets from two common settings in modern genetic studies: 1. Model organisms(such as laboratory mice): Individuals are bred through prescribed pedigree design. 2. Out-bred organisms (such as human): Individuals (mostly unrelated) are drawn from one or more populations or continental groups.

16 1.1 Background DNA and Haplotype Diploid species, which include nearly all mammals, carry paired homologous chromosomes, one inherited from each parent. A haplotype refers to the DNA sequence data from one of the paired chromosomes. Within the same species, DNA sequences are largely identical differing only slightly among individuals. Thus haplotypes are often defined only at positions with sequence variations. Figure 1.1 shows a toy example of two chromosomes with 15 sites and the two haplotypes defined at sites with variations. Figure 1.1: Toy example of two chromosomes with haplotypes defined on three sites containing variations Haplotype knowledge describes how genetic materials are inherited from generation to generation. It thus provides direct knowledge of genome ancestry and historical recombination events. Furthermore, utilizing haplotype sharing information, one can fill in missing genotypes (imputation) [Li et al., 2009]. Haplotypes are also important to many other fundamental problems in genetics. To name a few: (1) linkage analysis and linkage disequilibrium patterns [Stephens et al., 2001; Wall et al., 2003]; (2) mapping complex traits and diseases [Johnson et al., 2001; Altshuler et al., 2008]; (3) selection, evolution and historical migration in population genetics [Sabeti et al., 2002; Merriwether et al., 2

17 1995]. In these problems, even if reconstructed with uncertainty, haplotype information could lead to significantly increased power in inferences. Even though it is possible to obtain haplotype information of diploid organisms directly from biological experiments, it is generally expensive and cannot scale to large sample size. On the contrary, modern high-throughput genotyping technologies can generate accurate genotype readings on hundreds of thousands of markers at much lower cost. It is thus valuable to conduct analysis by reconstructing haplotypes based on genotype inputs Genotype Modern high-throughput genotyping technologies generate genotype readings on a preselected set of genetic markers. The set of markers can be defined by standard commercial platforms (e.g., Affymatrix 6.0, Illumina 1M), or customized by researchers (e.g., Yang et al. [2009]). Each genotype reading, or simply genotype, is an unordered combination of two alleles from paired chromosomes. In other words, genotypes are unable to distinguish between the two haplotypes of a diploid organism. It cannot tell which allele is from which haplotype. In this dissertation, I consider how to bridge the gap between genotype data and desired genetic analyses by reconstructing haplotypes probabilistically. Here, I consider two common settings in genetic studies and related inference problems specific to settings. 1.2 Model Organisms from Prescribed Breeding Model organisms, such as laboratory mice, are frequently bred or crossed in order to study genetic influences [Churchill et al., 2004; Valdar et al., 2006; Chia et al., 2005]. Often, organism resources are generated using prescribed breeding system to ensure diversity and reproducibility, which leads to complex pedigree structure consisting of many generations. Through recombination, DNA sequences of founder organisms are inter- 3

18 mixed in each generation. A DNA sequence of any descendant organism is a mosaic of its founders DNA segments. One example of such resources is the international Collaborative Cross (CC) project which is a major effort in the mouse research community and has been under development for more than 10 years [Threadgill and Churchill, 2012]. The CC project consists of hundreds of independently bred, recombinant-inbred mouse lines generated through a funnel breeding design (Figure 2.3). Each line has more than 20 expected generations. High-density genotype data of the CC resources not only provide opportunities for fineresolution quantitative trait locus (QTL) studies, but also facilitate exciting new research areas such as the inference of genetic networks underlying phenotypic traits in mammals. Among many analyses of interest, a core problem is to discover the founder attribution to genomes in subsequent generations. That is to say, given a descendant organism in the resource, I want to find out which part of its DNA sequences is inherited from which founder (genome ancestry in founders). The genome ancestry information provides direct knowledge of historical recombination events and opportunities for error detection and imputation. It also enables downstream analyses such as measuring strain effect in quantitative traits. Inference of genome ancestry involves resolving the potential inheritance flow at all markers of interest. This naturally requires the resolution of haplotype information as haplotypes correspond to the variants inherited together in the breeding process. It is straightforward to show that, in a pedigree with n non-founders and m markers of interest, there are 2 mn possible inheritance configurations even if one assumes known founder haplotypes and only bi-allelic markers. In a typical CC pedigree, there could be more than 40 mice and the enormous search space presents a major computational challenge. The commonly favored pedigree-based haplotyping methods [Kruglyak et al., 1996; Abecasis et al., 2001; Gudbjartsson et al., 2005] are all based on the Lander-Green algorithm (Lander and Green, 1987) as the running time is linear to the number of markers 4

19 which far exceeds other parameters. However, these methods are limited to pedigrees of moderate size since the running time grows exponentially with pedigree size. When they are applied to the genotype data from CC, the search space becomes extraordinarily large due to the large pedigree structure with many untyped intermediate generations. Other pedigree-based haplotyping methods include MCMC sampling methods [Sobel and Lange, 1996; Jensen and Kong, 1999], whose computing time can be substantial when applied to a large number of tightly linked markers, and rule-based methods [Qian and Beckmann, 2002; Li and Jiang, 2005], which have a crude approximation by minimizing recombinations in pedigree. More computationally efficient approaches for solving the genome ancestry problem have ignored pedigree information, including the breeding scheme. Examples include the combinatorial optimization approach by Zhang et al. [2008] and the HMM-based method in HAPPY [Valdar et al., 2006; Mott et al., 2000], a QTL mapping tool suite for association studies. All ancestry compositions are considered possible in the two methods. While breeding design does not determine the locations of recombination, it places important constraints on the possible ancestry choices at a single marker and at neighboring markers. Therefore, incorporating breeding design information would lead to more accurate inference. 1.3 Samples from Out-bred Human Populations The ultimate goal of almost all genetic research is to understand genetic mechanisms in humans. Therefore, tremendous efforts have been spent on investigating human samples directly. In contrast to model organisms where breeding is often designed and controlled, humans are out-bred and the genetic data of founders are generally unavailable. Since Risch and Merikangas [1996] showed that association studies are more powerful than linkage studies, genetic data collected for humans in the past one and a half decades are largely from unrelated individuals. The consequence of out-breeding, lack of founder 5

20 genetic information, and use of unrelated individuals is that individuals studied tend to share only short haplotype segments (e.g., several hundred Kbs) of their chromosomes. This is further confounded by the presence of population and sub-population structure. Reconstruction of haplotype in such out-bred populations is therefore challenging but of great importance in genetic studies. By aligning samples under study to samples in existing studies (e.g., HapMap and 1000 Genomes projects[the International HapMap Consortium, 2010; The 1000 Genomes Project Consortium, 2012]), researchers can identify the shared haplotype segments among samples. Consequently, one can not only recover the sporadic technological failures in genotypes, but also impute the markers that are untyped in individual studies but typed in reference samples. This genotype imputation technique greatly improves the marker density and analysis power of individual studies. Moreover, as the typical small to moderate effect of individual genetic variant on complex trait entails large sample size, collaborative efforts that pool information across multiple studies are typically taken to enhance the statistical power for detecting causal variants. In these collaborative efforts, samples from different studies are typically genotyped at different sets of markers because different commercially available genotyping platforms are used. The commonly used genotyping platforms have a small fraction of markers in common ( 10% is typical between platforms from two different companies). Restricting analysis to markers in common leads to much reduced marker density and huge loss of information. Imputation of markers untyped in individual studies greatly facilitates the integration of samples across studies (meta-analysis). Several HMM-based imputation methods [Li et al., 2010a; Howie et al., 2009; Browning and Browning, 2009] have previously been developed by reconstructing the haplotypes and shown to achieve good imputation performance in a number of populations [Pei et al., 2008; Huang et al., 2009], particularly those with high level of linkage disequilibrium (LD) or having closely matched reference population(s) from the HapMap 6

21 or the 1000 Genomes Projects [The International HapMap Consortium, 2010; The 1000 Genomes Project Consortium, 2010]. The wealth of literature using genotype imputation has focused on using external reference panels (for example, phased haplotypes from the HapMap and 1000 Genomes projects), largely in individuals of European ancestry, for inference of genotypes at common (minor allele frequency [MAF] > 0.05) genetic markers. Several important issues have not been adequately addressed including the utility of study-specific reference, accommodation of increasingly large reference panels, performance in admixed populations, and quality for less common (MAF ) and rare (MAF < 0.005) variants. These issues only recently became addressable with Genome-Wide Association (GWA) follow-up studies using dense genotyping or sequencing in large samples of non-european individuals. Also, little methodological work exists for imputation in admixed populations, such as African Americans and Hispanic Americans, which comprise more than 20% of the US population. Admixed populations offer a unique opportunity for gene mapping, but also impose challenges for imputation. To efficiently benefit from emerging large reference panels, one key issue to consider is on how to traverse the reference space harboring the most probability mass with minimum computational efforts. In modern genotype imputation framework, this corresponds to the selection of effective reference panels. Existing works often focused on constructing a pre-defined reference panel prior to running the imputation engine. Such methods (e.g., a cosmopolitan panel [Hao et al., 2009; Li et al., 2009; Shriner et al., 2010] or a weighted combination panel [Egyud et al., 2009; Huang et al., 2009; Pasaniuc et al., 2010; Pemberton et al., 2008]) have limited flexibility and aggravate the already heavy computation burden. Another approach, based on whole-haplotype closeness heuristics, has been adopted by IMPUTE2 [Howie et al., 2009] and can be embedded within other existing imputation models. The above-mentioned methods have shown promising results but have not been evaluated systematically. In addition, both categories of methods can be further improved statistically and compu- 7

22 tationally, for example, through integration of the former approach within (rather than prior to) the hidden Markov model, or through more elegant heuristics. 1.4 Thesis Statement Genetic analyses of model organism resources and out-bred populations can be achieved by reconstructing haplotype information implicitly or explicitly via HMM. By applying effective state-space pruning strategies, I present haplotype-based inference algorithms that can scale to large datasets without compromising accuracy. Application to CC mouse data leads to new biological discovery of properties of recombination events. Case study on Women s Health Initiative (WHI) metabochip data leads to generalizable quality control guidelines for imputation analysis. 1.5 Contributions In this section, I briefly summarize the contributions presented in subsequent chapters Model Organisms from Prescribed Breeding In Chapter 2, I propose a method, GAIN, to infer genome ancestry in organism resources. The method can efficiently handle complex pedigrees with inbreeding which is an important process in generating organism resources. Using a pair of dependent quaternary indicators to capture all recombinations in the inbreeding history, my method achieves accurate ancestry inference without the need to explicitly model every intermediate generation. By encoding the inbreeding model into the inheritance vectors, I design a Lander-Green-like algorithm whose running time remains constant with respect to the number of inbreeding generations. GAIN is implemented and evaluated on the CC high-density single-nucleotide polymor- 8

23 phism (SNP) data with complex breeding design. Experiments show that, GAIN generates accurate results efficiently on data that cannot be handled by existing pedigree haplotyping software. Compared with HAPPY [Mott et al., 2000], which does not model pedigree structure, GAIN substantially reduces ambiguities in ancestry inference. In Chapter 3, I generate a new linkage map of the laboratory mouse genome using GAIN described in previous chapter. The map is built with the recombination and ancestry information inferred from the genotypes of 237 male-female sibling pairs. Exploiting the large number of recombination events (n 22,000), the high precision in mapping each event ( 35kb) and the unique characteristics of the CC mice, Iprovideanewandpowerful lookattheeffectsofsex, strainandgenotypesat polymorphic loci of interest (e.g., the Prdm9 gene) on recombination. In addition to an extended catalog of sex and strain specific hotspots, I report the presence of cold regions for recombination with striking distributions and genomic characteristics Samples from Out-bred Human Populations In Chapter 4, I propose and evaluate a number of methods for effective reference panel construction to improve haplotype-based imputation engines. Using a novel piecewise IBS method, my software package MaCH-Admix yields consistently higher imputation quality than existing methods/software. I evaluated the performance on individuals from recently admixed populations, including 8421 African Americans and 3587 Hispanic Americans from the Women s Health Initiative (WHI), which allow assessment of imputation quality for uncommon variants. The advantage is particularly noteworthy among uncommon variants where up to 5.1% information gain is observed with the difference being highly significant (Wilcoxon signed rank test P-value < ). This work is the first that considers 9

24 various sensible approaches for imputation in admixed populations and presents a comprehensive comparison. InChapter5, Ipresent acasestudyofimputationinalargecohortofafricanamericans from the Women s Health Initiative (WHI) study. This study presents three under-studied aspects: (1) imputation of markers from a region-centric platform that are largely of low frequency; (2) imputation using a study-specific reference panel; and(3) imputation in admixed population. In this study, I describe a pipeline for constructing study-specific reference panels using individuals genotyped or sequenced at a larger set of genetic markers and for imputation into individuals with genotype data at a subset of markers. I demonstrate several approaches to reliably estimate imputation quality for SNPs in different MAF categories. Experiment results suggest that imputation of region-centric SNPs, including low frequency SNPs with MAF , is feasible and well worthwhile for power increase in downstream association analysis. I further provide practical guidelines regarding post-imputation quality control. 10

25 Chapter 2 Efficient Genome Ancestry Inference in Complex Pedigrees with Inbreeding 2.1 Introduction Model organisms, such as laboratory mice, are frequently bred or crossed in order to study genetic influences [Churchill et al., 2004; Valdar et al., 2006; Chia et al., 2005]. Often, such animal resources are generated using prescribed breeding system to ensure diversity and reproducibility, which leads to complex pedigree structure consisting of many generations. Through recombination, the DNA sequences of founder organisms are intermixed in each generation. A DNA sequence of any descendant organism is a mosaic of its founders DNA segments. As recombinations at each breeding stage cannot be observed directly, it is of great interest to infer the ancestry of resulting DNA sequences. In other words, which part of a resulting DNA sequence is inherited from which founder. The vast majority of the sequence variations are attributed to single base-pair mutations known as single-nucleotide polymorphism (SNPs), thus making SNPs ideal for resolving the genome ancestry problem. The set of SNPs on the same chromosome constitutes a haplotype. While any of the four nucleotides (A,T,C,G) is possible, in practice nearly all SNPs appear in only two variations. This results from the fact that SNPs

26 originate as mutations, which are rare events within a vast genome. It is therefore convenient to encode a SNP allele as a binary value and represent haplotypes as binary sequences. Modern high-throughput genotyping technologies are unable to distinguish between the two haplotypes of a diploid organism. Instead, a genotype sequence is measured where, at each SNP site, one of three possibilities is observed ({00,01,11}, since 10 cannot be distinguished from 01). Using the genotype representation for DNA sequences, the genome ancestry problem estimates the origin of each genotype from a descendant s sequence given the genotype sequences of its distant founders. To achieve high resolution, dense SNP markers are used ( tens of thousands on each chromosome ). Knowledge of genotype s ancestry is particularly useful in many problems such as studying the structure and history of haplotype blocks [Gabriel et al., 2002; Zhang et al., 2002; Schwartz et al., 2004], and mapping quantitative trait loci (QTLs)[Valdar et al., 2006; Mott et al., 2000]. In these studies, a probabilistic interpretation is favored over discrete solutions, due to the prevalence of ambiguities and measurement errors. The genome ancestry problem is closely related to haplotype inference with pedigree data. Inferring haplotypes in a pedigree often involves solving the inheritance flow of alleles at each generation. On the other hand, given the genome ancestry information, it is straightforward to reconstruct the descendant haplotypes. As pedigree analysis is NP-hard [Piccolboni and Gusfield, 2003], existing algorithms are either approximate or suffer exponential running times. Among the maximum likelihood approaches, methods [Kruglyak et al., 1996; Abecasis et al., 2001; Gudbjartsson et al., 2005] based on the Lander-Green algorithm [Lander and Green, 1987] are often favored because their running time is linear to the number of markers. MERLIN [Abecasis et al., 2001], an implementation based on sparse binary trees, is one of the most successful pedigree analysis programs. Unfortunately, methods based on Lander-Green algorithms are limited to pedigrees of moderate size since the running time grows exponentially with pedigree 12

27 size. MCMC sampling methods [Sobel and Lange, 1996; Jensen and Kong, 1999] have been proposed to address larger pedigrees. But their computing time can be substantial when applied to a large number of tightly linked markers. Other efforts include rule-based methods [Qian and Beckmann, 2002; Li and Jiang, 2005] which approximates a solution by minimizing recombinations in the pedigree (MRHC). PedPhase [Li and Jiang, 2005], which employs an effective integer linear programming (ILP) formulation, has been widely used in solving the MRHC. Current haplotyping methods for pedigrees are incapable of solving the genome ancestry problem in animal resources for the following reasons: 1) Pedigrees of model animal resources often contain large number of generations to ensure diversity and reproducibility. 2) None or few of the intermediate generations are genotyped due to the size of the resources. 3) A large number of dense markers are genotyped to achieve fine resolution. As a concrete example, more than one thousand lines have been started in the Collaborative Cross project [Churchill et al., 2004; The Collaborative Cross Consortium, 2012]. Each line is expected to undergo at least 23 generations before reaching 99% inbred. Hundreds of mice of various generations were genotyped, but on average only few are from the same line. The missing genotypes make the search space extraordinarily large. Other computationally efficient approaches for solving the genome ancestry problem have largely ignored the breeding scheme. While breeding design does not determine the locations of recombination, it often places constraints on the possible ancestry choices at a single site and at neighboring sites. The genome ancestry problem was modeled as a combinatorial optimization problem in [Zhang et al., 2008]. By minimizing recombinations, discrete solutions are generated. Mott et al. has proposed an approach using Hidden Markov Model (HMM) for ancestry inference in HAPPY [Valdar et al., 2006; Mott et al., 2000], a QTL mapping tool suite for association studies. All founder pairs are considered as possible hidden states for emitting the observed genotype at each site. Besides founder genotypes, no pedigree data are used in these two approaches. 13

28 There have also been many efforts to analyze pedigree by identifying symmetries in HMM state space [Donnelly, 1983; McPeek, 2002; Browning and Browning, 2002; Geiger et al., 2009]. The states are then grouped to accelerate the calculation. However, finding the maximal grouping is non-trivial. In real-world problems, only obvious symmetries such as founder phase and chain structure in pedigree can be best utilized. Besides model organisms, the genetic ancestry problem has been studied for human individuals that have recently been admixed from a set of isolated populations, instead of a set of founders[tang et al., 2006; Sundquist et al., 2008; Sankararaman et al., 2008; Paşaniuc et al., 2009]. In this problem, pedigree structure is usually not present (unrelated individuals) or the size of pedigree is small. Efficient methods have been developed to handle large-scale datasets[tang et al., 2006; Sundquist et al., 2008; Sankararaman et al., 2008]. Leveraging the observation that large animal resource pedigrees often contain repetitive sub-structures, I propose a method that can efficiently handle complex pedigrees with inbreeding which is an important process in generating animal resources. Using a pair of dependent quaternary indicators to capture all recombinations in the inbreeding history, my method achieves accurate ancestry inference without explicit modeling every generation. By encoding the inbreeding model into the inheritance vectors, I design a Lander-Green-like algorithm whose running time remains constant with respect to the number of inbreeding generations. My method is implemented and evaluated on the Collaborative Cross breeding design [Chesler et al., 2008; The Collaborative Cross Consortium, 2012] with dense SNP data. Experiments show that, my approach generates accurate results efficiently on data that cannot be handled by existing pedigree haplotyping software. Compared with HAPPY, which does not consider pedigree structure, my approach significantly reduces ambiguities and errors in ancestry inference. 14

29 2.2 The Genome Ancestry Problem Given a pair of chromosomes, consider L SNP markers ordered by their chromosomal locations. For each SNP site, we use 0 and 1 to encode the two possible values. The genotype at each site is the unordered combination of corresponding alleles from both chromosomes, which can assume one of three values: 00, 01, 11. A genotype sequence is a genome-ordered set of genotypes denoted as: G = g 1...g l...g L,(g l {00,01,11}). A haplotype H = h 1...h l...h L consists of alleles from one of the chromosomes where h l {0,1}. Consider a pedigree containing a set of founders FS = {F1,...,F N } and a descendant of interest. I denote the set of founder genotype sequences by {G F1,...,G FN }, all of which are given. Given the genotype sequence, G D, of the descendant generated through the pedigree structure, its genome ancestry is to be determined. Every genotype g l in G D inherits its alleles from two founders, say F A and F B. I refer to the founder pair (F A,F B ) as the genome ancestry at site l of genotype sequence G D. I want to estimate, for every SNP site l, the probability P(Ancestry(g l ) = (F A,F B )) for every founder pair (F A,F B ) FS FS. Note that founder pairs are unordered ((F A,F B ) = (F B,F A )), and it is possible that F A = F B. 2.3 Modeling Inheritance in Pedigree I start from the standard Lander-Green approach to model a pedigree: At each SNP site, an inheritance indicator is used to indicate the outcome of each meiosis. These inheritance indicators together form the inheritance vector. Since a child haplotype inherits its allele from either the paternal or maternal sequence, an inheritance indicator is a binary variable. For a pedigree with n non-founder animals, there are 2 n inheritance indicators at each site. Hence, the inheritance vector at site l, v l, can be defined as a binary sequence of length 2 n. An instance of v l specifies a possible configuration of 15

30 inheritance flow at site l of all animals in the pedigree. When SNP markers are dense enough, one can assume at most one recombination between two sites in generating one haplotype. If a recombination happens between site l and l + 1, the corresponding inheritance indicator will have different states for the two sites. Hence, to measure the number of recombinations between l and l + 1 in the whole pedigree, one can count the difference in bits between v l and v l+1. The probability of having d recombinations between l and l +1 is θ d (1 θ) 2n d, where θ is the recombination fraction. The length of inheritance vector grows linearly with the number of animals in the pedigree and this causes exponential growth in the number of possible inheritance patterns. Considering the fact that full pedigree analysis is computationally intractable, I overcome the issue by modeling important sub-structure in breeding systems as a shortcut to efficient computation. My first natural choice of sub-structure is inbreeding: 1) Inbreeding is often used in model animal resources to generate genetically diverse and/or reproducible descendants. 2) Inbreeding is often carried out for many generations and each generation elongates the inheritance vectors by 4 bits. Hence, if a pedigree involves inbreeding, the inbreeding generations often account for most of the computational complexity. I seek an aggregated inheritance indicator to replace the collection of many inheritance indicators in the inbreeding process. Such an aggregated indicator can be encoded in much shorter length and incorporated into the inheritance vector. If the state and transition probability of the aggregated indicator can be modeled efficiently, full pedigree analysis will become feasible on these animal resources. In the next section, I explain how inheritance in inbreeding generations can be modeled as an aggregated indicator Modeling Inbreeding Generations During inbreeding, offspring are produced by sibling matings for many generations. At each generation, four new haplotypes are formed by recombining the four haplotypes from the previous generation. The inbreeding process at a single site is shown in Figure 16

31 (a) (b) Figure 2.1: (a) Lattice of binary inheritance indicators representing the inheritance pattern of an inbreeding process at a single site. (b) An equivalent quaternary indicator representation 2.1(a). I denote the beginning generation of inbreeding as generation I 0. Observe that, at each site, because of the symmetry of inbreeding structure, the four alleles at generation I 0 have equal probabilities to be passed down to any haplotypes after I 1. Thus, for a descendant haplotype at generation I k (k > 2), I can simply replace the lattice of binary inheritance indicators by a single quaternary indicator. Each choice of the quaternary indicator has 1/4 probability. Two quaternary indicators are needed for the two haplotypes of a I k descendant (Figure 2.1(b)). However, the two quaternary indicators are not independent as the two haplotypes share the same inbreeding history until I k 1. To model this dependency between the two quaternary indicators, I find out the transition events and probabilities of the pair of indicators. The grouped pair is then used as an aggregated inheritance indicator as discussed above. I label the four I 0 haplotypes as 1,2,3,4. I then denote by a,b the two I k descendant haplotypes and S(a l ),S(b l ) are their I 0 sources at site l, i.e., S(a l ),S(b l ) {1,2,3,4}. Their I 0 sources along the chromosome is denoted by S(a),S(b) {1,2,3,4} L. A transition happens in S(a) between site l and l +1 if S(a l ) S(a l+1 ). I consider, between two 17

32 adjacent sites, l and l+1, all the possible transitions from S(a l ),S(b l ) to S(a l+1 ),S(b l+1 ) (Table 2.1). Note that: P EE0 +P EN1 +P EE2 +P EN2 = P(S(a l ) = S(b l )) = P EE0 +P EE2 +P NE1 +P NE2 = P(S(a l+1 ) = S(b l+1 )) and P NE1 +P NN0 +P NN1 +P NN2 +P NE2 = P(S(a l ) S(b l )) = P EN1 +P EN2 +P NN0 +P NN1 +P NN2 = P(S(a l+1 ) S(b l+1 )) The prior probability P(S(a l ) = S(b l )) at any site l is called the inbreeding coefficient [Wright, 1922]. To calculate the probability, let IC k denote the inbreeding coefficient at k 2 generation I k. IC k can be computed recursively using IC k = ( 1 2 )k j (1+IC j ). j=0 18

33 Site l Possible Transitions Site l + 1 Denote By Neither S(a) or S(b) transitions. S(a l+1 ) = S(b l+1 ) P EE0 S(a l ) = S(b l ) Either S(a) or S(b) transitions, but not both. S(a l+1 ) S(b l+1 ) P EN1 Both S(a) and S(b) transition to same value. S(a l+1 ) = S(b l+1 ) P EE2 Both S(a) and S(b) transition, but to different values. S(a l+1 ) S(b l+1 ) P EN2 Neither S(a) nor S(b) transitions. S(a l+1 ) S(b l+1 ) P NN0 Either S(a) or S(b) transitions, but not both. S(a l+1 ) = S(b l+1 ) P NE1 S(a) and S(b) become equal after the transition. S(a l ) S(b l ) Either S(a) or S(b) transitions, but not both. S(a l+1 ) S(b l+1 ) P NN1 S(a) and S(b) remain different after the transition. 19 Both S(a) and S(b) transition. S(a l+1 ) S(b l+1 ) P NN2 S(a) and S(b) remain different after the transition. Both S(a) and S(b) transition. S(a l+1 ) S(b l+1 ) P NE2 S(a) and S(b) become the same after the transition. Table 2.1: All possible transitions of S(a), S(b). Each type of transition is denoted by 3 characters. First two letters indicate the equality of S(a), S(b) before and after the transition. Then followed by a digit indicating the number of transitions in S(a), S(b).

34 Next, I derive the probabilities in Table 2.1. Consider that any transition in S(a) or S(b) is caused by one or more recombinations in the inbreeding process (Figure 2.1(a)). My calculation is based on the assumption that the recombination fraction, θ, is reasonably small. Hence, for any haplotype c at generation I j (1 j k), I assume that any single transition in S(c) is solely caused by one recombination in generating c or its ancestor haplotypes. In other words, a single transition in S(c) is not the result of multiple recombinations in the pedigree. My assumption is generally true for dense SNP markers where θ is usually well below Under the assumption, if a transition in S(c) is caused by a recombination in generating c itself, I define this to be a lead transition. Intuitively, a lead transition is one not inherited from its ancestors. A lead transition in c will change the I 0 source of c and all descendant haplotypes inheriting the transition. A lead transition is only possible when the two parental haplotypes of c have different I 0 sources. Hence, between two sites, a haplotype at generation j has a lead transition with probability θ (1 IC j 1 ). With the inbreeding coefficients calculated, I can derive the marginal probability of observingtransitioninoneofthei k haplotypes, P 1T = P(S(a l ) S(a l+1 )) = P(S(b l ) S(b l+1 )). Without loss of generality, I consider P(S(a l ) S(a l+1 )) for haplotype a. S(a) will transition if a itself or any of its ancestor haplotypes has a lead transition. At generation k, the lead transition happens with probability θ (1 IC k 1 ). For generation k 1, there are 2 possible ancestor haplotypes, each with 1θ (1 IC 2 k 2) chance of causing a transition in S(a). For each generation j from 1 to k 2, there are 4 possible ancestor haplotypes with probability 1 θ (1 IC 4 j 1). Consider that, at one site, any two haplotypes from the same generation cannot both be the ancestor of a. Thus, for any generation j, the expected probability of causing transition in S(a) is θ (1 IC j 1 ). k Under my assumption, P(S(a l ) S(a l+1 )) can be expressed by 1 (1 θ (1 IC j 1 )). I then derive the probability P EE2 that S(a) and S(b) have equal state at site l, and both transition to another state at site l+1. This event happens only if a haplotype c at j=1 20

35 some previous generation is the common ancestor of a,b and c has a lead transition. The probability of c at generation j being the common ancestor of a and b is 1IC 4 k j. The probability that c has a lead transition is θ (1 IC j 1 ). Again, consider the fact that, at one site, any two haplotypes from the same generation cannot both be the common ancestor of a and b. Thus, the probability of EE2 event caused by lead transition at I j (1 j k 2) is θ (1 IC j 1 )IC k j. Assuming a small θ, P EE2 can be calculated by k 2 1 (1 θ (1 IC j 1 )IC k j ). j=1 Lastly I consider the probability P NN1. To simplify my discussion, assume that the transition happens in S(a) (i.e. S(a l ) S(a l+1 )) and it inherits a lead transition in haplotype c of generation j. Since S(a l ), S(a l+1 ) and S(b l ) all have different I 0 ancestry, alleles from at least 3 distinct I 0 haplotypes should be observed at generation j 1. Let P Distinct (m,j) be the probability of observing exactly m distinct I 0 alleles at generation j. P Distinct (3,j) and P Distinct (4,j) can be computed recursively using: P Distinct (4,j) = 1 4 P Distinct(4,j 1) P Distinct (3,j) = 1 2 P Distinct(3,j 1)+ 1 2 P Distinct(4,j 1) Then, P NN1 is the probability that (1) at least 3 distinct I 0 alleles are present at generation j 1 and (2) a s ancestor c at generation j has a lead transition between sites l and l + 1 which is inherited by a (3) before and after transition, the I 0 source of c is different from that of b. Under my assumption of a small θ, P NN2,P NE2,P EN2 are all sufficiently small and can be ignored in calculating other probabilities. The intuition is as follows: if k is small, there are few animals in the inbreeding lattice and the chance of observing multiple transitions is rare; when k becomes larger, the probability P(S(a l ) S(b l )) approaches 0 21

36 rapidly and P NN2,P NE2,P EN2 are much smaller than P(S(a l ) S(b l )). With P 1T, P EE2 and P NN1 derived, I can easily solve all the rest probabilities in Table 2.1: P NE1 = P EN1 = 1 2 (2 (P 1T P EE2 ) P NN1 ) P EE0 = IC k P EE2 P EN1 P NN0 = 1 IC k P NE1 P NN1 P NN2,P NE2,P EN2 are approximated by a small probability P NE1 P NE1. I use simulation to validate the probabilities derived above. The results are shown in Figure 2.2. For θ around 0.01, my method gives reasonably close approximation. For θ below 0.001, my method is very accurate. The recombination fraction between dense SNP markers is usually well below So far I have derived all event probabilities in Table 2.1. The transition probability from (S(a l ),S(b l )) to (S(a l+1 ),S(b l+1 )) is the corresponding probability in Table 2.1 conditioned on P(S(a l ) = S(b l )) or P(S(a l ) S(b l )) Integrating the Inbreeding Model I have argued that each inbreeding process can be modeled by two quaternary indicators and their transition probabilities can be accurately approximated when θ is small. It is then straightforward to integrate the inbreeding model into the original Lander-Green model. I encode the two quaternary indicators using 4 binary bits in the inheritance vector. Consider a pedigree containing i inbreeding processes and n other members not involvedininbreeding. Theinheritancevectorv l ateverysitel nowhaslength2 n +4 i. Each possible realization of v l is a hidden state in HMM. The transition probability from v l to v l+1 is the product of transition probabilities of all binary indicators and pairs of quaternary indicators. I can then solve the HMM using standard routine: 22

37 Probability θ= θ=0.001 θ= Inbreeding Generation (a) Probability θ= θ= Inbreeding Generation (b) θ= θ=0.001 Probability Inbreeding Generation (c) Figure 2.2: Comparison of predicted probabilities and observed probabilities from simulations. The data points in the figures are observed probabilities from simulations. The curves are derived from my formulas. (a) Predicted and simulated P EE0 for θ = 0.01,0.001, (b) Predicted and simulated P EN1 = P NE1 for θ = 0.001, (c) Predicted and simulated P EE2 for θ = 0.001, I do not plot the case of θ = 0.01 in (b) and (c) because the values are much larger than that of the other two θ values. 23

38 P(v l G D ) = P(G D v l )P(v l ) P(G D ) = P(g 1,...,g l v l )P(g l+1,...,g L v l )P(v l ) P(G D ) = P(g 1,...,g l,v l )P(g l+1,...,g L v l ) P(G D ) = α(v l)β(v l ) P(G D ) where α(v l ) = P(g 1,...,g l,v l ) β(v l ) = P(g l+1,...,g L v l ) α(v l ) and β(v l ) can be solved recursively: α(v l+1 ) = v l α(v l )P(v l+1 v l )P(g l+1 v l+1 ) β(v l ) = v l+1β(v l+1 )P(v l+1 v l )P(g l+1 v l+1 ) P(G D ) is obtained from the calculated α(v l ) and β(v l ) at any site l: P(G D ) = v l α(v l )β(v l ) The genome ancestry at site l is, for every founder pair (F A,F B ), P(Ancestry(g l ) = (F A,F B )) = v l P(v l G D ) for all v l s.t. g l is inherited from (F A,F B ). 24

39 Note that, if I place the bits of quaternary indicators at the end of inheritance vector, the recursive calculation of α and β can still greatly benefit from the Elston-Idury algorithm [Idury and Elston, 1997]. 2.4 Modeling the Collaborative Cross The Collaborative Cross (CC) [Churchill et al., 2004; Chesler et al., 2008; The Collaborative Cross Consortium, 2012] is a large panel of reproducible, recombinant-inbred mouse lines proposed by the Complex Trait Consortium. Over a thousand of mouse lines have been started among which several hundred lines are kept inbreeding. All mouse lines are generated using eight genetically diverse founders via a common breeding scheme designed to randomize the genomic contribution of each founder. It provides an ideal platform for testing my approach The Breeding Scheme CC mice are derived from 8 fully inbred founders using the 8-way funnel breeding scheme shown in Figure 2.3(a). The chromosomes of the eight founders(shown in different colors) are combined by two generations of crosses (labeled G1 and G2I 0 ), followed by at least 20 inbreeding generations (G2I 1 to G2I ). The positions of the 8 founders are not fixed. Permutations of the founders are used to randomize the genomes and balance the founder contributions to the resulting CC lines. This variation in initial positions imposes different ancestry constraints on each line. Without loss of generality, I assume a founder order of F 1 F 2 F 3 F 4 F 5 F 6 F 7 F 8 as shown in Figure 2.3(a). 25

40 (a) (b) Figure 2.3: (a) Collaborative Cross breeding scheme: An example derivation of chromosomes by recombining chromosomes from 8 ordered founders. G1 and G2I 0 are two generations of crosses. G2I 1 to G2I are multiple generations of inbreeding. (b) The inheritance indicators used to represent the inheritance flow at a SNP site Modeling the Genome of G2I k Generation In a CC pedigree, any recombination in the formation of G1 haplotypes can be virtually ignored since all founders are fully inbred. Hence, at each SNP site, I only need 4 inheritance indicators for G2I 0 haplotypes and 2 quaternary indicators for the two haplotypes in a resulting G2I k descendant. The structure of the inheritance indicators is shown in Figure 2.3(b). G2I 1 mice are an exception which only involve one generation of inbreeding. For a G2I 1 mouse, I simply let the two quaternary indicators revert back to binary indicators. 26

41 This becomes a standard Lander-Green model and it can be seen that the two G2I 1 haplotypes are restricted to be from the left and right half of the funnel respectively. 2.5 Experiments In this section, I evaluate the proposed model on both simulated data and real CC genotype data. I implement my model GAIN (Genome Ancestry with INbreeding) for CC using C++. GAIN is compared with MERLIN [Abecasis et al., 2001] and HAPPY [Mott et al., 2000]. MERLIN is a widely used pedigree analysis software based on Lander-Green algorithm and can handle large number of markers. HAPPY is a QTL mapping tool suite and can analyze genome ancestry based on only founder and descendant genotype data, i.e., it ignores pedigree structure. Both software estimate the genome ancestry directly or indirectly Experiments on Simulated Data As ground truth is generally unavailable for real data, I evaluate the accuracy of genome ancestry analysis using simulated data. I simulate the genotype of a G2I k mouse by recombining real CC founder haplotypes according to the CC pedigree structure. Given the founder genotypes, the founder haplotypes can be obtained trivially since all founders are fully inbred. At each generation I choose recombination position randomly. To simulate genotyping errors, I also introduce random errors to the resulting genotype sequence. When a site is selected to represent an error, I flip its value to heterozygous if it is homozygous originally. If a heterozygous site is selected, I change it to one of the homozygous state randomly. This resembles the fact that most genotyping errors are between heterozygous and homozygous states, instead of between the two homozygous states. I simulate 20 test cases for each generation from G2I 1 to G2I 20. The number of markers ranges from 6 to 10 thousands. As MERLIN does not output probability distribution 27

42 for each inheritance vector, I first compare the best founder ancestry pair estimated by each method against the true answer. The error rate is measured by the percentage of sites where the estimated best founder ancestry does not match the ground truth. Figure 2.4 shows the error rate of all three methods in the simulated data with and without errors. Results of MERLIN are only available for the first 4 generations as the running time grows exponentially with the size of pedigree. No results can be generated within reasonable running time (3 hours) for generations beyond G2I 4. By incorporating pedigree information, both GAIN and MERLIN infer accurate estimates (error rate less than 2%). In contrast, HAPPY has much higher error rates and is more sensitive to noise MERLIN GAIN HAPPY MERLIN GAIN HAPPY Error Rate Error Rate Inbreeding Generation (a) Inbreeding Generation (b) Figure 2.4: (a) Comparison of error rates of GAIN, MERLIN and HAPPY on a simulated data set with no noise. (b) Comparison on a simulated data set with 1% noise. As mentioned previously, an accurate solution to the genome ancestry problem is important to subsequent studies such as QTL analysis. In such studies, not only the most likely genome ancestry is desired, but also the probabilities of each founder pair are wanted. Hence, it is also important to evaluate the probability distribution generated by each method. Both GAIN and HAPPY compute a probability distribution of each founder pair being the ancestry at a SNP site. I investigate the proportion of probabilities assigned to wrong founder ancestry. The result in Figure 2.5 shows that the knowledge of pedigree structure is indispensable in solving the genome ancestry prob- 28

43 lem. While HAPPY infers the most probable ancestry correctly for more than 80% of the markers, it assigns near 60% of the total probabilities to wrong ancestry choices. The mis-assigned probabilities could hamper further studies. With pedigree structure modeled, GAIN can resolve most ambiguities and assigns only less than 4% of the total probabilities to wrong ancestry. Wrongly Assigned Probability GAIN HAPPY Inbreeding Generation (a) Wrongly Assigned Probability GAIN HAPPY Inbreeding Generation (b) Figure 2.5: (a) Proportion of probabilities assigned to wrong ancestry by GAIN and HAPPY on a simulated data set with no noise. (b) Proportion of probabilities assigned to wrong ancestry by GAIN and HAPPY on a simulated data set with 1% noise Experiments on Real CC data The data set consists of genotypes of all autosomes from 96 mice of generation G2I 5 to G2I 12. The number of SNP markers on each chromosome ranges from 4122 to Due to the running time constraint of MERLIN, I only compare GAIN with HAPPY which does not consider pedigree structure. Since the true genome ancestry is unknown, I investigate the difference between the results of the two approaches. I compare both the best ancestry estimated and the full probability distribution of each possible ancestry. The first comparison (Figure 2.6(a)) shows the percentage of sites of which the best ancestry estimated by the two methods do not agree. The difference in best ancestry choice is very similar to that of my experiments on simulated 29

44 data with random error: the results from the two methods differ by 20%. I further measure the difference in probability distributions quantitatively using Jensen-Shannon(JS) Divergence [Lin, 1991] which is a smoothed and bounded divergence based on Kullback- Leibler Divergence. The JS Divergence (JSD) between two probability distributions p 1 and p 2 is defined as: p 1 (i)log 2 i JSD(p 1 p 2 ) = p 1 (i) 1 p 2 1(i)+ 1p 2 2(i) + p 2 (i)log 2 i p 2 (i) 1 p 2 1(i)+ 1p 2 2(i) A low JS Divergence indicates high similarity between p1 and p2. The JS divergence ranges between 0 and 2. Figure 2.6(b) compares the mean and standard deviation of the JS Divergence between HAPPY s results and ours over all markers and all 96 mice, grouped by chromosomes. Though I cannot compare the results against the ground truth for real CC data, the source of difference are further investigated. Consider again the CC pedigree in Figure 2.3(a). The initial four founder-mating pairs (F 1,F 2 ),(F 3,F 4 ), (F 5,F 6 ), (F 7,F 8 ) cannot serve as ancestry for any genotypes of G2I k descendants. This is because any genetic material passed from a founder mating pair is carried by a single haplotype in the G2I 0 generation. These four founder pairs are thus invalid ancestry choices if the pedigree structure is considered. As an example to show the improved inference due to incorporating pedigree knowledge, the ancestry of chromosome 7 of a G2I 6 mouse inferred by GAIN and HAPPY are shown in Figure 2.7(a) and 2.7(b) respectively. The most probable founder pair inferred by HAPPY agrees with GAIN s result at most sites. But their actual probabilities are often different. To quantify the extent to which HAPPY assigns positive probabilities to invalid ancestry, at each site l, I aggregate the probabilities of invalid ancestry and plot this pedigree inconsistency measure in Figure 2.7(c). I can see that, the difference between Figure 2.7(a) and 2.7(b) is largely influenced 30

45 1 0.8 Difference JS Divergence Chromosome (a) Average Divergence of all PreCC mice Standard Deviation (b) Chromosome Figure 2.6: (a) The difference in best ancestry estimated by GAIN and HAPPY (b) The average JS Divergence between results from GAIN and HAPPY on chromosome 1 to 19 of 96 real CC mice. by the pedigree inconsistency. Moreover, the probability distributions of ancestry choices at neighboring sites are not independent. Probabilities assigned to pedigreeinconsistent ancestry can substantially influence the choice of ancestry at neighboring sites. Such propagated error is sometimes the main cause of the JS Divergence between HAPPY s results and ours. As an example, Figure 2.7(d) shows a region in chromosome 1 from another G2I 6 mouse where the propagated error is the main cause of divergence. In this region, HAPPY does not assign significant probabilities to invalid ancestry choice, except for a few sites at both ends of this region. But, in the middle part, HAPPY favors ancestry choices that are one recombination away from these invalid ancestry choices. To sum up, even partial pedigree knowledge causes a big difference in analyzing genome ancestry. Though HAPPY can conduct analysis rapidly, its results on complex 31

46 1 1 Probability (F6,F6) (F2,F6) (F1,F6) (F2,F2) (F1,F8) Location on chromosome (a) Probability (F6,F6) (F2,F6) (F1,F6) (F1,F8) Location on chromosome (b) 1 Pedigree Inconsistency 1 Pedigree Inconsistency Propagated Error Probability Probability Location on chromosome (c) Location on chromosome (d) Figure 2.7: (a) Ancestry inference on chromosome 7 of a G2I 6 mouse by GAIN (b) Ancestry inference on chromosome 7 of the same mouse by HAPPY(c) The pedigree inconsistency in (b), i.e. the aggregated probability assigned to ancestry that violates pedigree knowledge. (d) A region in chromosome 1 from another G2I 6 mouse where propagated error is the main cause of divergence. 32

47 pedigrees can be biased. On the other hand, my method can provide a pedigree consistent inference in comparable running time Running Time Performance For a pedigree containing i inbreeding processes and n members not involved in inbreeding, the time complexity of GAIN is O(L n 2 2n 2 8i ) where L is the number of SNP markers. For any G2I k animal in CC pedigree, the time complexity remains the same. The running time does not depend on the error rate of genotype data either. Figure 2.8 shows the running time comparison of GAIN, MERLIN and HAPPY. Running Time (s) MERLIN GAIN HAPPY Inbreeding Generation Figure 2.8: Average running time of the three methods on data set containing 6644 markers. The experiment is conducted on an Intel desktop with 2.66Ghz CPU and 8GB memory. 2.6 Discussion The development of high density SNP technology makes model animal resources a powerful tool for studying genetic variations. It also makes any analysis on such resources computationally challenging. In this chapter, I demonstrate that modeling repetitive sub-structure of a pedigree can provide significant improvement in efficiency without compromising accuracy. I introduce a novel method for modeling the inbreeding pro- 33

48 cess. Integrated into the Hidden Markov Model framework originally introduced by the Lander-Green algorithm, my method can handle large pedigrees such as Collaborative Cross efficiently. The inbreeding sub-structure model alone does not speed up the ancestry inference for all types of pedigrees, but, as I have shown with the Collaborative Cross, the computational benefit can be crucial for analyzing many model animal resources. In analyzing such data, my method outperforms previous methods in terms of accuracy and efficiency. I believe that sub-structure modeling is a promising approach for large pedigree analysis, especially when specific types of pedigree are of interest. 34

49 Chapter 3 High Definition Recombination Map in a Highly Divergent Mouse Population 3.1 Introduction Recombination is an essential biological process in sexual reproduction as it ensures accurate chromosome segregation during meiosis and also contributes significantly to DNA repair and genetic diversity. Abnormal recombination can result in missegregation and is associated with multiple developmental diseases [Hassold and Hunt, 2001]. Despite its importance, the regulation mechanism for the rate and pattern of recombination is largely unknown, although previous studies have shown the influence of factors, including sex, chromosome, DNA sequence and hotspots [Robinson, 1996; Smagulova et al., 2011]. The Collaborative Cross (CC) provides a unique opportunity for the study of genomewide recombination. The CC is a large panel of recombinant inbred lines (RIL) currently under development [Chesler et al., 2008; The Collaborative Cross Consortium, 2012]. It is derived from eight genetically diverse founder strains, including five classical inbred strains (A/J, C57BL/6J, 129S1/SvImJ, NOD/ShiLtJ, and NZO/H1LtJ) and three wildderived strains (CAST/EiJ, PWK/PhJ, and WSB/EiJ). The eight founder strains were selected to capture a much greater level of genetic diversity than existing RIL panels

50 [Roberts et al., 2007]. Each of the independently bred lines has equal contributions from all eight founder strains via a funnel breeding scheme (Figure 2.3(a)). The eight founder strains are first intercrossed to generate the G1 generation. The G1 progeny are then crossed to create the four-way G2I 0 generation. The first eight-way progeny, the G2I 1 s are then generated from a G2I 0 G2I 0 cross 1. After this generation, CC strains become inbred by repeated generations of inbreeding through sibling mating. At the top of the funnel, the eight founder strains are arranged in order that is randomized and not repeated across lines. The left four founders contribute to the left half of the funnel and the remaining four contribute to the right half. I also denote the four pairs of founders that are crossed to produce G1 progeny as four quarters of the funnel. In this study, I focus on the G2I 1 generation which has balanced genome contribution from both sides of the funnel pedigree. The breeding pedigree leading to G2I 1 generation contains eight observable meioses(figure 3.1). I denote the four at crossing G1 generation as MGM, MGP, PGM, PGP and the four meioses at crossing G2I 0 generation as M m, M f, P m, P f. Using the genotype data of G2I 1 generation, I reconstructed the haplotype at G2I 1 generation and inferred all switching points of genome ancestry which correspond to past recombination events in the pedigree. With the design of the breeding scheme, every inferred recombination event can be assigned uniquely to one of the eight meioses. With all recombinations inferred and characterized by gender, meioses and genetic features, this study presents a high definition genome-wide recombination map and associated analysis of its properties. 1 Other researchers have used G2 and G2F 1 to denote G2I 0 and G2I 1 generations 36

51 Figure 3.1: The CC funnel pedigree to G2I 1 generation. In total there are eight meioses in the pedigree. 3.2 Materials and Methods The Genotype Data The genotype data were obtained from 244 male-female sibling pairs at G2I 1 generation using a customized high-density genotyping array [Yang et al., 2009]. The array contains 623,124 SNPs that capture the known genetic variation in laboratory mouse. Before I conducted haplotype reconstruction, I separated SNPs into high-quality and mid-to-low quality groups by examining: Genotype completeness (>0.99) 37

52 Concordancebetween G2I 1 mice, foundermiceandpartiallyavailableg1genotypes I kept only 15 25% of all SNPs on each chromosome in the high-quality group and used only these high-quality SNPs for haplotype reconstruction and recombination inference. The mid-to-low quality SNPs were used later to help refine recombination boundaries. I also excluded samples 2 and chromosomes 3 with exceptionally high discordance rate in haplotype reconstruction Haplotype Reconstruction and Recombination Inference I utilized the method GAIN to conduct haplotype reconstruction and recombination inference. The method, as described in Chapter 2, is a hidden-markov-model based method that can model haplotype and recombinations with all pedigree knowledge incorporated. It has been shown that GAIN can perform analysis in the CC with both high accuracy and scalability with respect to the pedigree size (proportional to number of generations). For the specific G2I 1 generation, the model constructed in GAIN is similar to that in an efficient implementation of Lander-Green algorithm (e.g., MERLIN [Abecasis et al., 2001]) because there are no further inbreeding generations. I performed analysis on each funnel independently but jointly on the siblings in the same funnel. This is because siblings can share recombinations and joint analysis can help resolve ambiguity on recombination locations and haplotype boundaries. Recombinations, however, are not shared across funnels. For each pair of G2I 1 sibling mice, GAIN took the genotypes of the eight founder mice and genotypes of the two sibling mice as input. In addition, it required the funnel order of eight founders. It then inferred the founder ancestry (in probabilities) at each SNP site by building a descendency model at each SNP and evaluating the probabilities of recombining between adjacent SNPs. The founder ancestry at each SNP describes 2 fourteen mouse samples or seven sibling pairs 3 six samples chromosome 18 and four samples chromosome X 38

53 the probability that each pair of founders (e.g., C57BL/6J and CAST/EiJ) are the two founders where the two alleles are inherited from. With pedigree knowledge considered andcarefulqcsteps, GAINachieved averyhighlevelofconfidenceinestimatingthebest ancestryatmostsites. Morethan98%ofthesitesinallmicehavethebestancestrychoice estimated with 0.99 probability. With the ancestry probability information, I could define the haplotype blocks and recombinations trivially by tracing the most probable founder ancestry along chromosomes. Each recombination event is described by: a mid-point where the most probable founder ancestry changes proximal and distal boundaries where the probability of the most founder ancestry shrinks to a threshold proximal and distal ancestry founders on the recombining chromosome the type of meiosis it is associated to The recombination interval inferred (from proximal to distal boundary) is expected to contain the recombination event with high probability. Note that there are regions where multiple founder ancestries have similar probabilities (due to lack of markers, low genotyping quality or similar DNA sequence in multiple founders). In such cases, long recombination intervals were obtained and the recombination events cannot be determined with high resolution. Upon obtaining the recombination inference results, I further refined them with the mid-to-low-quality SNPs filtered in the QC step. This was done by examining the consistency at mid-to-low quality SNPs between founders, each G2I 1 mice, and all G2I 1 mice assigned the same ancestry. On average, this reduced the recombination intervals inferred by approximately half. Note also that GAIN fully enforces all constraints imposed by pedigree knowledge. For example, two of the strongest constraints for G2I 1 mice are: 39

54 For any SNP of any G2I 1 mouse, the two alleles must come from different halves of the funnel. Two siblings cannot inherit different alleles from one quarter funnel at any SNP site. If the input data contained errors (genotype data or funnel order), GAIN would infer significantly more recombinations in order to satisfy the corresponding constraints. This can be used as an effective indicator to identify and remove: Wrongly labeled funnels and mice Poorly performing and/or incorrectly mapped SNPs 3.3 Overview of the Recombination Map Table 3.1: Summary of Identified Recombination Events in G2I 1 Mice Autosomes X chromosome Meiosis # Type Sex of G2I 1 Non-Shared Shared All Non-Shared Shared All Total 1 M f M m P f P m MGM f m all MGP f m all PGM f m all PGP f m all A total of 25,038 recombination events were identified in the 474 individual G2I 1 mice. Of these 18,948 events are observed only once and 3,045 recombination events are shared by the sib pair. Therefore, we have identified 21,993 unique recombination events in our population, 21,368 on the autosomes and 625 on chromosome X. Table 3.1 presents a summary of all types of recombination events identified. 40

55 At a high level, Iexamined the correctness of the events by checking the ratio between types of events (expected and observed). Firstly, the ratio of shared vs non-shared events is expected to be 1:2 based on Mendel s Law of Segregation. In the observed data, non-shared events represent 67.3% of events in the MGM, MGP, PGM and PGP meiosis (6,180 out of 9,177 events, the binomial test p-value is 0.17). This is consistently observed in each type of meiosis: MGM, 67.8%; MGP, 67.2%; PGM, 67.5% and PGP, 66.8% (binomial test p-values are 0.25, 0.54, 0.42, 0.93). Secondly, there should not be significant differences in the number of events in same type of meiosis (Mf vs Mm, Pf vs Pm, MGM vs PGM and MGP vs PGP). The ratio of events observed is highly consistent: Mf vs Mm, 1.02; Pf vs Pm, 1.03; MGM vs PGM, 0.975; and MGP vs PGP, (binomial test p-values are 0.48, 0.25, 0.39, 0.88). Lastly, the ratio between (M+P) events and (MGM+MGP+PGM+PGP) should be 4:3 ( =.57) 4. I observed 12,191 and 9,177 events, respectively ( =.57, the binomial test p-value is 0.79). Counts of Unique Events Distribution of Recombination Interval Length All Non-Shared Shared log 10 ( Recombination Interval Length ) Figure 3.2: Distribution of recombination interval length in log-scale 4 For one G2I 1 mouse, we expect to observe four informative independent meiosis. But if we consider two siblings, we expect to observe = 7 meioses. Because each G1 meiosis has 0.25 probability to be observed in both siblings 41

56 On average, the resolution of recombination events is very high (Figure 3.2). The median size of recombination interval is 35kbp. There are, however, some recombinations that have very large uncertainty intervals (peak in Figure 3.2 between 1 3Mbp). These are mainly due to strain dependent identical-by-descent (IBD) regions or lack of genetic markers in the interval. Based on the 21,993 identified unique recombination events, a recombination density map that can be smoothed at different scales is constructed. When smoothed with windows larger than 500kb, the G2I 1 map is remarkably similar to the map recently published but with much lower density of markers [Cox et al., 2009]. 3.4 Sex Effect on Recombination As expected, the total number of recombination events in autosomes is significantly smaller in the male germline than in the female germline (10,127 events and 11, 241 events, respectively; binomial test p-value ; Table 3.1). This sex difference is also observed in the number of recombination events observed in each individual in both G1 and G2 meioses. To investigate the possible causes of this difference, the effect of the Prdm9 genotype on the size of the autosomal map was determined. One of the eight founder strains of the CC, CAST/EiJ, carries the Prdm9 a allele, four strains (A/J, C57BL/6J, 129S1/SvImJ and NZO/HILtJ) carry the Prdm9 b allele, two strains (NOD/ShiLtJ and WSB/EiJ) carry the Prdm9 c allele and the PWK/PhJ strain carries the Prdm9 d allele. There is a significant expansion of the female map length and a reduction of the male map length in carriers of the Prdm9 a allele (1,450 cm and 1,195 cm, respectively). There is also a significant contraction of the female map length and an expansion of the male map length in carriers of the Prdm9 d allele (1,300 cm and 1,325 cm, respectively). Finally, carriers of both Prdm9 b and Prdm9 c alleles have similar ratio of female to male map lengths (Figure 3.3). 42

57 Figure 3.3: Recombination map length of autosomes by Prdm9 allele and gender In addition to the sex differences in overall recombination, there are dramatic sex differences in the pattern recombination events in the autosomes. Figure 3.4 shows the distribution of recombination events along the autosomes in female and male meioses. The most obvious difference is the increase in the density of recombination events in the distal ends of chromosomes in male meiosis. In female meioses, there is a more even distribution of recombination events along the autosomes (Figure 3.4(a)). In male meioses, approximately half of the recombination events occur in the distal quarter of the chromosomes and almost one third of events occur in the distal 10% of the autosomes (data not shown). Comparison of the recombination density observed in single and double recombinants reveals striking differences while preserving the increase in recombination in the distal ends of the chromosomes (Figure 3.5). The most obvious difference is that in double recombinants there are two peaks of high recombination rate separated by very low recombination rate in the middle while in single recombinants the density proximal to the distal peak remains basically constant. In double recombinants from male meioses, 43

58 the distal peak is both higher and sharper than in singles and the proximal peak is lower and much wider. This pattern suggests that recombination may progress temporally from the telomere to centromere in males. 3.5 Kernel density estimate of recombination Kernel density estimate of recombination Relative distance along chromosome (a) female meioses Relative distance along chromosome (b) male meioses Figure 3.4: Distribution of recombination events along the autosomes in female and male meioses. The x-axis corresponds to the relative position in all autosomes. The y-axis indicates the kernel density estimates of recombinations in each type of meiosis. 3.5 Cold Regions Regions with low levels of recombination have been reported previously [Smagulova et al., 2011] and many mouse researchers have anecdotal evidence that the ability to efficiently 44

59 Kernel density estimate of recombination Kernel density estimate of recombination Relative distance along chromosome (a) female meioses, single recombinant Relative distance along chromosome (b) female meioses, double recombinants Kernel density estimate of recombination Kernel density estimate of recombination Relative distance along chromosome (c) male meioses, single recombinant Relative distance along chromosome (d) male meioses, double recombinants Figure 3.5: Distribution of single and double recombination events along the autosomes in female and male meioses. The x-axis corresponds to the relative position in all autosomes. The y-axis indicates the kernel density estimates of recombinations in each type of meiosis. 45

60 reduce the size of many candidate regions of interest is undermined by an apparent lack of recombination. However, we know very little about the size, distribution, genomic features and evolutionary stability of such regions. Thus identification and characterization of such cold regions may provide important information on the distribution of genetic variation and the level of linkage disequilibrium in the mammalian genome, the accuracy of imputation of genetic variants and may provide new models to study the molecular and cellular mechanisms of meiotic recombination Identification of Cold Regions in the G2I 1 Population Cold regions are defined as long(>500 kb) continuous genomic intervals that are markedly depleted of recombination events in the G2I 1 population. Given the total number of recombination events in my experiment ( 22,000) I set up this 500 kb threshold in the initial identification of cold regions to reduce the number of false positives (i.e., on average I expect 8.7 recombination events per Mb). The 50 coldest regions in male and female meioses are first identified independently to allow for possible cold regions on chromosome X. The union of these regions constituted the initial set and underwent several filtering steps. In the fist step regions in which no calls (Ns) represent a large fraction of the nominal length are excluded. After this step the boundaries of the 59 remaining cold regions are refined using the recombination intervals in the G2I 1 population. For 51 of these regions the new refined interval has no recombination events and they are bound by the distal boundary of proximal recombination event and the proximal boundary of the distal recombination event (Table 3.2). Overall, cold regions span Mb ( 5% of the genome), distributed along 18 chromosomes (all chromosomes have cold regions except chromosomes 10 and 11) and with an enrichment for proximal and distal sections of the chromosomes (Table 3.2) 46

61 3.5.2 External Validation of Cold Regions To determine whether the results in the G2I 1 population are replicable in other populations, I estimated the recombination rate in these regions in the heterogeneous stock used to construct the most recent linkage map of the mouse [Cox et al., 2009]. On average, there is a four-fold reduction in recombination density in cold regions (0.14 cm/mb versus the expected 0.5 cm/mb that is observed genome wide). In fact, 57 of the 59 regions are below the genome wide average and for 16 regions the recombination density in the Cox map is zero (Table 3.2). The extent of validation is striking given the differences in genetic background (only five of the 16 strains are shared between these two studies and the non shared strains include three wild derived strains representing two subspecies that are rare or absent in the genetic makeup of the strains in the Cox study), marker density and approach to estimate recombination distances between these two populations. Recently, several maps of recombination initiation sites in the mouse have been published [Smagulova et al., 2011; Brick et al., 2012]. These studies identified regions with significant enrichment of double strand breaks (DSB) in the male germline of mice of different genetic backgrounds. Smagulova et al. [2011] identified 21 recombination deserts larger than 3 Mb, but noted that the inability of identifying hotspots in some of these regions may be due to sequencing gaps or highly repetitive DNA. Eleven of the cold regions identified in the G2I 1 population overlap with those described previously in Smagulova et al. [2011]. This level of concordance is even more remarkable once one considers that one of the Smagulova desserts was eliminated from my analysis because of complete lack of sequence 5 and the fact that nine additional regions that fail to make the cut in my list still show low levels of recombination in the G2I 1 population. More importantly, data from the second study [Brick et al., 2012] can be used to estimate the density of DSB in any given region. On average there is a 18X reduction in DSB density (range 14X to 24X) in cold regions compared to the genome average. 5 chr 7: 39 Mb, see also new GRCm38 assembly of the mouse genome 47

62 Table 3.2: List of Cold Regions Identified Chr Start End Size G2I 1 Cox B6 9R 13R F1 Smagulova (C+G) SD genes No No Yes No No No No No No No No No No No No No No No No No No No No No No No No Yes Yes Yes No No No No No No No Yes No No No No No No No No No No No No Yes Yes No No X Yes X Yes X No X Yes X Yes The table provides the chromosome location of 59 putative cold regions identified in the G2I 1 population. In addition to the size of these regions, the table lists: G2I 1, number of recombination events in the cold regions in the G2I 1 population. Cox, recombination rate in Cox et al. [2009] B6, log 1 0(number of reads at DSB/Mb) in the C57BL/6J strain of the Brick et al. [2012] 9R, log 1 0(number of reads at DSB/Mb) in the 9R strain of the Brick et al. [2012] 13R, log 1 0(number of reads at DSB/Mb) in the 13R strain of the Brick et al. [2012] F1, log 1 0(number of reads at DSB/Mb) in the (9Rx13R) strain of the Brick et al. [2012] Smagulova, recombination dessert in reported by Smagulova et al. [2011] in (C57BL/10.SC57BL/10.F)F1 mice (C+G), base composition in percent SD,fraction of the interval included in segmental duplication identified using dotplots generated using Gepard [Krumsiek et al., 2007] # genes, number of genes 48

63 3.5.3 Genomic Analysis of Cold Regions Several genomic features have been associated with suppressed recombination in regions such centromeres including low C+G content, frequent and complex duplications and enrichment for repeated sequences. Therefore, I determined the content of cold regions for these and additional genomic features (gene content, presence of segmental duplications (tandem and inverted)). The overall C+G content in cold regions is significantly lower than the genome wide average (Table 3.2). When all 59 intervals are plotted together the plot resembles the aggregate of three different distributions with obvious peaks at 36%, 40% and 44%. The lower peak is the most pronounced and represents approximately half of the cold regions (26 cold regions with low C+G). This suggests that cold regions tend to be associated with local low C+G content. I also observed a highly significant enrichment for large (>15 kb) segmental duplications either in tandem or inverted in cold regions. On average, in cold regions 28% of the sequences are involved in some type of rearrangement. 3.6 Conclusion In this chapter, I present a genome-wide recombination study based on recombination events inferred from the G2I 1 generation in the CC resource. The unique design of the CC allows us to fully determine the meiosis of each recombination event and attribute recombinations to gender and other genetic features. I performed careful quality control steps in constructing the recombination map. Extensive internal and external validations have been done to verify the correctness of results obtained. The sex, jointly with Prdm9 alleles, have strong effect on the pattern of recombinations. The distribution of double recombinants in male meioses strongly suggests a temporal pattern of the recombination progression. Furthermore, the vast majority of cold regions identified in the G2I 1 population represent bona fide regions of suppressed recombination independent of the 49

64 genetic background. Besides establishing the association with reduction in DSB density, I investigated the relationship between cold regions and local DNA sequence. 50

65 Chapter 4 MaCH-Admix: Genotype Imputation for Admixed Populations 4.1 Introduction Imputation of untyped genetic markers has been routinely performed in genome-wide association studies (GWAS) [Sanna et al., 2010; Scott et al., 2007; WTCCC, 2007] and meta-analysis [Dupuis et al., 2010; Smith et al., 2010; Willer et al., 2008], and will continue to play an important role in sequencing-based studies [Fridley et al., 2010; The 1000 Genomes Project Consortium, 2010]. Li et al. [2010a] have previously developed a hidden Markov model (HMM) based method for imputation and shown that it achieves high imputation accuracy in a number of populations [Huang et al., 2009], particularly those with high level of linkage disequilibrium (LD) or having closely matched reference population(s) from the HapMap [The International HapMap Consortium, 2010] or the 1000 Genomes Projects (1000G) [The 1000 Genomes Project Consortium, 2010, 2012]. However, little methodological work exists for imputation in admixed populations, such as African Americans and Hispanic Americans, which comprise more than 20% of the US population (see Web Resources). Admixed populations offer a unique opportunity for gene mapping because one could utilize admixture LD to search for genes underlying diseases that differ strikingly in preva-

66 lence across populations [Reich and Patterson, 2005; Rosenberg et al., 2010; Tang et al., 2006; Winkler et al., 2010; Zhu et al., 2004]. Although useful for admixture mapping, admixture LD also imposes challenges for imputation. Since an admixed individual s genome is a mosaic of ancestral chromosomal segments, to appropriately impute the genotypes, it is imperative to incorporate the underlying ancestry information. Practically, this is equivalent to selecting an appropriate reference panel that matches the corresponding ancestral population(s). Existing studies have evaluated a wide range of choices on the construction of a reference panel prior to running the imputation engine. The recommendation is to use a pre-defined panel that either combines all reference populations (a cosmopolitan panel) [Hao et al., 2009; Li et al., 2009; Shriner et al., 2010] or a weighted combination panel [Egyud et al., 2009; Huang et al., 2009; Pasaniuc et al., 2010; Pemberton et al., 2008]. The cosmopolitan panel may include haplotypes from populations that are irrelevant, and fails to reflect the underlying ancestry proportions and consequently the LD pattern for the target population. The weighted combination panel is generated by duplicating haplotypes according to certain weights, which substantially and unnecessarily increases computational costs [Egyud et al., 2009]. An alternative approach, based on identity-by-state (IBS) sharing between the target individual and haplotypes in the reference populations, can be embedded within existing imputation models. This approach constructs individual-specific effective reference panels, by selecting the most closely related haplotypes (according to IBS score) from the entire reference pool. The IBS-based selection is intuitive and useful for reducing the size of the effective reference panel and is tailored separately for each target individual. The selection is usually conducted by finding pairwise Hamming distances which is computationally very appealing. A simple IBS-based method, which selects a subset of haplotypes into the effective reference panel according to their Hamming distance with the haplotypes to be inferred across the entire genomic region to be imputed (hereafter 52

67 referred to as whole-haplotype), has been adopted by IMPUTE2 [Howie et al., 2009]. Although some promising results have been shown when compared with random selection, no work has examined alternatives to this simple whole-haplotype based matching, partly due to the heavy computational burden posed. In this chapter, I evaluated two classes of reference selection methods: IBS-based and ancestry-weighted approaches. Among the IBS-based approaches, I propose a novel method based on IBS matching in a piecewise manner. The method breaks genomic region under investigation into small pieces and finds reference haplotypes that best represent every small piece, for each target individual separately. The method can be incorporated directly into existing imputation algorithms and has identical computational complexity to that of the existing whole-haplotype IBS-based method. Results from all real datasets evaluated suggest that my piecewise IBS method is highly robust and stable even when a small number of reference haplotypes are selected. Importantly, for uncommon variants, my piecewise IBS selection method manifests more pronounced advantage with large reference panels. I have implemented all methods evaluated, including my piecewise IBS selection method, in the software package MaCH-Admix. Besides the new reference selection functionality, my software also retains high flexibility in two major aspects. First, both regional and whole-chromosome imputation can be accommodated. Second, both data independent and data dependent model parameter estimation are supported. Thus, besides standard reference panel with pre-calibrated parameters, I can elegantly handle study-specific reference panels and target samples with unknown ethnic origin. The rest of the chapter is organized as follows. I first present the general framework of the imputation algorithm, followed by the intuition and formulation of my piecewise IBS and various other effective reference selection methods. Then I evaluate all these methods implemented in MaCH-Admix, the whole-haplotype IBS method implemented 53

68 in IMPUTE2 [Howie et al., 2009], and BEAGLE[Browning and Browning, 2009] using the following datasets: 3587 Hispanic American individuals from the Women s Health Initiative (WHI) 8421 African American individuals from the WHI 49 HapMap III African American individuals 50 HapMap III Mexican individuals All datasets are imputed with reference from the 1000 Genomes Project (2188 haplotypes). I also explored the performance with small/medium reference set from HapMap II/III. Finally, I provide practical guidelines for imputation in admixed populations in the Discussion section. 4.2 Materials and Methods Assume that we have n individuals in the target population that are genotyped at a set of markers denoted by M g. In addition, we have an independent set of H reference haplotypes, e.g., those from the International HapMap or the 1000 Genomes Projects, encompassing a set of markers denoted by M r. Without loss of generality, I assume that the set of markers assayed in the target population, M g, is a subset of M r, the markers in the reference population. The goal of genotype imputation is to fill in missing genotypes including those missing by design (for example, genotypes at markers in M r but not M g, commonly referred to as untyped markers). As described earlier [Li et al., 2010a], the hidden Markov model as implemented in MaCH fulfills the goal by inferring the haplotypes encompassing M r markers for each target individual, from unphased genotypes at the directly assayed markers in M g. Haplotype reconstruction is accomplished by building imperfect mosaics using some of the H reference haplotypes. 54

69 4.2.1 General Framework Since admixed individuals have inherited genetic information from more than one ancestral population, I start with a pooled panel: a panel with haplotypes from all relevant populations, for example, CEU+YRI for African Americans and CEU+YRI+JPT+CHB for Hispanic Americans, where CEU is an abbreviation for Utah residents (CEPH) with Northern and Western European ancestry; YRI for Yoruba in Ibadan, Nigeria; JPT for Japanese in Tokyo, Japan; and CHB for Han Chinese in Beijing, China. Let G = (g 1,g 2,g 3,...,g Mr ) denote the unphased genotypes at M r markers for a target individual. Furthermore I define a series of variables S m,m = 1,2,...,M r to denote the hidden state underlying each unphased genotype g m. The hidden state S m consists of an ordered pair of indices (x m,y m ) indicating that, at marker m, the first chromosome of this particular target individual uses reference haplotype x m as the template and the second chromosome uses reference haplotype y m as the template, where x m and y m both take values from {1,2,...,H}. I seek to infer the posterior probabilities of the sequence of hidden states S = (S 1,S 2,...,S Mr ) for each individual as the knowledge of S will determine genotype at each of the M r markers. Define P(S m H,G) as the posterior probability for S m, the hidden state at marker m with H denoting the pool of reference haplotypes and G denoting the genotype vector of the target individual. To infer these posterior probabilities, I run multiple Markov iterations. Within each iteration, I calculate the conditional joint probabilities P(S m,g H) at each marker m via an adapted Baum s forward and backward algorithm as previously described [Li et al., 2010a]. For admixed populations, as one tends to include more reference haplotypes in the pool under the philosophy of erring on the safe side, and as one attempts not to duplicate haplotypes, one key aspect of the modeling is on how to traverse the sample space harboring the most probability mass with minimum computational efforts. 55

70 4.2.2 Piecewise IBS-based Reference Selection In piecewise IBS selection, I seek to construct a set of t effective reference haplotypes from the pool of H haplotypes within each HMM iteration for each target individual separately. Selected reference panels are therefore tailored for each target individual. For presentation clarity, I consider a single target individual. Specifically, I calculate the genetic similarity (measured by IBS, the Hamming distance between two haplotypes) in a piecewise manner between the individual and each haplotype in the reference pool, ignoring the sub-populations (e.g., CEU or YRI) within the reference. Denote (h 1,h 2 ) as the current haplotype guess for the target individual. I break haplotype h 1 into a maximum of t 2 pieces so that the typed markers are evenly placed across pieces. Each piece has a minimum length of ν typed markers to ensure that the calculated Hamming distance is informative. Denote the number of pieces by p. For each haplotype piece, I calculated the piece-specific IBS score between h 1 and each reference haplotype and selects the top t 2p reference haplotypes, resulting in a total of t 2 selected for h 1 across all p regions. I repeat the same procedure for h 2 and select a second set of t 2 reference haplotypes. In my implementation, I set ν = 32, which corresponds to an average length of <200Kb for commonly used genomewide genotyping platforms. To avoid creating spurious recombinations at piece boundary, I apply a random offset to the first piece in each sampling so that the boundaries differ across iterations. In the case where t 2p t is not an integer, I select ( ) (the ceiling integer) reference haplotypes in each piece 2p for each target haplotype. Then I sample randomly from the selected reference haplotypes. Note that the piecewise selection is repeated for each individual in each sampling iteration. Thus the selection will change along with the intermediate sampling results. I have also implemented two whole-haplotype IBS-based methods, IBS Single Queue (IBS-SQ) and IBS Double Queue (IBS-DQ). The former defines IBS score with any reference haplotype as the minimum Hamming distance to h 1 and h 2, thus ordering 56

71 the H reference haplotypes in a single queue. The top t reference haplotypes will be selected accordingly. The latter defines two separate IBS scores for h 1 and h 2, thus ordering the H reference haplotypes in two queues. The top t/2 reference haplotypes will be selected for h 1 according to IBS scores for h 1. Similarly, another t/2 reference haplotypes will be selected for h 2. Figure 4.1 explains the three IBS strategies under two simple scenarios. In both scenarios, there are eight markers measured in both target and reference with color indicating the allelic status where the same color at the same locus implies the same allele. In both Figures 4.1A and 4.1B, the first chromosome of the target individual shares all eight alleles with the dark-colored reference haplotypes and zero alleles with the light-shaded reference haplotypes. In Figure 4.1A, the second chromosome of the target individual shares two alleles with the dark-colored reference haplotypes and the remaining six alleles with the light-shaded reference haplotypes; whereas in Figure 4.1B, the second chromosome shares six alleles with the dark-colored reference haplotypes and the remaining two alleles with the light-shaded reference haplotypes. Suppose t = H. Figure 4.1A illustrates a scenario where the whole-haplotype Single 2 Queue strategy is not optimal because only dark-colored haplotypes will be selected into the effective reference panel. By combining two sets selected from two separate queues, the whole-haplotype Double Queue strategy is advantageous in the scenario. On the other hand, neither the whole-haplotype Single Queue nor the whole-haplotype Double Queue strategy can handle the scenario in Figure 4.1B well because both strategies would only select the dark-colored reference haplotypes. Ideally, the selected reference haplotypes should, when possible, contain information to represent every part of both chromosomes carried by the target individual. In the scenario presented in Figure 4.1B, because the target individual carries segment of the light-shaded haplotype, it is desirable to have some representation of the light-shaded haplotypes in the effective reference panel. My piecewise IBS method achieves this by breaking the whole region into pieces and selecting 57

72 some reference haplotypes according to genetic matching in each piece (illustrated in the bottom part of Figure 4.1A and 4.1B). By conducting local IBS-matching and choosing a few reference haplotypes within each piece, it is able to have some representation of the light-shaded reference haplotypes. As a result, all parts of the target chromosomes are well represented by the selected reference haplotypes. In general, I believe that selecting a small number of reference haplotypes for each piece locally performs better than selecting globally at the whole-haplotype level. Note that the piecewise IBS method has the same computational complexity as the two whole-haplotype IBS methods. A B Break into pieces to conduct IBS matching Break into pieces to conduct IBS matching Figure 4.1: A cartoon illustration of two scenarios where three IBS-based selection methods perform differently. The two lines on the top panel represent the two chromosomes of a target individual and the lines on the bottom panel represent the pool of H=16 reference haplotypes. Color determines the allelic status such that the same color at the same locus implies the same allele. The bottom parts show how my piecewise selection method breaks the imputation region into four pieces with t = H 2 = 8. Here I assume no constraint on the minimum piece size (i.e., ν = 0). 58

73 4.2.3 Ancestry-weighted Approach Besides IBS-based methods, I also evaluate an ancestry-weighted selection method, which is motivated by the idea of weighted cosmopolitan panel discussed in the Introduction Section. This method concerns the scenario where the reference panel consists of haplotypes from several populations, for instance CEU and YRI, such that the H reference haplotypes are naturally decomposed into several groups. Let Q denote the number of populations included and H q denote the number of haplotypes from reference population q,q = 1,2,...,Q. I first consider the issue of weight determination for each contributing reference population, i.e., the fraction of reference haplotypes to be selected from that population. Intuitively, the weights should depend on the proportions of ancestry from these reference populations for the target admixed individual(s). The weights can be, on one extreme, the same for all individuals in the target population (for example, when the admixture makeup is similar across all individuals), or different for sub-populations within the target population, or on the other extreme, specific for each target individual. For presentation clarity, I suppress the individual index i and denote w = (w 1,w 2,...,w Q ) as the vector of weights, under the constraint that w 1 +w w Q = 1. In this work, I consider the same set of weights for all target individuals. The weights are to represent the average contributions over the imputation region and for all target individuals. I choose to use such average weights over weights specific to each single individual because the average weights can be more stably estimated. There are several natural ways to estimate the weights. One could pre-specify the weights according to estimates of ancestry proportion. For example, it is reasonable to use a 2:8 CEU:YRI weighting scheme for African Americans who are estimated to have about 20% Caucasian and 80% African ancestries [Lind et al., 2007; Parra et al., 1998; Reiner et al., 2007; Stefflova et al., 2011]. Alternatively, one can estimate the ancestry proportions for the target individuals under investigation. I have implemented an 59

74 imputation-based approach within MaCH-Admix to infer ancestry proportions, according to the contributions of reference haplotypes from each population to the constructed mosaics of the target individuals so that the weights can be estimated by MaCH-Admix internally. I use the software package structure [Pritchard et al., 2000], specifically its Admix+LocPrior model, on LD-pruned set of SNPs to confirm my internal ancestry inference. Having determined the weights, I am interested in constructing a set of t effective reference haplotypes within each Markov iteration from the pool of H reference haplotypes according to the ancestry proportions. I achieve this by sampling without replacement t w q haplotypes from the H q haplotypes in reference population q. For each target individual, I sample a different reference panel under the same set of weights MaCH-Admix I have implemented the aforementioned methods (three IBS-based and one ancestryweighted) in my software package MaCH-Admix. MaCH-Admix breaks the one-step imputation in MaCH into three steps: phasing, model parameter (including error rate and recombination rate parameters) estimation and haplotype-based imputation. The splitting into phasing and haplotype-based imputation is similar to IMPUTE2. My software can accommodate both regional and whole-chromosome imputation and allows both data dependent and data independent model parameter estimation. The flexibility regarding model parameter estimation allows one to perform imputation with standard reference panels such as those from the HapMap or the 1000 Genomes Projects with precalibrated parameters in a data independent fashion, similar to IMPUTE2, which uses recombination rates estimated from the HapMap data and a constant mutation rate. Alternatively, if one works with study-specific reference panels, or suspects the model parameters differ from those pre-calibrated (for example, when target individuals are of 60

75 unknown ethnicity or from an isolated population), one has the option to simultaneously estimate these model parameters while performing imputation Datasets I assessed the reference selection methods in the following six target sets: 3587 WHI Hispanic Americans (WHI-HA) 8421 WHI African Americans (WHI-AA) 200 randomly sampled WHI-HA individuals 200 randomly sampled WHI-AA individuals 49 HapMap III African Americans (ASW) 50 HapMap III Mexican individuals (MEX) The WHI SHARe consortium offers one of the largest genetic studies in admixed populations. WHI [The WHI Study Group, 1998; Anderson et al., 2003] recruited a total of 161, 808 women with 17% from minority groups (mostly African Americans and Hispanics) from at 40 clinical centers across the U.S. The WHI SHARe consortium genotyped all the WHI-AA and WHI-HA individuals using the Affymetrix 6.0 platform. Detailed demographic and recruitment information of these genotyped samples are previously described [Qayyum et al., 2012]. Besides standard quality control (details described previously in [Liu et al., 2012]), I removed SNPs with minor allele frequency (MAF) below 0.5%. To evaluate the imputation performance on target sets of smaller size, I randomly sampled 200 individuals from WHI-HA and WHI-AA separately. For the two HapMapIII datasets, my target individuals are ASW (individuals of African ancestry in Southwest USA) and MEX (individuals of Mexican ancestry in Los Angeles, California) respectively from the phase III of the International HapMap Project 61

76 [The International HapMap Consortium, 2010]. These individuals(83 ASW and 77 MEX) were all genotyped using two platforms: the Illumina Human1M and the Affymetrix 6.0. I restricted my analysis to founders only: 49 ASW and 50 MEX. The main focus of my work is imputation with large reference panel. Thus, I first evaluated the imputation performance of all six target sets with reference from the 1000 Genomes Project (release , H = 2188 haplotypes). For the WHI datasets, the number of markers overlapping between the target and reference, bounded by the number of markers typed in target samples, is smaller than that in the HapMap individuals. Therefore, I performed imputation 10 times, each time masking a different 5% of the Affymetrix 6.0 markers. This masking strategy allowed us to evaluate imputation quality at 50% of Affymetrix 6.0 SNPs. For HapMap III ASW and MEX individuals, I randomly masked 50% of the overlapping markers and evaluated the performance at these markers. I used two different masking schemes for the HapMap and WHI samples because I have 1.5 million typed markers in the HapMap samples and thus can still achieve reasonable imputationaccuracybymasking50%ofthemarkersinasingletrial. IntheWHIsamples, masking 50% of the 0.8 million markers in a single trial would substantially reduce imputation accuracy and using one trial with a small percentage of markers masked would lead to insufficient number of markers for evaluation. Therefore, I used multiple trials with 5% masking for the WHI datasets. To provide a comprehensive evaluation, I also conducted imputation on all six target sets using HapMapII or HapMapIII haplotypes as the reference. I used HapMap II CEU+YRI (H = 240) for WHI-AA individuals and HapMapII CEU+YRI+JPT+CHB (H = 420) for WHI-HA individuals. The evaluation is based on masking 50% of the overlapping markers. For HapMap III ASW target set, I considered three different reference panels: HapMapII CEU+YRI (H = 240), HapMapIII CEU+YRI (H = 464), and HapMapIII CEU+YRI+LWK+MKK(H = 930), where LWK(Luhya in Webuye, Kenya) and MKK (Maasai in Kinyawa, Kenya) are two African populations from Kenya. For 62

77 HapMap III MEX target set, I considered HapMapII CEU+YRI+JPT+CHB (H = 420), and HapMapIII CEU+YRI+JPT+CHB (H = 804). For the HapMap target sets with HapMap references, I used genotypes at SNPs on the Illumina HumanHap650 Bead- Chip for imputation input and reserved other genotypes for evaluation. I have posted the HapMap data and my command lines used in this work on MaCH-Admix website (see Web Resources). I picked five 5Mb regions across the genome to represent a wide spectrum of LD levels. I first calculated median half life of r 2, defined as the physical distance at which themedianr 2 between pairsofsnpsis0.5,forevery5mbregionusingasliding windowof 1Mb, in CEU, YRI, and JPT+CHB, respectively. I used HapMapII phased haplotypes for the calculation. The five regions I picked are: chromosome3:80-85mb, chromosome1:75-80mb, chromosome4:57-62mb, chromosome14:50-55mb, and chromosome8:18-23mb in a decreasing order of LD level. The median half life of r 2 is around 90th, 70th, 50th, 30th, and 10th percentile within each of the three HapMap populations, for the five regions respectively (Table 4.1). Figure 4.2 shows the LD levels for the five residing chromosomes. For each region, I treat the middle 4Mb as the core region and the 500Kb on each end as flanking regions. Only SNPs imputed in the core region were evaluated to gauge imputation accuracy. Table 4.1: Median Half Life of r 2 (in Kb) CEU YRI JPT+CHB 10th Percentile th Percentile th Percentile th Percentile th Percentile chromosome3:80-85mb chromosome1:75-80mb chromosome4:57-62mb chromosome14:50-55mb chromosome8:18-23mb Percentiles are calculated within each population using all 5Mb windows across the genome. 63

78 Median Half-Life of r 2 (kb) Median Half-Life of r 2 (kb) Median Half-Life of r 2 (kb) Median Half-Life of r 2 (kb) Median Half-Life of r 2 (kb) chr Center Location of 5MB Window (Mb) (a) Chr 8 chr14 CEU YRI CHB+JPT Center Location of 5MB Window (Mb) (b) Chr 14b chr4 CEU YRI CHB+JPT Center Location of 5MB Window (Mb) (c) Chr 4 Chr1 CEU YRI CHB+JPT Center Location of 5MB Window (Mb) (d) Chr 1 chr3 CEU YRI CHB+JPT Center Location of 5MB Window (Mb) (e) Chr 3 CEU YRI CHB+JPT Figure 4.2: Median r 2 half-life value of 5Mb windows on 5 chromosomes 64

79 4.2.6 Methods Compared I evaluated the following reference selection approaches implemented in MaCH-Admix: random selection (MaCH-Admix Random or original MaCH) IBS Piecewise selection (MaCH-Admix IBS-PW) IBS Single-Queue selection (MaCH-Admix IBS-SQ) IBS Double-Queue selection (MaCH-Admix IBS-DQ) Ancestry-Weighted selection (MaCH-Admix AW) (for HapMapIII datasets) I also included IMPUTE2 [Howie et al., 2009] and BEAGLE [Browning and Browning, 2009] for comparison. I used IMPUTE and BEAGLE with default settings (-k hap 500 -iter 30 for IMPUTE2; niterations=10 nsamples=4 for BEAGLE). As aforementioned, MaCH-Admix can conduct imputation with pre-calibrated parameters (similar to IMPUTE2); alternatively, MaCH-Admix can perform imputation together with data-dependent parameter estimation in an integrated mode. The integrated mode generates slightly better results at the cost of increased computing time. Here, I report results from the pre-calibrated mode Measure of Imputation Quality Previous studies have proposed multiple statistics to measure imputation quality [Browning and Browning, 2009; Li et al., 2009; Lin et al., 2010; Marchini and Howie, 2010], measuring either the concordance rate, correlation, or agreement between the imputed genotypes or estimated allele dosages (the fractional counts of an arbitrary allele at each SNP for each individual, ranging continuously from 0 to 2) and their experimental counterpart. I opt to report the dosage r 2 values, which are the squared Pearson correlation between the estimated allele dosages and the true experimental genotypes (recoded as 0, 65

80 1, and 2 corresponding to the number of minor alleles), because it is a better measure for uncommon variants by taking allele frequency into account and directly related to the effective sample size for downstream association analysis (Pritchard and Przeworski, 2001). For the remainder of the work, with no special note, average dosage r 2 values will beplotted as a function of approximation level (measured by the effective reference panel size, i.e., t described in Methods section, corresponding to MaCH-Admix s --states option and IMPUTE2 s -k option). Hereafter, I use approximation level, effective reference size, t, and #states/-k interchangeably. I note that for standard haplotypes-to-genotype imputation (that is, using reference haplotypes to imputed target individuals with genotypes), computational costs increase quadratically with the approximation level. MaCH-Admix and IMPUTE2 both also have an approximation parameter at the haplotype-based imputation step, MaCH-Admix s --imputestates and IMPUTE2 s -k hap, which increases the computation time linearly and is by default set at a large value (500). I kept both at the default value because increasing beyond the default has rather negligible effects on imputation quality and that total computing time attributable to the haplotype-based imputation step is typically much smaller compared to --states and -k. 4.3 Results WHI-AA and WHI-HA with the 1000G Reference Figures 4.3 and 4.4 show results for full WHI-HA and WHI-AA sets using 2188 haplotypes from release of the 1000 Genomes Project as the reference (selected three out of the five 5Mb regions: the 1st, 3rd, and 5th regions according to level of LD). The remaining results under the default or middle settings are presented in Tables 4.2 and 4.3 (all five regions for WHI-HA and WHI-AA respectively). Note that BEAGLE s performance remains constant because it does not have a parameter analogous to MaCH- Admix s --states or IMPUTE2 s -k. 66

81 Generally, I observe higher imputation accuracy in regions with higher level of LD for all approaches evaluated. In addition, in regions with higher LD, imputation accuracy reaches a plateau with smaller effective reference sizes. This is because the LD pattern can be captured fairly well by a smaller number of reference haplotypes in regions with higher level of LD. In regions with lower level of LD, accuracy plateau is reached with larger effective reference sizes. But generally an effective reference size of 80 to 120 is good for MaCH-Admix to perform well at all LD levels. 67

82 Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE #states / -k (a) Chr 3, 80-85Mb #states / -k (b) Chr 4, 57-62Mb #states / -k (c) Chr 8, 18-23Mb A: Imputation quality of WHI-HA with the 1000G reference panel Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE #states / -k (d) Chr 3, 80-85Mb #states / -k (e) Chr 4, 57-62Mb #states / -k (f) Chr 8, 18-23Mb B: Uncommon SNP imputation quality of WHI-HA with the 1000G reference panel. I set the maximum plotting range on y-axis to be 5%. IMPUTE2 in (c) is below the lower bound of the plotting range. Figure 4.3: Imputation of 3587 WHI-HA with the 1000G reference panel. Imputation quality (measured by dosage r 2 ) is plotted as a function of the effective reference panel size (i.e., #states), for WHI-HA individuals in three selected 5Mb regions (ordered by LD from high to low).

83 Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE #states / -k (a) Chr 3, 80-85Mb #states / -k (b) Chr 4, 57-62Mb #states / -k (c) Chr 8, 18-23Mb A: Imputation quality of WHI-AA with the 1000G reference panel Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ MaCH-Admix IBS-PW IMPUTE2 BEAGLE #states / -k (d) Chr 3, 80-85Mb #states / -k (e) Chr 4, 57-62Mb #states / -k (f) Chr 8, 18-23Mb B: Uncommon SNP imputation quality of WHI-AA with the 1000G reference panel. Note that WHI-AA has significantly less number of SNPs in this category than WHI-HA does. Also, I set the maximum plotting range on y-axis to be 5%. MaCH-Admix Random in (b),(c) and BEAGLE in (a),(b),(c) are below the lower bound of the plotting range. Figure 4.4: Imputation of 8421 WHI-AA with the 1000G reference panel. Imputation quality (measured by dosage r 2 ) is plotted as a function of the effective reference panel size (i.e., #states), for WHI-AA individuals in three selected 5Mb regions (ordered by LD from high to low).

84 I found that the piecewise IBS selection approach (IBS-PW) is clearly the best among the three IBS-based methods implemented in MaCH-Admix. Its performance is stable even with a small #states value. For the other two IBS-based reference selection approaches implemented in MaCH-Admix, I observed IBS-DQ performs better than IBS- SQ. The performance order of the three MaCH-Admix IBS-based methods is expected based on my reasoning in the Material and Methods Section. In addition, all three IBS-based methods show clear advantage over random selection, particularly when the effective reference size is small. IMPUTE2 has similar performance to that of IBS-DQ when the effective reference size is small. Interestingly, IMPUTE2 s accuracy curve tends to stay relatively flat while those for MaCH-Admix s IBS-based methods increase with the effective reference size. Across all five regions evaluated, with effective reference size at 120, IBS-PW has consistent performance gain over other evaluated methods. Importantly, IBS-PW and IBS-DQ, particularly IBS-PW, manifest more pronounced advantage for uncommon variants (MAF <5%) in WHI-HA. For these uncommon variants, average dosage r 2 is 0.818, 0.782, and (0.808, 0.805, and 0.756) for WHI-HA (WHI-AA) using IBS-PW, IM- PUTE2, and BEAGLE respectively. The advantage of IBS-PW in uncommon SNPs is however smaller in WHI-AA largely because of the much smaller number of uncommon variants in WHI-AA (Figure 4.5). However, the difference is highly significant (p-value ) in both WHI samples. My observation is consistent in both the full set and the subset of 200 individuals (Tables 4.2 and 4.3). The variance of imputation quality by markers is heavily influenced by the MAF distribution. All methods exhibits much larger variance in imputing uncommon variants. The standard error of my IBS-PW method ranges from to for all variants, and from to for uncommon variants in imputing the WHI-HA full set. In imputing the WHI-AA full set, the standard error of IBS-PW ranges from to for all variants, and from 0.02 to for uncommon variants. 70

85 Table 4.2: Imputation Results of WHI-HA Individuals over Five 5Mb Regions with the 1000G reference All 3587 individuals Random 200 Subset overall dosage r 2 uncommon SNPs running overall dosage r 2 uncommon SNPs running (std dev) dosage r 2 (std dev) time (std dev) dosage r 2 (std dev) time chromosome3:80-85mb MaCH-Admix Random 0.935(0.107) 0.796(0.189) (0.121) 0.794(0.231) 841 MaCH-Admix IBS-PW 0.942(0.101) 0.817(0.189) (0.111) 0.814(0.210) 1041 MaCH-Admix IBS-SQ 0.939(0.104) 0.799(0.190) (0.119) 0.796(0.231) 988 MaCH-Admix IBS-DQ 0.941(0.102) 0.806(0.191) (0.119) 0.799(0.232) 995 IMPUTE (0.104) 0.797(0.191) (0.119) 0.799(0.233) 2076 BEAGLE 0.931(0.107) 0.799(0.190) (0.128) 0.779(0.231) 6614 chromosome1:75-80mb MaCH-Admix Random 0.918(0.130) 0.821(0.190) (0.129) 0.855(0.211) 1214 MaCH-Admix IBS-PW 0.927(0.123) 0.841(0.186) (0.121) 0.873(0.197) 1490 MaCH-Admix IBS-SQ 0.923(0.122) 0.823(0.187) (0.125) 0.861(0.209) 1443 MaCH-Admix IBS-DQ 0.926(0.121) 0.830(0.185) (0.123) 0.866(0.207) 1452 IMPUTE (0.121) 0.809(0.183) (0.127) 0.845(0.204) 2545 BEAGLE 0.917(0.124) 0.815(0.184) (0.129) 0.851(0.209) 9194 chromosome4:57-62mb MaCH-Admix Random 0.904(0.148) 0.761(0.208) (0.137) 0.813(0.213) 1239 MaCH-Admix IBS-PW 0.913(0.139) 0.783(0.202) (0.134) 0.824(0.212) 1527 MaCH-Admix IBS-SQ 0.907(0.141) 0.757(0.195) (0.135) 0.807(0.209) 1460 MaCH-Admix IBS-DQ 0.911(0.138) 0.770(0.195) (0.133) 0.817(0.210) 1455 IMPUTE (0.142) 0.751(0.198) (0.147) 0.773(0.225) 2991 BEAGLE 0.900(0.150) 0.751(0.218) (0.155) 0.787(0.244) chromosome14:50-55mb MaCH-Admix Random 0.921(0.132) 0.800(0.202) (0.122) 0.847(0.202) 1600 MaCH-Admix IBS-PW 0.932(0.120) 0.826(0.184) (0.119) 0.859(0.198) 1876 MaCH-Admix IBS-SQ 0.927(0.118) 0.807(0.175) (0.119) 0.849(0.199) 1877 MaCH-Admix IBS-DQ 0.930(0.115) 0.819(0.176) (0.118) 0.854(0.197) 1876 IMPUTE (0.120) 0.793(0.180) (0.125) 0.828(0.216) 2579 BEAGLE 0.926(0.121) 0.806(0.189) (0.130) 0.824(0.218) chromosome8:18-23mb MaCH-Admix Random 0.896(0.155) 0.793(0.212) (0.150) 0.821(0.225) 1899 MaCH-Admix IBS-PW 0.911(0.143) 0.824(0.198) (0.147) 0.833(0.221) 2302 MaCH-Admix IBS-SQ 0.903(0.145) 0.797(0.200) (0.149) 0.820(0.227) 2270 MaCH-Admix IBS-DQ 0.906(0.143) 0.805(0.200) (0.147) 0.822(0.224) 2285 IMPUTE (0.145) 0.773(0.206) (0.159) 0.781(0.247) 3647 BEAGLE 0.905(0.142) 0.807(0.201) (0.154) 0.800(0.232) All results were generated using default or suggested parameter values: MaCH-Admix: --rounds 30, --states 120, --imputestates 500; IMPUTE2: -iter 30, -k 120, -k hap 500; BEAGLE: niterations=10 nsamples=4. Running time is measured in seconds. 71

86 Table 4.3: Imputation Results of WHI-AA Individuals over Five 5Mb Regions with the 1000G reference All 8421 Individuals Random 200 Subset overall dosage r 2 uncommon SNPs running overall dosage r 2 uncommon SNPs running (std dev) dosage r 2 (std dev) time (std dev) dosage r 2 (std dev) time chromosome3:80-85mb MaCH-Admix Random 0.912(0.100) 0.782(0.150) (0.091) 0.824(0.194) 897 MaCH-Admix IBS-PW 0.947(0.073) 0.850(0.158) (0.083) 0.849(0.194) 1026 MaCH-Admix IBS-SQ 0.944(0.075) 0.844(0.161) (0.086) 0.835(0.198) 1035 MaCH-Admix IBS-DQ 0.946(0.074) 0.849(0.160) (0.083) 0.851(0.198) 1021 IMPUTE (0.075) 0.847(0.151) (0.085) 0.836(0.187) 2017 BEAGLE 0.921(0.088) 0.795(0.170) (0.107) 0.784(0.217) 6435 chromosome1:75-80mb MaCH-Admix Random 0.873(0.143) 0.703(0.219) (0.141) 0.726(0.241) 1240 MaCH-Admix IBS-PW 0.921(0.106) 0.802(0.176) (0.128) 0.770(0.232) 1530 MaCH-Admix IBS-SQ 0.915(0.109) 0.794(0.174) (0.130) 0.756(0.224) 1504 MaCH-Admix IBS-DQ 0.918(0.106) 0.803(0.168) (0.131) 0.762(0.235) 1476 IMPUTE (0.103) 0.810(0.157) (0.135) 0.760(0.240) 2412 BEAGLE 0.892(0.119) 0.759(0.173) (0.145) 0.713(0.242) 8621 chromosome4:57-62mb MaCH-Admix Random 0.883(0.126) 0.688(0.187) (0.111) 0.749(0.169) 1290 MaCH-Admix IBS-PW 0.927(0.092) 0.795(0.159) (0.100) 0.792(0.175) 1508 MaCH-Admix IBS-SQ 0.920(0.094) 0.782(0.148) (0.105) 0.777(0.180) 1545 MaCH-Admix IBS-DQ 0.924(0.090) 0.796(0.138) (0.100) 0.793(0.175) 1478 IMPUTE (0.091) 0.787(0.129) (0.104) 0.778(0.168) 2939 BEAGLE 0.898(0.109) 0.735(0.167) (0.131) 0.738(0.222) chromosome14:50-55mb MaCH-Admix Random 0.875(0.140) 0.726(0.216) (0.120) 0.807(0.198) 1663 MaCH-Admix IBS-PW 0.921(0.105) 0.823(0.171) (0.104) 0.852(0.167) 1900 MaCH-Admix IBS-SQ 0.914(0.108) 0.809(0.172) (0.112) 0.835(0.191) 1918 MaCH-Admix IBS-DQ 0.918(0.105) 0.818(0.168) (0.107) 0.850(0.175) 1900 IMPUTE (0.106) 0.815(0.157) (0.116) 0.820(0.186) 2575 BEAGLE 0.893(0.118) 0.775(0.176) (0.127) 0.786(0.216) chromosome8:18-23mb MaCH-Admix Random 0.830(0.177) 0.682(0.235) (0.163) 0.735(0.235) 1977 MaCH-Admix IBS-PW 0.889(0.142) 0.798(0.207) (0.148) 0.800(0.218) 2377 MaCH-Admix IBS-SQ 0.882(0.145) 0.789(0.207) (0.152) 0.786(0.224) 2393 MaCH-Admix IBS-DQ 0.885(0.144) 0.795(0.205) (0.149) 0.797(0.220) 2318 IMPUTE (0.140) 0.795(0.194) (0.153) 0.795(0.218) 3618 BEAGLE 0.858(0.151) 0.743(0.206) (0.158) 0.767(0.229) In my experiments, BEAGLE cannot finish imputation with the complete 1000G references within 7 days which is the hard limit on my cluster server. I thus restrict the markers in the reference panel to be the set of Affymetrix 6.0 markers plus 2.5% of the remaining 1000G markers. The size of the restricted set in each region is about 10 15% of the size of original 1000G marker set. All results were generated using default or suggested parameter values: MaCH-Admix: --rounds 30, --states 120, --imputestates 500; IMPUTE2: -iter 30, -k 120, -k hap 500; BEAGLE: niterations=10 nsamples=4. Running time is measured in seconds. 72

87 # of SNPs Minor Allele Frequency WHI-HA WHI-AA Figure 4.5: Minor Allele Frequency (MAF) distribution of SNPs in WHI-AA and WHI- HA HapMap ASW and MEX with the 1000G Reference In this setting, I use a large reference panel to impute two small target sets. Figure 4.6 shows the imputation quality of three regions for both ASW and MEX. The complete results are presented in Table 4.4. Similar to previous experiments, I found that IBS-PW is very effective in finding the most relevant reference from a large panel (1000G) and clearly outperforms the other methods. IMPUTE2 again shows a flatter curve in most regions. Random selection and BEAGLE tend to perform worse than the IBS-based methods. This again proves that IBS-based selections are very effective in working with large reference panels. In imputing ASW individuals, the standard error of my IBS-PW method ranges from to for all variants, and from to for uncommon variants. In imputing MEX individuals, the standard error of IBS-PW ranges from to for all variants, and from to for uncommon variants. 73

88 Dosage r MaCH-Admix Random MaCH-Admix IBS-PW MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-PW MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-PW MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ IMPUTE2 BEAGLE #states / -k (a) Chr 3, 80-85Mb #states / -k (b) Chr 4, 57-62Mb #states / -k (c) Chr 8, 18-23Mb A: Overall imputation quality of HapMap ASW with the 1000G reference panel Dosage r 2 MaCH-Admix Random MaCH-Admix IBS-PW MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-PW MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix IBS-PW MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ IMPUTE2 BEAGLE #states / -k (d) Chr 3, 80-85Mb #states / -k (e) Chr 4, 57-62Mb #states / -k (f) Chr 8, 18-23Mb B: Overall imputation quality of HapMap MEX with the 1000G reference panel Figure 4.6: Imputation of 49 HapMap ASW and 50 HapMap MEX individuals with the 1000G reference panel. Imputation quality (measured by dosage r 2 ) is plotted as a function of the effective reference panel size (i.e., #states), for WHI-AA individuals in three selected 5Mb regions (ordered by LD from high to low).

89 4.3.3 Imputation Performance with HapMap References First, consistent with what has been reported that imputation quality improves with reference panel size, imputation quality is indeed lower with HapMap references than with the 1000G reference. For example, average dosage r 2 is % with the 1000G reference (Table 4.2) for WHI-HA individuals in the chromosome4:57-62mb region but drops to % with HapMapII references (Table 4.5). Second, difference among various methods is much smaller with these smaller HapMap reference sets (H = ), which is consistent with my intuition that, given fixed computational costs, reference selection makes more pronounced difference with large reference panel since only a small portion of reference can be selected WHI-HA and WHI-AA with HapMap references The complete results are presented in Tables 4.5 and 4.6. In WHI-HA (Table 4.5, H = 420), IBS-PW outperforms IBS-SQ and IBS-DQ slightly and the advantage disappears in WHI-AA (Table 4.6, H = 240). MaCH-Admix and IMPUTE2 yield similar imputation accuracy, and both outperform BEAGLE slightly HapMap ASW and MEX with HapMap references For ASW, I experimented with three reference panels: HapMapII CEU+YRI, HapMapIII CEU+YRI, and HapMapIII CEU+YRI+LWK+MKK; for MEX two reference panels: HapMapII CEU+YRI+JPT+CHB and HapMapIII CEU+YRI+JPT+CHB. Results for ASW with HapMapIII CEU+YRI+LWK+MKK as the reference are shown in Figure 4.7 (the same three selected regions). The remaining results are presented in Tables 4.7, 4.8 and 4.9. Again, MaCH-Admix and IMPUTE2 yield similar imputation accuracy, both outperform BEAGLE slightly. IBS-PW is still an obvious winner in most regions and settings. But the relative difference among different methods diminishes when H is small. 75

90 Table 4.4: Imputation Results of HapMap ASW & MEX Individuals over Five 5Mb Regions with the 1000G reference (H = 2188) 49 ASW Individuals 50 MEX Individuals overall dosage r 2 uncommon SNPs running overall dosage r 2 uncommon SNPs running (std dev) dosage r 2 (std dev) time (std dev) dosage r 2 (std dev) time chromosome3:80-85mb MaCH-Admix Random 0.937(0.104) 0.854(0.210) (0.080) 0.960(0.149) 173 MaCH-Admix IBS-PW 0.948(0.095) 0.888(0.192) (0.077) 0.968(0.148) 212 MaCH-Admix IBS-SQ 0.948(0.091) 0.898(0.176) (0.079) 0.961(0.148) 203 MaCH-Admix IBS-DQ 0.947(0.095) 0.889(0.190) (0.079) 0.963(0.149) 221 IMPUTE (0.106) 0.877(0.201) (0.086) 0.953(0.187) 477 BEAGLE 0.906(0.137) 0.774(0.267) (0.096) 0.938(0.196) 2760 chromosome1:75-80mb MaCH-Admix Random 0.915(0.135) 0.828(0.233) (0.132) 0.854(0.238) 257 MaCH-Admix IBS-PW 0.930(0.123) 0.859(0.216) (0.134) 0.867(0.250) 302 MaCH-Admix IBS-SQ 0.926(0.128) 0.849(0.222) (0.130) 0.870(0.235) 293 MaCH-Admix IBS-DQ 0.928(0.127) 0.852(0.227) (0.132) 0.866(0.243) 299 IMPUTE (0.140) 0.842(0.229) (0.140) 0.847(0.270) 549 BEAGLE 0.900(0.148) 0.817(0.245) (0.144) 0.839(0.264) 3779 chromosome4:57-62mb MaCH-Admix Random 0.922(0.116) 0.801(0.230) (0.127) 0.873(0.228) 244 MaCH-Admix IBS-PW 0.937(0.107) 0.852(0.220) (0.118) 0.896(0.203) 298 MaCH-Admix IBS-SQ 0.933(0.110) 0.837(0.224) (0.116) 0.894(0.200) 286 MaCH-Admix IBS-DQ 0.934(0.107) 0.845(0.215) (0.119) 0.883(0.210) 290 IMPUTE (0.116) 0.819(0.238) (0.120) 0.889(0.207) 785 BEAGLE 0.897(0.144) 0.755(0.284) (0.143) 0.839(0.263) 5677 chromosome14:50-55mb MaCH-Admix Random 0.899(0.144) 0.739(0.280) (0.119) 0.891(0.218) 366 MaCH-Admix IBS-PW 0.914(0.134) 0.769(0.273) (0.118) 0.900(0.218) 420 MaCH-Admix IBS-SQ 0.909(0.138) 0.765(0.282) (0.120) 0.896(0.223) 420 MaCH-Admix IBS-DQ 0.909(0.135) 0.763(0.264) (0.122) 0.889(0.231) 429 IMPUTE (0.145) 0.770(0.281) (0.126) 0.874(0.234) 562 BEAGLE 0.879(0.167) 0.677(0.325) (0.128) 0.872(0.232) 4643 chromosome8:18-23mb MaCH-Admix Random 0.859(0.172) 0.755(0.283) (0.145) 0.892(0.200) 404 MaCH-Admix IBS-PW 0.879(0.162) 0.792(0.280) (0.140) 0.908(0.186) 487 MaCH-Admix IBS-SQ 0.872(0.164) 0.775(0.282) (0.145) 0.898(0.197) 485 MaCH-Admix IBS-DQ 0.874(0.166) 0.774(0.293) (0.143) 0.901(0.196) 495 IMPUTE (0.173) 0.767(0.298) (0.164) 0.864(0.247) 854 BEAGLE 0.844(0.181) 0.760(0.285) (0.156) 0.875(0.233) 6509 All results were generated using the following parameter values: MaCH-Admix: --rounds 30, --states 120, --imputestates 500; IMPUTE2: -iter 30, -k 120, -k hap 500; BEAGLE: niterations=10 nsamples=4. Running time is measured in seconds. Best performance in each comparison is highlighted by bold font. 76

91 Table 4.5: Imputation Results of WHI-HA Individuals over Five 5Mb Regions with the HapMapII reference (H = 420) All 3587 individuals Random 200 Subset overall dosage r 2 uncommon SNPs running overall dosage r 2 uncommon SNPs running (std dev) dosage r 2 (std dev) time (std dev) dosage r 2 (std dev) time chromosome3:80-85mb MaCH-Admix Random 0.897(0.157) 0.864(0.091) (0.150) 0.807(0.101) 234 MaCH-Admix IBS-PW 0.905(0.150) 0.918(0.021) (0.150) 0.831(0.081) 248 MaCH-Admix IBS-SQ 0.904(0.150) 0.913(0.033) (0.150) 0.838(0.088) 246 MaCH-Admix IBS-DQ 0.904(0.150) 0.911(0.036) (0.147) 0.845(0.082) 247 IMPUTE (0.148) 0.924(0.011) (0.144) 0.843(0.044) 403 BEAGLE 0.892(0.159) 0.902(0.062) (0.164) 0.831(0.106) 232 chromosome1:75-80mb MaCH-Admix Random 0.855(0.184) 0.752(0.222) (0.185) 0.723(0.253) 350 MaCH-Admix IBS-PW 0.863(0.176) 0.762(0.201) (0.183) 0.721(0.236) 367 MaCH-Admix IBS-SQ 0.860(0.179) 0.748(0.204) (0.181) 0.715(0.234) 363 MaCH-Admix IBS-DQ 0.861(0.178) 0.750(0.204) (0.183) 0.712(0.237) 377 IMPUTE (0.188) 0.740(0.248) (0.194) 0.701(0.282) 556 BEAGLE 0.851(0.186) 0.792(0.230) (0.191) 0.795(0.250) 296 chromosome4:57-62mb MaCH-Admix Random 0.852(0.169) 0.742(0.237) (0.165) 0.775(0.210) 343 MaCH-Admix IBS-PW 0.862(0.162) 0.764(0.217) (0.162) 0.787(0.201) 360 MaCH-Admix IBS-SQ 0.860(0.161) 0.756(0.223) (0.162) 0.779(0.211) 362 MaCH-Admix IBS-DQ 0.860(0.161) 0.757(0.224) (0.164) 0.786(0.205) 363 IMPUTE (0.176) 0.717(0.231) (0.180) 0.732(0.221) 541 BEAGLE 0.850(0.168) 0.740(0.234) (0.174) 0.734(0.263) 348 chromosome14:50-55mb MaCH-Admix Random 0.845(0.190) 0.669(0.285) (0.191) 0.677(0.290) 428 MaCH-Admix IBS-PW 0.854(0.184) 0.689(0.274) (0.186) 0.690(0.273) 448 MaCH-Admix IBS-SQ 0.852(0.184) 0.682(0.283) (0.186) 0.678(0.289) 450 MaCH-Admix IBS-DQ 0.852(0.184) 0.686(0.278) (0.186) 0.689(0.277) 453 IMPUTE (0.183) 0.681(0.272) (0.187) 0.686(0.286) 660 BEAGLE 0.846(0.186) 0.666(0.279) (0.191) 0.641(0.327) 356 chromosome8:18-23mb MaCH-Admix Random 0.826(0.216) 0.760(0.246) (0.216) 0.754(0.244) 524 MaCH-Admix IBS-PW 0.838(0.211) 0.775(0.240) (0.213) 0.763(0.238) 551 MaCH-Admix IBS-SQ 0.832(0.213) 0.765(0.241) (0.213) 0.758(0.242) 551 MaCH-Admix IBS-DQ 0.833(0.213) 0.768(0.241) (0.216) 0.750(0.243) 553 IMPUTE (0.207) 0.772(0.236) (0.214) 0.744(0.253) 875 BEAGLE 0.826(0.211) 0.742(0.245) (0.215) 0.732(0.258) 543 All results were generated using the following parameter values: MaCH-Admix: --rounds 30, --states 120, --imputestates 500; IMPUTE2: -iter 30, -k 120, -k hap 500; BEAGLE: niterations=10 nsamples=4. Running time is measured in seconds. Best performance in each comparison is highlighted by bold font. 77

92 Table 4.6: Imputation Results of WHI-AA Individuals over Five 5Mb Regions with the HapMapII reference (H = 240) All 8421 Individuals Random 200 Subset overall dosage r 2 uncommon SNPs running overall dosage r 2 uncommon SNPs running (std dev) dosage r 2 (std dev) time (std dev) dosage r 2 (std dev) time chromosome3:80-85mb MaCH-Admix Random 0.877(0.140) 0.684(0.271) (0.149) 0.636(0.354) 259 MaCH-Admix IBS-PW 0.884(0.136) 0.683(0.294) (0.149) 0.641(0.369) 275 MaCH-Admix IBS-SQ 0.883(0.137) 0.678(0.294) (0.148) 0.645(0.356) 264 MaCH-Admix IBS-DQ 0.883(0.137) 0.677(0.297) (0.150) 0.627(0.349) 265 IMPUTE (0.135) 0.668(0.290) (0.148) 0.613(0.371) 388 BEAGLE 0.842(0.164) 0.575(0.259) (0.173) 0.558(0.368) 234 chromosome1:75-80mb MaCH-Admix Random 0.822(0.166) 0.746(0.146) (0.174) 0.746(0.176) 394 MaCH-Admix IBS-PW 0.830(0.160) 0.759(0.143) (0.172) 0.757(0.194) 407 MaCH-Admix IBS-SQ 0.830(0.160) 0.762(0.143) (0.174) 0.746(0.200) 403 MaCH-Admix IBS-DQ 0.831(0.160) 0.764(0.144) (0.173) 0.751(0.198) 402 IMPUTE (0.167) 0.736(0.137) (0.181) 0.712(0.167) 556 BEAGLE 0.798(0.185) 0.685(0.167) (0.200) 0.656(0.222) 291 chromosome4:57-62mb MaCH-Admix Random 0.832(0.150) 0.664(0.152) (0.154) 0.679(0.177) 368 MaCH-Admix IBS-PW 0.841(0.144) 0.686(0.149) (0.152) 0.693(0.177) 378 MaCH-Admix IBS-SQ 0.842(0.143) 0.689(0.143) (0.150) 0.704(0.169) 400 MaCH-Admix IBS-DQ 0.842(0.143) 0.691(0.142) (0.152) 0.693(0.159) 384 IMPUTE (0.153) 0.654(0.160) (0.162) 0.666(0.177) 513 BEAGLE 0.798(0.183) 0.552(0.271) (0.199) 0.464(0.261) 298 chromosome14:50-55mb MaCH-Admix Random 0.770(0.195) 0.628(0.288) (0.199) 0.671(0.278) 427 MaCH-Admix IBS-PW 0.781(0.188) 0.645(0.279) (0.195) 0.681(0.268) 442 MaCH-Admix IBS-SQ 0.780(0.187) 0.647(0.280) (0.196) 0.679(0.262) 436 MaCH-Admix IBS-DQ 0.780(0.188) 0.644(0.283) (0.194) 0.678(0.265) 450 IMPUTE (0.180) 0.667(0.270) (0.194) 0.689(0.265) 597 BEAGLE 0.742(0.210) 0.553(0.308) (0.221) 0.579(0.315) 336 chromosome8:18-23mb MaCH-Admix Random 0.754(0.222) 0.619(0.241) (0.216) 0.649(0.233) 570 MaCH-Admix IBS-PW 0.764(0.217) 0.641(0.240) (0.214) 0.665(0.230) 584 MaCH-Admix IBS-SQ 0.768(0.214) 0.654(0.235) (0.213) 0.677(0.232) 593 MaCH-Admix IBS-DQ 0.768(0.213) 0.655(0.236) (0.213) 0.672(0.236) 590 IMPUTE (0.203) 0.659(0.232) (0.209) 0.675(0.225) 869 BEAGLE 0.717(0.232) 0.535(0.243) (0.237) 0.543(0.269) 452 All results were generated using the following parameter values: MaCH-Admix: --rounds 30, --states 120, --imputestates 500; IMPUTE2: -iter 30, -k 120, -k hap 500; BEAGLE: niterations=10 nsamples=4. Running time is measured in seconds. Best performance in each comparison is highlighted by bold font. 78

93 Dosage r MaCH-Admix Random MaCH-Admix AW MaCH-Admix IBS-PW MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ IMPUTE2 BEAGLE Dosage r MaCH-Admix Random MaCH-Admix AW MaCH-Admix IBS-PW MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ IMPUTE2 BEAGLE #states / -k (a) Chr 3, 80-85Mb #states / -k (b) Chr 4, 57-62Mb 0.87 Dosage r MaCH-Admix Random MaCH-Admix AW MaCH-Admix IBS-PW MaCH-Admix IBS-SQ MaCH-Admix IBS-DQ IMPUTE2 BEAGLE #states / -k (c) Chr 8, 18-23Mb Figure 4.7: Imputation quality of ASW with HapMapII CEU+YRI+LWK+MKK reference panel. Imputation quality (measured by dosage r 2 ) is plotted as a function of the effective reference panel size (i.e., #states), for ASW individuals in three selected 5Mb regions (ordered by LD from high to low). I also included ancestry-weighted selection in evaluation in this setting because weights can be estimated stably given the relatively simple population structure in reference. Interestingly, I did not observe noticeable advantage of the ancestry-weighted selection method despite the obvious population structure within the reference panel and the target being admixed individuals. It however outperforms random selection slightly in most ASW experiments Running Time Methods implemented in MaCH-Admix have comparable running time to that of IM- PUTE2. BEAGLE has similar running time in experiments with HapMap references. 79

94 Table 4.7: Imputation Results of 49 ASW Individuals Over All Five Short Regions HapMapII CEU+YRI reference HapMapIII CEU+YRI reference overall dosage r 2 uncommon SNPs running overall dosage r 2 uncommon SNPs running (std dev) dosage r 2 (std dev) time (std dev) dosage r 2 (std dev) time chromosome3:80-85mb MaCH-Admix Random 0.937(0.106) 0.721(0.230) (0.121) 0.833(0.275) 138 MaCH-Admix AW 0.938(0.102) 0.766(0.191) (0.120) 0.837(0.275) 147 MaCH-Admix IBS-PW 0.940(0.099) 0.787(0.190) (0.111) 0.860(0.249) 158 MaCH-Admix IBS-SQ 0.939(0.100) 0.759(0.184) (0.121) 0.836(0.281) 158 MaCH-Admix IBS-DQ 0.937(0.106) 0.739(0.209) (0.112) 0.857(0.249) 149 IMPUTE (0.099) 0.803(0.155) (0.119) 0.850(0.264) 316 BEAGLE 0.906(0.140) 0.702(0.276) (0.141) 0.796(0.296) 131 chromosome1:75-80mb MaCH-Admix Random 0.916(0.123) 0.862(0.201) (0.135) 0.810(0.237) 250 MaCH-Admix AW 0.915(0.124) 0.853(0.209) (0.134) 0.812(0.236) 280 MaCH-Admix IBS-PW 0.915(0.123) 0.857(0.202) (0.132) 0.826(0.234) 258 MaCH-Admix IBS-SQ 0.914(0.125) 0.853(0.207) (0.132) 0.819(0.230) 243 MaCH-Admix IBS-DQ 0.914(0.125) 0.858(0.211) (0.135) 0.815(0.240) 246 IMPUTE (0.131) 0.839(0.228) (0.140) 0.810(0.253) 442 BEAGLE 0.893(0.150) 0.824(0.245) (0.166) 0.777(0.285) 199 chromosome4:57-62mb MaCH-Admix Random 0.898(0.138) 0.808(0.230) (0.125) 0.840(0.239) 226 MaCH-Admix AW 0.898(0.138) 0.814(0.231) (0.123) 0.850(0.239) 210 MaCH-Admix IBS-PW 0.900(0.136) 0.821(0.231) (0.127) 0.847(0.249) 234 MaCH-Admix IBS-SQ 0.899(0.141) 0.811(0.238) (0.127) 0.841(0.243) 230 MaCH-Admix IBS-DQ 0.899(0.140) 0.814(0.232) (0.127) 0.845(0.247) 228 IMPUTE (0.140) 0.813(0.233) (0.128) 0.837(0.245) 452 BEAGLE 0.868(0.166) 0.775(0.252) (0.146) 0.803(0.280) 170 chromosome14:50-55mb MaCH-Admix Random 0.869(0.180) 0.744(0.298) (0.179) 0.757(0.306) 504 MaCH-Admix AW 0.871(0.176) 0.765(0.279) (0.177) 0.766(0.304) 282 MaCH-Admix IBS-PW 0.873(0.177) 0.762(0.293) (0.178) 0.769(0.304) 296 MaCH-Admix IBS-SQ 0.873(0.176) 0.761(0.289) (0.178) 0.765(0.302) 311 MaCH-Admix IBS-DQ 0.873(0.176) 0.757(0.293) (0.180) 0.757(0.312) 310 IMPUTE (0.180) 0.756(0.289) (0.180) 0.766(0.301) 523 BEAGLE 0.841(0.199) 0.688(0.332) (0.201) 0.694(0.340) 214 chromosome8:18-23mb MaCH-Admix Random 0.861(0.170) 0.813(0.247) (0.189) 0.766(0.285) 392 MaCH-Admix AW 0.863(0.171) 0.824(0.249) (0.188) 0.765(0.288) 423 MaCH-Admix IBS-PW 0.862(0.170) 0.824(0.247) (0.189) 0.761(0.296) 423 MaCH-Admix IBS-SQ 0.861(0.172) 0.819(0.246) (0.191) 0.778(0.290) 508 MaCH-Admix IBS-DQ 0.862(0.171) 0.821(0.239) (0.190) 0.776(0.289) 418 IMPUTE (0.175) 0.793(0.263) (0.194) 0.767(0.299) 767 BEAGLE 0.820(0.200) 0.728(0.303) (0.206) 0.732(0.309) 269 All results were generated using the following parameter values: MaCH-Admix: --rounds 30, --states 120; IMPUTE2: -iter 30, -k 120, -k hap 500; BEAGLE: niterations=10 nsamples=4. Best performance in each comparison is highlighted by bold font. 80

95 Table 4.8: Imputation Results of 49 ASW Individuals Over All Five Short Regions HapMapIII CEU+YRI+LWK+MKK reference overall dosage r 2 uncommon SNPs running (std dev) dosage r 2 (std dev) time chromosome3:80-85mb MaCH-Admix Random 0.953(0.101) 0.868(0.232) 162 MaCH-Admix AW 0.954(0.097) 0.881(0.222) 159 MaCH-Admix IBS-PW 0.958(0.091) 0.898(0.208) 167 MaCH-Admix IBS-SQ 0.954(0.100) 0.871(0.233) 179 MaCH-Admix IBS-DQ 0.954(0.100) 0.876(0.233) 173 IMPUTE (0.100) 0.877(0.225) 291 BEAGLE 0.934(0.124) 0.811(0.271) 334 chromosome1:75-80mb MaCH-Admix Random 0.932(0.122) 0.837(0.222) 236 MaCH-Admix AW 0.935(0.119) 0.847(0.217) 238 MaCH-Admix IBS-PW 0.939(0.117) 0.858(0.222) 283 MaCH-Admix IBS-SQ 0.935(0.124) 0.841(0.235) 270 MaCH-Admix IBS-DQ 0.935(0.120) 0.850(0.226) 272 IMPUTE (0.124) 0.846(0.225) 553 BEAGLE 0.918(0.144) 0.819(0.259) 491 chromosome4:57-62mb MaCH-Admix Random 0.934(0.107) 0.885(0.200) 232 MaCH-Admix AW 0.934(0.110) 0.884(0.208) 251 MaCH-Admix IBS-PW 0.937(0.106) 0.892(0.200) 253 MaCH-Admix IBS-SQ 0.934(0.110) 0.879(0.211) 247 MaCH-Admix IBS-DQ 0.935(0.109) 0.878(0.210) 267 IMPUTE (0.120) 0.861(0.237) 426 BEAGLE 0.914(0.132) 0.833(0.256) 469 chromosome14:50-55mb MaCH-Admix Random 0.883(0.170) 0.756(0.301) 318 MaCH-Admix AW 0.886(0.168) 0.772(0.295) 309 MaCH-Admix IBS-PW 0.891(0.167) 0.778(0.304) 352 MaCH-Admix IBS-SQ 0.889(0.166) 0.783(0.295) 320 MaCH-Admix IBS-DQ 0.890(0.166) 0.786(0.294) 335 IMPUTE (0.168) 0.785(0.303) 642 BEAGLE 0.873(0.181) 0.757(0.305) 514 chromosome8:18-23mb MaCH-Admix Random 0.863(0.178) 0.781(0.274) 431 MaCH-Admix AW 0.865(0.180) 0.788(0.285) 417 MaCH-Admix IBS-PW 0.871(0.177) 0.790(0.286) 452 MaCH-Admix IBS-SQ 0.867(0.180) 0.800(0.281) 479 MaCH-Admix IBS-DQ 0.867(0.178) 0.785(0.281) 462 IMPUTE (0.186) 0.800(0.286) 923 BEAGLE 0.848(0.190) 0.768(0.292) 718 All results were generated using the following parameter values: MaCH-Admix: --rounds 30, --states 120; IMPUTE2: -iter 30, -k 120, -k hap 500; BEAGLE: niterations=10 nsamples=4. Best performance in each comparison is highlighted by bold font. 81

96 Table 4.9: Imputation Results of 50 MEX Individuals Over All Five Short Regions HapMapII CEU+YRI+JPT+CHB reference HapMapIII CEU+YRI+JPT+CHB reference overall dosage r 2 uncommon SNPs running overall dosage r 2 uncommon SNPs running (std dev) dosage r 2 (std dev) time (std dev) dosage r 2 (std dev) time chromosome3:80-85mb MaCH-Admix Random 0.965(0.083) 0.988(0.040) (0.112) 0.893(0.227) 143 MaCH-Admix AW 0.965(0.080) 0.985(0.054) (0.109) 0.898(0.216) 144 MaCH-Admix IBS-PW 0.964(0.082) 0.989(0.037) (0.110) 0.899(0.222) 184 MaCH-Admix IBS-SQ 0.964(0.081) 0.987(0.046) (0.110) 0.897(0.221) 164 MaCH-Admix IBS-DQ 0.963(0.083) 0.989(0.042) (0.112) 0.896(0.223) 167 IMPUTE (0.089) 0.986(0.036) (0.119) 0.898(0.237) 311 BEAGLE 0.959(0.093) 0.995(0.012) (0.130) 0.854(0.245) 232 chromosome1:75-80mb MaCH-Admix Random 0.927(0.136) 0.832(0.244) (0.165) 0.818(0.296) 255 MaCH-Admix AW 0.929(0.134) 0.827(0.240) (0.168) 0.814(0.306) 248 MaCH-Admix IBS-PW 0.930(0.134) 0.838(0.245) (0.169) 0.819(0.312) 272 MaCH-Admix IBS-SQ 0.926(0.136) 0.838(0.221) (0.171) 0.829(0.308) 251 MaCH-Admix IBS-DQ 0.926(0.139) 0.832(0.230) (0.170) 0.822(0.309) 262 IMPUTE (0.141) 0.820(0.250) (0.177) 0.801(0.317) 476 BEAGLE 0.915(0.146) 0.806(0.245) (0.191) 0.775(0.338) 299 chromosome4:57-62mb MaCH-Admix Random 0.928(0.147) 0.806(0.296) (0.160) 0.840(0.286) 219 MaCH-Admix AW 0.929(0.146) 0.806(0.286) (0.162) 0.838(0.289) 214 MaCH-Admix IBS-PW 0.928(0.149) 0.802(0.304) (0.161) 0.844(0.287) 238 MaCH-Admix IBS-SQ 0.928(0.148) 0.812(0.286) (0.163) 0.851(0.288) 235 MaCH-Admix IBS-DQ 0.927(0.149) 0.809(0.292) (0.161) 0.839(0.291) 238 IMPUTE (0.156) 0.806(0.300) (0.169) 0.832(0.298) 501 BEAGLE 0.920(0.160) 0.793(0.305) (0.172) 0.824(0.304) 320 chromosome14:50-55mb MaCH-Admix Random 0.922(0.158) 0.895(0.167) (0.183) 0.823(0.290) 347 MaCH-Admix AW 0.921(0.161) 0.902(0.168) (0.183) 0.816(0.292) 286 MaCH-Admix IBS-PW 0.922(0.163) 0.900(0.171) (0.182) 0.827(0.293) 335 MaCH-Admix IBS-SQ 0.921(0.161) 0.903(0.168) (0.183) 0.828(0.286) 316 MaCH-Admix IBS-DQ 0.920(0.161) 0.898(0.166) (0.181) 0.840(0.287) 315 IMPUTE (0.165) 0.901(0.169) (0.182) 0.827(0.290) 598 BEAGLE 0.911(0.170) 0.891(0.172) (0.190) 0.813(0.299) 319 chromosome8:18-23mb MaCH-Admix Random 0.900(0.162) 0.852(0.233) (0.191) 0.824(0.284) 402 MaCH-Admix AW 0.901(0.160) 0.858(0.224) (0.196) 0.815(0.294) 401 MaCH-Admix IBS-PW 0.903(0.159) 0.867(0.218) (0.197) 0.826(0.298) 513 MaCH-Admix IBS-SQ 0.900(0.163) 0.863(0.223) (0.198) 0.817(0.298) 465 MaCH-Admix IBS-DQ 0.900(0.161) 0.864(0.212) (0.199) 0.813(0.301) 459 IMPUTE (0.164) 0.871(0.199) (0.205) 0.811(0.302) 806 BEAGLE 0.889(0.169) 0.859(0.225) (0.211) 0.788(0.320) 434 All results were generated using the following parameter values: MaCH-Admix: --rounds 30, --states 120; IMPUTE2: -iter 30, -k 120, -k hap 500; BEAGLE: niterations=10 nsamples=4. Best performance in each comparison is highlighted by bold font. 82

97 It however needs significantly more computing time than MaCH-Admix and IMPUTE2 when imputing with the 1000G reference, which I believe has to do with how consecutive untyped variants are modeled. Note that, due to the large number of experiments, I conducted all experiments on a big Linux cluster with more than 1000 CPUs. This leads to moderate fluctuations in running time over short regions due to I/O competition. But I obtain largely consistent conclusions across different experimental settings. 4.4 Discussion In summary, the emergence of large reference panels calls for more efficient methods to utilize the rich resource. I have implemented two classes of reference-selection methods, namely IBS-based and ancestry-weighted approaches, to construct effective reference panels within previously described HMM and implemented them in software package MaCH-Admix for genetic imputation in admixed populations. I have performed systematic evaluations on large (WHI-AA and WHI-HA full sample with 8421 and 3587 individuals), medium (subset of 200 individuals from each of the two WHI admixed cohorts), and small (HapMap ASW and MEX with 49 and 50 founders respectively) target samples; using large (the latest 1000G with H = 2188) and small (HapMap with H = ) reference panels; and in five regions with different levels of LD. Compared with popular existing methods, MaCH-Admix demonstrates its advantage mostly because its piecewise algorithm takes potential changes in haplotype pattern sharing across regions into direct account (versus IMPUTE2 which adopts a whole-haplotype IBS matching approach) and because it does not reduce local haplotype complexity (versus BEAGLE which does so to gain computational efficiency). Based on my evaluations, I recommend the proposed piecewise IBS-based method, which demonstrates the best trade off between quality and computing time. 83

98 As the reference panel continues to grow rapidly (for example, the 1000 Genomes Project will generate 5,000 haplotypes within two years), approaches that can rapidly explore the entire reference pool will become increasingly appreciated. IBS-based approaches show such potential. As manifested by results from both WHI individuals and the HapMapIII individuals, IBS-based approaches can generate accurately imputed genotypes by preferentially selecting a small but different subset of 100 (corresponding to 5% for the current 1000G case where H=2188) haplotypes from the entire reference pool in each iteration. As computational costs increase quadratically with the effective number of haplotypes used in each iteration, such 95% reduction in the effective number of reference haplotypes corresponds to >99.5% reduction in computational investment. Previous studies [Hao et al., 2009; Li et al., 2009; Shriner et al., 2010; Zhang et al., 2011; Seldin et al., 2011] have recommended the use of a combined reference panel which pools haplotypes from all available reference populations (e.g., from the HapMap or the 1000 Genomes Projects), especially for populations that do not have a single best match reference population for increased imputation accuracy. Two forces working in opposite directions are introduced by including reference haplotypes from populations different from those in target samples in such a cosmopolitan panel: shared haplotype stretches(likely even shorter) that would increase imputation quality while noise added by including population-specific local haplotypes would harm imputation quality. Therefore, the recommendation of using a cosmopolitan panel to enhance imputation quality also applies to MaCH-Admix, conceptually more applicable because MaCH-Admix reduces the noise force by choosing local haplotypes that are most relevant into effective reference. One key question concerns the optimal region size for imputation. From the perspective of including more LD information, particularly the long-range LD information that would be particularly critical for the imputation of uncommon variants, imputation over longer regions is desired. However, approaches that select reference haplotypes according to genetic matching between reference haplotypes and genotypes of target indi- 84

99 viduals across the entire region like whole-haplotype IBS-based methods will likely suffer from the change in genetic matching over a long region. For example, for both scenarios presented in Figure 4.1, there are two distinct sub-regions according to the matching pattern. Lumping them naively together, particularly using a single queue, may well lead to inferior performance as discussed earlier. I attempt to solve the problem by breaking the entire region into smaller pieces and within each piece selecting some reference haplotypes according to local genetic matching. This conceptually shares similarity with local ancestry adjustment in analysis of admixed populations [Wang et al., 2011a]. Pasaniuc et al. [2011] also found local ancestry increases imputation accuracy. The proposed piecewise IBS based selection method is robust to imputation region size. I have evaluated the performance on whole chromosomes using ASW/MEX with HapMap references and found that both piecewise IBS and ancestry-weighted selection perform much better than whole-haplotype IBS based methods (data not shown). Between piecewise IBS and ancestry-weighted selections, the piecewise IBS method has advantage in most whole chromosome experiments and is very close to ancestry-weighted selection in the rest. Ancestry-weighted approaches have been previously utilized to construct reference panels in admixed populations for tagsnp selection or imputation [Egyud et al., 2009; Pasaniuc et al., 2010; Pemberton et al., 2008]. However, such reference panels created a priori induce two problems for imputation. First, haplotypes from contributing reference populations are literally duplicated, thus substantially increasing computational burden. Second, the same fixed pre-constructed reference haplotypes are to be used for all Markov iterations, preventing imputation algorithms from taking into account the uncertainty in creating the reference panel. My ancestry-weighted approach selects reference haplotypes probabilistically according to the estimated ancestry proportions and creates a different reference panel in each Markov iteration. This strategy ensures that all reference haplotypes to be selected when I run the Markov iterations long enough, thus avoiding both problems mentioned above. An attractive feature that I have added 85

100 to MaCH-Admix is a functionality to estimate ancestry proportions so that it can internally generate weights for ancestry-weighted approach without the need to install and call external programs. Although there exist many methods to infer ancestry including for example structure [Pritchard et al., 2000], HAPMIX [Price et al., 2009] and GEDI-ADMX [Pasaniuc et al., 2009], I believe that researchers will find this build-in feature convenient. I found my estimates reasonably close to estimates from structure and working well for imputation purpose. In this study, I have examined the performance of my proposed and other imputation methods in both Hispanics and African Americans. Between the two, Hispanics are known to have more complex LD structure because of three ancestral populations involved as opposed to two for African Americans. The more complex LD in Hispanics indeed makes it essential to more explicitly account for the larger variability in local ancestry (for example, using my proposed piecewise approach). The more complex LD and population substructure in Hispanics have prevented a lot of investigators from even attempting imputation. However, I observe similar if not slightly better imputation quality in the five regions examined, with an average dosage r 2 of 92.5% (81.8%) versus 92.1%(81.4%) for all (uncommon) SNPs in WHI-HA and WHI-AA respectively using my piecewise IBS approach. That imputation performance for Hispanics is comparable with that for African Americans is expected due to on average less African ancestry (where LD is the lowest and thus most challenging for imputation) in Hispanics compared to African Americans. Therefore, I highly encourage investigators working with Hispanics perform imputation as well. Although in this work I propose the reference selection methods for imputation of admixed individuals, the methods can be directly applied to imputation in general for non-admixed populations by finding the best genetic match for each target individual. For the same reason, IBS-based methods tend to work better than ancestry-weighted approaches when between-individual variation among the target individuals is large (data 86

101 not shown). This is not surprising because IBS-based approaches select a different effective reference panel tailored for each target individual, rather than one uniform reference sampling setting for all target individuals as in the ancestry-weighted approach. I have also attempted to examine common and uncommon genetic variants separately, using MAF 5% as cutoff. I observe more pronounced differences among the attempted methods with uncommon variants, suggesting that choice of reference selection methods matters more for uncommon variants. Due to the nature of the SNPs evaluated (either typed Affymetrix 6.0 markers for the WHI individuals, or HapMap markers) and the target sample size (49-50 for HapMapIII ASW and MEX), there are few really rare (MAF<1%) variants. Although several attempts have been made [Wang et al., 2011b; Howie et al., 2011; The International HapMap Consortium, 2010; Liu et al., 2012], imputation quality for uncommon variants is far from being fully assessed and needs to be further evaluated when data from large scale sequencing efforts become available. Last but clearly not the least point concerns computational efficiency. MaCH-Admix is very flexible in terms of the effective number of haplotypes used in each iteration and the number of iterations. Imputation accuracy depends on both parameters. Since computational cost increases quadratically with --states and linearly with --rounds, for practical purpose, I recommend using --states and --rounds 20. I also have an option analogous to IMPUTE2 s -k hap, which increases computational costs linearly and even defaulting at a large value (500) contributes to only a small proportion of computing time. Between the two categories of approaches proposed, the ancestry-weighted approach requires only one-time up-front costs for the estimation of ancestry proportions. The IBS-based methods, on the other hand, require overhead costs at each iteration for calculating genetic similarities between individuals in the target population and the reference haplotypes. For both, the costs increase with the reference panel size. Finally, computational costs would increase only linearly with --states if I start with haplotypes of the target individuals, that is, for haplotype-to-haplotype (both reference and target 87

102 are in haplotypes) imputation as performed by software minimac. I plan to extend my proposed methods to minimac in the future. Web Resources Census fact for admixed populations, The 1000 Genomes Project, MaCH-Admix, MaCH, IMPUTE, BEAGLE, structure: 88

103 Chapter 5 Genotype Imputation of Metabochip SNPs in African Americans Using a Study Specific Reference Panel 5.1 Introduction Genotype imputation has become standard practice to increase genome coverage and improve power in Genome-Wide Association Studies (GWAS) and meta-analysis [de Bakker et al., 2008; Li et al., 2009; Marchini and Howie, 2010]. The wealth of literature using genotype imputation has focused on using external reference panels (for example, phased haplotypes from the International HapMap Project [The International HapMap Consortium, 2007] or the 1000 Genomes Project [The 1000 Genomes Project Consortium, 2010]), largely in individuals of European ancestry, for inference of genotypes at common (MAF > 0.05) genetic markers. GWAS have identified > 4, 300 genetic variants associated with human diseases and traits ( [Hindorff et al., 2009]. Investigators across the world have begun efforts to fine map within regions where GWAS-identified SNPs reside, through dense genotyping (e.g., using region-centric or gene-centric chips like the Metabochip for metabolic related traits ( or the ITMAT-Broad-CARe[IBC]

104 for cardiovascular related traits, or the immunochip for immune related diseases) or sequencing. Furthermore, multiethnic genetic association studies have been recognized as potentially more powerful for both gene discovery and fine mapping [McCarthy et al., 2008; Pulit et al., 2010; Rosenberg et al., 2010; Teo et al., 2010] and some initial efforts have been carried out [He et al., 2011; Keebler et al., 2010; Lanktree et al., 2009; Lettre et al., 2011; Smith et al., 2011; Waters et al., 2009]. In addition, because GWASidentified SNPs (mostly common) explain only a small proportion of overall heritability for most complex diseases and traits [Eichler et al., 2010; Maher, 2008; Manolio et al., 2009], whole-genome or whole-exome sequencing for rare SNPs and genetic variants other than SNPs (e.g., copy number variations, structural variants) are under way. So far, there has been relatively little research on the performance of genotype imputation in this new context. My study provides a typical scenario where 8, 421 African Americans from the Women s Health Initiative [The WHI Study Group, 1998] SNP Health Association Resource (SHARe) were genotyped using the Affymetrix 6.0 genotyping platform. In an attempt to generalize genetic effects across racial groups, the Population Architecture using Genomics and Epidemiology (PAGE) consortium genotyped a subset of 1,962 African American WHI participants with data on multiple metabolic related phenotypes using the Metabochip [Matise et al., 2011]. To increase the power to detect moderate to small genetic effects, I sought to impute the Metabochip SNPs in the remaining 6,459 individuals in WHI SHARe with Affymetrix 6.0 data only. Imputing SNPs in the fine mapping region tends to be more challenging because these SNPs tend to be rare and in low linkage disequilibrium (LD) with GWAS SNPs. Here I describe a pipeline for constructing study-specific reference panels using individuals genotyped or sequenced at a larger set of genetic markers (in this case, individuals genotyped using both Affymetrix 6.0 and Metabochip) and for imputation into individuals with genotype data at a subset of markers (in this case, individuals genotyped using Affymetrix 6.0 only). I benchmark the quality of my imputation in an African American population, for 90

105 SNPs on the Metabochip, a region-centric genotyping platform, with particular focus on low frequency SNPs (MAF down to 0.001), using a large study-specific reference panel containing 3, 924 haplotypes. An African American sample poses a greater challenge for genotype imputation due to more complex LD patterns in African Americans compared with individuals of European ancestry [Egyud et al., 2009; Shriner et al., 2010], and in which comparatively less discovery work has been done. I first describe how I constructed my study-specific reference panel using the 1, 962 African American individuals with genotypes for both Affymetrix 6.0 and Metabochip SNPs and how I performed imputation of the Metabochip-only SNPs into the remaining 6, 459 individuals. I then show several approaches through which I estimated imputation quality for SNPs in different MAF categories, with a special focus on less common (MAF: ) and rare (MAF < 0.01) variants. I provide practical guidelines regarding post-imputation quality control for different MAF categories, as well as for the inclusion of rare variants during imputation. 5.2 Materials and Methods Pre-Imputation Quality Control Prior to phasing and imputation, quality control was applied to both the Metabochip data and the GWAS data. Specifically, for the GWAS dataset (n = 6,459) I removed Affymetrix 6.0 SNPs with genotype call rates < 90% (m = 1,633), or Hardy-Weinberg exact test [Wigginton, et al. 2005] p-value < 10 6 (m = 16,327), or MAF < 0.01 (m = 14,014), resulting in a 829,370 GWAS SNPs passing quality control criteria [Reiner et al., 2011]. Separate quality control criteria were applied to the Metabochip SNPs, leading to 182,397 QC+ SNPs with genotype call rates > 95% and Hardy-Weinberg p-value > 10 6 Individuals were excluded if they had a call rate below 95%, showed excess heterozygosity, were part of an apparent first-degree relative pair, or were ancestry outliers as determined 91

106 by Eigensoft [Price et al., 2006]. Details can be found in the PAGE Metabochip platform paper [Buyske et al., 2011] General Pipeline for Reference Construction and Subsequent Imputation Figure 5.1 shows schematically how imputation was performed. In the top left panel, I first merged genotypes from the Affymetrix GWAS panel (blue) and the Metabochip (yellow) SNPs genotyped as part of the PAGE study for the 1,962 reference individuals (i.e., individuals with genotype data from both platforms). I then reconstructed haplotypes encompassing both GWAS and Metabochip SNPs for the reference individuals, constituting the reference panel of 3,924 haplotypes. In the top right panel, haplotype reconstruction for target individuals (i.e., individuals with GWAS genotypes only) was carried out similarly, but at the GWAS markers only. Finally, a haplotype-to-haplotype (that is, data are in haplotype form for both the reference and target individuals) imputation was performed to generate estimated genotypes at the Metabochip SNPs for the 6,459 target individuals. 5.3 Results Genomewide Imputation using Large Study-Specific Reference After careful matching on strand (so that genotypes from both Affymetrix 6.0 and the Metabochip are on the same strand), SNP ID, genomic coordinates, and actual genotypes for SNPs in common, I had a merged set of 987,749 SNPs for the 1,962 reference individuals. The average concordance rate for the 23,703 SNPs in common was 99.7%. For discordant genotypes, I kept the GWAS genotypes to match those of the target indi- 92

107 Figure 5.1: Reference construction and imputation pipeline using a study-specific reference panel. This schematic cartoon shows how I constructed my study-specific reference panel using five individuals genotyped on both the Affymetrix 6.0 and the Metabochip platform and how I performed imputation into the remaining five individuals with Affymetrix 6.0 data only. viduals with GWAS data only. Haplotypes were reconstructed on the merged set using MaCH [Li et al., 2010a]. In parallel, I constructed haplotypes across the 829,370 QC+ GWAS SNPs for all 8,421 individuals. Finally, I used the 3,924 haplotypes across the merged set of 987,749 SNPs as reference to impute into haplotypes across GWAS SNPs of the target individuals. The final haplotype-to-haplotype imputation was performed using the software package minimac, which generates the allele dosages (the fractional counts of an arbitrary allele at each SNP for each individual, ranging continuously from 0 to 2). Minimac also generates the SNP-level quality metric Rsq, which is the SNP-specific estimated r 2 between allele dosages and the unknown true genotypes. Rsq has been 93