A Population-based Latent Variable Approach for Association Mapping of Quantitative Trait Loci

Size: px
Start display at page:

Download "A Population-based Latent Variable Approach for Association Mapping of Quantitative Trait Loci"

Transcription

1 doi: /j x A Population-based Latent Variable Approach for Association Mapping of Quantitative Trait Loci Tao Wang 1,, Bruce Weir 2 and Zhao-Bang Zeng 2 1 Division of Biostatistics & Human Molecular Genetics Center, Medical College of Wisconsin, Milwaukee, WI Bioinformatics Research Center, Department of Statistics, North Carolina State University, Raleigh, NC Summary A population-based latent variable approach is proposed for association mapping of quantitative trait loci (QTL), using multiple closely linked genetic markers within a small candidate region in the genome. By incorporating QTL as latent variables into a penetrance model, the QTL are flexible to characterize either alleles at putative trait loci or potential risk haplotypes/sub-haplotypes of the markers. Under a general likelihood framework, we develop an EM-based algorithm to estimate genetic effects of the QTL and haplotype frequencies of the QTL and markers jointly. Closed form solutions derived in the maximization step of the EM procedure for updating the joint haplotype frequencies of QTL and markers can effectively reduce the computational intensity. Various association measures between QTL and markers can then be derived from the haplotype frequencies of markers and used to infer QTL positions. The likelihood ratio statistic also provides a joint test for association between a quantitative trait and marker genotypes without requiring adjustment for the multiple testing. Extensive simulation studies are performed to evaluate the approach. Keywords: association mapping, maximum likelihood, haplotype, latent variable, EM algorithm Introduction Classical linkage studies provide the first step towards locating disease susceptibility genes. Further narrowing down these candidate regions using a more dense array of polymorphic genetic markers, such as single nucleotide polymorphisms (SNPs), is vital to identify the genes and their pathways that control a disease phenotype. The association or linkage disequilibrium (LD) mapping approach is a powerful tool for fine mapping of disease genes (Hästbacka et al. 1992; Jorde, 1995; Long et al. 1995; Risch & Merikangas, 1996; Clayton, 2000). It has been shown that varied recombination rates across the human genome may correspond to low or high LD blocks (Peterson et al. 1995; Yu et al. 2001). As a result, the comparison of LD levels on large scales may not be Correspondence address: Tao Wang, PhD, Division of Biostatistics, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI Tel: , Fax: , taowang@mcw.edu so meaningful (Pritchard & Przeworski, 2001). However, some recent experimental evidence supports the idea that recombination may play a major role in shaping LD patterns in some small chromosomal regions (Kauppi et al. 2003). Therefore, with multiple linked markers (e.g., tagged SNPs) selected in such a small candidate region, it would be expected that the markers in stronger association with the disease phenotype would tend to be closer to the QTL. The LD approach is applicable to both family-based and population-based samples. In general, family-based samples are more informative and contain both recombination and LD information, but they are compromised by the limited number of informative families. Population-based samples, on the other hand, are preferable for collecting large sample sizes and detecting genes with low penetrance and modest effects. Populationbased association mapping, however, is complicated by the facts that a current population provides only a single sample generated from a complex stochastic evolutionary process, and the extent and distribution of LD in a 506 Annals of Human Genetics (2006) 70, C 2006 The Authors

2 Association Mapping of QTL current population is a cumulative result of many evolutionary forces including mutation, random genetic drift, natural selection etc. Admixture of populations with different allele frequencies may also generate spurious LD among loci. Numerous statistical methods for adjusting for population admixture have been developed (Devlin & Roeder, 1999; Bacanu et al. 2002; Pritchard et al. 2000a,b; Zhang & Zhao, 2001; Zhang et al. 2003). Classical epidemiological strategies can also be applied to adjust for population admixture through population stratification. A number of population-based LD methods have been proposed to map QTL using polymorphic genetic markers (Terwilliger, 1995; Rabinowitz, 1997; Rannala & Slatkin, 1998; Page & Amos, 1999; Meuwissen & Goddard, 2000; Hoeschele, 2003). Most of these methods assumed that the potential QTL were located on markers. Given the fact that we can only observe genotypes of markers but not the QTL, this restriction may lead to biased estimation of parameters such as QTL allele frequencies and genetic effects (Nielsen & Weir, 1999). The power may also be substantially reduced by the incomplete disequilibrium and differing allele frequencies between QTL and markers (Tu & Whittemore, 1999). Alternatively, Luo et al. (2000) developed a population-based LD method to infer QTL by regarding QTL genotypes as missing data. In their approach the existence of the QTL can be tested based on a null hypothesis of no genetic effects of the QTL. Meanwhile, the LDs between the disease locus and marker loci provide information for locating the disease locus. More recently, Lou et al. (2003) also explored some extensions of Luo et al. (2000) s method. In these studies, however, the possible role of QTL was not clearly formulated. The application of their methods to multiple closely linked markers using unphased genotype data was also limited. In fine mapping of QTL within a small candidate region the identification of QTL is complicated by the strong LD among markers within the region. Simple association analysis using one marker at a time may not help in narrowing down the region, because this single marker analysis often leads to inconsistent estimation of QTL allele frequencies and genetic effects. Using multiple linked markers simultaneously may increase the power of QTL detection and improve the accuracy of QTL locations (Kruglyak et al. 1996; Goddard, 1999; Meuwissen & Goddard, 2000). In this study we extend Luo et al. (2000) s method to multiple linked markers with unphased genotype data. First, we present a latent variable model to describe the relationship between a quantitative trait and its QTL genotypes. The potential role of a QTL as a bridge to link the phenotype and the multple genetic markers under study, and the underlying model structure, are clarified. Second, under a general likelihood framework we develop an EMbased algorithm for the maximum likelihood estimation (MLE) of QTL effects and haplotype frequencies of the QTL and markers jointly. In the maximization step of this EM procedure closed form solutions are derived for updating the joint haplotype frequencies of QTL and markers, which allows the use of multiple unphased marker genotype data with an arbitrary number of alleles and can effectively reduce the computational intensity. A likelihood-based test is also constructed for testing the association between the quantitative trait and markers through the QTL. Third, we explain the calculation of various LD associations between QTL and markers from the joint haplotype frequencies of the QTL and markers. A partial correlation between QTL and markers is introduced as an association measure to separate the hitch-hiking effect of markers. A strategy for inferring QTL positions based on the association pattern is then presented. Finally, we perform extensive simulation studies to evaluate the mapping approach and examine properties of the testing statistics. Recently, haplotype analysis has become one of the most active areas in current genomic studies. As a combination of closely linked alleles on a chromosome, the haplotype may provide a better approximation to functional genes than single marker alleles or genotypes (Schaid, 2004; Clark, 2004). One issue in current haplotype analysis is to determine which combination of markers should be used for construction of the haplotypes. To fully exploit association between a set of markers and a disease phenotype, various marker combinations need to be considered. A comprehensive test for association between the disease phenotype and a set of markers therefore requires a series of tests for marker alleles, genotypes, sub-haplotypes and whole-set haplotypes that often lead to severe multiple testing problems (Schaid, 2004; Becker & Knapp, 2004). We show, via our simulation studies,,that the latent QTL model can C 2006 The Authors Annals of Human Genetics (2006) 70,

3 T. Wang, B. Weir, and Z. Zeng characterize either a putative trait locus that is linked to the markers or a potential risk haplotype or subhaplotype of the markers. Therefore, the likelihood ratio statistic based on our latent QTL model may provide a joint test for association between the quantitative trait and a set of marker loci without requiring adjustment for the multiple testing. Methods Latent Variable Model In quantitative genetic studies a quantitative trait is assumed to be affected by its QTL and modified by environmental factors as well. This phenotype-qtlgenotype relationship can usually be described through a genetic model. Suppose that we have a random sample of N unrelated individuals from a sampled population. Let y i, x i, and z i, i = 1, 2,...,N, denote the phenotypic value of the quantitative trait we are interested in, marker genotypes, and p covariates of individual i, respectively. Let q i be the QTL genotypes of individual i. Then a common model for the quantitative response y i is given by y i = z i α + Q i β + e i, fori = 1,...,N (1) where α is a p-dimensional vector for fixed effects of the covariates, β is a q-dimensional vector for fixed effects of the QTL that may include additive, dominance and probable genetic interactions of the QTL, and the Q i is a q-dimensional vector corresponding to β with its components coded according to QTL genotypes Q i of the individual. For example, applying Falconer & Mackay (1996) s genetic model for diallelic QT with alleles A j 0 and A j 1 at the j-th QTL locus, one can define for individual i 1, genotype A j 0 A j 0 at the j th QTL w ij = 0, genotype A j 0 A j 1 at the j th QTL 1, genotype A j 1 A j 1 at the j th QTL, v { ij = 1, genotype Aj 0 A j 0 or A j 1 A j 1 at the j th QTL 0, genotype A j 0 A j 1 at the j th QTL, for j = 1, 2,...,l and i = 1,...,N, where l is the number of QTL affecting the trait. Ignoring QTL interactions, we can take Q i = (w i 1, v i 1,...,w il, v il ) (i.e., q = 2l) corresponding to β = (a 1, d 1,...,a l, d l ) T, where a j, d j are additive and dominance effects of the j-th QTL, respectively. QTL interactions or QTL by environment interactions may also be included in the model (Zeng et al. 2005). Model (1) can also be expressed in a matrix form as the following y = Zα + Qβ + e (2) where y = (y 1, y 2,...,y N ) is the N-dimensional response vector, Z = (z 1, z 2,...,z N ) T is an N p design matrix of covariates, Q = (Q 1, Q 2,...,Q N ) T is an N q design matrix of QTL effects, and e is an N-dimensional vector of residual errors which is usually assumed to follow an N(0,σ 2 I N ) distribution. In most current QTL mapping studies the QTL are assumed to be located on markers, which is unlikely because the QTL are in most cases unidentified positions. In the following we will treat the QTL genotypes q i, i = 1, 2,...,N, as missing data. As a result, the QTL related design matrix Q in model (2) cannot be explicitly specified. The distribution of those unobserved QTL genotypes, however, is supposed to be associated with the distribution of observed marker genotypes x i. By regarding the QTL effects β as fixed, our purpose is to estimate the QTL effects together with the joint distribution P (q i, x i ) of QTL and markers. The P (q i, x i ) contains information on QTL allele frequencies as well as association between QTL and markers. The latter is valuable for inferring positions of the QTL. We will show via our simulation study that the QTL in this latent variable model may play a flexible role to characterize either a specific trait locus linked to the markers or a haplotype/sub-haplotype of the markers associated with a quantitative trait. Likelihood Framework We assume that (i) given QTL genotypes, each phenotype y i is conditionally independent of marker genotypes; and (ii) the genotypic distribution of QTL and markers is independent of the non-genetic covariates. Based on the previous latent variable model, the likelihood for the observed data Y obs = {(y i, x i ), i = 1, 2,..., N} conditional on the covariate information 508 Annals of Human Genetics (2006) 70, C 2006 The Authors

4 Association Mapping of QTL Z = {z i, i = 1, 2,...,N} is given by L( ; Y obs Z) = N i =1 q i P (x i, q i )P (y i q i, z i ) (3) where the summation is over all possible QTL genotypes, P (y i z i, q i ) is the penetrance probability, and the component P (x i, q i ) is determined by the joint genotypic frequency of QTL and markers. Now, consider m closely linked markers located within a candidate region of interest and genotypically associated with a single QTL within the region. Possible extension to multiple candidate regions on different chromosomes with each having a single QTL is straightforward and will be discussed later. In order to explore the gametic association (i.e., LD), we need to distinguish the gametic phases of QTL and markers. It is known that most of the current genetic markers manifest their genotype information with two parental gametes confounded. For these linked QTL and marker loci, there is no unambiguous distinction of which alleles are on the same gamete (i.e., the so-called phase problem) when there is more than one heterozygous locus in their unphased genotypes. Hereafter, we use k 1 k 1 k mk m to denote unphased marker genotypes, q = jj to denote an unphased QTL genotype, and ordered paternal/maternal pairs η = jk 1 k m /j k 1 k m to specify the joint phaseknown genotypes of QTL and markers with QTL alleles A j and marker alleles M 1k1,, M mkm on the paternal gamete, with the QTL alleles A j and marker alleles M 1k 1,, M mk m on the maternal gamete. Under the assumption of Hardy-Weinberg equilibrium or, more precisely, gametic phase equilibrium (Lynch & Walsh, 1998), it is known that the frequency of a phase-known genotype η = jk 1 k m /j k 1 k m is a product of its two haplotype frequencies; i.e., P (η) = P jk1 k m P j k 1 k m where P jk1 k m is the joint haplotype frequency of QTL and markers. Here we also assume that the paternal gametes and maternal gametes have the same haplotype distributions. To clarify the phases, we can rewrite the likelihood L( ; Y obs Z) as L( ; Y obs Z) = N P (x i,η)p(y i z i,η) (4) i =1 η where the summation is over all possible phase-known genotypes η of QTL and markers. When a phase-known configuration η = jk 1 k m /j k 1 k m is compatible with the observed marker genotypes x i ; i.e., x i = k 1 k 1 k mk m,wehave P (x i,η) = P jk1 k m P j k 1 k m If η is incompatible with the marker genotype x i, then P (x i,η) = 0. Based on model (1), P (y i z i,η) in (4) can be replaced by the following normal density function φ(y i z i,η) = 1 [ exp (y ] i z i α Q i β) 2 2πσ 2σ 2 where β = (a, d ) with a, d being the additive and dominance effects of the QTL, and Q i is coded by the QTL genotypes jj. Note that the penetrance P (y i z i,η) depends on η through its QTL genotypes q = jj and has nothing to do with its marker genotypes x i = k 1 k 1 k mk m. The likelihood function of (4) presents a finite normal mixture model with unknown parameters involved in the both weights and normal density components. The unknown parameters consist of two parts: the phenotypic model related parameters 1 = (α, β, σ 2 ), and the haplotype related parameters 2 involved in the joint haplotype frequency P jk1 k m of QTL and markers. In general, it is hard to specify a simple parametric model for the distribution of P jk1 k m due to complex unknown genetic pathways. In this paper we broadly assume that assume that P jk1 k m has a multinomial distribution, with categories corresponding to haplotypes of the QTL and markers. Note that P jk1 k m includes potential parameters such as allele frequencies of QTL and markers, pairwise LDs between QTL and markers, and three-way or higher order disequilibria between the QTL and markers (Weir, 1996; Wang, 2001). In the simple case of a diallelic QTL and m diallelic markers, for example, there are overall 2 m+1 possible joint haplotypes of QTL and markers. If we assume that the haplotype distributions of markers are known they are usually estimated from the observed marker genotypes C 2006 The Authors Annals of Human Genetics (2006) 70,

5 T. Wang, B. Weir, and Z. Zeng directly - then there is one non-redundant parameter for QTL allele frequencies, there are m non-redundant parameters for pairwise LDs between the QTL and markers, and m(m 1) non-redundant parameters for three-way 2 disequilibria between the QTL and two markers, etc. The total number of non-redundant parameters (or degrees of freedom) related to P jk1 k m is 2 m, which grows exponentially with the number of marker loci. Due to the large number of non-redundant parameters involved implicitly in the joint haplotype distribution P jk1 k m of QTL and markers, the calculation of likelihood function and its derivatives are very complicated. Therefore, a classical nonlinear searching approach such as the Newton-Raphson algorithm is not practical in this case. We will develop a searching method for the maximum likelihood estimation of parameters based on the expectation-maximization (EM) algorithm (Dempster et al. 1977). EM Algorithm The EM algorithm is an iterative nonlinear optimization procedure for maximum likelihood analysis. It usually consists of two steps: the E-step calculates the conditional expectation of the complete data log-likelihood, given the observed data and the current estimates of parameters; and the M-step maximizes the resulting function with respect to the parameters. Let η i denote the joint phase-known genotypes (complete information) of QTL and markers of individual i, i = 1, 2,, N. Then the log-likelihood function for the complete data Y complete = {(y i,η i ), i = 1, 2,, N} conditional on covariate information Z is given by ln L c ( ; Y complete Z) = [ln P (η i ) + ln P (y i z i,η i )] i Given the observed data Y obs and the current parameter setting (t) = ( (t) 1, (t) 2 ), the expectation of log L c (Y complete, Z) is Q( (t) Z) = E (t)[log L c ( ; Y complete Z) Y obs ] = i η [ln P (η) + ln P (y i z i,η)]ω η,i (t) (5) where ω η,i (t) = P (t)(η y i, z i, x i ). So, in the E-step at the t-th iteration, we need to calculate the conditional probability ω η,i (t) = P (t)(η y i, z i, x i ) for all phase-known genotypes η of QTL and markers per subject. From Bayes rule, we have ω η,i (t) = P (t)(η, x i )P (t)(y i z i,η) P (t)(η, x i )P (t)(y i z i,η ) η For each η = jk 1 k m / j k 1 k m, write ω η,i (t) as ω j k 1 k m jk 1 k m,i (t). Let H i be the set of haplotype pairs that are compatible with the observed marker genotypes x i of the ith individual. Then (t) P ω j k 1 k jk m 1 k m P (t) j jk 1 k m,i (t) = k φ(y 1 k i z i, jj ) (t) m P (t) i for η = jk 1 k m /j k 1 k m H i, and ω j k 1 k m jk 1 k m,i (t) = 0ifη = jk 1 k m /j k 1 k m / H i, where P (t) i = jk 1 k m /j k 1 k m H i P (t) jk 1 k m P (t) j k 1 k m φ(y i z i, jj ) (t) In the M-step, we need to maximize the Q-function defined in (5) given the posterior weights ω η,i (t). Define W jj,i (t) = ω jk 1 k m j k k 1,,k m k 1 1 k,,k m,i (t). Since the model m related parameters 1 and the haplotype related parameters 2 are separable, 1 should be updated by minimizing 1 [ yi z 2σ 2 i α Q( jj )β ] 2 Wjj,i (t) i j, j + n 2 log(2πσ2 ) which can easily be implemented via the classical weighted regression. Due to invariance of the likelihood for switching labels of QTL alleles, which is a common phenomenon in latent model analysis, a correction may sometimes be required in calculation of the additive genetic effect of the QTL. To update P (t) jk 1 k m, we need to maximize the following function R( ) = (ln P jk1 k m ) (t) jk 1 k m (6) j,k 1,,k m where (t) jk 1 k m = i j k [ω j k 1 k 1 m,,k m jk 1 k m,i (t) + ω jk 1 k m j k 1 k m,i (t)]. Note that R( ) is a weighted log function for all haplotype frequencies P jk1 k m of QTL and markers. Regarding all the QTL and markers as one super locus with their joint haplotypes as its alleles, 510 Annals of Human Genetics (2006) 70, C 2006 The Authors

6 Association Mapping of QTL we can rewrite (6) as R( ) = k (ln p k)w (t) k, where p k denotes P jk1 k m and w (t) k is the weight (t) jk 1 k m. From some simple derivation, we can show that R( ) achieves the global maximum at ˆp k = w (t) k. Note k w (t) k that k w (t) k P (t+1) jk 1 k m = 2N, therefore = (t) jk 1 k m 2N, j, k 1, k 2,, k m (7) Through the updating equation (7) above, the joint haplotype frequencies of QTL and markers, which include the marker haplotype information, make use of the phenotype information through the posterior weights computed in the E-step. It has been recommended to estimate marker frequencies jointly with the phenotypes, because the marker genotypes are supposedly correlated with the phenotype through their LD with the QTL (Göring & Terwilliger, 2000). This method, however, may subject the penetrance model to possible mispecification and lead to increased degrees of freedom for the likelihood-based test statistics. Alternatively, we can first estimate the haplotype frequencies of markers ˆPk1 k m from observed marker genotypes directly (discussed below) and then update P (t+1) jk 1 k m with the fixed ˆP k1 k m. Similarly, if we shrink all the marker loci into a super locus M with marker haplotypes as its alleles, then we can rewrite (6) as R( ) = j = j k k (ln P jk )w (t) jk ln(p j q k + D jk )w (t) jk where p j is the frequency of QTL allele A j, q k is the frequency of allele k at locus M, D jk is the LD between QTL allele A j and M s allele k, and w (t) jk is the corresponding weight. Assuming now that the q k s are known, then we can show that R( ) achieves maximum at P jk = for any j, k. Or, equivalently P (t+1) jk 1 k m w (t) jk q k j w (t) j k = (t) jk 1 k m ˆPk1 k m j (t) j, k 1, k 2,, k m (8) j k 1 k m When the haplotype distribution of markers can be predetermined, using (8) can make the EM procedure more tractable and reduce the degrees of freedom for the likelihood ratio test statistics. Besides, these pseudo-mles still satisfy properties such as consistency, asymptotic efficiency and asymptotic normality under some regularity conditions (Liang & Self, 1996), although they may have reduced precision. When haplotypes of markers are available (e.g., through sperm typing or using genotypic information of relatives), direct counting gives MLEs of P k1 k m. For unphased marker genotypes, under the assumption of gametic phase equilibrium, haplotype frequency of markers can also be estimated using several available methods (Clark, 1990; Excoffier & Slatkin, 1995; Hawley & Kidd, 1995). In fact, by ignoring the penetrance model with the QTL, an EM procedure for estimation of the haplotype frequency P k1 k m of markers can be derived as a special case of the algorithm we presented above. Similar to (7), starting from an initialized P (0) k 1 k m, the haplotype frequencies of markers can be updated through P (t+1) k 1 k m = 1 2N where ω k 1 k m k 1 k m,i (t) = and P (t) i = i k 1,,k m P (t) k1 km P (t) k 1 k m [ ω k 1 k m k 1 k m,i (t) + ωk 1 k m k 1 k m,i (t)] (9), if x P (t) i = k 1 k 1 k mk m i 0, otherwise. k 1 k m /k 1 k m H i P (t) k 1 k m P (t) k 1 k m When x i = k 1 k 1 k m k m ; i.e., it is homozygous at each marker locus, we have ω k 1 k m k 1 k m,i (t) = 1. If x i = k 1 k 1 k j k j k mk m and k j k j ; i.e., x i is heterozygous only at the j-th locus, then ω k 1 k j k m k 1 k j k m,i (t) = ω k 1 k j k m k 1 k j k m,i (t) = 1. The same EM procedure was developed in Excoffier & Slatkin (1995). 2 The EM algorithm above can easily handle missing data in marker genotypes. The missing information in marker genotypes x i of an individual basically increases the possibility of more phase-known genotype configurations η that are compatible with the x i. For individuals with phenotype and covariate information but no marker genotypes, they contribute no information for estimating the haplotype frequencies of markers. For individuals with marker genotypes but no phenotypes, they can still be used for estimating the haplotype frequency of markers, although they are better excluded from the QTL analysis because they provide no C 2006 The Authors Annals of Human Genetics (2006) 70,

7 T. Wang, B. Weir, and Z. Zeng information about the phenotype-qtl-genotype association. It is known that the EM algorithm can not guarantee its convergence to a global maximum of the likelihood function. In general, different starting points should be examined; the optimum could then be selected as the one corresponding to the largest likelihood value. Quantitative traits are usually complex traits that are likely affected by multiple QTL. The algorithm above might be extended to multiple QTL as well. For example, we can easily extend the above algorithm to multiple candidate regions located on different chromosomes, with each having a single QTL. In this case, we can simply deal with the joint haplotype frequencies of the QTL and markers within each candidate region separately. Since the QTL and markers within different regions are unlinked, the haplotype distributions from different regions can be assumed to be genotypically independent. However, all the QTL, with each from one region, should jointly account for the genetic effects in the penetrance model. It might be the case that a candidate region may contribute more than one QTL influencing the trait, or more than one candidate region may be located on the same chromosome. In this situation the multi-qtl penetrance model encounters the confounding effects of the QTL genetic effects and their genotypic correlation on variation of the trait. Identification of these parameters becomes a problem. If possible, additional information or constraints on the correlation structure of these QTL are required. Another potential problem in multi-qtl modelling is how to identify the number of QTL. Essentially, this is a model selection problem. One important feature in this case is that an l-qtl model may not necessarily be nested within an (l + 1)-QTL model since they may have completely different QTL positions. So the classical likelihood ratio test statistic may no longer be valid to handle the problem. Zeng et al. (1999) adopted an interactive stepwise selection procedure to determine the number of QTL by applying F-to-drop and F-to-enter statistics, and stopping rules such as AIC, BIC or BIC δ (Broman & Speed, 2002) to aid the model selection. However, it has to be pointed out that these traditional criteria for selection of models may not be satisfactory in genetic mapping studies since they do not take into account factors such as heritability level, number and lengths of chromosomes, and number and distribution of markers, that are believed to play important roles in selection of the QTL. Based on our likelihood framework above, a classical likelihood ratio (LR) statistic can be constructed to test for association between the observed marker genotypes and the phenotype. For example using a simple one QTL model and ignoring its dominance effect, we can test the association through a null hypothesis of H 0 : a = 0; i.e., the existence of a QTL. In this case the LR statistic S = 2(ln L F ln L R ), where L F is the maximum likelihood under the full model and L R is the maximum likelihood under the restriction of a = 0. Note that under the null hypothesis of no QTL, the phenotypic values carry no information about the haplotype related parameters 2. So, the L R only depends on the mean of the phenotypic values and the haplotype distribution of the markers, while the L F involves estimation of the joint haplotype distribution of the QTL and markers, as well as the model fitting of the phenotypic values. It is known that under this non-standard condition the LR test statistic does not follow the classical χ 2 distribution and is a mixture of a point mass function at zero 1 {0} (or χ0 2) and a χ df 2 distribution for large samples (Self & Liang, 1987). In our case, given the marker haplotypes, the number of QTL related parameters involved in 2 under our multinomial distributional model is 2 m. So, a conservative approximation to the asymptotic distribution of S under the null H 0 would be {0} + 1 χ 2 2 df with df = 2 m + 1, although it is questionable whether the haplotype related parameters 2 have the same effect in fitting the model as the model related parameters 1, and so are the QTL allele frequencies or pairwise LDs between QTL and markers comparing with those higher-order disequilibrium parameters involved in the haplotype distribution of the QTL and markers. When dominance effects or more than one QTL are involved, the asymptotic distribution of the LR statistic is a weighted sum of independent χ0 2 and χ df 2 distributions, where the weights depend on the genetic models that we use (Liang & Self, 1996). Alternatively, one may construct an empirical distribution of the test statistic through a permutation procedure, by randomly permuting the phenotypes on which the covariates can be bound and then reassigning them to the marker genotypes. This permutation 512 Annals of Human Genetics (2006) 70, C 2006 The Authors

8 Association Mapping of QTL procedure is computationally feasible at least with moderate number of markers of less than 10 and repeats of The LR test above is actually a joint test of H 0 :no QTL effects, or no LD between QTL and markers similar to the TDT (Spielman & Ewens, 1996; Allison, 1997; Martin et al. 2000) being a joint test of association and linkage. Without LD between the QTL and markers, observed marker genotypes provide no information for the genotypic distribution of the QTL. So, it is a valid test for existence of the QTL only in the presence of LD that is often presumed in most genetic fine mapping studies. A rejection of the null implies that a or d 0 and there is LD between the QTL and markers. The stronger the LD between the QTL and markers is in the genome, the more power the LR statistic has to detect the association. In general, a comprehensive test for association between s phenotype and a set of markers requires consideration of alleles, genotypes, haplotypes or sub-haplotypes and results in a multiple testing problem. Since the QTL in our latent variable model may play a flexible role to describe either a causative allele at a putative trait locus linked with the markers, or a haplotype or sub-haplotype of the markers as shown in later simulation studies, the LR statistic can test for the phenotype and marker genotype association regardless of various possible LD association patterns between the QTL and markers, and therefore avoids an adjustment for the multiple testing in this case. The method we presented above could be regarded as an extension to natural populations of the classical interval mapping method (Lander & Botstein, 1989; Zeng, 1993, 1994; Kao & Zeng, 1997) which has been widely used in animal models for whole genome linkage scans. Within an interval flanked by two markers, the interval mapping method forms a mixture of normal distributions corresponding to three possible genotypes of a QTL, with the weights being determined by the genotypic distribution of the QTL which is inferred from the genetic distance of the QTL position to its flanking markers, and the observed genotypes at its flanking markers. Here we established a similar mixture normal distribution, with the weights being determined by the genotypic distribution of the QTL conditional on the marker genotypes or the joint haplotype frequency of QTL and markers. Maximum likelihood estimation is carried out on both the haplotype distribution of QTL and markers and the mixture normal components without the requirment of linkage information. This method allows the use of multiple linked markers simultaneously, and is applicable to natural populations as well as some experimentally designed populations such as backcrosses and F 2. Partial Correlation In association mapping studies the QTL position could be inferred based on known marker positions and the dependence structure of association between the QTL and markers. A variety of association measures have been proposed to describe the LD between QTL and markers. Two widely used association measures are squares of the correlation coefficient r 2 and Lewontin s D. The r 2 association measure between an allele A i at a QTL and an allele M j at a marker locus is defined as (Weir, 1996) r 2 = D ij 2 p i (1 p i )q j (1 q j ) where D ij = P ij p i q j is the LD between A i and M j. For a random gamete, if we define the following coding variables according to alleles at the QTL and marker loci { 1, Ai at the QTL zi = 0, otherwise { 1, M j at the marker locus z sj = 0, otherwise then D ij = Cov(zi, z j) is the covariance of zi and z j, and D r = ij is the statistical correlation coefficient of zi and z j. So, the range of r 2 is always between pi (1 p i )q j (1 q j ) 0 and 1. The D for the association measure between alleles A i and M j is defined as (Lewontin, 1964) D ij D min{ p i (1 q j ),(1 p i )q j D } ij > 0 = D ij min{ p i q j,(1 p i )(1 p j D )} ij < 0 The range of D is between -1 and 1. Though the D measure shows some advantages over the r 2 measure in certain cases, in general they have different strengths depending on the context (Devlin & Risch, 1995; Guo, 1997). C 2006 The Authors Annals of Human Genetics (2006) 70,

9 T. Wang, B. Weir, and Z. Zeng From the estimated joint haplotype frequencies of the QTL and markers, we can derive the marginal distributions of the QTL and markers and all pairwise LDs between the QTL and markers. The association measures D and r 2 between the QTL and markers can then be calculated. Both the D and r 2 association measures, however, ignore the correlation information among markers. Statistically it is known that using partial correlation might be helpful for separating the dependency among correlated variables. From the joint haplotype frequencies of the QTL and markers we can also compute the partial correlation (PC) between a QTL and a marker conditional on the other markers. For example, with a diallelic QTL and markers the calculation of the partial correlation between a QTL and a marker conditional on all the other markers can proceed as follows. First, one computes QTL allele frequencies p r 0, r = 1,, l, and all pairwise LDs D rs (r = 1,, l; k = 1,, m) between QTL allele A r 0 and marker allele M s 0, from the joint haplotype distribution P jk1 k m of the QTL and markers. Next, one computes marker allele frequencies q s 0, s = 1,, m, and all pairwise LDs Dij (i, j = 1,, m) for marker alleles M i 0 and M j 0 among markers from the haplotype distribution P k1,,k m of markers. Finally, for a specific QTL allele A r 0 at the r-th QTL and a marker allele M s 0 at the s-th marker locus, let C 11 be the 2 2 covariance matrix for zr 0, z s 0 at the r-th QTL and the s-th marker; i.e., ( ) pr 0 (1 p r 0 ) D rs C 11 = D rs q s 0 (1 q s 0 ) Besides, let C 22 denote the (m-1) (m-1) covariance matrix for all other markers except the s-th marker (a submatrix of (Dij ) m m). Let C 12 be the 2 (m-1) covariance matrix between the r-th QTL, the s-th marker, and the set of other markers. The partial correlation coefficient b rs between the r-th QTL and the s-th marker, conditional on other markers, can then be obtained from the following (Stuart et al. 1999) ( ) brs0 b rs = C 11 C 12 C 1 22 C 21 b rs b rs1 where C 21 = C12 T. In this way, we can calculate the partial correlation coefficients between all pairs of QTL and marker loci. For markers with more than two alleles we can focus on an allele of interest and collapse the others. Identification of an underlying QTL can be proceeded by evaluating various association patterns between the QTL and markers in a particular targeted region. A convenient strategy to infer the QTL position is first to select the marker that shows the strongest association with the QTL, and then choose a small region around it according to the association pattern of other markers. The wider this fine mapped region is the more likely it encloses the QTL, at a cost of reducing the accuracy of locating the QTL position. When several targeted regions are involved one can use the same strategy to determine a candidate region within each targeted region. Simulation Results To examine the possible roles of the latent QTL and the properties of the test statistic, we first considered random samples in which the haplotypes are generated at random from a predetermined haplotype distribution a simulation scheme used by Lake et al. (2003). The haplotypes and haplotype frequencies are listed in Table 1. For the six diallelic markers with eight possible haplotypes we randomly generated 1,000 haplotypes based on the haplotype frequencies, and then randomly selected haplotype pairs to form the genotypes of a sample with a sample size of 500 individuals. The genotypic values of the individuals were generated under two different scenarios by taking: 1) locus 6 as a trait locus (i.e., QTL); or 2) haplotype 2 as an effective disease allele 1 versus all other haplotypes with an ineffective allele 0 for the QTL. Under each scenario the genotypic value G i of individual i was given by a, d or a according to its QTL genotypes 00, 01 or 11, respectively, where a and d are the additive and dominance effects of the QTL based on Falconer & Mackay s genetic model. The phenotypic values were then simulated through y i = G i + e i, where the residuals e i N(0,σ 2 ) with the residual variance σ 2 = V G ( 1 1), h 2 and V G = 2p 0 p 1 [a + d (p 0 p 1 )] 2 + (2p 0 p 1 d ) 2 being the genotypic variance and h 2 the (broad sense) heritability level contributed by the mapped QTL. Under each scenario above, we considered two different parameter settings for the QTL effects: 1) a = 1.0, 514 Annals of Human Genetics (2006) 70, C 2006 The Authors

10 Association Mapping of QTL SNP Loci Haptypes Hap frequencies Table 1 Haplotype distribution for the six diallelic markers Frequencies of allele â ˆd ĥ 2 ˆp Q (0.15) Parameters ˆp M1 (0.25) ˆp M2 (0.20) ˆp M3 (0.40) ˆp M4 (0.40) ˆp M5 (0.35) and settings ˆr QM1 (0.024) ˆr QM2 (0.127) ˆr QM3 (0.100) ˆr QM4 (0.047) ˆr QM5 ( 0.046) Table 2 Means and SEs of parameter estimates from 100 replicates under scenario 1 a = 1, d = (0.05) 0.00 (0.09) 0.37 (0.06) 0.17 (0.02) h 2 = (0.01) 0.20 (0.01) 0.40 (0.01) 0.40 (0.01) 0.35 (0.01) 0.013(0.008) 0.113(0.009) 0.085(0.009) 0.039(0.009) 0.049(0.008) a = 1, d = (0.10) 0.02 (0.19) 0.32 (0.07) 0.19 (0.04) h 2 = (0.01) 0.20 (0.01) 0.40 (0.01) 0.40 (0.01) 0.35 (0.01) 0.011(0.011) 0.106(0.012) 0.080(0.013) 0.037(0.013) 0.044(0.011) a = 1, d = (0.17) 0.03 (0.33) 0.26 (0.09) 0.24 (0.07) h 2 = (0.01) 0.20 (0.01) 0.40 (0.01) 0.40 (0.01) 0.35 (0.01) 0.013(0.015) 0.093(0.016) 0.070(0.017) 0.032(0.016) 0.039(0.019) a = 1, d = (0.14) 1.02 (0.22) 0.36 (0.06) 0.17 (0.04) h 2 = (0.01) 0.20 (0.01) 0.40 (0.01) 0.40 (0.02) 0.35 (0.01) (0.008) (0.009) (0.009) (0.009) (0.007) a = 1, d = (0.36) 0.99 (0.45) 0.30 (0.10) 0.20 (0.08) h 2 = (0.01) 0.20 (0.01) 0.40 (0.01) 0.40 (0.02) 0.35 (0.01) (0.011) (0.015) (0.014) (0.012) 0.045(0.011) a = 1, d = (0.66) 0.89 (0.72) 0.27 (0.13) 0.26 (0.11) h 2 = (0.01) 0.20 (0.01) 0.40 (0.01) 0.40 (0.02) 0.35 (0.01) (0.014) (0.020) (0.019) 0.031(0.015) (0.016) d = 0 (no dominance); and 2) a = 1.0, d = 1.0 (complete dominance), together with three different heritability levels h 2 = 0.1, 0.2 and 0.3. For each parameter setting, we simulated 100 replicate samples, with each having a sample size of 500. For scenario 1 simulated genotypes at locus 6 were discarded after the phenotypic values were simulated. Using phenotypic values and the genotype data at loci 1 to 5 only, we fitted a one-qtl penetrance model and estimated the haplotype frequencies of the QTL and markers for each replicate sample. Estimation of the model related parameters, as well as the QTL allele frequencies, marker allele frequencies, and correlation coefficients (r) between QTL and markers derived from the joint haplotype frequencies of QTL and markers, were then assessed with the means and standard errors (SEs) of the parameter estimates from the 100 replicate samples shown in Table 2. It was interesting to see that the latent QTL played the same role as locus 6, even though the genotype data at this locus was not present in our estimation. Overall, a higher level of heritability corresponds to a lower noise level in the phenotypic model, and therefore leads to better estimation of parameters. The allele frequency p = 0.15 at the true QTL (locus 6 in this case) can be estimated with reasonable precision for a heritability level of h 2 = 0.3 even though it differs from all allele frequencies at other marker loci. Better estimations of marker allele frequencies were obtained than QTL allele frequencies due to less variability in these parameters. The SEs also show that estimation of the additive effect a is more precise than that of the dominance effect d which C 2006 The Authors Annals of Human Genetics (2006) 70,

11 T. Wang, B. Weir, and Z. Zeng is consistent with the fact that the dominance effect, as an interaction between two alleles at the QTL locus, is a second order approximation to the genotypic values of the QTL. The results have practical implications in LD mapping studies of quantitative traits. However, due to the large number of parameters involved, the heritability level appeared to be over estimated, while the correlation coefficients (association) between QTL and markers tended to be under estimated. Larger sample size may be required to overcome this problem. In scenario 2 we considered a haplotype-driven phenotype model and generate phenotypic response by regarding haplotype 2 as an effective QTL allele with additive and dominance effects. At this time our main interest was to find out whether the latent QTL model can capture the same role played by haplotype 2. To show this we fitted the latent QTL model using genotype data at all the six marker loci. Then we calculated frequencies of the QTL alleles and haplotype 2, as well as the conditional probabilities of QTL alleles given or not given haplotype 2 from the estimated joint haplotype frequencies of QTL and markers. The means and SEs of these parameter estimates from the 100 replicate samples are shown in Table 3. It is clear that the latent QTL captures the main features of haplotype 2 and tends to play the same role as haplotype 2 under this scenario. We further examined the behaviour of the likelihood ratio (LR) test statistic under the null hypothesis of no QTL (i.e., H 0 :a= d = 0). Under scenario 1, and using exactly the same residuals e i inforced upon the genotypic values above as trait values with absence of the QTL effect, we proceeded with the estimation procedure using different sample sizes of 200, 500, 800 and Each time we calculated the value of the likelihood ratio statistic. Figure 1(a) shows a combined histogram of the test statistics under the null, with 100 repeats for each combination of the two genetic mod- els, four sample sizes and three heritability levels (totally 2400 counts) which, based on data fitting, is close to a distribution of {0} + 5 χ with df = 11. With this distribution and a significance level of α = 0.05, Figure 1(c) shows the type I error rates with sample sizes of 200, 500, 800 and 1000 for heritability levels of 0.1, 0.2 and 0.3 separately. To analyse the power under scenario 1, we also computed the estimation procedure using different sample sizes of 200, 500, 800 and Figure 1(b) shows a combined histogram of the LR statistics for testing the null hypothesis of H 0 :a= d = 0 with 100 repeats for each combination of the two genetic models, four sample sizes, and three heritability levels as well. The distribution is close to a non-central χ 2 with the minimum values of the observed test statistics Based on the threshold determined by the same distribution of {0} + 5 χ above and the 5% significance level, Figure 1(d) shows the power curves with sample sizes of 200, 500, 800 and 1000 for heritability levels of 0.1, 0.2 and 0.3, separately. Since the two distributions of the test statistic under H 0 and H a are well separated, it is clear that the LR statistic provides a powerful test for association between the QTL and markers, at least under this simple simulation scheme. In the previous examples, random samples were simulated from the same haplotype distribution of QTL and markers. This simulation scheme mimics the sampling from a common population and might be suitable for some experimental animal models. But it may bear little resemblance to the actual evolutionary process in human populations. As a second example, we considered a typical case in LD mapping studies in humans. Suppose at some point (Generation G0) in the evolutionary history of a current population, a mutant allele that contributes to variation of a quantitative trait was brought in by a small group of individuals, either from chromosomal Table 3 Means and SEs of parameter estimates from 100 replicates under scenario 2 Parameter setting h 2 ĥ 2 â ˆd ˆp Q ˆp hap ˆP (Q1 hap) ˆP (Q0 hap c ) a=1, d= (0.06) 1.09 (0.06) 0.02 (0.10) 0.23 (0.03) 0.20 (0.01) 0.96 (0.05) 0.95 (0.03) p hap = (0.08) 1.17 (0.09) 0.00 (0.19) 0.25 (0.04) 0.20 (0.01) 0.93 (0.08) 0.92 (0.04) (0.11) 1.39 (0.26) 0.06 (0.42) 0.30 (0.08) 0.20 (0.01) 0.88 (0.12) 0.84 (0.09) a=1, d= (0.07) 1.18 (0.24) 0.93 (0.35) 0.24 (0.07) 0.20 (0.01) 0.97 (0.04) 0.94 (0.08) p hap = (0.05) 1.20 (0.15) 1.11 (0.22) 0.23 (0.03) 0.20 (0.01) 0.94 (0.06) 0.95 (0.03) (0.10) 1.71 (0.51) 0.95 (0.70) 0.29 (0.09) 0.20 (0.01) 0.87 (0.12) 0.86 (0.10) 516 Annals of Human Genetics (2006) 70, C 2006 The Authors

12 Association Mapping of QTL Figure 1 (a) and (b) are histograms for the test statistics under the null H 0 and the alternative H a, respectively. (c) and (d) are type I error rate and power curves corresponding to sample sizes of 200, 500, 800 and 1000, and heritability levels of 0.1, , respectively. mutations or population migration. When the mutation occurred LD was present between the mutant allele and its linked marker alleles. Through generations this LD pattern may vary due to various factors such as recombination, mutation or random genetic drift. If we assume that recombination plays the major role in shaping the LD pattern at the mutant s nearby markers, with passing generations the magnitude of LD between the disease allele and the alleles at its surrounding markers will decay through the action of recombination at a rate of (1-c) per generation by recombination (Falconer & Mackay, 1996). Consequently, after many generations those marker alleles nearby the disease-causing variant may preserve stronger association with the mutant and therefore the disease phenotype in the current population. To generate the simulation samples here we adopted a simulation scheme as described in Abdallah et al. (2004) by applying the gene dropping method. Though several other methods based on the coalescent theory may also be applied, these methods usually regard the selection of markers as random as well. The gene dropping method herein can provide repeated populations with fixed QTL and marker positions that allow us to assess the precision of LD estimation and mapping ability of our method. We consider an 8 cm chromosomal region with 5 diallelic markers evenly distributed over it, with the mutant locus (QTL) located in the middle, between the 2nd and 3rd markers. At generation G0 we assumed that one individual within a base population of 1,000 individuals received a QTL mutation in one of its haplotypes and had genotypes {111111/000000} at the QTL and marker loci, while all other individuals had QTL genotype {0/0} and marker genotypes simulated by assuming Hardy-Weinberg and linkage equilibria with allele frequencies of 0.5, 0.5, 0.3, 0.3 and 0.3 for markers 1 to 5, respectively. To avoid losing too many simulation replicates as a result of the genetic drift, we allowed the mutant QTL allele to expand in the first 9 generations following G0, by conferring a selective preference on the mutant allele and an exponential growth of the family members with at least one parent being a carrier of the mutant allele (e.g., having two children per family). In later generations haplotypes of progeny were inherited at random from their parental population. Recombinations within the region were based on a Poisson number of cross-overs and the recombination probabilities. Each marker allele was also allowed to mutate at a rate of 10 4 per generation. For each generation we confined the population size to be 1,000 (i.e., 2,000 haplotypes). We ended at the 200th generation (G200) and regarded it as a sample from the current population. Due to the random genetic drift, some alleles might become fixed (allele frequency <0.01) during the simulation procedure. We discarded a population when a fixation occurred in any generation. In total we generated 500 replicate samples from the base population. For each replicate sample simulated above, we randomly C 2006 The Authors Annals of Human Genetics (2006) 70,