Genetic Mapping of Complex Discrete Human Diseases by Discriminant Analysis

Size: px

Start display at page:

Download "Genetic Mapping of Complex Discrete Human Diseases by Discriminant Analysis"

Ann Barton
6 years ago
Views:

1 Genetic Mapping of Comple Discrete Human Diseases by Discriminant Analysis Xia Li Shaoqi Rao Kathy L. Moser Robert C. Elston Jane M. Olson Zheng Guo Tianwen Zhang ( Department of Epidemiology and Biostatistics Case Western Reserve University Cleveland Ohio 4409 USA; Department of Computer Science Harbin Institute of Technology Harbin China 5000; Department of Mathematics Harbin Medical University Harbin China 50086; Center for Molecular Genetics Lerner Research Institute Cleveland Clinic Foundation Cleveland Ohio 4495 USA; Department of Medicine University of Minnesota Minneapolis Minnesota USA In: P. R. CHINA Xia Li Professor Department of Mathematics Harbin Medical University Harbin China Phone: ( Fa: ( Liia6@yahoo.com or Liia@ems.hrbmu.edu.cn In: THE UNITED STATES OF AMERICA Shaoqi Rao PhD Center for Molecular genetics Lerner Research Institute /ND40 Cleveland Clinic Foundation 9500 Euclid Avenue Cleveland Ohio 4495 Phone: ( Fa: ( shaoqir@yahoo.com or raos@ccf.org Abstract The objective of the present study is to propose and evaluate a novel multivariate approach for genetic mapping of comple categorical diseases. This approach results from an application of standard stepwise discriminant analysis to detect linkage based on the differential marker identity-by-descent (IBD distributions among the different groups of sib pairs. For a binary human disease the three affection groups of a sib pair are defined as: concordantly affected discordant and concordantly unaffected. Two major advantages of this method is that it allows for simultaneously testing all markers together with other genetic and environmental factors in a single multivariate setting and it avoids eplicitly modeling the comple relationship between the affection status of sib pairs and the underlying genetic determinants. The efficiency and properties of the method are demonstrated via simulations. The proposed multivariate approach has successfully located the true position(s under various genetic scenarios. The more important finding is that using highly densely spaced markers (- cm leads to only a marginal loss of statistical efficiency of the proposed methods in term of gene localization and statistical power. These results have well established its utility and advantages as a fine-mapping tool. A unique property of the proposed method is the ability to map multiple linked trait loci to their precise positions due to its sequential nature as demonstrated via simulations. Key words Categorical traits IBD Linkage Analysis Multivariate 50

2 Introduction A comple disease trait refers to a phenotype that does not follow simple Mendelian segregation attributable to a single gene locus but instead can be caused by multiple disease loci their interactions polygenic inheritance and environmental effects [ ]. It is desirable to take all these factors and their interactions into account in a linkage analysis. Indeed genetic dissection of the roles and functions of various disease-causing factors has been deemed to be a main task in modern human genetic studies. Genetic analysis of a comple disease trait is complicated by its discrete phenotypic nature (usually binary for which a linear relationship between the observable phenotypes and the underlying genetic effects does not eist. Current univariate methods for genetic mapping of comple disease traits analyze one marker (or an off-marker position at a time which may generate two problems. First it does not take into account the correlated structure of multiple linked markers purely due to linkage between them resulting in correlated test statistics. Hence it tends to produce a flat test statistic profile and a wide empirical confidence interval of the estimated trait location. This phenomenon is more obvious when tightly linked marker data (eg. fine mapping data are used. Second a comple trait can be caused by effects of multiple linked trait loci and their interactions. A univariate method might generate a false peak of test statistic at a position between two close trait loci usually called a ghost trait locus. Therefore a multivariate method which is robust to the assumption of the number of trait loci involved and can account of the correlated structure of linked markers is demanding. Current methods for genetic mapping of a categorical disease trait include model-based method [3] model-free method [4] association analysis based on linkage disequilibrium [5] and novel multivariate pattern recognition techniques [6 7]. Although the model-based approaches have been very successful in mapping hundreds of disease-predisposing genes it becomes difficult to do a good model-based linkage analysis of comple human diseases for which the modes of inheritance are usually unknown prior to substantial genetic analyses. Model-free approaches in which no assumption is made about the mode of inheritance of the trait under study are popular in practice due to their simplicity. Typical model-free methods in human genetics are the affected-sib-pair (ASP and the etended affected relative pair tests. This group of methods and the transmission/disequilibrium test (TDT have one disadvantage: they do not make full use all the available data information from non-affected individuals being discarded due to the very nature of the methods. Furthermore it might be very difficult to find linkage disequibrium with comple trait predisposing genes as those genes are often very common thus leading to etreme allelic and nonallelic heterogeneity in the population or pedigrees. According to the ways of modeling the relationship between the outward discrete disease phenotypes and the underlying genetic effects the approaches for genetic mapping of a disease trait can be divided into two classes. A linear model assumes a linear relationship and the analysis is performed as if the discrete phenotype were continuous. A popular eample is (new Haseman-Elston (H-E regression [8 9] in which a second moment quantity (the squared sib pair s phenotypic differences in the old H-E and the mean-corrected product of the sib s phenotypic values in the new H-E is regressed on the proportion of alleles that the sibs share identical by descent (IBD at a marker locus to reveal a potential linkage to an unobservable trait locus. H-E enjoys simplicity robustness of a least squares approach and asymptotic efficiency. It is important to recognize that a linear model often used to analyze linkage between markers and continuous traits is a powerful and robust tool for genetic mapping of discrete disease traits as well [0~3]. A non-linear model such as a generalized linear model takes the non-linear relationship and discrete nature of phenotypes into account. Statistically 5

3 it is desirable but modeling a sophisticated genetic architecture in disease manifestations can be prohibitively comple and computational demand is high especially when random polygenic and common environmental effects are included [4 5]. In this study we propose a novel multivariate approach to mapping disease loci in human genomic studies with an intention to overcome the weakness inherent in current methods. We integrate a standard stepwise discriminant analysis into a sibpair linkage study. The rationale underlying this approach is that if a disease locus is tightly linked to a (some molecular marker(s the differential marker IBD distributions among the affection groups of sibpairs (for a binary human disease trait concordantly affected discordant and concordantly unaffected can be observed because of genetic effects of a disease locus on phenotypic manifestations and its tight linkage to nearby markers. Characteristics of this multivariate approach includes ( it uses information both from affected and unaffected sibs. Hence instead of using a reference population as a control the three affection groups will serve as controls for each other; ( discriminant analysis is a multivariate technique concerned with separating distinct sets of objects (or observations and with allocating new objects to previously defined groups. We search the best discriminants whose numerical values are such that the collections are separated as much as possible; (3 this approach avoids eplicitly modeling relationships between outward ordinal (or binary phenotypes and underlying genetic effects (major gene and polygenic background. In sibpair linkage studies the very act of taking a second moment quantity from marginal phenotypes has prohibited the direct use of a threshold model in non-linearly modeling the relationship between the re-defined phenotype (groups and its components (a quadratic form of the underlying variates; (4 Due to the sequential testing properties of stepwise discriminant analysis it enjoys robustness to an assumption on number of trait loci involved and can control multiple trait loci background. Here we demonstrate its properties using simulations. Statistical Method Consider a sibpair linkage study of a categorical disease trait. Each sib can take any possible ordinal value say c (c = C. We define an affection group (a specific combination of two ordinal values of a sib pair as: G i (i = K. Thus the total number of mutually eclusive groups (K is: C K = C +. C = corresponds to a binary human disease trait (a special case of an ordinal trait with just two categories. We have K = 3 and G i might be defined as: G = concordant affected both sibs in a sib pair are affected G = discordant only one in a sib pair is affected G 0 = concordant unaffected no sibs in a sib pair are affected so that we have a population consisting of three mutually eclusive groups. Net for each sib pair we define a discriminant vector (X which can include the following feature variables:( the estimated proportions of alleles ( πˆ shared IBD by the sib pair at L markers along a chromosome (segment; and ( other covariates (potential confounding factors for a linkage study for eample polygenic inheritance and common environment (aliasing with 5

4 so-called household effect mother effects (including maternal genetic and environmental effects and mitochondrial effects and other epidemiological factors (gender race age and so on. Modelling these covariates serves two purposes: to eliminate their interferences with assessment of linkage and to evaluate their importance in the manifestation of the disease. In a linkage study we are searching for a marker or cluster of markers (tightly linked to an unobserved trait locus whose IBD (the feature variable distributions among the disease affection groups lead to the best-fit partition (grouping to the observed one. Suppose that we have K disease affection groups and M feature variables (L marker IBD variates plus M L covariates. Let N N N K be sample sizes for Gi (i = K. Then the feature vector data for N (N = N i sib pairs can be epressed as: ( ( ( ( ( K ( ( K N ( N ( K N K. Denote µ (i = K the mean vector corresponding to the population Gi (i = i ( i K and assume that follows a multivariate normal distribution with a constant variance-covariance matri i.e. ( i ~ N ( u Σ. i A Wilks s ratio ( equivalent to a likelihood ratio test statistic [6] which is used to test the effects of the feature variates on separation of disease affection groups is constructed as: det W = ( det T where det is determinant W = {w ij } is the within group covariance matri and T = {t ij } is total covariance matri and w t ij ij = = K K a= l = N N a a a= l = where ( ( li li ( i i ( lj lj j j N a j = lj and j = Na l = N i j =... K N a a= l = M lj. A function of (N (M K / ln is a random variable asymptotically following a chi-square distribution with M(K degrees of freedom under the null hypothesis H 0 : µ = µ =... = µ K that is no differences among the mean vectors for K disease affection groups. To assess the contribution from each feature variable we use a stepwise discriminant analysis procedure [7] which combines forward selection and backward elimination by user defined criteria (with the same p-value of 0.05 for inclusion and eclusion in this study. If the feature variable is a marker IBD then assessment of it is equivalent to detection of linkage to a putative trait locus. Suppose now that have been selected from total of M feature variates at the s-th step. To assess the individual contribution of each variable say p 53

5 r from the remaining M p feature variates we partition W and T corresponding to subset p and as: r W W T T W = = W T. W T T where W T and ( i j = have similar meanings as described previously. W ij T ij W and T are p p matrices. W and T are p matrices (vectors. and are matrices (scalars. The Wilks s statistic for p feature variates is: det W p = det T and for p+ variates it is: W T p + W det det W = = W det T T det T W W T T det W det( W = det T det( T = p det( W W det( T T WW W T T T W W. T T Hence p det( T TT T det( W WW W =. det( W W W p+ W F r Under the assumption of multivariate normality we have p N p K = ( ~ F( K N p K. K p+ The variable r with the largest values of the partial F- statistic is added to the current subset consisting of p feature variates provided that it eceeds the specified critical value at α = 0.05 ( F r F. > K N p K We use a similar method in the backward elimination step. First consider deletion a single variable from a set of p feature variables. For each variable the partial F-statistic is computed to test whether it provides additional information over the remaining p- variates. The variate with the smallest partial F-statistic is eliminated first provided that the statistic does not eceed a specified critical value at α = 0.05 ( F r F K N ( p K Stepwise discriminant procedure alternates forward selection and backward elimination. The procedure stops when none of the included variables can be taken out and no further feature variables can be taken in. 54

6 We are cautious about the nominal p values given in SAS stepwise discriminant procedure. Etensive empirical simulations prove that the theoretical p value by SAS is liberal. Hence we resort to a permutation technique to obtain the empirical thresholds. The statistical power is determined by the empirical values instead by the nominal threshold values. Simulation Studies To investigate the efficiency and properties of the proposed discriminant analysis Monte Carlo simulations are carried out. We design five simulation eperiments. Factors considered include ( heritability of the trait locus/loci ( and 0.9 which is defined to be the ratio of segregation variance at the trait locus (loci over total phenotypic variance of the underlying liability (as eplained later for the ordinal disease phenotype; ( marker density ( 5 0 and 0 cm marker spacings; (3 categorical nature of the observations (ordinal versus binary; (4 disease prevalence (5% 0% 0% 30% 50% 70% 90%; (5 sample size ( sib pairs; and (6 two-linked trait loci with various distances between them ( and 90 cm. Only single chromosome segments of different lengths ( and 00 cm for 5 0 and 0 cm marker spacings respectively covered by si evenly spaced codominant markers each having eight equivalent alleles is simulated. A diallelic trait locus is simulated in the middle of the interval between the second and third marker locus in the scenarios of one trait locus. The two alleles of the trait locus are equally frequent. Hence the epected trait and marker heterozygosities are one half and 87.5% respectively. For simplicity simulated pedigrees consist of only nuclear families (two generations each having four full-sibs unless indicated otherwise. The total number of progeny N is fied at 800 ecept for Eperiment 5 in which effects of sample size are investigated. As a result the total number of sib pairs is 00. Under each design the simulation is repeated 00 times. The global statistical power is determined by counting the number of runs that have the largest F statistic among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold at α = 0.05 or 0.0 obtained by simulating 500 replicates under the null model of no trait locus (loci segregating. The local statistical power is determined similarly but now we confine a 0 cm region around the true trait locus (loci. The standard deviations calculated over the 00 replicates represent the standard errors of parameter estimates and test statistics. The simulation of genetic values of the trait locus for a liability starts by randomly assigning two alleles (and their effects to each parent drawn from a random-mating population. A dominant effect is simulated as an interaction between two parental alleles. The liability of each offspring is the sum of its genetic value the overall mean and its residual error sampled from N(0. Effects of various covariates and polygene on disease liability are not simulated. A set of fied thresholds truncate the underlying liability into mutually eclusive and ehaustive intervals which is then translated into observable categorical scores with the desired categorical distributions (prevalence. Eperiment : heritability of the disease locus A single diallelic trait locus is simulated at 5 cm on a chromosome of 50 cm. Si evenly spaced codominant markers with eight equally frequent alleles (one marker at every 0 cm on the chromosome make up the linkage group. Four fied thresholds ( and 0.5 truncate the underlying liability into five categorical scores with incidences of 0.56% 5.3% 4.99% 38.30% and 30.85% respectively. The mean and standard deviations of the location F- statistic and R and statistical power obtained from 00 simulations under each of five trait locus attributed heritabilities are given in Table. 55

7 Table. Effects of heritability of the disease locus on statistical efficiency of the proposed discriminant analysis approach in terms of statistical power and gene localization averaged over 00 replicates. Standard errors are in parentheses. Heritability Location (cm Partial R Test Statistic (F Global statistical power a Local statistical power b α =0.05 α =0.0 α = 0.05 α = ( (0.0.0 ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( a Global statistical power is determined by counting the number of runs that have the largest F statistic over the whole region among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold value at α = 0.05 or 0.0. b Local statistical power is determined by counting the number of runs that have the largest F statistic over the 0 cm region around the true trait locus among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold value at α = 0.05 or 0.0. The proposed multivariate approach has successfully found the true position with reasonable biases under low trait locus attributed heritability (h = 0.0 due to interference of false positives. Heritability has a major impact on efficiency of genetic mapping in terms of accuracy of localization and statistical power. With heritability increasing both global and local statistical powers are dramatically increased and the mean of estimated location is closer to the true position. It is obvious that heritability plays an important role in separating disease affection groups via marker IBD information. The proportion of marker IBD variances among sib pairs eplained by disease affection grouping (R increases with the magnitude of heritability. Larger standard errors for the F-statistic under higher heritabilities are due to scaling effects but the coefficients of variation are smaller. Eperiment : Marker density We intend to eplore its potential of the proposed multivariate approach as a fine mapping tool. As increasingly dense marker maps become available a powerful statistical finemapping tool is demanding. The association analysis approaches (for eample TDT and its derivatives are one option. However they consider one marker at a time and are epected to give a large number of false positives due to tight linkage between densely spaced markers. Moreover these approaches suffer from multiple testing problems. The proposed multivariate approach allows for simultaneously testing all markers together with other genetic and environmental factors in single multivariate setting. Theoretically it can remove false positives due to tight linkage between markers which might generate a serious problem in fine mapping situations. Same eperiment design as for Eperiment is used ecept that heritability of a trait locus is fied at Five levels of marker densities from sparsely to densely distributed markers are investigated. As shown in Table marker density influences the performance of the proposed approach in the sibpair linkage analysis of an ordinal trait. There may be an optimal marker density ( 0 cm marker spacings in which genetic mapping by stepwise discriminant analysis approach reaches its maimal power and minimal estimation errors 56

8 occur. If so this would provide a guide to a multi-stage genetic mapping design. The more important finding is that using highly densely spaced markers (- cm leads to little loss of statistical efficiency in term of localization and global statistical power. These results might well establish its utility and advantages as a fine-mapping tool. Table. Effects of marker density on statistical efficiency of the proposed discriminant analysis approach in terms of statistical power and gene localization averaged over 00 replicates. Standard errors are in parentheses. Marker distance a (cm Location (cm Partial R Test Statistic (F Global statistical power b Local statistical power c α = 0.05 α = 0.0 α = 0.05 α = ( ( ( ( (5 5.8 ( ( ( ( ( ( ( (3 3. ( ( ( (.5.7 ( ( ( a The true location of a trait locus is in parentheses for this column. b Global statistical power is determined by counting the number of runs that have the largest F statistic over the whole region among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold value at α = 0.05 or 0.0. c Local statistical power is determined by counting the number of runs that have the largest F statistic over the 0 cm region around the true trait locus among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold value at α = 0.05 or 0.0. Eperiment 3: An binary trait and disease prevalence A comple human disease is often scored as binary (affected and non affected which can be considered as an etreme case of ordinal disease traits with just two categories. The proposed approach utilizes information for both affected and unaffected individuals and classifies phenotypes of a sib pair into three affection groups. Theoretically or suggested by empirical simulations genetic mapping for a binary disease is less efficient than for an ordinal disease trait using generalized linear model based approaches [0]. However this argument may not be applied to a sibpair linkage study using a discriminant analysis approach where we work on a second moment form (group of the marginal phenotypes instead of directly modeling the relationship between the marginal phenotypes and the underlying genetic effects. Moreover in a discriminant analysis affection groups obtained from the marginal ordinal phenotypes of a sib pair are no longer a response variable but instead serve as an eplanatory variable for marker IBD distributions. These characteristics may lead to very different conclusions than other approaches to modelling marginal categorical phenotypes. Table 3. Effects of prevalence of the disease locus on statistical efficiency of the proposed discriminant analysis approach for genetic mapping of comple binary human diseases in terms of statistical power and gene localization averaged over 00 replicates. Standard errors are in parentheses. Prevalence Location (cm Partial R Test Statistic (F Global statistical power a Local statistical power b α =0.05 α = 0.0 α =0.05 α =

9 ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( a Global statistical power is determined by counting the number of runs that have the largest F statistic over the whole region among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold value at α = 0.05 or 0.0. b Local statistical power is determined by counting the number of runs that have the largest F statistic over the 0 cm region around the true trait locus among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold value at α = 0.05 or 0.0. We simulate a binary trait with different incidence rates of 0% 0% 30% 50% 70% and 90%. The results are shown in Table 3. Three important conclusions can be made from this simulation eperiment: ( more power can be obtained for a binary disease trait than for an ordinal trait. The proposed multivariate approach for genetic mapping of a binary trait outperforms analyses of an ordinal trait in some situations. Under the same heritability of a trait locus (h = 0.30 the power (94% at α = 0.05 for a binary disease trait with prevalence of 30% is larger than the corresponding ordinal trait (8% see Table for cross reference; ( performance of the proposed approach is not even with respect to symmetry of the marginal binary phenotype. The optimal performance in terms of localization and statistical power is attained at a prevalence of 30%. It is interesting to note that a low prevalence (0% leads to a higher statistical power than a high prevalence (90% which might imply that the proposed method is a powerful tool for genetic mapping of diseases with low prevalences and for randomly sampled data from a general population instead from hospital recruited data. We encourage further studies on this statistical issue. Eperiment 4: Sample size Sampling issues including sample size are important in eperimental design for a genetic mapping study because of a direct association with recruiting costs. Generally a sibpair linkage study requires a larger sample size to achieve an acceptable power of detection of a disease locus (loci than the approaches that eploit the linkage phase information as well for eample the generalized-linear-model-based approaches for plant and animal populations []. However our simulations in the previous section indicate that our novel method favors low disease prevalence in terms of statistical power compared with a high prevalence which implies that data might be recruited from general population instead of from a hospital setting. Hence the requirement of a large sample size in human genomic studies is less restrictive when our proposed multivariate approach is applied because recruiting data from a general population is much easier. To investigate effects of sample size we simulate five levels of sample size: and 50 nuclear families with four sibs each respectively. Hence the corresponding numbers of sib pairs are and 00 respectively. Again a trait with five 58

10 categories is simulated. A single trait locus with a trait attributed heritability of 0.30 situated at 5 cm (between the second and third marker on a 50 cm chromosomal segment covered with si evenly spaced markers is simulated. Logically our simulations show that increasing sample size will increase power and accuracy of localization averaged over 00 replicates. However the partial R the proportion of marker IBD variances among sib pairs eplained by disease affection grouping does not behave as we epected in that a small sample size leads to a large value of it for unknown reasons. Table 4. Effects of sample size on statistical efficiency of the proposed multivariate discriminant analysis approach for genetic mapping of comple ordinal human diseases in terms of statistical power and gene localization averaged over 00 replicates. Standard errors are in parentheses. Sample size Location (cm Partial R Global statistical power a Local statistical power b Test Statistic (F α =0.05 α =0.0 α = 0.05 α = ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( a Global statistical power is determined by counting the number of runs that have the largest F statistic over the whole region among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold value at α = 0.05 or 0.0. b Local statistical power is determined by counting the number of runs that have the largest F statistic over the 0 cm region around the true trait locus among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold value at α = 0.05 or 0.0. Eperiment 5: A binary trait that is controlled by two-linked trait loci A unique property of the proposed method is the ability to map multiple linked trait loci to their precise positions due to its sequential nature. We demonstrate this property by simulating two-linked trait loci with various distances between them. For this purpose a binary trait with an incidence of 30% controlled by two trait loci located on a single chromosome of length of 00 cm covered by eleven evenly spaced codominant markers (each at 0 cm is simulated. For convenience we assume that the two trait loci contribute an equal genetic variance to the variability of the underlying liability. No epistasis between the two loci is simulated. Note that the covariance between the two trait loci due to linkage is m m m m f f f f [( α α ( α α + ( α α ( α α ]( r 4 α q s s where r is the recombination fraction between the two trait loci and ( α represents the effect of gene substitution at the qth (q = trait locus in the sth (s = parent. It is evident that this covariance depends on the recombination fraction between the two loci. The closer the two trait loci the larger the covariance. Under the assumption of no epistasis and gender specific genetic heterogeneity this covariance simplifies to: [( α α ( α α ]( r. 59

11 We fied the total trait locus attributed heritability for the liability to be The gene substitution effects are appropriately scaled under each distance option. Distances between the two trait loci considered are cm respectively. Under the assumption of equal genetic contribution of the two trait loci the corresponding gene substitutions are and respectively which translate the marginal genetic variance of each trait locus to be (h = (h = (h = 0.7 and (h = 0.5. A total of 00 simulations for each distance are conducted for N = 00 sib pairs. The two mean trait-locations are determined by averaging the locations of the two markers with the largest F-statistics (included in the final subset during the stepwise discriminant analysis. The statistical power is obtained by counting the number of replicates with an F-statistic larger than the critical value at α = The averaged estimates of the trait locus parameters (locations the proportion of marker IBD variances among sib pairs eplained by disease affection grouping (R and test statistics and statistical power are listed in Table 5. Their standard errors calculated as deviations of over 00 replicated simulations are given in parentheses. The results prove that the proposed multivariate approach can map multiple linked loci to their precise positions even in the situation that two trait loci are close to each other (0 cm and separated by only two markers. This nice property frees a concern of a ghost quantitative trait locus a wellknown phenomenon for detection of closely linked trait loci by a single trait locus model. This has established its advantages of the proposed method as a genome screening tool as it is robust to the assumption of the number of trait loci involved which is indeed unknown prior to a substantial genetic analysis. Furthermore it saves the comple modeling work for a multiple trait loci model because it is essentially model-free in this aspect. It is interesting to note that the averaged R for each trait locus of two closely linked trait loci (e.g. 0 and 40 cm is higher than that for loosely linked loci although marginal genetic contributions under smaller distances between them are lower. Generally it is more difficult to detect multiple closely linked loci which is demonstrated by a small decrease in statistical power. Table 5. Effects of the distance between two QTLs on statistical efficiency of the proposed discriminant analysis approach for genetic mapping of comple binary human diseases in terms of statistical power and gene localization averaged over 00 replicates. Standard errors are in parentheses. Distance between Location ( cm Partial R Test statistic (F two QTLs a QTL QTL QTL QTL QTL QTL 90 ( ( ( ( ( ( (5.04 Global statistical power b QTL α = 0.05 (α = (8 QTL α = 0.05 (α =0.0 8 (79 Local statistical power c QTL α = 0.05 (α = (77 QTL α = 0.05 (α = (66 60 ( ( ( ( ( ( ( (84 93 (90 68 (66 7 (69 40 ( ( ( ( ( ( ( (8 93 (80 7 (69 80 (69 0 ( ( ( ( ( ( (.45 a The true locations of two trait loci are in parentheses for this column. b Global statistical power is determined by counting the number of runs that have the largest F statistic over the whole region among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold value at α = 0.05 or 0.0. c Local statistical power is determined by counting the number of runs that have the largest F statistic over the 0 cm region around the true trait locus among finally included marker IBD variates in the stepwise discriminant procedure greater than the empirical threshold value at α = 0.05 or (65 7 (67 59 (5 55 (53 60

12 3 Discussion In this work we have demonstrated the potential of a stepwise discriminant analysis as a screening tool in a linkage study. The advantage of using stepwise discriminant analyses is that it allows for the simultaneous testing of all marker and environmental factors in a single setting. Thus the discriminant analysis method identifies not only single gene actions but also combinations of gene actions (for eample epistasis. Because the analysis is carried out in a single run the method does not suffer from the problems associated with multiple testing. Another important feature of the proposed method is that it is essentially model-free in that no specific relationship between the response variable and independent variables is assumed. Hence it is not sensitive to model misspecifications as is a linear model or a generalized linear model. A discriminant analysis has some analogies to a logistic regression. For eample both can be used to analyze discrete data. However we point out a weakness in the application of logistic regression in a sibpair linkage study. The very act of taking cross-product in a sibpair regression has changed the relationship between the model components so that a threshold model implied in a logistic regression that is used to model the relationship between the redefined phenotype (affection state of a sibpair and its components might not valid. Consequently (stepwise logistic regression might have a lower statistical power [7]. It is interesting to compare a discriminant analysis with a logistic regression. Though a bit surprising it is logical that a discriminant analysis performs better than a logistic regression because it is generally inferior to a discriminant analysis when multivariate assumptions hold or are nearly true [6]. In discriminant analysis we need distributional assumptions on the feature vector but not on the grouping variable which is the key difference with logistic models. The sequential test properties of a (stepwise discriminant analysis might suggest that it can be used to refine genetic mapping of a discrete disease trait as well. Our own simulation study suggests that it can efficiently localize a trait locus to within less than a cm interval using fine map data. This advantage and others deserve to receive further investigation as it might provide some great insights on fine mapping of human diseases. We assumed in this study that the feature vectors for sib pairs are independent and follow multivariate normal distributions both of which can be violated in the contet of a sib pair linkage study especially when large sibships are recruited. For eample the IBD vectors of sib pairs within a family tend to be correlated. Also it may happen that some feature variables are not normally distributed. Moreover the p-values reported by the stepwise discriminant analysis procedure of SAS are liberal. These issues deserve attentions in future studies. Acknowledgements XL ZG and TZ are supported partly by Grant NSF from the National Science and Technology Committee of China and by China Scholarship Council. SR RCE and JMO are supported partly by the Grant HG0577 from the National Center of Human Genome Research of USA GM8356 from the National Institute of General Medical Sciences. Some of the results of this paper were obtained by using the program package S.A.G.E. supported by a United States Public Health Service Resources Grant RR03655 from the National Center for Research Resources. 6

13 References Terwilliger J D. Linkage analysis model-based. In Armitage P and Colton T Editors-in-Chief Encyclopedia of Biostatistics. West Susse: John Wiley & Sons Ltd. 998 Vol 3: 79~9 Olson J M Witte J S and Elston R C. Genetic Mapping of Comple Traits. Statist Med 999 8: 96~98 3 Ott J. Analysis of Human Genetic Linkage. Baltimore: John Hopkins University Press 99 4 Olson J M. Linkage analysis model-free. In Armitage P and Colton T Editors-in-Chief Encyclopedia of Biostatistics. West Susse: John Wiley & Sons Ltd. 998 Vol 3: Spielman R S Ewen W J. The TDT and other family-based tests for linkage disequilibrium and association. Am J Human Genet : 983~989 6 Lucek P R Ott J. Neural network analysis of comple traits. Genet Epidemiol 997 4: 0~06 7 Li X Rao S Elston R C et al. Locating the genes underlying a simulated comple disease by discriminant analysis. Genet Epidemiol 00 (in press 8 Elston R C Bubaum S Jacobs K B et al. Haseman and Elston revisited. Genet Epidemiol 000 8: ~7 9 Elston R C et al. S.A.G.E.: Statistical Analysis for Genetic Epidemiology Beta The Department of Epidemiology and Biostatistics Rammelkamp Center for Education and Research Case Western Reserve University Cleveland Rao S Xu S. Mapping quantitative trait loci for ordered categorical traits in four-way crosses. Heredity 998 8: 4~4 Rao S Li X. Strategies for genetic mapping of categorical traits. Genetica : 83~97 Rao S Li X Guo Z et al. Mapping quantitative trait loci of ordinal traits using a structural heteroskedastic threshold model: I. Theory. Mathematia Applicata 00 4(3: 46~5 3 Rao S Zhang Q Li X et al. Mapping quantitative trait loci of ordinal traits using a structural heteroskedastic threshold model: II. Numerical application. Mathematica Applicata 004(: 5~3 4 Yi N Xu S. Bayesian mapping of quantitative loci under the identity-by-descent-based variance component model. Genetics : 4~4 5 Yi N Xu S. Bayesian mapping of quantitative trait loci for comple binary traits. Genetics : 39~403 6 Mclachlan G J. Discriminant Analysis and Statistical Pattern Recognition. New York: Wiley ~400 7 SAS Institute Inc. SAS/STAT@ User s Guide. Version 6 Forth Edition Volume Cary NC: SAS Institute Inc ~509 6

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

(3) QTL and GWAS methods By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs Under what conditions particular methods are suitable