GENETIC LINKAGE ANALYSIS

Size: px
Start display at page:

Download "GENETIC LINKAGE ANALYSIS"

Transcription

1 Genetic linkage analysis Page 1 sur 16 Atlas of Genetics and Cytogenetics in Oncology and Haematology GENETIC LINKAGE ANALYSIS * I- Genetic linkage analysis I- 1. Recombination fraction I- 2. Definition of the "lod score" of a family I- 3. Test for linkage I- 4. Estimation of the recombination fraction I- 5. Recombination fraction for a disease locus and a marker locus I- 6. Linkage analysis for three loci : the phenomenon of interference I- 7. References II- Genetic heterogeneity of localization II- 1. The "Predivided sample test" II- 2. The "Admixture Test" II- 3. Generalization of the "admixture test" II- 4. References III- Statistical properties of the method of lod scores III-1. The test procedure III-1.1. Impact of non-sequentiality III-1.2. Maximization of the lod score over the [0, 1/2] interval III-2. Genotype information III-2.1. Ambiguity in phenotype-genotype relationships at the disease locus III-2.2. Ambiguity in the marker genotype III-2.3. Gamete disequilibrium between alleles at the disease locus and at the marker locus III-3. The problem of multiple tests III-4. References * I- GENETIC LINKAGE ANALYSIS Investigating the linked segregation of genes situated at different loci is a way of testing the independence of their transmission. This concept of independence is also reflected in the recombination fraction, θ which is the percentage of the gametes transmitted by the parents to be recombined. If they are transmitted independently, there will be the same number of recombined gametes as there are parental gametes, and so θ = 1/2. If they are not transmitted independently, then the parenteral gametes are transmitted preferentially to the recombined gametes, and 0 θ<1/2. In this case, there is said to be "linkage" between the two loci. I-1. RECOMBINATION FRACTION Let us consider the caseof two loci, A and B, with two codominant alleles at each of these loci, A 1, A 2 and B 1, B 2 respectively. Such an individual can produce four types of gamete: A 1 B 1 A 2 B 1 A 1 B 2 A 2 B 2 Two situations are possible:

2 Genetic linkage analysis Page 2 sur The loci A and B are on different chromosome pairs In this case, the four gametes all have the same probability: 1/4. 2 The loci A and B are on the same chromosome pairs Figure 1 Here we have to distinguish between two possible situations: the alleles A 1 and B 1 may be on the same chromosome within the pair, in which case A 1 and B 1 are said to be "coupled"; or they may be on different chromosomes, in which case A 1 and B 1 are said to be in a state of "repulsion". Figure 2 For instance, let us suppose that A 1 and B 1 are "coupled". Four types of gametes are still produced. Figure 3 Gametes A 1 B 1 and A 2 B 2 are said to be "parental". In the offspring, as in the parents, A 1 is "coupled" with B 1 (and A 2 is "coupled" with B 2 ). The gametes A 1 B 2 and A 2 B 1 are therefore described as being "recombined". An uneven number of recombination or "crossingover" phenomena have occurred between the A and B loci.

3 Genetic linkage analysis Page 3 sur /02/2006 Figure 4 Assuming that the crossing-over event for a pair of chromosomes follows Poisson s law, and knowing that a parental gamete has zero or an even number of crossings-over, whereas a recombined gamete has an odd number, we can show that the frequency of recombined gametes is always equal to or lower than that of the parenteral gametes and so 0 θ < 1/2 If θ = 1/2, then all the gamete types have the same probability and the alleles at the loci A and B loci are transmitted independently. Loci A and B are therefore said not to enhibit genetic linkage. This is the situation if A and B are on different pairs of chromosomes, and also if A and B are one the same pair, but at some distance from each other. However, if θ < 1/2, then the two loci are genetically linked. For a couple of which the genotypes at the A and B are known, the probability of observing the genotypes of the offspring depends on the value of θ. Let us assume the following crossing: Therefore, such a couple can have 4 types of offspring Figure 5 Figure 6 Assuming that there is gamete equilibrium at the A and B loci, in parent 1 there is a probability of 1/2 that alleles A 1 and B 1 will be coupled, and a probability of 1/2 that they will be in repulsion. (1) A 1 and B 1 are coupled, so the probability that parent (1) provides the gametes A 1 B 1 and A 2 B 2 is (1-θ)/2 and the probability that this parent provides gametes A 1 B 2 and A 2 B 1 is θ/2. The probability that the couple will have child of type (1) or (2) is (1-θ)/2, and that of their having a type (3) or type (4) child is θ/2.

4 Genetic linkage analysis Page 4 sur 16 The probability of finding n 1 children of type (1), n 2 of type (2), n 3 of type (3) and n 4 of type (4) is therefore [(1- θ)/2] n1+n2 x (θ/2) n3+n4 (2) A 1 and B 1 are in a state of repulsion, so the probability that parent (1) provides the gametes A 1 B 2 and A 2 B 1 is (1-θ)/2 and the probability that this parent provides gametes A 1 B 1 and A 2 B 2 is θ/2. The probability of the previous observation is therefore: (θ/2) n1+n2 x[(1-θ)/2] n3+n4 So in the end, with no additional information about the A 1 and B 1 phase, and assuming that the alleles at the A and B loci are in a state of coupling equilibrium, the probability of inding n 1, n 2, n 3 and n 4 children in categories (1), (2), (3), (4) is: p(n 1,n 2,n 3,n 4 /θ) =1/2{[(1 -θ)/2] n1+n2 x (θ/2) n3+n4 + (θ/2) n1+n2 x [(1-θ)/2] n3+n4 } So the liklihood of θ for an observation n1, n2, n3, n4 can be written: L(θ/n1,n2,n3,n4)=1/2 {[(1-θ)/2] n1+n2 (θ/2) n3+n4 + (θ/2) n1+n2 [(1-θ)/2] n3+n4 } Special case: number of children n= 1 Regardless of the category to which this child belongs L(θ) = 1/2 [(1-θ)/2] + 1/2 [θ/2] = 1/4 The liklihood of this observation for the family does not depend on θ. We can say that such a family is not informative for θ. Informative families An "informative family" is a family for which the liklihood is a variable function of θ. One essential condition for a family to be informative is, therefore, that it has more than one child. Furthermore, at least one of the parents must be heterozygotic. Definition: if one of the parents is doubly heterozygotic and the other is A double homozygote, we have a backcross A single homozygote, we have a simple backcross A double heterozygote, we have a double intercross I- 2. DEFINITION OF THE "LOD SCORE" OF A FAMILY Take a family of which we know the genotypes at the A and B loci of each of the members. Let L(θ) be the liklihood of a recombination fraction 0 θ < 1/2 L(1/2) be the liklihood of θ = 1/2, that is of independent segregation into A and B. The lod score of the family in θ is: Z(θ) = log 10 [L(θ)/L(1/2)] Z can be taken to be a function of θ defined over the range [0,1/2]. Lod score of a sample of families The liklihood of a value of θ for a sample of independent families is the product of the liklihoods of each family, and so the lod score of the whole sample will be the sum of the lod scores of each family.

5 Genetic linkage analysis Page 5 sur 16 I- 3. TEST FOR LINKAGE Several methods have been proposed to detect linkage: "U scores", were suggested by Bernstein in 1931, "the sib pair test" by Penrose in 1935, "likelihood ratios" by Haldane and Smith in 1947, "the lod score method" proposed by Morton in 1955 (1). Morton s method is the one most commonly used at present. The test procedure in the lod score method is sequential (Wald, 1947 (2)). Information, i.e. the number of families in the sample, is accumulated until it is possible to decide between the hypotheses H0 and H1 : H0 : genetic independence θ = 1/2 and Hl: linkage of θ 1 0 θ 1 < 1/2 The lod score of the θ 1 sample Z(θ 1 ) = log 10 [L(θ 1 )/L(l/2)] indicates the relative probabilities of finding that the sample is Hl or H0. Thus, a lod score of 3 means that the probability of finding that the sample is Hl is 1000 times greater than of finding that it is H0 ("lod = logarithm of the odds"). The decision thresholds of the test are usually set at -2 and +3, so that if: Z(θ 1 ) 3 H0 is rejected, and linkage is accepted. Z(θ 1 ) -2 linkage of θ 1 is rejected. -2 < Z(θ 1 ) < 3 it is impossible to decide between H0 and Hl. It is necessary to go on accumulating information. For the thresholds chosen, -2 and +3, we can show that: The first degree error, α < 10-3 The second degreee error, β < 10-2 The reliability, 1-ρ > 0.95 θ 1 The power, P(θ) > 0.80 θ 1 if the true value of θ < 0.10 Figure 7 Details about the principle underlying the test are to be found in Wald (2), and the justification for criteria -2 and +3 in Morton (1 ). In fact, what is being tested is not a single value of θ 1 relative to θ = 1/2, but a whole set of values between 0 and 1/2, with a step of various size (0.01 or 0.05). If there is a value of θ 1 such that Z(θ 1 ) 3: linkage is concluded to exist.

6 Genetic linkage analysis Page 6 sur /02/2006 Figure 8 If there is a value of θ 1 such that Z(θ 1 ) = -2 The linkage is excluded for any θ θ 1 Figure 9 If θ -2 < Z(θ) < 3, no conclusion can be drawn, the sample is not sufficiently informative.

7 Genetic linkage analysis Page 7 sur /02/2006 Figure 10 The proposed test has the advantage of being very simple, and of providing protection against falsely concluding linkage. However, some criticisms can be levelled, not only against the criteria chosen (Chotai (3)), but also against the entire principle of using a sequential procedure (Smith (4)). The number of families typed is, indeed, rarely chosen in the light of the test results. I- 4. ESTIMATION OF THE RECOMBINATION FRACTION If the test, on a sample of the family, has demonstrated linkage between the A and B loci, then one may want to estimate the recombination fraction for these loci. The estimated value of θ is the value which maximizes the function of the lod score Z, and this is equivalent to taking the value of θ for which the probability of observing linkage in the sample is greatest. I- 5. RECOMBINATION FRACTION FOR A DISEASE LOCUS AND A MARKER LOCUS Let us assume we are dealing with a disease carried by a single gene, determined by an allele, g 0, located at a locus G (g 0 : harmful allele, G 0 : normal allele). We would like to be able to situate locus G relative to a marker locus T, which is known to occupy a given locus on the genome. To do this, we can use families with one or several individuals affected and in which the genotype of each member of the family is known with regard to the marker T. In order to be able to use the lod scores method described above, what is needed Figure 11 is to be able to extrapolate from the phenotype of the individuals (affected, not affected) to their genotype at locus G (or their genotypical probability at locus G). What we need to know is: 1. the frequency, g 0

8 Genetic linkage analysis Page 8 sur /02/ the penetration vector f 1, f 2,f 3 f 1 = proba (affected /g 0 g 0 ) f 2 = proba (affected /g 0 G 0 ) f 3 = proba (affected /G 0 G 0 ) It will often happen that the information available for the marker is not also genotypic, but phenotypic in nature. Once again, all possible genotypes must be envisaged. As a general rule, the information available about a family concerns the phenotype. To calculate thelikelihood of θ, we must envisage all the possible genotype configurations at each of the loci, for this family, writing the likelihood of θ for each configuration, weighting it by the probability of this configuration, and knowing the phenotypes of individuals in A and B. Knowledge of the genetic parameters at each of the loci (gene frequency, penetration values) is therefore necessary before we can estimate θ (Clerget-Darpoux et al (5)). It is obvious that calculating the lod scores, despite being simple in theory, is in fact a lengthy and tedious business. In 1955, Morton provided a set of tables giving the lod scores for various values of θ for a disease locus and a marker locus for nuclear families with sibling sizes of 2 to 7. However, the situations envisaged were very restrictive. In particular, it was assumed that the disease was determined by a dominant or recessive completely pentrating rare gene. "LIPED" written by Ott in 1974 (6) was the pioneering software in linkage analysis. It is able to carry out this calculation, in an extensive pedigree for any values of q, f 1, f 2, f 3 and for penetration as a function of age. The "Linkage" program of Lathrop et al, 1984 (7,8) is the one most often used for gene mapping. It can be used to carry out multipoint analysis. All the software we have described is based on the same recursive algorithm, r (Elston and Stewart), which means that it can be used to investigate pedigrees of any size, but that it envisages all the possible haplotypical combinations of markers, and is therefore limited by the number of markers to be taken into account. In contrast, "Genehunter" (9), which is based on a Markov chain principle, is limited not by the number of markers taken into consideration in the analysis, but by the size of the family structure. The very recently developed software package "Allegro" (10) can apply information from a large number of markers and extended family structures. Analysis of gene linkage has made it possible to construct a gene map by locating the new polymorphisms relative to one other on the genome. The measurement used on the gene map is not the recombination fraction, which is not an additive datum, but the gene distance, which we will define below. I- 6. LINKAGE ANALYSIS FOR THREE LOCI : THE PHENOMENON OF INTERFERENCE (V. Bailey, 1961) Now let us consider three loci A, B and C. Let the recombination fraction between A and B be θ 1, that between B and C be θ 2 and that between A and C be θ 3. Figure 12 Let us consider the double recombinant event, firstly between A and B, and secondly between B and C. Let R l2 be the probability of this event. If the crossings-over occur independently in segments AB and BC, then: R l2 = θ 1 θ 2

9 Genetic linkage analysis Page 9 sur 16 If this is not the case, an interference phenomenon is occurring and R l2 = C θ 1 θ 2 where C 1 If C < 1 the interference is said to be positive; and crossings-over in segment AB inhibit those in segment BC. If C >1 the interference is said to be negative; and crossings-over in segment AB promote those in segment BC. Let us consider the case of a triple heterozygotic individual. Such an individual can provide 8 types of gametes. Figure 13 Figure 14 We can write that Figure 15 θ 3 = θ 1 + θ 2-2 R 12 θ 3 = θ 1 + θ 2-2 Cθ l θ 2 If C = 1 θ 3 = θ 1 + θ 2-2θ 1 θ 2 The recombination fraction is a non-additive measurement. However, we can write

10 Genetic linkage analysis Page 10 sur 16 (1-2θ 3 ) = (1-2θ 1 )(1-2θ 2 ) if x(θ) = k Log (1-2θ) then we have x(θ 3 ) = x(θ 1 ) + x(θ 2 ) and for k = -1/2, x(θ) θ for small values of θ. x(θ) = -1/2 Log (1-2θ) is an additive measurement. It is known as the genetic distance, and is measured in Morgans. It can be shown that x measures the mean number of crossings-over. Test for the presence of interference Let us consider a sample of families with the genotypes A, B and C. Let Lc be the greatest likelihood for θ 1, θ 2, θ 3 and L 1 the greatest likelihood when we impose the constraint C=1 (i.e. θ 3 = θ 1 + θ 2-2θ 1 θ 2 ) Then -2 Log (L l /L c ) follows a χ2 pattern, with one degree of freedom. I- 7. REFERENCES 1. Morton NE. Sequential tests for detection of linkage. Am J Hum Genet 1955; 7: Wald A. Sequential analysis. New York: Wiley, Chotai J. On the lod score method in linkage analysis. Ann Hum Genet 1984; 48: Smith CAB. Some comments on the statistical methods used in linkage investigations. Am J Hum Genet 1959; 11: Clerget-Darpoux F.; Bonaïti-Pellié C, Hochez J. Effects of mispecifying genetic parameters in lod score analysis. Biometrics 1986; 42: Ott, J. Estimation of the recombination fraction in human pedigrees: Efficient computation of the likelihood for human linkage studies. Am J Hum. Genet 1974; 36: Lathrop GM, Lalouel, J. Easy calculations of lod scores and genetic risks on small computers. Am J Hum Genet 1984; 36 (2): Lathrop GM; Lalouel JM; Julier C; Ott J. Multilocus linkage analysis in humans. Detection of linkage and estimation of recombination. Am J Hum Genet 1985; 37: Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES. Parametric and Nonparametric Linkage Analysis: A Unified Multipoint Approach. Am J Hum Genet 1996; 58: Gudbjartsson DF, Jonasson K, Frigge M, Kong A. Allegro, a new computer program for multipoint linkage analysis. Nature Genet 2000; 25: Bailey N. Introduction to the mathematical theory of genetic linkage. London: Oxford University Press, Amen House, Ott, J. Analysis of human genetic linkage. Johns Hopkins University Press, Morton NE. The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am J Hum 1956; 8: Smith CAB. Testing for heterogeneity of recombination fractions in human genetics. Ann Hum Genet 1963; 27: II- GENETIC HETEROGENEITY OF LOCALIZATION The analysis of genetic linkage can be complicated by the fact that mutations of several genes, located at different places on the genome, can give rise to the same disorder. This is known as genetic heterogeneity of localization. One of the following two tests is used to identify heterogeneity of this type, the "Predivided sample test" or the "Admixture Test". The first test is usually only appropriate if there is a good family stratification criterion or if each family individually has high informativity. II- 1. THE PREDIVIDED SAMPLE TEST This test is intended to demonstrate linkage heterogeneity in different sub-groups of a sample of families. The aim is to test whether the genetic linkage between a disease and its marker(s) is the same in all sub-groups. These groups are formed ad hoc on the basis of clinical or geographical criteria etc... Let us assume that the total sample of families has been divided into n sub-groups (it is possible to test for the existence of as

11 Genetic linkage analysis Page 11 sur /02/2006 many sub-groups as families). θ i denotes the true value of the recombination fraction of sub-group i. We want to test the null hypothesis H0: θ 1 = θ 2 = θ 3 = = θ n against the alternative hypothesis Hl: the values of θ i are not all equal. Therefore, the quantity Figure 16 follows a χ distribution with (n-l) degrees of freedom. The homogeneity of the sample for linkage with a type-i error of the sample for linkage with a type I error equal to α if Q is above the critical threshold χ2 (n-l) corresponding to α. II- 2. THE ADMIXTURE TEST Unlike the previous test, the "admixture test" is not based on an ad hoc subdivision of the families. It is assumed that among all the families studied genetic linkage between the disease and the marker is found only in a proportion α of the families, with a recombination fraction θ < 1/2. In the remaining (l-α) families, it is assumed that there is no linkage with the marker (θ=1/2). For each family i of the sample, the likelihood is calculated L i (α, θ) = α L i (θ) + (l-α) L i (1/2), where L i (θ) is the likelihood of θ for family i. The likelihood of the couple (α, θ) is defined by the product of the likelihoods associated with all the families : L(α,θ)= Π i L i (α,θ) We test to find out whether α is significantly different from 1 by comparing L max (α = l,θ), the maximized likelihood for θ assuming homogeneity, and L max (α,θ), the maximized likelihood for the two parameters α and θ (nested models). Then variable Q =2[Ln L max (α,θ) Ln L max (α = 1,θ)] follows a χ2 distribution with one degree of freedom. II- 3. GENERALIZATION OF THE ADMIXTURE TEST In some single-gene diseases, several genes have been shown to exist at different locations. This is true, for example of multiple exostosis disease, for which 3 genes have been identified successively on 3 different chromosomes. The "admixture test" is then extended to determine the proportion of families in which each of the three genes is implicated (Legeai-Mallet et al, 1997), and the possibility that there is a fourth gene. The three locations on chromosomes 8, 19 and 11 were reported as El, E2 and E3, and the proportions of families concerned as α l, α 2 and α 3 respectively. α 4 was used to represent the proportion of the families in which another location was involved. For each family i of the sample, the likelihood was calculated using the observed segregation within the family of the markers available in each of the three regions, according to the clinical status of each of its members. L i (El, E2, E3,α l, α 2, α 3 / Fi) = α l (L(E1/Fi)/L(El=1/2 / Fi)] + α l (L(E2/Fi)/L(E2=1/2 / Fi)] + α 3 [L(E3/Fi)/L(E3=1/2 / Fi)]+ α 4 For all the families L(El, E2, E3,α l, α 2, α 3 / Ft) = i L i (El, E2, E3,α l, α 2, α 3 / Fi)

12 Genetic linkage analysis Page 12 sur 16 Each α i can be tested to see if it is equal to 0, and then the corresponding non nullα i and Ei values are estimated. It is also possible to calculate the probability that the gene implicated is at El, E2 or E3 for each of the families in the sample. The post hoc probability makes use of the estimated α i proportions, but also the specific observations in this family. The sample investigated has been shown to consist of three types of families: in 48% of families, the gene is located on chromosome 8, in 24% of them on chromosome 19, and in 28% of families the gene is located on chromosome 11. There was no evidence of a fourth location in this sample. The post hoc probabilities of belonging to one of these 3 sub-groups were then estimated: the probability that the gene implicated would be on chromosome 8 was over 90% for 5 families, that it would be on chromosome 19 for 3 of them, and that it would be on chromosome 11 for 4 families. For the other families, the situation was less clear-cut: the post-hoc probabilities are similar to the ad hoc probabilities because of the paucity of information provided by the markers used. II- 4. REFERENCES 1. Legeai-Mallet L, Margaritte-Jeannin P, Clerget-Darpoux F et al. Genetic heterogeneity of hereditary multiple exostoses. Hum Genet 1997; 99: Morton N. The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am J Hum Genet 1956; 8: Smith CAB. Testing for heterogeneity of recombination values in human genetics. Ann Hum Genet 1963; 27: III- STATISTICAL PROPERTIES OF THE METHOD OF LOD SCORES The test procedure used in the method of lod scores is sequential (Wald, 1947). The amount of information, i.e. the number of families is accumulated in the sample, until it is possible to decide between the H0 and H1 hypotheses: H0: genetic independence θ = 1/2 and H1: linkage to θ 1, 0 θ 1 < 1/2 The value of the lod score of the sample in θ 1 z(θ 1 ) = log 10 [L(θ 1 )/L(1/2)] indicates the relative probabilities of observing the sample as H1 or H0. Thus, a lod score of 3 implies that the probability is 1000 times greater of observing the sample as H1 rather than H0 ("lod=logarithm of the odds"). The decision thresholds of the test are usually set at -2 and +3, so that if: Z(θ 1 ) 3 H0 is rejected and linkage is concluded Z(θ 1 ) 2 linkage is rejected for θ < Z(θ 1 ) < 3 it is impossible to decide between H0 and H1. It is necessary to go on accumulating information. For the -2 and +3 thresholds selected, it can be shown that: The first degree error α < 10-3 The second degree error β < 10-2 The reliability 1-ρ > 0.95 θ 1

13 Genetic linkage analysis Page 13 sur 16 The power P(θ) > 0.80 θ 1 if the true value of θ < 0.10 Figure 17 The conditions of application which underlie these properties: sequentiality, segregation of a simple single-gene disease in nuclear families, in which all the members are genotyped for a genetic marker, and the non-ambiguity of the test is not confirmed in practice. The table below shows the change in these conditions of application. We discuss here the impact of these changes on the statistical properties. III- 1. THE TEST PROCEDURE III IMPACT OF NON-SEQUENTIALITY Figure 18 In general, one is working on a sample of families of a fixed size. This problem of non-sequentiality was raised by Smith (1959) and investigated by Chotai (1984) and Guihenneuc (1991), who have shown that the type-1 error of the test was not increased, but on the contrary reduced. Furthermore, the power will obviously depend on the size the sample. It also depends on the parameters of the genetic model (penetrations, frequency of the morbid allele, degree of dominance), of the types of family analysed (nuclear or extensive families), the informativity of the markers, of what is known about the phase of the alleles at the disease locus and the marker locus, and of the value of the recombination fraction between these two loci. If one knows all about the genetic model of the transmission of the disease and its parameters, the greater the power of the method, the easier it is to detect the presence of recombination between the disease locus and a marker locus, in other words, the genotype of each of the two loci, but also the haplotype, i.e. the combination of 2 alleles from each locus on the same chromosome segment are easily identifiable from the phenotype. At the disease locus, the genotype can be deduced unambiguously from the phenotype if there is a rare dominant gene with total penetrance for the heterozygote and zero penetrance for the normal homozygote (no phenocopy). The power diminishes as the degree of dominance and the penetrance decline, and the gene frequency and proportion of phenocopies increase (Ott, 1991 ). At the marker locus, this power is greater the higher the degree of heterozygotism, or in other words, the more polymorphic the marker. If we consider the two loci together, the amount of knowledge about the haplotype transmitted is greater if there is a large number of generations. Finally, the proximity of the two loci increases the power of detection of the genetic linkage. Multipoint linkage analysis, which uses several reference markers near to each other on a given chromosome segment, increases the power of the method by increasing the informativity of the meioses. In general, it is used to pinpoint the location of a morbid locus once it has been established that genetic linkage is present. III MAXIMIZATION OF THE LOD SCORE OVER THE [0, 1/2] INTERVAL (E. Génin, Ann Hum Genet,1995,59: )

14 Genetic linkage analysis Page 14 sur 16 However, in practice, the test is never carried out for a single value of θ 1, but is done as follows: the lod score is calculated for various values of θ 1, the maximum lod score Z max is calculated and the test is applied to Z max.a criterion of +3 or even less, is used to conclude that linkage is occurring, based on the argument that the risk remains sufficiently small. The probability of post-hoc non linkage is never calculated. The fact of considering an alternative hypothesis by using the maximum lod score, Z max (which amounts to testing H0: θ = 1/2 versus H1: θ < 1/2) actually reduces the reliability of the test considerably. Thus, the probability ρ that there is no linkage when a Z max of + 3 has been obtained can be as high as 16.4%; i.e. more than three times the probability calculated by Morton (1955). The table below shows the probability that linkage does not exist as a function of the Z max obtained. Figure 19 the relationship between ρ and Z max depends on the type of family structure and the determinism of the disease (in this case the calculation has been carried out for a dominant disease in a sample of nuclear families with two children). Reliability =1-ρ The example of the conflicting results obtained for Alzheimer s disease is a good illustration of the usefulness of calculating the probability of linkage post hoc. Alzheimer s disease is a form of dementia characterized by loss of memory and of cognitive function. Only a few families have multiple cases, but within this sub-group of families, the distribution of the patients is compatible with the hypothesis of the intervention of a dominant mutation on an autosomal gene. Analyses of genetic linkage by the method of lod scores were therefore carried out to localize the gene involved. In 1987, a maximum lod score of was obtained using a marker of chromosome 21 in a large genealogy with numerous members affected (family FAD4), and this at first led people to conclude that the mutation responsible was located on chromosome 21 (St Georges-Hyslop et coll. 1987). For many years, research into this disease was therefore focused on this chromosome. Five years later however, several different teams provided a very significant demonstration of linkage with chromosome 14 markers. The very high lod scores that were obtained showed that most of the early familial forms were due to a mutation of a chromosome 14 gene 14 (Schellenberg et coll. 1992, St Georges-Hyslop et coll. 1992). In particular, in the case of family FAD4, a lod score of was obtained with markers for this region. In view of the observations obtained for chromosome 21 markers in FAD4, the post-hoc probability that there was no linkage was 1/3. It is likely that if this calculation had been done in 1987, the existence of a mutation on chromosome 21 in this family would have looked less convincing. Furthermore, it has now been shown that the gene implicated is located on chromosome 14. III-1.3. REFERENCES 1. Génin E, Martinez M, Clerget-Darpoux F. Posterior probability of linkage and maximal lod score. Ann Hum Genet 1995; 59: Schellenberg GD, Bird T, Wijsman E et al. Genetic linkage evidence for a Familial Alzheimer's disease locus on chromosome 14. Science 1992; 258: St Georges-Hyslop PH, Haines J, Rogaev E et al. Genetic evidence for a novel familial Alzheimer's disease locus on chromosome 14. Nature Genet 1992; 2: St Georges-Hyslop PH, Tanzi RE, Polinsky RJ et al. The genelic defect causing Alzheimer's disease maps on chromosome 21. Science 1987; 235:

15 Genetic linkage analysis Page 15 sur 16 III-2. GENOTYPE INFORMATION III-2.1. AMBIGUITY IN PHENOTYPE-GENOTYPE RELATIONSHIPS AT THE DISEASE LOCUS The original lod score method was applied to the study of nuclear families (the parents and their children), and this made it easy to deduce the genotype at each of the loci for each member of the family. Since it is the phenotypes that can be observed this means that the phenotype/genotype correspondence was known. In particular, when the analysis was carried out between a "disease" locus and a "marker" locus, the disease was assumed to involve a single gene, due to a rare allele of an autosomal gene, or linked to gender, with complete penetrance (probability of being affected equal to 1 for people carrying one copy of the allele for dominant diseases, of two copies for recessive diseases). Gamete equilibrium was also assumed to exist between the alleles at the "disease" locus and the "marker" locus. The method, the properties of which were fully established on the basis of these hypotheses, has been extended over the past twenty years to more varies and complex situations, but without questioning its underlying properties. In particular, it is applied to diseases of which the determinism is less or even totally unknown, which are studies in large genealogies, of which some of the members have an unknown phenotype. This leads us to investigate the power of the test using various models and its robustness to modeling errors. It should be stressed that the "lod score", which is thought of above all as a function of the recombination fraction and used to estimate this variable, also depends on the value of the genetic parameters at the disease locus, i.e. the frequency of the alleles at this locus and the penetrances (probabilities of being affected) associated with each of these genotypes. We evaluated the effects that an error in these parameters produced in the linkage test and in estimating the recombination fraction (Clerget-Darpoux et coll, 1986,1992,1993). Loss of power: The power of detecting linkage can be very severely reduced if there is an error concerning the relative penetrance of each of the genotypes: i.e. concerning the ratio of probabilities of being affect in those who carry two copies of the morbid allele, those who have a single copy and those who do not carry it at all, "the phenocopies". False exclusion of linkage: The robustness of the method to false specifications of the values of the parameters is not symmetrical with regard to the two hypotheses being tested. We have shown that the lod score is always, greatest for the correct values of the parameters and that it can be considerably reduced if these have been wrongly specified. As a consequence, an error in the values of the parameters does not lead to a false conclusion of linkage although it can wrongly lead to the exclusion of linkage. This is particularly the case if the proportion of phenocopies is underestimated. Bias in the recombination fraction: The estimation of the recombination fraction is very sensitive to any error in the value of any of the parameters. In addition, the effects of the errors on the gene frequency and on the penetrance values are usually additive, because in most studies these parameters are linked by the constraint of the value of the prevalence of the disease within the population. III-2.2. AMBIGUITY IN THE MARKER GENOTYPE To calculate a lod score between a disease locus and a marker locus, it is necessary to take into consideration all the possible genotypical configurations at each of the loci and to write the probabilities of these configurations. If some individuals have not been genotyped for the genetic marker, the probability of each possible genotype must be calculated. To do this, is will be necessary to specify the allele frequencies of the marker. Any error in thee allele frequencies, in particular the under-estimation of the frequency of an allele in the patients, artificially increases the values of the lod score and can therefore lead to a false conclusion that there is genetic linkage (false positives) (Ott, 1991 ; Freimer et al, 1993; Knapp et al, 1993). In increasingly frequent use of very extensive genealogies, in which only individuals of the last generation are typed, alls for great caution in interpreting positive results. III-2.3. GAMETE DISEQUILIBRIUM BETWEEN ALLELES AT THE DISEASE LOCUS AND AT THE MARKER LOCUS An association between a susceptibility gene and a marker can lead to bias in the estimation of the recombination fraction. In particular, the "lod scores" method specifies that there must be no selection for the marker in the sample. However, in a context of an association, selection based on the status of the patient implicitly involves selection for a marker. Furthermore, the calculation assumes that the probability for each genetic combination is equal in the parents, and this is not true if there is an association. In the analysis, failing to take into account the disequilibrium existing between disease alleles and marker alleles, induces a very great under-estimation of the "lod score" (in other terms, a marked reduction in the power of the linkage test) and a very slight under-estimation of the recombination fraction (Clerget-Darpoux, 1982). III-3. THE PROBLEM OF MULTIPLE TESTS

16 Genetic linkage analysis Page 16 sur 16 One of the difficulties encountered in the statistical interpretation of the analyses of the genetic linkage of complex diseases arises in fact from the fact that in general and with a varying degree of explicitness, the data are subjected to multiple tests: several clinical classifications, several genetic markers, several models, several samples. It is quite clear that the discontinuation criteria usually used in the lod score test no longer have the same statistical significance when several tests are applied simultaneously to the same sample or to several samples. E. Thompson (1984) has investigated this problem in the case of a disease involving a single gene for which the genetic linkage is tested using several markers located on different chromosomes (and therefore independent). The situation is much more complex for multifactorial diseases, because the multiplicity of the tests has several types of impact and these are not independent (Clerget-Darpoux et coll, 1990). Multiple tests could be taken into account by readjusting the discontinuation criterion of the lod scores test. However, on the one hand, it is not always clear from the publications which tests have actually been carried out, and on the other, this can make the test too conservative. This is why we think that the replication strategy should be favored. If a positive result is replicated for a new sample (using the same classification, the same marker, the same transmission model) this provides a reliable threshold of significance. III-4. REFERENCES 1. Chotai J. On the lod score method in linkage analysis. Ann Hum Genet 1984; 48: Clerget-Darpoux F. Bias of the estimated recombination fraction and lod score due to an association beween a disease gene and a marker gene. Ann Hum Genet 1982; 46: Clerget-Darpoux F, Bonaïti-Pellié C, Hochez J. Effects of misspecifying genetic parameters in 1od score analysis. Biometrics 1986; 42: Clerget-Darpoux F, Babron M.C., Bonaïti-Pellié C. Assessing the effect of multiple linkage tests in complex diseases. Genet Epidemiol 1990; 7: Clerget-Darpoux F, Bonaïti-Pellié C. Strategies based on marker information for the study of human diseases. Ann Hum Genet 1992; 56: Clerget-Darpoux F, Bonaïti-Pellié C. An exclusion map covering the whole genome : a new challenge for genetic epidemiologists? Am J Hum Genet 1993; 52: Freimer NB, Sandkuijl LA, Blower SM. Incorrect specification of marker allele frequencies : effect on linkage analysis. Am J Hum Genet 1993; 56: Guihenneuc C, Prum B, Clerget-Darpoux F, Bonaïti-Pellié C. Remarques sur la méthode du lod score en génétique. Pub Inst Stat Univ Paris 1990; 35: Knapp M, Seuchter SA, Bauer MP. The effect of misspccifying allele frequencies in incompletely typed families. Genet Epidemiol 1993; 10: Morton NE. Sequential tests for the detection of linkage. Am J Hum Genet 1955; 7: Ott J. Analysis of human genetic linkage, 2nd ed ition. John Hopkins University Press, Smith CAB. Some comments on the statistical methods used in linkage investigations. Am J Hum Genet 1959; 11: Wald A. Sequential analysis. New York: Wiley, Contributors: Françoise Clerget-Darpoux