Identifying Genes Underlying QTLs Reading: Frary, A. et al. 2000. fw2.2: A quantitative trait locus key to the evolution of tomato fruit size. Science 289:85-87. Paran, I. and D. Zamir. 2003. Quantitative traits in plants: Beyond the QTL. Trends in Genet. 19:303-306. Yu, J. et al. 2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genet. 38:203 208. In previous lectures on marker-assisted selection, we saw that selection on linked markers can be useful under certain circumstances. Some limitations to marker-assisted selection include loss of a target allele due to recombination between marker and target gene during the selection process, and the sometimes poor predictive ability of markers across populations (or crosses). One way to circumvent these two specific problems is to develop markers that are part of the gene sequence that causes the phenotypic difference underlying the QTL. The most difficult part of this approach is the identification of the specific gene or sequence that causes the genetic difference. Several ways to tackle this problem are outlined below, but note that they are very intensive processes, requiring large investments in time and money. Therefore, these approaches should be targeted at traits that are of extremely high value for selection, but are hard to select for, and have a reasonable chance of success. Thus, at this point, these methods are most suitable to QTLs with relatively large effects of the phenotype. Map-based cloning Map-based cloning is, in a sense, simply very high-resolution QTL mapping. If you can define the interval of a QTL to within a very small physical distance, you can sequence this region, identify the gene(s) in the region, and use various methods to distinguish which genes correspond to the QTL. This is not at all trivial. Recall that typically in QTL mapping we can define an interval containing a QTL to somewhere on the order of 20 cm. How many genes exist in a 20 cm interval? As a rough estimate, maize is predicted to have about 60,000 genes and maize genetic maps tend to be around 2,000 cm in total. Therefore, a 20 cm interval represents about 1% of the genetic map length. If genes are distributed evenly along the genetic map length, then on average, we expect to find 1% of 60,000 genes, or 600 genes, in such an interval. Thus, even if you already have the sequence available for such a region, the problem of distinguishing which gene or genes among these 600 causes the phenotypic effect of the QTL remains. If you do not have sequence for this region already available, then you may need to first sequence the region. Again, using maize as an example, the total nuclear genome size is 2,500 Mbp of DNA. Therefore, if 20 cm represents 1% of the physical distance of the maize genome (which it should on average), then such a segment is expected to contain 25 Mbp, which is a massive region to sequence (and further, you may be unlucky and your
QTL may reside in a recombination cold spot, in which case the region could be much larger). Therefore, to reduce the size of the region that needs to be sequenced and the number of gene candidates that need to be sifted through, one needs to reduce the likely interval containing the QTL to as small a size as possible. As an example of how this can be done, consider the Frary et al. (2000) study where they cloned a QTL with major effect that causes differences in tomato fruit size. This work was based on the fine-mapping studies by Alpert and Tanksley (1996). They began by making introgression lines by backcrossing gene regions from a wild tomato into domesticated tomato. One of the introgression lines developed was nearly-isogenic with the recurrent domesticated parent, containing almost all of the recurrent parent genome except for an introgression block around the previously mapped tomato fruit weight QTL on chromosome 2. They crossed this NIL to the recurrent parent and produced a huge F2 population of 3472 plants, and screened each of these plants with two RFLP markers flanking the QTL interval. They identified 55 F2 plants that had recombinations between these two markers, and grew progeny of those plants in the field for phenotyping and highresolution QTL mapping. Additional markers were mapped within this small region to permit precise localization of the QTL based on phenotypic data of recombinant families to a region of 0.13 cm. At the same time, these markers were used to screen a library of yeast artificial chromosomes (YACs), which can hold up to 1 Mbp of DNA, to identify those YACs that contain the sequences corresponding to the marker. Once such YACs were identified, they were cut with restriction enzymes, and specific pieces of the YACs were mapped in this region. In this way, they determined that the 0.13 cm interval containing the QTL spanned a physical size of about 150 kb. Next, Frary et al. (2000) screened a cdna library with this YAC to identify gene expression sequences derived from the YAC and found four unique gene transcripts. They then directly tested each of these four genes for effects on fruit weight by transforming a large-fruited line with the small fruited allele from each of these four genes separately. In this way, they directly confirmed that the fw2.2 QTL was caused by the ORFX gene for which they had cdna clones. Association Mapping All of the QTL mapping methods described in previous lectures have relied on a population developed from a cross of two parental lines. This requires the development of mapping populations, which, as in the case of recombinant inbred lines, may require multiple generations of population development before the study can even be initiated. An alternative approach, termed association analysis, exploits the genetic variation already present in breeding populations or germplasm collections. As in typical QTL mapping, one tests for an association between variation at a known gene or marker and a phenotype in the germplasm collection tested. Such associations occur if the gene being tested actually causes the phenotypic differences or if there is linkage (gametic phase)
disequilibrium between the gene being tested and the gene(s) causing the phenotypic differences. We might be content to find markers linked to causal QTLs, but the problem is that linkage disequilibrium does not actually imply linkage (which is why I will refer to it as gametic phase disequilibrium). Gametic phase disequilibrium is the nonrandom association of alleles at different loci. It is measured as the difference between observed and expected allele pair frequencies. For example, gametic disequilibrium between alleles a and b at loci A and B, respectively, is measured as: D ab = p ab p a p b, Where p ab is the frequency of the ab gamete or haplotype, and p a and p b are the allele frequencies of a and b, respectively. Gametic phase disequilibrium is reduced by random mating, but can be increased by population subdivision, recent population hybridization, and mutation. It can be maintained by physical linkage and selection on epistatically interacting loci. So, genes that are tightly linked are more likely to be in gametic disequilibrium, but even tightly linked genes may be in gametic equilibrium if there has been sufficient history of recombination between them. Conversely, genes on different chromosomes can be in gametic disequilibrium due to selection or population structure. Do you expect gametic disequilibrium to be more extensive in maize or in wheat? QTL mapping proceeds by artificially creating populations in which the level of linkage disequilibrium is solely a function of recombination frequency. In QTL mapping, therefore, a gene should be associated with a phenotype only if the gene is linked to an underlying QTL. In germplasm collections, breeding populations, or natural populations, however, gametic disequilibrium can occur due to linkage or population structure. Therefore, to perform an association analysis in such populations requires separating the effects of population structure from linkage. This can be done by first genotyping the population under study with a set of random markers, like SSRs, to characterize the relationships among lines in the population. For example, when this was done in a sample of 260 maize lines collected from around the world, they were found to group primarily into three subpopulations: Stiff Stalk, non-stiff Stalk, and Tropical/Subtropical, which correspond to the major heterotic groups recognized by corn breeders (Liu et al., 2003, Genetics 165:2117). By assigning each line a probability that it belongs to one of the three groups, most of the effects of population structure can be accounted for. Then, the effects of markers or candidate genes can be tested while using the sub-population identity probabilities as cofactors in the analysis model: Y = Xβ + Sα + Qv + e, where:
Xβ accounts for environment, block effects, etc., S are indicators of which candidate gene allele each line carries, α are candidate gene or marker effects, Q are the columns assigning the probability that each line belongs to each subpopulation, v are the effects of subpopulations, and e are residual error effects. More recently, Yu et al. (in press), have extended this model to account for pairwise genetic relationships among all of the individuals in the study. This allows finer-scale correction of genetic relationships among the lines in the study, because even within a subpopulation, there will be differences in how closely or distantly related the individuals are. The random genetic marker information can be used both to assign lines to subpopulations (the Q matrix) and also to estimate pair-wise relationships between individuals (the K matrix): Y = Xβ + Sα + Qv + Zu + e, Where the model is the same as above, with the addition of: Z, which indicates the genotype, and u, which indicates the genetic background effect, where the variance-covariance matrix of genetic background relationships is equal to KV g (where V g is the genetic background variance). With these models, the effect of the gene being tested is effectively separated from population structure effects. If a significant effect is observed, one then must determine if the gene being tested is the actual gene causing the phenotypic effect, or if it is linked to some other gene(s) causing the phenotypic difference. To separate the effects of the gene itself from linked genes, one must carefully study the extent of gametic phase disequilibrium in the population being studied and in the genome region tested. These last two points are key gametic disequilibrium depends strongly on the sample of genotypes tested and on the genome region. For example, in a diverse sample of maize inbreds, disequilibrium tends to rapidly decrease along the chromosomes (Remington et al. 2001 PNAS 98:11479), such that a specific gene causing effects on flowering time could be identified with association analysis (Thornsberry et al.,2001 Nat. Genet. 28:286). In contrast, in a highly selected, elite group of inbred lines from a private company, substantial linkage disequilibrium was found to extend up to 100 kb (Rafalski, 2002 Curr. Op. Plant Biol. 5:94), presumably due to selection. Furthermore, even in the diverse maize line set, disequilibrium was more extensive in some genome regions (Remington et al., 2001). With this in mind, the association mapping strategy needs to take account of how extensive gametic phase disequilibrium is in the target population. If disequilibrium is extensive, the resolution of the analysis will be reduced, but, on the other hand, one does not need to test the causal gene itself; instead, one may identify QTL using marker loci. In contrast, if disequilibrium is limited, one may have the resolution needed to identify
the causal gene affecting a trait, but it also means that random marker genes are not likely to be associated with the trait. Therefore, if one does not have a good set of candidate genes that are likely to be involved in the trait, association analysis with limited disequilibrium is probably not a good idea.