Selection Of Genetically Diverse Recombinant Inbreds With An Ordered Gene Evolutionary Algorithm.

Size: px
Start display at page:

Download "Selection Of Genetically Diverse Recombinant Inbreds With An Ordered Gene Evolutionary Algorithm."

Transcription

1 Selection Of Genetically Diverse Recombinant Inbreds With An Ordered Gene Evolutionary Algorithm. Dan Ashlock Mathematics and Statistics University of Guelph Guelph, Ontario Canda N1G 2W1 Ruth Swanson Interdepartmental Genetics Iowa State University, Ames, Iowa Patrick Schnable Plant Genetics Department Iowa State University, Ames, Iowa Abstract Recombinant inbreds are created by crossing two genetically distinct inbred lines and then inbreeding the resulting progeny multiple times. They are used to estimate associations of genes by co-inheritance of alleles from the two parent inbred types in the recombinant inbreds derived from the cross in a process called genetic mapping. Typically the recombinant inbred lines used in a genetic mapping study are relatively well studied and so they are natural choices for microarray, proteomic, and metabolomic studies. These are quite costly and so typically use fewer individuals than are used in most genetic mapping studies. An evolutionary algorithm for selecting a subset of a collection of recombinant inbred lines with maximum genetic diversity in their mapping characters is described. The evolutionary algorithm is an ordered-gene algorithm with the first k genes in the ordered selection taken to be the subset. Ordered genes are a convenient representation for subset selection. It is found that the problem is not difficult and that in a well mixed mapping population of recombinant inbreds the marginal increase in diversity obtained by evolutionary optimization is small but significant. In order to better understand the problem, synthetic data are also examined and suggest that the problem is easy in general, not only in the specific biological cases used.recombinant inbreds are created by crossing two genetically distinct inbred lines and then inbreeding the resulting progeny multiple times. They are used to estimate associations of genes by co-inheritance of alleles from the two parent inbred types in the recombinant inbreds derived from the cross in a process called genetic mapping. Typically the recombinant inbred lines used in a genetic mapping study are relatively well studied and so they are natural choices for microarray, proteomic, and metabolomic studies. These are quite costly and so typically use fewer individuals than are used in most genetic mapping studies. An evolutionary algorithm for selecting a subset of a collection of recombinant inbred lines with maximum genetic diversity in their mapping characters is described. The evolutionary algorithm is an ordered-gene algorithm with the first k genes in the ordered selection taken to be the subset. Ordered genes are a convenient representation for subset selection. It is found that the problem is not difficult and that in a well mixed mapping population of recombinant inbreds the marginal increase in diversity obtained by evolutionary optimization is small but significant. In order to better understand the problem, synthetic data are also examined and suggest that the problem is easy in general, not only in the specific biological cases used. I. INTRODUCTION Heterosis is a phenomenon observed in hybrid crops, including corn. Two lines of corn are inbred until little heterozygosity remains. A type of corn produced in this manner is called an inbred. Inbred lines are then crossed to produce a hybrid. For some pairs of inbreds, the hybrid is superior in terms of growth, yield, or other factors. The superior performance of hybrids, relative to their inbred parents, is what is referred to as heterosis. The molecular mechanisms of heterosis are poorly understood. As part of a project that will use microarray gene profiling, proteomics, and metabolomics to understand heterosis this paper gives an evolutionary algorithm for selecting which lines of corn will be subjected to microarray and proteomic assays. Hopefully a model for heterosis will subsequently be created using diverse molecular approaches to study the selected types of corn. Microarray analysis provide information about the global mrna profile of an organism. The mrna associated with a given gene gives an indication of the gene s expression levels of an organism. Proteomics is the study of an organism s proteome; the collection of all the proteins present within the organism. Metabolomics is the measurement of the detectable metabolites in an organism. One way to measure such metabolites involves feeding isotopically labeled glucose, containing carbon-13, to a developing maize plant. The labeled glucose incorporation can be traced in the organism to identify the metabolic pathways involved in development for whatever nutrients have been tagged with the C 13. Together, the data from these four different approaches can be connected and then used to learn about genes and gene function in heterosis. B73 and Mo17 are inbred lines of corn whose progeny exhibit heterosis. Lines of B73 and Mo17 have been crossed. The resulting hybrid progeny were randomly mated to allow for a random set of intermated B73 and Mo17 lines. These lines were subsequently selfed (inbred) for several generations to generate a set of recombinant inbred (RI) intermated B73 and Mo17 or IBM lines[3]. At each gene locus, the IBM lines will (usually) contain either the B73 or Mo17 allele. Hundreds /05/$ IEEE

2 of these lines have been generated. The Schnable Lab at Iowa State University is currently engaged in a genetic mapping study using 91 of these IBM lines. The process of genetic mapping is described in [4]. These 91 RI-lines are called the IBM mapping population. These lines were produced using four generations of inter-mating: first of B73 and Mo17 and then of their crossbreeds, before inbreeding to create the RIlines. This was intended to ensure better mixing of the alleles in the RI-lines. Each of the RI lines in the IBM population contains different chromosomal regions derived from B73 and Mo17. Physical traits relating to heterosis (e.g.: seed yield, total dry weight, height, etc.) can be mapped to different genomic locations, therefore identifying regions of the genome contributing to heterosis. To carry out the microarray, proteomic, and metabolomic analysis of the parental lines, hybrid progeny, and back-crosses to the parental lines will all be included. For reliable results, biological replicates are required. This can become quite costly considering the number of IBM lines used multiplied by the number of biological replicates. To do a more cost-effective analysis, an evolutionary algorithm is used to select a 30-member subset of the 91 IBM lines that are most genetically diverse. This will allow for an optimum diversity among the chosen RI lines, while reducing the size of the experiment as well as the cost. A similar computation for RI-lines in mice was performed in [2], using a greedy algorithm to produce the initial point for a hill climber that proceeded by transpositions to find a locally best selection of lines for additional study. Fig. 1. Recombinant inbred (RI) lines are generated by crossing two diverse, inbred (homozygous) lines to produce heterozygous F 1 progeny. Recombinant gametes produced by the F 1 are recovered by self-pollinating to produce F 2 progeny. Intermated recombinant inbred (IRI) lines are generated by random mating the F 2 population for multiple generations prior to inbreeding. F 2 s (RI) or intermated F 2 s (IRI) are self-pollinated for multiple generations to produce inbred, homozygous lines. The chromosomes in the resulting (I)RI lines are mosaic and contain different combinations of the parental alleles. A. The selection criterion The mapping study on the IBM population has thus far identified 3149 mappable genetic loci. These are genes that show sufficient diversity between B73 and Mo17 alleles to be placed on the genetic map. For each line in the 91 member IBM population we thus have 3149 values giving the allele at each of the mapped genetic markers. The possible values for an allele are B73, Mo17, missing, or mixed, with the latter two types being relatively rare. In three of the 91 RI lines the density of mixed type markers was sufficiently high that the RI lines were excluded from the heterosis study. It is possible that these lines were contaminated by outside pollen during the inbreeding process. In any case they are not useful for the heterosis study. This leaves 88 lines for which we have 3149 allele values. The goal is to pick 30 of the 88 lines to participate in the intensive heterosis study. These lines should have the maximum possible diversity of the two available allele types. The data is formatted as a matrix with 88 rows and 3149 columns with entries A (B73) B (Mo17) or C (missing or mixed). The evolutionary algorithm is picking 30 rows so that the sum over the columns of the diversity in each column is as high as possible. In this case diversity is measured by the information theoretic entropy of the symbols in the column, modified to bias selection against missing or mixed allele types in a manner described subsequently.

3 Suppose that we have selected 30 rows of the data matrix. Let N i (X) be the number of symbols X in the selected rows of a column i of the data matrix. We add N i (C) to whichever of N i (A) or N i (B) is larger to obtain N i (A) and N i (B). Set and N i (A) P i (A) = N i (A) + N i (B) P i (B) = N i (B) N i (A) + N i (B). Then the diversity score for column i is E i = (P i (A) log 2 (P i (A)) + P i (B) log 2 (P i (B))) (1) while the diversity score DS for a selection of 30 rows or the matrix (RI lines) is 3149 DS = E i (2) j=1 This diversity score roughly measures the information content of the selection in bits. Grouping the C type alleles with the more numerous type in a column lowers the score and so favors exclusion of lines with C alleles by making selections containing them score as less diverse. An explanation in more detail for biological readers unfamiliar with information theory follows. The basic function E = (p log 2 (p) + (1 p)log 2 (1 p)) (3) measures the information content of a coin with probability p of heads in the following sense. A fair coin, flipped many times, will generate a random string without bias or pattern. One bit of information is needed to report a flip of this fair coin. A coin that almost always flips heads will generate a highly compressable string. If we recorded 10,000 flips of the head-biased coin as bits and then used a zip or compress program the resulting file would compress substantially. Notice that simply recording the index of the rare positions of heads in a string of 10,000 flips would give a very compressed description of the string of flips of this biased coin. In an information theoretic sense the flips of the biased coin contain little information. The entropy of the coin, given in Equation 3, measures the number of bits required to store the outcome of the flipped coin, so long as we are storing many flips of the coin. When this number is smaller than one bit it still requires one bit to report one flip of the coin. If an unfair coin is flipped many times, Equation 3 gives a close estimate of how much a long string of flips can be compressed. The more biased a coin is, the less information a flip of the coin contains. Figure 2 shows the entropy of a coin in bits as a function of its probability of flipping heads. Notice that the maximum information content for a coin is one bit and that this occurs for a fair coin, one with a 50/50 chance of producing a head or tail. The diversity measure for the selection of 30 RI s from the 88 available treats each of the genetic loci as a coin. The Bits Bits of entropy in a coin as a function of P(heads) Entropy P(heads) Fig. 2. Entropy of a coin as a function of the coin s bias Crossover Parents Point Children ( ) 4 ( ) ( ) ( ) Fig. 3. An example of the permutation crossover used in this study. probability of heads is replaced with the probability, across the 30 selected lines, that the marker has the B73 allele. The probabilities are modified by adding the missing or ambiguous alleles to the count of which ever type, B73 or Mo17, will yield the lower score. The total diversity score is the sum over all 3149 genetic markers of the marker s individual entropies. The goal is thus to select the 30 lines that come as close as possible to a 50/50 split of the two allele types across all the markers. II. EXPERIMENTAL DESIGN Subsets of the set of available RI-lines are stored as permutation of the index set {1, 2,...,88}. The first 30 indecies in the permutation are the 30 selected lines in the subset. This permits the use of an evolutionary algorithm that evolves permutations, a common and well-studied type of evolutionary algorithm[1], to be used to perform subset selection. Equation 2 gives the fitness function used to drive the evolutionary algorithm. The algorithm is steady state and operates on a population of 100 permutations using size seven single tournament selection. This model of evolution operates by performing mating events one at a time. A mating event starts by selecting seven members of the population at random. The best two in the group of seven are copied over the worse two. These copies are crossed over and then subjected to 1-3 mutations with the number of mutation chosen uniformly at random. The crossover operator picks a point within each permutation. For a given permutation, entries at or before that point are preserved while the entries after that point are reordered to appear in the order they appear in the other permutation. An example of this crossover operating on two permutations of eight objects is shown in Figure 3. A single mutation consists of swapping two entries of the permutation selected uniformly at random.

4 A total of 100 evolutionary runs were performed, each initialized with a distinct population of permutations generated uniformly at random. Each run continued for 10,000 mating events with summary statistics reported every 100 mating events. This relatively small number of mating events was chosen after all of 10 initial runs achieved their best final fitness before 10,000 mating events when run for 100,000 mating events Run 0, 95% Confidence intervals on population mena fitness and possible entropy, if there is a selection of RIs that yields a 15:15 split of alleles at every location is To see this first recall we are using 3149 mapping loci, and then note that the maximum possible contribution of each loci, is 1. This makes a very good total, representing 98.07% of the possible entropy if there is a perfect set of RIs from the perspective of information content. It turns out that this excellent result is also quite easy to find. Typical runs start with population average fitnesses in excess of 3030 or 96.22% of the possible entropy. Figure 4 shows the trajectory of the fitness over evolution of the first three runs of the 41 that achieved the maximum fitness. In all three of these runs the maximum fitness was achieved before 6,000 mating events were performed IV. DISCUSSION AND CONCLUSIONS Run 3, 95% Confidence intervals on population mena fitness and Run 5, 95% Confidence intervals on population mena fitness and 3020 Fig. 4. Mean and best fi tness for the fi rst three runs to achieve fi tness III. RESULTS The maximum entropy found in any of the runs is This entropy was found in 41 of 100 runs. The maximum The evolutionary algorithm added approximately 2% to the entropically estimated diversity of its selected lines as compared to the average of the randomly selected lines in the initial population of each run. Three examples, showing the evolution of population average and population, appear in Figure 4. In all 100 runs the final fitness was significantly better than the average initial fitness; the technique thus works but adds only modest value to the downstream biological applications. The study in [2], using less powerful search techniques than a genetic algorithm (a hybrid of a greedy algorithm and a hill-climber), obtained very similar results for mapping data in mice. The optimization procedure made a small but significant increase in the diversity of computationally selected lines over that obtained for randomly selected lines. This paper inspired the current research, with the hope that a more powerful search technique would yield a better increase in diversity. In retrospect neither the mouse populations or the IBM mapping population for corn left much room for improvement over random selections of RI-lines. In an effort to challenge the techniques for selecting RI-lines two older mapping corn populations that both possessed fewer markers and which had not been subjected to random mating prior to inbreeding were tested. For both of these mapping populations the results were essentially the same: small but significant increases in diversity occurred during evolution. This suggests that the crossover process used to re-assort genes in the cell is very efficient at randomizing alleles, yielding very high allele diversity even in randomly selected RIs. The 41 populations of permutations that hit the maximum fitness located all located the same set of recombinant inbreds, albeit in 41 different orders. We conjecture that this set is the true global optimum, but cannot at present prove it. Using brute force this would require the checking of ( 88 30) or about 2.97e23 possible subsets. Given the size of the search space and the representation of subsets as permutations it is not intuitive to the authors that this problem is as easy as it appears to be. At present our hypothesis is that the biological cases of this problem are among the easiest because of the efficiency of

5 TABLE I SUMMARY OF RESULTS ON SYNTHETIC DATA SETS. ALPHA IS THE PROBABILITY OF ADJACENT ALLELES BEING DIFFERENT IN A GIVEN RI-LINE. MAXREPS IS THE LARGEST NUMBER OF TIMES A SINGLE SOLUTION WAS FOUND IN 100 TRIALS. BESTREPS IS THE NUMBER OF TIMES THE BEST RESULT WAS FOUND. IMPROVEMENT IS THE PERCENT OF THE THEORETICAL MAXIMUM ENTROPY (500 FOR THE SYNTHETIC DATA) IMPROVEMENT FOUND BY THE ALGORITHM IN THE FIRST RUN ATTAINING THE HIGHEST FITNESS. FITNESS IS THE BEST FITNESS ATTAINED. α Maxreps Bestreps Improvement % % % IBM mapping population n/a % % Confidence intervals on population mena fitness and % Confidence intervals on population mena fitness and 495 cellular processes at distributing alleles of the parents among their progeny. In order to test this hypothesis the algorithm was rewritten to operate on synthetic data. The number of markers was reduced from 3149 to 500 to permit more rapid testing (note that the time to evaluate the fitness function is directly proportional to the number of markers). Synthetic data contained no missing or mixed alleles for the sake of simplicity. The number of RI-lines used was left at 88. To generate the markers for a particular line, the data were filled in by moving down the loci of a line filling in symbols. The initial allele type, A or B, was selected at random. Thereafter the next allele was different from the current one with probability α. This simulates the way that blocks of adjacent alleles share their type. A set of 100 runs of the evolutionary algorithm were performed for three synthetic data sets generated with α set to 0.01, 0.25, and 0.5. The parameter α controls the average size of adjacent blocks of similar alleles. When α is small, the character of adjacent alleles changes less often, yielding larger blocks. For each of the three sets of runs the maximum number of time any solution was found, the number of times the best solution was found, the percentage improvement of the best solution over the average of its initial random population were computed. These are given, together with the analogous data for the IBM population, in Table I. If we take the tendency to have a best solution that is also the most found solution as evidence of a biological character then the synthetic data set with α = 0.5, possessed of the smallest blocks of similar alleles, is the most like the biological data set. Small block size should result from efficient mixing of alleles by biological crossover. The synthetic data thus at least weakly support the hypothesis that the problem is easy because of efficient mixing of alleles in the biological system. The less well mixed alleles sets have higher fitness (more diverse) collections of 30 lines. At values of α nearer to 0.5, there is more scatter, among the columns, of the per-allele diversity of all 88 lines; the fraction of A in a column varies more across the columns in the more mixed data set. This may explain why the more mixed a problem s alleles are, the worse both is initial populations and final solutions are. Examples of % Confidence intervals on population mena fitness and 489 Fig. 5. Mean and best fi tness for the fi rst runs attaining best fi tness on the synthetic data sets. Top-to-bottom these runs are for α = 0.01, 0.25, 0.5. the fitness tracks of runs ending with the found for each of the three values of alpha are given in Figure 5. The number of distinct solutions found was inversely correlated with the allele type mixing parameter α. This means that smaller values of α produce harder problems, or at least problems with a more diverse collection of optima. If the conjecture that high rates of mixing yield synthetic data that are more nearly biological in character then we also have support for the notion that the instances of the RI-line selection problem arising from biology are the easy ones. Even if the less biological synthetic data sets are harder search problems, they are still not too hard. They share with the biological

6 problem the property that the evolutionary algorithm causes a very small increase in total diversity over that present in the initial random populations. V. FUTURE WORK The solution to the motivating biological problem of selecting diverse collections of RI-lines given in this study is perfectly adequate, leaving little room for improvement. This sense of having thrown one s weight against an open door and falling on one s face leaves open the question of the abstract difficulty of RI-selection and the relative difficulty of the biological cases. The abstract problem space is that of binary matrices together with a number of rows to be selected. The goal is to maximize the sum of column diversities in the sub-matrix containing the selected rows. Both algebraic investigation of the properties of the search space and the simulations on randomly generated instances of the problem may shed light on why the problem appears so easy. Thus far, four biological data sets have been subject to optimization of this kind. The mouse data set from [2], and three very similar data sets from corn, one of which is reported in detail in this study. The analysis of additional biological data sets may produce a surprise, but the synthetic data results suggest that the problem may be intrinsically easy. One line of research that may be valuable is to try to construct a synthetic problem of high difficulty. The synthetic data experiments suggest that constructively varying the fraction of one allele type in the columns of the matrix so that this variation is high may yield difficult problem instances. This also suggests which cases may be best for theoretical, algebraic analysis. VI. ACKNOWLEDGMENTS We thank Debbie H. Chen and Josh Shendelman for genetic mapping data from the IBM population and Ben Burr for genetic mapping data from the other two RI populations. We also than Sang-Duck Seo, the Schnable Lab s graphic designer, for providing the image in Figure 1. This research was funded in part by a competitive grant from the National Science Foundation Plant Genome Program (DBI ) and by an NSERC discovery grant to the first author; additional support was provided by the Hatch Act and State of Iowa funds. REFERENCES [1] D. Ashlock. Optimization and Modeling with Evolutionary Computation. Springer-Verlag, New York, [2] C. Jin, H. Lan, A. D. Attie, G. A. Churchill, D. Bulutuglo, and B. S. Yandell. Selective phenotyping for increased effi ciency in genetic mapping. Genetics, 168: , December [3] M. Lee, W. Beavis, J. Vogel, W Woodman, S Tingley, M.J. Long, M. Kakowsky, A. Hallauer, D Austin, and D. Ritland. Tools for high resolution genetic mapping in maize:status report. In Proceedings of Plant and Animal Genome VII, page 146, [4] B. Lewin. Genes VIII. Prentice Hall, Upper Saddle River, NJ, 2004.