Fitness Uniform Optimization

Size: px
Start display at page:

Download "Fitness Uniform Optimization"

Transcription

1 Technical Report IDSIA-6-06 Fitness Uniorm Optimization arxiv:cs/06026v [cs.ne] 20 Oct 2006 Abstract Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland In evolutionary algorithms, the itness o a population increases with time by mutating and recombining individuals and by a biased selection o more it individuals. The right selection pressure is critical in ensuring suicient optimization progress on the one hand and in preserving genetic diversity to be able to escape rom local optima on the other hand. Motivated by a universal similarity relation on the individuals, we propose a new selection scheme, which is uniorm in the itness values. It generates selection pressure toward sparsely populated itness regions, not necessarily toward higher itness, as is the case or all other selection schemes. We show analytically on a simple example that the new selection scheme can be much more eective than standard selection schemes. We also propose a new deletion scheme which achieves a similar result via deletion and show how such a scheme preserves genetic diversity more eectively than standard approaches. We compare the perormance o the new schemes to tournament selection and random deletion on an artiicial deceptive problem and a range o NP-hard problems: traveling salesman, set covering and satisiability. Keywords Evolutionary algorithms, itness uniorm selection scheme, itness uniorm deletion scheme, preserve diversity, local optima, evolution, universal similarity relation, correlated recombination, itness tree model, traveling salesman, set covering, satisiability Introduction Evolutionary algorithms (EA). Evolutionary algorithms are capable o solving complicated optimization tasks in which an objective unction : I IR shall be maximized. i I is an individual rom the set I o easible solutions. Ineasible solutions due to constraints may also be considered by reducing or each violated constraint. A population P is a multi-set o individuals rom I which February, 2008 Shane Legg IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland shane@idsia.ch is maintained and updated as ollows: one or more individuals are selected according to some selection strategy. In generation based EAs, the selected individuals are recombined (e.g. crossover) and mutated, and constitute the new population. We preer the more incremental, steady-state population update, which selects (and possibly deletes) only one or two individuals rom the current population and adds the newly recombined and mutated individuals to it. We are interested in inding a single individual o maximal objective value or diicult multi-modal and deceptive problems. Standard selection schemes (STD). The standard selection schemes (abbreviated by STD in the ollowing), proportionate, truncation, ranking and tournament selection all avor individuals o higher itness [Gol89, GD9, BT95, BT97]. This is also true or less common schemes, like Boltzmann selection [MT93]. The itness unction is identiied with the objective unction (possibly ater a monotone transormation). In linear proportionate selection the probability o selecting an individual depends linearly on its itness [Hol75]. In truncation selection the α% ittest individuals are selected, usually with multiplicity α% to keep the population size ixed [MSV94].(Linear) ranking selection orders the individuals according to their itness. The selection probability is, then, a (linear) unction o the rank [Whi89]. Tournament selection [Bak85], which selects the best l out o k individuals has primarily developed or steady-state EAs, but can be adapted to generation based EAs. All these selection schemes have the property (and goal!) to increase the average itness o a population, i.e. to evolve the population toward higher itness. The problem o the right selection pressure. The standard selection schemes STD, together with mutation and recombination, evolve the population toward higher itness. I the selection pressure is too high, the EA gets stuck in a local optimum, since the genetic diversity rapidly decreases. The suboptimal genetic material which might help in inding the global optimum is deleted too rapidly (premature convergence). On the other hand, the selection pressure cannot be chosen arbitrarily low i

2 2 Marcus Hutter & Shane Legg we want the EA to be eective. In diicult optimization problems, suitable population sizes, mutation and recombination rates, and selection parameters, which inluence the selection intensity, are usually not known beorehand. Oten, constant values are not suicient at all [EHM99]. There are various suggestions to dynamically determine and adapt the parameters [Esh9, BHS9, Her92, SVM94]. Other approaches to preserve genetic diversity are itness sharing [GR87] and crowding [Jon75]. They depend on the proper design o a neighborhood unction based on the speciic problem structure and/or coding. One approach which does not require a neighborhood unction based on the genome is local mating [CJ9], however it has been shown that rapid takeover can still occur or basic spatial topologies [Rud00]. Another approach which has not been widely studied is preselection [Cav70]. We are interested in evolutionary algorithms which do not require special problem insight (problem speciic neighborhood unction and/or coding) and is able to eectively prevent population takeover. In this paper we introduce and analyze two potential approaches to this problem: the Fitness Uniorm Selection Scheme (FUSS) and the Fitness Uniorm Deletion Scheme (FUDS). The itness uniorm selection scheme. FUSS is based on the insight that we are not primarily interested in a population converging to maximal itness, but only in a single individual o maximal itness. The scheme automatically creates a suitable selection pressure and preserves genetic diversity better than STD. The proposed itness uniorm selection scheme FUSS (see also Figure ) is deined as ollows: i the lowest/highest itness values in the current population P are min/max we select a itness value uniormly in the interval [ min, max ]. Then, the individual i P with itness nearest to is selected and a copy is added to P, possibly ater mutation and recombination. We will see that FUSS maintains genetic diversity better than STD, since a distribution over the itness values is used, unlike STD, which all use a distribution over individuals. Premature convergence is avoided in FUSS by abandoning convergence at all. Nevertheless there is a selection pressure in FUSS toward higher itness. The probability o selecting a speciic individual is proportional to the distance to its nearest itness neighbor. In a population with a high density o unit and low density o it individuals, the itter ones are eectively avored. The itness uniorm deletion scheme. We may also preserve diversity through deletion rather than through selection. By always deleting rom those individuals which have very commonly occurring itness values we achieve a population which is uniormly distributed across itness values, like with FUSS. Because these deleted individuals are commonly occurring in some sense this should help preserve population diversity. Under FUDS the role o the selection scheme is to govern how actively dierent parts o the solution space are searched rather than to move the population as a whole toward higher itness. Thus, like with FUSS, premature convergence is avoided by abandoning convergence as our goal. However as FUDS is only a deletion scheme, the EA still requires a selection scheme which may require a selection intensity parameter to be set. Thus we do not necessarily have a parameterless EA, as we do with FUSS. Nevertheless due to the impossibility o population collapse the perormance is more robust than usual with respect to variation in selection intensity. Thus FUDS is at least a partial solution to the problem o having to correctly set a selection intensity parameter. Contents. This paper extends and supersedes the earlier results reported in the conerence papers [Hut02], [LHK04] and [LH05]. Among other things, this paper: extends the previous theoretical analysis o FUSS and gives the irst theoretical analysis o FUDS and o their perormance when combined; presents a new method o analysis called itness tree analysis; is the irst set o experimental results which directly compares the two proposed schemes on the same problems with the same parameters, including when they are used together; gives the irst ull analysis o population diversity measurements or FUSS and in particular extends and corrects some o the earlier speculation about perormance problems in some situations. The paper is structured as ollows: In Section 2 we discuss the problems o local optima and population takeover [GD9] in STD, which could be lowered by restricting the number o similar individuals in a population. As we oten do not have an appropriate unctional similarity relation, we deine a universal distance (semi-metric) d(i,j):= (i) (j) based on the available itness only, which will serve our needs. Motivated by the universal similarity relation d and by the need to preserve genetic diversity, we deine in Section 3 the itness uniorm selection scheme. We discuss under which circumstances FUSS leads to an (approximate) itness uniorm population. Further properties o FUSS are discussed in Section 4, especially, how FUSS creates selection pressure toward higher itness and how it preserves diversity better than STD. Further topics are the equilibrium distribution, the transormation properties o FUSS under linear and nonlinear transormations o. Another way to utilize the ability o the universal similarity relation d to preserve diversity, is to use it to help target deletion. This gives us the itness uniorm deletion scheme which we deine in Section 5. As this produces a population which is approximately uniormly distributed across itness levels, like with FUSS, many o the properties o FUSS carry over to an EA using FUDS. Some o these properties are highlighted in Section 6. In Section 7 we theoretically demonstrate, by way o a simple optimization example, that an EA with FUSS or FUDS can optimize much aster than with STD. We show that crossover can be eective in FUSS, even when ineective in STD. Furthermore, FUSS, FUDS and STD are compared to random search with and without crossover. In Section 8 we develop a itness tree model, which we believe to cover the essential eatures o itness landscapes or diicult problems with many local optima. Within this model we derive heuristic expressions or the optimization time o random walk, FUSS, FUDS and STD. They are compared, and a worst case slowdown o FUSS relative to

3 Fitness Uniorm Optimization 3 STD is obtained. There is a possible additional slowdown when including recombination, as discussed in Section 9, which can be avoided by using a scale independent pair selection. It is a best compromise between unrestricted recombination and recombination o d-similar individuals only. It also has other interesting properties when used without crossover. To simpliy the discussion we have concentrated on the case o discrete, equi-spaced itness values. In many practical problems, the itness unction is continuously valued. FUSS and some o the discussion o the previous sections is generalized to the continuous case in Section 0. Section begins our experimental analysis o FUSS and FUDS. In this section we give a detailed account o the EA sotware we have used or our experiments, including links to where the source code can be downloaded. Section 2 examines the empirical perormance o FUSS and FUDS on the artiicially constructed deceptive optimization problem described in Section 7. These results conirm the correctness o our theoretical analysis. In Section 3 we test randomly generated traveling salesman problems. In Section 4 we examine the set covering problem, an NP hard optimization problem which has many real world applications. For our inal test in Section 5 we look at random CNF3 SAT problems. These are also NP hard optimization problems. Section 6 contains a summary o our results and possible avenues or uture research. 2 Universal Similarity Relation The problem o local optima. Proportionate, truncation, ranking and tournament are the standard (STD) selection algorithms used in evolutionary optimization. They have the ollowing property: i a local optimum i lopt has been ound, the number o individuals with itness lopt = (i lopt ) tends to increase rapidly. Assume a low mutation and recombination rate, or, or instance, truncation selection ater mutation and recombination. Further, assume that it is very diicult to ind an individual itter than i lopt. The population will then degenerate and will consist mostly o i lopt ater a ew rounds. This decreased diversity makes it even less likely that lopt gets improved. The suboptimal genetic material which might help in inding the global optimum has been deleted too rapidly. On the other hand, too high mutation and recombination rates convert the EA into an ineicient random search. Possible solution. Sometimes it is possible to appropriately choose the mutation and recombination rate and population size by some insight into the nature o the problem. More oten this is a trial and error process, or no single ixed rate works at all. A naive ix o the problem is to artiicially limit the number o identical individuals to a signiicant but small raction ε. I the space o individuals I is large, there could be many very similar (but not identical) individuals o, or instance, itness lopt. The EA can still converge to a population containing only this class o similar individuals, with all others becoming extinct. In order or the limitation approach to work, one has to restrict the number o similar individuals. Signiicant contributions in this direction are itness sharing [GR87] and crowding [Jon75]. The problem o inding a similarity relation. I the individuals are coded binary one might use the Hamming distance as a similarity relation. This distance is consistent with a mutation operator which lips a ew bits. It produces Hamming-similar individuals, but recombination (like crossover) can produce very dissimilar individuals w.r.t. this measure. In any case, genotypic similarity relations, like the Hamming distance, depend on the representation o the individuals as binary strings. Individuals with very dissimilar genomes might actually be unctionally (phenotypically) very similar. For instance, when most bits are unused (like introns in genetic programming), they can be randomly disturbed without aecting the properties o the individual. For speciic problems at hand, it might be possible to ind suitable representationindependent unctional similarity relations. On the other hand, in genetic programming, or instance, it is in general undecidable whether two individuals are unctionally similar. A universal similarity relation. Here we want to take a dierent approach. We deine the dierence or distance between two individuals as d(i, j) := (i) (j). The distance is based solely on the itness unction, which is provided as part o the problem speciication. It is independent o the coding/representation and other problem details, and o the optimization algorithm (e.g. the genetic mutation and recombination operators), and can trivially be computed rom the itness values. I we make the natural assumption that unctionally similar individuals have similar itness, they are also similar w.r.t. the distance d. On the other hand, individuals with very dierent coding, and even unctionally dissimilar individuals may be d-similar, but we will see that this is acceptable. For instance, individuals rom dierent local optima o equal height are d-similar. Relation to niching and crowding. Unlike itness uniorm optimization, diversity control methods like niching or crowding require a metric g to be deined over the genome space. By looking at the relationship between g and we can relate these two types o diversity control: We say that a itness unction is smooth with respect to g, i g(i,j) being small implies that (i) (j) is also small, that is, d(i,j) is small. This implies that i d(i,j) is not small, g(i,j) also cannot be small. Thus, i we limit the number o d similar individuals, as we do in itness uniorm optimization, this will also limit the number o g similar individuals, as is done in crowding and niching methods. The advantage o itness uniorm optimization is that we do not need to know what g is, or to compute its

4 4 Marcus Hutter & Shane Legg value. Indeed, the above argument is true or any metric g on the genome space that is smooth with respect to. On the other hand, i the itness unction is not generally smooth with respect to g, then such a comparison between the methods cannot be made. However, in this case an EA is less likely to be eective as small mutations in genome space with respect to g will produce unpredictable changes in itness. Topologies on individual space I. The distance d:i I IR + 0 induced by the itness unction is a semi-metric on the individual space I (semi only because d(i,j) = 0 or i j is possible). The semi-metric induces a topology on I. Equal itness suices to declare two individuals as d- equivalent, i.e. d is a rather small semi-metric in the sense that the induced topology is rather coarse. We will see that a non-zero distance between individuals o dierent itness is suicient to avoiding the population takeover. d induces the coarsest topology (is the smallest distance) avoiding population takeover. The problem o genetic drit. Besides elitist selection, the other major cause o diversity loss in a population is genetic drit. This occurs due to the stochastic nature o the selection operator breeding some individuals more oten than others. In a inite population this will cause some individuals to be replaced which have no close relatives, thus reducing diversity. Indeed, without a suicient rate o mutation, eventually a population will converge on a single genome; even i no selection pressure is applied. Although itness uniorm optimization does not attempt to address this problem, some implications can be drawn. Clearly, with itness uniorm optimization a complete collapse in diversity is impossible as individuals with a wide range o itness values are always preserved in the population. However, within a given itness level genetic drit can occur, although the sustained presence o many individuals in other itness levels to breed with will reduce this eect. Theoretical analysis o genetic drit is oten perormed by calculating the Markov chain transition matrices to compute the time or the system to reach an absorption state where all o the population members have the same genome. As these results can be diicult to generalize, an alternative approach has been to measure genetic drit by measuring the loss in itness diversity in a population over time [RPB99a]. This is interesting as itness uniorm optimization attempts to maximize the entropy o the itness values in the population, producing a very high variance in population itness. Thus, at least according to the second method o analysis, very little genetic drit would be evident in the population. 3 Fitness Uniorm Selection Scheme (FUSS) Discrete itness unction. In this section we propose a new selection scheme, which limits the raction o d- similar individuals. For simplicity we start with a itness unction : I F with discrete equi-spaced values F = { min, min +ε, min +2ε,..., max ε, max }. We call two individuals i and j δ-similar i d(i,j) (i) (j) δ. The continuous valued case F =[ min, max ] is considered later. In the ollowing we assume δ <ε. In this case, two individuals are δ-similar i and only i they have the same itness. The goal. We have argued that in order to escape local optima, genetic variety should be preserved somehow. One way is to limit the number o δ-similar individuals in the population. In an exact itness uniorm distribution there would be P / F individuals or each o the F itness values, i.e. each itness level would be occupied by a raction o / F individuals. The ollowing selection scheme asymptotically transorms any inite population into a itness uniorm one. The itness uniorm selection scheme (FUSS). FUSS is deined as ollows: randomly select a itness value uniormly rom the itness values F. Then, uniormly at random select an individual i P with itness. Add another copy o i to P. Note the two stage uniorm selection process which is very dierent rom a one step uniorm selection o an individual o P (see Figure ). In STD, inertia increases with population size. A large mass o unit individuals reduces the probability o selecting it individuals. This is not the case or FUSS. Hence, without loss o perormance, we can deine a pure model, in which no individual is ever deleted; the population size increases with time. No genetic material is ever discarded and no ine-tuning in population size is necessary. What may prevent the pure model rom being applied to practical problems are not computation time issues, but memory problems. I space becomes a problem we delete random individuals, as is usually done with a steady state EA. Asymptotically itness uniorm distribution. The expected number o individuals per itness level ater t selections is n t ()=n 0 ()+t/ F, where n 0 () is the initial distribution. Hence, asymptotically each itness level gets occupied uniormly by a raction n t () P t = n 0() + t/ F P 0 + t F or t, where P t is the population at time t. The same limit holds i each selection is accompanied by uniormly deleting one individual rom the (now constant sized) population. Fitness gaps and continuous itness. We made two unrealistic assumptions. First, we assumed that each itness level is initially occupied. I the smallest/largest itness values in P t are min/max t we extend the deinition o FUSS by selecting a itness value uniormly in the interval [min t 2 ε,t max + 2 ε] and an individual i P t with itness nearest to (see Figure 2). This also covers the case when there are missing intermediate itness values, and also works or continuous valued itness unctions (ε 0). Mutation and recombination. The second assumption was that there is no mutation and recombination. In the

5 Fitness Uniorm Optimization 5 n() p() n() * = proportional n() p() truncation * = n() Figure 2: I the lowest/highest itness values in the current population P are min/max, FUSS selects a itness value uniormly in the interval [ min, max ], then, the individual i P with itness nearest to is selected and a copy is added to P, possibly ater mutation and recombination. n() n() p() ranking & * = tournament p() uniorm n() n() presence o a small mutation and/or recombination rate eventually each itness level will become occupied and the occupation raction is still asymptotically approximately uniorm. For larger rate the distribution will be no longer uniorm, but the important point is that the occupation raction o no itness level decreases to zero or t, unlike or STD. Furthermore, FUSS selects by construction uniormly in the itness levels, even i the levels are not uniormly occupied. * = 4 Properties o FUSS n() p() FUSS * = n() Figure : Eects o proportionate, truncation, ranking & tournament, uniorm, and itness uniorm (FUSS) selection on the itness distribution in a generation based EA. The let/right diagrams depict itness distributions beore/ater applying the selection schemes depicted in the middle diagrams. Note that or populations with a non-gaussian distribution o itness values (let column), the graph o selection probability vs. itness or FUSS (center bottom) can be totally dierent to that illustrated above, however the population distribution that results (right bottom) will be the same. FUSS eectively avors it individuals. FUSS preserves diversity better than STD, but the latter have a (higher) selection pressure toward higher itness, which is necessary or optimization. At irst glance it seems that there is no such pressure at all in FUSS, but this is deceiving. As FUSS selects uniormly in the itness levels, individuals o low populated itness levels are eectively avored. The probability o selecting a speciic individual with itness is inversely proportional to n t () (see Figure ). In an initial typical (FUSS) population there are many unit and only a ew it individuals. Hence, it individuals are eectively avored until the population becomes itness uniorm. Occasionally, a new higher itness level is discovered and occupied by a single new individual, which then, again, is avored. No takeover in FUSS. With FUSS, takeover o the highest itness level never happens. The concept o takeover time [GD9] is meaningless or FUSS. The raction o ittest individuals in a population is always small. This implies that the average population itness is always much lower than the best itness. Actually, a large number o it individuals is usually not the true optimization goal. A single ittest individual usually suices to solve the optimization task. FUSS may also avor unit individuals. Note, i it is also diicult to ind individuals o low itness, i.e. i there

6 6 Marcus Hutter & Shane Legg are only a ew individuals o low itness, FUSS will also avor these individuals. Hal o the time is wasted in searching on the wrong end o the itness scale. This possible slowdown by a actor o 2 is usually acceptable. In Section 7 we will see that in certain circumstances this behavior can actually speedup the search. In general, itness levels which are diicult to reach, are avored. Distribution within a itness level. Within a itness level there is no selection pressure which could urther exponentially decrease the population in certain regions o the individual space. This (exponential) reduction is the major enemy o diversity, which is suppressed by FUSS. Within a itness level, the individuals reely drit around (by mutation). Furthermore, there is a steady stream o individuals into and out o a level by (d)evolution rom (higher)lower levels. Consequently, FUSS develops an equilibrium distribution which is nowhere zero. This does not mean that the distribution within a level is uniorm. For instance, i there are two (local) maxima o same height, a very broad one and a very narrow one, the broad one may be populated much more than the narrow one, since it is much easier to ind. Steady creation o individuals rom every itness level. In STD, a wrong step (mutation) at some point in evolution might cause urther evolution in the wrong direction. Once a local optimum has been ound and all unit individuals were eliminated it is very diicult to undo the wrong step. In FUSS, all itness levels remain occupied rom which new mutants are steadily created, occasionally leading to urther evolution in a more promising direction (see Figure 3). Transormation properties o FUSS. FUSS (with continuous itness) is independent o a scaling and a shit o the itness unction, i.e. FUSS( ) with (i):=a (i)+b is identical to FUSS(). This is true even or a < 0, since FUSS searches or maxima and minima, as we have seen. It is not independent o a non-linear (monotone) transormation unlike tournament, ranking and truncation selection. The non-linear transormation properties are more like the ones o proportionate selection. Figure 3: Evolution o the population under FUSS versus standard selection schemes (STD): STD may get stuck in a local optimum i all unit individuals were eliminated too quickly. In FUSS, all itness levels remain occupied with ree drit within and in-between itness levels, rom which new mutants are steadily created, occasionally leading to urther evolution in a more promising direction. 5 Fitness Uniorm Deletion Scheme (FUDS) For a steady state evolutionary algorithm each cycle o the system consists o both selecting which individual or individuals to crossover and mutate, and then selecting which individual is to be deleted in order to make space or the new child. The usual deletion scheme used is random deletion as this is neutral in the sense that it does not bias the distribution o the population in any way and does not require additional work to be done, such as evaluating the similarity o individuals based on their genes. Another common strategy is to use an elitist deletion scheme. Here we propose to use the similarity semi-metric d deined in Section 2 to achieve a uniorm distribution across itness levels, like with FUSS, except that we achieve this

7 Fitness Uniorm Optimization 7 by selectively deleting those members o the population which have very commonly occurring itness values. O course this leaves the selection scheme unspeciied, indeed we may use any standard selection scheme such as tournament selection in combination with FUDS. It also means that we lose one o the nice eatures o FUSS as we now need to manually tune the selection intensity or our application FUSS o course is parameterless. Nevertheless it allows us to give many FUSS like properties to an existing EA using a standard selection scheme with only a minor modiication to the deletion scheme. The intuition behind why FUDS preserves population diversity is very simple: I an individual has a itness value which is very rare in the population then this individual almost certainly contains unique inormation which, i it were to be deleted, would decrease the total population diversity. Conversely, i we delete an individual with very commonly occurring itness then we are unlikely to be losing signiicant diversity. Presumably most o these individuals are common in some sense and likely exist in parts o the solution space which are easy to reach. Thus the itness uniorm deletion strategy is now clear: Only delete individuals with very commonly occurring itness values as these individuals are less likely to contain important genetic diversity. Practically FUDS is implemented as ollows. Let min and max be the minimum and maximum itness values possible or a problem, or at least reasonable upper and lower bounds. We divide the interval [ min, max ] into a collection o subintervals o equal length {[ min, min + a),[ min +a, min +2a),...,[ max a, max ]} which we call itness levels. As individuals are added to the population their itness is computed and they are placed in the set o individuals corresponding to the itness level they belong to. Thus the number o individuals in each itness level describes how common itness values within this interval are in the current population. When a deletion is required the algorithm locates the itness level with the greatest number o individuals and then deletes a random individual rom this level. In the case where multiple itness levels have maximal size the lowest o these levels is used. I the number o itness levels is chosen too low, say 5 levels, then the resulting model o the distribution o individuals across the itness range will be too coarse. Alternatively i a large number o itness levels is used with a very small population the individuals may become too thinly spread across the itness levels. While in these extreme cases this could aect the perormance o FUDS, in practice we have ound that the system is not very sensitive to the setting o this parameter. I n is the population size then setting the number o itness levels to be n is a good rule o thumb. For discrete valued itness unctions there is a natural lower bound on the interval length a because below a certain value there will be more intervals than unique itness values. O course this cannot happen when the itness unction is continuous. Other than this small technical detail, the two cases are treated identically. As FUDS spreads the individuals out across a wide range o itness values, or small populations the EA may become ineicient as only a ew individuals will have relatively high itness. For problems which are not deceptive this is especially true as there will be little value in having individuals in the population with low to medium itness. O course these are not the kinds o problems or which FUDS was designed. In practice we have always used populations o between 250 and 5,000 individuals and have not observed a decline in perormance relative to random deletion at the lower end o this range. An alternative implementation that avoids discretization is to choose the two individuals that have the most similar itness and delete one o them. An eicient implementation keeps a list o the individuals ordered by their itness along with an ordered list o the distances between the individuals. Then in each cycle one o the two individuals with closest itness to each other is selected or deletion. Although the perormance o this algorithm was better than random deletion, it was not as good as the implementation o FUDS using bins. We conjecture that the reason or this is as ollows: When there are just a ew very it individuals in the population it is quite likely that they will be highly related to each other and have very similar itness. This means that i we delete the individuals with most similar itness it is likely that many o the very it individuals will be deleted. However with the bins approach this will not happen as there are typically ew individuals in the high itness bins. Thus, although deleting one o the closest individuals in terms o itness might preserve diversity well, it also changes the pressure on the population distribution over itness levels. This small change in distribution dynamics appears to reduce perormance in practice. 6 Properties o FUDS As FUDS uniormly distributes the population across itness levels, like FUSS does, many o the key properties o FUSS also carry over to an EA that is using a standard selection scheme (STD) combined with FUDS deletion. No takeover in FUDS. Under FUDS the takeover o the highest itness level, or indeed any itness level, is impossible. This is easy to see because as soon as any itness level starts to dominate, all o the deletions become ocused on this level until it is no longer the most populated itness level. As a by-product, this also means that individuals on relatively unpopulated itness levels are preserved. Steady creation o individuals rom every itness level. Another similarity with FUSS is the steady creation o individuals on many dierent itness levels. This occurs because under FUDS some individuals on each itness level are always kept. This makes it relatively easy or the EA to ind its way out o local optima as it keeps on exploring evolutionary paths which do not at irst appear to be promising. Robust perormance with respect to selection intensity. Because FUDS is only a deletion scheme, we still need to choose a selection scheme or the EA. O course

8 8 Marcus Hutter & Shane Legg this selection scheme may then require us to set a selection intensity parameter. While this is not as desirable as FUSS, which has no such parameter, at least with FUDS we expect the perormance o the system to be less sensitive to the correct setting o this parameter. For example, i the selection intensity is set too high the normal problem is that the population rushes into a local optimum too soon and becomes stuck beore it has had a chance to properly explore the genotype space or other promising regions. However, as we noted above, with FUDS a total collapse in population diversity is impossible. Thus much higher levels o selection intensity may be used without the risk o premature convergence. In some situations i very low section intensity is used along with random deletion, the population tends not to explore the higher areas o the itness landscape at all. This can be illustrated by a simple example. Consider a population which contains,000 individuals. Under random deletion all o these individuals, including the highly it ones, will have a in,000 chance o being deleted in each cycle and so the expected lie time o an individual is,000 deletion cycles. Thus i a highly it individual is to contribute a child o the same itness or higher, it must do so reasonably quickly. However or some optimization problems the probability o a it individual having such a child when it is selected is very low, so low in act that it is more likely to be deleted beore this happens. As a result the population becomes stuck, unable to ind individuals o greater itness beore the ittest individuals are killed o. The usual solution to this problem is to increase the selection intensity because then the it individuals are selected more oten and thus are more likely to contribute a child o similar or greater itness beore they are deleted. Another is to change the deletion scheme so that these individuals live longer. This is what happens with FUDS as rare it individuals are not deleted. Eectively it means that with FUDS we can oten use much lower selection intensity without the population becoming stuck. Transormation properties o FUDS. While with FUDS we have the added complication o having to choose the number o subintervals with which to break up the itness values, this number is only a unction o the population size and distributional characteristics o the problem. Thus any linear transormation o the itness unction has no eect on FUDS. However, non-linear transormations will aect perormance. Problem and representation independence. Because FUDS only requires the itness o individuals, the method is completely independent o the problem and genotype representation, i.e. how the individuals are coded. Simple implementation and low computational cost. As the algorithm is simple and the itness unction is given as part o the problem speciication, FUDS is very easy to implement and requires ew computational resources. 7 A Simple Example In the ollowing, we use a simple example problem to compare the perormance o itness uniorm selection (FUSS), random search (RAND) and standard selection (STD), each used both with and without recombination. We also examine the perormance o standard selection when used with the itness uniorm deletion scheme (FUDS). We regard this problem as a prototype or deceptive multimodal unctions. The example demonstrates how FUSS and FUDS can be superior to RAND and STD in some situations. More generic situations will be considered in Section 8. An experimental analysis o this problem appears in Section 2. Simple 2D example. Consider individuals (x,y) I := [0,] [0,], which are tuples o real numbers, each coordinate in the interval [0,]. The example models individuals possessing up to 2 eatures. Individual i possesses eature I i i I :=[a,a+ ] [0,], and eature I 2 i i I 2 :=[0,] [b,b+ ]. The itness unction :I {,2,3} is deined as i (x, y) I \I 2, 2 i (x, y) I (x, y) = 2 \I, 3 i (x, y) I I 2, 4 i (x, y) I I 2. y b (x, y) We assume. Individuals with neither o the two eatures (i I\(I I 2 )) have itness =3. These local = 3 optima occupy most o the individual space I, namely a raction ( ) 2. It is disadvantageous or an individual to possess only one o the two eatures (i (I \I 2 ) (I 2 \I )), since = or 2 in this case. In combination (i I I 2 )), the two eatures lead to the highest itness, but the global maximum = 4 occupies the smallest raction 2 o the individual space I. With a raction ( ), the = / =2 minima are in between. The example has sort o an XOR structure, which is hard or many optimizers. Random search. Individuals are created uniormly in the unit square. The local optimum = 3 is easy to ind, since it occupies nearly the whole space. The global optimum = 4 is diicult to ind, since it occupies only 2 o the space. The expected time, i.e. the expected number o individuals created and tested until one with =4 is ound, is T RAND =. Here and in the 2 ollowing, the time T is deined as the expected number o created individuals until the irst optimal individual (with =4) is ound. T is neither a takeover time nor the number o generations (we consider steady-state EAs). Random search with crossover. Let us occasionally perorm a recombination o individuals in the current population. We combine the x-coordinate o one uniormly selected individual i with the y coordinate o another individual i 2. This crossover operation maintains a uniorm distribution o individuals in [0,] 2. It leads to the global optimum i i I and i 2 I 2. The probability o selecting an individual in I i is ( ) (we assumed that a 3 x

9 Fitness Uniorm Optimization 9 the global optimum has not yet been ound). Hence, the probability that I crosses with I 2 is 2. The time to ind the global optimum by random search including crossover is still 2 ( denotes asymptotic proportionality). Mutation. The result remains valid (to leading order in ) i, instead o a random search, we uniormly select an individual and mutate it according to some probabilistic, suiciently mixing rule, which preserves uniormity in [0,]. One popular such mutation operator is to use a suiciently long binary representation o each coordinate, like in genetic algorithms, and lip a single bit. For simplicity we assume in the ollowing a mutation operator which replaces with probability 2 / 2 the irst/second coordinate by a new uniorm random number. Other mutation operators which mutate with probability 2 / 2 the irst/second coordinate, preserve uniormity, are suiciently mixing, and leave the other coordinate unchanged (like the single-bitlip operator) lead to the same scaling o T with (but with dierent proportionality constants). Standard selection with crossover. The = and = 2 individuals contain useul building blocks, which could speedup the search by a suitable selection and crossover scheme. Unortunately, the standard selection schemes avor individuals o higher itness and will diminish the = / = 2 population raction. The probability o selecting = / = 2 individuals is even smaller than in random search. Hence T STD 2. Standard selection does not improve perormance, even not in combination with crossover, although crossover is well suited to produce the needed recombination. FUSS. At the beginning, only the =3 level is occupied and individuals are uniormly selected and mutated. The expected time until an = or =2 individual in I I 2 is created is T (not 2, since only one coordinate is mutated). From this time on FUSS will select one hal(!) o the time the = / = 2 individual(s) and only the remaining hal the abundant =3 individuals. When level = and level =2 are occupied, the selection probability is or these levels. With probability 2 the mutation operator will mutate the y coordinate o an individual in I or the x coordinate o an individual in I 2 and produces a new = /2/4 individual. The relative probability o creating an = 4 individual is. The expected time to ind this global optimum rom the =/ =2 individuals, hence, is T 2 =[( ) 2 ]. The total expected time is T FUSS T +T 2 = T 2 STD. FUSS is much aster by exploiting unit = / = 2 individuals. This is an example where (local) minima can help the search. Examples where a low local maxima can help in inding the global maximum, but where standard selection sweeps over too quickly to higher but useless local maxima, can also be constructed. FUSS with crossover. The expected time until an = individual in I and an = 2 individual in I 2 is ound is T, even with crossover. The probability o selecting an = / = 2 individual is 3 / 3. Thus, the probability that a crossing operation crosses I with I 2 is ( 3 )2. The expected time to ind the global optimum rom the = / =2 individuals, hence, is T 2 =9 O(), where the O() actor depends on the requency o crossover operations. This is ar aster than by STD, even i the =/=2 levels were local maxima, since to get a high standard selection probability, the level has irst to be taken over, which itsel needs some time depending on the population size. In FUSS a single = and a single = 2 individual suice to guarantee a high selection probability and an eective crossover. Crossover does not signiicantly decrease the total time T FUSSX T +T 2 +O(9), but or a suitable 3D generalization we get a large speedup by a actor o. FUDS with crossover. Assume that initially all o the individuals have =3 and that we are using random selection. For any mutation the probability o the child being in I I 2 is. Until I I 2 becomes quite ull FUDS will never delete individuals rom these areas. Furthermore i an individual in I I 2 is mutated then the mutant will also be in I I 2 with probability 2 (+ ). Thereore while most o the population has =3 we can lower bound the probability o a new child being in I I 2 by. It then ollows that i P is the size o the population we can upper bound the expected time or I I 2 to contain hal the total population by P 2. Once this occurs (and most likely well beore this point) crossover will produce an individual with = 4 almost immediately by crossing a member o I with a member o I 2. Thus T FUDS T 2 STD. This gives FUDS when used with random selection scaling characteristics which are similar to FUSS. I we use a selection scheme with higher intensity our bound on the expected time or hal the population to have =3 remains unchanged as the bound holds in the worst case situation where only individuals with =3 are selected. However higher selection intensity makes the inal crossover required to ind an individual with = 4 less likely. For moderate levels o selection intensity this is clearly not a signiicant actor and more importantly it is O() and independent o. Thus the order o scaling or T FUDS is just or this diicult problem, which is the same as T FUSSX. Simple 3D example. We generalize the 2D example to D-dimensional individuals x [0,] D and a itness unction D ( x) := (D + ) d= χ d ( x) max d D d χ d( x) + D +, where χ d ( x) is the characteristic unction o eature I d { i ai x χ d ( x) := i a i +, 0 else. For D=2, coincides with the 2D example. For D=3, the ractions o [0,] 3 where =/2/3/4/5 are approximately 2 / 2 / 2 // 3. With the same line o reasoning we get the ollowing expected search times or the global optimum: T RAND T STD 3, T FUSS 2, T FUSSX T FUDS.

10 0 Marcus Hutter & Shane Legg Species B 2 Species C 2 Species A Figure 4: Generic 2D itness landscape with evolution tree. Each connected slice represents a species. A species is also symbolized by a node in the slice. The number in a slice and near a node is the itness value o the species. I individuals rom one species can evolve to individuals o another species, the nodes are connected by a solid line. Altogether, they orm the itness tree. The branching actor b is 2 and the number o species per itness level s is 4 or intermediate itness values (3,4,5). This demonstrates the existence o problems where FUSS is much aster than RAND and STD, and where crossover can give a urther boost to FUSS, even when it is ineective in combination with STD. 8 Fitness-Tree Analysis subsectionthe itness tree model A general, problem independent comparison o the various optimization algorithms is diicult. We are interested in the perormance or diicult itness landscapes with many local optima. We only consider mutation; recombination is discussed in the next section. The evolutionary neighborhood (not to be conused with d-similarity) o an individual i is deined as the set o individuals that can be created rom i by a single mutation. Two individuals i and j with the same itness are deined to belong to the same species i there is a inite sequence o mutations which transorms i into j and all individuals o the sequence also have itness (i) = (j). Each itness level is partitioned in this way into disjoint species. We say a species o itness +ε can evolve rom a species o itness, i there is a mutation which transorms an individual rom the latter species to one o the ormer. Those species are connected by an edge in Figures 4 and 5. A species is said to be promising i it can evolve to the global optimum max. Additional deinitions and simpliying assumptions. i) Evolution which skips itness levels is ignored, and We have small mutations in mind, e.g. single bit lips, not macro mutations, which connect all individuals. Figure 5: Generic itness unction with evolution tree. Individuals which are evolutionary neighbors are connected by a dashed line. They belong to the species indicated by a node on the dashed line. A species which can evolve rom another is connected to it by a solid line. The smooth curve visualizes (somewhat misleading, since the itness is discrete) the itness unction with many local maxima. also devolution to species o lower itness other than the primordial species. ii) Random individuals have lowest itness min with high probability, and there is only one species o itness min. iii) There is a ixed branching actor b, i.e. each species can evolve into b improved species, or represents a local optimum rom which no urther evolution is possible. iv) There is a single global optimum max (or b optima to be consistent with the previous item). v) There are s dierent species per itness level (except near min and max where there must be ewer to be consistent with the previous items). vi) The probability p that an individual evolves to a higher itness is very small. In most cases a mutation keeps an individual within its species or devolves it. vii) The probability to evolve to one o the ospring species is uniorm, i.e. /b or all ospring species. We have the eeling that this picture covers the essential eatures o itness landscapes or diicult problems. The qualitative conclusions we will draw should still hold when some or all o the additional simpliying assumptions are violated. Example. Consider the case o individuals, which are real-valued D dimensional vectors, i.e. I = IR D. Let the itness unction be continuous and positive with many local maxima, which tends to zero or large arguments. This covers a large range o physical optimization problems. Mutation shall be local in IR D, i.e. i original i mutated D. As FUSS and the itness tree model is only deined or discrete itness unctions, we discretize to := ε,

11 Fitness Uniorm Optimization which is acceptable or suiciently small ε. A typical itness landscape or D = 2 and D = together with their itness tree are depicted in Figures 4 and 5. Since mutation is a local operation, each species is a (possibly multiply punched) connected slice (D-dimensional sub-volume) and evolution can only occur rom to + (ε=). Assumption (i) is generally satisied. The special itness landscapes depicted in Figures 4 and 5 also satisy (ii,iii,iv,v) with b=2 and s=4. Random walk. Consider a mutation induced random walk o a single individual. Due to the low evolution probability p, most o the time will be spent on individuals o the lowest itness min. As evolution is a tree, there is only one evolution sequence which leads to the global optimum. At each evolution step, the correct ospring species (out o b) has to be evolved. The probability o an evolution step in the right direction, hence, is p/b. F evolution steps are necessary to reach max. Thereore, the expected time to ind the global maximum by random walk is T RW (b/p) F. Random walk is very slow; it is exponential in the number o itness levels F to a very large basis b/p. FUSS. Assume that L itness levels rom min to are occupied. The probability that FUSS selects an individual o itness is /L. Under this additional assumption that the occupation o species within one itness level is approximately uniorm most o the time, the probability o selecting an individual o the promising species, which can evolve to the global optimum, is /s. The probability o an evolution step in the right direction is p/b as in the random walk case. Hence, the total expected time or an evolution in the right direction is L s b/p. The total time T FUSS 2 F 2 b/p or an evolution rom L = to the global optimum L = F is obtained by summation over L=... F. FUDS. A similar analysis can be applied to FUDS. Assume again that the L itness levels rom min to are occupied and that the occupation o species within each itness level is approximately uniorm most o the time. Because FUDS tends to spread the population out, like FUSS, this assumption is not unreasonable. As FUDS is only a deletion scheme we must also speciy a selection scheme. For our analysis we will take a very simple elitist selection scheme that hal o the time selects an individual rom the highest itness level, and the other hal o the time selects an individual rom a lower level. It ollows then that the probability o selecting a promising species is /2s and the probability that this then results in an evolutionary step in the right direction is p/b. Thus the total expected time or an evolutionary step in the right direction is 2 s b/p. Thereore by summation the total expected time to evolve to the global optimum is T FUDS 2 F s b/p. O course this analysis rests on our choice o selection scheme and the assumptions about the uniormity o the population that we have made. When FUDS is used with selection schemes which are very greedy these uniormity assumptions will likely be violated and less avorable bounds could result. Standard selection. We assumed a ixed number o s species per itness level and 0 or b ospring species. This implies that only a raction o /b species can evolve to higher itness. We assume that itness level has been taken over, i.e. most individuals have itness. The probability o evolution is p. A signiicant raction (or simplicity we assume most) o the P individuals must evolve to the next itness level beore evolution with a relevant rate can occur to the next to next level. Hence, the time to take over the next itness level is roughly P b/p. As there are F itness levels, the total time is T STD > F P b/p. We wrote > as we have made two signiicant avorable assumptions. In order to ensure convergence, the promising species in the current itness level has to be occupied. I we assume a uniorm occupation o species within one itness level, as or FUSS, this means that all species o the current itness level have to be populated. As there are s species, P has to be at least s, which can be quite large. On the other hand, STD linearly slows down with P, unlike FUSS. Hence, there is a trade-o in the choice o P. More serious is the ollowing problem. Assume that the irst individual evolved with itness +ε is one in a nonpromising species a. Due to selection pressure it might happen that species a takes over the whole population beore all (or at least the promising) species with itness +ε can evolve rom the ones o itness. The probability to ind the global optimum in the worst case scenario, where at each level only one species is occupied, is (/b) F. This is the original problem o the loss o genetic diversity discussed at the outset, which lead to the invention o FUSS. Every other ix the authors are aware o only seems to diminish the problem, but does not solve it. One ix is to repeatedly restart the EA, but the huge number o b F restarts might be necessary. The time is exponential in F like or random walk but with a smaller basis b. The true time is expected to be somewhere in between F P b/p and this worst case analysis, although an unavorable setting may never reach the global optimum (T STD = in this case). Perormance comparison. The times T FUSS, T FUDS and T STD should be regarded, at best, as rules o thumb, since the derivation was rather heuristic due to the list o assumptions. The quotient is more reliable: and T FUSS < F s T STD 2 P T FUDS T STD < 2 F F, s P. We will give a more direct argument in Section 9 that the slowdown o FUSS relative to STD is at most F. Finally, a truism has been recovered, namely that an EA can, under certain circumstances, be much aster than random walk, that is, T RW T FUSS,T FUDS,T STD.

12 2 Marcus Hutter & Shane Legg 9 Scale-Independent Selection and Recombination Worst case analysis. We now want to estimate the maximal possible slowdown o FUSS compared to STD. Let us assume that all individuals in STD have itness, and once one individual with itness +ε has been ound, takeover o level +ε is quick. Let us assume that this quick takeover is actually good (e.g. i there are no local maxima). The selection probability o individuals o same itness is equal. For FUSS we assume individuals in the range o min and. Uniormity is not necessary. In the worst case, a selection o an individual o itness < never leads to an individual o itness, i.e. is always useless. The probability o selecting an individual with itness is F. At least every F th FUSS selection corresponds to a STD selection. Hence, we expect a maximal slowdown by a actor o F, since FUSS simulates STD statistically every F th selection. It is possible to construct problems where this slowdown occurs (unimodal unction, local mutation x x± ε, no crossover). Gradient ascent would be the algorithm o choice in this case. On the other hand, we have not observed this slowdown in our simple 2D example and the TSP experiments, where FUSS outperormed STD in solution quality/time (see the experimental results in Section 2). Since real world problems oten lie in between these extreme cases it is desirable to modiy FUSS to cope with simple problems as well, without destroying its advantages or complex objective unctions. Quadratic slowdown due to recombination. We have seen that T FUSS F T STD. In the presence o recombination, a pair o individuals has to be selected. The probability that FUSS selects two individuals with itness is F. Hence, in the worst case, there could be a slowdown by a actor o F 2 or independent selection we 2 expect T FUSS F 2 T STD. This potential quadratic slowdown can be avoided by selecting one itness value at random, and then two individuals o this single itness value. For this dependent selection, we expect T FUSS F T STD. On the other hand, crossing two individuals o dierent itness can also be advantageous, like the crossing o = with =2 individuals in the 2D example o Section 7. Scale independent selection. A near optimal compromise is possible: a high selection probability p() i max and p() F otherwise. A scale independent probability distribution p() max is appropriate or this. We deine p() := c ln F ε max +. () The + in the denominator has been added to regularize the expression or = max. The actor c/ln F ensures correct normalization ( b+ p() = ). By using ln a b i=a i ln b a, one can show that ln F +ln F c i.e. c or F. In the ollowing we assume F 3, i.e. c 2. Apart rom a minor additional logarithmic suppression o order ln F we have the desired behavior p() or max and p() F otherwise: p( max mε) p() 2 ln F m +, 2 ln F F During optimization, the minimal/maximal itness o an individual in population P t is min/max t. In the deinition o p one has to use F t :={min t,t min +ε,...,t max } instead o F, i.e. F replaced with F t = ε (t max t min )+ F. So () can not be achieved by a static re-parametrization o itness replaced with g(). Furthermore the important idea o sampling rom a itness level instead o individuals directly is still maintained. The only dierence now is that the population will no longer converge to a itness uniorm one but to one with distribution p() which is biased toward higher itness but still never converges to a ittest individual. In the worst case, we expect a small slowdown o the order o ln F as compared to FUSS, as well as compared to STD. Scale independent pair selection. It is possible to (nearly) have the best o independent and dependent selection: a high selection probability p(, ) F i and p(, ) F otherwise, with uniorm marginal p()= 2 F. The idea is to use a strongly correlated joint distribution or selecting a itness pair. A scale independent probability distribution p(, ) is appropriate. We deine the joint probability p(, ) o selecting two individuals o itness and and the marginal p() as p(, ) := 2 F ln F ε +, (2) p() := p(, )= p(,). F F We assume F 3 in the ollowing. The + in the denominator has been added to regularize the expression or =. The actor (2 F ln F ) ensures correct normalization or F. More precisely, using ln b+ a b i=a i ln b a, one can show that ln F, F p(, ), 2 F p(), i.e. p is not strictly normalized to and the marginal p() is only approximately (within a actor o 2) uniorm. The irst deect can be corrected by appropriately increasing the diagonal probabilities p(,). This also solves the second problem. p(, ) := { p(, ) or p(, ) + [ F p()] or = (3) Properties o p(, ). p is normalized to with uniorm marginal p():= p(, )= F, F

13 Fitness Uniorm Optimization 3, Fp(, )= F p(, ) p(, ). p()=, Apart rom a minor additional logarithmic suppression o order ln F we have the desired behavior p(, ) F or and p(, ) F otherwise: 2 p(, ±mε) 2ln F m+ F, p(, ) 2ln F F 2. During optimization, the minimal/maximal itness o an individual in population P t is min/max t. In the deinition o p one has to use F t :={min t,t min +ε,...,t max} instead o F, i.e. F replaced with L:= F t = ε (t max min t )+ F. Scale-Independent Deletion. Just as the selection scheme FUSS has its dual in the deletion scheme FUDS, we can likewise create the dual o Scale-Independent Selection in the orm o Scale-Independent Deletion. Thus rather than targeting deletion rom the population so that the distribution becomes lat, as we do with FUDS, we now deine a convex curve g which is peaked at the ittest individual in the population and delete the population down so that it ollows the shape o this curve. This retains some o the advantages o FUDS, or example the population cannot collapse to just a ew itness levels, and yet it recognizes that or many problems it is useul to bias the population distribution toward it individuals. O course such problems are less deceptive than the kind that FUSS and FUDS are intended or. 0 Continuous Fitness Functions Eective discretization scale. Up to now we have considered a discrete valued itness unction with values in F = { min, min +ε,..., max }. In many practical problems, the itness unction is continuous valued with F =[ min, max ]. We generalize FUSS, and some o the discussion o the previous sections to the continuous case by replacing the discretization scale ε by an eective (time-dependent) discretization scale ˆε. By construction, FUSS shits the population toward a more uniorm one. Although the itness values are no longer equi-spaced, they still orm a discrete set or inite population P. For a itness uniorm distribution, the average distance between (itness) neighboring individuals is P t (t max t min ) =: ˆε. We deine ˆF t := { t min,t min +ˆε,...,t max }. ˆF t = ˆε (t max t min )+= P t. FUSS. Fitness uniorm selection or a continuous valued unction has already been mentioned in Section 3. We just take a uniorm random itness in the interval [min t 2 ˆε,t max+ 2 ˆε]. Independent and dependent itness pair selection as described in the last section works analogously. An ˆε = 0 version o correlated selection does not exist; a non-zero ˆε is important. A discrete pair (, ) is drawn with probability p(, ) as deined in (2) and (3) with ε and F replaced by ˆε and ˆF t. The additional suppression ln ˆF =ln P t is small or all practically realizable population sizes. In all cases an individual with itness nearest to ( ) is selected rom the population P (randomly i there is more than one nearest individual). I we assume a itness uniorm distribution then our worst case bound o T FUSS < TST D t= P t is plausible, since the probability o selecting the best individual is approximately P t. For constant population size we get a bound T FUSS < P T STD. For the preerred non-deletion case with population size P t = t the bound gets much worse T FUSS < 2 T STD 2. This possible (but not necessary!) slowdown has similarities to the slowdown problems o proportionate selection in later optimization stages. The species deinition in Section 8 has to be relaxed by allowing mutation sequences o individuals with ˆε-similar itness. Larger choices o ˆε may be avorable i the standard choice causes problems. FUDS. Fitness uniorm deletion already requires the range o the itness unction to be broken up into a inite number o intervals. While or discrete valued itness unctions the intervals may correspond to the unique values o the itness unction, this is not a requirement. Indeed i the population is small and the itness unction has a large number o possible values then a more coarse discretization is necessary. Continuous valued itness unctions can thereore be treated in exactly the same way and do not cause any special problems. In act they are slightly simpler in that we are now ree to choose the discretization as ine as we like without being limited by the number o possible itness values. O course, like in the discrete case, we still must choose a discretization which is appropriate given the size o the population. The EA Test System To test FUSS and FUDS we have implemented an EA test system in Java. The complete source code along with the test problems presented in this paper and basic usage instructions can be downloaded rom [Leg04]. The EA model we have chosen or our tests is the so called steady state model as opposed to the more usual generational model. In a generational EA in each generation we select an entirely new population based on the old population. The old population is then simply discarded. Under the steady state model that we use, each step o the optimization adds and removes just one individual at a time. Speciically the process occurs as ollows: Firstly an individual is selected by the selection scheme and then with a certain probability another individual is also selected and the crossover operator is applied to produce a new individual. Then with another probability a mutation operator is applied to produce the child individual which is then added to the population. We reer to the probability o crossing as the crossover probability and the probability o mutating ollowing a crossover as the mutation probability. In the case where no crossover takes place the individual is always mutated to ensure that we are not simply adding

14 4 Marcus Hutter & Shane Legg a clone o an existing individual into the population. Finally, an individual must be deleted in order to keep the population size constant. This individual is selected by the deletion scheme. The deletion scheme is important as it has the power to bias the population in a similar way to the selection scheme. Our task in this paper is to experimentally analyze how FUSS perorms relative to other selection schemes and how FUDS perorms relative to other deletion schemes. Because any particular run o a steady state EA requires both a selection and a deletion scheme to be used, there are many possible combinations that we could test. We have narrowed this range o possibilities down to just a ew that are commonly used. Among the selection schemes, tournament selection is one o the simplest and most commonly used and we consider it to be roughly representative o other standard selection schemes which avor the itter individuals in the population; indeed in the case o tournament size 2 it can be shown that tournament selection is equivalent to the linear ranking selection scheme [Hut9, Sec.2.2.4]. With tournament selection we randomly pick a group o individuals and then select the ittest individual rom this group. The size o the group is called the tournament size and it is clear that the larger this group is the more likely we are to select a highly it individual rom the population. At some point in the uture we may implement other standard selection schemes to broaden our comparison, however we expect the perormance o these schemes to be at best comparable to tournament selection when used with a correctly tuned selection intensity. Among the deletion schemes one o the most commonly used in steady state EAs is random deletion. The rational or this is that it is neutral in the sense that it does not skew the distribution o the population in any way. Thus whether the population tends toward high or low itness etc. is solely a unction o the selection scheme and its settings. O course random deletion, unlike FUDS, makes no eort to preserve diversity in the population as all individuals have an equal chance o being removed. In this paper we will compare FUDS against random deletion as this is the standard deletion schemes in situations where it is diicult or impossible to directly measure the similarity o individuals based on their genomes. When reporting test results we will adopt the ollowing notation: TOUR2 means tournament selection with a tournament size o 2. Similarly or TOUR3, TOUR4 and so on. Under random selection, denoted RAND, all members o the population have an equal probability o being selected. This is sometimes called uniorm selection. When a graph shows the perormance o tournament selection over a range o tournament sizes we will simply write TOURx. Naturally FUSS indicates the itness uniorm selection scheme. To indicate the deletion scheme used we will add either the suix -R or -F to indicate random deletion or FUDS respectively. Thus, TOUR0-R is tournament selection with a tournament size o 0 used with random deletion, while is FUSS selection used with FUDS deletion. The important ree parameters to set or each test are the population size, and the crossover and mutation probabilities. Good values or the crossover and mutation probabilities depend on the problem and must be manually tuned based on experience as there are ew theoretical guidelines on how to do this. For some problems perormance can be quite sensitive to these values while or others they are less important. Our deault values are 0.5 or both as this has oten provided us with reasonable perormance in the past. For each test we ran the system multiple times with the same mutation and crossover probabilities and the same population size. The only dierence was which selection and deletion schemes were used by the code. Thus even i our various parameters, mutation operators etc. were not optimal or a given problem, the comparison is still air. Indeed we oten deliberately set the optimization parameters to non-optimal values in order to compare the robustness o the systems. As a steady state optimizer operates on just one individual at a time, the number o cycles within a given run can be high, perhaps 00,000 or more. In order to make our results more comparable to a generational optimizer we divide this number by the size o the population to give the approximate number o generations. Unortunately the theoretical understanding o the relationship between steady state and generational optimizers is not strong. It has been shown that under the assumption o no crossover the eective selection intensity using tournament selection with size 2 is approximately twice as strong under a steady state EA as it is with a generational EA [RPB99b]. As ar as we are aware a similar comparison or systems with crossover has not been perormed. Depending on the purpose o a test run, dierent stopping criteria were applied. For example, in situations where we wanted to graph how rapidly dierent strategies converged with respect to generations, it made sense to ix the number o generations. In other situations we wanted to stop a run once the optimizer appeared to have become stuck, that is, when the maximum itness had not improved ater some speciied number o generations. In any case we explain or each test the stopping criterion that has been used. In order to generate reliable statistics we ran each test multiple times; typically 30 times but sometimes up to 00 times i the results were noisy. From these runs we then calculated the mean perormance as well as the sample standard deviation and rom this the standard error in our estimate o the mean. This value was then used to generate the 95% conidence intervals which appear as error bars on the graphs. 2 A Deceptive 2D Problem The irst problem we examine is the simple but highly deceptive 2D optimization problem which was theoretically analyzed in Section 7. As in the theoretical analysis, we set up the mutation operator to randomly replace either

15 Fitness Uniorm Optimization 5 Deceptive 2D - Random Deletion Deceptive 2D - FUDS Generations To Find Max TOUR3-R TOUR2-R RAND-R Generations To Find Max TOUR3-F TOUR2-F RAND-F Delta Delta 0.0 Figure 6: With random deletion (let graph) FUSS signiicantly outperorms TOURx and RAND. By switching to FUDS (right graph) the perormance o TOURx and RAND now scale the same as FUSS. the x or y position o an individual and the crossover to take the x position rom one individual and the y position rom another to produce an ospring. The size o the domain or which the unction is maximized is just δ 2 which is very small or small values o δ, while the local maxima at itness level 3 covers most o the space. Clearly the only way to reach the global maximum is by leaving this local maximum and exploring the space o individuals with lower itness values o or 2. Thus, with respect to the mutation and crossover operators we have deined, this is a deceptive optimization problem as these partitions mislead the EA [FM93]. For this test we set the maximum population size to,000 and made 20 runs or each δ value. With a steady state EA it is usual to start with a ull population o random individuals. However or this particular problem we reduced the initial population size down to just 0 in order to avoid the eect o doing a large random search when we created the initial population and thereby distorting the scaling. Usually this might create diiculties due to the poor genetic diversity in the initial population. However due to the act that any individual can mutate to any other in just two steps this is not a problem in this situation. Initial tests indicated that reducing the crossover probability rom 0.5 to 0.25 improved the perormance slightly and so we have used the latter value. The irst set o results or the selection schemes used with random deletion appear in the let graph o Figure 6. As expected, higher selection intensity is a signiicant disadvantage or this problem. Indeed even with just a tournament size o 3 the number o generations required to ind the maximum became ineasible to compute or smaller values o δ. Our results conirm the theoretical scaling orders o δ or TOUR2-R, and 2 δ or, as predicted in Section 7. Be aware that this is a log-log scaled graph and so the dierent slopes indicate signiicantly dierent orders o scaling. In the second set o tests we switch rom random deletion to FUDS. These results appear in the right graph o Figure 6. We see that with FUDS as the deletion scheme the scaling improves dramatically or RAND, TOUR2 and TOUR3. Indeed they are now o the same order δ as FUSS, as predicted in Section 7. This shows that or very deceptive problems much higher levels o selection intensity can be applied when using FUDS rather than random deletion. The perormance o is very similar to that o. This is not surprising as the population distribution under FUSS already tends to be approximately uniorm across itness levels and thus we expect the eect o FUDS to be quite weak. Although this problem was artiicially constructed, the results clearly demonstrate how FUSS and FUDS can dramatically improve perormance in some situations. 3 Traveling Salesman Problem A well known optimization problem is the so called Traveling Salesman Problem (TSP). The task is to ind the shortest Hamiltonian cycle (path) in a graph o N vertexes (cities) connected by edges o certain lengths. There exist highly specialized population based optimizers which use advanced mutation and crossover operators and are capable o inding paths less than one percent longer than the optimal path or up to 0 7 cities [LK73, MO96, JM97, ACR00]. As our goal is only to study the relative perormance o selection and deletion schemes, having a highly reined implementation is not important. Thus the mutation and crossover operators we used were quite simple: Mutation was achieved by just switching the position o two o the cities in the solution, while or crossover we used the partial mapped crossover technique [GA85]. Fitness was computed by taking the reciprocal o the tour length. For our irst set o tests we used randomly generated

16 6 Marcus Hutter & Shane Legg Tour Length Random TSP - Random Deletion TOUR2-R TOUR6-R TOUR3-R Tour Length Random TSP - FUDS TOUR2-F TOUR6-F TOUR3-F Generations Generations Figure 7: TOUR3-R converged too slowly while TOUR2-R converged prematurely and became stuck. TOUR6-R appears to be about the correct tournament size or this problem, however it is still inerior to. With FUDS all o the selection schemes perormed well though FUSS was still the best. TSP problems, that is, the distance between any two cities was chosen uniormly rom the unit interval [0,]. We chose this as it is known to be a particularly deceptive orm o the TSP problem as the usual triangle inequality relation does not hold in general. For example, the distance between cities A and B might be 0., between cities B and C 0.2, and yet the distance between A and C might be 0.8. The problem still has some structure though as eicient partial solutions tend to be useul building blocks or eicient complete tours. For this test we used random distance TSP problems with 20 cities and a population size o 000. We ound that changing the crossover and mutation probabilities did not improve perormance and so these have been let at their deault values o 0.5. Our stopping criterion was simply to let the EA run or 300 generations as this appeared to be adequate or all o the methods to converge and allowed us to easily graph perormance versus generations. The irst graph in Figure 7 shows each o the selection schemes used with random deletion. We see that TOUR3- R has insuicient selection intensity or adequate convergence while TOUR2-R quickly converges to a local optimum and then becomes stuck. TOUR6-R has about the correct level o selection intensity or this problem and population size. however initially converges as rapidly as TOUR2-R but avoids becoming stuck in local optima. This suggests improved population diversity. The perormance curve or is impressive, especially considering that it is parameterless. At irst it might seem surprising that the maximum itness with FUSS climbs very quickly or the irst 20 generations, especially considering that FUSS makes no attempt to increase the average itness in the population. However we can explain this very rapid rise in solution itness by considering a simple example. Consider a situation where there is a large number o individuals in a small band o itness levels, say,000 with itness values ranging rom 50 to 70. Add to this population one individual with a itness value o 73. Thus the total itness range contains 24 values. Whenever FUSS picks a random point rom 72 to 73 inclusive this single individual with maximal itness will be selected. That is, the probability that the single ittest individual will be selected is 2/24 = In comparison under TOUR2 the probability that the ittest individual is selected is the same as the probability that it is picked or the sample o 2 elements used or the tournament, which is approximately, 2/000 = Thus the probability o the ittest individual in the population being selected is higher under FUSS than under TOUR2 and so the maximum itness would rise quickly to start with. Previously in [LHK04] we speculated that this may have been responsible or perormance problems that we had observed with FUSS in some situations. However urther experimentation has shown that very rapid rises in maximal itness are quite rare and are also very shortly lived when they do occur too short to cause any signiicant diversity problems in the population. We now believe that the population distribution is to blame in these situations; something that we will explore in detail in Section 5. The second graph in Figure 7 shows the same set o selection schemes but now using FUDS as the deletion scheme. With FUDS the perormance o all o the selection schemes either stayed the same or improved. In the case o TOUR3 the improvement was dramatic and or TOUR2 the improvement was also quite signiicant. This is interesting because it shows that with itness uniorm deletion, perormance can improve when the selection intensity is either too high or too low. That is, when using FUDS the perormance o the EA now appears to be more robust with respect to variation in selection intensity. In the case o TOUR2-F this is evidence o improved

17 Fitness Uniorm Optimization 7 Random TSP Population 250 Random TSP Population Tour Length TOURx-R TOURx-F Tour Length TOURx-R TOURx-F Tournament Size Tournament Size Random TSP Population 000 Random TSP Population Tour Length TOURx-R TOURx-F Tour Length TOURx-R TOURx-F Tournament Size Tournament Size Figure 8: The perormance o TOURx-F is much more stable than TOURx-R under variation in the selection intensity. Also both and produce very good results, especially with the larger populations. population diversity as the EA is no longer becoming stuck. However or TOUR3-R the selection intensity is quite low and thus we would expect the population diversity to be relatively good. Thus the act that TOUR3-F was so much better than TOUR3-R suggests that FUDS can have signiicant perormance beneits that are not related to improved population diversity. Investigating urther it seems that this eect is due to the way that FUDS ocuses the deletion on the large mass o individuals which have an average level o itness while completely leaving the less common it individuals alone. This helps a system with very weak selection intensity move the mass o the population up through the itness space. With higher selection intensity this problem tends not to occur as individuals in this central mass are less likely to be selected thus reducing the rate at which new individuals o average itness are added to the population. In order to better understand how stable FUDS perormance is when used with dierent selection intensities we ran another set o tests on random TSP problems with 20 cities and graphed how perormance varied by tournament size. For these tests we set the EA to stop each run when no improvement had occurred in 40 generations. We also tested on a range o population sizes: 250, 500, 000 and The results appear in Figure 8. In these graphs we can now clearly see how the perormance o TOURx-R varies signiicantly with tournament size. Below the optimal tournament size perormance worsened quickly while above this value it also worsened, though more slowly. Interestingly, with a population size o 5000 the optimal tournament size was about 6, while with small populations the optimal value ell to just 4. Presumably this was partly because smaller populations have lower diversity and thus cannot withstand as much

18 8 Marcus Hutter & Shane Legg selection intensity. In contrast and appear as horizontal lines as they do not have a tournament size parameter. We see that they have perormed as well as the optimal perormance o TOURx-R without requiring any tuning. Indeed or larger populations appears to be even better than the optimally tuned perormance o TOURx-R. This is a very positive result or the parameterless FUSS. Comparing FUDS with random deletion we also see impressive results. For every combination o selection scheme, tournament size and population size the result with FUDS was better than the corresponding result with random deletion, and in some cases much better. Furthermore these graphs clearly display the improved robustness o tournament selection with FUDS as TOURx-F produced near optimal results or all tournament sizes. Even with an optimally tuned tournament size FUDS increased perormance, particularly with the smaller populations. Indeed or each population size tested the worst perormance o TOURx-F was equal to the best perormance o TOURx-R. With FUSS there was also a perormance advantage when using FUDS, again more so with the smaller populations. The combination o both FUSS and FUDS was especially eective as can be seen by the consistently superior perormance o across all o the graphs. More tests were run exploring perormance with up to 00 cities. Although the perormance o FUDS remained stronger than random deletion or very low selection intensity, or high selection intensity the two were equal. We believe that the reason or this is the ollowing: When the space o potential solutions is very large inding anything close to a global optimum is practically impossible, indeed it is diicult to even ind the top o a reasonable local optimum as the space has so many dimensions. In these situations it is more important to put eort into simply climbing in the space rather than spreading out and trying to thoroughly explore. Thus higher selection intensity can be an advantage or large problem spaces. At any rate, or large problems and with high selection intensity FUDS did not appear to hinder the perormance, while with low selection intensity it continued to signiicantly improve it. Experiments were also perormed using the more eicient 2-Opt mutation operator. As expected, this increased perormance and allowed much higher selection pressure to be used. O course the problem then no longer had the kind o deceptive structure that heavily punishes high selection pressure that we are looking or. Nevertheless, FUDS continued to signiicantly boost the perormance o tournament selection, in particular when the tournament size was too small. 4 Set Covering Problem The set covering problem (SCP) is a reasonably well known NP-complete optimization problem with many real world applications. Let M {0,} m n be a binary valued matrix and let c j >0 or j {,...n} be the cost o column j. The goal is to ind a subset o the columns such that the cost is minimized. Deine x j = i column j is in our solution and 0 otherwise. We can then express the cost o this solution as n j= c jx j subject to the condition that n j= m ijx j or i {,...m}. Our system o representation, mutation operators and crossover ollow that used by Beasley [BC96] and we compute the itness by taking the reciprocal o the cost. The results presented here are based on the scp42 problem rom a standard collection o SCP problems [Bea03]. The results obtained on other problems in this test set were similar. We ound that increasing the crossover probability and reducing the mutation probability improved perormance, especially when the selection intensity was low. Thus we have tested the system with a crossover probability o 0.8 and a mutation probability o 0.2. We perormed each test at least 50 times in order to minimize the error bars. Our stopping criterion was to terminate each run ater no improvement in minimal cost had occurred or 40 generations. The results or this test appear in Figure 9. Similar to the TSP graphs we again see the importance o correctly tuning the tournament size with TOURx-R. We also see the optimal range o perormance or TOURx- R moving to the right as the population sizes increases. This is what we would expect due to the greater diversity in larger populations. This kind o variability is one o the reasons why the selection intensity parameter usually has to be determined by experimentation. Unlike with TSP however, the perormance o FUSS was less convincing in these results. With the smaller populations o 250 and 500 was only better than TOURx-R when the tournament size was very low or very high. With the larger populations o,000 and 5,000 the results were much better with perorming as well as the optimal perormance o TOURx-R. perormed better than, in particular with the smaller populations though this improvement was still insuicient or it to match the optimal perormance o TOURx-R in these cases. The act that the perormance o FUSS varied by population size suggests that FUSS might be experiencing some kind o population diversity problem. We will look more careully at diversity issues in the next section. With FUDS the results were again very impressive. As with the TSP tests; or all combinations o selection scheme, tournament size and population size that we tested, the perormance with FUDS was superior to the corresponding perormance with random deletion. This was true even when the tournament size was optimal. While the perormance o TOURx-F did vary signiicantly with dierent tournament sizes, the results were more robust than TOURx-R, especially with the larger populations. Indeed or the larger two populations we again have a situation where the worst perormance o TOURx-F is equal to the optimal perormance o TOURx-R.

19 Fitness Uniorm Optimization 9 SCP42 Population 250 SCP42 Population Solution Cost TOURx-R TOURx-F Solution Cost TOURx-R TOURx-F Tournament Size Tournament Size SCP42 Population 000 SCP42 Population TOURx-R TOURx-F 570 TOURx-R TOURx-F Solution Cost Solution Cost Tournament Size Tournament Size Figure 9: The perormance o FUSS or the two smaller populations was relatively poor, while or the larger populations it matched the optimal perormance o TOURx-R. FUDS again produced superior results to random deletion in all situations tested. 5 Maximum CNF3 SAT Maximum CNF3 SAT is a well known NP hard optimization problem [CK03] that has been extensively studied. A three literal conjunctive normal orm (CNF) logical equation is a boolean equation that consists o a conjunction o clauses where each clause contains a disjunction o three literals. So or example, (a b c) (a e ) is a CNF3 expression. The goal in the maximum CNF3 SAT problem is to ind an instantiation o the variables such that the maximum number o clauses evaluate to true. Thus or the above equation i a=f, b=t, c=t, e=t, and =F then just one clause evaluates to true and thus this instantiation gets a score o one. Achieving signiicant results in this area would be diicult and this is not our aim; we are simply using this problem as a test to compare selection and deletion schemes. Our test problems have been taken rom the SATLIB collection o SAT benchmark tests [HS00]. The irst test was perormed on the ull set o 00 instances o randomly generated CNF3 ormula with 50 variables and 645 clauses, all o which are known to be satisiable. Based on test results the crossover and mutation probabilities were let at the deault values. Our mutation operator simply lips one boolean variable and the crossover operator orms a new individual by randomly selecting or each variable which parent s state to take. Fitness was simply taken to be the number o classes satisied. Again we tested across a range o tournament sizes and population sizes. The results o these tests appear in Figure 0. We have shown only the population sizes o 500 and 5,000 as the other population sizes tested ollowed the same pattern. Interestingly or this problem there was no evidence o better perormance with FUDS at higher selection intensities. Nor or that matter was there the decline in perormance with TOURx-R that we have seen elsewhere. Indeed with random deletion the selection intensity appeared to have no impact on perormance at all.

20 20 Marcus Hutter & Shane Legg CNF3 SAT50 Population CNF3 SAT50 Population Clauses Satisied TOURx-R TOURx-F Tournament Size Clauses Satisied TOURx-R TOURx-F Tournament Size Figure 0: With low selection intensity TOURx-F perormed slightly below TOURx-R, but was otherwise comparable. FUSS had serious diiculties. While SAT3 CNF is an NP hard optimization problem, this lack o dependence o our selection intensity parameter suggests that it may not have the deceptive structure that FUSS and FUDS are designed or. With low selection intensity FUDS caused perormance to all below that o random deletion; something that we have not seen beore. Because the advantages o FUDS have been more apparent with low populations in other test problems, we also tested the system with a population size o only 50. Unortunately no interesting changes in behavior were observed. While FUDS had minor diiculties, FUSS had serious problems or all the population sizes that we tested. We suspected that the uniorm nature o the population distribution that should occur with both FUSS and FUDS might be to blame as we only expect this to be a beneit or very deceptive problems which are sensitive to the tuning o the selection intensity parameter. Thus we ran the EA with a population o 000 and graphed the population distribution across the number o clauses satisied at the end o the run. We stopped each run when the EA made no progress in 40 generations. The results o this appear in Figure. The irst thing to note is that with TOUR4-R the population collapses to a narrow band o itness levels, as expected. With TOUR4-F the distribution is now uniorm, though practically none o the population satisies ewer than 550 clauses. The reason or this is quite simple: While FUDS levels the population distribution out, TOUR4 tends to select the most it individuals and thus pushes the population to the right rom its starting point. In contrast, FUSS pushes the population toward currently unoccupied itness levels. This results in the population spreading out in both directions and so the number o individuals with extremely poor itness is much higher. Given that our goal is to ind an instantiation that satisies all 645 clauses, it is questionable whether having a large percentage o the population unable to satisy even 600 clauses is o much beneit. While the total population diversity under might be very high, perhaps the kind o diversity that matters the most is the diversity among the relatively it individuals in the population. This should be true or all but the most excessively deceptive problems. By thinly spreading the population across a very wide range o itness levels we actually end up with very ew individuals with the kind o diversity that matters. O course this depends on the nature o the problem we are trying to solve and the itness unction that we use. Fortunately with CNF3 SAT we can directly measure population diversity by taking the average hamming distance between individuals genomes. While this means that the value o the itness based similarity metric is questionable or this problem, as more direct methods like crowding can be applied, it is a useul situation or our analysis as it allows us to directly measure how eective FUSS and FUDS are at preserving population diversity. The hope o course is that any positive beneits that we have seen here will also carry over to problems where directly measuring the diversity is problematic. For the diversity tests we used a population size o 000 again. For comparison we used FUSS, TOUR3 and TOUR2 both with random deletion and with FUDS. In each run we calculated two dierent statistics: The average hamming distance between individuals in the whole population, and the average hamming distance between individuals whose itness was no more than 20 below the ittest individual in the population at the time. These two measurements give us the total population diversity and high itness diversity graphs in Figure 2. We graphed these measurements against the solution cost o the ittest individual rather than the number o generations. This is only air because i good solutions are ound very quickly then an equally rapid decline in diversity is acceptable and to be expected. Indeed it is trivial to come up with a system which always maintains high population diversity how ever long it runs, but is unlikely

21 CNF TOUR4-F Population Distribution CNF Population Distribution n 600 oit al up op Clauses Satisied CNF Population Distribution n 600 oit al up op Clauses Satisied Figure : With TOUR4-R the population collapses to a narrow band o itness levels while with TOUR4-F the distribution is lat. Under FUSS the population spreads out in both directions with in particular giving an extremely uniorm distribution. to ind any good solutions. The results were averaged over the perormance problems with. all 00 problems in the test set. Because the best solution The top right graph shows the same selection schemes, ound in each run varied we have only graphed each curve but this time with FUDS. As expected FUDS has signiicantly improved the total population diversity with both until such a point where ewer than 50% o the runs were able to achieve this level o itness. Thus the terminal TOUR3 and TOUR2, while having little impact on FUSS point at the right o each curve is representative o airly which already has a relatively lat population distribution. As the maximal solution ound by TOUR3-F and typical runs rather than just a ew exceptional ones that perhaps ound unusually good solutions by chance. TOUR2-F were not better than TOUR3-R and TOUR2- The top two graphs in Figure 2 show the total population diversity. As expected the diversity with TOUR3-R is not a signiicant actor in the perormance o the EA or R, this indicates that improved total population diversity and TOUR2-R decline steadily as inding better solutions this type o optimization problem. That FUDS has lited becomes increasingly diicult and the population tends to the total diversity or TOUR3 and TOUR2 so that they collapse into a narrow band o itness. As we would expect, are now above, is particularly interesting. This the total population diversity with TOUR3-R is higher suggests that while FUSS has high total population diversity, there appears to be some more subtle eects that than with TOUR2-R. While declines initially it then stabilizes at around 50 beore becoming stuck. As the are causing the diversity to be lower than it could be. It TOUR3-R and TOUR2-R curves both extend urther to may be related to the act the FUSS sometimes heavily selects rom small groups within the population during the the right, even though the total population diversity becomes quite low, this show that diversity problems in the early stages o the optimization process, as we noted in population as a whole are not a signiicant actor behind Section 3. However we are not certain whether this is Population Clauses Satisied Population CNF TOUR4-R Population Distribution Clauses Satisied Fitness Uniorm Optimization 2

22 22 Marcus Hutter & Shane Legg CNF3 SAT Total Fitness Diversity CNF3 SAT Total Fitness Diversity Average Hamming Distance TOUR3-R TOUR2-R Average Hamming Distance TOUR3-F TOUR2-F Average Hamming Distance Clauses Satisied CNF3 SAT Top Fitness Diversity TOUR3-R TOUR2-R Clauses Satisied Average Hamming Distance Clauses Satisied CNF3 SAT Top Fitness Diversity TOUR3-F TOUR2-F Clauses Satisied Figure 2: While the total population diversity is very strong under FUSS, the diversity among it individuals is weak. FUDS improves the total population diversity compared to random deletion, but has little eect on the diversity among the it individuals. occurring in this case. On the lower set o graphs we see the diversity among the itter individuals in the population; speciically those whose itness is no more than 20 below the ittest individual in the population at the time. On the irst graph on the let we see that TOUR3 has signiicantly greater diversity than TOUR2 with both deletion schemes. This is expected as TOUR3 tends to search more evolutionary paths while TOUR2 just rushes down a ew. Disappointingly FUDS does not appear to have made much dierence to the diversity among these highly it individuals, though the curves do latten out a little as the diversity drops below 30, so perhaps FUDS is having a slight impact. For both and the diversity among the it individuals was poor, indeed it was even worse than TOUR2 or both deletion schemes. Thus, while the total population diversity with FUSS tends to be high, the diversity among the ittest individuals in the population can be quite poor. Furthermore, the curves or high itness diversity all end once the diversity drops into the 2 to 7 range. As this pattern was absent rom the graphs o total population diversity, this indicates that it is indeed the diversity among the relatively it individuals in the population that most determines when the EA is going to become stuck. In summary, these results show that while FUSS has been successul in maximizing total population diversity, or problems such as CNF3 SAT this is not suicient. It appears to be more important that the EA maximizes the diversity among those individuals which have higher itness and in this regard FUSS is poor, which leads to poor perormance. This is most likely a characteristic o optimization problems which, while still diicult, are not as deceptive as SCP or random TSP.

23 Fitness Uniorm Optimization 23 6 Conclusions and Future Research Directions We have addressed the problem o balancing the selection intensity in EAs, which determines speed versus quality o a solution. We invented a new itness uniorm selection scheme FUSS. It generates a selection pressure toward sparsely populated itness levels. This property is unique to FUSS as compared to other selection schemes (STD). It results in the desired high selection pressure toward higher itness i there are only a ew it individuals. The selection pressure is automatically reduced when the number o it individuals increases. We motivated FUSS as a scheme which bounds the number o similar individuals in a population. We deined a universal similarity relation solely depending on the itness, independent o the problem structure, representation and EA details. We showed analytically by way o a simple example that FUSS can be much more eective than STD. A joint pair selection scheme or recombination has been deined. A heuristic worst case analysis o FUSS compared to STD has been given. For this, the itness tree model has been deined, which is an interesting analytic tool in itsel. FUSS solves the problem o population takeover and the resulting loss o genetic diversity o STD, while still generating enough selection pressure. It does not help in getting a more uniorm distribution within a itness level. We have also invented a related system called FUDS which achieves a similar eect to FUSS except that it works through deletion rather than through selection. This means that FUDS shares many o the important characteristics o FUSS including strong total population diversity and the impossibility o population collapse. We showed analytically that or a simple deceptive optimization problem the perormance o STD when used with FUDS scales similarly to FUSS. A test system has been constructed and used to evaluate the empirical perormance o both FUSS and FUDS on a range o optimization problems with dierent population sizes, mutation probabilities and crossover probabilities. Their perormance has been compared to the more standard methods o tournament selection and random deletion. For the artiicial deceptive 2D optimization problem and random distance matrix TSP problems both FUSS and FUDS perormed extremely well. For the deceptive 2D problem they dramatically improved the scaling exponent in the number o generations needed to ind the global optimum. For the TSP problems perormed as well as optimally tuned TOURx-R or all population sizes, and FUDS caused TOURx to perorm near optimally or all tournament sizes and population sizes. With SCP problems with small populations the perormance o was only better than TOURx-R when the tournament size was poorly set. For populations larger than,000 however, continued to perorm as well as the optimal results or TOURx-R. FUDS was again consistently superior returning better results than random deletion or every combination o selection scheme, tournament size and population size tested. For CNF3 SAT problems we ran into diiculties however. While FUDS signiicantly improved the perormance o FUSS, it was inerior to random deletion or low selection intensities. In other cases the perormance was comparable. FUSS however had serious perormance problems. Further investigations revealed that this appears to be due to the small number o individuals in the population that have relatively high itness when using FUSS. We measured the diversity in the population and ound that while the total population diversity with FUSS was high, the diversity among the it individuals was relatively poor. This produced a serious diversity problem in the population when combined with the act that there are relatively ew individuals o high itness when using FUSS. As the perormance o TOURx-R was not impacted by high selection intensity on the CNF3 SAT problem this indicates that this problem does not have the kind o deceptive nature that harshly punishes greedy exploration that we were looking or. Perhaps or such problems a less extreme approach is called or. For example, rather than trying to spread the population across all itness levels uniormly we should instead control the distribution so that it is biased toward high itness but never collapses totally as it does with TOURx-R. We have experimented with a deletion scheme which deletes the population distribution down to a convex curve peaked at the ittest individual in the population. This is the deletion equivalent o the scale independent selection scheme described in Section 9. Our results thus ar indicate that the perormance is equal or slightly superior to random deletion in all situations. However the dramatic improvements that FUDS has over random deletion in some cases are now less signiicant. Another possibility is to manipulate the itness unction to eectively achieve the same thing. For example, we have ound that by taking the itness to be the reciprocal o the number o unsatisied clauses in the CNF3 SAT problem the perormance o FUSS improves signiicantly, indeed it is then comparable to TOURx. Perhaps however it would be better to avoid these perormance tricks and instead ocus on extremely deceptive problems where high selection intensity is heavily punished, that is, the kinds o problems that FUSS and FUDS were speciically designed or. Acknowledgments. This work was supported by SNF grants and Reerences [ACR00] D. Applegate, W. Cook, and A. Rohe. Chained Lin-Kernighan or large traveling salesman problems. Technical report, Department o Computational and Applied Mathematics, Rice University, Houston, TX, [Bak85] J. E. Baker. Adaptive selection methods or genetic algorithms. In Proc. st International Conerence on Genetic Algorithms and their Applications, pages 0, Pittsburgh, PA, 985. Lawrence Erlbaum Associates.

24 24 Marcus Hutter & Shane Legg [BC96] J. Beasley and P. Chu. A genetic algorithm or the set covering problem. European Journal o Operational Research, 94: , 996. [Bea03] J. Beasley. Or-library. mscmga.ms.ic.ac.uk/jeb/orlib/scpino.html, [BHS9] T. Bäck, F. Homeister, and H. P. Schweel. A survey o evolution strategies. In Proc. 4th International Conerence on Genetic Algorithms, pages 2 9, San Diego, CA, July 99. Morgan Kaumann. [BT95] [BT97] T. Blickle and L. Thiele. A mathematical analysis o tournament selection. In Proc. Sixth International Conerence on Genetic Algorithms (ICGA 95), pages 9 6, San Francisco, Caliornia, 995. Morgan Kaumann Publishers. T. Blickle and L. Thiele. A comparison o selection schemes used in evolutionary algorithms. Evolutionary Computation, 4(4):36 394, 997. [Cav70] D. J. Cavicchio. Adaptive search using simulated evolution. PhD thesis, Unpublished doctoral dissertation, University o Michigan, Ann Arbor, 970. [CJ9] R. J. Collins and D. R. Jeerson. Selection in massively parallel genetic algorithms. In Proc. Fourth International Conerence on Genetic Algorithms, San Mateo, CA, 99. Morgan Kaumann Publishers. [CK03] P. Crescenzi and V. Kann. A compendium o NP optimization problems. viggo/problemlist/compendium.html, [EHM99] A. E. Eiben, R. Hinterding, and Z. Michalewicz. Parameter control in evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 3(2):24 4, 999. [Esh9] L. J. Eshelman. The CHC adaptive search algorithm: How to sae search when engaging in nontraditional genetic recombination. In G. J. E. Rawlings, editor, Foundations o genetic algorithms, pages Morgan Kaumann, San Mateo, 99. [FM93] S. Forrest and M. Mitchell. What makes a problem hard or a genetic algorithm? Some anomalous results and their explanation. Machine Learning, 3(2 3):285 39, 993. [GA85] [GD9] [Gol89] [GR87] D. Goldberg and R. Lingle. Alleles. Loci and the traveling salesman problem. In Proc. International Conerence on Genetic Algorithms and their Applications, pages Lawrence Erlbaum Associates, 985. D. E. Goldberg and K. Deb. A comparative analysis o selection schemes used in genetic algorithms. In G. J. E. Rawlings, editor, Foundations o genetic algorithms, pages Morgan Kaumann, San Mateo, 99. D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading, Mass., 989. D. E. Goldberg and J. Richardson. Genetic algorithms with sharing or multi-modal unction optimization. In Proc. 2nd International Conerence on Genetic Algorithms and their Applications, pages 4 49, Cambridge, MA, July 987. Lawrence Erlbaum Associates. [Her92] M. Herdy. Reproductive isolation as strategy parameter in hierarchically organized evolution strategies. In Parallel problem solving rom nature 2, pages , Amsterdam, 992. North-Holland. [Hol75] [HS00] John H. Holland. Adpatation in Natural and Artiicial Systems. University o Michigan Press, Ann Arbor, MI, 975. H. H. Hoos and T. Stützle. SATLIB: An Online Resource or Research on SAT. In SAT 2000, pages IOS press, [Hut9] M. Hutter. Implementierung eines Klassiizierungs- Systems. Master s thesis, Theoretische Inormatik, TU München, pages with C listing, in German, marcus/ai/pcs.htm. [Hut02] M. Hutter. Fitness uniorm selection to preserve genetic diversity. In Proc Congress on Evolutionary Computation (CEC-2002), pages , Washington D.C, USA, May IEEE. [JM97] [Jon75] D. S. Johnson and A. McGeoch. The traveling salesman problem: A case study. In E. H. L. Aarts and J. K. Lenstra, editors, Local Search in Combinatorial Optimization, Discrete Mathematics and Optimization, chapter 8, pages Wiley-Interscience, Chichester, England, 997. K. de Jong. An analysis o the behavior o a class o genetic adaptive systems. Dissertation Abstracts International, 36(0), 540B, 975. [Leg04] S. Legg. Website. shane, [LH05] S. Legg and M. Hutter. Fitness uniorm deletion or robust optimization. In Proc. Genetic and Evolutionary Computation Conerence (GECCO 05), pages , Washington, OR, ACM SigEvo. [LHK04] S. Legg, M. Hutter, and A. Kumar. Tournament versus itness uniorm selection. In Proc Congress on Evolutionary Computation (CEC 04), pages , Portland, OR, IEEE. [LK73] S. Lin and B. W. Kernighan. An eective heuristic or the travelling salesman problem. Operations Research, 2:498 56, 973. [MO96] O. Martin and S. Otto. Combining simulated annealing with local search heuristics. Annals o Operations Research, 63:57 75, 996. [MSV94] Heinz Mühlenbein and Dirk Schlierkamp-Voosen. The science o breeding and its application to the breeder genetic algorithm (BGA). Evolutionary Computation, (4): , 994. [MT93] M. de la Maza and B. Tidor. An analysis o selection procedures with particular attention paid to proportional and Boltzmann selection. In Proc. 5th International Conerence on Genetic Algorithms, pages 24 3, San Mateo, CA, USA, 993. Morgan Kaumann. [RPB99a] A. Rogers and A. Prügel-Bennett. Genetic drit in genetic algorithm selection schemes. IEEE Transactions on Evolutionary Computation, 3(4): , 999. [RPB99b] A. Rogers and A. Prügel-Bennett. Modelling the dynamics o a steady-state genetic algorithm. In Wolgang Banzha and Colin Reeves, editors, Foundations o Genetic Algorithms 5, pages Morgan Kaumann, San Francisco, CA, 999.

25 Fitness Uniorm Optimization 25 [Rud00] Günter Rudolph. On takeover times in spatially structured populations: array and ring. In K. K. Lai, O. Katai, M. Gen, and B. Lin, editors, Proceedings o the Second Asia-Paciic Conerence on Genetic Algorithms and Applications (APGA 00), pages 44 5, Hong Kong, PR China, Global-Link Publishing Company. [SVM94] D. Schlierkamp-Voosen and H. Mühlenbein. Strategy adaptation by competing subpopulations. In Parallel Problem Solving rom Nature PPSN III, pages , Berlin, 994. Springer. Lecture Notes in Computer Science 866. [Whi89] D. Whitley. The GENITOR algorithm and selection pressure: Why rank-based allocation o reproductive trials is best. In Proc. Third International Conerence on Genetic Algorithms (ICGA 89), pages 6 23, San Mateo, Caliornia, 989. Morgan Kaumann Publishers, Inc. Marcus Hutter received the M.Sc. degree in computer science and the Ph.D. degree in theoretical particle physics rom the (Technical) University, Munich, Germany, in 992 and 995, respectively. Thereater, he developed algorithms in a medical sotware company or ive years. Since 2000, he has published over 35 research papers while a Researcher with the Dalle Molle Institute or Artiicial Intelligence (IDSIA), Lugano, Switzerland. He is the author o Universal Artiicial Intelligence (EATCS: Springer, 2004). His current interests are centered around reinorcement learning, algorithmic inormation theory and statistics, universal induction schemes, adaptive control theory, and related areas. Shane Legg received the B.C.M.S. degree in mathematical and computer sciences rom the University o Waikato, Hamilton, New Zealand, in 996 and the M.Sc. degree in mathematics rom Auckland University, Auckland, New Zealand, in 997. He is currently working towards the Ph.D. degree at the Dalle Molle Institute or Artiicial Intelligence (IDSIA), Lugano, Switzerland. Ater receiving the M.Sc. degree in 997, he then worked in a number o companies in New Zealand and the United States mainly ocusing on commercial applications o artiicial intelligence. His research is ocused on genetic algorithms, complexity theory, and theoretical models o artiicial intelligence.