The Importance of semantic context in tree based GP and its application in defining a less destructive, context aware crossover for GP

Size: px

Start display at page:

Download "The Importance of semantic context in tree based GP and its application in defining a less destructive, context aware crossover for GP"

Herbert Alfred Garrison
5 years ago
Views:

1 The Importance of semantic context in tree based GP and its application in defining a less destructive, context aware crossover for GP by Hammad Majeed B.Sc. Supervisor: Dr.ConorRyan External Examiner: Dr. William B. Langdon A thesis for the PhD Degree Submitted to the University of Limerick October 27

2 ii

3 Declaration I hereby declare that the work presented in this thesis is original except where an acknowledgement is made or a reference is given to other work and I have read the University hand book of academic administration and I accept its procedure. Student: Hammad Majeed Signed: Date: November 2, 27 Supervisor: Dr. Conor Ryan Signed: Date: November 2, 27 i

4 Abstract This thesis gives an empirical proof of the existence of competitive building blocks in Grammatical Evolution (GE), a grammar based program evolving algorithm. It shows that in GE, rooted and non-rooted building blocks exist and over the period of time rooted building blocks compete with each other to grow in size, while non-rooted building blocks help them to accomplish that. This is an offline study and done in a retrospective manner. We also present a comprehensive study of the importance of semantic context of a sub-tree in tree based systems and introduce a novel context aware evaluation technique for evaluating sub-trees in context. The usefulness of this technique is demonstrated on a benchmark problem. In this work, we introduce a new constructive and context aware crossover for GP, Context-Aware crossover, which works by placing the selected subtrees in their best possible context in any tree. This is a greedy approach and results in an improved performance. It is tested on a wide range of problems and showed better performance on all the problems except the Uni-Variate and Bi-Variate Polynomial Symbolic Regression problems. Furthermore, the results show that it generates very compact form of trees without adversely affecting their fitness. Finally, we show the usefulness of the context aware evaluation technique in encapsulating useful trees at the end of a run and using them to create a ii

5 module repository. This repository is later used to improve the performance of the second cascaded run. For the second run, a variant of context-aware crossover is introduced, Context-Aware mutation which works on module repository. The effectiveness of this setup is demonstrated by re-examining a real world blood flow problem and improving the previously published results.

6 Dedicated to my whole family for their support and priceless love over the years.

7 Acknowledgements All praise to God Almighty who provided me with an opportunity to work with a group of competent and nice people. He has been the most merciful and gracious to me all my life. I would like to thank my supervisor Dr. Conor Ryan for mentoring me during my research and helping me at different levels and teaching me the art of research. It would be inappropriate if I do not thank him for being considerate towards my religion and beliefs. He always did what was in his power to accommodate me and my beliefs. I cant thank him enough for that. Lastly, I am also thankful to him for reading this thesis in such a short time and helping me to improve it with his valuable feedback and suggestions. Without his help it was not possible. Dr. Raja Muhammad Atif Azad, an old buddy and colleague who provided me with a good company and helped me many a times during this research. He made my stay comfortable to say the least. Thank you Atif! I would like to thank Eoin Ryan, David Power and Fiacc Larkin for keeping the Beowulf cluster up and running for our resource hungry experiments. In general, I would like to thank the whole BDS group (David Wallin, Dr. Miguel Nicolau, Darwin Slattery and Miriam O Riordan) for their support in all these years. The addition of Lorrain Morgan to our group was a big blessing. She v

8 made our lives easy and took care of various tasks with zeal and interest. Recently, she has spent one and a hour to print this thesis for me, Thanks I appreciate! I would also like to thank Jennifer Willies, a great administrator, who always paid special attention to make my visits to EuroGP hassle free. She is a woman of an extraordinary talents. Finally, I would like to thank Gemma Swift who always greets me with a smile. vi

9 Contents Declaration Abstract Acknowledgments List of Figures List of Tables i ii v xviii xx 1 Introduction Overview of the Thesis Contribution of the Thesis Outline of the Thesis Background and Related Work Genetic Programming Problem Specification Initial Population Fitness measure Selection method Recombination operators vii

10 2.1.6 Building Block hypothesis Where it all started Grammatical Evolution (GE) Ripple Effect and Ripple Crossover Building Blocks in GE Experimental Design Results Analysis Extending the study to phenomic level Problems with tree based representation Inefficient genetic operators Summary Complexity Involved in Tree Structures Quantifying destructiveness of crossover Parent to children fitness ratio Using Robinson and Foulds topologic distance Structural Approaches to Define Sub-tree s Context Strong Context Preserving Crossover & Weak Context Preserving Crossover Depth Dependent Crossover Homologous Crossover for GP Size fair and Homologous crossovers for GP Uniform Crossover for GP Functional Approaches to Define Sub-tree s Context Brood Recombination in GP Selective Crossover Heuristic Based Approaches Simple Dominance Selective Crossover viii

11 3.4.2 Looseness Control Crossover Salient features of an ideal crossover Constructiveness Maintain diversity Contain code bloat Non-dependent on inaccurate heuristics Non-dependent on structural preservation Universally applicable Inexpensive Easy to implement Defining the context of a sub-tree in tree based GP What is the best way to define the context of a sub-tree? Syntax based Fitness based Summary Our definition of the context of a sub-tree Summary Evaluating GP Schema in Context Examining Schema in Context Generalization of schema Contribution of schema Limitation of this approach Experiments Schema contribution to fitness and size Comparing schemata within runs Comparing Schemata across runs Context or depth? ix

12 4.3 Benefit of this study Modules Discovery and Use Identification of dead code Summary Context-Aware Crossover Operator for GP Context-Aware crossover Salient features Improving the performance of GP by employing context-aware crossover Order of application of the crossover operators Problem domains examined Experimental setup Quartic Polynomial Symbolic Regression Bit Multiplexer Lawnmower Analysis of the results Summary Optimising the Use of Context-Aware and Standard Crossovers Optimising GP s Performance Varying order and rates of crossovers Effect of standard crossover on a GP run Testing the hypothesis Experimental Setup Quartic Polynomial Symbolic Regression Bit Multiplexer Lawnmower x

13 6.2.5 Results analysis Summary Validation of the Hypothesis A few important concepts Experimental setup Performance on training cases Performance on test cases Size of the generated trees Count of different types crossover events Fitness gain of the children over their parents Quartic Polynomial Symbolic Regression problem Results Analysis of the results Uni-variate Polynomial Symbolic Regression problem Results Analysis of the results Bi-variate Polynomial Symbolic Regression problem Results Analysis of the results Bit Multiplexer problem Results Analysis of the results Lawnmower problem Results Analysis of the results General analysis of the performance of context-aware crossover on the tested problems xi

14 7.8.1 Fitness gain Tree size Fitness gain in the children Constructiveness Conclusion Discovery of Useful Modules to Improve GP Performance Recombination operators used Context-Aware Crossover Context-Aware Mutation operator Module Mutation operator Calculating the contribution of a sub-tree Identification of modules and repository creation Selection of modules Experimental Setup Does good first run guarantee good second run? Results Use of standard crossover in the first run Use of Context-Aware crossover in the first run Discussion Summary Re-examination of a real world blood flow problem using context-aware crossover Experimental Setup Results comparison Comparison of goodness of fit xii

15 9.4 Fitness plots Analysis of the results Summary Use of cache in context-aware crossover Experiments conducted Quartic Polynomial Symbolic Regression Sextic Polynomial Bit Multiplexer Summary Conclusions & Future Work Future work Extending GE analysis to other evolutionary algorithms Employing context-based schema evaluation technique in existing methods Using context-aware crossover as a tool Selection of better sub-tree A Graphs for the results discussed in section A.1 Quartic Polynomial Symbolic Regression A.2 11-Bit Multiplexer A.3 Lawnmower A.4 Summary B Publications 181 xiii

16 List of Figures 2.1 A sample GP tree Working of one point crossover Ripple crossover in GE Count of correctly and incorrectly positioned building blocks (size 9) Count of some more correctly and incorrectly positioned building blocks (size 9) Rooted building block size statistics Statics for the largest rooted building blocks Tree based structures make it hard to calculate the contribution/fitness of any of its sub-tree Possible destructive effects of random selection of a sub-tree Protection of good building blocks Computing Robin Foulds distance A rooted node with node co-ordinate system Strong context preserving crossover Weak context preserving crossover Uniform Crossover Looseness control crossover Preserving the parent nodes is not always useful xiv

17 3.8 Importance of the deep sub-trees of a tree In Brood Recombination the probability of selection of a crossover point closer to the root node is small Evaluating GP schema in context Replacing GP schema with its identity nodes before re-evaluation Average size and fitness contribution of schema (exp #) and (/ X #) Different size and fitness contribution of (exp #) in another run Comparison of schemata found in different runs A few more schemata from two different runs Effect of schemata size on its instance counts Relation between the depth and its fitness contribution of a schema The sub-tree ( * ( - X X) #) acts as an intron Working of context-aware crossover Performance of context-aware crossover in the Quartic Polynomial Symbolic Regression problem Performance plots for the 11-Bit Multiplexer problem Plots for the Lawnmower problem curr genslope 6.1 Behavior of the exponential exp curr gen max gen Effect of order and varying probability of using recombination operators on the average fitness of the system Effect of order and varying probability of using recombination operators on the average fitness of the system Effect of using standard and context-aware crossovers on the average performance of different problems xv

18 6.5 Effect of using standard and context-aware crossovers on the best performance of different problems Effect of using standard and context-aware crossovers on the size of the generated trees Fitness plots for the Quartic Polynomial Symbolic Regression problem Tree size plots for the Quartic Polynomial Symbolic Regression problem Crossover events count plots the Quartic Polynomial Symbolic Regression problem Fitness gain plots for the Quartic Polynomial Symbolic Regression problem Behavior of the bi-variate polynomial.3 x six(2πx) Fitness plots for the Uni-variate Polynomial Symbolic Regression problem Tree size plots for the Uni-variate Polynomial Symbolic Regression problem Fitness gain plots the Uni-variate Polynomial Symbolic Regression problem Crossover events count plots for the Uni-variate Polynomial Symbolic Regression problem Behavior of the bi-variate polynomial x 4 x 3 + y 2 /2 y Fitness plots for the Bi-variate polynomial Symbolic Regression problem Tree size plots the Bi-variate Polynomial Symbolic Regression problem xvi

19 7.13 Fitness gain plots for the Bi-variate Polynomial Symbolic Regression problem Crossover events count plots for the Bi-variate Polynomial Symbolic Regression problem Fitness plots for the 11-Bit Multiplexer problem Tree size plots for the 11-Bit Multiplexer problem Fitness gain plots for the 11-Bit Multiplexer problem Crossover events count plots for the 11-Bit Multiplexer problem Fitness plots for the Lawnmower problem Tree size plots for the Lawnmower problem Crossover events count plots for the Lawnmower problem Fitness gain plots for the Lawnmower problem Plots for cascaded runs setup using standard crossover in the first run Standard deviation plots for selected modules Plots for cascaded runs setup using standard crossover in the first run Plots for cascaded runs setup using context-aware crossover in the first run Standard deviation plots of the selected modules A visual demonstration of a bypass graft Highly fit individuals with poor fits Left: Fits of the best-of-the-run individuals for the setups employing linear scaling. Right: Magnified view of the left plots at the origin to show the error xvii

20 9.4 Left: Fits of the best-of-the-run individuals for the setups not using linear scaling. Right: Magnified view of the left plots at the origin to show the error Mean best fitness plots for the populations generated by the setups not using linear scaling Mean best fitness plots for the populations generated by the setups employing linear scaling Usefulness of cache in tree based GP. The sub-tress of the two parents are cached before the start of the crossover and then used during evaluating the generated children. Only the parent nodes of the swapped sub-trees need to be evaluated Effect of using cache on the number of the nodes evaluated in the Quartic Polynomial Symbolic Regression problem Effect of using cache on the number of the nodes evaluated in the Sextic Polynomial Symbolic Regression problem Effect of using cache on the number of the nodes evaluated in the 11-Bit Multiplexer problem A.1 Effect of switching of crossover operators on the performance and size of the generated trees in the Quartic Polynomial Symbolic Regression problem A.2 Effect of switching of crossover operators on the performance and size of the generated trees in the 11-Bit Multiplexer problem.178 A.3 Effect of switching of crossover operators on the performance and size of the generated trees in the Lawnmower problem xviii

21 List of Tables 4.1 Average instance frequency counts of potentially useful schemata that could be encapsulated Common run time parameters for the studied problems Run parameters for the different setups used to solve the Quartic Polynomial Symbolic Regression problem Run parameters for the different setups used to solve the Quartic Polynomial Symbolic Regression problem Run parameters for the experiments conducted to solve the Lawnmower problem Run parameters for the Quartic, Sextic Polynomial Symbolic Regression and Multiplexer problems Common run time parameters for all the studied problems Performance comparison of the results obtained for the Quartic Polynomial Symbolic Regression problem by switching the crossover operators at different stages of the run with the results obtained for the same problem employing std var and var setup in the previous chapter. Only top three pct xsetups are mentioned in the table and the best setup is shown bold xix

22 6.4 Performance comparison of the results obtained for the 11- Bit Multiplexer problem by switching the crossover operators at different stages of the run with the results obtained for the same problem employing std var and var setup in the previous chapter. Only top three pct x setups are mentioned in the table and the best setup is shown bold The Lawnmower problem specific run time parameters Performance comparison of the results obtained for the 11- Bit Multiplexer problem by switching the crossover operators at different stages of the run with the results obtained for the same problem employing std var and var setup in the previous chapter. Only top three pct x setups are mentioned in the table and the best setup is shown bold Run parameters for the experiments conducted to evolve the uni-variate polynomial Comparison of the goodness of the fit of different setups. The best setup is shown bold xx

23 Chapter 1 Introduction Semantic based study in Genetic Programming (GP) has always been a challenging task due to the complexities involved in tree representation and the tight linkage of the nodes of trees. These complex tree structures make subtree evaluation in context a difficult task, if not impossible. Only possible solution to this problem is to look for adhoc methods to achieve the same effect. Adhoc methods are mostly error-prone and inaccurate. The accuracy of each method varies and depends on its implementation. In Chapter 3, we first present theoretical and empirical proof of the importance of the semantic context of the sub-trees of a GP tree and then show its constructive effects, if considered during a run, and destructive effects, if ignored during a run, on the performance of the system. Chapter 4 introduces a new context aware technique to calculate the fitness contribution of a sub-tree and show its usefulness in marking the sub-tree good or bad for a GP run. Its limitations are also discussed in Section This work introduces a new context aware crossover operator, Context- Aware crossover for tree based GP and shows its effectiveness in a wide range of problems. We also discuss different ways to optimize its use in a GP run. 1

24 At the end in Chapter 8, we employ the context aware sub-tree evaluation technique along with context-aware crossover to identify and encapsulate good modules from the population. These modules are then used in the subsequent run to improve its performance. The effectiveness of this technique is presented by testing it on a benchmark problem and improving the existing results published for a real world blood flow problem (Chapter 9). 1.1 Overview of the Thesis The work presented in this thesis has evolved and matured to its current form over the years after passing through a number of incremental and logical steps. Each step taken was built upon its predecessor and the next logical step in the ladder of this research. We think it would be appropriate to present the work in its original order of research, this would help the reader to understand the rationale behind the course of action taken by us to further this study. The initial focus of this research was to study the possible existence of the competitive building blocks in Grammatical Evolution (GE), a grammar based program evolving algorithm[ryan et al., 1998]. The competitive hypothesis says that in GE, at genomic level, rooted competitive and non-rooted cooperative building blocks exists. The competitive building blocks try to grow in size with time, while co-operative building blocks help them to do so. The results obtained were quite promising and encouraged us to extend the study to GP-like systems. For GP, we decided not to restrict our research to just studying the size and position of the building blocks or schemata but to look at their semantics (context) too. Semantics of a schema/sub-tree/building block normally boils 2

25 down to its contribution towards the tree containing it (container-tree), its goodness, the effect of its removal or inclusion on the container-tree and so on. Unfortunately, there were many hurdles involved in this study. The biggest one was associated with the tree representation of GP-like systems and the tight linkage of the nodes made our job quite difficult. At this stage we decided to devise our own tool which would allow us to study the semantics of a sub-tree and give us accurate reading. This realization made us to introduce a new approach to evaluate a sub-tree in context. This was an interesting approach and helped us to answer some of the questions mostly asked by the GP community. Unfortunately, this was not the perfect technique and had some limitations attached to it. After this, our study was mostly focused on studying the importance of context of a sub-tree and the factors affecting the context of a sub-tree. The empirical proof taught us that GP s crossover operator is one of the most influential factor and it could shift the results in either side depending on its effectiveness. Unfortunately, most of the crossover techniques available at that time were not context aware and worked at syntactic level and were quite destructive. This tempted us to take up this daunting task and develop a new context aware crossover. We called this crossover, Context-Aware crossover due to its context aware nature. We had tested its performance on a wide range of problems and optimised its use by testing different setups. Lastly, we combined the technique of evaluating schema in context and this new crossover operator to identify useful building blocks (modules) from the population for encapsulation and used them in the subsequent runs to improve the performance of GP. 3

26 1.2 Contribution of the Thesis As mentioned in the previous section, this research was furthered by inspecting and solving the problems faced by the GP community. The main contributions of this dissertation are as follows: The GE community can benefit from the competitive building block hypothesis described in this thesis and use it to look under the hood of GE engine. This analysis can be further used to improve the working of GE. This study can be extended and applied to other evolutionary algorithms. A comprehensive study of the shortfalls andflawsintheexistingcrossover approaches is presented. It can be used as an initial study to develop a new system or to improve the existing ones. The technique to evaluate a schema in context has numerous applications which are listed below. Can be used as a heuristic to calculate the effect of a sub-tree on the tree containing it. Can be employed to develop more informed mutation or crossover operators. Has successfully removed dead or ineffective code from the trees which makes this a useful technique to contain code bloat. Can also help to improve some of the existing schema selection and identification techniques. Can be used in identifying good modules or sub-trees for encapsulation. 4

27 Can be helpful in studying the dynamics of GP system and to identify schemata or building blocks responsible for the success or otherwise of the run. A new context-aware crossover is introduced in this thesis. This can be used to improve the performance of the system. It has the ability to tackle hard problems and to generated small trees. 1.3 Outline of the Thesis Chapter 2 - Background and Related Work introduces GP system and discusses our initial work on GE. The possible extension of this work for GP and the problems involved in GP to conduct this study is also discussed. Chapter 3 - Complexity Involved in Tree Structures discusses the various techniques adopted by different researchers to overcome the problems presented in Chapter 2. At the end, it discusses the problems with those approaches and possible ways to improve them. Chapter 4 - Evaluating GP Schema in Context introduces a new technique to calculate the contribution of a sub-tree/schema in context. It also demonstrates the usefulness of this technique in identifying good schemata at the end of a run. Chapter 5 - Context-Aware Crossover for GP introduces a new context aware crossover for GP and discusses its features in detail. The ways to use it in a GP run along with standard crossover is also discussed. 5

28 Chapter 6 - Optimising the Use of Context-Aware and Standard Crossovers discusses different setups to employ standard and contextaware crossovers in a GP run. It also shows the results obtained and the identifies the best setup to use. Chapter 7 - Validation of the Hypothesis shows the effectiveness of the best setup identified in Chapter 6. It also shows the results obtained for a wide range of problems and compares these results. Chapter 8 - Discovery of Useful Modules to Improve GP Performance introduces a new way to combine the technique introduced in Chapter 4 to evaluate schema in context and with context-aware crossover to identify useful modules from a GP run and then use them in the next run to improve the performance GP. Chapter 9 - Re-examination of a Real World Blood Flow Problem Using Context-Aware Crossover discusses the usefulness of contextaware crossover on a real world problem and the improvement in the already published results for this problem. Chapter 1 - Use of Cache in Context-Aware Crossover discusses the usefulness of a very simple cache technique in reducing the expensiveness of context-aware crossover. Chapter 11 - Conclusions & Future Work discusses some possible extensions of this work. concludes the thesis and 6

29 Chapter 2 Background and Related Work Computer programs are considered to be a natural choice for solving different type of problems due to their flexibility and variables involved. Unfortunately, coding a computer program is not a trivial task and requires planning, problem knowledge and possible inputs. This realization brought up the idea of automatic programming. Genetic Programming is one of the endeavour in this area. Koza in early 9s [Koza, 199] introduced the idea of evolving program structures by simulating the Darwin s theory of evolution survival of the fittest. Since then, GP has enjoyed an unprecedented success in all fields of life and resulted in human competitive results [Koza, 2] [Jones et al., 26] [Koza et al., 26a] [Koza et al., 26b]. In the following section we shall discuss it in detail. 2.1 Genetic Programming GP is an evolutionary method introduced by Koza to solve problems by evolving computer programs. Koza adopted Lisp like expressions to represent his programs and called them expression trees. These trees are of variable length 7

30 and comprised of basic units called functions and terminals. Functions perform some operation on their inputs while terminals are effectively functions with zero inputs. A function can take as an input the value returned by a terminal or another function. A sample GP expression tree with terminals and functions marked is shown in Figure 2.1. Functions + sin * * X / 1.2 X 1 2 Terminals (+ sin(* 1.2 X) (* X (/ 1 2))) Figure 2.1: A sample GP tree. Functions and terminals are marked and the Lisp expression for the tree is shown below the tree. Each generated GP program is assigned fitness depending upon its ability to solve the given problem. This fitness value is normally used as a criterion by the selection method to select certain individuals for further improvement. GP improves the fitness of the selected programs by transforming it into a better program by applying genetic operators to it. The commonly used genetic operators are one point crossover and mutation. In the following section each component of GP will be discussed in a little more detail. The principal components of a GP system are as follows: 8

31 Problem specification Initial population Fitness measure Selection methods Recombination operators Problem Specification To solve a problem using GP, the problem needs to be translated into a machine understandable form. In GP, mostly, a problem is represented in the form of terminals, non-terminals and termination criteria. Terminals in GP are mostly considered as variables and functions with no input. Nonterminals are functions with one or more inputs and termination criteria tells the system when to stop the evolution process. GP is widely used to solve problems from the symbolic regression domain. The Quartic Polynomial Symbolic Regression problem is one example [Koza, 1992]. The goal of GP for this problem is the evolution of an expression satisfying the data points of the uni-variate expression x 4 +x 3 +x 2 +x sampled over the domain [ 1 : 1]. Typical terminal and non-terminals sets for this problem are {x, R} and {+,,,, ln, sin, cos, exp} respectively. x and R are variable and the set of random constants (commonly referred to as ephemeral random constants) respectively, while +,,,, ln, sin, cos, exp are non-terminals taking 2, 2, 2, 2, 1, 1, 1, 1 inputs, respectively. 9

32 2.1.2 Initial Population A GP engine needs an initial population to start its working. Ramped half and half is the most widely used method for generating the initial population. In this method an individual of the initial population is generated by randomly selecting terminals and non-terminals from the corresponding sets. The size of the initial population varies for different problems. The detailed discussion of this topic is beyond the scope of this work. Interested readers can refer to [Koza, 1992] Fitness measure GP assigns a fitness value to each individual of the population by calculating its ability to predict outputs from the given inputs. The goal of having fitness evaluation is to give feedback to the system regarding which individuals should be selected to further improve the performance of the system. Generally, the fitness of an individual is calculated on a set of inputs called training set Selection method Selection picks certain individuals of the population by biasing the selection process. The biasing criteria is mostly set before the start of the evolution process and helps in generating a new improved population. Commonly used selection methods are fitness proportionate selection, random selection, tournament selection, greedy over-selection. Fitness proportionate selection works by maintaining a roulette wheel. A part of the wheel is assigned to each individual of the population depending on its fitness. Fitter programs get bigger shares and vice versa. Finally, the 1

33 individuals are selected by spinning the wheel. Clearly this selection is biased towards the fit individuals, hence the name fitness proportionate selection. Tournament selection is a relatively less expensive selection process as it works on the subset of the population. The size of the subset is dictated by the tournament size, which is set before the start of the run. Typical values of tournament size are 2, 4, 6, 7. Higher tournament size exerts more selection pressure on the population and vice versa. After randomly selecting the subset of the population, the best one among them is allowed to reproduce and replace the loser of the tournament Recombination operators The individuals selected by the selection method are commonly called parents and are combined to generate one, two or more offspring. The recombination process is dictated by the recombination operator used. A typical example of a recombination operator is standard one point crossover. Theworkingof standard one point crossover is shown in Figure 2.2. The selected parents are labelled as Parent 1 and Parent 2. The encircled sub-trees of the parents are selected randomly for swapping. After swapping, two new children, labelled Child 1 and Child 2 are produced. Another commonly used genetic operator is mutation. This operator generates a new offspring by first randomly selecting sub-tree of the selected parent and then replacing it with a new randomly generated sub-tree Building Block hypothesis Since the introduction of GP [Koza, 1992], there has been much research carried out to discover exactly how it conducts its search. Some researchers concentrated on trying to extend Genetic Algorithm(GA) [Holland, 1975] 11

34 Parent 1 Parent 2 * / + * + + Cos X Sin X *.34 X X *.45 X X.45 crsossover point / X X.34 Child 1 Child 2 * / + * * + + Cos / X X *.34 X X + X X X.45 X Sin.23 Figure 2.2: Working of one point crossover. schema theory for GP [Poli and Langdon, 1997b] [Poli and Langdon, 1998b], while others tried to define their own [Koza, 1992] [O Reilly, 1995] [O Reilly and Oppacher, 1994] [Rosca, 1997]. Most GP schema work involves the identification and analysis of the processing sub-trees and tree fragments. As in GAs, a don t-care(#) is employed to generalize schema. Depending on the definition of schema this # can match to any sub-tree [O Reilly, 1995] [O Reilly and Oppacher, 1994], rooted sub-tree [Rosca, 1997] or to a terminal or a function [Poli and Langdon, 1997b] [Poli and Langdon, 1998b]. The challenge, clearly, is the identification of useful schemata. 12

35 In literature these useful schema are commonly referred to as building blocks or schema [Poli and Langdon, 1998a] [Poli and Langdon, 1998b] [Poli and Langdon, 1997a]. It is argued that GP finds the solution of a problem by combining these building blocks over time. Therefore, by explicitly identifying, preserving and exchanging these blocks the performance of the system can be improved manifold. Furthermore, this can be helpful in solving the scaled up versions of the same problem. Now that we have finished introducing GP, we shall discuss the initial work done and its outcome. 2.2 Where it all started Initially the focus of our study was on Grammatical Evolution (GE) system [Ryan et al., 1998] [O Neill and Ryan, 23] and possible existence of the building blocks and their growth over the period of a run [Ryan et al., 24]. GE is an automatic expression evolving system, like Genetic Programming (GP), and uses different types of grammars to generate the final expression. To fully understand the initial work, it is important to briefly discuss the working of GE. 2.3 Grammatical Evolution (GE) Grammatical Evolution is a Genetic Programming system that combines the convenience of a GA style binary string representation with the expressive power of a high level language program. Following a biological metaphor the binary string is termed as the genotype that is translated into a program or the phenotype through a genotype-phenotype mapping process. The map- 13

36 ping process typically uses a context free grammar represented in Backus Naur Form (BNF). However, the modular design of this evolutionary algorithm permits the use of other types of grammars and notations, for example, Adaptive Logic Programming system (ALP) [Keijzer, 22] combines the context sensitive programming features of Prolog logic programs with GE and demonstrates many interesting applications. As discussed later, the unique manner in which the linear strings are processed allows the unconstrained genetic operators while still producing valid off-springs. Thus, unlike Montana [Montana, 1995] specialized operators are not needed. Similarly, a repair mechanism such as that used by Keller and Banzhaf [Keller and Banzhaf, 1996] is not required. A context free grammar is represented by a tuple {T,N,P,S}. T is the set of terminals, the symbols that occur in the valid sentences of the language defined by the grammar. N denotes the set of non-terminals, the interim items that lead to the terminals. P is a set of production rules and S N is the start symbol. A grammar typically used by GE for symbolic regression problems is given below. S = <expr> <expr> ::= (<expr> <op> <expr>) <pre-op> (<expr>) <var> <op> ::= / - * + <pre-op> ::= Sin Cos Log Exp <var> ::= x 1. The mapping process chooses different rules from the grammar to arrive at a particular sentence. The genotype is treated as a sequence of 8 bit genes. The mapping starts with the start symbol. At each step during the mapping process a gene is read from the genome to pick a rule for the non-terminal 14

37 under consideration. Use is made of the mod operation to decode a gene into a rule in the following manner. Rule index = (Gene) Mod (Number of rules for the particular non-terminal) To elucidate the mapping process consider a sample individual Start symbol <expr> has the following options. <expr> ::= ( <expr> <op> <expr> ) () <pre-op> ( <expr> ) (1) <var> (2) 24 mod 3 =. Thus, (<expr> <op> <expr) is chosen. The mapping process always resolves the left most non-terminal. <expr> being the same non-terminal as before the set of options remains unchanged. The mapping proceeds by reading the next gene 32 that decodes to <var>. The expression now becomes (<var> <op> <expr>). For <var> the set of choices is: <var> ::= x () 1. (1) 14 mod 2 =. Thus, <var> is replaced by x in the expression. The mapping continues in this manner until the individual is completely mapped to (x + Exp (x) ). The genetic material is reused if the mapping is incomplete at the end of a single pass through the individual. The phenomenon is termed as wrapping and discussed comprehensively in [Ryan et al., 23] Ripple Effect and Ripple Crossover Systems predating GE such as GADS [Paterson and Livesey, 1997] used a fixed mapping from a gene to a rule in the grammar. Thus, it required that the genes should appear in the order the mapping demands. For example, 15

38 if a rule is required for <pre-op>, GADS skips all the genes it encounters until it finds an appropriate one. CFG/GP [Freeman, 1998] uses the interim genes if they are applicable to any of the unresolved non-terminals in the derivation tree. However, this does not account for the possibility that the currently unusable genes may become useful as the mapping continues. Both the approaches suffer from a proliferation of introns. The use of mod operation ensures that a gene is always interpreted in context of the mapping process. As a result a gene always decodes to a rule that is used immediately. This property is termed intrinsic polymorphism [Keijzer et al., 21]. Consequently the meaning of a gene depends on all the genes that precede it in the chromosome. If a change occurs in the earlier part of the chromosome, the effect ripples through the rest of the genome. Thus, the one point crossover in GE is termed as the ripple crossover. Fig. 2.3 exemplifies the ripple crossover for the individual discussed previously. When the chromosome is cut at a certain location, it effectively dismantles multiple branches from the derivation tree leaving behind many ripple sites. The in-coming fragment may be of different length and may contain the genes used for different non-terminals. The polymorphic interpretation ensures that they are used in context to fill in the ripple sites Building Blocks in GE The previous section may make it appear as though it is unlikely that building blocks can exist in GE, as they not only have the same property in GP that the operation of their phenotype depends to a large extent on the ones that preceded them (in the same manner that nodes in GP depend on their parent nodes), they can also actually code for something entirely different. An investigation in whether or not building blocks appear in GE was 16

39 (i) Genotype Unused genes Crossover sites (ii) Derivation Tree <expr> + ( <expr> <op> <expr> ) x Exp <var> + <pre op> ( <expr> ) (iii) Abstract Syntax Tree x x Exp <var> x Spine <expr> ( <expr> <op> <expr> ) <var>?? x (iv) Ripple Sites and Spine Ripple Sites Figure 2.3: Ripple crossover in GE. The change in the chromosome in (i) is correlated with the corresponding changes in the derivation tree and the abstract syntax tree in (ii) and (iii) respectively. (iv) depicts the spine and the vacant ripple sites. carried out by [O Neill and Ryan, 2] in which they compared a variety of crossover operators, including the so-called headless chicken [Angeline and Pollack, 1993c] crossover, in which randomly generated material is inserted into parents rather than having them exchange genes. They showed that a homologous crossover, that prevented individuals from choosing crossover points within their common areas performed approximately the same as the standard one point crossover, suggesting that a type of homologous crossover comes for free with GE. Our belief is that, while this is true, it is more the case that many of the successful crossovers are of this variety. 17

40 Rooted or unrooted building blocks? Rooted building blocks are clearly of importance to any type of GP system, but the fact that headless chicken crossover usually under-performs compared to standard crossover suggests that there have to be unrooted building blocks at play as well. The following section presents a suite of experiments designed to test whether crucial building blocks from ideal individuals appear in the wrong position (and possibly with an entirely different meaning) Experimental Design The first experiment was to test if important building blocks appear in the population, but in the wrong location on individual s chromosomes. Important building blocks are defined as those that appear in the best-of-run individual for a particular run. A second experiment was designed to investigate how newly extended rooted building blocks spread throughout the population. If there really is a competition to discover a rooted building block, we should be able to see evidence of the appearance of increasingly larger rooted building blocks, which then start to take over the population. We used Koza s quartic polynomial symbolic regression problem, with a population of 5 running for 2 generations, using steady state replacement with roulette wheel selection. Only crossover operator was used and mutation was turned off; this was to permit us to focus on the effects of crossover alone. It has been shown [O Sullivan and Ryan, 22] that the performance of GE drops off very sharply when mutation is removed, and we experienced a similar drop off in performance. In fact, for all the experiments here, we only 18

41 experienced two successes, even though they ran for two hundred generations. Occurrence of incorrectly positioned building blocks These experiments were designed to check the frequency of existence of a building block at different positions throughout the population. The best individual was selected and compared to the entire population of a particular generation. Two counts were maintained. The first shows the frequency of existence of a building block at the same position in the whole population as it was in the best individual, while the second count keeps track of the frequency of the building block at different positions across the whole population. This test explains how the building blocks are spread across the genome. As the size of the building block was not known before hand, we checked the block sizes starting from one to the maximum size of the best individual. Due to space considerations we only show measures of building block size nine, as this is representative of the other results. These are shown in Figs. 2.4 and 2.5. Every building block of length nine was counted, in an overlapping fashion, so the first building block occupies gene positions to 8, while the second occupies 1 to 9. As indicated by Figure 2.4 and 2.5, as early as after the first generation, the population (Figure 2.4) has, on average, produced around seven individuals with the correct top root structure. The further along the genome we examined, the less likely a building block was to be found in its correct position - recall that we are interested in individual building blocks in the traditional sense rather than rooted building blocks. Curiously, the trend is that, as the number of building blocks in the correct position decreases, their number in the incorrect position increases, indicating a good mix of diversity in the population. 19

42 8 7 Generation = 1, Block size = 9, Averaged over 3 Runs Same pos Diff pos 7 6 Generation = 5, Block size = 9, Averaged over 3 Runs Same pos Diff pos Match Count Match Count Starting Position Starting Position Figure 2.4: Count of correctly and incorrectly positioned (overlapping) building blocks of size 9 after the first and 5 generations. Match Count Generation = 1, Block size = 9, Averaged over 3 Runs Same pos Diff pos Match Count Generation = 15, Block size = 9, Averaged over 3 Runs Same pos Diff pos Starting Position Starting Position Figure 2.5: Count of correctly and incorrectly positioned (overlapping) building blocks of size 9 after the 1 and 15 generations. By generation 5, also in Figure 2.4, we see a very different picture. Notice the difference in the scales and that, on average, around 35 individuals have the same first nine genes. Further, for certain sequences towards the end of the chromosomes, we see relatively high peaks, indicating that many individuals have those sequences, albeit not in the correct position. This seems to support the findings of [Ryan et al., 23] who postulated the existence of a stop sequence in GE. That is, a sequence of genes that can successfully terminate most of the chromosomes in the population. An analogy in GP 2