Genetic Programming and its Application in Real-time Runoff Forecasting ABSTRACT

Size: px
Start display at page:

Download "Genetic Programming and its Application in Real-time Runoff Forecasting ABSTRACT"

Transcription

1 This is a Pre-Published Version. Published version available at Wiley-Blackwell: Genetic Programming and Its Application in Real-time Runoff Forecasting Khu, S. T., Liong, S. Y., Babovic, V., Madsen, H. and Muttil, N. Journal of American Water Resources Assoc., (JAWRA) Vol. 37, No. 2, Genetic Programming and its Application in Real-time Runoff Forecasting Khu, S. T., Liong, S. Y., Babovic, V., Madsen, H. and Muttil, N. ABSTRACT Genetic programming (GP), a relatively new evolutionary technique, is demonstrated in this study to evolve codes for the solution of problems. First, a simple example, in the area of symbolic regression, is considered. GP is then applied to real-time runoff forecasting for the Orgeval catchment in France. In this study, GP functions as an error updating scheme to complement a rainfall-runoff model, NAM. Hourly runoff forecasts of different updating intervals are performed for forecast horizons of up to nine hours. The results showed that the proposed updating scheme is able to predict the runoff quite accurately for all updating intervals considered and particularly for updating intervals of not exceeding the time of concentration of the catchment. The results are also compared with those of an earlier study, by the World Meteorological Organization (WMO, 1992), at which auto-regression and Kalman filter were used as the updating methods. Comparisons show that GP is a better updating tool for realtime flow forecasting. Another important finding from this study is that nondimensionalizing the variables enhances the symbolic regression process significantly. Key works: genetic programming, evolutionary algorithms, rainfall-runoff, real-time forecasting, updating, regression. 1

2 Genetic Programming and Its Application in Real-time Runoff Forecasting INTRODUCTION One of the central challenges of computer science is to get a computer to perform a task without telling it how to do it. In hydrologic engineering, the challenge is to derive a model that relates two or more physical processes without knowing the actual mechanics of conversion. Genetic Programming (GP) addresses the first challenge by providing a method which automatically creates a working computer program from a high-level statement of the problem. GP achieves this automatic program discovery (also known as program synthesis or program induction) by genetically breeding a population of computer programs using principles of Darwinian natural selection and biologically inspired operations. GP can also be applied to infer models in hydrologic engineering problems such as rainfall-runoff modelling or runoff forecasting. In problems where complete understanding of the physical process is lacking or the process is too complicated to be modeled, GP may offer some assistance or insight. An application area of GP is that of real-time runoff forecasting. In real-time runoff forecasting for example, incorporating knowledge of prediction errors of the past forecast to forecasting models of different horizons is can greatly improve the models performance. In runoff forecasting, information on the immediate past and current states of meteorological conditions and those of the catchment are essential to forecast the catchment s response for different forecast horizons. When applied in a real-time mode, it is necessary to modify or update the forecast based on current information such as observed discharges. There are four updating approaches that update either (1) 2

3 the input parameters, (2) the state variables, (3) the model parameters, or (4) the output variables. The most commonly used scheme is updating the output variables or error correction. This approach is adopted in this study. This study is organized as follows: genetic programming is first introduced followed by a detailed examination of the various biologically inspired operations to form new solutions. A simple example is then used to illustrated the strength of GP. Finally GP is used to improve the simulated runoff of a rainfall runoff model applied to an actual catchment. GENETIC PROGRAMMING Genetic Programming (GP) is a relatively new domain-independent method for evolving computer programs to solve, or approximately solve, problems (Koza, 1992). In engineering applications, GP is frequently applied to model structure identification problems. In such applications, GP is used to infer the underlying structure of either a natural or experimental process in order to model the process numerically. GP inferred models have the advantages of (1) generating simple parsimonious expressions and (2) offering some possible interpretations to the underlying process. GP began as an attempt to discover how computers could learn to solve problems without being explicitly programmed to do so. One successful application of GP in automatic program discovery is that of symbolic regression, instead of the traditional numerical regression. This makes the application of GP even more relevant in fields where large amounts of data are accumulating in machine-readable form. For example, GP has been applied to (1) predict chaotic financial time series (Oakley and Howard, 1994); (2) predict occurrence of sunspots (Lee and Suzuki, 1995); (3) solve 3

4 various hydraulics problems, such as rainfall-runoff relationship from synthetic data, sediment transport modelling, salt intrusion in estuaries and flow over a flexible vegetated bed (Babovic and Abbott, 1997); and (4) emulate the rainfall-runoff process (Whigham and Crapper, 1999). GP belongs to a class of probabilistic search procedures known as Evolutionary Algorithms (EAs) which includes Genetic Algorithms (GA) (Holland, 1975), Evolutionary Programming (EP) (Fogel et al., 1966) and Evolutionary Strategy (ES) (Schwefel, 1981). These techniques use computational models of natural evolutionary process for the development of computer based problem-solving systems. All evolutionary algorithms function by simulating the evolution of individual structures via processes of reproductive variation and fitness based selection. The techniques have become extremely popular due to their success at searching complex non-linear spaces and their robustness in practical applications. Basic Principles of Genetic Symbolic Regression Genetic Symbolic Regression (GSR) is a special application of GP in the field of symbolic regression. In traditional numerical regression, one pre-determines the functional form, either linear or higher order, and the task is to determine the coefficients. In symbolic regression, the task is to both find a suitable functional form and determine the coefficients. Hence, GSR involves finding a mathematical expression, in symbolic form, relating a finite sampling of values of a set of independent variables (x i ) and a set of dependent variables (y j ). GSR can be viewed as an extension of Genetic Algorithm (GA) in terms of the basic principles of operations. Like GA, GSR works with a number of solution sets, known collectively as a population, rather than a single solution at any one time. With 4

5 a large number of solution sets, it gives both techniques the advantage of avoiding the possibility of getting trapped in the local optima. There are, however, two major differences between GP and GA. They are: (i) GSR works with two sets of variables, instead of one set of variables as in GA. One set of variables, known as the terminal set, contains the independent variables and constants, {x i }, similar to GA. The other set, known as the functional set, contains the basic operators used to form the function, f( ). For example, the function set may contain the following operators { +, -, *, /, ^, log, sin, tanh, exp,.} depending on the perceived degree of complexity of the regression. Thus, the symbolic regression is performed using these two variable sets and it is possible to derive a large number of possible functional relationships to fit the data. (ii) In most EAs, the length of the solution set is normally fixed. In GP, however, the length is allowed to vary from one solution set to another. This variation in length is due to the two genetic operators, crossover and mutation. The flexibility in the structure length increases the search space significantly. The solution sets in each iteration are collectively known as a generation. In GPs, the size of a population does not have to be the same from one generation to the next. The solutions of the very first generation are usually generated through a random process. However, those of the subsequent generations are generated through genetic operations. Each possible solution set can be represented and visualized in either parse tree form or in Polish notation (Lukasiewicz, 1957) as shown in Fig. 1. As the population evolves, new solution sets replace the older ones and are supposed to perform better. The solution sets in a population associated with the best fit 5

6 individuals will, on average, be reproduced more often than the less-fit solution sets. This is known as the Darwinian principle of the "survival of the fittest". The basic procedure of GP, Fig. 2, can be described as follows: 1. generate the set of initial population; 2. evaluate each parse tree and assign the fitness; 3. form the temporary population by selecting candidates according to their fitness. This temporary population is called the mating pool. Candidates with higher fitness are given greater probabilities to mate and hence, to produce children or offspring; 4. choose pairs of parse tree from the temporary mating pool randomly for mating and apply the genetic operator called crossover. Crossover is the exchange of genetic material (such as fitness, composition) between two selected candidates; 5. select a crossover site where the material will be exchanged randomly, thereby resulting in the creation of offspring; 6. apply another genetic operator known as mutation which randomly changes the genetic information of the candidate; 7. copy the resultant chromosomes into the new population; 8. evaluate the performance of the new population; 9. repeat steps 3-8 until a predetermined criterion is reached. Selection, Crossover and Mutation Selection is the process of altering the fitness of the individual with respect to the whole generation's average performance. This is an important step because it determines directly the individual's chances of its representation in the next generation. A common selection method is fitness ranking. Individuals are sorted 6

7 according to the fitness values and ranked accordingly. Another selection method is the tournament selection. This method fills the mating pool without the need of fitness mapping. Pairs of individuals are picked at random from the population. The individual with a higher fitness value is copied into the mating pool directly. This is repeated until the pool is filled. Crossover is the first process of producing new individuals from the selected individuals in the mating pool. It takes two individuals and prunes their branch at some randomly chosen position, into two segments (Figure 3a). Exchanging the segments produces two new possible solution sets (Figure 3b). The two new solution sets or children inherit some characteristics from their parents and genetic information is thereby exchanged in the process. As the process continues, the fitness of the whole population increases and converges to finding the near optimal solution set. Mutation is the random alteration of the individual parse tree at the branch or node level (Figure 4). Mutation is a mechanism that perturbs the parse tree structure. It does not usually change the tree structure but the information content in the parse tree. It therefore explores a new domain and serves to free the search from the possibility of being trapped in local optima. It should be noted that mutation can be destructive, causing rapid degradation of relatively "fit" solution sets if the probability of mutation is set too high. Depending on the strategy adopted, mutation rate can be low according GA-type mutation or high according to ES-type mutation. The genetic programming introduced here is one of the simplest forms available. Instead of Polish notations, GP solution sets can also be represented in other forms such as: direct acylic graphs (Handley, 1994), linear representation (Perkis, 1994) and direct graphs (Poli, 1996). The evolution processes can also vary such as using automatically defined functions (Koza, 1994), and adaptive representation 7

8 through learning (Rosca and Ballard, 1996), etc. These details can be found, for example, in Koza (1992, 1994), Kinnear (1994), Angeline and Kinnear (1996) and Langdon (1998). AN EXAMPLE OF GENETIC SYMBOLIC REGRESSION An example is shown here to illustrate the concept of GSR. The problem of interest is to infer the Bernoulli equation for a steady, 1-dimensional fluid flow: 2 E = z + p v + γ 2g = const (1) where z = vertical distance above a datum (m) p = pressure (N/m 2 ) v = velocity (m/s) g = Earth s gravitational acceleration (9.81 m/s 2 ) γ = specific gravity of water (9810 N/m 3 ) From Eq.(1), 1000 sets of different combinations of z, p and v are generated using a standard random number generator. The values of the energy head, E, are then computed correspondingly. It should be noted that this example was used in Maarten and Babovic (1999) where problems involving variables of different dimensions are discussed. For example, if variable z were to be multiplied with variable v, then the resulting expression has a dimension of square-of-length/time which is inconsistent with the dimension of variable E (length). In their paper, such occurrences of dimensional inconsistency are penalized, to ensure dimensional consistency of the resulting GP expression. In this study, however, to circumvent the problem of dimension inconsistency, each of the values of z, p and v is normalized or non-dimensionalized by using each variable s maximum value. By doing so, Eq.(1) is then transformed to a new form: 2 2 pmax p v maxv E max E = zmax z + + = γ 2g const (2) 8

9 or 2 E = C z + C p + C v = const (3) where the non-dimensionalized variables are given by: E = z = p = v = E E max z z max p p max v v max (4a) (4b) (4c) (4d) and the coefficients are: zmax C 1 = (5a) E max C 2 pmax = (5b) γ.e max 2 max v C 3 = (5c) 2g. E max The terminal set used is the set of non-dimensionalized variables { z, p, v} while the functional set is {+, -, /, *}. Crossover is performed by randomly choosing the sub-tree insertion location. The objective function is to find the minimum of the root mean square error (RMSE) of the predicted energy head. The other GP relevant parameters and their values are shown in Table 1. The genetic programming software used in this study is GPkernel developed at the Danish Hydraulic Institute. The initial population is generated by a random number generator. The size of the tree of the initial population is constrained to a maximum of fifteen levels and the subsequent tree size is constrained to forty-five levels. This restriction is necessary since GP has the tendency to infer a Fourier expansion-type function if the tree size is not limited. 9

10 This type of expansion, although it may fit the data very well, does not add value in the function interpretation. Ten different runs are performed, each using a different seed, to generate the random numbers. Each time GPkernel is run for 15 minutes. Fig. 5 shows the average root mean square error (RMSE) of each generation in the GSR runs. It should be noted that the average RMSE, for all 10 runs, decreases rapidly as the generation progresses and reached values near to zero for most of the runs. As a result, exact formulae (up to 3 significant figures) are produced from each run. The above simple example illustrates the capability of GP, or GSR technique to infer the correct functional relationship when there are no errors in the raw data. The next section illustrates the use of GSR as a new updating procedure combined with rainfall runoff simulation models. GP APPLICATION IN REAL-TIME RUNOFF FORECASTING A forecasting system is a system that takes information on the past and current states of meteorological conditions and those of the catchment, as inputs to it, and forecasts the catchment s response into the future. In real-time forecasting, however, the originally forecast values may be updated or modified as measured data become available and, thus, prediction errors can be determined and used for forecasting. In real-time runoff forecasting with rainfall runoff simulation models, rainfall time series up to the desired runoff forecast horizon must be available. The required rainfall time series within the runoff forecast horizon may be estimated with, for example, a nonlinear prediction method. In this study, the measured rainfall time series, at any runoff forecast horizon, is made available to evaluate the performance of the proposed GSR based error updating scheme. 10

11 The focus of this study is to: (1) compare the forecasts of a calibrated rainfallrunoff model, e.g. NAM, with and without the GSR based error updating scheme; and (2) suggest how far in the future, i.e. the maximum forecast horizon, the GSR based error updating scheme can be used with confidence. The catchment simulation model used in this study is a lumped rainfall-runoff model called Nedbor-Afstomnings-Model (NAM) which is presently part of the MIKE11 software. The NAM model has been developed by the Hydrological Section of the Institute of Hydrodynamics and Hydraulic Engineering at the Technical University of Denmark (Nielsen and Hansen, 1973). The model can be defined as a deterministic, conceptual, lumped type model with moderate input data requirement. The NAM model consists of a set of linked mathematical statements describing, in a simplified quantitative form, the behaviour of the land phase of the hydrological cycle. The input data requirements are the catchment size, precipitation, potential evapo-transpiration and temperature. It operates by continuously accounting for the moisture content in the snow zone, surface zone, sub-surface zone and groundwater zone storage. The model can be calibrated using historical data by adjusting one or more of the seventeen parameters. Figure 6 shows the schematic diagram of the proposed error updating using GSR. The NAM model is first used to simulate the discharge, QSIM, for the whole period of interest based on the rainfall data, R. The proposed procedure is then used to compute the prediction error, ε, by comparing the simulated discharge, QSIM, with the observed discharge, QOBS, for time, t. The new simulated or improved discharge, QIMP t, is computed by adjusting QSIM t for each forecast lead-time within the forecast horizon. 11

12 Mathematically, the measured discharge, QOBS t, at time t, can be expressed as: QOBS t = QSIM t + ε t (6a) or ε t = QOBS t - QSIM t (6b) GSR is used to infer the functional relationship, F( ), between the simulated discharges and the past simulation errors, and the present simulation error. For lead time of 1 hour, the functional relationship for GSR prediction error, ε t 1, may be expressed as follows: { QSIM, QSIM,... QSIM, ε, ε ε } ˆ ε t+ 1 = F t+ 1 t t 4 t t 1... t 4 (7a) and the forecast improved discharge, QIMP t+1, can be obtained from: QIMPt + 1 = QSIM t+ 1 + ˆ ε t + 1 (7b) ˆ + For lead time of 2, 3,, α hours, the recursive form of Eq. (7) can be written as: ˆ ε { QSIM,..., QSIM, ˆ ε,..., ε } F (8a) t+ α = t+ α t+ α 4 t+ α 1 ˆt+ α 4 QIMPt + α = QSIM t+ α + ˆ ε t + α (8b) QSIM and ε of the immediate past 5 time steps are included in the functional set since the catchment s time of concentration varies up to a maximum of 5 hours, i.e. 5 time steps (WMO, 1992). It should be noted that the values of ˆ ε t+ α in Eq. (8a) may be either the actual errors at instances when measured data are available or GSR derived errors. The real-time flood forecasting with updating procedure for a 1-hour lead-time could be summarized as follows: 12

13 1. The NAM model is first calibrated using an automatic calibration routine, e.g. Accelerated Convergence Genetic Algorithm, ACGA (Liong et al., 1998), on the entire runoff period from ; 2. The prediction errors, denoted by ε, between the NAM simulated and observed runoff for each time interval, are computed; 3. Ten storm events, representing high flow regimes with minimum discharge of 4 m 3 /s, are selected from the calibration runoff period, for the symbolic regression using GP. This minimum discharge criterion is also used in the selection of the verification data set; 4. GSR is then used to derive the functional relationship between the present prediction error, εˆ, and the NAM simulated discharge, QSIM, and the past prediction errors, ε, as given in Eq. (7a); 5. The improved simulated discharge, QIMP, is finally calculated, using Eq. (7b) and compared with the measured discharge. For α > 1, the above procedure is repeated following Eqs. (8a) and (8b). In this study, the catchment under consideration is the Orgeval catchment, in France (Fig. 7), which has been studied extensively in the World Meteorological Organization s inter-comparison project (WMO, 1992). The catchment is located about 80 km east of Paris and the main river that drains the catchment runoff is the Orgeval. The catchment has an area of about 104 km 2. The catchment comprises mainly rural area, with only 1 percent of the total area urban and 18 percent of the total area forest. In this study, a total of ten storm events (denoted as storms S1 S10) from a hourly flow record was selected for training the GSR (Fig. 8(a)) while a total of six storm events (denoted as storms S11 S16) between (Fig. 8 13

14 (b)) was selected and used for the verification of the updating procedure. Fig. 9(a) and 9(b) show the observed and NAM simulated hydrographs for two of the selected GSR training events. These ten storm events represent high flow regimes which are the focus of GSR training in this study. It should be noted that the maximum peak discharge of the GSR training storms is 7.38 m 3 /s while those of the verification storm events range from 10 m 3 /s to 29 m 3 /s. Following the example in the previous section, both the dependent variable, εˆ, and the independent variables, QSIM and ε, are nondimensionalized using their respective maximum values. Hence, the terminal set in GSR for the 1-hr lead time, for example, is given by the normalized values { QSIM, QSIM,..., QSIM, t, ε t 5 } t t 1 4t..., ε and the functional set is given by the basic algebraic operators {+, -, *, / }. Henceforth, all variables used in the study are normalized variables and the bar sign on each variable is therefore suspended. The objective function searches the minimum of the root mean square error (RMSE) of the predicted error, εˆ. In this study, the GP program, GPKernel, was run for 30 minutes on a Pentium II 300 PC. The other GP relevant parameters and their values are shown in Table 2. The population size of the parent and children are both set at It is to be noted that, from various runs attempted, it was difficult to achieve good prediction accuracy with a smaller population size. Discussion and Analysis of Results follows: The best functional form, with the minimum RMSE, resulting from GSR is as ˆ ε t+ 1 = ε t ε ( QSIM QSIM ) t 2 t 0.644ε t 1 t 1 (9a) 14

15 or QIMP t+ 1 = QSIM + t ε t ( QSIM QSIM ) t ε t t ε Eq. (9a) shows a certain degree of similarity with the commonly used auto-regressive form with the exception of the fourth term, an interaction term between the simulated discharges and a prediction error. This interaction term is significant in rectifying under or over-prediction trends of the simulation model. Eq. (9) also shows that only simulated discharges and/ or prediction errors of up to three previous time-steps are important. This implies that, for data used to derive the above functional relationship, the catchment s time of concentration is more likely about three hours. Table 3 shows the root mean square errors (RMSE) of the various prediction horizons for each of the verification storm events. From this table, it can be seen that the RMSE of each event is relatively better or of the same order of magnitude as that of the simulation model (NAM). Up to 4-hour lead time for all storm events considered, except for storm event S11, the RMSE values of NAM+GSR are categorically better than those resulting from NAM only. Thus, the proposed updating GSR can be used up to a lead time of 4 hours with high confidence. Figs. 10 and 11 showed the performance of the proposed procedure with different updating frequencies of 2 hours, 4 hours and 6 hours for 2 verification storm events. It shows clearly that the performance of the GSR error updating method is acceptable for all the updating frequencies. The World Meteorological Organization (WMO) conducted a workshop in 1988 and published a report entitled Simulated Real-Time Inter-Comparison of Hydrological Models (WMO, 1992). The WMO study compared the performances of 14 different updating procedures. The study found that the error updating procedure NAMS11 (Havno et al., 1995) and the state updating procedure NAMKAL (Storm et t 1 (9b) 15

16 al., 1988) yielded best performance on the French Orgeval catchment. Briefly, NAMS11 applies: (1) NAM rainfall-runoff simulation model and the MIKE11 hydrodynamic module; and (2) an error correction technique based on a first order autoregressive model. NAMKAL is a modified NAM model, reformulated in state space form and updated with an extended Kalman filtering algorithm. In the WMO study, forecasts were updated at every 4th-hour. Thus, their results are now compared with this study s proposed GSR based updating scheme of 4-hour runoff forecast horizon. The choice of the updating interval coincides with the earlier drawn conclusion from Table 3. Figure 12(a) and (b) shows the average RMSE, of two verification storm events (storms S12 and S15) resulting from various updating schemes, NAM+GSR, NAMS11 and NAMKAL. These two storm events are the same as those chosen in Figs. 10 and 11. Fig. 12(a) shows that for storm event S12, the proposed NAM+GSR performs better in the first 3 hours than NAMS11 and NAMKAL while on the fourth hour, they all perform equally. Fig. 12(b), however, shows clearly that for storm event S15, the NAM+GSR is categorically better than the two other techniques. 16

17 CONCLUSIONS A relatively new evolutionary technique, known as genetic programming (GP) has been introduced. GP was used to evolve codes for the solution of problems. A simple example of the Bernoulli equation was used to illustrate how GP symbolically regresses or infers the relationship between the input and output variables. An important conclusion from this study is that non-dimensionalizing the variables prior to symbolic regression process significantly enhance the success of GSR. GP was then applied to the problem of real-time runoff forecasting for the Orgeval catchment in France. GP functioned as an error updating procedure complementing the rainfall-runoff model, NAM. Ten storm events were used to infer the relationship between the NAM simulated runoff and the corresponding prediction error. That relationship was subsequently used for real-time forecasting of six storm events. The results indicate that the proposed methodology is able to forecast different storm events with great accuracy for different updating intervals. The forecast hydrograph performs well even for a long forecast horizon of up to nine hours. However, for practical applications in real-time runoff forecasting, the updating interval should be less than or equal to the time of concentration of the catchment. The results were also compared with two known updating methods such as the autoregression and Kalman filter. Comparisons show that the proposed scheme, NAM+GSR, is comparable to these methods for real-time runoff forecasting. 17

18 ACKNOWLEDGMENTS The work is sponsored jointly by the National University of Singapore research project, RP and the Talent Project N o Data to Knowledge - D2K funded by the Danish Technical Research Council (STVF). Part of the work was done by the first author during his study leave at Danish Hydraulic Institute (DHI). 18

19 LITERATURE CITED: Angeline, P. J. and Kinnear, K. E., (1996). Advances in Genetic Programming 2. MIT Press, Cambridge, MA. Babovic, V. and Abbott, M. B., (1997). The evolution of equations from hydraulic data, part II: applications. Journal of Hydraulic Research, 35(3): Fogel, L. J., Owens, A. J. and Walsh, M. J., (1966). Artificial Intelligence through Simulated Evolution. John Wiley, New York. Handley, S., (1994). On the use of a directed acyclic graph to represent a population of computer programs. In Proceedings of the 1994 IEEE World Congress on Computational Intelligence. IEEE Press, pp Holland, J. H., (1975). Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, Kinnear, K. E., (1994). Advances in Genetic Programming. The MIT Press, Cambridge, MA. Koza, J. R., (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge, MA. Koza, J. R., (1994). Genetic Programming 2: Automatic Discovery of Reusable Programs. The MIT Press, Cambridge, MA. Langdon, W. B., (1998). Genetic Programming and Data Structures. Kluwer Academic Publishers, Norwell, MA. Lee, G. Y. and Suzuki, A., (1995). Genetic programming approach for time series analysis and prediction. Journal of Graduate School and Faculty of Engineering, University of Tokyo (B), 43(2): Liong, S. Y., Khu, S. T. and Chan, W. T., (1998). Derivation of Pareto front with accelerated convergence genetic algorithm, ACGA. In Proceedings of the 3 rd 19

20 International Conference on Hydroinformatics, Babovic, V. and Larsen, L. C. (eds.), Volume 2, pp Lukasiewicz, J., (1957). Aristotle's Syllogistic from the Standpoint of Modern Formal Logic. Oxford, Clarendon Press. Maarten, K. and Babovic, V., (1999). Dimensionally Aware Genetic Programming. Proceedings of the Genetic and Evolutionary Computation Conference, GECCO-99, Daida, W. B., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M. and Smith, R. E. (eds.), Morgan Kaufmann, CA. Nielsen, S. A. and Hansen, E., (1973). Numerical simulation of rainfall runoff process on a daily basis. Nordic Hydrology, 4: Oakley, N. and Howard, E., (1994). The application of genetic programming to the investigation of short, noisy, chaotic data series. In Evolutionary Programming, Lecture Notes in Computer Sciences, Fogarty, T C. (ed.), No. 865, Springer- Verlag, Germany, pp Perkis, T., (1994). Stack-based genetic programming. In Proceedings of the 1994 IEEE World Congress on Computational Intelligence, Vol. 1, IEEE Press, pp Poli, R., (1996). Parallel distributed genetic programming. Technical Report CSRP , School of Computer Science, University of Birmingham. Rosca, J. P. and Ballard, D. H., (1996). Discovery of subroutines in genetic programming. In Advances in Genetic Programming 2, Angeline, P. J. and Kinnear, K. E. (eds.), The MIT Press, Cambridge, MA., pp Rungo, M., Refgaard, J. C. and Havno, K., (1991). The updating procedure in the MIKE11 modelling systme for real-time forecasting. Proceeding of the 20

21 International Symposium on Hydrological Applications on Weather Radar, Ellis Horword publication, University of Salford, August 1989, pp Schwefel, H. P., (1981). Numerical Optimization of Computer Models. John Wiley, Chichester. Storm, B., Jensen, K. H. and Refgaard, J. C., (1988). Estimation of catchment rainfall uncertainty and its influence on runoff prediction. Nordic Hydrology, Vol. 19, pp Whigham, P. A. and Crapper, P. F., (1999). Modelling Rainfall-runoff Relationships using Genetic Programming. Special Issue of Journal of Mathematical and Computer Modelling (in press). WMO, (1992). Simulated Real-Time Inter-Comparison of Hydrological Models. WMO Operational Hydrology Report no. 38, WMO no World Meteorological Organization, Geneva. 21

22 List of Tables: Table 1 : Genetic Programming Parameters Used in Bernoulli Equation Example Table 2 : Genetic Programming Parameters Used in Real-time Runoff Forecasting Example Table 3 : Root Mean Square Error of Testing Storms for Different Prediction Lead-times 22

23 Parameter Value Size of parent 1000 Size of children 1000 Tournament size 3 Crossover rate 1.0 Mutation rate 0.3 Maximum initial tree size 15 Maximum tree size 45 23

24 Parameter Value Size of parent 3000 Size of children 3000 Tournament size 3 Crossover rate 1.0 Mutation rate 0.3 Maximum initial tree size 15 Maximum tree size 30 24

25 Lead-time (hours) 1 (NAM) average RMSE Storm Storm Storm Storm Storm Storm (0.878) (2.675) (1.810) (4.515) (3.199) (1.657)

26 List of Figures Figure 1 : Different Forms of Representation in Genetic Programming Figure 2 : Basic Procedure of Genetic Programming Figure 3 : Crossover in Genetic Programming Figure 4 : Mutation in Genetic Programming Figure 5 : Rapid Convergence of GSR for Bernoulli Equation Example Figure 6 : Schematic Diagram of Updating Procedure Figure 7 : Location of Orgeval Catchment Figure 8 : Hydrographs of (a)training and (b)verification Storm Events Figure 9 : Comparison of Observed and Simulated Hydrographs for 2 Training Storm Events: (a) S2 and (b) S7 Figure 10 : Updating Every α hours and Forecasting up to 6 hours for Verification Storm Event S12: (a) α = 2 hours; (b) α = 4 hours; and (c) α = 6 hours Figure 11 : Updating Every α hours and Forecasting up to 6 hours for Verification Storm Event S15: (a) α = 2 hours; (b) α = 4 hours; and (c) α = 6 hours Figure 12 : Average RMSE of 2 Verification Storm Events: (a) S12 and (b) S15 26

27 (i) A simple expression: a + (b * c) (ii) as Polish notation (prefix): + * b c a (iii) as reverse Polish notation (postfix): b c a * + (iv) as Parse tree: + / \ a * / \ b c 27

28 START Generation = 0 Generate initial random population Termination criteria satisfied? YES STOP NO Evaluate fitness of each individual population Crossover Conduct genetic operations based on probability Mutation Select 2 individuals based on fitness Select one individual Perform crossover Perform mutation Introduce newly generated individuals into new population pool 28

29 Parent 1 Parent 2 + * a * + - b c b a * d a c Direct algebraic form:. a+(b*c) (b+a)*((a*c)-d) (a) Child 1 Child 2 + * a - + * * d b a b c a c Direct algebraic form:. a+((a*c)-d) (b+a)*b*c (b) 29

30 + + Mutation a * a / b c b c a + (b * c) a + (b / c) 30

31 Average Root Mean Square Error Generation 31

32 Meteorological Data such as Rainfall Rainfall-runoff Simulation Model (NAM) Output: Simulated Runoff (QSIM) Observed Runoff (QOBS) Updating Procedure (Genetic Symbolic Regression) Output: Improved Runoff (QIMP) 32

33 Ru de Bourgogne Ru de Rognon Ru des Avenelles Orgeval The Grand Morin 33

34 8 7 Storm S1 S2 S5 S7 Observed Discharge (m3/s) Threshold for storm events S3 S4 S6 S8 S9 S /07/72 05/02/73 24/08/73 12/03/74 28/09/74 16/04/75 Time date & time (a) 34

35 35 Validation data S15 Discharge (m3/s) Storm S11 S12 S13 S14 S16 5 Threshold for storm event 0 11/06/78 02/04/79 05/05/79 08/03/79 11/01/79 01/30/80 04/29/80 07/28/80 Time Date (b) 35

36 8 7 (a) Storm event 2 (Calibration) Observed NAM Discharge (m3/s) /02/73 12:00 13/02/73 00:00 13/02/73 12:00 14/02/73 00:00 14/02/73 12:00 Time date & time (b) Time 8 7 Storm event 7 (Calibration) Observed NAM Discharge (m3/s) /03/74 00:00 21/03/74 00:00 22/03/74 00:00 23/03/74 00:00 date & time 36

37 20 updating begins Discharge (m3/s) Observed NAM+GSR NAM simulation 0 01/02/79 12:00 02/02/79 00:00 02/02/79 12:00 03/02/79 00:00 03/02/79 12:00 Time date & time (a) 20 updating begins Discharge (m3/s) Observed NAM+GSR NAM simulation 0 01/02/79 12:00 02/02/79 00:00 02/02/79 12:00 03/02/79 00:00 03/02/79 12:00 Time date & time (b) 37

38 20 updating begins Discharge (m3/s) Observed NAM+GSR NAM simulation 0 01/02/79 12:00 02/02/79 00:00 02/02/79 12:00 03/02/79 00:00 03/02/79 12:00 Time date & time (c) 38

39 20 Discharge (m3/s) updating begins Observed NAM+GSR NAM simulation 0 13/07/80 12:00 14/07/80 00:00 14/07/80 12:00 15/07/80 00:00 Time date & time (a) Discharge (m3/s) updating begins 4 Observed 2 NAM+GSR NAM simulation 0 13/07/80 12:00 14/07/80 00:00 14/07/80 12:00 15/07/80 00:00 date & time Time (b) 39

40 18 16 Discharge (m3/s) updating begins Observed 2 NAM+GSR NAM simulation 0 13/07/80 12:00 14/07/80 00:00 14/07/80 12:00 15/07/80 00:00 Time date & time (c) 40

41 Storm Event S12 RMSE (m3/s) NAM without updating GP+GSR NAMS11 NAMKAL Forecast lead time [hrs] (a) RMSE [m3/s] Storm Event S15 NAM without updating GP+GSR NAMS11 NAMKAL Forecast lead time [hrs] (b) 41