Efficient prediction uncertainty approximation in the calibration of environmental simulation models

Size: px
Start display at page:

Download "Efficient prediction uncertainty approximation in the calibration of environmental simulation models"

Transcription

1 WATER RESOURCES RESEARCH, VOL. 44,, doi: /2007wr005869, 2008 Efficient prediction uncertainty approximation in the calibration of environmental simulation models Bryan A. Tolson 1 and Christine A. Shoemaker 2 Received 8 January 2007; revised 7 November 2007; accepted 16 January 2008; published 11 April [1] This paper is aimed at improving the efficiency of model uncertainty analyses that are conditioned on measured calibration data. Specifically, the focus is on developing an alternative methodology to the generalized likelihood uncertainty estimation (GLUE) technique when pseudolikelihood functions are utilized instead of a traditional statistical likelihood function. We demonstrate for multiple calibration case studies that the most common sampling approach utilized in GLUE applications, uniform random sampling, is much too inefficient and can generate misleading estimates of prediction uncertainty. We present how the new dynamically dimensioned search (DDS) optimization algorithm can be used to independently identify multiple acceptable or behavioral model parameter sets in two ways. DDS could replace random sampling in typical applications of GLUE. More importantly, we present a new, practical, and efficient uncertainty analysis methodology called DDS approximation of uncertainty (DDS-AU) that quantifies prediction uncertainty using prediction bounds rather than prediction limits. Results for 13, 14, 26, and 30 parameter calibration problems show that DDS-AU can be hundreds or thousands of times more efficient at finding behavioral parameter sets than GLUE with random sampling. Results for one example show that for the same limited computational effort, DDS-AU prediction bounds can simultaneously be smaller and contain more of the measured data in comparison to GLUE prediction bounds. We also argue and then demonstrate that within the GLUE framework, when behavioral parameter sets are not sampled frequently enough, Latin hypercube sampling does not offer any improvements over simple random sampling. Citation: Tolson, B. A., and C. A. Shoemaker (2008), Efficient prediction uncertainty approximation in the calibration of environmental simulation models, Water Resour. Res., 44,, doi: /2007wr Introduction [2] Environmental management decisions are often based on the results of complex environmental simulation models. Since all models are approximations of reality, they are all subject to varying degrees of uncertainty. Thus, effective decision making should be based on model simulation results that report an estimate of the uncertainty associated with predictions. [3] In general, uncertainty sources in environmental modeling include parameter, data and model structure. With highly parameterized simulation models and/or a limited amount of measured data for parameter estimation, model parameter estimates are uncertain. Measured data for initial or boundary conditions, as well as measured calibration data, such as flow or water quality, are also measured imperfectly and often infrequently and are therefore also subject to uncertainty. Finally, the model itself can be uncertain since it is typically known that the simulated processes, timescale or spatial scale utilized do not precisely 1 Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, Ontario, Canada. 2 School of Civil and Environmental Engineering, Cornell University, Ithaca, New York, USA. Copyright 2008 by the American Geophysical Union /08/2007WR characterize the system being modeled. Although each of these sources of uncertainty are important, this study only considers parameter uncertainty. [4] Traditional Monte Carlo propagation of parameter uncertainty through to model forecasts becomes complicated when model calibration data are considered because random samples from the assumed probability distributions for the parameters must also be deemed to produce reasonable predictions of the available measured calibration data. Two methods of uncertainty analysis that cope with this complication in the context of hydrologic and watershed modeling are the generalized likelihood uncertainty estimation (GLUE) methodology [Beven and Binley, 1992] and various Bayesian Markov chain Monte Carlo or MCMC methods [e.g., Kuczera and Parent, 1998; Vrugt et al., 2003]. While these approaches tend to be computationally intensive, the GLUE methodology is more easily implemented as there are fewer stringent assumptions. Beven [2006b] questions whether the assumptions required in a formal Bayesian analysis are valid for any nonsynthetic hydrologic system being modeled. Consider that in our modeling case study, performing a Bayesian MCMC analysis would require a very difficult process of deriving a correct description of the residual errors which are correlated in time, space and across multiple simulated constituents (flow, sediment and total phosphorus). As a result, our study 1of19

2 TOLSON AND SHOEMAKER: IMPROVING UNCERTAINTY ANALYSIS EFFICIENCY will utilize pseudolikelihood functions and thus focus comparisons with GLUE and will not investigate other uncertainty analysis methods such as Bayesian MCMC methods that require the definition of a formal likelihood function. [5] Chapra [2003] recently reviewed the current state of TMDL water quality modeling and concludes that research is needed to identify practical and feasible approaches for uncertainty analysis of process-oriented, mechanistic models in TMDL studies. A review of the uncertainty literature demonstrating GLUE methodologies shows that the number of model evaluations utilized in many studies is likely infeasible for studies involving environmental simulation models that are even modestly computationally demanding (i.e., a few minutes or more). GLUE case studies typically report using from 60,000 to 100,000 model evaluations [e.g., Blazkova et al., 2002; Campling et al., 2002; Freer et al., 1996] for a range of dimensions (number of uncertain parameters). Brazier et al. [2000] and Freer et al. [2004] are two extreme examples showing how computationally intractable the application of GLUE can become as they reported using roughly 3,000,000 and 5,600,000 model evaluations, respectively. In the context of the Cannonsville watershed SWAT2000 model utilized in this case study, 100,000 model evaluations would require 4.6 months of serial computing time on a Pentium# IV, 3 GHz computer. [6] On the basis of the computational efficiency considerations described above, a new, more efficient methodology for uncertainty analysis conditioned on calibration data is required. The purpose of this paper is to propose an alternative approximate, high-dimensional uncertainty analysis methodology called dynamically dimensioned search approximation of uncertainty (DDS-AU) which utilizes new DDS algorithm developed by Tolson and Shoemaker [2007a]. DDS-AU is an alternative methodology to GLUE when pseudolikelihood functions are utilized to measure the agreement between model predictions and data. [7] This paper will compare DDS-AU to the GLUE methodology. GLUE was selected for comparison because it is simple, very prevalent in the literature and it does not require the definition of a statistically based likelihood function. The DDS-AU method efficiently, effectively and independently samples for multiple high-pseudolikelihood (i.e., high-quality) model parameter sets. The numerical results presented later show that our DDS-AU approach requires one to 3 orders of magnitude fewer model evaluations than a typical GLUE analysis and can sample significantly higher pseudolikelihood parameter sets than GLUE. The efficiency and robustness of the DDS optimization algorithm are responsible for the greatly improved sampling efficiency. [8] The remainder of this paper is organized as follows. Section 1.1 briefly introduces the GLUE methodology and reviews some of the GLUE application literature. Section 2.1 describes each component of the DDS-AU methodology and section 2.3 briefly outlines the modeling case study. Results for the modeling case study are presented in section 3 and focus first on efficiency comparisons and last on uncertainty bound comparisons between DDS-AU and GLUE. The discussion in section 4 covers comments on GLUE performance and the improved DDS-AU uncertainty estimates. Section 5 outlines the study conclusions and future work Review of GLUE Methodology [9] The GLUE methodology [Beven and Binley, 1992] for uncertainty analysis of environmental simulation model predictions was an important advancement because the simplicity and flexibility of GLUE make it an appealing and practical approach that has enabled numerous modelers to move immediately from the traditional deterministic model prediction approach to one that generates a distribution of plausible outcomes. For brevity, we keep our description of GLUE to a minimum but refer readers to Beven and Binley [1992], Beven and Freer [2001] or Montanari [2005] for more complete descriptions of GLUE. [10] GLUE requires that modelers subjectively define a likelihood function that monotonically increases as agreement between model predictions and measured calibration data increases [Beven and Binley, 1992]. The GLUE likelihood function can be, but is not required to be and is not typically, a statistically based likelihood function. A large number of GLUE studies utilize the Nash-Suttcliffe coefficient [Nash and Suttcliffe, 1970] or some transformation of it to define the likelihood function. From this point forward we will use the word likelihood in situations where the word could be or is referring to the traditional statistical definition of likelihood and otherwise will use pseudolikelihood when we are referring to a function that is definitely not consistent with the traditional definition. [11] In essence, when considering parameter uncertainty, GLUE is focused on identifying multiple acceptable or behavioral parameter sets rather than the optimal parameter set. The idea that multiple parameter sets (and multiple model structures) lead to reasonable or acceptable predictions of calibration data is referred to as the equifinality concept [Beven and Freer, 2001]. For each case study, the modeler must subjectively define behavioral in terms of the selected likelihood function (e.g., a threshold value of the likelihood function). A Monte Carlo experiment is then conducted to independently sample model parameter space and identify various behavioral parameter sets. The modeler must subjectively determine whether a sufficient number of behavioral parameter sets are sampled. In addition, the modeler would subjectively determine (typically make an implicit assumption) that behavioral samples from across behavioral parameter space (e.g., higher-quality samples) were also identified. These behavioral parameter sets are then subjectively transformed and assigned relative weights on the basis of their likelihood measure such that the weights sum to one. Then, the weighted behavioral parameter sets can be used in an ensemble forecasting procedure to estimate prediction quantiles of the quantity of interest [Beven and Freer, 2001]. This means that for any time period of interest, model predictions under all the behavioral parameter sets are generated in order to quantify uncertainty. [12] The description of GLUE above is intentionally vague to reflect the fact that the modeler has an enormous amount of flexibility in terms of how to implement the various details of the analysis. More than 250 publications refer to the Beven and Binley [1992] GLUE publication. The vast majority of GLUE studies define a pseudolikelihood function as well as sample for behavioral solutions using simple uniform random sampling. For example, Beven [2006a] reports that the normal sampling approach in GLUE is uniform random sampling. The bulk of GLUE 2of19

3 TOLSON AND SHOEMAKER: IMPROVING UNCERTAINTY ANALYSIS EFFICIENCY publications are focused on uncertainty analysis of rainfallrunoff models. A smaller number of studies have applied GLUE to biogeochemical or watershed-scale modeling of sediment or nutrients [e.g., Jia and Culver, 2006; Smith and Wheater, 2004; Zak and Beven, 1999]. Although a large number of GLUE studies consider approximately ten parameters uncertain, others consider upward of 40 [Uhlenbrook and Sieber, 2005] or even 50 model parameters uncertain [Zak and Beven, 1999]. [13] The results of a GLUE analysis are typically presented as GLUE prediction limits (prediction quantiles) around a model prediction of interest. It is important to realize for the purposes of this paper that GLUE prediction limits generated with a pseudolikelihood function are not equivalent to statistical confidence limits. According to Beven and Freer [2001], the resultant GLUE prediction limits are not direct estimates of the probability of simulating a particular observation. Some authors have critically evaluated the GLUE methodology with respect to its statistical rigor. Montanari [2005] presented a hydrologic modeling case study where the 95% GLUE prediction limits include only 62% of the observed data. Mantovan and Todini [2006] and Thiemann et al. [2001] explain why GLUE is inconsistent with Bayesian inference. Kuczera and Parent [1998] show that even when a valid likelihood function is used, GLUE derived prediction limits are misleading and argue that the typical GLUE sampling approach of uniform random sampling can be very inefficient and this inefficiency will increase as the number of uncertain parameters increases. [14] The focus of this paper is not to further critique GLUE from a statistical perspective. We believe the description of model prediction uncertainty in an approximate but nonstatistical way like GLUE can be valuable. Instead, our intent in this paper is to highlight the gross sampling inefficiency of GLUE when Monte Carlo sampling is used to search for behavioral parameter sets. Furthermore, our analysis is also restricted to GLUE applications that utilize a pseudolikelihood function. It is clear that the GLUE methodology is completely general in that it can admit statistically based likelihood functions and utilize sampling strategies other than simple random sampling. However, our focus on GLUE applied with pseudolikelihood functions and random sampling is appropriate because the fact is the majority of GLUE modeling literature utilizes it this way. We review some of the issues and proposed solutions to the Monte Carlo sampling inefficiency problem with GLUE in the next sections (1.2 and 1.3) before proposing a new methodology for conducting an efficient but approximate uncertainty analysis in section Issues With Monte Carlo Sampling in GLUE [15] In the practical application of GLUE, we believe that the determination of the behavioral threshold of the pseudolikelihood function can be inappropriately influenced by ineffective Monte Carlo sampling results. If a behavioral threshold is to be applied in any uncertainty analysis, it should be based largely on two primary considerations: (1) the modeler s subjective prior definition of behavioral and (2) whether any solutions exist that meet the modeler s behavioral definition. Clearly, the second consideration can influence the first consideration 1. [16] When GLUE is applied to high-dimensional problems (many uncertain parameters) in the literature, the definition of behavioral is often conditioned on the efficiency of uniform random sampling to identify high-pseudolikelihood parameter sets. For example, the behavioral threshold would be set such that the best 10% of the Monte Carlo samples would be classified as behavioral [e.g., Zak and Beven, 1999]. This approach to setting the behavioral threshold is an example of how a modeler can relax their prior/original definition of the behavioral threshold so that a sufficient number of behavioral parameter sets (with reduced likelihoods) are identified to conduct the GLUE analysis. The failure of Monte Carlo sampling to identify parameter sets that exceed the prior/ original behavioral threshold in no way means that these do not exist. Without an incredibly large and often computationally intractable number of samples, Monte Carlo sampling is typically one of the most inefficient optimization strategies. If parameter sets above the prior behavioral threshold in fact exist but are simply not sampled, a revision of the original behavioral threshold on the basis of such poor sampling results is questionable. [17] A prerequisite of any uncertainty analysis technique that is conditioned on model calibration data is that either the maximum likelihood solution or multiple relatively high likelihood solutions are identified. Given that most GLUE studies do not invoke (or at least do not report on) the results of a reasonable global optimization algorithm to maximize the likelihood function, this would seem to indicate an implicit assumption among GLUE practitioners that as long as a large number of random samples are taken, Monte Carlo sampling (most typically uniform random sampling) will sample adequately high likelihood solutions. Monte Carlo sampling can only be guaranteed to find the global optimum when an infinite number of samples are taken. Therefore, in practice, GLUE will fail to find the global optimum solution. [18] The GLUE authors readily accept the inability of random sampling to sample the global optimum since GLUE rejects the idea of an optimal model [Beven and Freer, 2001] and instead accept the concept of equifinality, where multiple behavioral models adequately simulate the measured data with approximately equal levels of predictive performance. However, since the GLUE methodology reweights behavioral parameter set likelihoods on the basis of the likelihood magnitude, the equifinality concept in the GLUE methodology seems to allow for the existence of multiple mediocre models, multiple good models, multiple very good models and so on. This in turn implies that the attained level of equifinality is conditional on the effectiveness of the sampling methodology in GLUE (typically uniform random sampling). For example, too few samples will either fail to identify any behavioral models or will only identify mediocre models and will not identify good or very good models. Assuming that it is desirable to identify at least a few good models, a question central to the efficacy of GLUE for uncertainty estimation then becomes how well does the GLUE sampling approach approximate the maximum likelihood solution? 1.3. Strategies for Dealing With Monte Carlo Sampling Inefficiency [19] Monte Carlo sampling inefficiency means that behavioral parameter sets are sampled with an incredibly low 3of19

4 TOLSON AND SHOEMAKER: IMPROVING UNCERTAINTY ANALYSIS EFFICIENCY frequency (1 in every 10,000 model evaluations for example). Sampling ineffectiveness is related but is used throughout the paper to describe when Monte Carlo sampling does not identify any parameter sets from the highest-pseudolikelihood regions of parameter space. There are at least four strategies reported in the literature for combating the inefficiency and ineffectiveness of uniform random sampling in GLUE. These include using brute force (i.e., millions of model evaluations using a parallel computing network), sensitivity analysis methods to reduce the number and/or range of uncertain parameters, Latin hypercube sampling (LHS) and metamodeling. Unfortunately, as will be discussed below, there are problems or limitations with each of these strategies. [20] The brute force approach has been applied, for example, by Brazier et al. [2000], who used roughly 3,000,000 model evaluations for a soil erosion model with 16 uncertain parameters, and Freer et al. [2004], who used 5,600,000 model evaluations for a hydrologic model with only eight uncertain parameters. In both studies, a parallel computing network was utilized to conduct these model simulations. The results of Freer et al. [2004] clearly show that as more and more data become available for calibration (so that there are multiple calibration objectives), GLUE with uniform random sampling becomes more and more inefficient. The main issue is that millions of model evaluations are clearly infeasible for many practical modeling case studies, even more so for distributed environmental models, and almost certainly impossible for any modeler who does not have access to parallel computing facilities. [21] In principle, an initial sensitivity analysis is an attractive method for reducing the dimension and size of parameter space before GLUE sampling is conducted [e.g., Benaman and Shoemaker, 2004; Zak and Beven, 1999]. However, sensitivity analysis is an imperfect solution for at least two reasons. First, in high-dimensional cases, sensitivity analysis methods, such as the regionalized sensitivity analysis (RSA) by Spear and Hornberger [1980] and RSA variants, can be computationally burdensome. For example, Zak and Beven [1999] used 60,000 model evaluations in their sensitivity analysis before they conducted their GLUE analysis using another 60,000 model evaluations. Even with more efficient sensitivity analysis approaches like Benaman and Shoemaker [2004] there is still a second concern. Insensitive parameters during the calibration period may be sensitive in the prediction period (i.e., when the uncertain system state or performance under various management alternatives is evaluated). Furthermore, even if the model sensitivity is low for each insensitive parameter independently, the model prediction uncertainty can be very sensitive to a subset of parameters deemed individually insensitive. [22] In a traditional uncertainty analysis framework where calibration data are not available or not utilized to condition the random samples from the assumed input probability distributions, LHS is generally accepted to be a more efficient alternative to random sampling. Tung et al. [2006] report that LHS has been widely applied in many areas of hydrosystems engineering. LHS provides improvements over random sampling in terms of reducing the variance of the estimated sample mean and empirical cumulative distribution function (CDF) if certain monotonicity conditions hold [McKay et al., 1979]. Melching [1995] demonstrates LHS efficiency over simple random sampling in a traditional type of uncertainty analysis of a hydrological model. [23] Some GLUE studies [e.g., Christiaens and Feyen, 2002; Uhlenbrook and Sieber, 2005] report using LHS within the GLUE framework because of severe model computational restrictions. For example, Uhlenbrook and Sieber [2005, p. 23] report the use of LHS in their GLUE study because the number of necessary model runs can be reduced. It must be clarified that despite the use of LHS in these and other GLUE studies, when the total number of model runs is limited in a GLUE sampling exercise, using LHS instead of random sampling should not be expected to increase the number of behavioral samples. [24] LHS is a variance reduction technique used when sampling from known input probability distributions. In GLUE, the probability distributions or joint probability distribution describing behavioral parameter space is not known prior to the GLUE sampling experiment. When LHS is utilized to search for behavioral samples in GLUE, LHS is applied with input variable distributions that characterize all of the model parameter space (typically a hypercube because of assumed independent uniform distributions of all parameters). Model parameter space includes the behavioral and nonbehavioral regions of parameter space. LHS should function to estimate the mean pseudolikelihood and the CDF of the pseudolikelihood resulting from sampling model parameter space more precisely than random sampling. Therefore, the estimated proportion of parameter space that is behavioral should also be more precise. LHS will not function to increase the expected value of this proportion. Thus, when random sampling fails to sample a sufficient number of behavioral parameter sets, LHS with the same number of samples will not typically improve the sampling frequency of behavioral parameter sets. Only after behavioral parameter space was identified (and somehow characterized by a joint probability distribution) and then required resampling in some postcalibration scenario analysis where model prediction uncertainty was to be estimated could LHS function to reduce the computational burden of a GLUE analysis. [25] One other approach to GLUE efficiency improvements involves the use of computationally efficient metamodels such as an artificial neural network (ANN) to approximate the pseudolikelihood response surfaces. Khu and Werner [2003] propose a modified GLUE approach using an ANN on hydrological model case studies with two and eight uncertain parameters to improve basic GLUE efficiency. Mugunthan and Shoemaker [2006] introduce a new and more efficient uncertainty analysis methodology in comparison to GLUE that uses a radial basis function approximation approach for optimization and apply it to groundwater quality model case studies with three and seven model parameters considered uncertain. Hossain et al. [2004b] demonstrated the use of a Hermite polynomial chaos expansion of normal random variables as a substitute for a computationally expensive land surface model with five parameters considered uncertain. It is not clear how well the above methods, all of which were demonstrated with eight or fewer uncertain parameters, will work in higher dimensions such as those considered in this study (13 30 parameters considered uncertain). Ong et al. [2004] 4of19

5 TOLSON AND SHOEMAKER: IMPROVING UNCERTAINTY ANALYSIS EFFICIENCY note that in the context of evolutionary optimization, global metamodels can readily be used with any search method but they can however be inefficient as problem dimension increases. More importantly, these methods substantially increase level of complexity of the uncertainty analysis relative to GLUE or the proposed DDS-AU methodology since metamodels must be calibrated or trained to mimic the pseudolikelihood response surface Goals and Outline [26] Despite the previous attempts to address GLUE efficiency as described in the previous section, a new methodology is needed to enable practical and meaningful uncertainty analysis when models with a high number of model parameters are considered uncertain. We consider a model calibration case studies in which 13 to 30 model parameters are considered uncertain and focus on improving uncertainty analysis efficiency without metamodel or response surface approximation techniques. Since we argue that LHS will not function to improve behavioral sampling efficiency in comparison to random sampling within the GLUE framework, this is first demonstrated for one of our case studies. Then, we compare our new methodology against what we consider to be the common or standard type of high-dimensional GLUE analysis which utilizes uniform random sampling to search for behavioral parameter sets. Since uniform random sampling is only one approach to randomly sample parameter space in GLUE, the focus of this paper is not to refute the general methodological framework of GLUE. Indeed, without the original research on GLUE [Beven and Binley, 1992] and the formalization of the equifinality concept [Beven and Freer, 2001], our research probably would not have evolved as it has. Instead, we present substantially more efficient alternatives to the standard GLUE analysis that serve to address a future research question recently noted by Beven [2006a] regarding how to efficiently search for behavioral parameter sets. [27] Our new alternative high-dimensional uncertainty analysis methodology is called the dynamically dimensioned search approximation of uncertainty (DDS-AU) method. DDS-AU relies largely on the new DDS optimization algorithm developed by Tolson and Shoemaker [2007a]. DDS can improve future uncertainty analyses in two ways. Most importantly, we will show how to utilize DDS within a new, simple, flexible and practical methodology (DDS-AU) for describing model prediction uncertainty. We also note that DDS could simply replace uniform random sampling in most typical applications of GLUE. 2. Methodology 2.1. Dynamically Dimensioned Search Approximation of Uncertainty Methodology [28] The DDS-AU methodology is termed an approximation because it utilizes a pseudolikelihood function rather than a statistically rigorous likelihood function. Therefore, very much like GLUE, a subjective decision must be made as to the case study appropriate pseudolikelihood measure employed. Some of the example pseudolikelihood functions defined in this study utilize the Nash-Suttcliffe coefficient (E NS ) [Nash and Suttcliffe, 1970] and some functions combine the E NS coefficients for flow and multiple water quality constituents. However, modelers are free to use any single-objective function they deem appropriate within the DDS-AU framework, including objectives that are to be minimized. This freedom makes DDS-AU (like GLUE) a method for prediction uncertainty estimation that can be immediately applied to all deterministic model calibration studies since the definition of a statistically rigorous likelihood function is not required. [29] Prior to conducting the DDS-AU analysis, modelers should also select a threshold value of the pseudolikelihood function, termed the behavioral threshold, such that all parameter sets leading to pseudolikelihoods less than this value (or more generally degraded values of the objective function) are deemed to produce unacceptable predictions and are therefore classified as nonbehavioral. Only the behavioral parameter sets are used to characterize the prediction uncertainty. The behavioral threshold can be based on prior experience and can be adjusted later in the analysis. An initial subjective behavioral threshold can understandably be influenced and/or modified on the basis of a known reasonable lower bound estimate the true maximum of the pseudolikelihood function. A good estimate of the maximum of the pseudolikelihood function should be established via the application of an effective global optimization technique such as DDS using perhaps 1% to 5% of the total uncertainty analysis computational budget (i.e., model evaluations). [30] After making these two basic subjective decisions regarding the form of the pseudolikelihood function and the behavioral threshold value, the remainder of the DDS-AU methodology is separated into three sections in order to distinguish DDS-AU as a separate uncertainty analysis methodology from GLUE. Each section corresponds to each part of the DDS-AU methodology (behavioral solution sampling, uncertainty characterization and updating procedure for new data) to demonstrate it is a distinct alternative to the corresponding GLUE approach. The DDS-AU behavioral sampling methodology in section is important to consider independently because it could easily be incorporated into the GLUE methodology in order to improve GLUE behavioral sampling efficiency. Uncertainty characterization in DDS-AU is summarized in section followed by a simple updating procedure when new calibration data becomes available (section 2.1.3) DDS-AU Behavioral Sampling Methodology [31] The DDS-AU methodology for sampling or identifying multiple independent high-pseudolikelihood solutions is based on the results of multiple independent optimization trials of the DDS stochastic global optimization algorithm, each using a relatively small number of model evaluations (e.g., 100). The DDS-AU methodology is made feasible because of the simplicity and efficiency of the DDS algorithm. Tolson and Shoemaker [2007a] demonstrated good comparative performance of DDS on 6-, 10-, 14-, 20-, 26- and 30-dimensional optimization problems. A brief description of the DDS algorithm is provided in the following paragraph but readers should refer to Tolson and Shoemaker [2007a] for the complete DDS algorithm pseudocode. [32] DDS was designed for calibration problems with a large number of parameters and requires no algorithm parameter tuning. The algorithm is designed to scale the 5of19

6 TOLSON AND SHOEMAKER: IMPROVING UNCERTAINTY ANALYSIS EFFICIENCY search to the user-specified number of maximum function evaluations and thus has no other stopping criteria. In short, the algorithm searches globally at the start of the search and becomes a more local search as the number of iterations approaches the maximum allowable number of function evaluations. Tolson and Shoemaker [2007a] describe how this strategy roughly mimics a manual or trial and error model calibration approach. The adjustment from global to local search is achieved by dynamically and probabilistically reducing the number of dimensions in the neighborhood (i.e., the set of decision variables or parameters modified from their best value). A candidate solution is created by perturbing the current solution values in the randomly selected dimensions only. These perturbation magnitudes are randomly sampled from a normal distribution with a mean of zero. DDS is a greedy type of algorithm since the current solution, also the best solution identified so far, is never updated with a solution that has an inferior value of the objective function. [33] The stochastic nature of DDS means that multiple DDS optimization trials initialized to different initial solutions and/or a different random seed will follow a different trajectory or search path and can terminate at different final solutions. DDS-AU takes advantage of this fact to find multiple behavioral solutions and thus each DDS optimization trial in DDS-AU must be initialized to a different random initial solution and a different random seed. In the simplest implementation of DDS-AU, only the final best solution from each DDS optimization trial is considered as a possible behavioral sample. Only a single behavioral solution per DDS optimization trial should be used for prediction uncertainty characterization because multiple behavioral solutions from a single optimization trial are likely to be very similar given the nature of the DDS algorithm. Late in the optimization, many of the solutions evaluated by DDS will differ in only one parameter value. The use of independent DDS optimization trials to search for each behavioral solution distinguishes the DDS-AU sampling methodology from the methodology proposed by Mugunthan and Shoemaker [2006] that extracts multiple behavioral solutions after a declustering technique is applied from the search history of a single optimization trial by a function approximation based algorithm. [34] The basic sampling approach in DDS-AU described above is easy to implement and very efficient if conducted across a parallel computing network. After defining the objective function, possibly a prior behavioral threshold, as well as model parameters considered uncertain and their corresponding minimum and maximum values, the modeler should follow the specific steps below and make a few subjective decisions to implement the most basic DDS-AU sampling methodology: [35] Step 1: Define the maximum total number of model evaluations for analysis (N total ) and desired number (i.e., maximum required) of behavioral samples to identify (n beh ). Calculate the number of model evaluations per DDS optimization trial (m DDS )asm DDS = N total / n beh (rounded down to an integer). [36] Step 2: Perform n beh DDS optimization trials from n beh random initial solutions. [37] Step 3: Classify the n beh final best DDS solutions (one per trial) as behavioral or nonbehavioral. [38] Step 4: If the number of behavioral solutions is deemed too small then, if time permits, refine some or all of the final best DDS nonbehavioral solutions with additional DDS optimization trials; otherwise, consider easing the behavioral threshold. [39] The total number of model evaluations used for the analysis in step 1 should be determined largely on the basis of the case study specific computational and time constraints. Determining the ideal number of behavioral samples (n beh ) given a fixed total number of model evaluations (N total ) requires the modeler to weigh the benefits of a higher n beh against the reduced probability that each optimization trial will find a behavioral solution. Fixing n beh determines the effort expended by each DDS-AU optimization trial (m DDS ). A small value of n beh will yield highpseudolikelihood solutions but result in fewer DDS-AU sampled behavioral solutions. Too high a value of n beh will produce too many DDS-AU samples that are nonbehavioral. Given a fixed total number of model evaluations available, the ability of modelers to have some control over this tradeoff decision is an advantage of DDS-AU in comparison with GLUE. In this study, promising results were attained by fixing n beh from 100 to 200 which then typically gave m DDS values from 3D to 7D where D is the dimension of the problem (the number of uncertain parameters). [40] In step 4, the modeler must subjectively decide whether an acceptable number of behavioral samples have been identified for their purposes (this value should be lower than n beh ). The same subjective decision is required in a GLUE analysis. If more behavioral samples are needed then, time permitting, the final best nonbehavioral DDS solutions can be refined or polished using perhaps another m DDS model evaluations per optimization trial. Thus, even if the modeler chooses an overly optimistic value of n beh and the DDS optimization trials do not have enough evaluations to identify behavioral solutions, the initial effort is not lost because the first set of nonbehavioral DDS-AU samples are higher in quality and in a sense closer to becoming behavioral with additional optimization. An example of this approach is outlined in section 3.3. Without additional model evaluations, the only option available to the modeler to identify more behavioral samples is to ease the behavioral threshold. Easing the prior behavioral threshold is only recommended after determining the prior behavioral threshold was too stringent in comparison to a reasonable lower bound estimate of the true maximum of the pseudolikelihood function. Given the importance of the maximum of the pseudolikelihood function, even for simply judging the quality of DDS-AU (or GLUE) sampled pseudolikelihoods, it would be beneficial to optimize the pseudolikelihood function. This can be done easily with the one longer optimization trial of the DDS algorithm using a small percentage of the total computational budget. This approximation of the solution with the highest pseudolikelihood can also be used as a behavioral sample. [41] A reasonable concern that should be evident with the DDS-AU sampling strategy described above is the question of how DDS-AU ensures some of the behavioral solution pseudolikelihoods are reasonably close to the unknown maximum of the pseudolikelihood function while others are distributed throughout the search space and thus have pseudolikelihoods distributed between the behavioral 6of19

7 TOLSON AND SHOEMAKER: IMPROVING UNCERTAINTY ANALYSIS EFFICIENCY threshold and the best known pseudolikelihood value. Provided the quantity of m DDS /D is relatively small (e.g., it is 3 to 7 in the examples in this paper) and random initial solutions are used for each DDS trial, it is very unlikely that the behavioral solutions independently identified by the DDS trials will be inappropriately clustered around an unrepresentative subset of the behavioral local optima. In fact, the probability that any of the DDS solutions are a precise local optimum is quite small given the nature of DDS as described by Tolson and Shoemaker [2007a] and the use of a rather limited number of model evaluations per optimization trial. Nonetheless, we suggest two alternative strategies for doubly ensuring DDS-AU solutions are well distributed throughout behavioral parameter space. One alternative would be to randomize m DDS, the total number of model evaluations per DDS optimization trial, (while respecting settings for N total and n beh ) and then take the final best solution from each trial as a behavioral sample. The majority of our results with DDS-AU have been generated using this alternative. A second alternative when a constant m DDS was used across DDS optimization trials would be to randomly sample a single behavioral solution from the search history of each DDS optimization trial. This second alternative may be invoked if the basic DDS-AU sampling approach (constant m DDS ) results show that m DDS was set too high such that the modeler deems behavioral solutions (final best DDS solutions) are inappropriately clustered in parameter space. In this way, the uncertainty analysis can continue without expending additional computational effort (model evaluations) to sample more dispersed behavioral parameter sets DDS-AU Prediction Uncertainty Characterization [42] Model predictions using DDS-AU sampled parameter sets are not weighted by the pseudolikelihood function values to calculate pseudolikelihood weighted predictions or quantiles as in GLUE. DDS-AU simply describes the range in model predictions that will result from the acceptance of the equifinality concept. Constructing pseudolikelihood weighted quantiles (or 95% prediction limits) on the basis of pseudolikelihood functions can be misleading and therefore the DDS-AU methodology avoids this issue. Without any pseudolikelihood reweighting, DDS-AU prediction bounds are constructed on the basis of the maximum and minimum simulation results of the quantity of interest (e.g., flow) in each time step derived from model simulations under all behavioral parameter sets. As in the GLUE approach, prediction bounds for a time series are determined independently for each time step. Provided that the behavioral threshold is truly based on a modeler s belief of what is acceptable rather than being conditioned on inefficient Monte Carlo sampling results, this approach should generate meaningful prediction bounds. [43] The interpretation of the DDS-AU prediction bounds is very straightforward but requires that all subjective decisions made by the modeler be explicitly detailed. The subjective decisions that need to be clearly associated with any DDS-AU prediction bounds or characterization of prediction uncertainty are as follows: [44] 1. Behavioral threshold in actual pseudolikelihood function units. [45] 2. Number of behavioral samples used to characterize the prediction bounds. [46] 3. Total number of model evaluations in the analysis and the number of function evaluations per optimization trial. [47] There may be cases where modelers wish to characterize DDS-AU prediction uncertainty using more detail than prediction bounds on the basis of the minimum and maximum model prediction. For example, the frequency a predicted quantity of interest (e.g., peak streamflow) exceeds or does not exceed a certain design limit may be of interest. Given the use of a pseudolikelihood function in DDS-AU (as in most GLUE analyses), it is most appropriate to describe DDS-AU prediction uncertainty explicitly in terms of the frequency DDS-AU sampled behavioral parameter sets generate a specific model prediction of interest (i.e., exceed a design limit). Furthermore, this frequency should not be interpreted as a probability but only used as an indication as to whether the prediction of interest relatively likely or unlikely. This importantly avoids the implication that DDS-AU generates statistically rigorous descriptions of prediction uncertainty such as confidence limits in light of recent work by Montanari [2005] demonstrating GLUE prediction limits should not be interpreted as confidence limits Updating DDS-AU Behavioral Solutions With Additional Calibration Data [48] Unlike the GLUE procedure for likelihood weight updating when new measured system response data for calibration becomes available [see Beven and Binley, 1992; Beven and Freer, 2001], a simple non-bayesian methodology for updating the behavioral solution pseudolikelihoods and thus the prediction uncertainty is proposed for DDS-AU. In other words, this section describes procedures for updating a current set of DDS-AU behavioral parameter sets when new calibration data are available to refine a previous DDS-AU analysis. Given adequate time and computational resources, modelers may want to repeat the entire DDS-AU sampling procedure (section 2.1.1) for the entire set of currently available calibration data. This is more likely to be necessary if the amount of new data available is comparable to the amount of data used for calibration during the previous DDS-AU analysis. In this case, modelers should initialize all DDS-AU optimization trials to the behavioral solutions found in the previous DDS- AU analysis. Otherwise, if computational time is at a premium, modelers can use the updating methodology described below to very efficiently update DDS-AU prediction bounds (i.e., with minimal or no dependence on further optimization or sampling). [49] 1. For each of the behavioral samples generated from prior DDS-AU analysis (n 1 ) during the calibration period (period 1): simulate the model for the new time period (period 2), concatenate the model simulated time series from period 1 and period 2, and calculate new overall pseudolikelihood measures for the entire measured data period. [50] 2. Reclassify each of the n 1 solutions as behavioral or nonbehavioral. [51] 3. Determine if a sufficient number of the n 1 samples are retained as behavioral. If yes, then the pseudolikelihood updating procedure is complete. Otherwise, some or all of 7of19

8 TOLSON AND SHOEMAKER: IMPROVING UNCERTAINTY ANALYSIS EFFICIENCY the nonbehavioral samples must be refined with additional DDS optimization trials as described in step 4 of the DDS- AU sampling methodology in section [52] Given the simplicity of the above procedure and the length of this paper, we do not demonstrate the DDS-AU updating procedure in our results. However, it is important note that the DDS-AU updating procedure is distinct and less subjective in comparison to the GLUE updating procedure. Beven and Freer [2001] report that for certain pseudolikelihood functions the Bayesian updating of behavioral solution pseudolikelihoods in GLUE, for example combining pseudolikelihoods for year 1 and year 2, can generate a combined or updated pseudolikelihood for the entire time period (years 1 and 2) that is different than the pseudolikelihood that would be estimated by simply considering years 1 and 2 together as one time period. We do not consider this to be a desirable characteristic of GLUE. The DDS-AU updating procedure ensures that this situation will not happen as updated DDS-AU pseudolikelihoods are defined as the pseudolikelihood for the combined time period Previous Studies Similar to DDS-AU [53] Previous studies have, like DDS-AU, suggested various approaches for approximating uncertainty with the help of various optimization algorithms without conducting a formal Bayesian statistical analysis. Many of these such studies [Evers and Lerner, 1998; Khu and Werner, 2003; Mugunthan and Shoemaker, 2006; van Griensven and Meixner, 2006] are applied to problems with a much lower number of uncertain parameters in comparison with our study (9 or fewer in comparison to 13, 14, 26 and 30) and would become much more inefficient if they were utilized on our case study. Khu and Werner [2003] utilize a genetic algorithm (GA) to enhance the efficiency of sampling within GLUE whereas Mugunthan and Shoemaker [2006] utilize a function approximation based optimization algorithm (see section 1.3) for a non-glue uncertainty assessment. The search history from a modified version of the shuffled complex evolution (SCE) algorithm [Duan et al., 1992] is used by van Griensven and Meixner [2006] to approximate parameter uncertainty. Considering the relatively poor SCE algorithm performance on the exact problems utilized in this case study as reported by Tolson and Shoemaker [2007a], an SCE-based approach would likely be more inefficient than DDS-AU. Seibert and McDonnell [2002] report using a GA to assess parameter uncertainty in a problem with 16 uncertain parameters and report using 150,000 total model evaluations to identify 50 behavioral parameter sets. In this study, we show that for our modeling case study, DDS-AU identifies more than 50 behavioral parameter sets in only 10,000 model evaluations. [54] Evers and Lerner [1998] approximate parameter uncertainty in a groundwater flow modeling study. They attempt to conduct an exhaustive search of five-dimensional space (uncertain parameters and boundary conditions or simply uncertain inputs) using a gridded search combined with a local search technique to find reasonable input sets. These reasonable input sets are then used with equal weighting (i.e., they are all equally valid input sets) to quantify prediction uncertainty. Evers and Lerner [1998] present their approach independently of the GLUE methodology. This establishes that approaches like DDS-AU, which are similar in some ways to GLUE, have been considered a distinct methodology. Our contribution in this paper is distinguished from the approach of Evers and Lerner [1998] because DDS-AU can feasibly be applied to much higher than five-dimensional problems. Unlike the method of Evers and Lerner [1998] which finds only local optima to characterize uncertainty, DDS-AU can sample from all high-pseudolikelihood regions rather than local optima only. Finally, DDS-AU is presented as a methodology that parallels GLUE complete with a procedure for updating prediction uncertainty as additional calibration data becomes available Using DDS Within the GLUE Framework [55] As described in section 2.1, DDS-based sampling can be utilized to independently find multiple behavioral samples for use within the GLUE framework. However, there is another way to utilize DDS-based sampling in a traditional GLUE analysis given ample computational resources and time. All parameter sets evaluated in the DDS optimization trials (not necessarily just one solution per optimization trial) can be analyzed to determine the minimum sized parameter space hypercube that contains all parameter sets known to meet the behavioral threshold. Then, a second sampling experiment such as traditional uniform random sampling can be conducted within the reduced hypercube bounds to augment the DDS-based behavioral samples. This parameter bound reduction approach has the potential to substantially increase uniform random sampling efficiency. An example GLUE efficiency improvement is estimated on the basis of our results at the end of section 3.2. However, we do not fully demonstrate either approach to utilizing DDS within the GLUE framework in this paper Cannonsville Watershed SWAT2000 Calibration Case Study [56] Four real calibration uncertainty analysis problems based on the Cannonsville watershed SWAT2000 model described by Tolson and Shoemaker [2007a, 2007b] and Tolson [2005] were utilized in this study. Three of the calibration uncertainty analysis problems focused on a small subwatershed called Town Brook so as to minimize computational constraints and allow for large numbers of model simulations. Two of the problems were focused on flow calibration (14 uncertain parameters) so as to better replicate the majority of GLUE hydrologic calibration uncertainty analysis studies reported in the literature. The other two problems were focused on the simultaneous calibration of 26 or 30 flow and total suspended sediment (TSS) and total phosphorus (P) parameters since the Cannonsville SWAT2000 model was developed for water quality management purposes Town Brook Flow Calibration [57] A flow calibration example for a small subwatershed of the Cannonsville watershed called Town Brook was formulated to consider the uncertainty in the set of SWAT2000 flow parameters optimized in Tolson and Shoemaker [2007a]. These 14 model parameters impact snowmelt, surface runoff, groundwater, lateral flow and evapotranspiration predictions and are listed along with their ranges in Table 1. The Town Brook calibration was repeated for two cases. First, the wide default parameter 8of19