Treatment of Censored Data and Dependency among Sampling Distributions in Probabilistic Human Health Exposure Assessment

Size: px
Start display at page:

Download "Treatment of Censored Data and Dependency among Sampling Distributions in Probabilistic Human Health Exposure Assessment"

Transcription

1 Treatment of Censored Data and Dependency among Sampling Distributions in Probabilistic Human Health Exposure Assessment Junyu Zheng 1 H. Christopher Frey 2 1 College of Environmental Science and Engineering South China University of Technology, Guangzhou, China, Department of Civil, Construction, and Environmental Engineering North Carolina State University, Raleigh, NC, USA, Abstract Probabilistic human health exposure assessment is commonly challenged by non-detected data in measurement datasets and dependencies among sampling distributions. Methods for dealing with nondetected data, dependencies and associated uncertainty are introduced. An example case study was performed to demonstrate application of the methods and investigate how exposure estimates and associated uncertainty are affected by using different methods. The results show that improper handling of nondetected data may cause bias in the mean or other statistics such as the 97.5 percentile, as well as their uncertainty. A maimum likelihood estimation (MLE)- bootstrap approach is suggested for dealing with nondetected data. A modified two-stage Monte Carlo approach, in which sampling distributions obtained from bootstrap simulation are kept in pair-wise data formats, is able to capture complex dependencies among sampling distributions, and thus is recommended as a technique for propagating uncertainty. Keywords: Bootstrap Simulation, Censored Data, Monte Carlo Simulation, Exposure Assessment, Uncertainty 1. Introduction Probabilistic methods for quantitatively characterizing variability and uncertainty in exposure and risk assessment have received increasing attention in the last 15 years [1-4]. For example, probabilistic approaches for assessing human exposure have been recommended by the National Research Council [5] and U.S. Environmental Protection Agency [ 6] because they can quantify the variability and uncertainty in exposure for a population of interest. In probabilistic approaches, variability and uncertainty in model inputs are explicitly represented and are propagated through a model. With quantitative information regarding both variability and uncertainty of exposures and risk, decision-makers can assess whether a particular decision is likely to be robust to variability, incomplete knowledge or both [7, 8]. There is a growing track record of the use of quantitative methods for characterizing both variability and uncertainty in various applications, including human health and ecological risk assessment [1-4, 9-11]. For example, Cohen et al [9] demonstrated a two-stage Monte Carlo simulation to quantify variability and uncertainty in human health exposure assessment. EPA is quantifying the variability and uncertainty in cumulative human exposure to chemical pollutants in the Stochastic Human Exposure Dose Simulation (SHEDS) series models (e.g., SHEDS/Pesticide, SHEDS/Wood and SHEDS/Air [12] Toxics) using the two-stage Monte Carlo simulation framework. However, probabilistic exposure assessment is commonly challenged by incomplete data such as censored values in sample measurement datasets. For example, in the National Human Exposure Assessment Survey (NHEXAS) database, 30% to 70% of observations are below detection limits for many pollutants. In some cases, as much as 90% of observations are below detection limits, such as for chlorpyrifos in outdoor air [13]. NHEXAS is a multistage probability-based empirical exposure study for multiple pollutants, media and pathways. In most published analyses using NHEXAS data, censored data were ignored or treated using inaccurate assumptions (e.g., replacing non-detects with zero, half of the detection limit, the detection limit, or socalled robust methods [11] ). However, all of these approaches are biased [14]. 1

2 Failure to properly account for dependencies among model inputs or sampling distributions may lead to bias in exposure estimates. Currently, a twostage Monte Carlo simulation approach is often used for propagating the variability and uncertainty in model inputs to model output [ 9]. For example, Buck et al [15] used the two-stage Monte Carlo simulation technique to estimate variability and uncertainty in chlorpyrifos exposure and dose for case studies based upon Minnesota and Arizona NHEXAS data. In this approach, probability distribution models are generally used to describe independent sampling distributions representing uncertainties in the parameters of a variability distribution. However, if there are strong dependencies between distribution parameters such as for the gamma distribution [16], this approach may lead to bias in exposure estimates and associated uncertainty. The purpose of this paper is to: (1) develop methods for dealing with data with non-detects and associated uncertainty in the probabilistic human exposure assessment; (2) investigate how dependencies among sampling distributions are handled in order to reduce bias in exposure estimates and associated uncertainty; and (3) analyze how exposure estimates and associated uncertainty are affected by different methods for dealing with these issues. A case study was used to demonstrate the methods and help analyze the effects of different methods on exposure estimates. 2. Methodology 2.1. Methods for dealing with nondetects and associated uncertainty Measurement data may be reported as non-detected or censored if observed values are less than detection limits of measurement instruments. Conventional methods for treating non-detected data include replacing non-detects with zero, half of the detection limit, or the detection limit, or simply discarding such data; however, these approaches may lead to biased estimates. A maximum likelihood estimation and bootstrap (MLE/bootstrap) simulation approach, introduced by Zhao and Frey [14], is used to deal with censored data and associated uncertainty in the context of human health exposure assessment. This approach is expected to provide more robust estimates compared to conventional approaches. In this approach, MLE is used to estimate parameters of distributions representing censored data set. The likelihood function for a censored data set having one or more multiple detection limits for a probability distribution is [ 14] : n L ( α, β ) = f ( x α, β ){ ( F( DL α, β ))} (1) Where, x i i= 1 i p ND m m= 1 j= 1 = Detected data point, where, i = 1, 2,, n α, β = Parameters of the distribution ND m = Number of non-detects corresponding to detection limit DL m m = 1, 2,, P P f F = Number of detection limits = Probability density function = Cumulative distribution functions For data that are below detection limit, the cumulative probability of the detection limit is used in lieu of the likelihood. Generally, for computational convenience, it is more common to work with the log likelihood function instead of the likelihood function itself. By maximizing the log-likelihood function, the optimal parameter estimators are found. A non-linear optimization algorithm was often used because of its computational efficiency. Bootstrap simulation is applicable to characterization of uncertainty associated with a distribution for variability fit to censored data. Compared to a typical parametric bootstrap simulation [17], a bootstrap sample cannot be generated directly for a censored data set because random variations in the number of data points that are below detection limit need to be considered. Instead, an empirical bootstrap approach is used. Details for methods using empirical bootstrap approach to characterize associated uncertainty with censored data are found in Zhao and Frey [14] Methods for treating dependencies among sampling distributions Because a typical two-stage Monte Carlo framework cannot properly capture the dependencies among the sampling distributions [9], a modified approach for the framework was proposed. Rather than using a compact parametric probability distribution model, the approach uses pair-wised sampling data, which are obtained from bootstrap simulation, to represent uncertainty in and dependence between two or more parameters of a model input s variability distribution. During the uncertainty propagation, each alternative distribution of variability for a given input is generated based upon appropriately paired sampling data for distribution parameters, not upon a random combination of these parameters generated from m 2

3 independent marginal sampling distributions. Because the dependencies among the parameters are preserved in the modified approach, the approach will be able to reduce bias in exposure estimates and associated uncertainty due to inappropriately dealing with dependencies among sampling distributions. 3. Introduction to Case Study The example case study focuses on characterization of inter-individual variability and uncertainty in outdoor human exposures to urban air toxics. Outdoor exposures to urban air toxics have been linked to health hazards such as asthma [13]. The data for case study is from the NHEXAS-Arizona Stage III database [18]. The example case study focuses on assessing average weekly residential outdoor exposures to formaldehyde of a population residing in Tucson, Arizona. Since the main purpose of this case study is to demonstrate how to properly deal with non-detects and uncertainty propagation, only the inhalation pathway was assessed. Valid sample measurements from the NHEXAS database are available for 37 individuals, of which 23 are male and 14 are female. The observed formaldehyde concentrations are reported based upon a one-week average, in which 23 samples (or 65 percent of data) are non-detected. The detection limit for the formaldehyde measurement is ug/m 3. The exposure model for inhalation used in this study is [19] : C T IR AF E = BW 1000 (2) Where: AF = Absorption factor (constant), 0.80 for formaldehyde [20] BW = Body weight (kg) C = Observed concentration of a pollutant in residential outdoor air (ug/m3) E = Exposure (ug/kg/day) IR = Inhalation rate (L/min) T = Time staying at outdoor (min/day) Estimates of inhalation rate across all population groups and activity types are obtained based upon body surface area, which depends on height, body weight, and breathing rate [19] : IR= SA RB (3) Where: R B = Unit area breathing rate (L/min/m 2 ) (5.0 L/min/m 2 for male and 4.7L/min/m 2 for female [19] SA = Surface area (m 2 ) is calculated from [11] SA= H BW (4) Where: H = Body height (cm) 4. Results and Discussion 4.1. Variability and uncertainty in model inputs MLE/bootstrap approach was used to estimate parameters of selected probability distributions for model inputs with sample data and quantify the uncertainty in selected statistics. Table 1 summarizes the fitted distributions and parameter estimates for the Table 1. Summary on fitted distributions to model inputs Model inputs Sample Size Distribution First Para. a Sec. Para. b Formaldehyde Concentration 37 Gamma Body Weight 37 Gamma Height 37 Normal Exposure Time 30 c Gamma Note: a. First parameter: mean of ln(x) for lognormal, scale parameter for gamma and Weibull b. Second parameter: standard deviation of ln(x) for lognormal, shape parameter for Gamma and Weibull c. The exposure time for 7 individuals was reported as zero, and discarded in this example model inputs. Figure 1 displays the distributions for variability in formaldehyde concentration as well as bootstrap probability bands representing uncertainty. The figures for other model inputs are not shown here. These results are obtained from Analysis of Uncertainty and Variability Tool (AuvTool) [21] based on 500 bootstrap samples and a variability sample size of Both Kolmogorov-Smirnov (K-S) test results and the visual comparison provided by AuvTool shows that the chosen distributions can appropriately represent the data for model inputs. In Figure 1, the ranges of uncertainty of the portions of the CDF below detection limit are large. The large uncertainty at the lower tails arises from the high percentage of (65%) of left-censoring in the data set. The estimated mean value from the fitted distribution is ug/m 3, which is less than the detection limit of ug/m 3. The estimated value 3

4 Cumulative Probability Observed Data above Detect Limit Detection Limit = ug/m3 Fitted Gamma Distribution Confidence Interval 50 percent 90 percent 95 percent Formaldehyde (ug/m3) Fig. 1: Gamma distribution fitted to data containing 23 non-detects and probability bands representing uncertainty for formaldehyde concentration. occurs at the 63.5 th percentile of the fitted distribution. Thus, the estimate of the mean formaldehyde concentrations is affected by censoring, since it occurs in the censored portion of the distribution. In comparison, if all non-detected data are replaced with one-half of the detection limit, the estimated mean is ug/m 3, which occurs at the 66.8 th percentile of the fitted distribution. If non-detected data are replaced with the detection limit, the estimated mean is ug/m 3, which occurs at the 75.2th percentile of the fitted distribution. Thus, it is clear that different approaches for dealing with non-detects will produce different estimates of the mean. The MLE-based approach has the advantage of being asymptotically unbiased [14] Dependencies Dependencies among sampling distributions representing uncertainty in distribution parameters and model inputs are investigated. The sampling distributions are obtained from bootstrap simulation. Figure 2 displays an example scatter plot for the sampling distributions of the two parameters of the gamma distribution that represents formaldehyde concentration. As shown in Figure 2, except for body height, which is described by a normal distribution, there are strong nonlinear dependences between sampling distributions for the parameters of distribution model representing formaldehyde concentration, body weight and exposure time. Fig. 2: Dependency among parameters of the gamma distribution representing formaldehyde conc. A correlation was identified between model inputs of body weight and height. The correlation coefficient is However, correlations among insignificant inputs have generally little effect on results. In this study, the inhalation rate, as well as body weight and height, were shown to be insignificant contributors to overall variance, and are over-shadowed by the importance of pollutant concentration and exposure time Variability and uncertainty in exposure estimates Variability and uncertainty in model inputs were propagated through Equation (2) by using Monte Carlo random sampling. The simulation results show that the mean exposure to formaldehyde from the first stage variability analysis is ug/kg/day. The inter-individual variability in weekly average exposure to residential outdoor urban air toxics for formaldehyde over a 95% probability range is more than three orders of magnitude, ranging from to ug/kg/day. Selected statistics for representing uncertainty in exposure estimates include the 2.5 th, mean, median, and 97.5 th percentiles. The mean exposure to formaldehyde from the 2 nd stage uncertainty analysis is ug/kg/day, and agrees well with the mean estimate (0.129 ug/kg/day) from the first stage variability analysis. The average exposure estimates of statistics such as the 2.5 percentile and 97.5 percentile also agree well with their corresponding estimates from variability analysis. These results show that the approach proposed in this study provides asymptotically unbiased exposure estimates for the mean and other statistics. The uncertainty in mean exposure estimates varies from to ug/kg/day over a 95% probability range. The relative range of uncertainty in exposure estimates is greater at the lower tails. For example, the confidence interval is more than three orders of magnitude at the 2.5 th percentile from to ug/kg/day. This implies that leftcensored data set may have larger uncertainty at the lower tails of a cumulative probability distribution for exposure estimates Effects of different methods of dealing with non-detects on exposure estimates As a comparison, alternative methods such as replacing non-detects with zero, half of detection limit, detection limit and discarding non-detected data are used to evaluate the effects of these methods on exposure estimates. Mean estimates of exposure and their uncertainties from these methods are shown in Table 2. These results show that improper handling of non-detects causes bias in the mean exposure estimates as well as their uncertainties. For example, if 4

5 non-detected data are replaced with zero, an Estimat underestimate of the mean exposure occurs, which is Table 2. Comparison on different methods for dealing with non-detects Uncertainty 95% Category Mean Confidence (ug/kg/day) Interval (ug/kg/day) MLE/bootstrap (0.075, 0.189) Discarded (0.153, 0.375) Replaced with ½ (0.082, 0.201) DL Replaced with DL (0.117,0.304) Replaced with (0.044, 0.155) ug/kg/day. Discarding non-detects or replacing them with the detection limit leads to an overestimate of the mean exposure, as expected. Replacing nondetected observations with half of detection limits may lead to over- or under-estimate of mean exposure, which depends on the portion of non-detected data in a sample data set and the size of the detection limit. In this case, a slight overestimate is observed. Bias for the uncertainty range of the mean exposure was also found when using conventional methods for dealing with non-detects. For example, when replacing non-detected observations with the detection limit, the confidence interval in the mean is shifted toward higher values compared to MLE/bootstrap result Effects of dependencies among sampling distributions on uncertainty in exposure estimates In order to demonstrate the effects of improperly handling dependencies among sampling distributions, uncertainties in model inputs are also propagated using the typical two-stage Monte Carlo simulation framework as a comparison. The marginal distributions for each distribution parameters were Table 3. Fitted distributions representing uncertainty in the parameters of variability distributions Model inputs Fitted. Dist. Fitted Distributions for Variability Distribution Parameters Formaldehyde Shape gamma(3.554,4.158) Scale Weibull(1.381,1.667) Gamma Body Scale lognormal(2.618, 0.264) Gamma Weight Shape lognormal(1.627, 0.261) Height Mean normal(1.651,0.0245) Normal Std Dev. beta(58.04,359.5) Exposure Scale gamma(10.46,0.187) Gamma Time Shape lognormal(-0.133,0.365) based upon bootstrap simulation results from AuvTool, as summarized in Table 3. These marginal distributions were sampled independently to create families of frequency distributions for variability. Generally, the means for any statistics of a distribution from the first stage variability and the second stage uncertainty analyses should be consistent if there is no bias created during uncertainty propagation. The results from the typical two-stage approach are summarized in Table 4 with those from the modified approach used in this study. These results indicate that biases in exposure estimates for the mean and other statistics are created if complex dependencies Table 4. Comparison of uncertainty results in exposure estimates using different uncertainty propagation methods Category Uncertainty Mean (ug/kg/day) 95% Confidence Interval of Uncertainty Mean (ug/kg/day) MLE/Bootstrap Approach Mean (0.075, 0.189) 2.5 Percentile ( , ) 97.5 Percentile (0.377, 1.033) A typical two-stage approach Mean (0.0136, 0.713) 2.5 Percentile ( , ) 97.5 Percentile (0.098, 2.917) among the sampling distributions are not properly captured. For example, when using the typical twostage approach, the mean exposure estimate from uncertainty analysis is ug/kg/day, and is significantly different from the mean exposure estimate (0.129 ug/kg/day) obtained from first stage variability analysis. The 97.5 th percentile of variability is substantially overestimated in the typical two-stage approach. 5. Conclusions Conventional methods for dealing with non-detected data will lead to under- or over-estimates of mean exposure and other statistics. Generally, replacing non-detected data with zero will lead to an underestimate of mean exposure, and replacing with the detection limit or discarding will bring an overestimate. Replacing non-detected data with half of detection limit may lead to an under-or overestimate, which depends on the portion of nondetected data in a sample dataset and the detection limit size. A left-censored sample data with a higher percentage of non-detected data will lead to a higher uncertainty at lower tails in the cumulative distribution of exposure estimates. The results from the case study show that improper analysis of non-detected data, 5

6 even for left-censored data, may cause a significant bias in the mean or upper tail statistics such as 95% percentile as well as their uncertainty. The MLE/bootstrap approach used in this study is recommended as a method for dealing with censored data because it takes into account of uncertainty associated with random sampling errors, and provides asymptotically unbiased estimates of statistics, such as the mean and 95% percentile. Substantial bias can occur in estimates of uncertainty when dependencies among the sampling distributions of parameters for inter-individual variability distributions are not properly captured. When there are complex dependencies in the sampling distributions, a modified two stage Monte Carlo approach, in which the sampling distributions obtained from bootstrap simulation are kept in pair-wise data formats, is recommended since it can retain such dependencies. However, when there are no dependencies among sampling distributions, using marginal parametric probability distribution models, as done in the typical two-stage Monte Carlo approach, may be preferred since uncertainty inputs are organized in compact form and thus simulation is easily managed and implemented during uncertainty propagation. 6. References [1] Morgan, M.G., and Henrion, M., Uncertainty: A guide to dealing with uncertainty in quantitative risk and policy analysis. Cambridge University Press: New York, 1990 [2] Frey, H.C., Quantitative analysis of uncertainty and variability in environmental policy making. Directorate for Science and Policy Programs, American Association for the Advancement of Science, Washington, DC,1992. [3] EPA (U.S. Environmental Protection Agency). Guiding Principles for Monte Carlo Analysis[R]. EPA/630/R-97/001, Washington, DC [4] Zartarian, V.G., Özkaynak, H., Burke, J.M., Zufall, M.J., Rigas, M.L., Furtaw, E.J., Jr., A modeling framework for estimating children s residential exposure and dose to chlorpyrifos via dermal residue contact and nondietary ingestion. Environmental Health Perspectives, 108(3): , 2000 [5] NRC (National Research Council), Science and judgment in risk assessment. National Academy Press: Washington, DC,1994 [6] EPA (U.S. Environmental Protection Agency), Summary of the U.S. EPA colloquium on a Framework for Human Health Risk Assessment (volume 2) [R]. Risk Assessment Forum. Washington, DC. Available at: eid=55007,1998 [7] Cullen, A.C. and Frey, H.C., Probabilistic Techniques in Exposure Assessment. Plenum Press: New York, 1999 [8] Thompson, K.M., J.D. Graham., Going beyond the single number: using probabilistic risk assessment to improve risk management. Human and Ecological Risk Assessment, 2(4): ,1996 [9] Cohen, J.T., Lampson, M.A., and Bowers, S., The use of two-stage Monte Carlo simulation techniques to characterize variability and uncertainty in risk analysis, Human and Ecological Risk Assessment, 2(4): ,1996. [10] Frey, H.C., and Zheng, J., Quantification of variability and uncertainty in utility NOx emission inventories, Journal of the Air & Waste Manage Association, 52(9): , 2002 [11] Moschandreas DJ, Ari H, Karuchit S, et al., Exposure to pesticides by medium and route: The 90th percentile and related uncertainties, Journal of Environmental Engineering, ASCE, 127(9): ,2001 [12] Ozkaynak, H., Zartarian,V., Xue, J., Dang, W., USEPA SHEDS model: Methodology for Exposure Assessment for Wood Preservations. The 2004 Annual International Conference on Contaminated Soils. Sediment and Water, Amherst, MA., 2004 [13] Gordon, S. M., P.J. Callahan, M.G. Nishioka, Residential environmental measurements in the National Human Exposure Assessment Survey (NHEXAS) pilot study in Arizona: preliminary results for pesticides and VOCs, Journal of Exposure Analysis and Environmental Epidemiology, 9(5): , 1999 [14] Zhao, Y. and Frey, H. C., Quantification of variability and uncertainty for censored data sets and application to air toxic emission factors, Risk Analysis, 24(6): , 2004 [15] Buck, R.J., H. Ozkaynak, J.P. Xue, et al., Modeled estimates of chlorpyrifos exposure and dose for the Minnesota and Arizona NHEXAS populations, Journal of Exposure Analysis and Environmental Epidemiology,11(3): , 2001 [16] Frey, H.C. and Rhodes, D.S., Characterization and simulation of uncertain frequency distributions: Effects of distribution choice, variability, uncertainty, and parameter dependence, Human and Ecological Risk Assessment, 4(3): ,1998 6

7 [17] Efron, B. and Tibshirani, R.J., An introduction to the bootstrap, Chapman & Hall: London, UK.,1993 [18] EPA(U.S. Environmental Protection Agency), ( 2005 accessed) [19] EPA (U.S. Environmental Protection Agency), Exposure factors book. EPA/600/P-95/002Fa-c, Office of Research and Development, Washington, DC,1997 [20] USDOE (U.S. Department of Energy), Guidance for conducting risk assessment and related risk activities for the DOE-ORO Environmental Management Program[R]. DOE/BJC/OR-271 ( 271.pdf), 1999 [21] Frey, H.C., J. Zheng, Y. Zhao, S. Li and Y. Zhu., Technical documentation for the software development of analysis of uncertainty and variability analysis tool (AuvTool)[R]. Prepared by North Carolina State University for Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC,