Statistical methods for assessing groundwater compliance and cleanup: a regulatory perspective

Size: px
Start display at page:

Download "Statistical methods for assessing groundwater compliance and cleanup: a regulatory perspective"

Transcription

1 Groundwater Quality: Remediation and Protection (Proceedings of the Prague Conference, May 1995). IAHS Publ. no. 225, Statistical methods for assessing groundwater compliance and cleanup: a regulatory perspective E. F. WEBER Landau Associates, th Ave. W., PO Box 1029, Edmonds, Washington , USA Abstract Regulatory corrective action requirements can be imposed at sites in USA where a statistically significant groundwater quality impact is documented. The regulatory basis for determining an impact is a detection monitoring decision. The basis for determining an impact no longer exists is a corrective action monitoring decision. Detection monitoring typically involves periodic comparisons between background and downgradient monitoring well data to verify compliance of a facility. Corrective action monitoring involves comparisons between downgradient monitoring well data and a cleanup standard to evaluate remedial action effectiveness. Recommended statistical procedures for making these regulatory decisions include hypothesis tests, statistical interval estimates, and trend analyses. Application of these methods within the context of regulatory groundwater monitoring programmes raises a number of issues regarding acceptable statistical power and false positive rates, and distributional assumptions. INTRODUCTION Two major federal laws that provide for protection and cleanup of groundwater in USA are the Resource Conservation and Recovery Act (RCRA) and the Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA), or Superfund. These laws, administered by the US Environmental Protection Agency (EPA), require a determination if a groundwater quality impact exists at a site as a basis for either requiring implementation or allowing termination of remedial corrective action. The regulatory basis for determining whether an impact has occurred is a detection monitoring decision. The basis for determining whether an impact no longer exists is a corrective action monitoring decision. The purpose of this paper is to present the recommended statistical approach used to make these decisions and to discuss some of the issues that have arisen regarding their implementation and effectiveness. REGULATORY FRAMEWORK Both RCRA and CERCLA address releases of hazardous substances to groundwater; however, they differ fundamentally in their approaches. RCRA focuses on preventing releases through regulated management practices from landfills or facilities that treat, store, or dispose of hazardous waste. RCRA also has stringent cleanup provisions but is primarily a preventive programme. By contrast, CERCLA is a response-oriented

2 494 E. F. Weber programme enacted to address threats to human health and the environment, typically from accidental spills or unsafe management practices. Nevertheless, RCRA and CERCLA both require characterization of groundwater flow direction, rate, and quality when an aquifer is threatened. They also dictate the use of statistical procedures in making regulatory decisions to assess the need for corrective action. The regulations allow the use of an alternative procedure subject to EPA approval; however, in practice EPA's recommended procedures are typically used (see Table 1). Table 1 RCRA and CERCLA statistical regulations and guidance. Detection and compliance monitoring Regulations Guidance RCRA 1. Statistical Analysis of Ground-Water 40 CFR parts 264 (hazardous waste) Monitoring Data at RCRA Facilities: and 258 (municipal waste) (1) Interim Final Guidance (EPA, 1989). 2. Statistical Analysis of Ground-Water CERCLA Monitoring Data at RCRA Facilities: Not applicable Addendum to the Interim Final Guidance (EPA, 1992a) Corrective action RCRA 1. Methods for Evaluating the Attainment monitoring Same as above of Cleanup Standards, vol. 2: Ground Water {EVA, 1992b) CERCLA No specific regulations (1) CFR = Code of Federal Regulations. GROUNDWATER MONITORING APPROACH RCRA defines three monitoring programmes for facilities that are required to monitor groundwater. These programmes are detection monitoring, compliance monitoring, and corrective action monitoring. CERCLA, by contrast, only addresses corrective action monitoring. Detection monitoring, Fig. 1, requires at least one background well (hydraulically upgradient from a potential source) and three downgradient wells. The purpose of detection monitoring is early detection of a release to groundwater, should one occur, based on comparison of downgradient well data to background data for a limited number of water quality parameters. Compliance monitoring is implemented if detection monitoring indicates a statistically significant likelihood of a release. Compliance monitoring mandates sampling for an expanded suite of hazardous constituents and requires establishment of concentration limits (compliance or cleanup standards) should any of these constituents be detected. Downgradient well data are compared to concentration limits for each well on a periodic basis (e.g. semi-annually). The purpose of compliance monitoring is to determine if the release to groundwater is significant enough to warrant corrective action. Corrective action monitoring is typically implemented if compliance monitoring indicates a statistically significant groundwater impact. Corrective action typically requires a more extensive characterization programme and remedial measures. Data

3 Statistical methods for assessing groundwater compliance and cleanup Background Wells s-^ Downgradient (~) ^ o o Wells o o (Plan View) Fig. 1 Detection monitoring paradigm. comparisons are similar to those of compliance monitoring; however, they also involve trend analyses of the downgradient well data. The purpose of corrective action monitoring is to document the effectiveness of remediation and the attainment of cleanup goals. STATISTICAL FRAMEWORK Regulatory decisions for the three monitoring programmes use a basic statistical approach (EPA, 1988, 1989) consisting of the following steps: (a) establish the null hypothesis and corresponding false positive rate, (b) determine the data distribution and choose a statistical method, and (c) apply the statistical method and reject or accept the null hypothesis based on the outcome. The first step, establishing the null hypothesis, is taken by EPA. Determination of the monitoring hypothesis The monitoring or null hypothesis (H 0 ) is what is assumed to be true about the system prior to data collection, until indicated otherwise. The alternate hypothesis (H^ is accepted if the data indicate that the null hypothesis is unlikely. For a detection monitoring decision null hypothesis, the groundwater is assumed to be clean. For a corrective action monitoring decision null hypothesis, the groundwater is assumed to be contaminated. Therefore, from a statistical standpoint, the burden of proof to demonstrate that the site is clean is opposite for the two programmes. For the compliance monitoring programme, the regulations allow either a detection monitoring or a corrective action monitoring decision null hypothesis to be used, depending on which statistical method is used.

4 496 E. F. Weber False positive and false negative rates With statistical inference there is a probability of making two types of wrong decisions concerning the null hypothesis: a false positive or a false negative. For detection monitoring, a false positive is equivalent to determining that the site is contaminated when it is not. A false negative is failing to detect contamination, and statistical power (1 - the false negative rate) is the ability of a test to detect contamination. For a corrective action decision, these definitions are reversed because of the different null hypothesis. For example, a false positive is equivalent to determining that the site is clean when it is not. For detection monitoring, EPÀ typically specifies a minimum false positive rate, oc, of 0.01 for a single well comparison (comparisonwise) and 0.05 for multiple comparisons (experimentwise). Unfortunately, the experimentwise false positive rate of 0.05 is typically difficult to achieve because of multiple comparisons necessary to demonstrate compliance. Detection and compliance monitoring requires a statistical assessment for each downgradient well for each constituent. Because there are typically numerous downgradient wells and multiple constituents, multiple statistical tests or comparisons (each with an oc of 0.01) are necessary for each sampling event to demonstrate compliance. For 20 independent comparisons, the experimentwise false positive rate for each sampling event equals 1 - (1-0.01) 20, or 18%. Lowering the experimentwise false positive rate requires lowering the comparisonwise false positive rate below Unfortunately, the result is lower statistical power to detect contamination when it exists. Controlling the experimentwise false positive rate is of less concern for corrective action because the number of comparisons are typically less, and the attainment decision may be a discrete event as opposed to periodic (e.g. semi-annual) decisions required for detection and compliance monitoring. EPA's primary goal in specifying statistical tests is to maintain statistical power while keeping a low experimentwise false positive rate. One approach to increase statistical power without impacting the false positive rate is to assume an appropriate statistical distribution for the data. Distributional assumptions In 1992 EPA reversed its policy of using the normal distribution as a default assumption for application of statistical tests in favour of the lognormal distribution and putting more emphasis on nonparametric methods (EPA, 1992a). A primary reason for this reversal is that statistical tests to detect nonnormality lack sufficient power for small sample sizes (Montgomery et al, 1987) typical of many RCRA and CERCLA sites. Therefore, if normality is assumed as a default assumption, it will typically be applied unless the data is grossly nonnormal. Because groundwater quality data are more typically approximated by a lognormal distribution (McBean & Rover, 1992; Helsel & Hirsch, 1992; Montgomery et al, 1987), the new policy should result in increased power of statistical tests. Also, physical explanations can account for lognormal groundwater quality data. For example, flow related soil properties such as hydraulic conductivity and solute dispersion coefficients are reported as lognormal (Rao et al., 1979). Successive dilutions can also produce lognormal concentrations (Ott, 1990). For

5 Statistical methods for assessing groundwater compliance and cleanup 497 practical reasons, EPA also stresses nonparametric methods, which are not dependent on distributional assumptions (typically difficult to verify) and which are more resistant to large proportions of nondetects in the sample data. Helsel & Hirsch (1992) suggest that nonparametric methods typically have equivalent or greater power than parametric tests for most water resource data sets. STATISTICAL METHODS The type of statistical method chosen to test the null hypothesis depends on the type of monitoring programme and data distribution. The performance of the method depends on factors that affect the false positive rate and corresponding statistical power. Some of these issues are discussed below, along with the recommended methods for each monitoring programme. Detection monitoring Statistical tests for detection monitoring include hypothesis tests and interval estimates. RCRA guidance recommends a one-way analysis of variance (ANOVA) hypothesis test procedure or prediction or tolerance intervals. These tests can be either parametric or nonparametric. Another approach discussed by EPA (1988), intra-well comparisons, has some advantages for detection monitoring but is limited from a practical standpoint. Intra-well comparisons are not discussed in this paper. The ANOVA tests for differences in the mean of several independent group means. Data from individual background and downgradient wells are each treated as a separate group. ANOVA controls the experimentwise false positive rate by testing multiple downgradient wells in a single test for a given constituent. Gibbons (1994), however, points out disadvantages with using an ANOVA. For example, the false positive rate can still be high if there are a large number of constituents, each requiring a separate test. An ANOVA is sensitive to spatial variability between downgradient wells, has low power to detect a thin plume of contamination for large monitoring networks, and increases monitoring costs because four independent samples at each well are necessary prior to making a comparison. Three types of statistical intervals are described in the guidance: confidence, tolerance, and prediction (Table 2). They are all calculated based on the form I + ^(i-oc, n) x s ' wh ere ^ is a factor typically read from statistical tables, n is the number "of sample observations, and x and s are the sample mean and standard deviation. The interval is compared to an observation at a downgradient well (detection monitoring decision) or to a risk based standard (corrective action decision). Prediction and tolerance intervals (calculated from pooled background data) are used for detection monitoring. Prediction intervals are typically calculated to contain all (100% coverage) observations (m) from a future sample with a specified percent confidence [(1 - oc) x 100]. The experimentwise false positive rate, oc, can be kept at 0.05 by adjusting the comparisonwise false positive rate, oc/m (for 100 future comparisons: 0.05/100 = ). In the case of prediction intervals, the factor K is dependent on ex /m instead of oc. With large numbers of future observations, the comparisonwise false positive rate

6 498 E. F. Weber Table 2 Statistical intervals in the guidance. Monitoring programme Corrective Type Purpose" Detection Compliance action Confidence Interval to contain a population parameter / / interval (mean) Tolerance Interval to contain at least a specified / interval proportion (95 %) of the population Prediction Interval to contain all of the observations / interval from a future sample a Adapted from Halm & Meeker (1991). becomes very low (typically lower then the EPA threshold of 0.01), leading to a very large interval and low statistical power. For this reason, prediction intervals are most applicable to smaller compliance well networks. Tolerance intervals are calculated to contain a proportion (i.e. 95% coverage) of observations from a future sample with a specified percent confidence [(1 - c) X 100]. Implicit in the tolerance interval is a certain percentage of false positive results (100%-coverage) leading inevitably to facility noncompliance, made worse as the number of comparisons increase. Because false positives are expected with this interval, it is reasonable to retest failures to verify that they are indicative of an actual impact. This approach was suggested by McNichols & Davis (1988) and Gibbons (1991) and eventually endorsed in the RCRA guidance (EPA, 1992a). Retesting is a simple approach that controls the false positive rate, yet has good statistical power. One proposed retesting approach combines the two intervals (Gibbons, 1991). A tolerance interval (with 95% coverage) is initially constructed. If a facility has 100 compliance measurements, then five failures should occur by chance alone. A prediction interval is then constructed to contain five future observations, which are then retested. Noncompliance is only indicated if the second sampling event results in failure. The power can be increased slightly by adding a second retest. The effectiveness of this approach relative to the EPA reference power curve (EPA's recommended minimum acceptable power) is demonstrated on Fig. 2. Compliance monitoring Statistical tests for compliance monitoring include comparison of a confidence interval or tolerance interval (calculated from downgradient well data) to a compliance standard. The compliance standard can be equal to the average background concentration or a risk based standard. In the case of a background standard, a lower confidence interval on the mean is recommended. This approach uses the detection monitoring null hypothesis. In the case of a risk based standard, an upper tolerance interval is recommended. This approach uses the corrective action monitoring null hypothesis. Conceptually, these two approaches are not compatible and will be prone to result in different compliance decisions for the same data set. Gibbons (1994) addresses additional statistical problems

7 Statistical methods for assessing groundwater compliance and cleanup 499 (24 Background Samples; 50 Wells) Standard Units Above Background Fig. 2 Power curve for 95% tolerance and 98% prediction limit. with compliance monitoring policy. He notes that a 95% confidence, 95% coverage tolerance interval calculated from four samples (a typical number of independent samples available semiannually) will be approximately five standard deviations above the mean. The calculated interval will have very low statistical power (because of the small sample size), resulting in almost certain noncompliance of the facility. Corrective action monitoring Statistical tests for corrective action include trend analyses to determine effectiveness of remedial action and statistical interval estimates to determine compliance with risk based cleanup standards after termination of remedial action. If the risk based standard is based on chronic toxicity (e.g. incremental cancer risk), a confidence interval on the mean is recommended. If the standard is based on acute toxicity, a tolerance interval is recommended (EPA, 1992b). The CERCLA guidance discusses three decision points in corrective action monitoring: when to terminate remedial action, when to start attainment sampling, and when to evaluate attainment of cleanup standards. The first two decisions involve trend analyses. Evaluation of attainment (whether the remedial action has attained the groundwater cleanup standard) involves interval estimates. The decision to terminate treatment is somewhat subjective but typically involves the determination that contaminant concentrations are decreasing (based on trend analysis) and are at or below the cleanup standard on an absolute basis. After terminating treatment, sampling for attainment typically will not start until the transient effects due to remediation have dissipated (i.e. the system is at steady-state). Evaluation of steady state can be made based on trend analyses. Trend analysis applications to groundwater data sets are discussed by Gibbons (1994), Helsel & Hirsch (1992), and Gilbert (1987). Once steady state has been verified, attainment sampling begins. After

8 500 E. F. Weber collecting attainment samples for a specified period, typically a number of years, a statistical interval is calculated. Comparison of the upper interval limit to the cleanup standard is the basis for determining if the cleanup has been attained or if more remediation is necessary. REFERENCES EPA (l9i%)statisticalmethodsforevaluatingtheattainmentofsuperfundcleanupstandards, vol. 2: Groundwater. Draft. US Environmental Protection Agency. EPA (1989) Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. Office of Solid Waste, Waste Management Division. US Environmental Protection Agency. EPA (1992a) Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Addendum to Interim Final Guidance. Office of Solid Waste, Permits and State Programs Division. US Environmental Protection Agency. EPA (1992b) Methods for Evaluating the Attainment of Cleanup Standards, vol. 2: Ground Water. US Environmental Protection Agency. Gibbons, R.D. (1991) Statistical tolerance limits for ground-water monitoring. Groundwater 29(4), Gibbons, R.D. (1994) Statistical Methods for Groundwater Monitoring. John Wiley & Sons, New York. Gilbert, R.O. (1987) Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold Co., New York. Hahn, G.J. & Meeker, W.Q. (1991) Statistical Intervals: A Guide for Practitioners. John Wiley & Sons, New York. Helsel.D.R. &Hirsch,R.M. (1992) Statistical Methods in Water Resources. Studies inenvironmentalscience49, Elsevier, Amsterdam. McBean, E.A. & Rover, F.A. (1992) Estimation of the probability of exceedance of contaminant concentrations. Groundwat. Monitor. Review 12(1), McNichols, R. J. & Davis, C.B. (1988) Statistical issues and problems in ground water detection monitoring at hazardous waste facilities. Groundwat. Monitor. Review 8(4), Montgomery, R.H., Loftis, J.C. & Harris, J. (1987) Statistical characteristics of ground-water quality variables. Groundwater 25(2), Ott, W.R. (1990) A physical explanationof the lognormality of pollutant concentrations./. Air Waste Management Ass. 40, Rao, P.V., Rao, P.S.C., Davidson, J.M.& Hammond, L.C. (1979) Use of goodness-of-fittestsfor characterizingthe spatial variability of soil properties. Soil Sci. Soc. Am. J. 43,