USING 99% STATISTICAL CONFIDENCE LEVELS TO MINIMIZE FALSE POSITIVES IN TOXICITY TESTING

Size: px
Start display at page:

Download "USING 99% STATISTICAL CONFIDENCE LEVELS TO MINIMIZE FALSE POSITIVES IN TOXICITY TESTING"

Transcription

1 USING 99% STATISTICAL CONFIDENCE LEVELS TO MINIMIZE FALSE POSITIVES IN TOXICITY TESTING Background The presence or absence of toxicity is determined by comparing survival, growth or reproduction of effluent-exposed organisms to that observed for control organisms. Since survival, growth and reproduction vary naturally, the data are analyzed statistically to ensure that any observed difference is sufficiently large so as to be unlikely to have been caused by mere chance (see, for example, Guidance for Data Quality Assessment: Practical Methods for Data Analysis EPA QA/G-9 QA00 Update EPA/600/R-96/084 July, 2000; pg. 1-5 & 1-7). The critical threshold for statistical significance, usually called alpha (α), is selected before the test begins. Historically, the statistical analysis was performed using a 95% alpha-level. This means there was only a 5% chance that a test would fail when, in fact, the sample was not toxic. Such an error is also known as a false positive or Type-1 error. When EPA promulgated standard methods for whole effluent toxicity testing, the agency recommended, but did not require, that the alpha-level be set to 0.05 (corresponding to 95% confidence). The data analysis examples included in the manual specify an alpha level of 0.01 for testing the assumptions of hypothesis tests and an alpha level of 0.05 for the hypothesis tests themselves. These levels are common and well accepted levels for this type of analysis and are presented as a recommended minimum significance level for toxicity test data analysis. Short-Term Methods for Estimating the Chronic Toxicity of Effluents and Receiving Water to Freshwater Organisms; EPA (July, 1994) pg. 49. Recently, EPA published new guidance that recommends the alpha-level be set to 0.01 (corresponding to 99% confidence) under certain conditions. The conditions are discussed in a separate white paper entitled: Demonstrating Toxicity Test Sensitivity. The alpha level used for hypothesis testing in WET data analysis may be reduced from 0.05 to 0.01 when: sublethal endpoints (reproduction or growth) from the Ceriodaphnia dubia or fathead minnow tests are reported under NPDES permit requirements, or the NPDES permit limit for WET was derived without allowing for receiving water dilution due to low dilution potential in the receiving system, provided that the WET test is able to maintain adequate test sensitivity Method Guidance and Recommendation for Whole Effluent Toxicity (WET) Testing (40 CFR Part 136); EPA-821-B (July, 2000) pg , Risk Sciences Page 1 of 6

2 Benefits Reducing the alpha-level from 0.05 to 0.01 increases statistical confidence from 95% to 99%. This, in turn, reduces the risk of false positive results by 80% (from 1 in 20 tests to only 1 in 100 tests). In essence, the test becomes somewhat harder to fail. A larger adverse effect must be observed before the laboratory would rule out the possibility of chance variation. For example, on average there must be about a 30% reduction in reproduction among effluent-exposed Ceriodaphnia to be 95% confident that toxicity is present. But, to be 99% confident the laboratory must record more than a 40% reduction in reproduction. At first glance, such difference may appear relatively trivial. However, over the 5-year life of most NPDES permits, the beneficial effect is quite substantial (see Table 1). Table 1: Probable Number of False Positives in 5 Years of Single Species Toxicity Tests Test Reported Endpoints Pct. of All Dischargers w/ at Least One False Positive Frequency (# statistical 95% 99% Confidence Monthly Acute Only (1) 95% 45% Monthly Chronic Only (2)* 99% 70% Monthly Acute & Chronic (3) All but 1 in 10,000 83% Quarterly Acute Only (1) 64% 18% Quarterly Chronic Only (2) 87% 33% Quarterly Acute & Chronic (3) 95% 45% *Chronic tests have two endpoints: mortality and sub-lethal effects; both must be analyzed statistically. Table 1 demonstrates that 64% of all dischargers running quarterly acute toxicity tests, using a 95% confidence level, will observe at least one false positive during the normal term of their permit. But, only 18% of all dischargers running similar tests will record at least one false positive if a 99% confidence level is used when analyzing the exact same toxicity data. Note that 83% of all dischargers running monthly acute and chronic toxicity tests on a single species will report at least one false positive during a standard 5 year permit term. This will occur regardless of how clean the effluent is or how well the bioassay laboratory performs. Table 1 reflects the level unavoidable statistical error that will result after implementing all of the measures to minimize test variability recommended in EPA s most recent guidance (June, 2000). If a discharger has reasonable potential to cause or contribute to an exceedence of instream standards, the permit must contain a limit for whole effluent toxicity. Since a single test failure is generally deemed to demonstrate reasonable potential, it is essential to account for expected level of statistical errors, intrinsic to the test method itself, when establishing criteria, setting permit limits, or evaluating compliance for WET. "The allowable frequency for criteria excursions should refer to true excursions of the criteria, not to spurious excursions caused by analytical variability or error." Technical Support Document for Water Quality Based Toxics Control - Responsiveness Summary; May 9, 1991, pg , Risk Sciences Page 2 of 6

3 Implementation EPA does not specify a mandatory confidence level for statistical analyses performed as part of whole effluent toxicity tests. That decision is considered a state implementation issue: "The toxicity test manuals provide extensive and complete guidance for determining the endpoints of each type of test. The interpretation and application of the test results are part of the implementation policy and are not addressed in this rulemaking... It is not always obvious that an effect level that is determined to be statistically significant is also biologically significant. The implied question, concerning the biological significance of (threshold) statistically significant occurrences of adverse biological effects observed in toxicity tests, is an implementation question, and is not addressed in this rulemaking." Whole Effluent Toxicity: Guidelines Establishing Test Procedures for the Analysis of Pollutants - Supplementary Information Document (SID). Oct. 2, p. 28 & 33 Resorting to arbitrary values such as false rejection=- 0.05, false acceptance= is not recommended. The circumstances of the investigation may allow for a less stringent option, or possibly a more stringent requirement (p. 6-6)...The value of 0.01 should not be considered a prescriptive value for setting decision error rates, nor should it be considered EPA policy to encourage the use of any particular decision error rate. (P. 6-11) Guidance for the Data Quality Objectives Process EPA QA/G-4; EPA/600/R-96/005 (August, 2000). participants can select any method approved for an analyte when multiple methods are approved for the analyte Availability, Adequacy, and Comparability of Testing Procedures for the Analysis of Pollutants Established Under Section 304(h) of the Federal Water Pollution Control Act (EPA/600/9-87/030); September, 1988; p Because the whole effluent toxicity test methods do not mandate a specific confidence level, dischargers may elect to use either the 95% or 99% threshold unless their permit specifies otherwise. Dischargers should clarify the issue at the time the permit is issued by discussing it with their regulatory authority. In practice, each state will likely rely on EPA s guidance to select an appropriate confidence level. Dischargers should be prepared to justify a higher confidence threshold. And, it may be necessary to increase the number of test organisms to maintain adequate test sensitivity. 2001, Risk Sciences Page 3 of 6

4 Justification (in the event your state is reluctant to allow a 99% confidence level) 1) The 99% confidence-level is consistent with EPA s recommended approach for determining the presence or absence of chemical pollutants. Detection limit means the minimum concentration of an analyte (substance) that can be measured and reported with a 99% confidence that the analyte concentration is greater than zero as determined by the procedures set forth at appendix B of this part. 40 CFR This guidance recommends using 0.01 as the starting point for setting decision error rates. If the consequences of a decision error are not severe enough to warrant this stringent decision error limit, this value may be relaxed (a larger probability may be selected). However, if this limit is relaxed from a value of 0.01 for either the decision error rate at the Action Level or the other bound of the gray region, your planning team should document the rationale for relaxing the decision error rate. This rationale may include regulatory guidelines; potential impacts on cost, human health, and ecological conditions; and sociopolitical consequences. Guidance for the Data Quality Objectives Process EPA QA/G-4; EPA/600/R-96/005 (Aug. 2000) pg "The generation of scientifically accurate and valid biological measurements for environmental pollutants requires approximately the same criteria for assessing the adequacy of a method as previously described for chemical analyses (precision, accuracy, comparability, representativeness & completeness). The same performance characteristics and development states of the method must be known in order to make an assessment of adequacy...as with chemical methods, variability is an essential criterion for assessing the adequacy of a test method. The mechanism within each method for making this assessment is the quality assurance/quality control (QA/QC) section...precision statements, detection limits, dynamic range, and inherent biological variability are intricately related." Availability, Adequacy, and Comparability of Testing Procedures for the Analysis of Pollutants Established Under Section 304(h) of the Federal Water Pollution Control Act (EPA/600/9-87/030); September, 1988; p ) The 99% confidence level is consistent with EPA s recommended threshold for scientific certainty in other important permitting decisions. EPA guidance for assessing reasonable potential recommends a 99% confidence level when setting a daily maximum limit (see Technical Support Document for Water Quality-based Toxics Control; March, 1991; pg. 110). EPA s guidance for calculating an interlaboratory detection level (IDL) for chemical pollutants also recommends a 99% confidence level (see Availability, Adequacy, and Comparability of Testing Procedures for the Analysis of Pollutants Established Under Section 304(h) of the Federal Water Pollution Control Act (EPA/600/9-87/030); September, 1988; p , Risk Sciences Page 4 of 6

5 Historically, EPA has routinely used the 99% confidence level to define the boundaries of acceptable laboratory performance for toxicity tests performed as part of the annual DMR-QA studies. If it is reasonable to accept laboratory variability within a 99% error band, then it is also reasonable to allow permittees the same level of scientific certainty when using the same laboratory data to assess compliance with permit limits for WET. For example, most states allocate mixing credits based on conservative assumptions about available dilution. Generally, the lowest 7-day flow expected to occur in 10 years (7Q10) is used to set WET limits. The 7Q10 condition is equivalent to 99.9% confidence (7 days in 3,650 days). EPA s water quality criteria for chemical pollutants are usually set based on an assumption that aquatic organisms cannot tolerate more than one excursion of the in stream water quality objective in a given 3 year period. This is equivalent to 99.4% confidence (1 week in 156 weeks). 3) The 99% confidence level is necessary to offset our inherent inability to validate the accuracy of toxicity test results. According to EPA: The term "accuracy" means the nearness of a measurement to its real, or true, value...an accurate result agrees closely with the real value. The closer the result to the real value, the more accurate the result." NPDES Permit Writer's Guide to Data Quality Objectives; Nov., 1990; p. 1-7 When EPA promulgated the WET testing methods under 40 CFR 136 they warned that: The accuracy of toxicity tests cannot be determined. Short-Term Methods for Estimating the Chronic Toxicity of Effluents and Receiving Water to Freshwater Organisms; EPA (July, 1994); Section , pg. 49 & Section , pg "Accuracy of toxicity test results cannot be ascertained, only the precision of toxicity can be estimated..." Federal Register; Vol. 60, No. 199; Oct. 16, 1995; p "It should be noted here that the dilution factor selected for a test determines the width of the No-Observed-Effect-Concentration and the Lowest Observed Effect Concentration Interval and the inherent maximum precision of the test...with a dilution factor of 0.5, the NOEC could be considered to have a relative variability of plus or minus 100%." Short-Term Methods for Estimating Chronic Toxicity of Effluents and Receiving Water to Freshwater Organisms (EPA ); July, 1994; Section ; p , Risk Sciences Page 5 of 6

6 NPDES permit limits and water quality criteria must account for the inherent analytical variability associated with any standard method, including toxicity test procedures. "The precision of toxicity measurements is similar to that of finely tuned instruments operating at detection limits. The users of biological methods must account for the inherent variability in response. Typically for toxicity test methods, this means using replicate exposures at each concentration and running parallel tests with each sample or batch of test organisms using a standardized toxicant so that the "health" or sensitivity of the test organisms can be independently measured. It also means that the natural variability in sensitivity will have to be accounted for. More importantly, this variability must also be accounted for when permit limits, criteria, or standards are set." NPDES Permit Writer's Guide to Data Quality Objectives; November, 1990; p "To assess the precision of biological tests, the EPA report (600/9-87/030, 1988) indicated that the methods must account for inherent variability of response and natural variability of within-species sensitivity." Whole Effluent Toxicity: Guidelines Establishing Test Procedures for the Analysis of Pollutants - Supplementary Information Document (SID); Oct. 2, 1995; p EPA frequently claims the precision of WET tests is similar to that found for many other chemical analysis methods. Assuming that is true, such variability is addressed by adopting method detection levels (MDLs) or Practical Quantitation Levels (PQLs). EPA has also stated that it is not possible to develop MDLs or PQLs for toxicity test methods. Without an MDL or PQL, the inherent variability of WET test methods is more likely to result in a decision-error than is likely to occur when using results from chemical analyses to assess compliance with permit limits. This is because toxicity is one of the few pollutant parameters for which there is zero-tolerance. If toxicity is detected, at the in stream waste concentration, a permit violation has occurred. Very few chemical pollutants have an effective limit of zero. And, even if they did, the MDL/PQL approach would prevent most compliance errors caused solely by analytical variability. WET tests may have precision profiles similar to chemical analyses, but the two methods are not treated the same when certifying results on the monthly Discharge Monitoring Report (DMR). Since permit violations can be subject of criminal prosecutions under the Clean Water Act, and such violation impose strict liability on the discharger, it is appropriate to require evidence to establish guilt beyond a reasonable doubt. The 99% confidence level most closely approximates the judicial standard for criminal culpability. 2001, Risk Sciences Page 6 of 6