Estimation of Relative Potency from Bioassay Data that Include Values below the Limit of Quantitation

Size: px
Start display at page:

Download "Estimation of Relative Potency from Bioassay Data that Include Values below the Limit of Quantitation"

Transcription

1 Estimation of Relative Potency from Bioassay Data that Include Values below the Limit of Quantitation FRANCIS BURSA, KELLY J. FLEETWOOD, KARIE J. HIRST, AND ANN YELLOWLEES Experimental data may include values below the limit of detection or limit of quantitation. Such data are often substituted by a fixed value, such as half the limit of quantitation or even omitted entirely. Either approach can bias the results and, in the case of parallel line bioassay, lead to unnecessary assay failure. This article demonstrates that a better statistical technique, Tobit analysis, can account properly for values that are below the limit of quantitation. Specifically, this article focuses on the analysis of bioassay data. We apply Tobit analysis to simulated bioassay data and show that this method leads to a higher probability that an assay will correctly pass parallelism testing, when compared with the commonly used substitution method. We conclude that Tobit analysis should be the method of choice for the analysis of bioassays containing values below the limit of quantitation. Keywords: bioassay, relative potency, statistical analysis, limit of quantitation (LOQ), limit of detection (LOD), censoring, Tobit analysis Biological assays may be based on quantitative responses such as the level of antibodies expressed in animal serum. Such responses may be subject to a limit of quantitation (LOQ) that is, the lowest concentration at which the analyte can be reliably quantified or a limit of detection (LOD) that is the minimal amount of the analyte that can be distinguished from background (Armbruster and Pry 2008). In cases in which there is sufficient doubt about the measured responses (i.e., in which response is so low that a result cannot reliably be either measured or reported), it is common practice to replace the unknown value with some fraction, typically one-half or one-third, of the LOQ. This is known as substitution. For example, the acellular pertussis vaccine immunopotency assay uses the immune response of groups of mice to estimate the relative potency of vaccine lots. If the measured antibody content is below the LOQ then the recommendation is that the value is replaced by one half of the LOQ (WHO 2013). The substitution method has historically been used in many areas of science, including environmental chemistry, microbiology and pharmacokinetic and pharmacodynamic modelling (Beal 2001, Helsel 2006, Lorimer and Kiermeier 2007). However, it is well documented that the use of the substitution method may produce biased estimates of the mean of a group of values and may underestimate or overestimate their variance (Helsel 2006, Lorimer and Kiermeier 2007). It has been strongly criticized by a number of authors, including Helsel (2006) in environmental chemistry, Lorimer and Kiermeier (2007) in microbiology, and Lynn (2001) in bioassay. In general, the authors recommend the use of maximum likelihood estimation as explored in this article. Despite this, the substitution method is still the de facto standard used in bioassay. Neither the European Pharmacopoeia chapter on the statistical analysis of bioassays (Council of Europe 2015) nor any of the United States Pharmacopeia chapters on bioassays (USP 2012) provide any guidance on the issue. Maximum likelihood estimation methods for data including values only recorded as less than (or greater than) a limiting value have been widely adopted in various scientific fields, especially clinical medicine, and are available in many standard statistical analysis packages. Data only recorded as below (or above) a limit are censored in the sense that their value is only partially known: LOQ measurements have a value that lies between zero and the limit, but its exact value is unknown. Values are said to be right censored when they are known to be greater than a particular value, and left censored when they are known to be less than a particular value. For example, in clinical studies of cancer treatments, patients who are still alive at the end of the trial have right-censored data because the patient is only known to have survived for at least the duration of the trial. Methods for handling rightcensored data are often collectively referred to as survival BioScience 66: The Author(s) Published by Oxford University Press on behalf of the American Institute of Biological Sciences. All rights reserved. For Permissions, please journals.permissions@oup.com. doi: /biosci/biw126 Advance Access publication 8 October November 2016 / Vol. 66 No. 11 BioScience 983

2 analysis because they are used to investigate survival times (Kalbfleisch and Prentice 2002). Those applied to left-censored data are often referred to as Tobit analysis after James Tobin (1958) described the use of these techniques in econometrics (Tobin 1958). Tobit analysis finds the parameters that maximize the likelihood of the observed responses, including both those above and below the limit. For normally distributed data, the likelihood is the following: Likelihood = Uncensored observations 2 ( y y i i ) 1 2 2σ e 2π LOQ y i Φ Censored observations σ Here y i is the ith response, ŷ i is the ith fitted value (which is a function of the fit parameters), σ is the standard deviation and Φ is the cumulative distribution function of the standard normal distribution. Handling censored values is a particular problem in the case of bioassays in which a series of doses are tested. Here, the likely true value of a censored data point varies depending on the dose. Figure 1 illustrates this situation with simulated data for an assay with four doses and a limit of quantification (LOQ) of 1. A decision has been made to substitute all values that are recorded only as below the limit (1) with the value 0.5 (i.e., half the limit). The mean responses to the assay are illustrated by the diagonal line. Individual responses were assumed to be normally distributed (an assumption underlying all analyses of quantitative bioassay data; USP 2012, Council of Europe 2015) on the log scale. above LOQ below LOQ Substituted value Expected value of response below LOQ Figure 1. Example distributions of responses across a range of doses. The curved lines show the distributions of the responses for each dose, and the shaded areas indicate where the distributions fall below the limit. The simulated responses are illustrated by circles. Solid circles indicate observed values, and hollow circles indicate censored values. The hollow circles represent unknown values; we are assuming here that these values are only recorded as less than the given limit. In the analysis, they have been substituted with the value 0.5; the substituted values are shown as hollow triangles (to facilitate visualization, the hollow triangles are laid out to the right of the dose). Clearly, 0.5 is not a good choice for the substitution in this example, because it is below all the censored observations true (but unknown) values (hollow points) moderately at the lowest dose and more so at the higher doses. In general, a reasonable value for a substitution would be the expected value of the response, knowing that it is less than the LOQ. This value can be calculated and is also shown in the figure, per dose group, as a solid triangle. The value depends on the true mean value for the dose group, and therefore a common value across all the dose groups cannot logically be used for substitution in this situation. In this article, we aim to show that Tobit analysis is an appropriate method for the analysis of relative potency (RP) bioassay data in which values below the LOQ may occur. It is important to note that we assume throughout that these values below the LOQ are really unknown. If a value has been recorded but discarded as not sufficiently accurate, it may be that using the recorded value, possibly with some scheme for down-weighting, should be used, but we do not explore that approach here. First we describe the assay scenario and how the RP is defined. We then simulate a range of assays and compare the substitution and Tobit approaches in terms of their impact on the estimation of the RP. Methods The relative potency (RP) of one sample compared with another is defined as the ratio of equally effective doses for the samples (Finney 1964), and RP bioassays are designed to measure the potency of a test batch of material relative to a reference standard. Sets of dilutions of standard and test materials are administered to cells or animals and the concentration (dose); response relationships are used to estimate potency. It is only appropriate to measure the RP of similar preparations that is, test preparations that behave as dilutions of the standard. The similarity and relative potency of a standard and test preparation can be evaluated by fitting a dose response 984 BioScience November 2016 / Vol. 66 No. 11

3 Table 1. Simulation parameters (examples for a subset of RP values). group Standard deviation of log response Reference Mean of log response Mean on raw scale Test: RP = 100% Mean of log response Mean on raw scale Test: RP = 80% Mean of log response Mean on raw scale Test: RP = 60% Mean of log response Mean on raw scale Test: RP = 40% Mean of log response Mean on raw scale model to the data. For each preparation such a model describes the relationship between the response and log dose. Details of standard models for bioassay data are readily available: for example, see chapter 5.3 of the European Pharmacopoeia (Council of Europe 2015), the United States Pharmacopeia (2012) general chapter 1034, or Finney (1964). In this article, we focus on linear models for the log dose response relationship, although the methods are equally applicable to nonlinear models such as the four-parameter logistic (Deming 2015). If the preparations are similar, then the linear log dose-response models for the two preparations will be approximately parallel, and the test preparation will appear to be a simple displacement of the reference along the log dose axis. Parallelism can be evaluated in several ways (Fleetwood et al. 2015); in this article, we measure parallelism by calculating a 95% confidence interval for the difference between slopes and comparing this with pre-defined limits (the equivalence approach). We explored three approaches to fitting the linear log dose response models in which a decision has been made not to rely on reported values less than the LOQ but instead to censor them at the LOQ: (1) Substitution with LOQ/2; ordinary linear regression: values below the LOQ were replaced by one half of the LOQ. (2) Substitution with LOQ/3; ordinary linear regression: values below the LOQ were replaced by one third of the LOQ. (3) Tobit analysis: Maximum likelihood was used to fit a linear model to the data, treating the values below the LOQ as censored between 0 and the LOQ. Data We used simulation to compare the performance of the substitution methods and Tobit analysis as applied to bioassay data with values censored if they lie below the LOQ. The methods were compared according to how well they evaluated parallelism and estimated the true underlying RP. The simulated assays were based on the standard approach for a parallel line in vivo assay (such as immunopotency), which includes parallelism assessment. The simulated bioassays compared a reference preparation with a test preparation using four doses of each preparation. The doses were equally spaced on the log scale. Each of the four doses per preparation was assumed to have been administered to a group of 10 animals. The responses were normally distributed on the log scale, with standard deviation set to This value was chosen so that when there was very little censoring (up to 1%), the parallelism test (see below) would pass nearly all the time (90% or more), as is likely to be the case in practice. The test line was parallel to the reference line. The LOQ (on the original scale of measurement) was set to 1, and simulated values less than the LOQ were censored and recorded only as less than the LOQ. The reference line was fixed, and RP values ranging from 100% down to 40% were simulated. Thirty RP values were used, uniformly spaced on the log scale. Table 1 shows the parameters used in the simulations for 4 of the 30 RP values for illustration (100%, 80%, 60%, 40%). For each RP value, 100,000 simulations of the reference-test pair were generated. The first panel in figure 2 illustrates the simulation of one reference-test pair in which RP = 40%. The mean responses for the reference and test are illustrated by the diagonal lines. The LOQ (set at 1) is shown by the dotted horizontal line. The curved lines show the distributions of the responses for each line and dose, and the shaded areas indicate where the distributions fall below the LOQ. For the reference line, the probability of a response below the LOQ is low. However, for the lower doses of the test line, the probability is much higher. As the true potency decreases, the test line will become lower, and the probability that a response at one of the lower doses is below the LOQ will become even higher. The simulated responses for one simulation are illustrated by circles; solid circles indicate observed values, and hollow circles indicate censored values. The hollow values would be recorded as less than the LOQ in practice. November 2016 / Vol. 66 No. 11 BioScience 985

4 Simulated data and true underlying distributions Linear fits with substitution at LOQ/2 Test response below LOQ Test substituted at LOQ/3 Linear fits with substitution at LOQ/3 For each simulation of a reference-test pair, the three analysis methods were applied. In all cases, the parallelism test was an equivalence test: An assay passed the test if the 95% confidence interval for the slope difference lay entirely within the interval 1 to 1. (Note: as the assays were simulated to be parallel, the correct result is a pass in all cases). We compared the three methods in terms of the following: (a) the proportion passing the parallelism test; (b) the accuracy of the RP estimate, defined as the mean difference between the true RP and the fitted RP; and (c) the precision of the RP estimate, defined as the mean width of the 95% confidence interval Test substituted at LOQ/2 Censored test response Tobit fits Figure 2. Distributions of simulated responses: RP = 40%, with estimated linear and Tobit models. (CI) for the log(rp). All analysis was conducted using R (R Development Core Team 2005), using the survreg function in the survival package (Therneau 2015). The remaining panels in figure 2 illustrate the three analysis methods. The hollow triangles show the location of the substituted values. For the simulated data in the top left panel, the top right panel shows the result of substituting one half of the LOQ for each censored value. The bottom left panel shows the effect of substituting one third of the LOQ. The bottom right panel illustrates the Tobit fit. The impact on the slope of the test line can be clearly seen. 986 BioScience November 2016 / Vol. 66 No. 11

5 Figure 3. A comparison of the fits shown in figure 2. Percentage passing parallelism Results An example simulation, with the fitted models from the three methods overlaid, is shown in figure 3. Here the true RP is 40%. No censored values are seen in the reference line but the test line has 7 at the lowest dose, and 1 at both of the next 2 doses. The test for parallelism failed (incorrectly) for both the substitution methods but passed for the Tobit analysis (correctly). Note that for the reference lines, the three fits are identical, because in this example there were no censored values in the reference line. Figure 4 shows the percentage (over all simulations) passing the parallelism test, as a function of RP. The percentage of values that were censored (across both test and reference lines) is directly related to the RP and is shown along the top of the graph; it increases to the right, from essentially none when the true RP is 100% to 13% corresponding to about 10 responses below the LOQ when the true RP is 40%. (Note, however, that the specific relationship is influenced by the underlying model and variability.) The percentage of values censored for the reference preparation is equivalent to the 100% RP figure that is, almost nil. When the number of censored values is small (up to about 1%, corresponding to RP values down to about 80% in this particular scenario), all methods have a pass rate of at least 90% for the parallelism test. As the percentage of censored values increases (with decreasing RP), the pass rates for both the substitution methods decrease substantially. This happens because for the parameters we have used in our simulations, most censored values occur at the lowest dose of the test line and are only slightly below the LOQ, so using the substitution value of LOQ/2 or LOQ/3 is too low, which steepens the slope of the fitted test line. This bias decreases the probability that the assay will pass the parallelism test. In this example, the probability decreases to below 25% for the LOQ/3 substitution, and to below 75% for the LOQ/2 substitution. For the Tobit analysis, the situation is simpler: This method estimates the slopes correctly on average, but as the number of censored values increases, the confidence interval for the test slope increases. This means it is more likely that the confidence interval for the difference in slopes will extend outside the equivalence interval of ( 1, +1), slightly decreasing the parallelism pass rate. For the Tobit method, the failure rate was less than 5% for RP values down to 41%, corresponding to a percentage of True dose response Fit with substitution at LOQ/2 Fit with substitution at LOQ/3 Tobit fit Percentage of values below the LOQ Substitution at LOQ/2 Substitution at LOQ/3 Tobit Figure 4. Percentage passing parallelism. November 2016 / Vol. 66 No. 11 BioScience 987

6 Mean RP estimate Percentage of values below the LOQ Substitution at LOQ/2 Substitution at LOQ/3 Tobit Figure 5: Accuracy of RP estimate. Mean width of CI for log(rp) Percentage of values belo Substitution at LOQ/2 Substitution at LOQ/3 Tobit Figure 6. Precision of RP estimate. censored values up to 24% in the test sample (or 12% across the whole assay). Figure 5 shows the average estimate of RP plotted against the true (simulated) RP, providing a comparison of the bias of the estimate for the methods. Figure 6 shows the average width of the 95% confidence interval for the log(rp) estimate, again plotted against the true RP value. This provides a comparison of the precision of the estimate across the three methods. All three methods have similar accuracy and precision within the range of RP examined. This happens because the RP is actually calculated from a forced parallel fit, and the parallel fit is less influenced by the presence of values below the LOQ than are the separate model fits used to test for parallelism. There is a small effect on the precision: The Tobit method is slightly more precise at low RPs. The behavior is similar for other assay characteristics (not shown), such as the accuracy and precision of the reference and test intercepts or the proportion of confidence intervals for the RPs that contain the true value (the coverage). Table 2 shows summary statistics for the three parameters illustrated in the graphs for reference. The most striking differences are in the pass rates for the parallelism test. Where the RP is below 60%, more that 25% of cases fail the test if the LOQ/3 is substituted and more than 10% of cases fail when the LOQ/2 is substituted, whereas less than 5% of cases fail when Tobit analysis is used. Discussion We have explored three approaches to the analysis of data sets that include values below the LOQ: substitution with LOQ/2, substitution with LOQ/3, and Tobit analysis. It is well established that substitution methods may produce biased estimates of means and variances (Helsel 2006, Lorimer and Kiermeier 2007). In this article, we used a simulation study to explore how these methods handle bioassay data with values below the LOQ. In our example, all three methods gave reasonable estimates of the relative potency, in terms of both accuracy and precision. However, the substitution methods led to high percentages of unnecessary failures of the test for parallelism, in contrast to the Tobit method, whose failure rate was less than 5%, even when the percentage of censored values was as high as 20% in the test sample (or 10% across the whole assay). In many companies and laboratories, Standard Operating Procedures dictate that a failure of the parallelism test means that the assay must be repeated, leading to increased use of resources, lost time and increased cost. Of course, in this situation, the repeat assay is quite likely to fail again. 988 BioScience November 2016 / Vol. 66 No. 11

7 Table 2. Simulation results at illustrative RP values. RP Percentage passing equivalence test for parallelism Mean RP estimate Mean width of 95% CI for log(rp) LOQ/2 LOQ/3 Tobit LOQ/2 LOQ/3 Tobit LOQ/2 LOQ/3 Tobit 100% 96.0% 94.9% 96.9% 100.0% 100.0% 100.0% % 94.7% 91.6% 96.8% 79.7% 79.6% 79.8% % 89.1% 75.0% 96.6% 59.9% 59.9% 59.9% % 73.7% 23.8% 94.7% 40.2% 41.1% 39.4% Although it may be possible to choose a value to substitute for censored data in the case of a single dose group, when the assay response is related to dose (as in a relative potency assay), there is no substitution value that is ideal for all the dose groups. The single value will be too low for some groups, too high for others, and just right for at most one group. The arguments we have presented in this article are generalizable; they are not restricted to the scenarios chosen for the simulations, to the choice of parallelism test, or to parallel-line models. Tobit analysis can be applied to other bioassay models, such as slope ratio models and four- or five-parameter logistic models. However, the specific values of the RP that correspond to particular rates of censored values or to particular values of precision, accuracy, or probability of passing the parallelism test all depend on the specific assay. Our conclusions are also equally applicable to other types of experiments that may have values that are reported only as less than a given limit. It should be noted, however, that if a value has been recorded but discarded as not sufficiently accurate, it may be preferable and easier to use the recorded value, possibly with some scheme for down-weighting; this could be a topic for future research. Our aim was to alert scientists to the fact that substitution methods are not often optimal for handling censored values and may lead to unnecessary conclusions of experimental or assay failure, whereas Tobit analysis works well. Therefore, in cases in which there is a potential for values to be reported as less than a particular limit, Tobit analysis should be considered. The method has clear benefits for bioassay, is widely used in other areas of research, and is available in standard statistical software packages. References cited Armbruster DA, Pry T Limit of blank, limit of detection and limit of quantitation. Clinical Biochemist Reviews 29: S49 S52. Beal SL Ways to fit a PK model with some data below the quantification limit. Journal of Pharmacokinetics and Pharmacodynamics 28: Council of Europe Statistical analysis of results of biological assays and tests. Pages in European Pharmacopoeia Commission. European Pharmacopoeia, 8th ed. Council of Europe. Deming SN The 4PL. Statistical Designs. Finney DJ Statistical Method in Biological Assay. Charles Griffin. Fleetwood K, Bursa F, Yellowlees A Parallelism in practice: Approaches to parallelism in bioassays. PDA Journal of Pharmaceutical Science and Technology 69: Helsel DR Fabricating data: How substituting values for nondetects can ruin results, and what can be done about it. Chemosphere 65: Kalbfleisch JD, Prentice RL The Statistical Analysis of Failure Time Data, 2nd ed. Wiley. Lorimer M, Kiermeier A Analysing microbiological data: Tobit or not Tobit? International Journal of Food Microbiology 116: R Development Core Team R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. (24 August 2016; Tobin J Estimation of relationships for limited dependent variables. Econometrica: Journal of the Econometric Society 26: [USP] US Pharmacopeial Convention Analysis of biological assays. Pages in USP. First Supplement to US Pharmacopeia 35: National Formulary 30. USP Convention. (24 August 2016; www. drugfuture.com/pharmacopoeia/usp35/pdf/ %20[1034]%20 Analysis%20of%20Biological%20Assays.pdf) [WHO] World Health Organization Recommendations to assure the quality, safety and efficacy of acellular pertussis vaccines. Pages in WHO. WHO Expert Committee on Biological Standardization. WHO. Technical Report Series no November 2016 / Vol. 66 No. 11 BioScience 989