Federal regulations (Clinical Laboratory Improvement

Size: px
Start display at page:

Download "Federal regulations (Clinical Laboratory Improvement"

Transcription

1 alibration Verification Performance Relates to Proficiency Testing Performance Martin H. roll, MD; Patricia E. Styer, PhD; Delmiro Anthony Vasquez, MT(ASP) ontext. Since 1988, the ollege of American Pathologists has been offering materials for calibration verification coupled with the surveys for linearity, called the linearity (LN) surveys. Objective. To determine whether successful completion of the ollege of American Pathologists LN surveys provides a benefit in terms of improved proficiency testing (PT) performance. Design. In this study, we used information from LN surveys, LN3, and and from the PT surveys, Z, and administered and analyzed in the year For the PT data, we calculated 4 measures of performance: passing PT, results exceeding 2 SDs, sum of absolute SD intervals, and the absolute sum of SD intervals. For the LN data, we classified laboratories as participants versus nonparticipants in LN surveys and by whether or not LN survey performance was successful. Results. LN enrollees had fewer unacceptable PT results than did nonenrollees. Additionally, for many analytes there was a significant positive association between LN performance and PT performance. onclusions. For most analytes studied, there was strong evidence linking performance on PT surveys with performance on LN surveys. Eight of 13 analyses (6) demonstrated improved performance with successful calibration verification. (Arch Pathol Lab Med. 2004;128: ) Federal regulations (linical Laboratory Improvement Amendments of ) specifically state that calibration verification, that is, the determination of analyte in materials composed of a matrix similar to that of patient samples, be performed every 6 months at every change in lot numbers of reagents, after major preventive maintenance, when controls show unusual trends or are out of acceptable limits, or more frequently if recommended by the manufacturer or laboratory. Sometimes, the term validation of the analytical measurement range is used instead of the term calibration verification. alibration verification must cover the analytical measurement range, thereby requiring that the concentrations or activities being evaluated include the lowest and the highest values of this range. Since 1988, the ollege of American Pathologists (AP) has been offering materials for calibration verification coupled with the surveys for linearity, called the linearity (LN) surveys. The concentrations or activities evaluated in these surveys span most reportable ranges and exceed the ranges found for proficiency testing (PT), as offered by the AP, for most analytes. alibration verification in the LN surveys determines whether the instruments or methods are properly calibrated against those of a peer group. At least 4 of 5 consecutive solution averages Accepted for publication December 8, From the Department of linical hemistry, Dallas VA Medical enter, Dallas, Tex (Dr roll); the ollege of American Pathologists, Northfield, Ill (Dr Styer); and Spectrum 3 Inc, Miami, Fla (Mr Vasquez). The authors have no relevant financial interest in the products or companies described in this article. Reprints: Martin H. roll, MD, Department of linical hemistry, Dallas VA Medical enter, Room 113, 4500 Lancaster Rd, Dallas, TX ( martin.kroll@med.va.gov). must fall within specified limits derived from the analytical goal for error for that analyte. Successful completion of calibration verification surveys validates the analytical measurement range. We investigated whether successful calibration verification also improves results of PT. MATERIALS AND METHODS In this study, we used information from LN surveys, LN3, and and from the PT surveys, Z, and. We studied the analytes sodium, potassium, glucose, albumin, iron, creatinine, alanine aminotransferase, digoxin, carcinoembryonic antigen, cortisol, human chorionic gonadotropin, folate, and vitamin. All surveys had been administered and analyzed in the year Table 1 lists the details of these AP surveys, including the number of participants. For the PT data, we calculated 4 measures of performance for each of the analytes listed in Table 1. For the LN data, we grouped laboratories into a simple 2-stage hierarchy for each analyte. In this hierarchy, laboratories were first classified as either participants or nonparticipants based on LN enrollment. Among those enrolled in both PT and LN surveys, we further classified laboratories into 2 groups based on LN performance: successful (the laboratory met the criteria for validation of the analytical measurement range) and unsuccessful (the laboratory s results fell outside of those limits). Measuring PT Performance We evaluated PT performance using the grading rules mandated by the linical Laboratory Improvement Amendments for each analyte and using additional measures intended to identify laboratories that may be at risk for failing PT. Our 4 PT performance variables are defined as follows. Presence of Unacceptable PT Results. The first measure of laboratory PT performance is a binary variable indicating whether there are any unacceptable PT results in a single mailing for each analyte examined in this study, using the PT grading rules. Typically, a laboratory completes 5 challenges per analyte, and the grading rules determine an interval of acceptable results 544 Arch Pathol Lab Med Vol 128, May 2004 alibration Verification roll et al

2 Table 1. Summary of ollege of American Pathologists Survey Data Used in Analysis* Analyte LN Survey LN Mailing PT Survey PT Mailing Iron ALT LN3 Z No. of PT Participants No. of LN Participants * LN indicates linearity; PT, proficiency testing; ALT, alanine aminotransferase;, carcinoembryonic antigen; and, human chorionic gonadotropin. B and indicate separate mailings during a 1-year period. Number of participants with complete data used in the current analysis. based on the peer group mean and in some instances the peer group SD. The binary variable defined as the presence of unacceptable results is 1 for the ith laboratory when any results fall outside of the grading interval for that analyte and 0 otherwise. We defined 3 other measures of PT performance to identify laboratories that may be at risk for failing PT based on deviations from their respective peer group means. These remaining 3 variables are intended to measure within-laboratory variability relative to a peer group standard without incorporating specific PT grading rules. In each case, results are standardized using the corresponding peer group mean and SD. We then calculated 3 different measures of laboratory variability using these standardized variables. Results Greater than 2 SDs. For a given peer group and analyte, let Z ij be the standardized result, or SD interval (SDI), for the jth challenge of the ith participant. That is, if X j is the peer group mean and SD j is the peer group SD for the jth challenge, then Z ij (X ij X j)/sd j. For the first variable using the standardized results, we construct a binary variable to indicate whether any of the standardized results are less than 2 or greater than 2. If any challenges are greater than 2 SDs from the participant s peer group mean, we assign a value of 1. Otherwise, the value is 0. Sum of Absolute SDIs. The SDI is the laboratory result standardized by the corresponding peer group mean and SD. For the ith laboratory, the sum of the absolute SDIs is defined for a given analyte as S Z, (1) i j where the summation is taken over the number of challenges, usually 5, per analyte. This variable is intended to identify laboratories with large variability from results either substantially greater or less than the peer group average. Absolute Sum of SDIs. The final variable, the absolute sum of the SDIs, is intended to identify laboratories with persistent bias, ie, with results that are consistently greater than or less than the peer group mean. Here, the summation is taken first, and then the absolute value is calculated. The variable is formally defined as j ij A Z, (2) i where the summation is taken over the number of challenges, usually 5, per analyte. ij omparison of LN and PT Performance In the first set of analyses, we compared LN enrollees and nonenrollees using these 4 PT performance measures. For the 2 binary variables, we compared results using either a 2 test or a Fisher exact test when expected cell counts were small. For the 2 continuous variables defined in Equations 1 and 2, we used the Wilcoxon 2-sample test to evaluate whether the locations of the distributions of the enrollee and nonenrollee performance measures differed by 0. 2 In the Results section, we have listed specific results with P values of.10. Generally, when the significance level of a comparison was.10, we have indicated that the test was done but no significant differences were found. We have also listed all results for iron. In the second set of analyses, we compared successful and unsuccessful LN enrollees using the 4 PT performance measures. Again, we used either the 2 or Fisher exact test for the binary variables and the Wilcoxon 2-sample test for the continuous variables to formally evaluate the significance of the observed differences in performance measures for the different groups of participants. All comparisons with P values of.10 are listed in the Results section, and nonsignificant results are noted. We have listed all results for iron, which is the single analyte that showed no suggestion of association between PT performance and LN enrollment or success. RESULTS LN Participants Versus Nonparticipants For many analytes, laboratories enrolled in the corresponding LN surveys demonstrated superior PT performance, as measured by the presence of unacceptable results among all challenges, compared with laboratories not enrolled in the corresponding LN surveys (Table 2). Specifically, LN enrollees had lower rates of unacceptable PT results. The intralaboratory variability of the PT results, however, was generally of similar magnitude for LN enrollees and nonenrollees, as measured by the remaining 3 variables (performance measures), except for albumin and alanine aminotransferase, which had reduced results for excessive deviation from participant means, and increased SDIs. Table 2 lists the results for analytes with significant differences for LN enrollees versus nonenrollees and indicates which analytes had no significant differences. The summary measures in column 4 are taken Arch Pathol Lab Med Vol 128, May 2004 alibration Verification roll et al 545

3 Table 2. Results of Statistical Analyses omparing Proficiency Testing (PT) Performance for Linearity (LN) Survey Participants and Nonparticipants % of Laboratories Enrolled in LN Surveys Measure of PT Performance 20 ALT % or Average for LN Enrollees % 5% % or Average for Nonenrollees P * No significant differences were observed for carcinoembryonic antigen, cortisol, folate, human chorionic gonadotropin, iron, potassium, and vitamin. ALT indicates alanine aminotransferase. Includes only measures of PT performance for which there is a significant or marginally significant difference between LN enrollees and nonenrollees. SDI indicates SD interval. Summary measure is the percentage of laboratories with at least 1 unacceptable result or at least 1 result more than 2 SDs from the peer group Table 3. Results of Statistical Analyses omparing Proficiency Testing (PT) Performance for Successful Linearity (LN) Survey Participants (Results alibrated and Linear) Versus Unsuccessful LN Participants (Either Not alibrated or Not Linear) Measure of PT Performance Mean Good Performance for LN % % Mean Poor Performance for LN % 28% % 19% % 1 P * No significant differences were observed for alanine aminotransferase, glucose, and iron. indicates carcinoembryonic antigen;, human chorionic gonadotropin. Includes only measures of PT performance for which there is a significant or marginally significant difference between LN enrollees and nonenrollees. SDI indicates SD interval. Summary measure is the percentage of laboratories with at least 1 unacceptable result or at least 1 result more than 2 SDs from the peer group over all LN enrollees, regardless of their performance in the LN surveys. PT Performance by LN Performance For many analytes, there was a significant positive association between LN performance and PT performance. Table 3 lists the analytes and measures of PT performance, providing evidence that successful LN performance, as measured by calibration verification, translates into successful PT performance. For most analytes, the presence of unacceptable results using the PT grading rules illustrates this association. For folate and vitamin, the association appears only for the measures of intralaboratory variability (results more than 2 SD s from the participant mean, the sum of absolute SDIs, or the absolute sum of SDIs). There were no significant differences in PT performance for alanine aminotransferase, glucose, and iron. Table 4 lists the complete set of results for iron, the single analyte with no suggestion of association between PT performance and LN enrollment or performance. There 546 Arch Pathol Lab Med Vol 128, May 2004 alibration Verification roll et al

4 Table 4. Results for Iron as the Single Analyte Showing No Significant or Marginally Significant Differences in Proficiency Testing (PT) Performance Based on Linearity (LN) Survey Enrollment or Performance* Measure of PT Performances LN Enrollment No P 15% LN Performance Good Poor P * Twenty-three percent of laboratories were enrolled in. SDI indicates SD interval. Summary measure is the percentage of laboratories with at least 1 unacceptable result or at least 1 result more than 2 SDs from the peer group 1.6% Table 5. Summary of Results omparing Proficiency Testing and Linearity (LN) Survey Performance LN Enrollment Associated With: LN Performance Associated With: Reduced No. of Reduced No. of LN Survey Unacceptable Results Reduced Variability Unacceptable Results Reduced Variability Iron ALT LN3 * ALT indicates alanine aminotrasferase;, carcinoembryonic antigen; and, human chorionic gonadotropin. was no trend in the direction of association between the measures of PT performance and LN enrollment or success. Eight (6) of 13 analytes demonstrated reduced number of unacceptable results with successful calibration verification, and 5 analytes demonstrated simultaneous reduction in the number of unacceptable results and variability (Table 5). All analytes except iron showed at least some suggestion of positive association between PT performance and LN enrollment or success. OMMENT In this analysis, we demonstrated that for most analytes, there is evidence linking performance on PT surveys with performance on LN surveys. Only iron did not demonstrate better PT performance with LN enrollment or performance. ompared with the previous study by Lum et al, 3 we demonstrated significant differences for cortisol, creatinine, beta human chorionic gonadotropin, and potassium. In that previous study, 19 (58%) of 33 analyses demonstrated improved performance (reduced number of unacceptable results) with successful calibration verification. 3 In the present study, 8 (6) of 13 analyses demonstrated improved performance (reduced number of unacceptable results) with successful calibration verification. The biggest change since the previous study in the calibration verification portion of the LN survey has been a change in the LN materials from lyophilized to liquid. Participants no longer are required to make dilutions for most of the analytes, resulting in a decreased number of poor admixtures. Thus, fewer false assessments of poor calibration verification occur. In addition, these results may reflect better performance by the liquid material, as indicated by accuracy within peer groups. In addition to outright unacceptable results, we also studied other conditions related to poor performance with PT. Many laboratories investigate PT results when the participant result exceeds the group mean by more than 2 SDs (increased variability). Because such investigations require additional resources and may be a source of concern, a survey that can reduce the incidence of increased variability would be useful. Many laboratories investigate PT when the sum of the SD indices (SDIs) is greater than 5. Usually the SDIs in such cases are all positive or all negative, and such results indicate a strong positive or negative bias. In cases where they are not all positive or negative but the sum of the absolute values of the SDIs is greater than 5, then increased imprecision in the methodology should be suspected. These occurrences are not failures but rather represent warnings that a method is not performing up to expectations. For these types of cases, we measured the number of laboratories that experienced an increase in the sums of absolute SDIs or an increase in the absolute sum of the SDIs. By reducing the number of instances of these non failure-oriented warnings, the LN calibration verification surveys can save valuable laboratory resources. Lum and his colleagues 3 did not examine the effect of Arch Pathol Lab Med Vol 128, May 2004 alibration Verification roll et al 547

5 successful calibration verification on variability, as was done in the present study. There are several reasons why successful performance with LN surveys may translate into better performance (both reduced number of unacceptable results and reduced variability) with PT. Failure to validate the analytical measurement range may alert a laboratory to problems before they are experienced with PT, thus allowing the time to make appropriate corrections. The LN material frequently covers a much wider range of values than do the PT materials, providing a greater challenge for a particular method s performance. The allowance for error is tighter for the LN surveys than for PT. References 1. linical Laboratory Improvement Amendment of 1988: Final Rule. 42 FR Part 405, et al. Washington, D: US Dept of Health and Human Services, Health are Financing Administration Public Health Service; Federal Register, February 28, 1992: Fisher LD, van Belle G. Biostatistics: A Methodology for the Health Sciences. New York, NY: John Wiley & Sons; Lum G, Tholen DW, Floering DA. The usefulness of calibration verification and linearity surveys in predicting acceptable performance in graded proficiency tests. Arch Pathol Lab Med. 1995;119: Arch Pathol Lab Med Vol 128, May 2004 alibration Verification roll et al