Validation of Cell-based Fluorescence Assays: Practice Guidelines from the ICSH and ICCS Part V Assay Performance Criteria

Size: px
Start display at page:

Download "Validation of Cell-based Fluorescence Assays: Practice Guidelines from the ICSH and ICCS Part V Assay Performance Criteria"

Transcription

1 Cytometry Part B (Clinical Cytometry) 84B: (2013) Validation of Cell-based Fluorescence Assays: Practice Guidelines from the ICSH and ICCS Part V Assay Performance Criteria Brent Wood, 1 Dragan Jevremovic, 2 Marie C. Bene, 3 Ming Yan, 4 Patrick Jacobs, 5 Virginia Litwin 6 *; on behalf of ICSH/ICCS Working Group 1 1 University of Washington, Seattle, Washington, USA 2 Mayo Clinic, Division of Hematopathology, Rochester, Minnesota, USA 3 Immunology, Laboratory, CHU, de Nancy & Lorraine Universite, Vandoeuvre les Nancy, France 4 Immunocytometry Systems, BD Biosciences, San Jose, California, USA 5 Life Technologies, Director of Flow Cytometry Sales, North America 6 Hematology, Covance Central Laboratory Services, Inc., Indianapolis, Indiana, USA Multi-color flow cytometry is a unique technology, which enables the analysis of heterogeneous cellular systems and provides multiparametric information on a cell-by-cell basis. A variety of factors contribute to the complexity of validating cell-based flow cytometric methods, including the lack of fully characterized cellular reference materials and the difficulty in obtaining, or creating, samples with varying levels of a given cell type or varying levels of expression of a given antigen. This document summarizes validation requirements and describes validation strategies for quasi-quantitative and qualitative cell-based flow cytometric assays. VC 2013 International Clinical Cytometry Society Key terms: Cytometry; accuracy; sensitivity; specificity; imprecision How to cite this article: Wood B, Jevremovic D, Bene MC, Yan M, Jacobs P, Litwin V; on behalf of ICSH/ICCS working group. Validation of Cell-based Fluorescence Assays: Practice Guidelines from the ICSH and ICCS - Part V - Assay performance criteria. Cytometry Part B 2013; 84B: The U.S. Centers for Disease Control and Prevention (CDC) categorizes flow cytometry as high complexity laboratory testing. Understanding which category of bioanalytical assay is applicable for a given method is essential in designing and implementing method validation strategies. Lee et al. (1) have grouped bioanalytical methods into four categories: definitive quantitative; relative quantitative; quasi-quantitative; and qualitative (Table 1). Definitive quantitative assays include calibrators fit to a regression model to calculate absolute values and reference standards that are well characterized and fully representative of the endogenous measurand. Definitive quantitative assays can be both accurate and precise. Relative quantitative assays utilize responseconcentration calibration, however in this scenario the reference standards are not fully characterized or truly representative of the endogenous measurand. As such, imprecision can be demonstrated for a relative quantitative method, but accuracy can only be estimated. With quasi-quantitative assays there is a relationship between the response and the measurand but calibration standards are not used. Thus, quasi-quantitative methods can be validated for imprecision, but not accuracy. Qualitative methods generate categorical data. Flow cytometric methods largely fall in the two latter categories and are essentially therefore quasi-quantitative or qualitative. Multi-color flow cytometry is a unique technology, which enables the analysis of heterogeneous cellular systems and provides multiparametric information at a cellby-cell level. The strength of flow cytometry lies not only in the ability to simultaneously measure multiple parameters, but also in the flexibility to report them in different ways. The appropriate data output depends on the biology of the system being investigated, the analytical or scientific question being asked, and the intended use of the results. A wide variety of data outputs can be reported usually expressed in terms of several characteristics of cells, or cell subsets, in the sample tested for example, percentage of positive events, absolute counts, median fluorescence intensity, quantitative antigen expression levels, ratiometric indices, markers coexpression, or relative nucleic acid content. *Correspondence to: Virginia Litwin, Ph.D., Principal Scientist, Hematology, Covance Central Laboratory Services, Inc., 8211 SciCor Drive, Indianapolis, IN, virginia.litwin@covance.com Received 6 November 2012; Revised 20 May 2013; Accepted 14 June 2013 Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: /cyto.b VC 2013 International Clinical Cytometry Society

2 316 WOOD ET AL. Table 1 Categories of Bioanalytical Method a Assay Category Definition Quantitative Uses calibration standard to determine the absolute quantitative values for unknown samples. The reference material is well defined and fully representative of the endogenous analyte. Example: Pharmacokinetic assays Relative quantitative Uses a calibration standard to estimate the absolute quantitative values for unknown samples. The reference material is not fully representative of the endogenous analyte (e.g. MESF units). Example: Cytokine enzyme immunoassays, PMN CD64 assay Quasi-quantitative Does not use calibration standard, but has a continuous response. Numeric data is reported. Example: Immunogenicity assays, phenotypic and functional biomarker assays, receptor occupancy assays; cell counts Qualitative Lacks proportionality to the amount of analyte. Categorical data is reported. Example: Immunogenicity and immunohistochemical assays a Adapted from Lee et al., 2006 and O Hara et al., Validation Samples A variety of factors contribute to the complexity of validating cell-based flow cytometric methods (2,3). The lack of fully characterized cellular reference materials and the difficulty in obtaining, or creating, samples with varying levels of a given cell type or varying levels of expression of a given antigen contribute most significantly to these challenges. Assessment of the analytical measurement range, sensitivity and linearity, are made difficult by the lack of cellular reference material and samples with varying levels of the analyte. Although creative solutions can be designed, none are without limitations. For example, low positive samples may be constructed by the spiking of negative samples with a small number of positive cells; however, spiking samples with cell lines is discouraged as they are substantially different from intended specimens in clinical use. Moreover, in clinical settings especially, the results obtained in flow cytometry can be correlated with clinical information and to the results of other laboratory techniques, as a means for cross validation of the information provided by flow cytometry. Performance Assessment for Relative and Quasi- Quantitative Methods Relative and quasi-quantitative methods include, but are not limited to lymphocyte subsets analysis, CD41 T cell enumeration, immunodeficiency exploration, CD341 stem cell enumeration, fetal RBC enumeration, neutrophil CD64 expression, PNH testing, minimal residual disease. Accuracy According to the ISO definition, accuracy is the closeness of agreement between the average value obtained from a large series of test results and an accepted reference value (International Organization for Standardization (ISO) Statistics-Vocabulary and Symbols. ISO Geneva: 1993). By this definition, it is not possible to assess accuracy in the measurement of cellular populations by flow cytometry, because true cellular standard reference material with appropriate biological matrices is not available. Alternative accepted approaches for establishing accuracy include comparison to current reference methodology (e.g., ICSH immunoplatelet count assay), inter-laboratory comparison or verification with specimens obtained from patients with a diagnosis confirmed by alternative methods. For fluorescence cell-based methods used as diagnostic assays, accuracy assessment presents a considerable challenge. Reference methods are not available and reference laboratories are rarely available. Moreover, comparison to other methodologies may result in unexplained differences due to the different technologies and definitions employed. Although most approaches will be imperfect comparisons, the validating laboratory must document some attempt at accuracy assessment. A minimum of 10 sample replicates is recommended whenever accuracy assessment is being determined, unless data indicates more or less replicates would be more appropriate. The acceptance criteria will also be variable depending on the required degree of accuracy for the intended use, nevertheless should be clearly defined for each assay. Ninety percent or greater agreement between methods is considered acceptable performance for accuracy, but as discussed above, with the method comparison approach this may not be achievable. In cases where agreement is low, the validating laboratory should provide a probable explanation for the discrepancies and a means to identify any interfering conditions. Assay of Standard or Reference Materials The best approach for assessing accuracy is to evaluate reference material, which simply does not exist for cell-based methods, even for the fluorochromes commonly used in these assays. For most FDA-cleared or CEapproved methodologies, CAP Proficiency Testing Surveys or characterized control materials are available, but many EQA programs are consensus-, not accuracy-based. Alternatively, accuracy, albeit in terms of correlation to comparable methods, can thus be verified by acceptable performance in Proficiency Testing Surveys or by demonstrating that QC results are within the manufacturer s ranges (2,4). Unfortunately for most cell-based

3 ASSAY PERFORMANCE CRITERIA FOR CELL-BASED FLUORESCENCE ASSAYS 317 fluorescence assays, proficiency testing surveys and qualified QC material are not available so in most cases this approach will not be feasible. Alternatives are presented below. Method Comparison Method comparison can be used to demonstrate accuracy if there is another method with documented clinical or analytical equivalency. For example, morphology would be appropriate for hematopathology and tritiated thymidine incorporation for flow-based proliferation analysis (5). When determining the acceptance criteria, it is important to be aware that differences observed in the comparison of a flow cytometric method to another methodology are likely to be due to technological differences rather than accuracy of a cell-based Laboratory Diagnostic (or Developed) Test (LDT). For example, a flow cytometric method for absolute monocyte counts would likely include 2 light scatter properties and a minimum of 2 fluorochrome-conjugated monoclonal antibodies (i.e., specific for CD45 and CD14 or CD64). Thus, is it likely that the flow cytometric monocyte assay will yield biased results in comparison to either the gold standard morphologic identification of monocytes on a blood smear or enumeration with a clinical hematology analyzer using light scatter alone (6). Inter-laboratory Comparison Split samples can be used to compare results if there is another laboratory that runs the same test by the same or similar methods. As discussed above, it must be recognized that differences in technologies are likely to result in differences in test performance. Yet, in the absence of a well-defined standard, demonstration of acceptable correlations between methods may be the only available option. Comparison to a Confirmed Diagnosis Clinical Validation It is similarly challenging to compare assay results with clinical diagnosis or outcome due to the multiple variables involved and need for carefully controlled patient cohorts that are beyond the capacity of most laboratories. Such methods can only be used if the assay serves as a diagnostic test for a unique clinical entity. Accuracy should not be determined by comparing assay results with clinical diagnosis when a test is not uniquely diagnostic for that entity. Although a standard validation should include instrument-to-instrument bias assessment and between analyst comparisons, these parameters do not assess accuracy and should not be considered as such in a validation plan. Specificity Analytical specificity For flow cytometric methods several factors influence analytical specificity: the markers used to define the cellular population or antigens of interest, gating strategy, and interactions with other reagents. For cell-based assays, the justification for the phenotype of a given cell subset must be provided in the form of reagent specificity demonstrating the ability of the phenotype to appropriately distinguish the cell population or antigen of interest from all other cell populations and events. The gating strategies must be verified to establish that the cell subset of interest is included, whilst other cell subsets and nonspecific events are excluded. The specificity of the reagents and antibodies under the conditions of the assay must be verified by assaying a suitable material known to contain the population of interest as well as other populations expected to be present in material to which the assay will be applied. Publications in peer reviewed journals or information from the Leucocyte Differentiation Antigens Workshops ( may be used to verify the specificity of specific commercially available monoclonal antibody clones. If the clone has not been through the workshop, then data on specificity might include immunoassays (elisas, immunoblots, etc.) to purified measurand and/or comparison to previously characterized reagents. It is well established that hemolyzed, clotted and partially clotted samples may cause aberrant results and should not be used unless acceptable assay performance under these conditions is demonstrated (7). Clinical specificity Clinical specificity is established by documenting that the assay result correlates with other clinical and/or other laboratory data for the clinical situation of interest, and does not correlate with other clinical situations. The criteria relevant to provide assay specificity for a given clinical situation should be clearly defined for each assay. Interfering conditions or substances should be identified and the extent of interference investigated during the validation process. Sensitivity Sensitivity-analytical (LOD/LOB) The limit of blank (LOB) is the highest apparent signal measured in the absence of the measurand and is commonly calculated as Mean (blank) SD (blank); 95% of negative values are below this limit. The limit of detection (LOD) is closely related to the LOB and commonly defined as LOB SD (low positive samples). The LOD is the level where 95% of low levels of measurand are detected above the level of the blank. This assumes 5% false positive and false negative rates. The lower limit of quantitation (LLOQ) is the lowest level of measurand that can be reliably detected (LOD) and whose total error (bias 1 imprecision) meets a desired criterion for accuracy (clinical utility). LLOQ may equal LOD under some circumstances, but is never lower than LOD.

4 318 WOOD ET AL. For cell-based assays establishing the analytical sensitivity is relevant for quasi-quantitative flow cytometric assays designed to measure dim levels of fluorescence or detect rare events, e.g. minimal residual disease or CD341 peripheral stem cells (frequencies down to /10 26 ). Replicate assay of samples with dim fluorescence or a low population frequency, as well as negative samples, is required to estimate analytical sensitivity. A standard recommendation for soluble analytes is to establish LOB/LOD with 100 low positive and 60 negative samples, each consisting of X samples assayed X times each over X days, where X is the number of samples divided by 10. This is difficult to achieve for many flow cytometric assays, not only due to inherent differences in cell-based assays through cellular autofluorescence, but due to limited specimen availability and considerable reagent costs. Under these circumstances, it may be acceptable to verify a desired target LOD by assaying fewer replicates (e.g. N 5 5) from a smaller number (e.g. N 5 5) of low positive and negative specimens each of which is collected as five separate listmode files ( measurements). This design also recognizes that a flow cytometric analysis represents a statistical sample of tens of thousands of individual cell measurements. These analyses might run over a minimum of 3 separate days so as to assess the consistency of instrument start up cycles and daily quality control. If no more than 5% of the blank replicates exceed the low positive target, then LOB is confirmed. If no more than 5% of low positive sample replicates fall below the target LOD, then LOD is confirmed. The LOB can be also be assessed by using a gating or fluorescence-minus-one (FMO) control tube (9). LOB can also be assessed by autofluorescence in whole blood assays or assays where non-specific binding is not a sensitivity limitation in the assay design. The FMO tube would include all the gating antibodies used in the assay to identify the cell target cell population with the exception of one antibody directed against the analyte of interest. Data generated from the FMO tube could potentially provide an appropriate blank or level of antigen expression for a defined cell type below which antigen expression cannot be detected. However, this type of control does not take into account potential contributions to background from the test reagent and thus overestimate the analytical sensitivity of the assay. Isotype negative control antibodies may also be included for this purpose, but must be carefully matched for performance otherwise they may not accurately reflect background contributed by the test reagent (See ICSH-ICCS Guidelines Validation of cell-based fluorescence assays: Part III-Analytical Issues [cyto.b.21106]). Assaying cell populations known to be negative for the cellular antigen of interest is another way to estimate non-specific fluorescence background, but should be matched in a manner accounting for autofluorescence and other background characteristics of the population of interest. Controls of these types, as might be done with internal cell controls, allow estimation of LOB in a manner as described above. Determination of LOD can be achieved by repeated assay of samples having low levels of the measurand to determine SD and calculated imprecision. However, flow cytometric data intrinsically consist of measurements of fluorescence intensity on many individual cells. Consequently, for assays relying on measurement of fluorescence intensity for quantitation, i.e. level of expression above background, LOD can be estimated from a single sample where a low level of measurand near LOD is present, providing the SD is sufficiently low, in conjunction with a single negative control tube (FMO) to estimate LOB. However, assay of multiple samples of each type (minimum, N 5 5) will better incorporate sample and preparative variation, five replicates of each sample being a suggested minimum number to determine the imprecision near the LOB. Sensitivity-functional (LLOQ) The difference of the mean from the presumed true value is the Bias. The Total Error (TE) is estimated as Bias 1 2SD and should be compared with the desired TE. If the TE exceeds the desired limit, TE at a higher value of measurand must be assessed. As mentioned above, well-characterized reference standards or stabilized control materials to assess an assay s accuracy are for the most part not available for cell-based techniques, especially for novel LDT methods. Therefore, the concept for total error cannot be correctly applied. As a result, one can assume that the Bias is 0 and use SD, expressed as the coefficient of variation (CV), to estimate a lower limit for the assay functional sensitivity. An advantage of using the %CV, rather than the SD, is that it normalizes variations at lower levels of event detection or measurand expression. A standard recommendation for chemistry assays for soluble analytes to establish LLOQ for a new assay is to measure 40 replicates from 3 to 5 samples on at least 5 runs. This is difficult to achieve for many flow cytometric assays due to limited specimen availability and considerable reagent costs. Under these circumstances, it may be appropriate to verify a desired LLOQ by assaying five replicates near the LLOQ, each replicate analyzed and confirming that an acceptable level of imprecision is achieved. For assays evaluating the frequency of populations, one approach is to create samples near the LLOQ through serial dilutions of stained/fixed/(6washed) whole blood or PMBC samples into unstained or partially stained autologous whole blood. A related approach is to prepare serial dilutions of cells from a positive sample spiked into a negative sample. However, spiking samples with cultured cell lines is discouraged, as they are substantially different from than their normal counterparts in peripheral blood. For assays evaluating the intensity of antigen expression, preparation of a series of samples having increasing amounts of

5 ASSAY PERFORMANCE CRITERIA FOR CELL-BASED FLUORESCENCE ASSAYS 319 competing unlabeled antibody may be prepared to mimic low-level antigen expression. Repeated assay of the sample with the lowest detectable frequency or intensity will provide an estimate of LLOQ. If the distribution of the reportable populations is different in disease state samples, it may be necessary to establish the LLOQ with disease-state samples, if available. Otherwise, lacking evidence of difference in the analyte composition or behavior of diseased cells from those of non-diseased cells, clinical samples from a variety of patient populations may be suitable for assay validation studies. Imprecision Intra-assay imprecision Intra-assay imprecision should be conducted in the same assay matrix (bone marrow/blood or cell suspension originating from patient tissue or fluid). If possible, samples from both the diseased state and healthy donors should be utilized as diseased samples often display different population distributions than healthy samples. A minimum of five samples should be assayed in triplicate (or more) in a single analytical run (2,4). Ideally the determination of LLOQ (see above) and intra-assay imprecision can be performed at the same time. Note that for rare specimen types it may not be possible to obtain five diseased samples. In such cases, alternative specimen types should be sought for evaluation of imprecision. The five samples should be selected to span the expected analytical range for clinical decisions. The laboratory should document an explanation for the substitution and the rationale for any alternative selected. Unless there is scientific evidence of chemical differences in the molecular structure of the measurand in disease states (altered amino acid sequence, changes in glycosylation or sialylation, etc.) compared to healthy cells, there is no valid scientific reason to perform the assay on a variety of diseased specimen beyond validation of the analytical claims of the assay. The mean, SD and %CV for each reportable result from each sample should be calculated. Percent CV rather than SD should be used as the acceptance criteria. An advantage of using the %CV rather than the SD is that it normalizes variations at lower levels of event detection. Considerations when establishing the level of acceptable imprecision for a reportable result include the frequency of the population and the total number of events acquired (11,12). A desirable target for assay imprecision isacvoflessthan10%,butforlessabundantpopulations (where frequency is at a level of 1:1,000 (0.1%) or lower), such as minimal residual disease detection or fetomaternal hemorrhage detection, a CV of less than 20% may be acceptable (13). In such instances where the population is rare, higher variation may be acceptable depending on the disease and intended use of the method, but more replicates and more samples should be used for the imprecision evaluation. Inter-assay imprecision In order to avoid the contribution of specimen instability, inter-assay imprecision can be performed in a commercially available stabilized whole blood quality control material suitable for flow cytometry. If such stabilized material is not available, multiple runs maybe conducted in the same day provided that the instrument is powered-down and recalibrated according to the laboratory standard procedure. Unlike method validation with stable soluble analytes, it is misleading and inappropriate to perform inter-assay imprecision studies separated in time by more that 4 6 hours, due to the alterations related to changes in cell viability in biologic specimens. Data generated during method validation can be also used to generate the preliminary quality control (QC) range (e.g. mean 6 2 SD). Note that for some assays, the final reportable populations may not be present in the QC material. In such cases, alternative populations should be evaluated in order to demonstrate the repeatability of reagent performance. Two to three levels of quality control material should be assayed in triplicate, in three to five independent analytical runs (2,4,5). The mean, SD and %CV for each reportable result from each sample should be calculated. Considerations when establishing the level of acceptable imprecision for a reportable result include the frequency of the population and the total number of events acquired, as defined earlier (9,10). Linearity By and large, the reportable results for flow cytometric methods are considered quasi-quantitative (1); thus linearity is not directly applicable. The two instances where linearity may be applicable are when the primary assay result is the frequency of a population, e.g. minimal residual disease or fetal RBC detection, or a measured level of fluorescence intensity, e.g. PMN CD64 measurements. In principle, either may be evaluated as part of the same experiment and using the same samples to assess LLOQ. For example, the linearity of population frequency may be estimated by assay of a positive sample prepared by known serial dilution into a negative sample, as above. Only when the fluorescence intensity signal output is quantified using fluorescence calibration/quantitation beads is the assay considered relatively quantitative, in which case linearity should be evaluated using alternative approaches to serial dilutions (11). On the other hand, instrument linearity can be demonstrated and should be verified semi-annually per the laboratory s instrument SOP; this is sufficient and appropriate assessment of linearity for relative quantitative assays. An alternative means for linearity assessment is to utilize calibrated fluorescence beads, which are spectrally and environmentally matched to the assay being validated. By using a set of spectrally matched beads of different fluorescence intensities spanning the measurement range of the assay, careful techniques to gate on

6 320 WOOD ET AL. bead singlets, doublets and triplets, one should demonstrate a linear relationship between the MESF values determined for the beads and median (or mean) fluorescent intensity measurements of the multiple bead populations as measured by the flow cytometer. This should be determined both for the instrument model used for the assay, analyzing calibration beads in buffer solution, and for the assay conditions, by adding the bead mixtures into the stained or processed specimen, by demonstrating a linear measurement of the same beads under the total assay conditions. Attempts can be made to prepare samples with varying levels of the reportable result in order to assess linearity. Data should be analyzed according to the recommendations in CLSI Guideline, EP06-A (12). Carryover Validation as to the level of carryover between specimens on an automated instrument is important in fluorescent cell-based assays, particularly in rare event assay, such as fetomaternal hemorrhage testing (14). For cell counting assays, the validation procedure of sequentially analyzing replicates of a low end specimen, followed by a high end specimen, and again followed by the low end specimen, is outlined in the CLSI H52-A guideline (14). However, for flow cytometry, the source of carryover is an instrument issue, particularly for those instruments using a carousel or multi-well sampler design. Thus the assay carryover can be determined, as with linearity validations, through the use of beads of differing fluorescence intensity. As the beads have a characteristic fluorescence and light scatter signature, carryover can be determined to levels below 1% using the same principle for carryover detection outlined in CLSI H52. Measurement Range/Reportable Range The reportable range is defined as the acceptable limits (low and high, if applicable) within which each of the reported analytes have met the analytical imprecision and linearity requirements. The information derived from assessing the LLOQ will establish the lower range. Given the large dynamic range of the instrument, up to five decades in some instances, an upper limit to the measurement range need only be determined if very bright fluorescence intensities are anticipated. For qualitative assays, the reportable range would merely be present or absent, while for quasi-quantitative assays, the upper and lower limits for each reported result should be tested for analytical imprecision and, if appropriate, linearity. During the assay development phase, upper and lower limit of cell number or sample volume required for saturating or acceptable staining conditions should be established. Specimen stability Stability Samples from a minimum of five apparently healthy or disease state donors should be tested fresh (ideally within 2 h of collection) as baseline, and at various intervals depending on when specimens would be expected to arrive at the laboratory. Sample stability must be evaluated for each anticoagulant or processing condition to be used. Specimens must be held under the anticipated storage conditions (4 C, RT, cryopreserved) in a similar manner to that to be used for the assay in practice. The effect of cell viability on the assay should be evaluated and understood. If relevant to the assay design, isolated cell subsets should be assessed for stability of the reportable results following isolation and storage. For example, isolated and cryopreserved cells should be regularly assayed for stability of the reportable results after thawing in comparison to the level seen prior to processing and storage. This might qualify the use of biobanked specimens for assay clinical validation studies. Stability is established at the latest time point where a 620% change from baseline is achieved or a minimum of 80% of the samples are within initial assay imprecision (2,15). Processed specimen stability The stability of processed (stained, lysed, fixed) samples should also be evaluated to determine how soon processed samples need to be acquired on the instrument. Processed samples (N 5 5) should be tested within 1 h of staining (baseline) and at various intervals depending on when specimens would be expected to be acquired on the instrument. Specimens must be held at the same conditions to be used in practice, i.e. at 4 Cor RT and in the dark. A recommendation for flow cytometry is to acquire data as soon as the samples are ready. This is especially critical for intracellular antigen and no wash assays. Post processing stability is also established at the latest time point where a 620% change from baseline is achieved or a minimum of 80% of the samples are within initial assay imprecision (2,15). Reagent Stability Reagents should be used within the manufacturer s expiration dates. Any manufacturer or LDT validator should have stability data for a minimum of three lots of the assay reagents with inter-batch CVs anticipated to be <10%. If reagents are used outside of expiration dates, equivalent performance must be documented. Reagent cocktails and other solutions prepared within the laboratory must be evaluated for stability under the expected storage containers and temperatures. Reference Range Reference range can be defined as a set of values obtained from the analysis of a cohort of healthy (normal) adult and/or pediatric individuals for the purpose of interpretation of results. It is understood that reference ranges are primarily applicable to quantitative assays, since actual numerical data are reported. For qualitative assays, it would be relevant to indicate if the measurand of interest is positive (present) or negative (absent) in a healthy population (see below with regards to disease-specific ranges). A reference range can also

7 ASSAY PERFORMANCE CRITERIA FOR CELL-BASED FLUORESCENCE ASSAYS 321 include disease-specific reference values, where the parameters for the assay are measured in the context of a defined clinical entity. While it is obvious that for some assays there may be multiple applicable clinical contexts, and therefore, it may be impractical to obtain adequate number of clinical samples to establish a disease range. Whenever possible, an attempt should be made to incorporate disease-specific samples to ensure that the analytical parameters are being appropriately defined in the specific clinical context. The reference range should be developed under the same analytical conditions as the actual clinical analysis (e.g. including consideration of biological variability such as timing of sample collection to reflect for example the impact of diurnal variation etc.). Certain biological factors such as pregnancy or age may influence immune function or WBC and therefore, should be taken into consideration before recruiting study participants. In fact, it would be most helpful to develop a set of exclusion criteria keeping in mind known factors that have the potential to affect immune responses and/or function acknowledging the presence of unknown factors and/or practical difficulties in recruitment under stringent conditions. The number of samples required to establish the reference range is determined by statistical considerations and availability of a pool of healthy donors that meet the criteria for participation in a reference range study. In some clinical contexts, a disease range may be required in addition to, or in place of a reference range for accurate interpretation of clinical data, keeping in mind that developing a disease-specific range can be significantly more challenging, time-consuming and expensive compared to a reference range with healthy donors (see above). Conversely, patient cohorts tested as part of routine analyses may be more readily available to assess disease ranges than samples obtained from apparently healthy subjects. The number of healthy donors used for a reference range analysis is classically proposed to include at a minimum 120 individuals (60 males and 60 females) (CLSI C28-A3) adults or pediatric. If a pediatric range is not specifically defined, then it should be shown that pediatric-derived data for that specific assay (whether it is functional or a single marker or group of markers) is comparable to the adult data in at least children across the pediatric age spectrum. This could be from prospectively recruited pediatric donors or from pediatric samples that have been tested as part of routine clinical testing but have values that are considered normal. Another secondary option is to use ranges based on literature when appropriate. If this data cannot be obtained for pediatric individuals, BUT the test has to be performed in this population, then consideration should be given to reporting pediatric data qualitatively instead of quantitatively. Validation still should be performed to support the intended use. Data analysis should be performed according to the recommendations in CLSI C28- A3 (16). Performance Assessment for Qualitative Methods Qualitative methods include, but are not limited to leukemia/lymphoma analysis, abnormal population present or absent for instance in MDS assessment, immunophenotypic description and level of involvement on which no threshold for clinical action is based. Accuracy As mentioned above, accuracy, as defined by ISO (17), cannot be established for cell-based assays. Qualitative methods are often only one of the modalities used in establishing a diagnosis, and generally are correlated with other clinical and laboratory findings. The alternative to assessment of true accuracy during the validation process is comparison of qualitative results from flow cytometric assays to the expected results as discussed above. As a general rule, at least 20 normal and 20 abnormal specimens should be included in the validation of an LDT. Additional samples would be required to support specific IVD claims. Note that for rare specimen types it may not be possible to obtain this number of samples. In such cases, alternative specimen types or spiked samples should be sought for evaluation. The laboratory should document an explanation for the substitution and the rationale for the alternative selected. It is important to recognize that newly developed tests usually use higher number of parameters for cell characterization, and may have an improved ability to detect abnormalities. Literature data can also be helpful in this context. Specificity Analytical specificity Refer to section for quasi-quantitative methods. Clinical specificity Clinical specificity for qualitative methods is defined as the ability to distinguish normal from abnormal specimens. From a statistical viewpoint, specificity for a classification function of this type is defined as the number of recognized negative samples of all samples that are truly negative. Specificity 5 TN/(TN1FP) where TN5 true negatives and FP 5 false positives (18). This may be determined by assay of a series of samples and scoring the presence of absence of an abnormality in comparison with a suitable reference method, such as morphology, clinical findings, immunohistochemistry or molecular studies. The number of samples to be assessed will depend on the nature of the assay and the variability in expected immunophenotypes. Sensitivity and specificity may be determined from the same sample cohort provided a balanced mixture of positive and negative samples is present. The range of specimens used for validation should reflect the range of specimen intended to be analyzed in the clinical setting (blood or bone marrow, lymph node, tissues, fluids, etc). In addition, as some qualitative flow cytometry tests are often used in

8 322 WOOD ET AL. different diseases, a range of expected disease processes should also be reflected in the validation cohort. Sensitivity Sensitivity-analytical (LOD/LOB) Sensitivity in a qualitative assay is the ability to recognize a finding above background, e.g. distinction of an abnormal population from normal populations or recognition of an abnormal level of antigen expression. Thus, the limit of detection of a qualitative assay may be defined as the minimum number of events that constitute a recognizable cell population divided by the total number of events collected. It may also be defined as the minimum level of deviation in intensity that allows recognition of an abnormality in antigen expression divided by the absolute intensity of expression. Both of these measures will be heavily dependent on the relative immunophenotype of the normal and abnormal populations, as well as on background staining and other technical artifacts. Consequently, sensitivity in assays of this type is very difficult to quantify. To assess sensitivity for abnormal population detection, one can perform dilution or spiking experiments with known abnormal samples in an appropriate normal background at a variety of levels, e.g. peripheral blood or bone marrow. The level at which an observer can confidently distinguish the abnormal population from normal populations is the limit of detection of the assay. Given the great variability in immunophenotype observed in these situations, a general recommendation as to the number of samples to be tested is not possible. Another specificity of blood or bone marrow flow cytometry assays is the presence of numerous different cell subsets that can be used as internal controls (i.e., not stained by specific reagents). In fact, assay sensitivity is likely to vary on a per sample basis such that an impractically large number of samples would need to be assessed by this method. Another way to view assay sensitivity is as the percentage of samples correctly identified containing an abnormal finding or population out of all samples compared with that known to be abnormal by any other method. This may be determined by assaying a series of samples and scoring the presence or absence of an abnormality and comparing with a suitable reference method, such as morphology, clinical findings, immunohistochemistry or molecular studies. This type of assessment is more realistic as it does not require manufactured samples. The number of samples to be assessed will depend on the nature of the assay and the variability in expected immunophenotypes. Sensitivity 5 TP/ (TP1FN), where TP 5 true positive and FN5 false negatives (15). Imprecision Although results of qualitative assays are descriptive, it is important to establish reproducibility of testing during the validation process. The general criterion for acceptable intra- and inter-assay imprecision is concordance in interpretation. As a desirable number of samples a minimum of 3 replicates each of a positive and a negative specimen should be assayed and concordance between the replicates assessed for each reported parameter. For population identification, a quantitative measurement of imprecision (e.g. percentage of abnormal cells) should be recorded in at least 3 replicates of a single specimen, starting from antibody staining through analysis. A desirable target for assay imprecision is less than 10% CV, but for less abundant populations (< 1%), imprecision of less than 20% CV is acceptable. This exercise ensures that, despite the subjective analysis of the data, technical assay performance (including instrument, antibodies, and gating) is reproducible. Linearity Not applicable for qualitative methods. Measurement range /reportable range Not applicable for qualitative methods. Establishing reference intervals Not applicable for qualitative methods. Specimen stability Stability Similarly to quasi-qualitative assays, stability is established at the latest time point where a 20% change from baseline or a minimum of 80% of the samples are within initial assay imprecision (2,15). Specifically, for qualitative assays, that means that interpretation is concordant in 4 out of 5 samples for a given time point or temperature range. When compared to other types of laboratory tests, flow cytometry has an advantage of visual data display that enables detection of sample compromise by light scatter and linear non-specific antibody binding. It is recommended that procedures clearly state that specimen integrity is assessed by visual inspection of dot plots. For some specimens where integrity of the specimen is more questionable (e.g. solid tissues), it is recommended to have viability assessment included with every specimen. In addition, there are occasional clinical situations in which precious samples are obtained (CSFs, small volumes from infants, etc.) that must be analyzed even though they fall outside validated specimen stability limits (age, temperature, specimen type). If results on these precious specimens are reported, a statement indicating validation limitations and caution with interpretation must be included in the report. Processed specimen stability Processed samples (3 to 5) should be tested within one hour of staining (baseline), and additionally at various intervals, depending on when specimens are expected to be acquired on the instrument. Specimens must be held at the same conditions to be used in

9 ASSAY PERFORMANCE CRITERIA FOR CELL-BASED FLUORESCENCE ASSAYS 323 practice, i.e. at 4 C or RT and in the dark. A recommendation for flow cytometry is to acquire data as soon as the samples are ready. The acceptance criterion is concordance. Reagent stability Refer to section for quasi quantitative methods. Reference range Refer to section for quasi quantitative methods. LITERATURE CITED 1. Lee JW, Devanarayan V, Barrett YC, Weiner R, Allinson J, Fountain S, Keller S, Weinryb I, Green M, Duan L, Rogers JA, Millham R, O Brien PJ, Sailstad J, Khan M, Ray C, Wagner JA. Fit-for-purpose method development and validation for successful biomarker measurement. Pharm Res 2006;23: O Hara D, Xu Y, Lianz E, Reddy M, Wu D, Litwin V. Recommendations for the Validation of Flow Cytometric Testing During Drug Development: II Assays. J Immunol Meth 2011;363: Cunliffe J, Derbyshire N, Keeler S, Coldwell R. An Approach to the validation of flow cytometry methods. Pharm Res 2009;26: New York State Department of Health, Clinical Laboratory Evaluation Program, Assay Approval in Cellular Immunology, February Owens MA, Vall HG, Hurley AA, Wormsley SB. Validation and quality control of immunophenotyping in clinical flow cytometry. J Immunol Meth 2000;243: Grimaldi E, Carandente P, Scopacasa F, Romano MF, Pellegrino M, Bisogni R, De Caterina M. Evaluation of the monocyte counting by two automated haematology analysers compared with flow cytometry. Clin Lab Haem. 2005;27: CLSI H42-A2: Enumeration of Immunologically Defined Cell Populations by Flow Cytometry Second Edition Approved Guideline. Wayne, PA: National Committee for Clinical Laboratory Standards, Roederer M. Spectral Compensation for Flow Cytometry: Visualization Artifacts, Limitations, and Caveats. Cytometry 2001;45: Hoy T, Chapter 7, further clinical applications from Flow Cytometry: A practical approach, Third Edition edited by Ormerod MG, Oxford University press (2000, reprint 2003) 10. Johansson U, chapter 8, Immunological studies of human cells from Flow Cytometry: Principles and Application edited by MG Macey, Humana Press Inc., Totowa, NJ (2007). 11. Wang L, Gaigalas AK, Marti G, Abbasi F, Hoffman RA. Toward quantitative fluorescence measurements with multicolor flow cytometry. Cytometry Part A 2008;73: CLSI EP06-A: Evaluation of the Linearity of Quantitative Measurement Procedures Approved Guideline. Wayne, PA: National Committee for Clinical Laboratory Standards, Barnett D, Granger V, Kraan J, Whitby L, Reilly JT, Papa S, Gratama JW. Reduction of intra- and inter laboratory variation in CD341 stem cell enumeration using stable test material, standard protocols and targeted training. DK34 Task Force of the European Working Group of Clinical Cell Analysis (EWGCCA). Br J Haematol 2000; 108: CLSI H52-A: Fetal Red Cell Counting Procedures Approved Guideline. Wayne, PA: National Committee for Clinical Laboratory Standards, Nowatzke W, Woolf E. Best practices during bioanalytical method validation for the characterization of assay reagents and the evaluation of analyte stability in assay standards, quality controls, and study samples. AAPS J 2007;9:E CLSI C38-A3: Defining, Establishing, and Verifying Reference Intervals in the Clinical Laboratories Approved Guideline, Third Edition. Wayne, PA: National Committee for Clinical Laboratory Standards, International Organization for Standardization (ISO) Statistics - Vocabulary and Symbols. ISO Geneva: CLSI EP12-A2: User Protocol for Evaluation of Qualitative Test Performance - Second Edition Approved Guideline. Wayne, PA:National Committee for Clinical Laboratory Standards, 2008.