Image-guided radiotherapy quality control: Statistical process control using image similarity metrics

Size: px
Start display at page:

Download "Image-guided radiotherapy quality control: Statistical process control using image similarity metrics"

Transcription

1 Image-guided radiotherapy quality control: Statistical process control using image similarity metrics Satomi Shiraishi, Michael P. Grams, and Luis E. Fong de los Santos a) Department of Radiation Oncology, Mayo Clinic, 200 First St SW, Rochester, MN 55905, USA (Received 28 August 2017; revised 7 February 2018; accepted for publication 23 February 2018; published 30 March 2018) Purpose: The purpose of this study was to demonstrate an objective quality control framework for the image review process. Methods and materials: A total of 927 cone-beam computed tomography (CBCT) registrations were retrospectively analyzed for 33 bilateral head and neck cancer patients who received definitive radiotherapy. Two registration tracking volumes (RTVs) cervical spine (C-spine) and mandible were defined, within which a similarity metric was calculated and used as a registration quality tracking metric over the course of treatment. First, sensitivity to large misregistrations was analyzed for normalized cross-correlation (NCC) and mutual information (MI) in the context of statistical analysis. The distribution of metrics was obtained for displacements that varied according to a normal distribution with standard deviation of r = 2 mm, and the detectability of displacements greater than 5 mm was investigated. Then, similarity metric control charts were created using a statistical process control (SPC) framework to objectively monitor the image registration and review process. Patient-specific control charts were created using NCC values from the first five fractions to set a patient-specific process capability limit. Population control charts were created using the average of the first five NCC values for all patients in the study. For each patient, the similarity metrics were calculated as a function of unidirectional translation, referred to as the effective displacement. Patient-specific action limits corresponding to 5 mm effective displacements were defined. Furthermore, effective displacements of the ten registrations with the lowest similarity metrics were compared with a three dimensional (3DoF) couch displacement required to align the anatomical landmarks. Results: Normalized cross-correlation identified suboptimal registrations more effectively than MI within the framework of SPC. Deviations greater than 5 mm were detected at 2.8r and 2.1r from the mean for NCC and MI, respectively. Patient-specific control charts using NCC evaluated daily variation and identified statistically significant deviations. This study also showed that subjective evaluations of the images were not always consistent. Population control charts identified a patient whose tracking metrics were significantly lower than those of other patients. The patient-specific action limits identified registrations that warranted immediate evaluation by an expert. When effective displacements in the anterior posterior direction were compared to 3DoF couch displacements, the agreement was 1 mm for seven of 10 patients for both C-spine and mandible RTVs. Conclusions: Qualitative review alone of IGRT images can result in inconsistent feedback to the IGRT process. Registration tracking using NCC objectively identifies statistically significant deviations. When used in conjunction with the current image review process, this tool can assist in improving the safety and consistency of the IGRT process American Association of Physicists in Medicine [ Key words: image-guided radiotherapy, quality control 1. INTRODUCTION In modern-day radiotherapy, most institutions use image guidance for patient localization before treatment, with volumetric imaging being the most commonly used modality across all treatment sites. 1 Coupled with advances in delivery technology, image-guided radiotherapy (IGRT) has opened up opportunities such as dose-escalation, hypofractionation, and a reduced margin between clinical target volume (CTV) and planning target volume (PTV) by decreasing geometric uncertainties associated with patient setup at each treatment. 1 3 While image guidance is a critical part of radiotherapy, the current quality control process for daily review of image registration by both therapists and physicians is subjective; each reviewer may have different expectations for alignment accuracy. 1,4 Furthermore, evaluation of threedimensional (3-D) image registration is challenging when multiple critical anatomies are present or when there is deformation. At most institutions where volumetric localization is the standard of practice, therapists also have to review image registrations and make a decision as to whether the alignment is acceptable for treatment, which complicates achieving inter-reviewer consistency. Furthermore, because methods to evaluate statistical trends in image registrations over multiple treatment days are not standard practice, achieving inter-fraction consistency is difficult. Also, automatic rigid registration 1811 Med. Phys. 45 (5), May /2018/45(5)/1811/ American Association of Physicists in Medicine 1811

2 1812 Shiraishi et al.: IGRT quality control 1812 of images may not result in clinically optimal registrations. Automatic registration depends on many variables such as size of the selected volume of interest, registration algorithms, search space, and image qualities. 5 Studies have shown the presence of significant residual setup errors due to deformed anatomy after automatic rigid registrations. 6 9 While IGRT has opened up opportunity to precisely deliver highly conformal dose distributions, the IGRT process alone does not guarantee an optimal patient setup, warranting the development of an objective quality control process. The purpose of this work was to demonstrate an objective quality control framework for image review to improve the consistency and efficiency of the IGRT process. We used an image similarity metric to quantify patient-specific inter-fraction consistency in a selected volume of interest (VOI). With the similarity metric as a quantitative measure of image registration consistency, we established a quality control system using statistical process control (SPC). Statistical process control is an established framework used to create control charts for process monitoring and feedback. Statistical process control is frequently used in industry and has also been applied to radiation therapy; Pawlicki et al. have published articles describing the application of SPC in radiation therapy, and others have used the framework for intensity-modulated radiotherapy process control. 13,14 In addition to the stepwise method for creating control charts, the process capability limit and action limit are critical concepts employed in SPC. 10,11 In the context of IGRT, the process capability limit bounds statistically expected variations and identifies statistically significant deviations. The action limit is a threshold that indicates clinically meaningful deviations. Ideally, the process capability limit should fall within the action limit decided by the clinic. However, in the case of unstable immobilization devices or setups, the process capability limit may be outside the action limit. While both process capability limits and action limits are important for full clinical implementation of SPC, this manuscript primarily focuses on establishing process capability limits and briefly discusses a preliminary analysis of action limits. The proposed framework has immediate application as a means for objective, on-line registration monitoring and outlier detection prior to treatment. This tool can also assist in the physician s off-line reviewing process by objectively monitoring the consistency of registrations over the course of treatment and identifying setups that deviate significantly from previous fractions. 2. MATERIALS AND METHODS This study was an institutional review board-approved retrospective study involving 33 bilateral head and neck (H&N) cancer patients who received definitive radiotherapy between July 2015 and August All patients received daily conebeam computed tomography (CBCT) localization. When multiple CBCTs were acquired within a single treatment session, the CBCT acquired immediately before the treatment was used for the analysis, resulting in 927 registrations. All planning CTs were acquired on a SOMATOM Definition AS scanner (Siemens, Munich, Germany) using an in-house adult head and neck protocol with 65 cm field of view, mm 2 axial resolution, and 2 mm slice thickness. The CBCTs were acquired with TrueBeam linacs (v2.0; Varian Medical Systems, Palo Alto, CA) using a half-scan head protocol with 26.2 cm field of view, mm 2 axial resolution, and 1.99 mm slice thickness. 15 All 33 patients were immobilized using an Orfit 5-point mask (Orfit Industries, Wijnegem, Belgium) with AccuCushions TM (Klarity Medical Products, Newark, OH) and a neck rest. Twenty-two patients were fit with a Precise Bite TM (CIVCO, Coralville, IA) to localize the mandible, and seven patients used an oral sponge. Prior to each treatment, a daily CBCT was registered to the planning CT using 6 degrees of freedom (DoF) and the mutual information (MI) algorithm provided by the True- Beam software. The algorithm searched for the best registration in three steps with increasing spatial resolution. 15 A bone intensity filter ( HU) was applied to create a registration volume within the user-defined VOI during the last step. Standard practice in our clinic is on-line verification, in which images are obtained and reviewed by a physician prior to treatment for the first fraction and off-line verification, in which images are reviewed by a physician after treatment for the remainder of fractions. For the second fraction and on, therapists are requested to align the images using the previously described automatic registration process without manual adjustments. If therapists have questions regarding the results of the registration, a physician is called to review and may choose to adjust the alignment manually prior to treatment. 2.A. Registration tracking volume and similarity metrics The proposed quality control process relies on a similarity metric which was calculated for a predefined, consistent VOI referred to as a registration tracking volume (RTV). Since similarity metric values depend on the size and location of the volume, the RTV was fixed for each patient throughout the course of treatment. Two RTVs were defined for this analysis and are shown in Fig. 1: (a) C-spine, which is an cm 3 box around the cervical spine (approximately C2 C7); and (b) Mandible, which is cm 3 around the lower mandible. The analysis was performed using in-house MATLAB (Mathworks, Natick, MA) codes. Each patient s planning CT, daily CBCT, and co-ordinate transformation matrix from image registration at the time of treatment were exported from our radiation oncology information system database. Co-ordinate transformations were applied to the daily CBCTs using the registration matrix obtained at treatment to register the corresponding planning CT. The CBCT was then resampled to match the planning CT resolution ( mm 3 ). Two similarity metrics, mutual information (MI) and normalized cross-correlation (NCC), were evaluated for suitability as tracking metrics in the SPC framework. Mutual information is defined as

3 1813 Shiraishi et al.: IGRT quality control 1813 FIG. 1. Examples of registration target volumes used for this study. The C-spine RTV is a cm 3 box around approximately C2 C7 and the Mandible RTV is a cm 3 box around the lower mandible. [Color figure can be viewed at wileyonlinelibrary.com] MI ¼ X pa; ð b pa; Þlog ð b Þ a;b pa ð Þpb ð Þ ; (1) where a and b are pixel values in the planning CT and CBCT respectively, and p(a) and p(b) are probabilities of those pixel values in the RTV. 16 The p(a,b) term is the probability of the CBCT having b pixel value given the planning CT has a pixel value in the corresponding voxel. Normalized cross-correlation is defined as NCC ¼ 1 X ðaðx~þ AÞðBðx~Þ BÞ ; (2) N x~ r A r B where N is the number of voxels in the RTV, Aðx~Þ is the pixel value at x~, and A and r A are the average and standard deviations, respectively, of all pixel values in the RTV for the planning CT. Corresponding values for the CBCT are Bðx~Þ, B and r B. 2.B. Similarity metric sensitivity study Sensitivities of MI and NCC to systematic misregistrations in the context of statistical analysis were evaluated using the C-spine RTV. To identify suboptimal registrations, we needed to statistically separate small, random daily variations from larger, potentially clinically significant deviations. Registrations from the first fraction for 20 randomly selected patients were reregistered to find reference registrations corresponding to maximum MI values for the RTV, which also corresponded to the maximum NCC values. The CBCT images were then systematically displaced around the reference registration, and similarity metrics were calculated at each displacement. The metrics were normalized by the value at the reference registration to allow comparison between MI and NCC. To estimate patient setup errors, the variability in similarity metrics was obtained for 300 simulated displacements in each translational direction (lateral, vertical, longitudinal) that varied according to a normal distribution with standard deviation, r = 2 mm. The detectability of displacements greater than 5 mm corresponding to a CTV PTV margin in our clinic was compared to more prevalent, smaller displacements and was presented using the distance from the means in units of standard deviations. 2.C. Statistical process control 2.C.1. Overview A statistical process control (SPC) framework was employed to quantitatively monitor the image review process. Similarity metrics quantified residual variations in the patient setup at the time of treatment and were used as a quality metric in SPC. Detailed discussions of SPC and its application can be found elsewhere. 10,12,17,18 In brief, SPC is a framework to create control charts which characterize process performance (eg, the image registration review process). In addition to measured data points, control charts may display a center line which indicates the average and process capability limits and action limits as described in the Introduction. If the metric follows a normal distribution, the process capability limits estimate 3 standard deviations. 18 The SPC framework is still valid when the metric does not follow a normal distribution, although the limits no longer indicate a 99.7% confidence interval. 10,18 Methods used to create control charts for this study are discussed below. 2.C.2. Patient-specific control chart Patient-specific control charts were created to monitor daily variation and to identify inconsistent registrations. In this study, the first five fractions were used as baseline data, so the center line (x) of the control chart represented the average similarity metric for the first five fractions. Upper and lower process capability limits (UL and LL) were calculated as UL ¼ x þ 3^r and LL ¼ x 3^r, where ^r was an estimator of the true standard deviation. The relationship between the sample range R and the standard deviation r is described by

4 1814 Shiraishi et al.: IGRT quality control 1814 the relative range, W ¼ R=r, which is tabulated for a normal distribution; W ¼ 1:128 0:853 for a sample size of two. 17,18 Statistical process control uses the mean of the relative range to estimate the standard deviation by ^r ¼ R=1:128. The variable R is the average of the first four range measurements, R 2 through R 5,whereR i ¼jx i x i 1 j, with x i being the similarity metric on the ith fraction. This patient-specific control chart is a way to monitor daily setup consistency and assist in identifying statistically significant deviations. 2.C.3. Rationale for population analysis Since similarity metrics depend on pixel value distributions in the RTV, and each patient has anatomical landmarks of slightly different shape and size, similarity metric values are patient-specific. However, if variability among patient anatomy is less than variability originating from image misalignment, a population analysis can help identify patients whose setup reproducibility is very different from that of other patients. To evaluate the feasibility of conducting a population analysis, the variations due to anatomical differences was compared to the variations observed for treatment registrations. Metric variations originating from anatomical differences were calculated by reregistering images from the first fractions using RTVs C-spine and mandible independently to maximize the similarity metric for each patient. Three patients were removed from the mandible study since two of them did not have fully intact mandibles, and one had a large surgical screw across the mandible. The standard deviation of the metrics was calculated for these reregistered alignments to characterize variations due to anatomical differences. From metrics measured from the treatment registrations, the weighted average of the mean similarity metric and standard deviations were calculated using x ¼ P m j¼1 n jx j = P m j¼1 n j and h s ¼ P m j¼1 n j 1 s 2 j = P i m 1=2, j¼1 n j m where xj is the average metric, n j is the number of fractions, s j is the standard deviation for jth patient and m is the number of patients. The standard deviation due to anatomical differences was compared to the standard deviation observed for treatment alignments to evaluate the dominant cause of the variation. 2.C.4. Population control chart A population control chart can help identify patients whose setup reproducibility is significantly different from that of other patients. Potential causes such as unstable immobilization devices or weight loss/swelling of the patient could cause less reproducible setups. Two types of population control charts were created; the average control chart, which monitored average metric values across patients; and the sigma control chart, which monitored the variation in the metric across patients. Metrics during the first five fractions for each patient were used to create these control charts. The center line for the average control chart was calculated as X ¼ 1 P j¼m m j¼1 X j, where X j is the average of the first five measurements for the jth patient. Similar to the patient-specific control chart, the upper and lower limits were set at three p standard deviations: UL ¼ X þ 3^r= ffiffi n and LL ¼ X 3^r= pffiffiffi pffiffi n where ^r= n is the estimator of the standard deviation of the means. In this control chart, ^r is calculated using ^r ¼ S=c 4 where S ¼ 1 P j¼m m j¼1 S j, and S j is the standard deviation for the first five fractions for the j th patient. The constant c 4 is tabulated for a normal distribution and is c 4 ¼ 0:940 for a sample size of five. 18 The sigma control chart was created with the center line of S and the upper and lower limits with p UL ¼ S þ 3^r ffiffiffiffiffiffiffiffiffiffiffiffi p 1 c 2 4 and LL ¼ S 3^r ffiffiffiffiffiffiffiffiffiffiffiffi 1 c 2 4 where p ^r ffiffiffiffiffiffiffiffiffiffiffiffi 1 c 2 4 is the estimator of the standard deviation of S. 18 These control charts identified patients whose similarity metrics during the first five fractions were significantly lower or had greater variability than those of other patients. For one particular outlier patient, images were reregistered using the C-spine RTV as the registration VOI to evaluate the achievable similarity metric when only the C-spine was considered for registration. We then evaluated the corresponding mandible alignment. 2.D. Preliminary study on action limits 2.D.1. Similarity metrics and effective displacements Relating similarity metric values to spatial displacement would make interpretation of the metric more intuitive for users and would allow us to set an action limit based on spatial displacements. However, since similarity metrics are scalar values and have no directional information, assumptions about the direction of displacement were necessary to relate the values to spatial distances. As a preliminary analysis, we converted the similarity metric to spatial displacement assuming the misalignment was only in one direction, referring to them as effective displacements. For each patient, the sensitivity curves were produced in a manner similar to that described in Section 2.B. Unlike in Section 2.B, the metric was not normalized to the value of the reference registration to account for patient variations. Three sensitivity curves left/right (L/R), anterior/posterior (A/P), or superior/inferior (S/I) were created for each patient. These patient-specific sensitivity curves were used to convert metric values to effective displacements. 2.D.2. Effective displacement vs couch shift The effective displacements were compared to the total couch shifts required to align the C-spine and mandible in their corresponding RTVs. Ten registrations with the lowest similarity metrics from each of the C-spine and mandible datasets were evaluated. Since treatment registration is often a compromise between C-spine, mandible, and clavicle, reregistering the images using RTVs allowed us to estimate

5 1815 Shiraishi et al.: IGRT quality control 1815 residual misalignments of each anatomical landmark at the time of treatment. Couch shifts were obtained by 3DoF automatic registration using the RTV as the VOI. The preferential direction of displacement was first determined by the direction that is most often the largest component in the calculated 3DoF couch shifts. The total shift was calculated as h i 1=2, DCouch ¼ ðdxþ 2 þðdyþ 2 þðdzþ 2 where Dx, Dy and Dz represent couch shifts in lateral, vertical and longitudinal directions, respectively. Total couch shifts were compared to the effective displacements in the preferential direction of displacements. 2.D.3. Action limits Action limits in the SPC framework define the threshold at which the deviations are clinically significant. Sensitivity curves were interpolated to provide an estimate of similarity metrics corresponding to 5 mm displacements, a CTV PTV margin used in our clinic. FIG. 2. Left/right sensitivity curves for MI (red dashed line) and NCC (blue solid line) calculated using 20 randomly selected patients from the cohort. Values are normalized to the maximum metric value at 0 displacement (Metric Ref ). The shaded bands indicate 1r. [Color figure can be viewed at wileyonlinelibrary.com] 3. RESULTS 3.A. Similarity metric sensitivity study The sensitivity of MI and NCC was evaluated in the context of statistical analysis using the C-spine RTV. Figure 2 shows an example of normalized sensitivity curves which represent the fractional change in metric as a function of displacement from the reference registration. A general trend a steeper gradient for MI than NCC around the reference registrations was seen in all three translational directions. For the simulated displacements with r = 2 mm, distributions of measured similarity metrics are shown in Fig. 3. The NCC distribution is narrower than that of MI; the standard deviation was 0.09 and 0.18 for NCC and MI, respectively. On average, displacements greater than 5 mm were 2.8r and 2.1r away from the means for NCC and MI, respectively. Due to the steep gradient in the sensitivity curves for MI, small fluctuations in displacements resulted in a large change in MI. This is a reason MI is often preferred for finding an optimal registration for IGRT, but this feature decreases the sensitivity for outlier detection in the SPC framework by increasing the standard deviation. Normalized cross-correlation was selected for this study due to its ability to better separate small, daily variations from larger, potentially clinically significant deviations for H&N treatments. 3.B. Statistical process control 3.B.1. Patient-specific control charts Daily registrations showed various degrees of consistency among patients. Examples of control charts for the C-spine and mandible are shown in Fig. 4 along with axial images from selected days. These control charts also display patientspecific process capability limits which describe the FIG. 3. Distributions of MIs (red) and NCCs (blue) for simulated displacements where the displacements were based on Gaussian distribution with r = 2 mm. Displacements in all three directions, L/R, A/P, and S/I are shown. [Color figure can be viewed at wileyonlinelibrary.com] reproducibility of the setups. In Fig. 4, patient 13 showed the best consistency, and patient 23 showed the largest variation. In fact, patient 23 had the largest fluctuations in the C-spine in this entire cohort. The patient gained weight between the time of planning CT acquisition and the first treatment fraction and lost weight throughout the course of treatment; his body mass index was 37.9 at the time of simulation, 39.0 on the first treatment day and 34.6 on the last day. For patient 19 shown in Fig. 4(d), C-spine alignment was consistent for the entire course of treatment. However, despite using Precise Bite TM, mandible alignment suddenly changed on day 7 and remained misaligned for the rest of the treatment course as shown in Fig. 4(e). This misalignment was visually confirmed, and selected axial slices of the planning CT and CBCT overlay are shown in Fig. 4(f). On day 23, the registration was rejected during the off-line review with the physician, who noted, Mandible is way off, is the Precise bite in

6 1816 Shiraishi et al.: IGRT quality control 1816 correctly? when in fact, the mandible was misaligned since day 7. This example highlights that the off-line review process does not reliably monitor registration consistency. In addition, misalignment can persist over the course of multiple fractions, which reinforces the need for more objective quality control of the image review process. Comparison of daily similarity metrics with process capability limits aids in deciding if a given day s setup is within statistical variation in the patient setup or represents an anomalous setup. Furthermore, the standard deviation of NCC values during the course of treatment spanned an order of magnitude across all patients from to 0.09 for C-spine and 0.01 to 0.09 for the mandible. The consistency of the IGRT registration is patient-specific. These variations support the idea that evaluation of patient-specific process capability (ie, setup reproducibility) is important in understanding the cause of setup inconsistency. 3.B.2. Rationale for population analysis Variations of metrics across patients originating from anatomical differences were compared to variations originating from treatment setups. Figures 5(a) and 5(b) shows the NCC distributions across the patient cohort when the metrics were maximized for their corresponding RTVs. The mean and standard deviation of NCC values were for the C-spine RTV and for the mandible RTV. These variations originating from anatomical differences are much smaller than the variation due to the clinical treatment setups as shown in Figs. 5(c) and 5(d). For the treatment setups, the weighted average and standard deviations of NCC values were for C-spine and for the mandible. The small variation across patients due to anatomy is likely because RTVs were set at the same size for all patients as described in Section 2.A and the anatomical landmarks did not deform easily; cervical vertebrae and mandibles do not change drastically in shape and size, as only modest spinal deformations occur within the immobilization mask. This analysis suggests that setup variation is likely a more dominant cause of fluctuation than patient-specific anatomy, and a population analysis for these particular RTVs is reasonable. 3.B.3. Population control chart Variability across patients was studied using average and sigma control charts calculated from the first five fractions for all 33 patients using registrations from the clinical treatments. The population control charts for C-spine and mandible RTVs are shown in Fig. 6. For an outlier patient (patient 23 indicated with dotted circles), images from the first five fractions were reregistered using the C-spine RTV as the registration VOI; the recalculated NCC averages and standard deviations are shown as hollow circles in Fig. 6(a). When registration was focused only on the C-spine, NCCs were approximately at the mean of all patients. By focusing the registration around the C-spine, mandible alignment worsened [Fig. 6(b)], indicating that the relative position of the C- spine and mandible were different between the planning CT and CBCTs. This example shows that deformation (relative position of the C-spine and mandible) forced a compromise of alignment between C-spine and mandible, resulting in the lower tracking metric for this patient. For the C-spine RTV, all patients had at least nine registrations that were within the FIG. 4. Examples of individual measurement control charts for the C-spine (a, d, g) and mandible (b, e, h) RTVs and selected axial images (c, f, i). Each row represents a patient. In the control charts, the black-dashed lines and the solid lines indicate, respectively, the center line and capability limits for the patient. Axial images are approximately at C-4 and C-5 level to give a sense of both C-spine and mandible alignment. The planning CT is shown in magenta, and the CBCT is shown in green. [Color figure can be viewed at wileyonlinelibrary.com]

7 1817 Shiraishi et al.: IGRT quality control 1817 FIG. 5. (a) Variation of NCCs across 33 patients when images from the first fraction were reregistered using the C-spine RTV. (b) Variation of NCCs across 30 patients when images from the first fraction were reregistered using the mandible RTV. (c) Whisker plot showing the distribution of C-spine NCCs for all patients during the course of treatment. The median is indicated with the circles with a dot; 25th and 75th percentiles are indicated by the blue rectangles, and outliers defined by 1.5 interquartile ranges (IQR) are shown with hollow circles. The minimum and maximum NCC values that are not outliers are indicated by the blue lines. The dashed line and the grey shaded band indicate the mean and 1r from (a). (d) Corresponding plot for the mandible where the dashed line and the grey shaded band indicate the mean and 1r from (b). [Color figure can be viewed at wileyonlinelibrary.com] population average capability limits during the course of treatment. For the mandible RTV, all patients who used the Precise Bite TM had at least five registrations that were within the population average bounds, showing that there were no patients who could not achieve these population limits. While achieving these population average limits may not be clinically necessary, the limits can be useful during the first five fractions (baseline period) to make sure that patient-specific limits are reasonable. 3.C. Preliminary study on action limits 3.C.1. Similarity metrics and effective displacements As a preliminary analysis, we converted the similarity metric to spatial displacement assuming the misalignment was only in one direction, referring to them as effective displacements. Average sensitivity curves for the C-spine and mandible are shown in Fig. 7. Curves for the C-spine are shown in the top row, and those for the mandible are shown in the bottom row. The asymmetry seen in Fig. 7(b) is due to the asymmetric spatial distribution of pixel values in RTVs. For the C-spine RTV, moving the CBCT image anteriorly reduced the volume of airway in RTV while moving the image posteriorly increased the volume of airway in RTV. These changes in the distribution of pixel values resulted in asymmetric NCC values when images were translated. Furthermore, the shape and size of the airway displayed variation among patients, resulting in larger variation in the sensitivity curves. The slope of the curve in the C-spine S/I direction was reduced if a patient s spine was parallel to the axis of displacement because the pixel distribution of the vertebral column was similar under translational displacement. Some patients were setup with a pronounced curvature of the C- spine, which resulted in steeper sensitivity curves as images were displaced from the reference registrations. This study highlights that the shape of a sensitivity curve is highly dependent on the selection of RTV; each RTV requires an investigation of metric sensitivity. 3.C.2. Effective displacement vs couch shift The preferential direction of displacement was determined to be the A/P direction for both C-spine and mandible; for nine of 10 C-spine patients, the 3DoF couch shift corresponding to the A/P direction was the largest in magnitude. Similarly for the mandible, six of 10 patients had their largest displacements in A/P direction. For the remainder of patients, the A/P displacements were the second largest component of 3DoF shifts after displacements in S/I direction. We compared effective displacements in the A/P direction with the total couch displacements required to align the anatomy.

8 1818 Shiraishi et al.: IGRT quality control 1818 FIG. 6. Population average and sigma control charts for the C-spine (a) and mandible (b). The C-spine charts show quality metrics from all 33 patients while the mandible charts show those from patients who used the Precise Bite TM (22 patients). The hollow circles indicate the values when patient 23 was reregistered focusing on the C-spine RTV. [Color figure can be viewed at wileyonlinelibrary.com] FIG. 7. Average sensitivity curves for C-spine (a-c) and mandible RTV (d-f). Each of the three directions (L/R, A/P, S/I) was a result of displacing CBCT images by a given distance. The shaded areas indicate 1r. All 30 patients were analyzed for the C-spine curves, and 30 patients were analyzed for the mandible. [Color figure can be viewed at wileyonlinelibrary.com]

9 1819 Shiraishi et al.: IGRT quality control 1819 When 3DoF couch displacements required to align the images were compared to effective A/P displacements, they showed a reasonable correlation, as shown in Fig. 8. For seven of 10 patients in both C-spine and mandible cohorts, the agreement was within 1 mm. However, for one patient in the C-spine cohort, the A/P effective displacement underestimated the shift by 4.4 mm. This patient required an additional 6.8 mm correction in the S/I direction on top of the 4.8 mm shift in A/P direction based on the 3DoF automatic registration. When the S/I effective displacement and A/P effective displacement were averaged, the effective displacement agreed much better with the couch shift for this patient. This point is shown as a hollow circle in Fig. 8. The observed agreement between effective displacements and total couch shifts suggests that correlating similarity metrics with physical displacements may be reasonable when the preferential direction of displacement can be easily determined. 3.C.3. Action limits Action limits define the threshold at which deviations are clinically significant. Values of NCCs corresponding to 5 mm displacements from the population sensitivity curves are listed in Table I as preliminary action limits. Using a patient s own sensitivity curves to calculate 5 mm action limits for his/her treatments would be more accurate than applying population based action limits. However, reregistering images using an RTV to calculate patient-specific sensitivity curves may not always be realistic because re-registration is FIG. 8. Total couch shift compared to effective A/P displacement for ten registrations. Red circles and blue squares indicate C-spine and mandible registrations, respectively. The one-to-one line is shown as a solid black line, and the dashed lines indicate 1 mm. A red hollow circle indicates a value when S/I and A/P effective displacements are averaged for this patient who also required a large longitudinal shift. [Color figure can be viewed at wileyonlinelibrary.com] TABLE I. NCC values corresponding to 5 mm displacements based on the population sensitivity curves. C-spine Mandible 5 mm L/R mm A/P mm S/I resource intensive. In such a scenario, population-based action limits can be used as estimates. 4. DISCUSSION 4.A. Clinical benefits We demonstrated a quantitative quality control tool for IGRT image review using similarity metrics in the SPC framework. While this study was retrospective, control charts like those described here could be used at the time of treatment to assist in making decisions about the acceptability of a particular setup. Figure 9 shows an example use of a control chart. The first five fractions comprise the baseline acquisition period, and their NCC values are monitored with respect to the population average (shaded blue region) calculated in Section 3.B.3. In this C-spine cohort, all patients achieved registrations within the population average at some point during the treatment course. The NCC values from the first five fractions are used to calculate the patient-specific mean (black-dashed line) and capability limits (black solid line). Although a measurement which falls outside the process capability limits does not imply that the setup is unacceptable, the measurement is identified as a statistical outlier that warrants an investigation into its cause. Action limits define the threshold at which deviations become clinically significant. In the context of IGRT geometric uncertainty, the action limit actually depends on numerous parameters such as the CTV PTV margin, direction and magnitude of anatomical displacement, local dose profile and gradient, and treatment modality. Action limits accounting for these clinical considerations require extensive analysis, that is, not within the scope of this paper. In this study, we investigated a preliminary action limit corresponding to a 5 mm effective displacement. Figure 9 shows the patient-specific action limit indicating a 5 mm effective displacement in S/I direction with a red dash-dot line. Since the slope of the sensitivity curve in the S/I direction was less steep than that in the A/P or L/R directions for C-spine, the S/I action limits corresponded to the conservative threshold for this patient. In this quality control framework, a measurement which falls outside the action limit warrants an immediate evaluation by an expert. Addition of control charts to the current image review process provides a secondary check of patient position. It is important to note that this study explored only translational displacements. In reality, displacements often occur in multiple directions and rotations. Because we cannot account for

10 1820 Shiraishi et al.: IGRT quality control 1820 FIG. 9. An example of a control chart use. [Color figure can be viewed at wileyonlinelibrary.com] all potential combinations of deformations and displacements that may occur in patients, the possibility of false-negatives in which an NCC value may be within the action limit, but the actual motion is outside the acceptable margin remains. Therefore, we recommend using this tool as a complement to the current registration evaluations. With this in mind, RTV tracking can also be employed during the off-line image verification process to alert physicians when image alignment differs significantly from that of previous fractions. This may reduce situations similar to the one discussed in Section 3.B.1, in which the off-line review process did not consistently detect mandible misalignments. inconsistent. It is imperative to emphasize that the sensitivity of a similarity metric depends on anatomy inside and around the RTV. The sensitivity of similarity metrics should be evaluated for each RTV. Furthermore, desired sensitivity may depend on treatment modality; for example, stereotactic body radiation requiring tighter tolerances may be better tracked with MI, since daily setup variations are likely to be smaller than those seen in this analysis. Investigation of other treatment sites and similarity metrics are crucial in developing this tool further. A limitation of this study is the small sample size. To obtain more statistically robust patient averages and 4.B. Future work This work was a proof-of-principle demonstration of the SPC framework for the image review process. Process control using similarity metrics and SPC described in this manuscript is one way of applying SPC, and details could be further optimized. For example, the first five fractions were used to collect patient-specific baseline data in this study. This was an arbitrary choice with the goal of balancing statistical accuracy and limited sample size, and could be optimized through further investigation. Furthermore, the difference in sensitivity to direction of displacement should be investigated further. If we can automatically identify the largest direction of displacement, this similarity metric-based image review tool may become more accurate in predicting effective displacements. Further development and optimization of this SPC framework are necessary. Other RTV tracking structures and similarity metrics should also be explored; one advantage of this tool is to track changes in the body, including body surfaces which can show signs of weight loss and swelling that may affect treatment indirectly. We have noticed that weight loss can be visually tracked as shown in Fig. 10. Tracking the consistency of the body contour which is closely tied to how the patient fits in the immobilization device and consequently to setup reproducibility may help identify patients whose setup is FIG. 10. An example of body surface tracking. (a) Patient-specific control chart where the black lines indicate capability limits. (b) Planning CT (magenta) and CBCT (green) overlay for the first and last fractions. [Color figure can be viewed at wileyonlinelibrary.com]

11 1821 Shiraishi et al.: IGRT quality control 1821 variations, a larger patient cohort would be desirable. The presence of surgical clips and dental fillings caused image artifacts in some patients. While RTVs mostly avoided those artifacts, a more systematic study using a larger patient cohort would be necessary to differentiate similarity metric variations stemming from patient localization and those caused by inconsistent image quality. Implementation of SPC allows a continuing cycle of process monitoring and improvement, and data can be continuously accumulated for further analysis. 5. CONCLUSIONS We have demonstrated a quality control framework for IGRT using a similarity metric-based registration tracking process. Cervical spine and mandible alignments were analyzed using NCC as a quality metric. Control charts showed varying degrees of setup consistency between patients, illustrated by patient-specific capability limits. The study also showed that the current subjective image review process was not always consistent in identifying mandible misalignment. Metric variations originating from anatomical differences were approximately one-half of the variations seen in treatment registrations, justifying population analysis using similarity metrics for these RTVs. Population control charts identified patients whose metrics were statistically different from those of other patients. When effective displacements in the anterior posterior direction were compared to 3DoF couch displacements, the agreement was 1 mm for seven of 10 patients for both C-spine and mandible. Preliminary action limits were calculated assuming a 5 mm unidirectional displacement, corresponding to a CTV PTV margin used in our clinic. In short, current qualitative quality control methods for reviewing image registration can lead to inconsistent feedback to the IGRT process. Tracking of a similarity metric with respect to the capability limits allows identification of statistically significant deviations. Comparison of the similarity metric with action limits aids in identifying registrations where further evaluation by an expert is warranted. When used in conjunction with the current image review process, this framework can objectively assist in improving safety and consistency of the IGRT process. ACKNOWLEDGMENT This work was partially supported by Varian Medical Systems. CONFLICT OF INTEREST The authors have no conflicts to disclose. a) Author to whom correspondence should be addressed. Electronic mail: fongdelossantos.luis@mayo.edu. REFERENCES 1. Nabavizadeh N, Elliott DA, Chen Y, et al. Image Guided Radiation Therapy (IGRT) practice patterns and IGRT s impact on workflow and treatment planning: Results from a national survey of american society for radiation oncology members. Int J Radiat Oncol Biol Phys. 2016;94: Jaffray DA. Image-guided radiotherapy: from current concept to future perspectives. Nat Rev Clin Oncol. 2012;9: Jaffray DA, Langen KM, Mageras G, et al. Safety considerations for IGRT: executive summary. Pract Radiat Oncol. 2013;3: Goyal S, Kataria T. Image guidance in radiation therapy: techniques and applications. Radiol Res Pract. 2014;2014: Grams MP, Brown LC, Brinkmann DH, et al. Analysis of automatic match results for cone-beam computed tomography localization of conventionally fractionated lung tumors. Pract Radiat Oncol. 2014;4: Graff P, Kirby N, Weinberg V, et al. The residual setup errors of different IGRT alignment procedures for head and neck IMRT and the resulting dosimetric impact. Int J Radiat Oncol Biol Phys. 2013;86: Graff P, Hu W, Yom SS, Pouliot J. Does IGRT ensure target dose coverage of head and neck IMRT patients? Radiother Oncol. 2012;104: van Kranen S, van Beek S, Rasch C, van Herk M, Sonke JJ. Setup uncertainties of anatomical sub-regions in head-and-neck cancer patients after offline CBCT guidance. Int J Radiat Oncol Biol Phys. 2009;73: van Kranen S, van Beek S, Mencarelli A, Rasch C, van Herk M, Sonke JJ. Correction strategies to manage deformations in head-and-neck radiotherapy. Radiother Oncol. 2010;94: Pawlicki T, Chera B, Ning T, Marks LB. The systematic application of quality measures and process control in clinical radiation oncology. Semin Radiat Oncol. 2012;22: Pawlicki T, Whitaker M, Boyer AL. Statistical process control for radiotherapy quality assurance. Med Phys. 2005;32: Pawlicki T, Mundt AJ. Quality in radiation oncology. Med Phys. 2007;34: Breen SL, Moseley DJ, Zhang B, Sharpe MB. Statistical process control for IMRT dosimetric verification. Med Phys. 2008;35: Gerard K, Grandhaye JP, Marchesi V, Kafrouni H, Husson F, Aletti P. A comprehensive analysis of the IMRT dose delivery process using statistical process control (SPC). Med Phys. 2009;36: Varian Medical Systems. TrueBeam Technical Reference Guide Volume 2: Imaging. 2013;2(June): Brock KK, ed. Image Processing in Radiation Therapy. 1st edn. Boca Raton, FL: CRC Press; Pyzdek T, Keller P. The Six Sigma Handbook: A Complete Guide for Green Belts, Black Belts, and Managers at All Levels. Third. (Bass J, ed.). New York: McGraw Hill; Montgomery D. Introduction to Statistical Quality Control; < 9823::aid-anie 9823 > 3.3.co;2-c.