Investigating Common-Item Screening Procedures in Developing a Vertical Scale

Size: px
Start display at page:

Download "Investigating Common-Item Screening Procedures in Developing a Vertical Scale"

Transcription

1 Investigating Common-Item Screening Procedures in Developing a Vertical Scale Annual Meeting of the National Council of Educational Measurement New Orleans, LA Marc Johnson Qing Yi April 011

2 COMMON-ITEM SCREENING IN A VERTICAL SCALE 1 Abstract Creating a vertical scale involves several decisions on assessment designs and statistical analyses to determine the most appropriate vertical scale. This research study aims at investigating common item stability check procedures to arrive at vertical linking item sets that will produce the necessary constants for computing vertical theta (ability) estimates and scale scores on a vertical scale metric. The research reported in this paper, investigates the phenomenon of common items (across adjacent levels) that have lower difficulty estimates ( easier ) at the lower level than at the upper level and the subsequent vertical scales. A major finding of this research is that the presence of linking items that appeared to be easier at the lower level than at the upper level can still lead to patterns of increasing achievement growth from the lowest level of the scale to the highest level.

3 COMMON-ITEM SCREENING IN A VERTICAL SCALE Investigating Common-Item Screening Procedures in Developing a Vertical Scale Introduction Vertical scaling is a process of placing scores on tests that measure similar constructs, but at different educational levels, onto a common scale (Kolen & Brennan, 004). Vertical scales, therefore, are thought of as a progression of scale scores used to monitor academic achievement across age/grade levels (hereafter, levels). The need for vertical scales has received much attention in the recent decade due to No Child Left Behind (NCLB) requiring that assessment programs track academic progress. Despite the prevalence of vertical scales in assessment, at both the national and state levels of assessment programs, the methodologies to derive vertical scales are numerous and often can produce different results. In deriving vertical scales, practitioners often must choose among scaling methodologies (e.g., item response theory (IRT), Thurstone scaling), vertical linking strategies across levels (e.g., concurrent, separate level-groups, level-by-level), and scaling designs (e.g., scaling test, common items across levels, equivalent groups design). There are other factors that should be considered when designing a vertical scale and there are studies that are devoted to analyzing these factors and how various combinations of these factors affect resulting vertical scales (Ito, Sykes, & Yao, 008; Tong & Kolen, 007). These research studies of vertical scaling have not provided clear guidance on what factors should be used in combination to produce the best vertical scales. However, it is often that practitioners of vertical scales, or those interested in designing them, derive appropriate vertical scales by analyzing how combinations of these factors affect the vertical scales in relation to the expectation of growth within individual assessment programs.

4 COMMON-ITEM SCREENING IN A VERTICAL SCALE 3 One factor that deserves more attention in vertical scaling is the set of items that ultimately is used to create the vertical link among levels. In other words, vertical scales are created via a set of items, regardless of the scaling design, that are responded to by examinees of differing levels. In the case of the common item approach, vertical linking items are assessed within on-level test forms as well as within off-level test forms. Within the equivalent groups design, examinees can be randomly assigned to respond to either an on-level test or an off-level test. However, with a scaling test design, examinees respond to a test that consists of all vertical linking items, across all levels. The scaling test is in addition to an on-level test from which scores are linked to the scaling test. In practice, examinee performance on the vertical linking items is compared between the off-level and on-level examinees. This comparison can result in items being removed from the vertical linking item set prior to the construction of vertical scales (analogous to common item screening in horizontal equating). Common item screening methodologies used in vertical linking studies can be the same procedures found in horizontal equating strategies (e.g., Robust Z analysis, perpendicular distance). However, the assumptions of item instability are different in the vertical linking context from those of conventional horizontal equating practices. In other words, in the context of vertical linking, it is expected that the vertical linking items will exhibit a differential in performance between on-level and off-level examinees whereas that expectation is irrelevant in horizontal equating studies. Therefore, this does raise the question of whether or not the common item screening methodologies used in horizontal equating are appropriate within vertical linking contexts. Should items be removed at all in vertical linking studies when a differential in performance between on-level and off-level examinees exists?

5 COMMON-ITEM SCREENING IN A VERTICAL SCALE 4 The research interest expressed in this paper involves examining common item screening methodologies for vertical linking items and the impact of removal decisions on vertical scales. In other words, this study will investigate different procedures of adjusting vertical linking item sets and how these decisions affect resulting vertical scales. It has already been stated that there is some item performance differential expected in vertical linking studies, but this study will investigate varying degrees of this expectation and justifiable decisions that can be made based on the empirical differential in item performance. Linking Items in Equating In practice, horizontal equating - statistically placing a test form onto a particular measurement scale - is often times accomplished through a set of items designated as linking items. When a test form is being placed onto the measurement scale of another test form, the linking items are those items that are common to both test forms. However, when a test form is being placed onto the measurement scale of an item pool, the linking items can either be all scored test items or a set of the items. In either situation, a measurement link is established that allows a test form to be placed onto the same scale as a previous test form or the item pool. The selection of linking items, in the case of only representing a set of the tested items, has been considered critical to the design of horizontal equating studies and guidelines have been established that are continued to be used in the psychometric analyses within large scale assessment programs. These guidelines include test content representation relative to the entire test form, the position of the linking items throughout the test, the number of linking items in relation to the total number of test forms, and the statistical properties of the intended linking items usually based on past performance. Although important, the dissection of these guidelines is beyond the scope of this research study, but readers are referred to texts that discuss these

6 COMMON-ITEM SCREENING IN A VERTICAL SCALE 5 guidelines in more detail (Klein & Jarjoura, 1985; Wingersky, Cook, & Eignor, 1987). Vertical linking can be accomplished from a variety of methods. One method is through the use of linking items, analogous to horizontal equating. When used as common items across adjacent levels, vertical linking item sets will mostly consist of items that students at the adjacent levels can respond to correctly. Linking item guidelines of horizontal equating, mentioned above, are applicable in the vertical linking context so that a strong measurement link can be established that will foster a reasonable scale of growth across all levels. The scaling test method of vertical linking, however, relies on examinees responding to an on-level test as well as a test that consists of items spanning all levels (the scaling test; Kolen & Brennan, 004). Linking Item Performance in Equating Stability Check Procedures When using linking items to determine a measurement link between test forms or between a test form and an item pool, the item statistics are analyzed and compared between previous item statistics and newly obtained statistics. Under Rasch, the IRT statistics can be compared through the use of procedures such as the Robust Z analysis (Huynh, Gleaton, & Seaman, 199), perpendicular distances mentioned earlier, as well as the 0.3-logit difference procedure (Miller, Rotou, & Twing, 004). All of these procedures (discussed below), referred to as item stability checks, aim at identifying the items that show a greater than expected difference between the old and new statistics, each with its own criteria of acceptable difference. In practice, the items identified at this stage are considered to be removed from the linking item set before the final measurement link is establish and the scaling of raw scores to scale scores. However, there are guidelines around how common items are removed from the linking item set for each procedure. Robust Z Statistic The Robust Z statistic is determined through the following formula:

7 COMMON-ITEM SCREENING IN A VERTICAL SCALE 6 [( bi 1 bi ) M d ] z, IQR(0.74) where is one difficulty estimate value for a given linking item, is the other estimated item bi1 b i difficulty for that linking item, M d is the median difference of all potential linking items, and IQR is the interquartile range of the difference for all potential linking items. In contrast, traditional z statistics are computed as z = (score-mean)/standard deviation, which can be affected by outliers. The Robust Z statistic was designed to be robust in its calculation against outliers. Also, evaluating the Robust Z statistic against a predetermined value - alone does not provide the mechanism for removing unstable linking items. This procedure incorporates the ratio of standard deviations and the correlation of the two sets of item difficulty estimates to determine if linking items should be dropped. The full procedure is outlined in Appendix A. 0.3-Logit Difference This procedure identifies items that have an absolute difference between item difficulty estimates of 0.3 logit or greater. These items are considered to be removed from the linking item set following standard guidelines for removing items. Perpendicular Distance Based on the delta-plot method (Angoff, 197; Dorans & Holland, 1993) of item difficulty differences, the perpendicular distance procedure evaluates the standard deviation of the perpendicular distance to the line of best fit. Although this method has been applied to differences in proportion correct values (item p-values; Karkee & Choi, 005), the research study presented in this paper uses this method to evaluate differences in Rasch item difficulty values. Also, the computation of the statistics for this procedure are slightly different from what was

8 COMMON-ITEM SCREENING IN A VERTICAL SCALE 7 presented by Karkee and Choi, based on the application of this procedure to equating studies for large-scale assessment programs. As computed here, the perpendicular distance is: [ AI1 I B] A 1 D, ( 1 ) where I signifies the item difficulty estimates; A ( ) r r 1 1 which includes the variances, ( and ), standard deviations ( and ), correlation ( ) and 1 1 r 1 squared-correlation ( r 1 ) of the item difficulty sets; and B A which includes the means 1 of the item difficulty sets ( 1 and ). For this research study, the perpendicular distance for each linking item is transformed into a z-value by D z D where D D is the mean distance and D is the standard deviation of the distance. From this, any linking item with a z-value greater than 3.0 is removed. It should be pointed out, though, that linking items to be removed with this procedure are removed one at a time, leading to a recalculation of distance estimates for the remaining linking items after each removal. Linking Item Performance in Equating Horizontal vs. Vertical Linking For horizontal equating, there is often the expectation that the linking items will perform similarly to their most recent test administration so that the item stability checks should not result in any linking items being dropped prior to the scaling of raw scores. However, the unpredictable nature of testing and student responses can result in items showing large differences in performance across test administrations. Therefore, the stability checks will result in items being dropped from the linking set so that these items will not impact the measurement link that is being sought for the purposes of equating.

9 COMMON-ITEM SCREENING IN A VERTICAL SCALE 8 Vertical linking, however, presents a slight challenge to the idea of item stability used in horizontal equating. In vertical linking, the expectation of common items is that items presented to students of two different (mainly, adjacent) levels will appear easier at the higher level than at the lower level. This can be summarized as items performing better at the higher level than at the lower level. Item stability checks are appropriate in this situation, though, to monitor these differences and investigate closely those differences that are greater than expected. Therefore, as with horizontal equating, stability procedures can result in vertical linking items being removed from the analysis prior to a vertical link being established. Within vertical linking, though, it can be found that items used as vertical links can display better performance at the lower level than at the higher level. In other words, the items are easier for the lower level students than for the higher level students. There are several reasons this may occur, but the relevance of this phenomenon to the current paper is the appropriate handling of these instances in creating a vertical scale. From this, a dilemma is introduced in creating vertical scales since the goal of these scales is to show a progression of achievement from one level (level 1) through another (e.g., level 6). Anomalies in item performance between two adjacent levels may limit the perception of the progression of achievement. Purpose of Research One aspect, in particular, that was analyzed is when vertical linking items show better performance at a lower level than at a higher level, to the degree that linking items were removed from the linking set. The expectation is that items presented at a higher level will result in lower item difficulty estimates than when those items are administered at a lower level. Those items should appear easier at the higher level than at the lower level. However, it can be found that items may perform better at a lower level than at the higher grade level. This may affect the

10 COMMON-ITEM SCREENING IN A VERTICAL SCALE 9 construction of vertical scales since the goal of these scales is to show a progression of achievement from one level (e.g., level 1) through another (e.g., level 6). Anomalies in item performance between two levels may limit the progression of achievement. This situation is discussed because it can be the case that typical common item screening methodologies may not lead practitioners to discard items that perform better at a lower level from the vertical linking set, thus leaving an item in the linking set that does not fully comply with expectations. This research study investigated various common item screening methodologies to determine linking item sets that will lead to the development of vertical scales. The primary goal of this research study was to analyze the pattern of vertical scales among the different common item screening methodologies, noting how anomalies in item performance across adjacent levels affect the trajectories of growth. Method Student data from a large scale assessment program was used in this study. This data, obtained through the common-item non-equivalent-groups design (Young, 006; Kolen & Brennan, 004), reflected student responses to on-level test forms that included off-level items ( vertical linking items ) according to the design shown in Table 1. As shown in Table 1, the off-level items for level 1 were from the level test only while the level 6 test included off-level items from only the level 5 test. However, for levels through 5, each test included off-level items from one level above and one level below. With this design, 36 items were classified as vertical linking items among adjacent levels. All items were calibrated to the Rasch measurement model through WINSTEPS software (Linacre, 007). The non-linking items were calibrated first, and then used as anchors for the calibration of the vertical linking items. The Rasch item difficulty estimates of the common

11 COMMON-ITEM SCREENING IN A VERTICAL SCALE 10 items were analyzed across adjacent levels and differences between these estimates were examined through multiple item screening procedures: Robust Z, perpendicular distance, and 0.3-logit difference. The goal of this screening investigation was to identify vertical linking items that (1) show a substantial differential in examinee performance across adjacent levels, noting () the occurrence of items performing better at the lower level than at the higher level and how inclusion of these items affect the vertical scales. This research study used two conditions of linking item removal: directional and nondirectional. The directional approach removed only those linking items that were, in fact, easier at the lower level than at the higher level while the non-directional approach removed linking items based on the results of the item stability procedures, regardless of whether or not the items were easier at the lower level. Also, the maximum number of linking items that could be removed within any research condition was set at 7, approximately 0% of the original linking item set. This maximum percentage is widely used in practice. From the item screening methodologies, vertical linking constants were computed as the difference in mean Rasch difficulty estimates between two adjacent levels. These constants were added cumulatively to on-level theta estimates (obtained through WINSTEPS) using level 3 as the base scale. For example, for level 1, the vertical linking constant computed between level 1 and level and the vertical linking constant between level and level 3 were added to each theta estimate in level 1, placing all level 1 theta estimates onto the scale of level 3. From the adjusted theta estimates, vertically linked scale scores (reportable scores on the vertical scale) were derived using a procedure outlined by Kolen and Brennan (004) which uses a unique slope and intercept for each level. Using this approach, slopes were determined by

12 COMMON-ITEM SCREENING IN A VERTICAL SCALE 11 sc( y ) sc( y1) y y where sc(y ) is the desired scale score mean for the highest level of the vertical scale (e.g., level 6), sc(y 1 ) is the desired scale score mean for the lowest level of the vertical scale (e.g., level 1), y is the vertically linked ability estimate for the highest level corresponding to a cumulative percent of 75% (of the population of ability estimates) whereas y 1 is the vertically linked ability estimate corresponding to a cumulative percent of 75% for the lowest level. For this research study, 50 was used as the desired mean scale score for level 6 while 00 was used for level 1, as proposed by the authors of this approach. Intercepts under this approach were determined by sc( y) sc( y1) sc( y1) ( y1) y y1 1, where the terms are defined as they were for computing the slopes. Evaluation For each item screening methodology, descriptive statistics of the vertical scale scores were plotted across levels to display average performance across levels and effect sizes were computed and plotted across levels to provide an index of the separation of scale score distributions among adjacent levels. From Kolen and Brennan (004), the effect size index was computed as follows:, es ( Y) ( ( Y) upper upper ( Y) ( Y) lower lower )/,

13 COMMON-ITEM SCREENING IN A VERTICAL SCALE 1 where μ(y) upper is the mean scale score for the upper level, μ(y) lower is the mean scale score of the lower level, σ (Y) upper is the variance for the upper level, and σ (Y) lower is the variance of the lower level. Also, the vertical linking item sets were compared across the item screening methodologies to analyze the number of item s used to create the vertical link that, in fact, performed better at the lower level than at the higher level. Results Table shows the number of items removed through each item stability screening procedure within each research condition, the number of items retained that were easier at the lower level, and the vertical linking constants computed with the final linking item sets. The directional and non-directional approached resulted in vertical linking items that were easier at the lower level, but were kept for computation of the vertical linking constants. In the cases where the number of linking items removed was less than 7 the maximum allowed the remaining linking items were not flagged as problematic within the item stability check procedures. However, in the cases where seven linking items were removed from the linking set, some of the remaining linking items were flagged as problematic, but could not be removed for violating the maximum allowed for removal. It should be pointed out that the perpendicular distance procedure for both research conditions (directional and non-directional) resulted in the same vertical level linking constants. In both of these cases, no linking items were flagged to be removed across each level. This result will manifest itself throughout the rest of the results of this study and will be discussed again in the conclusion section of this paper. Table 3 shows the average of the theta estimates after the vertical linking constants have been applied as previously outlined. From these results, a few things are worth pointing out that

14 COMMON-ITEM SCREENING IN A VERTICAL SCALE 13 will manifest themselves throughout the rest of the results. First, the mean theta estimates increase from level 1 to level 6 for all conditions/procedures except for the non-directional approach with the 0.3-logit difference procedure. In this condition, the average theta estimate for level 3 (the base level) is slightly lower than that of level and the average theta estimate for level 5 is lower than that of level 4. Second, and with the exception of the perpendicular distance procedure, the mean theta estimates for each item stability check procedure under the non-directional condition comprise a smaller range and are smaller in magnitude than those under the directional condition. Third, the mean theta estimates for the 0.3-logit difference procedure under the non-directional condition resulted in the smallest (in magnitude) mean theta estimates with a level 6 mean theta of which while being the highest value of all levels within this procedure is the smallest mean theta estimate for level 6 across all research conditions. Table 4 shows the slopes and intercepts derived for each study condition. The slope and intercept values for the 0.3-logit difference procedure under the non-directional approach of removing linking items were much higher than the other values. This was due to the fact that the vertical linking theta estimates that corresponded to a cumulative percent of 75% for level 1 (used in the calculation, but not presented in this paper) were much higher for this condition than the others a product of the vertical level linking constants from Table. Table 5 shows the mean scale scores after theta transformation using the derived slopes and intercepts. Figure 1 shows the mean scale scores across levels for each condition/procedure. It should be noted that there were some negative scale scores as the minimum value, especially for the Robust Z and 0.3-logit difference procedures under the non-directional condition which had negative scale scores as the minimum for each level. From the results of the mean theta

15 COMMON-ITEM SCREENING IN A VERTICAL SCALE 14 estimates, the mean scale scores for the non-directional condition are lower than those of the directional condition, with the lowest mean scale scores from the 0.3-logit difference procedure. Table 6 and Figure show the standard deviations of scale scores across levels for each condition/procedure. Here, the standard deviations from the non-directional condition were much higher than those of the directional condition. Plus, the 0.3-logit different procedure under the non-directional approach resulted in standard deviations that were approximately five times greater than those from the directional approach, indicating greater variability among the scale scores. As mentioned earlier, effect sizes provide an index of the separation of scale score distributions among adjacent levels. Table 7 and Figure 3 depict the effects sizes computed for this research study. As the figure shows, the pattern of effect sizes is relatively consistent across conditions/procedures, but with large jumps from the effect size between level 4 and level 5 and that between level 5 and level 6. The majority of the effect sizes can be considered small, equal to or greater than 0. but less than 0.5 (Cohen, 1988). However, the 0.3-logit procedure under the non-directional condition resulted in some negligible effect sizes, less than 0., essentially indicating no separation of scale score distributions between adjacent levels. Discussion As can be seen throughout the results of this research study, the strategy for removing linking item during stability checks (directional vs. non-directional) as well as the item stability screening procedure itself affects the items removed and, subsequently, the linking constants obtained for developing a vertical scale. More importantly, the results from Table indicate that each vertical scale created within this research study was done so with several vertical linking items that were easier at the lower level than at the upper level. The data used for this study

16 COMMON-ITEM SCREENING IN A VERTICAL SCALE 15 provided an invaluable opportunity to investigate how this item performance phenomenon manifests itself in vertical scales. In looking at Figure 1, the mean scale scores increase from level 1 to level 6 for all research conditions except for the 0.3-logit difference procedure under the non-directional approach for removing linking items. The vertical trend for this research condition shows no growth from level 1 to level and shows an unexpected decrease growth from level 4 to level 5 followed by a large growth from level 5 to level 6. The variance of scale scores for this stability screening procedure is also much higher than the other research conditions. It can be inferred from the results of this study that the 0.3-logit difference procedure at least under the non-directional approach was affected by the presence of linking items that were easier at the lower level than at the upper level, given that some unstable items had been removed that were easier at the upper level. The negligible effect sizes from the 0.3-logit difference procedure under the non-directional approach provide further evidence of the pattern growth shown in Figure 1. This inference of the effect of the linking item set for this research condition might cause some concern among practitioners for considering this option for adopting to create a vertical scale. Limitations Although the results of this study provide promising applications to future vertical scale development, there are limitations worth mentioning. First, this data was not compared against linking item sets in which items common to adjacent levels are easier at the upper level and more difficult at the lower level a general expectation of common-item performance across adjacent levels. This comparison would shed more insight into whether this phenomenon is a major concern to consider when developing a vertical scale through assessment items. Another

17 COMMON-ITEM SCREENING IN A VERTICAL SCALE 16 limitation of this research study is the criteria for identifying and removing unstable linking items. As was previously mentioned, the criteria used for this study reflected criteria used for linking studies performed for operational large-scale assessment programs, which is sometimes different from original published criteria. A third limitation of the study is that only one data set was included in the research which may limit the generalizability of the results.

18 COMMON-ITEM SCREENING IN A VERTICAL SCALE 17 References Angoff, W.H. (197, September). A technique for the investigation of cultural differences. Paper presented at the annual meeting of the American Psychological Association, Honolulu. Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillside, NJ: Erlbaum. Dorans, N. J., & Holland, P. W. (1993). DIF Detection and Description: Mantel-Haenszel and Standardization. In P. W. Holland, and H. Wainer (Eds.), Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers. Huynh, H., Gleaton, J., and Seaman, S.P. (199) Technical documentation for the South Carolina high school exit examination of reading and mathematics: Paper No. ( nd ed.). Columbia, SC: University of South Carolina, College of Education. Ito, K., Sykes, R. C., & Yao, L. (008). Concurrent and separate grade-groups linking procedures for vertical scaling. Applied Measurement in Education, 1, Karkee, T., & Choi, S. (005, April). Impact of eliminating anchor items flagged from statistical criteria on test score classifications in common item equating. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal. Klein, L. W. & Jarjoura, D. (1985). The importance of content representation for common-item equating with non-random groups. Journal of Educational Measurement,, Kolen, M. J., & Brennan, R. L. (004). Test equating, scaling, and linking: Methods and practices ( nd Edition). New York: Springer-Verlag. Linacre, J.M. (007). A User's Guide to W I N S T E P S M I N I S T E P Rasch-Model Computer Programs. Chicago, IL. Miller, G. E., Rotou, O., & Twing, J. S. (004). Evaluation of the.3 logits screening criterion in common item equating. Journal of Applied Measurement, 5(), Tong, Y., & Kolen, M. J. (007). Comparisons of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 0(), Wingersky, M. S., Cook, L. L., & Eignor, D. R. (1987). Specifying the characteristics of linking items used for item response theory item calibration. ETS Research Report Princeton NJ: Educational Testing Service. Young, M. J. (006). Vertical scaling. In S.M. Downing and T. M. Haladyna (Eds.), Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates.

19 COMMON-ITEM SCREENING IN A VERTICAL SCALE 18 Table 1. Common Item Design for Developing a Vertical Scale Level 1 on-level Level 1 offlevel Level offlevel Level on-level Level offlevel Level 3 offlevel Level 3 on-level Level 3 offlevel Level 4 off- Leve l Level 4 on-level Level 4 offlevel Level 5 offlevel Level 5 on-level Level 5 offlevel Level 6 offlevel Level 6 on-level

20 COMMON-ITEM SCREENING IN A VERTICAL SCALE 19 Table. Removed/Retained Linking Items and Vertical Linking Constants Condition Procedure Level Items Removed Items Kept Vertical Level Linking Constant (performed better at lower level) Robust Z Directional Perpendicular Distance Logit Difference Robust Z Non-Directional Perpendicular Distance Logit Difference

21 COMMON-ITEM SCREENING IN A VERTICAL SCALE 0 Table 3. Mean Vertical Linking Theta Estimates Condition Procedure Level 1 Level Level 3 Level 4 Level 5 Level 6 Robust Z Directional Perpendicular Distance Logit Difference Robust Z Non-Directional Perpendicular Distance Logit Difference Table 4. Slope and Intercept Values for Scale Transformation Condition Procedure Slope Intercept Robust Z Directional Perpendicular Distance Logit Difference Robust Z Non-Directional Perpendicular Distance Logit Difference Table 5. Mean Vertical Linking Scale Scores Condition Procedure Level 1 Level Level 3 Level 4 Level 5 Level 6 Robust Z Directional Perpendicular Distance logit Difference Robust Z Non-Directional Perpendicular Distance logit Difference Table 6. Standard Deviation of Vertical Linking Scale Scores Condition Procedure Level 1 Level Level 3 Level 4 Level 5 Level 6 Robust Z Directional Perpendicular Distance logit Difference Robust Z Non-Directional Perpendicular Distance logit Difference

22 COMMON-ITEM SCREENING IN A VERTICAL SCALE 1 Table 7. Vertical Scale Effect Sizes Condition Directional Non-Directional Procedure Level 1/ Level Level / Level 3 Level 3/ Level 4 Level 4/ Level 5 Level 5/ Level 6 Robust Z Perpendicular Distance logit Difference Robust Z Perpendicular Distance logit Difference Mean Scale Score Directional: Robust Z Directional: Perpendicular Distance Directional: 0.3-logit Difference NonDirectional: Robust Z NonDirectional: Perpendicular Distance NonDirectional: 0.3-logit Difference Level Figure 1. Mean Vertical Linking Scale Scores

23 COMMON-ITEM SCREENING IN A VERTICAL SCALE Standard Deviation Directional: Robust Z Directional: Perpendicular Distance Directional: 0.3-logit Difference NonDirectional: Robust Z NonDirectional: Perpendicular Distance NonDirectional: 0.3-logit Difference Level Figure. Standard Deviation of Vertical Linking Scale Scores

24 COMMON-ITEM SCREENING IN A VERTICAL SCALE Effect Size Directional: Robust Z Directional: Perpendicular Distance Directional: 0.3-logit Difference NonDirectional: Robust Z NonDirectional: Perpendicular Distance NonDirectional: 0.3-logit Difference L1/L L/L3 L3/L4 L4/L5 L5/L6 Level Figure 3. Vertical Scale Effect Sizes

25 COMMON-ITEM SCREENING IN A VERTICAL SCALE 4 Appendix A Robust Z Stability Check Guidelines 1. Calculate the mean and standard deviation for both sets of item difficulties for all linking items.. Calculate the ratio of standard deviations. 3. Calculate the correlation between the sets of item difficulties. 4. Calculate the robust Z statistic for each linking item and flag all linking items with an absolute value of the robust Z greater than The ratio of the standard deviations (from step ) must be between 0.9 and The correlation (from step 3) must be at least If the ratio of standard deviations or correlation is outside of the prescribed bounds, then remove the item whose absolute robust Z value is the largest and is greater than 1.645). 8. Recompute the ratio of standard deviations and correlation with the remaining linking items. 9. Continue dropping items in a stepwise fashion until the ratio of standard deviations and correlation are within the prescribed bounds, there are no items left with a robust Z greater than 1.645, or 0% of the linking set has been dropped. Note that the Robust Z values are not recalculated each time, only the ratio of standard deviations and correlation.