R E COMPUTERIZED MASTERY TESTING WITH NONEQUIVALENT TESTLETS. Kathleen Sheehan Charles lewis.

Size: px
Start display at page:

Download "R E COMPUTERIZED MASTERY TESTING WITH NONEQUIVALENT TESTLETS. Kathleen Sheehan Charles lewis."

Transcription

1 RR R E S E A RC H R COMPUTERIZED MASTERY TESTING WITH NONEQUIVALENT TESTLETS E P o R T Kathleen Sheehan Charles lewis. Educational Testing service Princeton, New Jersey August 1990

2 Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service Princeton, NJ August, 1990 *This research was supported by Educational Testing Service through the Program Research Planning Council.

3 Copyright ec) 1990, Educational Testing service. All Rights Reserved

4 Computerized Mastery Testing With Nonequivalent Testlets Abstract A practical procedure for determining the effect of testlet nonequivalence on the operating characteristics of a testlet-based computerized mastery test (CMT) is introduced. The procedure involves estimating the CMT decision rule twice, once with testlets treated as equivalent and once with testlets treated as nonequivalent. In the equivalent testiet mode, the likelihood functions estimated for specific number right scores are assumed to be constant across testlets and a single set of cutscores is estimated for all testlets. In the nonequivalent testlet mode, the likelihood functions estimated for specific number-right scores are allowed to vary from one testiet to another and a different set of cutscores is estimated for each permutation of testiet presentation order. Small differences between the estimated operating characteristics of the equivalent testlet decision rule and the nonequivalent testlet decision rule indicate that the assumption of equivalent testlets was warranted. This procedure is illustrated with data from the Architect Registration Examination, a professional certification examination administered by the National Council of Architectural Registration Boards (NCARB). Keywords: Bayesian mechods, computerized mastery cesting, decision theory, Item Response Theory.

5 Computerized Mastery Testing With Nonequivalent Testlets In 1988, the National Council of Architectural Registration Boards (NCARB) commissioned ETS to develop a computerized mastery test (CMT) for use in its architectural certification process. Because NCARB was interested in reducing total testing time, ETS designed a test with an adaptive stopping rule. In this new test, items are administered to examinees in randomly selected blocks called testlets. After each testlet is administered a decision is made either to classify the examinee (as either a master or a nonmaster) or to administer another testlet. The decision rule is specified in terms of the cumulative number of correct responses obtained by the examinee at the end of each stage of testing. Concretely, examinees with low cumulative number correct scores are failed, those with high cumulative number correct scores are passed and those with scores indicating an intermediate level of mastery are required to respond to an additional testlet. The cutoff values which examinees' scores are compared to are selected to minimize expected posterior loss. These values are estimated prior to the actual test administration using a procedure which is based on Item Response Theory (IRT) combined with Bayesian decision theory (Lewis and Sheehan, in press). Implementation of this new CMT begins with the creation of a pool of testlets. In the original Lewis and Sheehan study all testlets were assumed to be parallel, that is, composed of the same number of items and equivalent with respect to content coverage and the likelihood of particular number right scores at proficiency levels located near the master/nonmaster cut-off value. 1

6 The assumption of equivalent testlets serves three purposes. First, since the number of testlets administered to each examinee is variable, the fact that all testlets are parallel insures that the tests administered to different examinees are equally difficult and cover the same content areas, even when they are not the same length; second, since there are only a finite number of testlets in the pool, use of equivalent testlets minimizes the impact of the the fact that sampling must be performed without replacement; and third, the equivalent testlet design simplifies the computations needed to determine the optimal decision rule. In many testing situations, it will not be difficult to create a pool of testlets which meet the first two equivalence criterion, i.e. equal testlet length and equivalent content coverage. However, the requirement of equivalent likelihoods for particular number right scores at proficiency levels located near the master/nonmaster cutoff vaiue may not be that easy to achieve. Thus, a procedure for determining the effect of testlet nonequivalence on the operating characteristics of a particular GMT decision rule is needed. This paper introduces such a procedure and illustrates its use with two different testiet pools from the NGARB program. The procedure for evaluating testlet equivalence introduced in this paper assumes that the first two equivalence criterion have been met and considers testiet nonequivalence with respect to the third criteria only. The procedure involves estimating the GMT decision rule twice, once with testlet likelihoods treated as equivalent and once with testiet likelihoods treated as nonequivalent. In the equivalent mode, the likelihood functions estimated for specific number right scores are assumed to be constant across testlets and a single set of cutscores is estimated for all testlets. In the nonequivalent 2

7 mode, the likelihood functions estimated for specific number right scores are allowed to vary from one testlet to another and a different set of cutscores is estimated for each permutation of testlet presentation order. Concretely, cutscores associated with more difficult testlets will be while those for easier testlets will be somewhat higher. somewhat lower, Less discriminating testlets will have passing and failing cutscores which are further apart, while those for more discriminating testlets will be closer together. In the final step of the procedure, the two CMT decision rules are applied to a set of simulated data and the results are tabulated in terms of the percent of simulees classified as masters at each proficiency level considered. These data are then used to estimate the operating characteristics of the equivalent testlet CMT and the nonequivalent testiet CMT. The operating characteristics which will generally be of interest include: the average test length; the expected number of false positive decisions, the expected number of false negative decisions; and the overall pass rate. Small differences between the operating characteristics of the equivalent testlet CMT and those of the nonequivalent testlet GMT indicate that the assumption of equivalent testlets was warranted. The following section provides a brief review of the theory needed to determine an optimal CMT decision rule in both the equivalent and nonequivalent testlet modes. Subsequent sections describe two simulation studies which were performed to evaluate the feasibility of the validation procedure introduced in this paper. Each simulation was designed to model the performance of a specific candidate population responding to three different tests: (i) a variable-length GMT defined with the equivalent testlet assumption, (ii) a variable-length GMT defined without the equivalent testlet 3

8 assumption, and (iii) a fixed-length paper-and-pencil test. The fixed-length paper-and-pencil test was included as a baseline from which to judge the performance of the two variable-length CMTs. In both simulations, candidate populations were modeled after the populations tested in a recent administration of NCARB's annual certification examination, the Architect Registration Examination (ARE). In each case, item parameters available for actual ARE items were used to generate the simulated data and to determine the decision rules for both the simulated paper-and-pencil test and the two simulated CMTs. Determining an Optimal CMT Decision Rule This section provides a brief review of the theory needed to determine an optimal CMT decision rule when the testlet pool consists of contentbalanced n-item testlets and a maximum test length of k testlets has been specified. Both equivalent and nonequivalent testlet pools are considered. In the Lewis and Sheehan model for mastery testing, all items are assumed to follow a unidimensional IRT model with known item parameters and the masterjnonmaster cutscore is conceived of as a point B c on the latent proficiency scale. Although all examinees are evaluated with respect to the single unobservable cutscore 8 e mastery decisions are made on the basis of cumulative number correct cutscores which vary depending on the number of testlets administered. The cumulative number correct cutscores are determined using a sequential hypothesis testing technique. Unlike other applications of sequential mastery testing (see Ferguson 1969 a,b, Reckase, 1983 and Kingsbury and Weiss, 1983), it is expected that items will differ in difficulty and discrimination, but that the assignment of items to testlets will be conducted 4

9 to minimize differences in the average difficulty and discrimination levels of the testlets in the pool. In both the equivalent and nonequiva1ent modes, misclassification rates are controlled through a decision theory approach. In this approach, the user's preferences for alternative classification outcomes are established using loss functions. Early applications of the decision theory approach to mastery testing include Cronbach & GIeser (1965), Hambleton & Novick (1973), Huynh (1976), Petersen (1976), Swaminathan, Hambleton & Algina (1975), and van der Linden & Me11enbergh (1977). The object in adopting a decision theory approach is to determine a decision rule which, in some way, reflects the preferences for alternative outcomes which have been built into the loss function. Since the loss function depends on the true mastery status of individuals, and that status is never known in practice, however, the optimal decision rule to associate with a particular loss function will not be unique. Different methods for dealing with this problem have been proposed. Lewis and Sheehan employ a Bayesian solution in which the unique decision rule to associate with a particular loss function is found by minimizing posterior expected loss at each stage of testing. Bayesian decision theory methods are discussed in Chernoff and Moses (1959), Lindley (1971) and Wetherill (1975). To apply the Bayesian decision theory approach, two additional points on the latent proficiency scale must be specified: On, an examinee will be considered a nonmaster, and Om, the highest level at which the lowest level at which an examinee will be considered a master. One must also select a loss function and a prior distribution. The loss function is specified in terms of three parameters: A, the loss associated with passing a nonmaster, B, the loss 5

10 associated with failing a master, and C, the cost of administering a single testlet. C is usually taken to be one in order to set the scale of measurement. The prior distribution is specified in terms of two probabilities: Pm' the prior probability of being a master, and Pn-l-P m, the prior probability of being a nonmaster. Note that Pm can be interpreted as the prior probability of a candidate having a proficiency level which is greater than or equal to 8 m and P n can be interpreted as the prior probability of a candidate having a proficiency level which is less than or equal to On' In a departure from some IRT-based Bayesian procedures, posteriors are calculated conditional on the observed number right score rather than the entire vector of observed item responses. To simplify the notation let where Xi is the score observed for the ith testiet administered (i=l,...k). This probability is calculated iteratively as follows: (1) where P(XilB-0 m ) and P(X i IS=8n) refer to the conditional probability of observing a number right score of Xi' given a proficiency level of O~ or On' respectively. Procedures for calculating these conditional probabilities differ depending on whether the testlet equivalence assumption is in effect. Note that, in both the equivalent and nonequivalent testlet modes, when i=l, P m1i - 1 is the prior probability of being a master, Pm, and P n 1i - 1 is the prior probability of being a nonmaster, Pn. 6

11 In the equivalent testlet mode, the conditional probabilities P(X ii8-9m ) and P(X ii8-9n ) are obtained by averaging over all testlets in the pool as follows: T P(X i=si8-0m ) ~ exp(l/t ~ In(P(Xi=sI8-0 m,t»} t-l (2) where s refers to a particular number right score (s=o,...,n) and t refers to a particular testlet (t=l,...,t). The testlet specific probabilities on the right-hand side of the equals sign are obtained as follows n Xj l-xj P(Xi-sl8=0Il"t) - ~ II Pjt(Om) (l-pjt(om» j=l (3) where the summation is taken over all response patterns such that the total score is sand Xj is 1 or 0 depending on whether the response pattern considered is defined with a correct or incorrect response to the jth item, and Pjt(Oa) is the conditional probability of a correct response to the jth item in testlet t, by an examinee with proficiency level 8 m (as given by the assumed IRT model). In the nonequivalent testlet mode, conditional score probabilities are not averaged. Instead, posterior probabilities are calculated conditional on the specific set of testlets administered. Thus, the testlet specific likelihood functions given in Equation 3 are used instead of the pool-wide average likelihood function given in Equation 2. The expected losses associated with the decisions to pass or fail at stage i are expressed as functions of P m : i as follows: 7

12 E[i(pas s I 8 ) 1 Xl,".,X i ] ie + Ae (1 - P ma ) E[i(fail I 8 ) 1 Xl,'",Xi] ie + BeP mii To determine the expected loss associated with the decision to administer another test1et at stage i (for i<k), it is necessary to consider all possible outcomes at stage i+l. For example, if stage i+1 were to result in a "pass immediately" decision, then the loss of deciding to continue testing at stage i would be equal to the loss of deciding to pass the examinee at stage i+1. However, since the set of all possible outcomes at stage i+1 includes the option of administering another testlet, all possible outcomes at stage i+2 must also be considered. and so on. The uncertainty associated with future testlet administrations is accounted for by averaging the expected loss associated with each of the various outcomes in proportion to the probability of observing those outcomes, that is, the predictive probability. The predictive probability for a score of X i + l at Stage i+1, given the scores for the first i stages, is calculated as a function of the posterior probability of mastery at Stage i, as follows: In the equivalent testlet mode I P(Xi+l=S I e~8111) and P(XH1=s 18=On) are determined using the pool-wide average score probablities given in Equation 2. In the nonequivalent testlet mode, P(X i + 1-sI8-8.) and P(X i + 1-sl8=8n ) must be determined from a testlet specific likelihood function. Thus. advance knowledge of the identity of the next test1et to be administered is required. This problem is handled by estimating a different set of cutscores for each 8

13 permutation of testlet presentation order. At testing time, the cutscores used for a particular individual are determined by the permutation of testlet presentation order which was selected for that individual at the start of the testing session. Because the permutation provides the identity of all testlets which could be administered, the next testlet to be administered is always known. To determine the expected loss of the continue testing option, it is useful to introduce risk functions at each stage of testing. Beginning with Stage k define - min ( kc+ae(l - P 1lI 1k ), kc+bep m1k } The expected loss of deciding to administer another testlet at Stage k-1 can now be written in terms of the risk at Stage k as follows: where P m1k is evaluated for each value which X k may take on. The risk function at Stage k-1 may now be defined as E[l(continue testingj8)ix1,...xx-l]} min {(k-l)c+ae(l - P Ill1k - 1 ), (k-1)c+bep m1k - 1 In general, the risk at Stage i is defined in terms of the risk at Stage i+1 as follows: 9

14 ri(pilla) - min { ic + s» (1 - P lllli ), ic + BePilili, ~ Pal!erHl(P mih 1 ) ). The decision rule which minimizes expected posterior loss at stage i for i=l,...,k-1 can now be defined as follows: pass if r i (P mli ) ic + Ae(l-P mli ) (4) fail if ri (P m1i ) ic + BeP lilli continue testing otherwise, and for stage i-k: dlc(p m1k ) = pass if kc+ae(l-p m1k ) ~ kc+bep rn1k fail otherwise, (5) or equivalently, d k (P II : k ) = pass if P IIl IIc z A/(A+B) fail otherwise. This decision rule, which is specified in terms of posterior probabilities, can be evaluated with respect to any pool of parallel testlets, or any permutation of nonequivalent testlets. To minimize online calculations, the decision rule given above is translated into a set of probability metric cutscores, as follows. First, compute the risk at stage k for a set of values of P m1k, say.000,.001,...,1.000: for each value of P m1k, select the decision which minimizes expected posterior loss by applying Equation 5. Next, compute the risk at stage k-1 for the same set of values of P IIl 1k - 1 Note that the task of computing the risk at stage k-l is considerably simplified by the fact that the risk at stage k has already been determined. It is also the case that, at each stage i<k, the largest values of P lllli will result in the decision "pass 10

15 immediately" and the smallest values will result in the decision "fail immediately". Thus, we can easily keep track of the threshold values defined as Ali - largest value such that application of Equation 4 results in the decision to fail immediately whenever P m1i < Ali' A2i smallest value such that application of Equation 4 results in the decision to pass immediately whenever P lil l i ~ AZi, (Note that application of Equation 4 will result in the decision to continue testing whenever >'li ~ P m1i ~ >'2i)' It is easily seen that >'lk... >'Zk = A/(A + B). It is also possible to translate the probability metric cutscores into an equivalent set of cutscores expressed on the cumulative number right score metric. That is, by computing the posterior probabilities corresponding to all possible combinations of Xl"" X k, we can determine a set of approximate threshold values (Y1i,YZi, i=l,...,k) such that ~ X j < Yu for most combinations in which P m1i < >'u, and ~ X j ~ Y 2i for most combinations in which P m1i ~ >'2i for i=l,..., k, where the summation is over ~ for j=l,...,i. Note that the translation from the posterior probability metric to the numb~r right score metric is a manyto-l transformation. An Application The data available for this study were derived from NCARB's annual certification examination, the Architect Registration Examination (ARE). This examination covers eight distinct skill areas, known as divisions, which are administered as independently scored subtests. To pass the ARE, a candidate 11

16 must pass all eight divisions. In 1988, NCARB introduced quarterly computer-administered versions of the ARE for three different divisions. The design for these alternative tests followed the equivalent testlet CMT model described above. For the past two years, candidates have been given the option of taking either the paper-andpencil version or the GMT version of the examination in these three skill areas. Two of the divisions for which both paper-and-pencil and computeradministered examinations exist are considered in this study: Division E, a test of the skill area known as Structural Technology - Lateral Forces, and Division D/F, a test of the skill area known as Structural Technology General and Long Span. Test specifications for the computer-administered and paper-and-pencil (P&P) versions of these two divisions are given in Table 1. (We would like to take this opportunity to note that the NCARB testlet pools have been redefined since this research was conducted, and as a result, cutscores and performance characteristics reported here do not refer to any currently existing testlet pool.) Insert Table 1 About Here ==---====~_tll:l-:lc-~~= _ The ARE testlets are constructed from items which have been previously administered, in the standard paper-and-pencil format, and calibrated with the LOGIST program (Wingersky, Barton, and Lord, 1982). Independent, unidimensional IRT scales are assumed for each division. The LOGIST item parameter estimates are used: (i) to construct peaked testlets, (ii) to evaluate test1et equivalence, and (iii) to generate the likelihoods needed to 12

17 determine optimal decision rules. For this study, two additional LOGIST calibrations were performed in order to obtain rrt item parameter estimates expressed on the same scale, for both the CMT items and the items from the most recent paper-and-pencil test, for the two Divisions studied. These additional item calibrations also used paper-and-pencil data exclusively. Both the paper-and-pencil and the CMT versions of the ARE are scored number right. The cutscore for the initial form of the paper-and-pencil test was determined in an Angoff study. Cutscores for subsequent paper-and-pencil forms are determined by equating back to the initial form. The cutscore for the CMT is first set on the theta metric and then translated to the number right score metric using the equivalent testlet procedure described above. The theta metric GMT cutscore is simply the theta value corresponding to the number right cutscore defined for the initial paper-and-pencil form, by the initial standard setting study. The GMT version of the ARE was designed to be provide accurate measurement in the region of the theta scale near the cutscore. The degree to which this design objective was achieved can be seen in Figure 1 which provides plots of the Test Characteristic Curves (TCCs) calculated for recent forms of both the CMT and the paper-and-pencil test for Divisions E and D/F. In each plot, the TCC for the operational paper-and-pencil test is plotted with a solid line and the TCG for the operational CMT is plotted with a dashed line. For Division E, the CMT TCC was constructed from the rrt item parameters defined for all six of the available testlets. For Division D/F, the emt Tce was constructed from the IRT item parameters defined for five testlets which were randomly selected from the pool. (This five testlet limitation reflects the five testlet maximum given in the Division D/F test 13

18 specifications.) The cutscore separating nonmasters from masters is noted on each plot by a vertical line. The plots show that the CMTs are generally more discriminating (have steeper slopes) in the area near the cutscore. Insert Figure 1 about here All of the NCARB testlet pools were constructed to contain equivalent testlets. Thus all testlets are composed of the same number of items and cover the same content areas. To determine whether the third equivalence criteria has been met, one must consider the amount of agreement found among the testlet specific likelihood functions estimated for the number right score. These functions, which are theoretically determined by the IRT item parameters of the items comprising each testlet, provide the probability of observing any possible number right score on a specific testlet, for a candidate who is either a maximally competent nonmaster or a minimally competent master. The likelihood functions which were estimated for the six IO-item Division E testlets are plotted in the top portion of Figure 2. The plot shows considerable agreement: for each testlet a maximally competent nonmaster is most likely to score 4 out of 10, whereas a minimally competent master is most likely to score 7 out of 10. The likelihood functions which were estimated for the ten 25-item Division DfF testlets are given in the bottom portion of Figure 2. Note that the Division DfF testlets show slightly less agreement. In the following section, the significance of these deviations is evaluated by comparing the classification performance of the CMT decision rule defined with the equivalence assumption to the classification performance of the GMT decision rule defined without the equivalence 14

19 assumption. For each test, classification performance is estimated using simulated data. Insert Figure 2 about here The Simulation Technique Two independent simulation studies were performed, one for each of the divisions considered in this paper. Each study was designed to model the performance of a single group of examinees responding to three different tests: (i) the most recent operational form of the paper-and-pencil test (although scored somwhat differently. as will be described later); (ii) the most recent operational version of the GMT, scored under the assumption of equivalent testlets; and (iii) the most recent operational version of the GMT, scored without assuming equivalent testlets. Actual NGARB item parameters were used in simulating responses to each test. The simulated paper-andpencil responses can be used as a baseline from which to gauge the performance of the two simulated GMTs. For each division. generating examinee theta values were selected based on the estimated proficiency distribution of the candidate population tested in the most recent administration of the paper-and-pencil test. These empirical distributions are plotted in Figure 3. On each plot, the cutscore separating nonmasters from masters is indicated by a vertical line. Insert Figure 3 about here The simulated data set which was constructed for each division contained 15

20 100 simulated examinees at each of 40 different proficiency levels. The proficiency levels selected for each division correspond to the 20 raw score points located immediately above and below the cutscore defined for the current operational paper-and-pencil test. Case weights were generated for each simulated examinee so that the weighted distribution of true scores in the simulated data set matched the distribution of observed scores which was plotted in Figure 3. For each simulated examinee, item responses were generated according to the three parameter logistic IRT model for (i) all of the items which appeared on the operational paper-and-pencil test (60 items for Division E and 125 items for Division D/F), and (ii) all of the items which appeared in the operational GMT testlet pool (sixty items for Division E and 125 items for Division D/F). There was no overlap between the items included on the paperand-pencil test and the items included in the GMT testlet pools. The numbers of simulated response vectors included in the final simulated data set are given in Table 2. The procedures used to determine optimal decision rules for each test are described in the following section. Insert Table 2 about here Determining the Decision Rules For each division, three decision rules were generated: one estimated from the simulated paper-and-pencil data and two estimated from the simulated CMT data. The common set of input parameters which entered into the calculations for each rule included: (i) the proficiency estimate corresponding to a maximally competent nonmaster (On); (i1) the proficiency 16

21 estimate corresponding to a minimally competent master (Om); (iii) the loss associated with misclassifying a nonmaster; and (iv) the loss associated with misclassifying a master. For each division, the proficiency estimate corresponding to a maximally competent nonmaster was taken to be the theta value located 1 1/2 standard errors of measurement below the cutscore. (A standard error of measurement at the cutscore was calculated, for each division, from data that was collected in the most recent administration of the paper-and-pencil test.) Similarly, the proficiency estimate corresponding to a minimally competent master was taken to be the theta value located 1 1/2 standard errors of measurement above the cutscore. The loss function parameters were taken to be 40 for misclassifying a nonmaster as a master, and 20 for misclassifying a master as a nonmaster, on a scale in which 1 corresponds to the loss associated with administering an additional testlet. This loss function is referred to as a 40/20 loss function. These values were selected based on results obtained in the simulation studies reported in Lewis and Sheehan (in press). The sequential decision procedure described above was used to determine the optimal decision rule to apply to both the simulated paper-and-pencil data and the simulated CMT data. For the simulated paper-and-pencil data, the sequential procedure was implemented by considering each paper-and-pencil form as a single testiet. Thus, the 60-item Division E paper-and-pencil test was considered to be a single 60-item testlet and the l25-item Division D/F paperand-pencil test was considered to be a single l25-item testiet. For each division, the testing procedure was forced to terminate after the first stage, that is after the first testlet was administered, because there were no unused testlets remaining in the pool. Thus, the decision rule which minimized 17

22 posterior expected loss was estimated as the first stage cutscore such that, for examinees with number correct scores below the cutscore, the posterior expected loss of a fail decision was less than that of a pass decision, and for examinees with number correct scores above the cutscore,the posterior expected loss of a pass decision was less than that of a fail decision. Note that, since each pool contained just one testlet, this approach did not require the assumption of equivalent testlets. That is, the issue of whether or not to average score probabilities across testlets never had to be resolved. The cutscores determined in this manner were 34 for the 60-item Division E test and 77 for the l25-item Division D/F.test. For the simulated GMT data, two decision rules were estimated, one which incorporated the assumption of equivalent testlets and one which did not. To determine the decision rule which incorporated the assumption of equivalent testlets, the likelihood functions given in Figure 2 were averaged to provide a single pool-wide average likelihood function, for each division, as shown in Figure 4. These curves were used to determine the posterior probability of mastery associated with alternative number right scores, at each stage of testing. The resulting decision rules are listed, by stage, in Table 3. (Stage 1 number right cutscores were not estimated because the test specifications called for a minimum of two testlets per examinee.) Implementation of this rule is straight forward: at the second (and perhaps subsequent) stage(s) of testing, examinees with cumulative number correct scores less than or equal to the maximum fail score are failed, examinees with cumulative number correct scores greater than or equal to the minimum pass score are passed, and those with scores between the two cutscores are administered an additional testlet. 18

23 Insert Figure 4 and Table 3 about here The cutscores estimated for the nonequivalent testlet GMT were based on the testlet specific likelihood functions given in Figure 2, rather than the pool-wide average likelihood functions given in Figure 4. For the Division E pool, testlet specific cutscores were estimated for each of the 6!=720 possible permutations of test1et presentation order. A frequency distribution for the unique cutscores obtained, at each stage of testing, is given in Table 4. The table shows, for example, that of the 720 permutations of testlet presentation order analyzed, 72 percent resulted in a Stage 2 maximum fail score of 9, and 28 percent resulted in a Stage 2 maximum fail score of 10. In general, there were only two or three unique pass or fail cutscores obtained at each stage of testing. The table also indicates which cutscores correspond to the cutscores obtained under the equivalent testlet model. Note that, at each stage of testing, the testlet specific cutscores with the highest frequency are those which correspond to the equivalent testlet model. Implementation of this rule is also straight forward: prior to each individual testing session, a specific testlet permutation is randomly selected from the set of 720 that are available; this permutation determines (i) the order in which testlets will be administered to the examinee and (ii) the set of testlet specific cutscores which will be used to make the pass/fail/continue testing decision at each stage of testing. A slightly different procedure was employed to obtain testlet specific cutscores for the Division D/F pool. Specifically, since the Division D/F pool contains ten testlets and the test specifications call for a maximum of 19

24 five testlets per examinee, calibration of the entire set of testlet presentation permutations would have involved 30,240 (10 X 9 X 8 X 7 X 6) sets of testlet specific cutscores. However, full calibration of all possible testlet presentation permutations was never really considered since all predictions of the numbers of examinees electing to sign up for the exam were always far less than 30,240 examinees. Instead, a random subset of 1,000 permutations was selected for calibration. The results are summarized in Table 5. At each stage, the calibration yielded between five and six unique sets of cutscores. Note that the equivalent testlet cutscores always appear at the center of the distribution and generally have the highest frequency. This rule is implemented in exactly the same way as the Division E rule. That is, prior to each individual testing session, a particular permutation is randomly selected from the set of 1,000 that were calibrated; this permutation determines (i) which subset of testlets will be administered, (ii) the order of presentation of the selected testlets, and (iii) the set of testlet specific cutscores which will be applied to the observed item responses. The Simulation Results The classification performance of the P&P test, the equivalent testlet CMT (EQ emt) and the nonequivalent testlet CMT (NE CMT) were estimated by applying the decision rules noted above to the simulated response vectors described previously. are given in Table 6. Overall passing rates calculated from the weighted data Note that, for both divisions, the pass rates calculated for the two CMTs are almost identical. This indicates that the deviations from equivalence evidenced in the testlet specific likelihood functions shown previously were not sufficient to have a differential impact 20

25 on the overall CMT pass rates. The P&P pass rate is also similar to the two CMT pass rates. This indicates that the differences observed in the test charactersitic curves estimated for the P&P test and the CMT were also not sufficient to have a differential impact on the two pass rates. The table also provides the "true" pass rate determined from the generating thetas. For both divisions, the "true" pass rate is higher than the pass rates calculated from the simulated data. This difference can be attributed to the 40/20 loss function which was built into all of the simulated decision rules. The 40/20 loss function tends to lower the pass rate because the larger loss associated with misclassifying a nonmaster as a master serves to limit the number of simulated examinees classified as masters. Insert Table 6 about here The classification accuracy of the three simulated tests, as determined from the weighted simulation results, is summarized in Table 7. Note that, in each division, the error rates listed for the two CMTs are very similar. In Division E, for example, both CMTs have a false positive error rate of 6 percent and a false negative error rate between 13 and 14 percent. (The false positive error rate is always lower than the false negative error rate because the 40/20 loss function specifies that a false positive decision is twice as serious as a false negative decision.) The similar error rates listed in the table provide further evidence that violations of the testlet equivalence assumption have not adversely effected the classification accuracy of the CMT developed under the equivalent testlet design. Table 7 also shows that, in both divisions, the variable-length CMTs have achieved similar classification 21

26 accuracy as the paper-and-pencil tests, while using fewer items. Insert Table 7 about here Detailed classification accuracy results are provided in Tables 8 and 9. Each table provides the percent of examinees classified as masters by the P&P test, and the two CMTs, at each of the generating theta values used in the simulation. Results for Division E are listed in Table 8, results for Division D/F are listed in Table 9. Both tables show a high degree of similarity between the numbers listed for the two CMTs. To provide a visual display of this data, Figure 5 presents the percent of examinees classified as masters by the two CMTs plotted as a function of theta. The fact that the two curves are nearly indistinguishable underscores the fact that the nonequivalent testlet CMT has not provided significantly improved classification accuracy over the equivalent testlet CMT. ~-==:I====::S======-==---==== ===== Insert Tables 8 and 9 and Figure 5 about here ==========~==:a~:" ~ ~~;::;Q Tables 8 and 9 also provide classification consistency data for the two CMTs. For both divisions, examinees with true abilities either far above or far below the cutscore were consistently classified by both CMTs 100 percent of the time. For Division E, examinees with true abilities near the cutscore were consistently classified more than 95 percent of the time. The data for Division D/F is somewhat less consistent, as low as 89 percent for some theta values near the cutscore. In Table 10, the simulation results are summarized in terms of the 22

27 minimum, maximum and mean test length observed for each exam. The table shows no significant differences between the two CMTs with respect to these statistics. An alternative view of the test length data is provided in Figure 6 which depicts average test length plotted as a function of theta. Since 100 response vectors were generated at each theta value, each point is an average of 100 values. For both divisions, the plots show minimal differences. Insert Table 10 and Figure 6 about here Cost Considerations The equivalent test1et CMT and the nonequivalent test1et CMT can also be compared on the basis of costs. Since both models require similar implementation procedures, differences in implementation costs should be negligible. Developmental costs can vary widely, however, as shown in Table 11. The table provides execution times (measured in hours) for the computer runs used to determine the equivalent testlet and nonequivalent testlet cutscores for the two divisions studied. Although program execution times are machine dependent (the times listed are for a PC-AT with a 386S-32 bit CPU), relative differences should be more broadly meaningful. As can be seen, the nonequivalent testlet model requires significantly more computer processing time than the equivalent testlet model. Differences in the times listed for Division E and Division D/F can be attributed to (i) differences in testlet lengths (10 items for Division E and 25 items for Division D/F), and (iii) differences in the size of the testlet pool (six testlets for Division E and ten testlets for Division D/F). 23

28 Insert Table 11 about here Summary and Conclusions This paper introduced a practical procedure for determining the effect of testlet nonequivalence on the operating characteristics of a testlet-based computerized mastery test. The procedure assumes that all testlets are composed of the same number of items and cover the same content areas. It then determines the effect of testlet-to-testlet variation in the likelihood functions estimated for particular number right scores. Operationally, the procedure is an extension of the equivalent testlet procedure which was described in Lewis and Sheehan (in press). The procedure involves estimating the CMT cutscores twice, once with testlets treated as equivalent and once with testlets treated as nonequivalent. In the nonequivalent testlet mode, different sets of cutscores are developed for each permutation of testlet presentation order, taking into account whatever differences may exist in the likelihoods associated with each testlet. Concretely, cutscores associated with more difficult testlets are somewhat lower while those for easier testlets are somewhat higher. Less discriminating testlets have passing and failing cutscores which are further apart, while those for more discriminating testlets are closer together. A drawback of this design is that, although the different sets of cutscores are specifically constructed to insure that all examinees are treated equally with respect to the measurement properties of their particular set of testlets, the fact that different cutscores are used for different individuals may be objectionable to some test takers. Note that this objection does not hold for the equivalent testlet model in which a single set of cutscores is used for all examinees. 24

29 To evaluate the feasibility of this new GMT procedure, two simulation studies were performed. Each simulation was designed to model the performance of a specific candidate population responding to three different tests (i) a standard paper-and-pencil test, (ii) a GMT defined with the equivalent test1et assumption, and (iii) a GMT defined without the equivalent testlet assumption. Candidate populations were modeled after the populations tested in a recent administration of the Architect Registration Examination (ARE), a professional certification examination which is currently being administered in the GMT format. The simulation results showed that it was feasible (although costly in terms of computer processing time) to construct the testlet-specific sets of cutscores needed for the nonequivalent testlet GMT design and that, for the two ARE testlet pools studied, dropping the assumption that all testlets are equivalent had negligible impact on classification performance. These results can be interpreted as a validation of the decision to score the ARE testlet pools using the equivalent testlet design. The implications of the research described in this paper include the following: (1) It is now possible to employ test1et-based computerized mastery testing procedures when testlets are not constructed to be equivalent. This capability may prove useful for adaptive testing applications. (2) The methods and techniques used in the simulation provide a validation procedure which can be applied to any testlet pool which is developed to contain equivalent testlets. It is reccommended that such validation be peformed, on a routine basis, as new equivalent testlet pools are developed. 25

30 REFERENCES Chernoff, H., & Moses, L. E. (1959). Elementary decision theory. New York: John Wiley. Cronbach, L. J., & GIeser, G. C. (1965). Psychological Tests and Personnel Decisions (2nd ed.). Urbana, IL: University of Illinois Press. Ferguson, R. L. (1969a). Computer-assisted criterion-referenced measurement (Working Paper No. 41). Pittsburgh: University of Pittsburgh Learning and Tesearch Development Center. (ERIC Documentation Reproduction No. ED ). Ferguson, R. L. (1969b). The development, implementation, and evaluation of a computer-assisted branched test for a program of individually prescribed instruction. (Doctoral dissertation, University of Pittsburgh) (University Microfilms No ). Hambleton, R. K., & Novick, M. R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10, Huynh, H. (1976). Statistical consideration of mastery scores. Psychometrika, 41, Kingsbury, G. G., & Weiss, D. J. (1983). A comparison of IRT-based adaptive mastery testing and a sequential mastery testing procedure. In D. J. Weiss (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive testing (pp ). New York: Academic Press. Lewis, C. & Sheehan, K. (in press). Using Bayesian decision theory to design a computerized mastery test. Applied Psychological Measurement. Lindley, D. v. (1971). Making decisions. London and New York: Wiley Interscience. Petersen, N. S. (1976). An expected utility model for "optimal" selection. Journal of Educational Statistics, I, Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive testing (pp ). New York: Academic Press. Swaminathan, H., Hambleton, R. K., & Algina, J. (1975). A Bayesian decisiontheoretic procedure for use with criterion-referenced tests. Journal of Educational measurement, 12,

31 van der Linden, W. J., & Me11enbergh, G. J. (1977). Optimal cutting scores using a linear loss function. Applied Psychological Measurement, 7, Wetherill, G. B. (1975). Sequential methods in statistics. London: Chapman and Hall. Wingersky, M. S., Barton, M. A., & Lord, F. M. (1982). LOGIST V user's guide. Princeton, NJ: Educational Testing Service. 27

32 Table 1 Test Specifications 1 Div. E Div. D/F Length of the P&P test (items) CMT Test1et Length (items) Min. No. of Test1ets Administered Max. No. of Testlets Administered Minimum CMT Test Length (items) Maximum GMT Test Length (items) Total No. of Testlets in the Pool The ARE test specifications have been redefined since this research was conducted. 28

33 Table 2 Numbers of Simulated Response Vectors Unweighted Weighted Division E Nonmasters 2,000 2,462 Masters 2,000 3,963 Total 4,000 6,425 Division D/F Nonmasters 2,000 4,957 Masters 2,000 3,601 Total 4,000 8,558 29

34 Table 3 Number Correct Cutscores Calculated Under the Assumption of Equivalent Test1ets Division E Maximum Minimum Fail Pass Stage Items Score Score Division D/F Maximum Minimum Fail Pass Stage Items Score Score

35 Table 4 Test1et Specific Number Right Cutscores Calculated for the Division E Pool Without Assumming Equivalent Testlets Sta~e 2 Maximum Fail Score 9* 10 Proportion Minimwn Pass Score 14 15* Proportion Stage 3 Maximwn Fail Score 14 15* 16 Proportion Minimum Pass Score 19 20* 21 Proportion Sta~e 4 Maximwn Fail Score 20 21* Proportion Minimum Pass Score 25 26* Proportion Stage 5 Maximum Fail Score 26 27* Proportion Minimum Pass Score 30 31* Proportion Stage 6 Maximum Fail Score 33 34* Proportion Minimum Pass Score 34 35* Proportion * Cutscore for the equivalent test1et design. 1. Proportion of times obtained in anlysis of 6! ~ 720 distinct test1et presentation permutations. 31

36 Table 5 Testlet Specific Number Right Cutscores Calculated for the Division DfF Pool Without Assumming Equivalent Testlets Maximwn Minimum Stage Fail Score Proportion l Pass Score Proportion l *.42 35* Maximwn Minimwn Sta~e Fail Score Proportion l Pass Score Proportion l *.36 50* Maximum!>linimwn Stage Fail Score Proportion l Pass Score Proportion l * * Maximum Minimum Stage Fail Score Proportion l Pass Score Proportion l *.31 78* * Cut Score for the equivalent testlet design. 1. Proportion of times obtained in analysis of 1,000 randomly selected testlet presentation permutations. 32

37 Table 6 Comparison of Overall Pass Rates Test Div.E Div. D/F P&P EQ CMT NE CMT Truth

38 Table 7 Comparison of Classification Accuracy Division E l True Nonmasters True Masters Percent Percent Percent Percent Classified Classified Classified Classified Test as Nonmaster as Master as Nonmaster as Master P&P EQ CMT NE CMT The simulated data set included 2,462 true nonmasters and 3,963 true masters. Division D/F 2 True Nonmasters True Masters Percent Percent Percent Percent Classified Classified Classified Classified Test as Nonmaster as Master as Nonmaster as Master P&P EQ CMT NE CMT The simulated data set included 4,957 true nonmasters and 3,601 true masters. 34

39 Table 8 Detailed Classification Accuracy Data for Alternative Division E Mastery Tests Generating Parameters % Classified as Master CMT True EQ NE Classification Score Theta Wt. P&P CMT GMT Consistency a a a a a a a a a a a a a a a a a a a