Item Calibration: on Computerized Adaptive Scores. Medium-of-Administration Effect DTIQ 1I I III I I BE~U. I!IJilillI. ll Il

Navy Personnel Research and Development Center San Diego, California 92152.7250 TR-93-9 September 1993 AD-A271 042 1I I III I I BE~U Item Calibration: Medium-of-Administration Effect on Computerized Adaptive Scores DTIQ I E LF--CT E OCT2 11993 Rebecca D. Hetter Bruce M. Bloxom Daniel 0. Segall 93-25378 I!IJilillI ll 1111111Il 9 3 10 20 1 2 9 Approved for public release; ditdbution is unkrits.

NPRDC-TR-93-9 September 1993 Item Calibration: Medium-of-Administration Effect on Computerized Adaptive Scores Rebecca D. Hetter Bruce M. Bloxom Defense Manpower Data Center Daniel 0. Segall Reviewed and approved by W. A. Sands Released by John D. McAfee Captain, U.S.Navy Commanding Officer and Richard C. Sorenson Technical Director (Acting) Approved for public release; distribution is unlimited. Navy Personnel Research and Development Center San Diego, California 92152-7250

REPORT DOCUMENTATION PAGE OBf.00-8 Ptk wmolpng bu len for thb colection of Idi rnikn isa enaece to aen"e 1 hour per responme k, lhdi the lno tnr mvbftag robuctlims, searching extng data aoumee. ptherkvg and mawht the daa needed, and conlting and rviewing the colledion at Ildmouian. Send convneda reg&awn ttu burden edete or any oie Arped of fti clection ot krolineton. tndudlng suggetlion tor reduckg this burden. to Washtngon Headquanem Services. Directorate ior Wammo Operallon arw Repte, 1215 Jefferon Davis HIghway, Sult 1204, Arlnglon, VA 22202-4302. and to the Office of Manmgamen and Budget. Paperwoo Reduction Ptled (0704-0168). Washngon, DC 20603. 1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATE COVERED I September 1993 Final-March 1987-December 1989 4. TITLE AND SUBTITLE 5. FUNDING NUMBERS Item Calibration: Medium-of-Administration Effect on Computerized Program Element: 0604703N Adaptive Scores Work Units: R1822-MHOO1 6. AUTHOR(S) R1822-MHOO1A Program Element: OWN Rebecca D. Hetter, Bruce M. Bloxom, Daniel 0. Segall Reimbursable 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Navy Personnel Research and Development Center REPORT NUMBER San Diego, California 92152-7250 NPRDC-TR-93-9 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING/MONITORING Bureau of Naval Personnel (PERS-23) AGENCY REPORT NUMBER Navy Department Washington, DC 20370-5000 11. SUPPLEMENTARY NOTES Functional Area: Personnel Systems Product Line: Computerized Testing Effort: Computer Adaptive Testing (CAT) 12a. DISTRIBUTION/AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE Approved for public release; distribution is unlimited. A 13. ABSTRACT (Maximum 200 words) An important question in the development of item pools for computerized adaptive tests (CATs) is whether data for calibrating items should be collected by a paper-and-pencil (P&P) or a computer administration of the items. This study evaluated the effect on adaptive scores of using a P&P calibration. The correspondence between adaptive scores obtained with computeradministered items and a P&P calibration with adaptive scores obtained with computer-administered items and a computer calibration was evaluated. Forty items from each of four Armed Services Vocational Aptitude Battery (ASVAB) content areas (general science, arithmetic reasoning, work knowledge, and shop information) were administered by computer to one group of Navy recruits and by P&P to a second group. These data were used to obtain computer-based and P&P-based calibrations of the items. Each calibration was then used to estimate item response theory adaptive scores for a third group of recruits who received the items by computer. The effect of medium of administration was assessed by comparative regression, correlation, and reliability analyses of the scores using the alternative calibrations. Results indicate that, although statistically significant medium effects were found on some content areas, medium of administration did not affect the reliability of the adaptive scores. Although these findings support the use of the P&P parameters of the current CAT-ASVAB item pool, it is recommended that further analyses be performed to elucidate the significant effects. 14. SUBJECT TERMS 15. NUMBER OF PAGES Item calibration, computerized adaptive testing, computerized ability tests, computer- 36 administered tests, calibration mode, ability tests 16. PRICE CODE 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. qfcurity CLASSIFICATION 20. LIMITATION OF ABSTRACT OF REPORT OF THIS PAGE OF ABSTRACT UNCLASSIFIED UNCLASSIFIED UNCLASSIFIED UNLIMITED NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) Prescribed by ANSI Std. Z39-18 298-102

Foreword A joint-service effort is in progress to develop a computerized adaptive testing (CAT) system and to evaluate its potential for use in the military entrance processing stations as a replacement for the paper-and-pencil Armed Services Vocational Aptitude Battery (ASVAB). The Department of the Navy has been designated as lead service for CAT system development and the Navy Personnel Research and Development Center has been designated as lead laboratory. This research was funded under CAT-ASVAB Program Element 0604703N, Work Units R1822-MHOO and R1822-MHOO1A, and reimbursable Navy funding (O&M,N), sponsored by the Bureau of Naval Personnel (PERS-23). This research was part of the overall evaluation of CAT-ASVAB. This report presents the results of an evaluation of the effect on adaptive scores of item calibration medium of administration. The data were collected by RGI, Inc., pursuant to contract N66001-86-C-0217. Results are directed toward technical, professional, and contractor personnel involved in implementing CAT. JOHN D. McAFEE Captain, U.S. Navy Commanding Officer RICHARD C. SORENSON Technical Director (Acting) Acoession For NTIS GRA&I DTIC TAB 0 Unannouneed 0 Justlilcatio By _ Dl stributim/f Avellab~l itv og s Avai... V a

Summary Problem The Navy Personnel Research and Development Center is conducting research to design and evaluate a computerized adaptive test (CAT) as a potential replacement for the paper-andpencil (P&P) Armed Services Vocational Aptitude Battery (ASVAB). In support of this effort, the Accelerated CAT-ASVAB Program (ACAP) is evaluating item pools specifically developed for computerized adaptive testing. An important question in the development of item pools for computerized adaptive tests is whether data for calibrating items should be collected by a P&P a computer administration of the items. If P&P administrations do not yield precise enough calibrations, items must be administered by computer for calibration just as they are during testing. Since the CAT-ASVAB item pools have been calibrated using P&P administrations, this is an issue of interest for ACAP research. Objective The objective of this study was to evaluate the effect on adaptive scores of using a P&P calibration. Specifically, to what extent do adaptive scores obtained with computer-administered items and a P&P calibration correspond to adaptive scores obtained with computer-administered items and a computer calibration? Method Forty items from each of four ASVAB content areas-general science (GS), arithmetic reasoning (AR), word knowledge (WK), and shop information (SI)-were administered by computer to one group of Navy recruits and by P&P to a second group. These data were used to obtain computer-based and P&P-based calibrations of the items. Each calibration was then used to estimate item response theory adaptive scores for a third group of recruits who had received the items by computer. The effect of medium of administration was assessed by comparative analyses of the scores using the alternative calibrations. Testing was conducted at the Recruit Training Center in San Diego, CA. ASVAB scores of record were obtained for nearly all of the recruits and were used to assess whether the groups were comparable in ability levels. Results and Discussion Results of the reliability analyses indicate that random errors due to calibration have equivalent variance across different media. These results suggest that the use of item parameters obtained in a P&P calibration will not affect the reliability of CAT-ASVAB test scores, an important concern of the ACAP program. Results of the regression and correlation analyses show statistically significant mediumof-administration effects. The rlgression results showed effects on AR, WK, and SI; amid the correlation results showed effects on GS and WK. vii

Condusions Results of regression, correlation, and reliability analyses conducted to evaluate calibration medium-of-administration effect on adaptive scores indicate that, although statistically significant medium effects were found on some content areas, these effects did not affect the reliability of the CAT-ASVAB scores. Recommendations Although these findings support the use of the P&P parameters of the current CAT-ASVAB item pool, further hypothesis testing with an expanded reliability model is recommended to elucidate the significant effects. In addition, analyses of individual item parameters may be necessary for understanding these effects. viii

Contents Introduction... 1 Background... 1 Objective... 1 M ethod... Subjects... I Items.......... s..2.. 2 Item Calibrations... 2 Scores... 3 Adaptive Scores... 3 Nonadaptive Scores... 3 ASVAB Scores... 4 R esults and D iscussion... 4 Calibration Sam ples... 4 AFQT Com parisons... 4 Num ber-right Scores... 5 Regression Analysis... 7 Correlation Analysis.... 8 Reliability Analysis... 8 Statistical M odel... 9 LISREL M odels... 1I C onclusions... 12 R ecom m endation... 12 R eferences... 13 Appendix-Regressions, Correlations, and Scatter Plots... A-0 Distribution List Page ix

List of Tables 1. Medium-of-Administration Subtest Order and Time Limits... 2 2. Calibration Design... 3 3. Computation of Theta Scores... 3 4. Analysis of Variance: AFQT by Calibration Group... 4 5. ASVAB vs. Group 1 (Computer): Number-Right Score Means, Standard Deviations, and Intercorrelations... 5 6. ASVAB vs. Group 2 (P&P): Number-Right Score Means, Standard Deviations, and Intercorrelations... 6 7. ASVAB vs. Group 3 (Computer): Number-Right Score Means, Standard Deviations, and Intercorrelations... 7 8. Test for Equality of Covariances: COV(TIT3) = (COV(T2,T3)... 8 9. t-test of the Difference Between Dependent Correlations... 9 10. Correlation and Variance Matrices in the Model... 10 11. Test for Equality of Reliabilities... 12 x

Introduction Background The Navy Personnel Research and Development Center is conducting research to design and evaluate a computerized adaptive test (CAT) as a potential replacement for the paper-and-pencil (P&P) Armed Services Vocational Aptitude Battery (ASVAB). In support of this effort, the Accelerated CAT-ASVAB Program (ACAP) is evaluating item pools specifically developed for computerized adaptive testing. An important question in the development of item pools for computerized adaptive tests is whether data for calibrating items should be collected by a P&P or a computer administration of the items. Although research shows that computerized adaptive tests with P&P item calibrations can have validities comparable to conventional P&P tests (Moreno, Segall, & Kieckhaefer, 1985, pp. 29-33), how much less than optimal these computerized adaptive tests might be is unknown. The concern about medium of administration in item calibration is that item parameters for some types of items (e.g., items with long paragraphs or with graphics) may differ between computer and P&P administrations. This could result in less-than-optimal item selection and score estimation in adaptive tests. If P&P administrations do not yield precise enough calibrations, items must be administered by computer during calibration just as they are during testing. Objective The objective of this study was to evaluate the effect on adaptive scores of using a P&P calibration. Specifically, to what extent do. adaptive scores obtained with computer-administered items and a P&P calibration correspond to adaptive scores obtained with computer-administered items and a computer calibration? Method Fixed blocks of items were administered by computer to one group of examinees and by P&P to a sernnd group. These data were used to obtain computer-based and P&P-based calibrations of the items. Each calibration was then used to estimate item response theory (RT) adaptive scores (thetas) for a third group of examinees who had received the items by computer. The effect of medium of administration was assessed by comparative analyses of the thetas using the alternative calibrations. Subjects The subjects were Navy recruits who were randomly assigned to one of three groups. Data were collected for 2,955 examinees with 989 in Group 1 (computer), 978 in Group 2 (P&P), and 988 in Group 3 (computer). These sample sizes provide enough data for independent calibrations, since simulation results obtained by Hulin, Drasgow, and Jarsons (1983, pp.101-110) suggest that substantially larger samples produce little improvement in the precision of item characteristic curves and scores, given the number of items (40) used in these calibrations.

Testing was conducted at a Recruit Training Center in San Diego, CA. ASVAB scores of record were obtained for nearly all of the recruits and were used to assess whether the groups were comparable in ability levels. Items The items were taken from pools specifically developed in support of CAT-ASVAB by Prestwood, Vale, Massey, and Welsh (1985). Forty items from each of four ASVAB content areas (general science, arithmetic reasoning, word knowledge, and shop information) were administered by computer to Groups 1 and 3, and by P&P to Group 2. The items were conventionally administered in ascending order of difficulty, using the difficulties obtained by Prestwood et al. (1985). The three groups received the same items with the same instructions and practice problems, in the same order and with the same time limits. Although only 4 of the 11 CAT-ASVAB content areas were included in this study, the medium-of-administration (MOA) subtests were administered in the same order as in the CAT-ASVAB. Tune limits were prorated from 95% completion times for the same content areas in ACAP, with 10% added to allow for a higher completion rate. Subtest order and time limits are shown in Table 1. Table 1 Medium-of-Administration Subtest Order and Time Limits Time Subtest (Minutes) General Science (GS) 19 Arithmetic Reasoning (AR) 63 Word Knowledge (WK) 16 Shop Information (SI) 17 Total 115 The 40 items included 34 high-usage items (usage obtained from ACAP simulation studies) and six "seeds" (not-scored items administered for the purpose of gathering data for on-line calibration research). The booklet format was the same as that used in the original P&P calibration by Prestwood et al. (1985), and the computer format *as the same as that used in ACAP. Practice problems and instructions were also as in ACAP. Item Calibrations IRT parameter estimates based on the three-parameter logistic model (Birnbaum, 1968) were obtained in separate calibrations for each of the two computer groups (calibrations Cl and C3) and for the P&P group (calibration C2). The data sets on which the calibrations are based are labelled Ul, U3, and U2, correspondingly. The calibrations were performed with LOGIST6 (Wingersky, Barton, & Lord, 1982), a computer program that uses a joint maximum-likelihood approach. The design with the corresponding notation is summarized in Table 2. 2

Table 2 Calibration Design Group Data Set/ Item Parameters/ No. Medium Item Responses Calibrations 1 Computer U1 C1 2 P&P U2 C2 3 Computer U3 C3 NM. PAP = pape and pencl. Scores For each recruit in Group 3, three theta scores were computed: TI, T2, and T3 (see Table 3). All three scores were based on U3 responses. T1 scores were calculated from the computeradministration item parameters (Cl). T2 scores were based on the P&P-administration parameters (C2), and T3 scores were calculated from the third parameter set (C3), also based on a computeradministration. As described below, TI ar.d T2 were adaptive scores, using only 10 of the 40 responses from a given examinee, while T3 theta was nonadaptive, using all 40 responses. Table 3 Computation of Theta Scores Calibration Parameters Response Scoring Test Theta Set Method Length C1 (Group 1, computer) U3 Adaptive 10 items TI C2 (Group 2, P&P) U3 Adaptive 10 items T12 C3 (Group 3, computer) U3 Nonadaptive 40 items T3 N=te. P&P = paper and pencil. Adaptive Scores To compute the adaptive thetas (TI and T2), 10-item adaptive tests were simulated using actual examinee responses. Owen's Bayesian scoring (Owen, 1975) was used throughout the test to update the ability estimate, and a Bayesian modal estimate was computed at the end of the test to obtain the final score. Items were selected from information tables on the basis of maximum information. (An information table consists of lists of i. ns by ability level Within each list, all the items in the pool-40 in this case-are arranged in descending order of the values of their information functions computed at that ability level. The information tables used in this study were computed for 37 ability levels equally spaced along the [-2.25, +2.25] interval). Nonadaptive Scores The nonadaptive thetas (T3) included all 40 items in the test. Final thetas were computed using the Bayesian modal estimate (since all the items go into the score, it is not necessary to update the ability estimate after each item). 3

Table 3 summarizes the method used for computing the theta scores used in the analyses. ASVAB Scores ASVAB subtest scores for the four content areas of interest and the Armed Forces Qualification Test (AFQT) were obtained from the records for most of the examinees. The subtests were General Science (GS), Arithmetic Reasoning (AR), Word Knowledge (WK), and Auto Shop (AS). Notice that the ASVAB's Auto Shop subtest covers two content areas: auto information and shop information, whereas in the CAT-ASVAB each area constitutes a separate subtest. Since only shop information was administered in this study, ASVAB-AS was compared to MOA-SI. Calibration Samples Results and Discussion Two subjects in Group 3 (computer) had fewer than 10 valid responses on subt': sts WK and SI, and LOGIST omitted them from the calibrations. These subjects were eliminated from all subsequent analyses of WK and -I, Group 3 (computer). Fir-J sample sizes were 989 for Group 1 (computer), 978 for the Group 2 (P&P), 988 for GS & AR in Group 3 (computer), and 986 for WK & SI in Group 3 (computer). AFQT Comparisons To determine whether the three calibration groups were comparable in examinee ability, a one-way analysis of variance of AFQT by calibration group was computed. Results (Table 4) clearly indicate that there are no AFQT differences among the three groups. Sample sizes are slightly smaller in Tables 4 through 7 because AFQT scores were not available for some examinees. Table 4 Analysis of Variance: AFQT by Calibration Group Sum of Mean Source df Squares Squares F-Ratio F-Prob. Between Groups 2 29.8 14.9199 0.035 0.9656 Within Groups 2923 1247241.6 426.6990 Total 2925 1247271.4 Standard Standard Group No. Count Mean Deviation Error 1 985 55.5655 21.0157 0.6695 2 963 55.3946 20.4010 0.6574 3 978 55.3221 20.5418 0.6569 Total 2926 55.4279 20.6499 0.3818 4

Number-Right Scores Tables 5, 6, and 7 show means, standard deviations, and intercorrelations of number-right scores for same-name ASVAB and MOA subtests by calibration group. Table 5 ASVAB vs. Group I (Computer): Number-Right Score Means, Standard Deviations, and Intercorrelations (N= 985) P&P ASVAB Group I (Computer) Subtest GS AR WK AS AFQT GS AR WK SI P&P ASVAB GS 1.00 AR 0.57 1.00 WK 0.73 0.52 1.00 AS 0.50 0.42 0.48 1.00 Correlation Matrix AFQT 0.69 0.85 0.78 0.45 1.00 Group I (Computer) GS 0.81 0.58 0.73 0.53 0.71 1.00 AR 0.49 0.73 0.46 0.35 0.71 0.57 1.00 WK 0.69 0.47 0.77 0.43 0.67 0.72 0.46 1.00 SI 0.54 0.44 0.49 0.78 0.46 0.60 0.41 0.48 1.00 Means and Standard Deviations Min 4.00 7.00.00 0.00 21.00 8.00 6.00 6.00 6.00 Max 25.00 30.00 35.00 25.00 99.00 39.00 40.00 39.00 39.00 Mean 17.55 20.22 26.85 16.14 55.57 25.66 20.06 22.82 22.62 SD 4.44 5.80 5.41 5.16 21.01 6.23 5.96 5.13 6.88 No. ASVAB = Anned Services Vocational Aptibide Battery, P&P = paper and pencil,s = General Science, AR = Arithmetic Reasoning, WK = Word Knowledge, AS = Auto Shop, AFQT = Armed Forces Qualification Test, SI = Shop Infornatikn. 5

Table 6 ASVAB vs. Group 2 (P&P): Number-Right Score Means, Standard Deviations, and Intercorrelations (N = 963) P&P ASVAB Group 2 (P&P) Subtest GS AR WK AS AFQT GS AR WK SI P&P ASVAB GS 1.00 AR 0.50 1.00 WK 0.74 0.48 1.00 AS 0.52 0.39 0.46 1.00 Correlation Matrix AFQT 0.69 0.82 0.78 0.42 1.00 Group 2 (P&P) GS 0.80 0.54 0.76 0.51 0.72 1.00 AR 0.45 0.72 0.44 0.30 0.68 0.54 1.00 WK 0.69 0.45 0.79 0.38 0.68 0.74 0.44 1.00 SI 0.53 0.41 0.50 0.77 0.46 0.56 0.36 0.43 1.00 Means and Standard Deviations Min 5.00 5.00 10.00 3.00 17.00 8.00 6.00 5.00 3.00 Max 25.00 30.00 35.00 25.00 99.00 39.00 39.00 40.00 40.00 Mew. 17.44 20.41 26.68 16.16 55.39 25.33 20.22 22.74 22.96 SD 4.38 5.52 5.32 5.02 20.40 6.18 5.81 5.28 6.84 Not. See Table 5 for definitions of acronyms. 6

Table 7 ASVAB vs. Group 3 (Computer): Number-Right Score Means, Standard Deviations, and Intercorrelatlons (N = 978) P&P ASVAB Group 3 (Computer) Subtest GS AR WK AS AFQT GS AR WK SI P&P ASVAB GS 1.00 AR 0.48 1.00 WK 0.74 0.44 1.00 AS 0.47 0.35 0.45 1.00 Correlation Matrix AFQT 0.66 0.83 0.75 0.40 1.00 Group 3 (Computer) GS 0.81 0.51 0.75 0.52 0.69 1.00 AR 0.46 0.74 0.42 0.31 0.70 0.53 1.00 WK 0.70 0.40 0.79 0.38 0.66 0.73 0.45 1.00 SI 0.52 0.38 0.49 0.76 0.45 0.60 0.36 0.48 1.00 Means and Standard Deviations Min 4.00 5.00 6.00 0.00 20.00 6.00 5.00 5.00 5.00 Max 25.00 30.00 35.00 25.00 99.00 40.00 40.00 40.00 39.00 Mean 17.42 20.14 26.84 16.14 55.32 25.70 19.70 22.66 22.56 SD 4.47 5.67 5.48 4.93 20.54 6.17 5.94 5.22 7.03 Note. See Table 5 for definition of acronyms. Regmsion Analysis This analysis was designed to test whether 10-item adaptive ability scores computed using computer or P&P calibrated items are equivalent. For each of the four content areas, the appendix presents the regressions and scatter plots of TI on T3, and of T2 on T3, where Ti, 12, and T3 are as defined in Table 3. Then, a LISREL-based (Joreskborg & Sorbom, 1986) analysis was designed to test for the equality of the regression lines of TI on T3 and of T2 on T3. The testing was sequential, first for equality of slopes and then for equality of intercepts (if the slopes are different, testing for equality of intercepts is not required). 7

To test for equality of slopes, a LISREL run that tested for equality of covariances was performed for each of the four content areas. The model specification for LISREL was as follows: 0 = covariance matrix, free AZ = identity matrix = error matrix, zero ( )V(TI,T3) = COV (T2,T3) Results are presented in Table 8. Significant differences were obtained for all content areas except GS. Table 8 Test for Equality of Covariances: COV(T1,T3) = COV(T2,T3) Subtest Chi Sq df p GoF Adj GoF RMSR GS.02 1.876.999.995.001 AR 10.92" 1.001.644-1.137.011 WK 21.38* 1.000.495-2.028.014 SI 12.13* 1.000.790-0.261.016 Notes. I. p = probability, GoF = Goodness of Fit. RMSR = Root Mean Square Residuals. 2. See Table 5 for definitions of othe acronyms. *p <.0 5. Correlation Analysis After the results of the regression analyses were obtained, it was decided to use directional hypotheses in an attempt to explain the differences found. For each of the four content areas, the Pearson correlations, r(t1 on T3) and r(t2 on T3), were obtained, and t-tests of the difference between dependent correlations (Cohen & Cohen, 1975, p. 53) were computed. Table 9 presents these results. For subtests GS and WK, when correlations were computed between 10-item adaptive and 40-item nonadaptive thetas, r(t1,t3) (the computer-computer correlation) was significantly higher than r(t2,t3) (the P&P-computer correlation). This is consistent with a hypothesis that thetas based on the same calibration medium of administration are more similar than thetas based on different media of administration. However, without further analysis, it is not clear why these results were obtained for GS and WK and not for AR and SL Reliability Analysis Since the results from the regression and correlation analyses are conflicting and difficult to interpret, further analyses were required. A design was developed to assess the effect of calibration medium on test reliabilities. The model and the LISREL specifications are described below. These tests assess overall effect across the four content areas simultaneously; if a significant effect is found, further analyses would be required to attribute the error to specific subtests. 8

Table 9 t.test of the Difference Between Dependent Correlations i9 r13 > r23 Subtest NO r(tl,t2) Y(T1,T3) r(t2,t3) df t GS 988 0.9703 0.9608 0.9552 985 2.768** AR 988 0.9813 0.9586 0.9587 985-0.060 WK 986 0.9798 0.9665 0.9632 983 2.114" SI 986 0.9564 0.9508 0.9507 983 0.039 Nogla. 1. r = Pearson correlaton coefficient T1 = Caibation Group I (computer). 10-item adaptive theta; -12 = Calibration Group 2 (P&P), 10-ibm adaptive theta; T3 = Calibration Crou 3 (compute), 40-item nonadaptive theta. 2. See Table 5 for definition of other acronyms. "AlU responses from Group 3 (computer) (U3). **p <.01 (one tailed). *p <.05 (one tailed). Statistical Model A Assume that the observed theta, 0, values have three components: the true ability level 0, measurement error E, and random error due to calibration 8. Then, A = 1 (0+ E) + A where 4 = 0 + e, the true ability plus the error of measurement, and X is a scale factor. Then, the basic measurement model can be described by the following eight equations: Subtest Equation Score Responses Item Parameters 41 = X1 41 + 81 Tl-GS U3 Computer 02 = X 2 42+ 82 Ti-AR U3 Computer 43 = X3 43 + 53 Tl-WK U3 Computer 94 = X4 4A + 84 Ti-SI U3 Computer 5 = X5 + 5 T2-GS U3 P&P 96 = X6 6 + 86 T2-AR U3 P&P 97 = X7 7 + 6 7 T2-WK U3 P&P 98= 8 + T2-SI U3 P&P 9

Selecting the best-fitting model consists of: (1) estimating the model in which certain parameters are set to be equal, (2) estimating a less constrained model, and (3) assessing the statistical significance of the improvement in fit going from the more constrained model to the less constrained model. If the more constrained model fits the data as well as the less constrained model (i.e., within sampling error limits), then one may conclude that the constraints do not seriously erode the fit of the model. In this case, one model is specified such that the calibration errors of the pseudo-true test scores are constrained to be equal for the computer-based and the P&P item parameters; another model is specified such that these calibration errors are free to vary between the two media of administration. If the constrained model provides just as good a fit as the free model, then constraining calibration errors to be equal across item-parameter sets does not erode the fit of the model to the data, and one may conclude that the calibration errors of the ability scores are equal for computer and P&P item-parameters. According to the model, the variance-covariance matrix I: among the observed scores has the form: I = Ax Ox A'x + 08 where Ax is a diagonal matrix with standard deviations in the diagonal, 08 is a diagonal matrix of variances attributable to calibration error, and 40 is the attenuated correlation matrix among the ability values 4. Notice that the matrix (D is attenuated from only one source of error-the source attributable to the calibration; 0 is not attenuated with respect to measurement error. The fixed and estimated parameters of this model are displayed in Table 10. Table 10 Correlation and Variance Matrices in the Model Correlation Matrix -0 41 (2 (3 04 05 06 (7 08 Subtest Score (TI-GS) (TI-AR) (TI-WK) (TI-SI) (T2-GS) (12-AR) (T2-WK) (12-SI) O1 (TI-GS) 1.0 02 (TI-AR)r(02, AA (h) AA 0.1 03 (TI-WK) r(03, 32) 1.0 05 (T2-GS) r O r(04, 02) r(04, 03) 1.0 (12AR)1) ~ v~ A (0 4 *.~) (O2, 01) 1.0 07 (T2-WK) r(1, 01) r(02, 02) 10A r(13, r(r30 A1) 1.0 08 (T2-SI) r(04, 01) r(04, 02) r(14, 03) r(04, 03) A4 0) r 0.02) Variance Matrix 05 r(0 1, 01) r(02, 02) r(03, 03) r(0 4, 04) r(0 5, 05) r(0 6, 06) r(07, 07) r(0 8, O8) 10

Notice in Table 10 that the correlation of a subtest with itself (across media) is equal to one, and that the correlations between same-name subtests are assumed to be equal both across and within calibration media. Fo exc ple, the correlation between P&P,core, T2-AR and T2-GS should be represented by r(16, '05); however, it is represented by r( W2, ei) because under the model all the correlations between GS and AR are assumed to be equal; that is, r(2,) =r6= r(tjj) r(65,12) = 46A). LISREL Models To test the model fit, two models were specified and corresponding LISREL runs were performed. In Model 1, the variances of errors due to calibration (0&) were free to vary between the two media of administration. In Model 2, the variances of errors due to calibration were constrained to be equal for same-name subtests. The 0 constraints in Table 10 were imposed for both Model 1 and Model 2. The LISREL output yields a chi-square statistic that is a measure of how much E differs from S; that is, how well the model fits the data. The difference in the chi-squares from the two models is also a chi-square with df equal to the difference in df from the two models. If this difference is not significant, then the data satisfy/fit the model independently of the calibration errors; that is, errors due to calibration across media for same-name subtests are equal. The LISREL specifications for Model 1 were: 1. Lambda-X = Diagonal Matrix, Free. 2. PHI = Symmetrical Matrix, Free. 3. Theta-Delta = Diagonal Matrix, Free. The LISREL specifications for Model 2 were: 1. Lambda-X = Diagonal Matrix, Free. 2. PHI = Symmetrical Matrix, Free. 3. Theta-Delta (TD) = Diagonal Matrix with Constraints. 4. TD Constraint No. 1: All off-diagonal 0& fixed at zero. 5. TD Constraint No. 2: 08 (computer) = 08 (P&P); that is, 08(l,1) = 05(5,5) 08(2,2) = 08(6,6) 08(3,3) = 08(7,7) 08(4,4) = 08(8,8). Table 11 presents goodness-of-fit statistics for these models. The likelihood ratio chi-square value of the model in Model 1 was 14.07 with 14 df. The result is not statistically significant, indicating that Model 1 adequately explains the observed covariance matrices. Results for Model 2 show a chi-square value of 19.57 with 18 df, which is also not statistically significant. 11

The difference in chi-squares between Model 1 and Model 2 is distributed as a chi-square with df equal to the difference in df from Model 1 and Model 2. This value (5.50 with 4 dj) was not significant, indicating that allowing the error term to be free does not change the fit of the model. Table 11 Test for Equality of Reliabilities Model No. Specification Chi Sq df p GoF Adj Gof RMSR 1 05 = free 14.07 14 n.s. 0.996 0.991 0.002 0 (Computer) = 06 2 (P&P) 19.57 18 n.s. 0.995 0.990 0.003 3 Model 2 - Model 1 5.50 4 n.s. Note. p = probability, GoF = Goodness of Fit., RMSR = Root Mean Square Residuals, n.s. = not significant. Conclusions Results of the regression, correlation, and reliability analyses indicate that, although statistically significant medium effects were found for some content areas, these effects did not affect the reliability of adaptive scores. Results of the reliability analyses indicate that random errors due to calibration have equivalent variance across different media. This suggests that the use of item parameters obtained in a P&P calibration will not affect the reliability of CAT-ASVAB test scores, an important concern of the ACAP program. The regression and correlation results showed significant medium-of-administration effects. The regression results showed effects on AR, WK, and SI, and the correlation results showed effects on GS and WK. Because these results can be attributed to the effects of calibration medium on the scale factor X in the reliability analysis, further hypothesis testing with the reliability model is necessary to elucidate the scale effects. Analyses of individual item parameters may also be necessary for understanding these effects. In addition, the results may be clarified by alternative treatment of not-reached items when the thetas (Tl, T2, and T3), are being computed. Recommendation Although these findings support the use of the P&P parameters of the current CAT-ASVAB item pool, further hypothesis testing with an expanded reliability model is recommended to elucidate the significant effects. In addition, analyses of individual item parameters may be necessary for understanding these effects. 12

References Birnbaum, A. (1968). Some latent-trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Cohen, J., & Cohen, P. (1975). Applied multiple regression/correlation analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates. Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item response theory: Applications to psychological measurement. Homewood, IL: Dow Jones-Irwin. Joreskborg, K. G., & Sorbom, D. (1986). LISREL VI: Analysis of linear structural relationships by maximum likelihood and least square methods. Mooresville, IN: Scientific Software, Inc. Moreno, K. E., Segall, D. 0., & Kieckhaefer, W. F. (1985). A validity study of the computerized adaptive testing version of the Armed Services Vocational Aptitude Battery. Proceedings of the 27th Annual Conference of the Military Testing Association. San Diego, CA: Navy Personnel Research and Development Center. Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive testing. Journal of the American Statistical Association, 70, 351-356. Prestwood, J. S., Vale, C. D., Massey, R. H., & Welsh, J. R. (1985). Armed Services Vocational Aptitude Battery: Development of an adaptive item pool (AFHRL-TR-85-19). Brooks AFB, TX: Air Force Human Resources Laboratory, Manpower and Personnel Division. Wingersky, M. S., Barton, M. A., & Lord, F. M. (1982). LOGIST user's guide. Princeton, NJ: Educational Testing Service. 13

Appendix Regressions, Correlations, and Scatter Plots A-O

GENERAL SCIENCE TI = ADAPTIVE THETA (Cl, U3), 10-items "T2- ADAPTIVE THETA (C2, U3), 10-items T3 - NON-ADAPTIVE THETA (C3, U3), 40-items Analysis for 988 points of 2 variables: Variable T2 TI Min -2.5830-2.6970 Max 2.5110 2.5660 Sum 66.6930 23.7240 Mean 0.0675 0.0240 SD 0.8631 0.8570 Correlation Matrix: r2 1.0000 TI 0.9703 1.0000 Variable T2 TI 3-2.... so 17 0* -i * ) o /.J'.4. ft -2- -3- I I I II -3-2 -I 0 1 2 3 Ti A-I

GENERAL SCIENCE TI = ADAPTIVE THETA (CI, U3). 10-ifems T2 - ADAPTIVE THETA (C2, U3). 10-items T3 - NON-ADAPTIVE THETA (C3, U3). 40-items Analysis for 988 points of 2 variables: Variable TI T3 Min -2.6970-3.0300 Max 2.5660 2.5710 Sum 23.7240 2.4860 Mean 0.0240 0.0025 SD 0.8570 0.9232 Correlation Matrix: TI 1.0000 T3 0.9608 1.0000 Variable TI T3 Regression Equation for TI: TI = 0.8919 T3 + 0.0217679 Significance test for prediction of TI Mult-R R-Squared F(I,986) prob (F) 0.9608 0.9232 11856.9617 0.0000 3-2- TI 0-2- -3-5* -3-2 -1 0 2 3 T3 A-2

GENERAL SCIENCE TI = ADAPTIVE THETA (Cl, U3), 10-itens T2 = ADAPTIVE THETA (C2, U3). 10-items T3. NON-ADAPTIVE THETA (C3, U3), 40-items Analysis for 988 points of 2 variables: Variable T2 T3 Mi -2.5830-3.0300 Max 2.5110 2.5710 Sum 66.6930 2.4860 Mean 0.0675 0.0025 SD 0.8631 0.9232 Correlation Matrix: 7'2 1.0000 T3 0.9552 1.0000 Variable 12 T3 Regression Equation for T2: T2 = 0.8931 T3 + 0.0652559 Significance test for prediction of T2 Mult-R R-Squared F(1 986) prob (r) 0.9552 0.9124 10271.9555 0.0000 3-2- I : 12 0. -2 p...'- -3 - I I IIII -3-2 -1 0 1 2 3 T3 A-3

ARITHMETIC REASONING TI - ADAPTIVE THETA (Cl, U3), 10-items T2 = ADAPTIVE THETA (C2, U3), 10-itens T3 = NON-ADAPTIVE THETA (C3, U3), 40-iWms Analysis for 988 points of 2 variables: Variable T2 TI Min -2.5440-2.7190 Max 2.5850 2.5890 Sum -66.1150-25.5640 Mean -0.0669-0.025ý SD 0.9463 0.926o Correlation Matrix: T2 1.0000 TI 0.9813 1.0000 Variable 12 TI 3-2-""' pre T2 0 -... -1 %* t*. -2- -3- I I IInII -3-2 -1 0 1 2 3 T1 A-4

ARrrHMETIC REASONING TI = ADAPTIVE THETA (Cl, U3), 10-items 12 = ADAPTIVE THETA (C2, U3), 10-items T3 - NON-ADAPTIVE THETA (C3, U3), 40-items Analysis for 988 points of 2 variables: Variable TI T3 Min -2.7190-2.5370 Max 2.5890 2.9080 Sum -25.5640 4.1980 Mean -0.0259 0.0042 SD 0.9266 0.9449 Correlation Matrix: TI 1.0000 T3 0.9586 1.0000 Variable TI T3 Regression Equation for TI: TI = 0.94 T3 + -0.0298686 Significance test for prediction of T1 Mult-R R-Squared F(1,986) prob (F) 0.9586 0.9189 11174.2661 0.0000 3- -3,3* 3-0 1 2 3,-:.. * A-5

ARITHMETIC REASONING TI = ADAPTIVE THETA (Cl, U3), 10-items T2 = ADAPTIVE THETA (C2, U3), 10-items T3 = NON-ADAPTIVE THETA (C3, U3), 40-items Analysis for 988 points of 2 variables: Variable 12 T3 Mini -2.5440-2.5370 Max 2.5850 2.9080 Sum -66.1150 4.1980 Mean -0.0669 0.0042 SD 0.9463 0.9449 Correlation Matrix: T2 1.0000 T3 0.9587 1.0000 Variable r12 T3 Regression Equation for T2. 12 = 0.9602 T3 + -0.070998 Significance test for prediction of T2 Mult-R R-Squared F(1,986) prob (F) 0.9587 0.9192 11216.0675 0.0000 3 2-0,T "*.*.. 0a. - - -3 I I IIIII.3-2 -1 0 1 2 3 A3 A-6

WORD KNOWLEDGE TI = ADAPTIVE THETA (Cl, U3), 10-items T2 - ADAPTIVE THETA (C2, U3), 10-items T3 = NON-ADAPTIVE THETA (C3, U3), 40-items Analysis for 986 points of 2 variables: Variable T2 TI Min -2.2940-2A290 Max 2.6370 2-5080 Sum 33.4100 11.9230 Mean 0.0339 0.0121 SD 0.8531 0.8767 Correlation Matrix: T2 1.0000 TI 0.9798 1.0000 Variable 12 T1 3-2- 1. N,.o -2-* A-7' -3-2 -1 TI7

WORD KNOWLEDGE TI = ADAPTIVE THETA (Cl, U3), 10-items T2 m ADAPTIVE THETA (C2, U3), 10-items T3 = NON-ADAPTIVE THETA (C3, U3), 40-items Analysis for 986 points of 2 variables: Variable TI 73 Mini -2.4290-2.5000 Max 2.5080 2.9960 Sun 11.9230 5.2260 Mean 0.0121 0.0053 SD 0.8767 0.8920 CoffelatiLo' Matrix: Ti 1.0000 13 0.9665 1.0000 Variable T1 T3 Regression Equation for TI: T1 = 0.95 T3 + 0.00705733 Significance wst for prediction of TI Mult-R R-Squared F(1,984) prob (F) 0.9665 0.934113958.1727 0.0000 3-2 - TI 0 - -3-3 -2-1 0 1 2 3 T3 A-8

WORD KNOWLEDGE TI m ADAPTIVE THETA (C1, U3). 10-items T2 = ADAPTIVE THETA (C2, U3), 10-items "13 = NON-ADAPTIVE THETA (C3, U3), 40-items Analysis for 986 points of 2 variables: Variable 12 T3 Min -2.2940-25000 MAX 2.6370 2.9960 Sum 33.41CO 5.2260 Mean 0.0339 0.0053 SD 0.8531 0.8920 Correlation Matrix: 12 1.0000 T3 0.9632 1.0000 Variable 12 T3 Regression Equation for T2: T2 = 0.9211 T3 + 0.0290022 Significance test for prediction of 12 Mult-R R-Squared F(1,984) prob (F) 0.9632 0.9277 12624.8208 0.0000 3-2 * : 1 2 -if sot -2- -3- I l Il It -3-2 -1 0 1 2 3 "3 A-9

SHOP INFORMATION TI = ADAPTIVE THETA (CI, U3), 10-items 12 = ADAFTIVE THETA (C2, U3), 10-iwtnm T3 - NON-ADAPTIVE THETA (C3, U3), 40-itms Analysis for 986 points of 2 variables Variable T2 TI Min -2.7960-2.6640 Max 2.7030 2.6750 Sum 12.1180 40.9270 Mean 0.0123 0.0415 SD 0.8962 0.8656 Correlation Matrix: T2 1.0000 TI 0.9564 1.0000 Variable T2 TI 3-2- : *... 120 -" 1. S I " *1-2- -3- I I I III -3-2 -1 0 1 2 3 TI A-10

SHOP INRORMATION TI - ADAPTIVE THETA (Cl, U3), 10-iems T2 - ADAPTIVE THETA (C2. 13), 10-items T3 = NON-ADAPTIVE THETA (C3, U3), 40-items Analysis for 986 points of 2 variables: Variable TI "3 Min -2.6640-3.6840 Max 2.6750 2.5460 Sum 40.9270-10.4760 Mean 0.0415-0.0106 SD 0.8656 0.9146 Correlation Matrix: TI 1.0000 T3 0.9508 1.0000 Variable TI T3 Regression Equation for TI: TI = 0.8999 T3 + 0.0510692 Significance test for prediction of Ti Mult-R R-Squared F(1,984) prob (F) 0.9508 0.9041 9273.3592 0.0000 3- TI 0 "!!..3-2-* -I I I -3- -3-2 -1 0 1 2 3 T3 A-i1

SHOP INFORMATION TI = ADAPITVE THETA (CI, U3), 10-items T2 = ADAPTIVE THETA (C2, U3), 10-items "T3 = NON-ADAPTIVE THETA (C3, U3), 40-items Analysis for 986 points of 2 variables: Variable "2 '"3 Mini -2.7960-3.6840 max 2.7030 25460 Sum 12.1180-10.4760 Mean 0.0123-0.0106 SD 0.8962 0.9146 Correlation Matrix: 12 1.0000 1"3 0.9507 1.0000 Variable 12 T3 Regression Equation for 12: T2 = 0.9316 T3 + 0.0221877 Significance test for prediction of 12 Mult-R R-Squared F(1,984) prob (F) 0.9507 0.9039 9254.1786 0.0000 3-2-.: ~.:'. - 2 S ~ -2 *:. " *e S -3- I 1III I I -3-2 -1 0 1 2 3 13 A-12

Distribution List Bureau of Naval Personnel (PERS-23) HQ USMEPCOM Office of the Assistant Secretary of Defense (FM&P) Defense Technical Information Center (DTIC) (4)