THE EFFECTIVESS OF COMPUTERIZED ADAPTIVE TESTING ON ADVANCED PROGRESSIVE MATRICES TEST PERFORMANCE 1

Size: px

Start display at page:

Download "THE EFFECTIVESS OF COMPUTERIZED ADAPTIVE TESTING ON ADVANCED PROGRESSIVE MATRICES TEST PERFORMANCE 1"

Kerry Wilkinson
6 years ago
Views:

1 THE EFFECTIVESS OF COMPUTERIZED ADAPTIVE TESTING ON ADVANCED PROGRESSIVE MATRICES TEST PERFORMANCE 1 Aries Yulianto Faculty of Psychology, University of Indonesia Abstract Although computers are still rarely used for test administration in Indonesia, there is a big opportunity to develop it. This experiment was carried out to measure the effectiveness of computerized test administration, especially computerized adaptive test (CAT). Two weeks before experiment, subjects had taken Advanced Progressive Matrices (APM) test in paper-pencil test (PPT) form. The subjects were randomly assigned into six experimental groups to take the same test in classical computerized test (CT) or CAT form with test taking time limit variations of 25 minutes, 50 minutes, or no time limit. Test scores were estimated using maximum likelihood model. Based on Embretson and Reise (2000) findings, items with b between -0.5 to 0.5 are chosen as the first items administered through CAT. The next items are chosen based on the maximum information criterion. Test administration stops if standard error of the score is smaller or equal with 0.4. There was no significant difference between CAT and PPT scores, but there was significant difference between CT and PPT scores. This research found that CAT is effective, because consumed less time and administered lesser items (12 as average) than CT and PPT (total of 36 items). Keywords: Computerized Adaptive Testing, Progressive Matrices, Paper-Pencil Test. INTRODUCTION Psychological test in widely use in Indonesia, from diagnosis to selection purpose, from academic to industrial setting. It can be said that psychological test as ultimate aim to select people, with main objective to place the right person in the right place. Most of the test delivered with paper-pencil administration. Only small number of tests administered as performance test. As a result, there are required some time to administering, scoring, and reporting test result. This will be a heavier job for tester if it includes a huge numbers of examinees. Unfortunately, fast reporting became a major objective for almost testing situations. Another problem rise along as using the same test over few decades. Most of tests were lack of security, so their reliability and validity need to be questioned. On the other side, computers increasingly have been used for many purposes and setting recently. Government and individuals promote computer use on most aspect. Unfortunately, using computer as a method of test administration was not a major attention. Some softwares were built to help testers scoring and reporting test result. But tester still administer test with paper-pencil. Most of non verbal test were used as tolls of assessment, such as Raven s Progressive Matrices (PM), General Intelligence Test subtest 5 (TIU-5), Culture-Fair Intelligence Test (CFIT), and Figure Reasoning Test (FRT). McAulay, Deary, Ferguson, and M. Frier, (2001) found that non verbal ability reflected adaptive ability or problem solving than verbal ability. Verbal test was considered disadvantage for some groups of 1 Paper presented at International Meeting on Psychometric Society (IMPS) 2007, Tokyo, Japan, 9-13 July, 2007.

2 The Effectiveness of Computerized Adaptive Testing on Advanced Progressive Matrices examinees, such as people with hearing disability, verbal disability, vision disability, mental retarded, or children with severe emotional disturbance (Bracken & McCallum, in Fives & Flanagan, 2002). Among nonverbal ability test, PM test was one of frequently use nonverbal ability test (Murphy & Davidshofer, 2001). PM test was constructed based on Spearman s g factor intelligence theory. PM test was widely use in basic research and intellectual screening (Gregory, 2000). As a culture-fair test, PM also use in general cognitive ability research to compare intellectual ability across nation, race, or majority-minority groups. Ackerman (2000) was use test PM to find a major factor in adult s intelligence. In neuropsychological setting, PM was use to know brain damage patient s intellectual ability (Caffarra, Vezzadini, Zonato, Copelli, & Venneri, 2003). Over the past 3 decades in U.S., computers increasingly have been used to automate the administration, scoring, and interpretation of results from a wide variety of psychological measures, including assessment of ability and academic achievement (Brown & Weiss, 1977), neuropsychological status (e.g., Jenskins, Fitzpatrick, Garrat, Peto, & Steward- Brown, 2001) vocational interests, and personality (e.g., Butcher, Perry, & Atlis, 2000; Simms & Clark, 2005). Computers provide an objective, efficient, and reliable means for delivering assessment services to clients and research participants. A concern in both research and clinical settings is the length of many personality measures. For instance, an hour or longer often is required to complete such measures as the 567-item MMPI 2, the 344-item Personality Assessment Inventory, or the 240-item NEO Personality Inventory Revised (NEO-PI R). The time required for such assessments are difficult to accommodate in many applied and research settings. Managed care companies have limited the types of assessments for which they will reimburse practitioners to those that require less time and effort to administer, score, and interpret. Research time also is scarce and costly. Moreover, long measures can lead to fatigue and drifting attention for many test takers, which ultimately compromise the validity of the test profile and complicate test interpretation. Along with developmental technology, shifting from paper-pencil administration to use computer to administer test was start in 1970 (Bunderson, Inouye, & Olsen, 1989). This was the first generation of computerized test. Computer was use to deliver item as in the paper-pencil test. It was give some of advantages, such as fast scoring, immediate reporting, better standardization of test administration, increasing test security, and reduce measurement error. Combine with Item Response Theory (IRT), computer deliver item that suitable to examinee s ability. As a result, each examinee will get different set of items from other examinee. This second generation use of computer administration known as computerized adaptive testing. Computerized Adaptive Testing In the most basic sense, Computerized Adaptive Testing (CAT) permits the selection and administration of items that are individually tailored to the trait or ability level of the examinee, with the potential of substantial item and time savings (Embretson & Reise, 2000). A typical CAT selects and administers only those items that provide the most psychometric information (i.e., yield the lowest standard errors of measurement) at a given trait level. For example, IRT and CAT have been shown to offer noteworthy solutions to the challenge of constructing patient-based health status measures that are 2

3 Aries Yulianto both more practical and more reliable over a wide range of score levels (Ware, Gandek, Sinclair, & Bjorner, 2005). Figure 1 showed scheme of CAT administration. Start with estimate ability level Select and delivered an optimum item Evaluate response No Stopping rule satisfied? Re-estimate ability and standard error Yes End of Test STOP Figure 1. Scheme of CAT Consideration in CAT administration Embretson and Reise (2000) state some consideration in CAT administration, there are: Item bank. The Basic goal of CAT is to administer a set of items that are in some sense maximally efficient and informative for each examinee. Because of the primary importance of the item bank in determining the efficiency of CAT, much thought and research has gone into issues involving the creation and maintenance of item banks. No precise number can be given regarding how many items this requires, but a rough estimate is around 100. Items in bank should be calibrated with one of item parameter model estimation, I PL, 2 PL, or 3 PL. Administer the first item. If it can be assumed that the examinee population is normally distributed, then a reasonable choice for starting a CAT is with an item of moderate difficulty, such as one with a difficulty parameter between -.5 and.5. If some prior information is available regarding the examinee s position on the trait continuum, then such information might be used in selecting the difficulty level of the first item. Average θ from examinees population could be used as ability estimation, to make an optimum CAT (Thissen & Mislevy, 1990). Some testers like to begin their CAT with an easy item so that the examinee has a success experience which may, in turn, alleviate some problems such as test anxiety (Embretson & Reiss, 2000). Score examinee s ability. There are three main methods for estimating an examinee s ability: (a) Maximum Likelihood (ML), (b) Maximum a Posteriori (MAP), and (c) Expected a Posteriori (EAP). Some researchers do not endorse the use of priors because they potentially affect scores. For example, if few items are administered, then ability level estimates may be pulled toward the mean of the prior distribution. For this reason, some researchers have implemented a step-size procedure to assigning scores at beginning of a CAT. 3

4 The Effectiveness of Computerized Adaptive Testing on Advanced Progressive Matrices Select the next item. Two strategies can be used to select the next item, maximum information and minimum expected posterior standard deviation. Thiessen and Mislevy, (1990) called the latter strategy as Bayesian estimation. Maximum information strategy select item that provides the most psychometric information at the examinee s current estimated ability level. This strategy usually corresponds to ML scoring. Second strategy is to select the item that minimizes the examinee s expected posterior standard deviation. That is, select the item that makes the examinee s standard error the smallest. This typically corresponds to the Bayesian scoring procedures and does not always yield the same results as the maximum information strategy. Test termination. In CAT, after every item response an examinee s trait level and standard error is re-estimated and the computer selects the next item to administer. But this can t go forever, and the CAT algorithm needs to have a stopping rule. There are four stopping rules: (1) variable length, (2) fixed length, (3) variable-fixed length, and (4) time- limit. In variable length rule, test will terminate if standard error is below some acceptable value. Thissen and Mislevy (1990) called it as target strategy. It advantages is appropriate with classical theory that equal measurement error variance assumed and suitable for some statistical analyses which considering measurement error. Standard error (S.E.) limitation varied among researchers. In his research, Ury using S.E. equal or smaller than.3162, because it will get same result as the classical reliability coefficient.90 (Thissen and Mislevy, 1990). In another research, Hornke (2000) using.38 as SE limitation. Blais and Raiche (2002) found from their simulation, that if SE is equal or smaller than.40, SE of ability estimate will differ only.03 than the previous estimation. Second test termination strategy, fixed length, depends on amount of items delivered. Thissen and Mislevy (2002) called this strategy as maximum number of items. The advantages are that easy to do and item utilizing could be predicted. These two strategies can be combined, as the third strategy, if running out of items will be possible if precision target won t reach. Thissen and Mislevy (1990) suggested forth strategy, test will be terminated after a specific time. This strategy will give an advantage for speed test, but not for power test. Embretson and Mislevy (2000) recommend SE as an effective strategy for test termination, since it use CAT s algorithm. The basic objective of this study is to prove that CAT administration deliver test more efficient than conventional administrations, paper-pencil test and classical computerized test. To address this objective, two independent variables were involved, test administration type and work-time limitation. The research problem is, are test administration type and work-time limitation influence test performance of APM? Method Participant First, 298 undergraduate students of Faculty of Psychology, University of Indonesia, had taken 36 items of APM in paper-pencil form. Two weeks later, onehundred and twenty students who joined voluntary women and 8 men take the experiment in faculty s computer laboratory. 4

5 Aries Yulianto Design The experiment followed a 2 (test administration type: classical computerized test/computerized adaptive test) x 3 (time limit: 25 minutes/50 minutes/no limitation) randomized factorial between subjects design. The subjects were randomly assigned into six experimental groups to take the APM test in classical computerized test (CT) or computerized adaptive test (CAT) form with test taking time limit variations of 25 minutes, 50 minutes, or no time limit. Procedure This research involved test performance as dependent variable and 2 independent variables, test administration type and work time limitation. Manipulation APM test delivered using Fastest Pro 1.6 trial version software (available at which have two options to deliver test, classical or adaptive. The software also has a feature to control time limitation. Test administration set to 3 variation, 25 minutes (same as paper-pencil administration), 50 minutes (twice as paperpencil administration), and no time limitation. Unlike in paper-pencil administration, test instruction in computerized administration presented individually at computer monitor. There are some strategies for CAT administration: Item bank. One parameter logistic (1 PL) or Rasch model estimated using ACER- QUEST. For this purpose, previously available test data from 1216 subjects were added. As a result, all of 36 items were considered fit and used as item bank. Difficulty parameter varied from to Administer the first item. Randomly select item with difficulty parameter between -.5 and.5 because subjects trait level ability assumed to be distributed normally. Score examinee s ability. Subjects trait level ability was estimated using ML. Select the next item. Item with maximum information at current subject s ability estimate were select to be delivered to subject. With maximum item information method, CAT administration will delivered more effectively (Embretson & Reise, 2007). Test termination. Variable length criteria were used to terminate the test. Using Blais and Raiche (2002) recommendation, test will terminate if S.E. is.40 or below. Dependent Measure Test score or subjects trait level ability (θ) was estimated using maximum likelihood (ML). Although no ML estimate can be obtained from perfect all endorsed or not endorsed response, the ML trait level estimator has several positive asymptotic features (Embretson and Reise, 2000), such as: not biased (the expected value of θ always equals to the true θ), an efficient estimator, and its error are normally distributed. Statistical Analyses To compare subjects estimate θ from two test administration, paper-pencil and computer administration (CT or CAT), paired-sample t-test was used. Factorial analysis of variance was used to know main effect from each IV, test administration type and work time limitation, and also IV s interaction effect. With level of significance 0.05, data were computed with SPSS. 5

6 The Effectiveness of Computerized Adaptive Testing on Advanced Progressive Matrices Results Means, standard deviations, and minimum-maximum subject s estimate ability level (θ) for each experiment groups are shown in table 1. Although CAT administration had smaller mean than CT, there are no differences (F =.721, p >.05). In other words, subject used in this experiment were had equal abstract reasoning ability as measure by APM test. Subject s θ in paper-pencil test administration were significantly differ than when they administered with CT administration (t = 3.479, p<.01). In fact, CT administration (M =.4879) was lower than paper-pencil administration (M =.6737). This is not happen in comparison between paper-pencil and CAT administration subject s θ. Although θ in CAT administration (M =.5059) lower than paper-pencil administration (M =.5469), it found no difference between this two score (t =.547, p>.05). From these result, CAT had an advantage than CT administration, that is make θ estimate close to true θ (assumed that paper-pencil administration were equally to true θ). But when subject s θ in CT administration groups were compare to subject s θ in CAT administration groups, it found no difference (F = 2.202, p>.05). This result wasn t consistent with the previous result. Table 1. Means, Standard Deviations, and min. max. θ scores Test Administration CT CAT Total ( ) ( ) ( ) Time limit ( ) ( ) ( ) No limitation ( ) ( ) ( ) Total ( ) ( ) ( ) Note: bold numbers are mean, italic numbers are standard deviation, and numbers in parenthesis are minimum-maximum test scores. From comparisons between two types of test administrations for each time limitation, there were found similar results. In time limit of 25 minutes, there is no difference between CAT and CT administration in θ estimate (F =.035, p>.05). Although CAT θ estimation higher than CT s, there is no significant difference for 50 minutes time limitation (F = 1.748, p>.05). Similar comparison result also found in groups with no time limitation treatment (F = 1.339, p>.05). For comparison of three time limitations for CT administration, there was no significant difference in estimating θ (F =.160, p>.05). Similar result also found for CAT administration (F =.408, p>.05). 6

7 Aries Yulianto Estimated Marginal Means of θ 0.70 Time limit 25 minutes minutes No limitation CT CA Test administration Figure 1. mean plot for interaction effect One of purpose of this experiment was to prove that CAT is more efficient than other type of test administration. Efficiency is evaluated by amount of time to spend for administering the test. Time to administer the test depend on amount of item to be administered; lesser item to administer, lesser time to spend. Table 2 showed average of item for each treatments condition. From this table, it showed that on every time limitation condition, CAT administration delivered lesser item (12 items as average) than CT (34 items as average). There is also significantly differing on item delivery from two type of test administration. So, we can conclude that CAT is more efficient than CT because it deliver lesser item, but with no difference in subject s ability estimate. Table 2. Item average for each treatments condition Time limitation No limitation Total Administration type CT CAT Total

8 The Effectiveness of Computerized Adaptive Testing on Advanced Progressive Matrices Discussion This experiment proves that CAT is more efficient method to deliver test than classical method (e.g., paper-pencil test administration and classical computerized test). This result consistent with argument from Embretson and Reise (2000) that IRT-based CAT consist lesser items than conventional or paper-pencil test. One thing need to explore further more is about psychological factor contribution to test performance, especially in computerized test. As said earlier, examinees in Indonesia usually take paper-pencil test administration. Then, in computerized test administration setting, there will be a difference performance than in paper-pencil test. Tonidandel, Quinones, and Adams (2002) found that test anxiety negatively correlated with test performance. It support earlier finding by Wise (1997b), that anxiety increasing during test will decrease test performance. It would be happened because computerized testing was unfamiliar (Wise, 1997a). Since subjects in this experiment were all college students, who familiar with computer, I assumed that there was no or little test anxiety as a result from computer administration. As a consequence, this research finding shouldn t generalize to other population than people who weren t familiar with computer. There should be another research to consider psychological factor effect in test performance. References Blais, J., & Raiche, G. (2002). Some Features of the sampling distribution of the ability estimate in computerized adaptive testing according to two stopping rules. Paper presented at 11 th International Objective Measurement Workshop, New Orleans, April 2002 (unpublished). Brown, J.L., & Weiss, D.J. (1977) An Adaptive Testing Strategy for Achievement Test Batteries. Bunderson, C.V., Inouye, D. K., & Olsen, J.B. The Four Generations of Computerized Educational Measurement. Dalam Robert L. Linn. Educational Measurement. 3 rd ed. New York: American Council on Education & Macmillan Publishing Company. Butcher, J.M., Perry, J.L., Atlis, M.M. (2000) Validity and Utility of Computer Based Test Interpretation. Psychological Assessment, Vol. 12, no. 1. Caffarra, P., Vezzadini, G., Zonato, F., Copelli, S., & Venneri, A. (2003). A normative study of a shorter version of Raven s progressive matrices Neurol Sci. 24: Embretson, S.E, & Reise, S.P. (2000). Item Response Theory for Psychologist. New Jersey: Lawrence Erlbaum Associates, Inc. Fives, C.J., & Flanagan, R. (2002). A Review of the Universal Nonverbal Inteligence Test (UNIT): An Advances for Evaluating Youngsters with Diverse Needs. School Psychology International. Vol. 23 (4): Gregory, R.J. (2000). Psychological Testing: History, Principles, and Applications. 3 rd ed. MA: Allyn & Bacon. Hornke, L.F. (2000). Item Response Times in Computerized Adaptive Testing. Psicolόgica. 21,

9 Aries Yulianto Jenskins, C.; Fitzpatrick, R.; Garrat, A.; Peto, V.; & Steward-Brown, S. (2001). Can Item Response Theory Reduce Patient Burden when Measuring Health Status in Neurological Status? Journal of Neurology, vol. 71, no. 2. McAulay, V., Deary, I.J., Ferguson, S.C., & Frier, B.M. (2001). Acute Hypoglycemia in Humans Causes Attentional Dysfunction While Nonverbal Intelligence is Preserved. Diabetes Care; Oct 2001; 24, 10; ProQuest Medical Library, p Murphy, K.R., & Davidshofer, K.O. (2001). Psychological Testing: Principles and Applications. 5 th ed. New Jersey: Prentice-Hall, Inc. Simms, L.J., & Clark, L.A. Validation of a Computerized Adaptive Version of Schedule of Nonadaptive and Adaptive Personality (SNAP). Psychological Assessment, vol. 17, no. 1, Thissen, D., & Mislevy, R. J. (1990). Testing Algorithms. In H. Wainer, N.J. Dorans, R. Flugher, & B.F. Green, Computerized Adaptive Testing: a Primer. New Jersey: Lawrance Erlbaum Associates, Publishers. Tonidandel, S., Quinones, M.A., & Adams, A.A. (2002). Computer-Adaptive Testing: The Impact of Test Characteristics on Perceived Performance and Test Taker s Performance. Journal of Applied Psychology, Vol. 87, No. 2, Ware, J.E. Jr., Gandek, B., Sinclair, S. J., & Bjorner, J.B. (2005). Item Response Theory and Computerized Adaptive Testing: Implications for Outcomes Measurement in Rehabilitation. Rehabilitation Psychology. 50, 1, Wise, S.L. (1997a). Examinee Issues in CAT. Paper presented in the Annual Meeting of the National Council on Measurement in Education. (Unpublished) Wise, S.L. (1997b). Overview of Practical Issues in a CAT Program. Paper presented in the Annual Meeting of the National Council on Measurement in Education. (Unpublished). 9

Designing item pools to optimize the functioning of a computerized adaptive test

Psychological Test and Assessment Modeling, Volume 52, 2 (2), 27-4 Designing item pools to optimize the functioning of a computerized adaptive test Mark D. Reckase Abstract Computerized adaptive testing