Automated Test Assembly for COMLEX USA: A SAS Operations Research (SAS/OR) Approach

Similar documents
An Integer Programming Approach to Item Bank Design

Designing item pools to optimize the functioning of a computerized adaptive test

Audit - The process of conducting an evaluation of an entity's compliance with published standards. This is also referred to as a program audit.

National Council for Strength & Fitness

What are the Steps in the Development of an Exam Program? 1

Computer Adaptive Testing and Multidimensional Computer Adaptive Testing

An Automatic Online Calibration Design in Adaptive Testing 1. Guido Makransky 2. Master Management International A/S and University of Twente

A Strategy for Optimizing Item-Pool Management

Psychometric Issues in Through Course Assessment

SEE Evaluation Report September 1, 2017-August 31, 2018

A standardization approach to adjusting pretest item statistics. Shun-Wen Chang National Taiwan Normal University

UK Clinical Aptitude Test (UKCAT) Consortium UKCAT Examination. Executive Summary Testing Interval: 1 July October 2016

Innovative Item Types Require Innovative Analysis

An Introduction to Psychometrics. Sharon E. Osborn Popp, Ph.D. AADB Mid-Year Meeting April 23, 2017

Assembling a Computerized Adaptive Testing Item Pool as a Set of Linear Tests

ESTIMATING TOTAL-TEST SCORES FROM PARTIAL SCORES IN A MATRIX SAMPLING DESIGN JANE SACHAR. The Rand Corporatlon

Understanding the Dimensionality and Reliability of the Cognitive Scales of the UK Clinical Aptitude test (UKCAT): Summary Version of the Report

Conjoint analysis based on Thurstone judgement comparison model in the optimization of banking products

Evaluating the use of psychometrics

STAAR-Like Quality Starts with Reliability

Effects of Selected Multi-Stage Test Design Alternatives on Credentialing Examination Outcomes 1,2. April L. Zenisky and Ronald K.

The computer-adaptive multistage testing (ca-mst) has been developed as an

Equating and Scaling for Examination Programs

Investigating Common-Item Screening Procedures in Developing a Vertical Scale

Test-Free Person Measurement with the Rasch Simple Logistic Model

IBM Workforce Science. IBM Kenexa Ability Series Computerized Adaptive Tests (IKASCAT) Technical Manual

HOGAN BUSINESS REASONING INVENTORY

Redesign of MCAS Tests Based on a Consideration of Information Functions 1,2. (Revised Version) Ronald K. Hambleton and Wendy Lam

Fusion Analytical Method Validation

HOGAN BUSINESS REASONING INVENTORY

Conditional Item-Exposure Control in Adaptive Testing Using Item-Ineligibility Probabilities

Using a Performance Test Development & Validation Framework

STATE OF THE ART ANALYTICS

9.7 Getting Schooled. A Solidify Understanding Task

Test Development. and. Psychometric Services

A Comparison of Item-Selection Methods for Adaptive Tests with Content Constraints

Reliability & Validity

Examination Report for Testing Year Board of Certification (BOC) Certification Examination for Athletic Trainers.

Solving Business Problems with Analytics

Validity and Reliability Issues in the Large-Scale Assessment of English Language Proficiency

Field Testing and Equating Designs for State Educational Assessments. Rob Kirkpatrick. Walter D. Way. Pearson

Fusion Analytical Method Validation

A Production Problem

7 Statistical characteristics of the test

The Effects of Model Misfit in Computerized Classification Test. Hong Jiao Florida State University

Item Analysis of National Examination Council Senior School Certificate Examination Economics Objective Tests

Logistic Regression with Expert Intervention

A Gradual Maximum Information Ratio Approach to Item Selection in Computerized Adaptive Testing. Kyung T. Han Graduate Management Admission Council

Telecommunications Churn Analysis Using Cox Regression

SURVEY OF SOFTWARE FOR THE TEST QUALITY ANALYSIS. Varazdat Avetisyan

Linear model to forecast sales from past data of Rossmann drug Store

Quadratic Regressions Group Acitivity 2 Business Project Week #4

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Multiple Choice (#1-9). Circle the letter corresponding to the best answer.

Percentiles the precise definition page 1. Percentiles and textbook definitions confused or what?

After completion of this unit you will be able to: Define data analytic and explain why it is important Outline the data analytic tools and

Creative Commons Attribution-NonCommercial-Share Alike License

ITEM RESPONSE THEORY FOR WEIGHTED SUMMED SCORES. Brian Dale Stucky

Mastering Modern Psychological Testing Theory & Methods Cecil R. Reynolds Ronald B. Livingston First Edition

Disentangling Prognostic and Predictive Biomarkers Through Mutual Information

Worker Types: A New Approach to Human Capital Management

A Test Development Life Cycle Framework for Testing Program Planning

proficiency that the entire response pattern provides, assuming that the model summarizes the data accurately (p. 169).

Applying Tabu Search to Container Loading Problems

Potential Impact of Item Parameter Drift Due to Practice and Curriculum Change on Item Calibration in Computerized Adaptive Testing

Operational Check of the 2010 FCAT 3 rd Grade Reading Equating Results

PASSPOINT SETTING FOR MULTIPLE CHOICE EXAMINATIONS

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Band Battery

A Statistical Comparison Of Accelerated Concrete Testing Methods

Mining for Gold gets easier and a lot more fun! By Ken Deal

Rounding a method for estimating a number by increasing or retaining a specific place value digit according to specific rules and changing all

Test Development: Ten Steps to a Valid and Reliable Certification Exam Linda A. Althouse, Ph.D., SAS, Cary, NC

QUANTITATIVE COMPARABILITY STUDY of the ICC INDEX and THE QUALITY OF LIFE DATA

1. Contingency Table (Cross Tabulation Table)

Technical Report: June 2018 CKE 1. Human Resources Professionals Association

Diagnostic Online Math Assessment: Technical Document. Published by Let s Go Learn, Inc.

Setting Standards. John Norcini, Ph.D.

ational ssessment ollaboration Annual Technical Report

ational ssessment ollaboration Annual Technical Report

Linear Programming: Basic Concepts

The Examination for Professional Practice in Psychology: The Enhanced EPPP Frequently Asked Questions

Analysis and Modelling of Flexible Manufacturing System

Evolving Control for Micro Aerial Vehicles (MAVs)

Evolutionary Algorithms

Copyright c 2009 Stanley B. Gershwin. All rights reserved. 2/30 People Philosophy Basic Issues Industry Needs Uncertainty, Variability, and Randomness

Advanced analytics at your hands

Score Reporting: More Than Just Pass/Fail. Susan Davis-Becker, Alpine Testing Solutions Sheila Mauldin, NCCPA Debbra Hecker, NAWCO

Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRT

Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRT

Prescriptive Analytics for Facility Location: an AIMMS-based perspective

(1960) had proposed similar procedures for the measurement of attitude. The present paper

Issues surrounding conversion of paperand-pencil to computerized testing

Linking Current and Future Score Scales for the AICPA Uniform CPA Exam i

A Simulation-based Multi-level Redundancy Allocation for a Multi-level System

Final Examination. Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring Tuesday June 7, 2011

PSS E. High-Performance Transmission Planning Application for the Power Industry. Answers for energy.

3 Ways to Improve Your Targeted Marketing with Analytics

Core vs NYS Standards

Smarter Balanced Adaptive Item Selection Algorithm Design Report

Know Your Data (Chapter 2)

Transcription:

Automated Test Assembly for COMLEX USA: A SAS Operations Research (SAS/OR) Approach Dr. Hao Song, Senior Director for Psychometrics and Research Dr. Hongwei Patrick Yang, Senior Research Associate

Introduction Automated test assembly (ATA) is the process of automating test form construction through constrained optimization (vs. manual assembly) Improved effectiveness and efficiency for constructing multiple parallel test forms Improved psychometric quality: Increased form comparability and less variation Targeting at population ability to assure more accuracy of pass/fail decisions at the cut score

Introduction In this ATA demonstration, we choose an optimization program PROC OPTMODEL, part of Statistical Analysis Software Operations Research (SAS/OR), as the tool for ATA SAS is the official statistical analysis platform at NBOME SAS is an industry standard product in mathematical and statistical computing Important to operational work related to COMLEX-USA as a high-stakes licensure test designed to protect the public Note: Operations research deals with the application of advanced analytical methods to help make better decisions

Three Fundamental Components In the ATA work, we utilize the technique of mixed/pure integer linear/nonlinear programming. Three fundamental components need to be established: Decision variables Constraints Including both content and psychometrics constraints Objective function(s)

Decision Variables The decision variables we define here are binary variables in the form of 0 s and 1 s indicating the inclusion or exclusion of each item in each test form: x if = 1, if item i is assigned to form f x if = 0, otherwise Here, i = 1, N, f = 1, M, N is the total number of items in the item pool and M is the total number of forms to be assembled

Constraints Constraints are test specifications that need to be met. Typical constraints include: To restrict the test length to be exactly n items for form f N i=1 x if = n To ensure that item i is selected no more than once across all M forms 0 M i=1 x if 1 To limit the number of items on a certain topic (say, OPP items, or set of enemy items, etc.) to be between l and u on any given form Let t be a binary indicator variable with 1 indicating the item falling into the topic and 0 otherwise. Then, l x if t i u N i=1

Objective Function(s) Finally, the objective function is formulated by requiring the test information function (TIF) of each assembled form be as close to the target value as possible at the cut score θ = θ c : N Minimize x I i θ c x if T i θ c i=1 Careful consideration is given to keep examinations comparable over years

Measures of Test Quality Basically, test information function or TIF tells us how well the test is doing in estimating ability over the whole range of ability scores Given ability θ, a higher value in TIF indicates that the test is doing a better job

Data and Constraints Applied In development of the ATA engine, one-level data was used with the target latent ability cut score of θ = θ c Data sources of anchor, operational and pretest items The following criteria are specified as constraints in this ATA demonstration Blueprint Dimension 1 criteria Blueprint Dimension 2 criteria Life stage in Clinical presentations Number of items in a test form etc.

New ATA Forms vs. Previous Forms: TIF by Ability

New ATA Forms vs. Previous Forms The newly assembled ATA forms (in RED) are graphically presented when compared with a set of forms (in BLUE) assembled using the traditional manual assembly method Figure 1 presents an overlay of both groups of graphs by plotting one statistic against a wide range of ability levels across all assembled forms values of test information functions by ability levels from [(-4), (+4)]

New ATA Forms vs. Previous Forms Within each group of graphs, there is very good equivalency among forms The graphs are all closely overlapped with each other The new ATA forms noticeably demonstrate less variability among them around the cut score θ = θ c than do the traditional forms

New ATA Forms vs. Previous Forms Across the two groups of graphs, for a major portion of the continuum, the new ATA forms show relatively high test information function values than those traditional forms

New ATA Forms vs. Previous Forms In sum, the new and the traditional forms are reasonably comparable with each other in terms of equivalency within their respective group The new ATA forms can be better tailored to the candidate ability, around the cut score θ = θ c in particular

Impact Analysis for Classification Accuracy To further evaluate the new ATA forms, we have conducted an impact analysis via a simulation study using the empirical administration data Assuming the same cohort of candidates were to take the newly assembled ATA forms, we would compare their between-year examination scores and pass/fail decisions

Impact Analysis for Classification Accuracy

Impact Analysis for Classification Accuracy Figure 2 plots the newly estimated ability θ values after equating (vertical axis) against their previous estimates (horizontal axis) for two select ATA forms In each plot, the points fall around a 45 reference line, indicating the newly equated ability estimates tend to be identical to their previously obtained values

Impact Analysis for Classification Accuracy Besides, in each scatterplot, almost completely overlapped with the 45 reference line is an ordinary least squares regression line with the equated ability estimates as the dependent variable and the previous estimates as the predictor Additional, convincing evidence supportive of the new, equated ability estimates from the ATA forms

Impact Analysis for Classification Accuracy

Impact Analysis for Classification Accuracy Table 1 provides a cross-tabulation of two sets of classification results from the same classification criterion for measuring ability Based on the actual data from first-time candidates in one recent administration cycle Based on the data simulated from the ATA forms when administered to the same group of candidates above

Impact Analysis for Classification Accuracy Depending on which form it is, the passing rate from ATA ranges from 91.52% to 92.33% across all ATA forms, highly comparable across forms Close to the actual passing rate of 92% Depending on which form it is, the failing rate from ATA ranges from 7.67% to 8.48% across all ATA forms

Impact Analysis for Classification Accuracy As for the sensitivity statistic, its estimate ranges from 97.25% to 98.03% across all ATA forms Definition: Proportion of truly qualified candidates who actually pass the examination As for the specificity statistic, its estimate ranges from 78.55% to 81.80% across all ATA forms Definition: Proportion of truly unqualified candidates who actually fail the examination

Conclusions The ATA approach is preferred over the manual assembly approach because More equivalent with a reduction in the variability among forms over the continuum of candidate ability More rigorous psychometrics and content properties Higher on the test information function along a major portion of the ability continuum More content constraints being factored into form assembly As accurate as traditional forms in terms of scoring and classifying candidates

Conclusions In actual form assembly, we go even further in an effort to keep our strong commitment to the public Numerous communications among the Test Development and the Psychometrics and Research Teams, and external Subject Matter Experts on both content and psychometrics issues To enhance form equivalency to the greatest possible extent Flexibility for adding other content and psychometrics constraints whenever needed Multiple stages of ATA where feedback from item and form review meetings can be factored into each stage

Conclusions This small scale study is based on rigorous mathematics optimization procedures implemented in an industry standard software package and has demonstrated the ATA as one of many ongoing innovations at NBOME A *demonstration* of the ATA work only Not to be viewed as reflecting the full process typically used in a real ATA project at NBOME

References Choe, E. M. & Denbleyker, J. (2014). Quality psychometrics of Common Block Assembly: Summary report. Chicago, IL: National Board of Osteopathic Medical Examiners (NBOME). Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. Orlando, FL: Harcourt Brace Jovanovich, Inc. Kalinowski, K. (2015). COMAT form assembly instructions for 2015. Chicago, IL: National Board of Osteopathic Medical Examiners (NBOME). Lathrop, Q. N. (2015). cacirt: Classification Accuracy and Consistency under Item Response Theory. R package version 1.4. http://cran.r-project.org/package=cacirt Linacre, J. M. (2007) How to simulate Rasch data. Rasch Measurement Transactions, 21(3), 1125-1125. Papadimitriou, C. H., & Steiglitz, K. (1998). Combinatorial optimization: Algorithms and complexity. Mineola, NY: Dover Publications. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research, Copenhagen. Reif, M. (2014). PP: Estimation of person parameters for the 1,2,3,4-PL model and the GPCM. R package version 0.5.3. https://github.com/manuelreif/pp Rudner, L. M. (2001) Computing the expected proportions of misclassified examinees. Practical Assessment, Research & Evaluation, 7(14), 1 5. Rudner, L. M. (2005) Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13), 1 4. Schrijver, A. (2003). Combinatorial optimization. NYC, NY: Springer. van der Linden, W. J. (2005). Linear models for optimal test design. NYC, NY: Springer. Woo, A., & Gorham, J. L. (2010). Understanding the impact of enemy items on test validity and measurement precision. Journal of Clear Exam Review, 21(1), 15-17.

Feel Free to Follow-Up with Questions! If you have any remaining questions, please do not hesitate to contact Dr. Hao Song or Dr. Hongwei Patrick Yang, via e-mail or phone: E-mail for Dr. Hao Song: HSong@nbome.org Phone number: 773-714-0622 extension 294 E-mail for Dr. Hongwei Patrick Yang: Pyang1@nbome.org Phone number: 773-714-0622 extension 290

And, finally, on behalf of NBOME THANK YOU! 2013 NBOME