Second Generation, Multidimensional, and Multilevel Item Response Theory Modeling

Similar documents
RUNNING HEAD: MODELING LOCAL DEPENDENCE USING BIFACTOR MODELS 1

Technical Report: Does It Matter Which IRT Software You Use? Yes.

Scoring Subscales using Multidimensional Item Response Theory Models. Christine E. DeMars. James Madison University

Estimation of a Rasch model including subdimensions

ITEM RESPONSE THEORY FOR WEIGHTED SUMMED SCORES. Brian Dale Stucky

Psychometric Issues in Through Course Assessment

Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRT

Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRT

Increasing unidimensional measurement precision using a multidimensional item response model approach

proficiency that the entire response pattern provides, assuming that the model summarizes the data accurately (p. 169).

Validity and Reliability Issues in the Large-Scale Assessment of English Language Proficiency

Understanding the Dimensionality and Reliability of the Cognitive Scales of the UK Clinical Aptitude test (UKCAT): Summary Version of the Report

Computer Adaptive Testing and Multidimensional Computer Adaptive Testing

Conjoint analysis based on Thurstone judgement comparison model in the optimization of banking products

Linking errors in trend estimation for international surveys in education

A standardization approach to adjusting pretest item statistics. Shun-Wen Chang National Taiwan Normal University

Determining the accuracy of item parameter standard error of estimates in BILOG-MG 3

ASSUMPTIONS OF IRT A BRIEF DESCRIPTION OF ITEM RESPONSE THEORY

ABSTRACT. systems, as they have demonstrated high criterion-related validity in predicting job

An Introduction to Rasch Measurement: Theory and Applications October 8-9, 2010 at the Hilton Garden Inn, Maple Grove, MN

Longitudinal Effects of Item Parameter Drift. James A. Wollack Hyun Jung Sung Taehoon Kang

Glossary of Standardized Testing Terms

Clustering of Quality of Life Items around Latent Variables

CHAPTER 4 EXAMPLES: EXPLORATORY FACTOR ANALYSIS

Six Major Challenges for Educational and Psychological Testing Practices Ronald K. Hambleton University of Massachusetts at Amherst

A Hierarchical Rater Model for Constructed Responses, with a Signal Detection Rater Model

The computer-adaptive multistage testing (ca-mst) has been developed as an

Chapter 11. Multiple-Sample SEM. Overview. Rationale of multiple-sample SEM. Multiple-sample path analysis. Multiple-sample CFA.

differential item functioning Wang 2008 DIF-free-then-DIF DFTD DIF DIF-free

Using a Performance Test Development & Validation Framework

Concurrent Unidimensional and Multidimensional Calibration within Item Response Theory

Reliability and interpretation of total scores from multidimensional cognitive measures evaluating the GIK 4-6 using bifactor analysis

ANZMAC 2010 Page 1 of 8. Assessing the Validity of Brand Equity Constructs: A Comparison of Two Approaches

An Introduction to Psychometrics. Sharon E. Osborn Popp, Ph.D. AADB Mid-Year Meeting April 23, 2017

Develop Innovative Methods in Secondary Analyses of Child Welfare Databases -- Children s Bureau Discretionary Grants Program Grantee s Final Report

Effects of Selected Multi-Stage Test Design Alternatives on Credentialing Examination Outcomes 1,2. April L. Zenisky and Ronald K.

Confirmatory factor analysis in Mplus. Day 2

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Band Battery

UK Clinical Aptitude Test (UKCAT) Consortium UKCAT Examination. Executive Summary Testing Interval: 1 July October 2016

An Automatic Online Calibration Design in Adaptive Testing 1. Guido Makransky 2. Master Management International A/S and University of Twente

Academic Screening Frequently Asked Questions (FAQ)

Estimating Standard Errors of Irtparameters of Mathematics Achievement Test Using Three Parameter Model

Designing item pools to optimize the functioning of a computerized adaptive test

To appear in: Moutinho and Hutcheson: Dictionary of Quantitative Methods in Management. Sage Publications

Discoveries with item response theory (IRT)

Chapter 1 Item Selection and Ability Estimation in Adaptive Testing

THE COMPARISON OF COMMON ITEM SELECTION METHODS IN VERTICAL SCALING UNDER MULTIDIMENSIONAL ITEM RESPONSE THEORY. Yang Lu A DISSERTATION

A Gradual Maximum Information Ratio Approach to Item Selection in Computerized Adaptive Testing. Kyung T. Han Graduate Management Admission Council

Indian Institute of Technology Kanpur National Programme on Technology Enhanced Learning (NPTEL) Course Title Marketing Management 1

Smarter Balanced Assessment Consortium Field Test: Automated Scoring Research Studies in accordance with Smarter Balanced RFP 17

Investigating Common-Item Screening Procedures in Developing a Vertical Scale

The Effects of Model Misfit in Computerized Classification Test. Hong Jiao Florida State University

to be assessed each time a test is administered to a new group of examinees. Index assess dimensionality, and Zwick (1987) applied

Field Testing and Equating Designs for State Educational Assessments. Rob Kirkpatrick. Walter D. Way. Pearson

Multivariate G-Theory and Subscores 1. Investigating the Use of Multivariate Generalizability Theory for Evaluating Subscores.

Dealing with Variability within Item Clones in Computerized Adaptive Testing

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Subscore Reliability and Classification Consistency: A Comparison of Five Methods

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article

Worker Skill Estimation from Crowdsourced Mutual Assessments

THREE LEVEL HIERARCHICAL BAYESIAN ESTIMATION IN CONJOINT PROCESS

Support Vector Machines (SVMs) for the classification of microarray data. Basel Computational Biology Conference, March 2004 Guido Steiner

Three Research Approaches to Aligning Hogan Scales With Competencies

STAAR-Like Quality Starts with Reliability

Running head: GROUP COMPARABILITY

An Exploration of the Robustness of Four Test Equating Models

Practical Exploratory Factor Analysis: An Overview

Mastering Modern Psychological Testing Theory & Methods Cecil R. Reynolds Ronald B. Livingston First Edition

Automated Test Assembly for COMLEX USA: A SAS Operations Research (SAS/OR) Approach

Chapter 7. Measurement Models and Confirmatory Factor Analysis. Overview

Harrison Assessments Validation Overview

Southern Cross University Tania von der Heidt Southern Cross University Don R. Scott Southern Cross University

Appendix A Mixed-Effects Models 1. LONGITUDINAL HIERARCHICAL LINEAR MODELS

STAT 2300: Unit 1 Learning Objectives Spring 2019

SURVEY OF SOFTWARE FOR THE TEST QUALITY ANALYSIS. Varazdat Avetisyan

SAS/STAT 13.1 User s Guide. Introduction to Multivariate Procedures

Martin Senkbeil and Jan Marten Ihme. for Grade 9

Statistics & Analysis. Confirmatory Factor Analysis and Structural Equation Modeling of Noncognitive Assessments using PROC CALIS

Measurement of Employee Productivity using Cluster Analysis of BehavioralIntegrity

Operational Check of the 2010 FCAT 3 rd Grade Reading Equating Results

What is Multilevel Structural Equation Modelling?

_DTIC DJUNT1C. o-psychometric DEVELOPMENTS RELATED TO corn. November 1993 SEMIANNUAL TECHNICAL REPORT FOR THE PROJECT TESTS AND SELECTION

From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here.

PVM Pharmacy Customer Satisfaction Structural Equations Models. Multivariate Solutions

Lecture 6: GWAS in Samples with Structure. Summer Institute in Statistical Genetics 2015

NEPS Working Papers. NEPS Technical Report Scaling the Data of the Competence Tests. NEPS Working Paper No. 14. Steffi Pohl & Claus H.

Structural Equation Modeling (SEM)

ACHIEVEMENT VARIANCE DECOMPOSITION 1. Online Supplement

Differential Item Functioning

Archives of Scientific Psychology Reporting Questionnaire for Manuscripts Describing Primary Data Collections

Computer Software for IRT Graphical Residual Analyses (Version 2.1) Tie Liang, Kyung T. Han, Ronald K. Hambleton 1

Han Du. Department of Psychology University of Notre Dame Notre Dame, IN Telephone:

The Occupational Personality Questionnaire Revolution:

USE OF POLYCHORIC INDEXES TO MEASURE THE IMPACT OF SEVEN SUSTAINABILITY PROGRAMS ON COFFEE GROWERS LIVELIHOOD IN COLOMBIA

Influence of the Criterion Variable on the Identification of Differentially Functioning Test Items Using the Mantel-Haenszel Statistic

(1960) had proposed similar procedures for the measurement of attitude. The present paper

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization

The prediction of economic and financial performance of companies using supervised pattern recognition methods and techniques

R E COMPUTERIZED MASTERY TESTING WITH NONEQUIVALENT TESTLETS. Kathleen Sheehan Charles lewis.

CBC Conference Dr Emma Beard

Transcription:

Second Generation, Multidimensional, and Multilevel Item Response Theory Modeling Li Cai CSE/CRESST Department of Education Department of Psychology UCLA With contributions from Mark Hansen, Scott Monroe, and Ji Seung Yang NCME, 2012 April 16, 2012 1

Multidimensionality Introduction Modern educational and psychological assessments can be complex in dimensionality. - Consider the PISA assessment framework. - Consider many multi-faceted health outcomes measures. 2

Introduction Two-tier Model for PISA (Cai, 2010) 3

Introduction Two-tier Model for PISA (Cai, 2010) 4

Introduction NIH (NIDA) PROMIS Smoking Module Project goal: to develop, evaluate, and standardize item banks to assess cigarette smoking behavior and constructs associated with smoking for both daily and non-daily smokers The bifactor structure for the Dependence/Craving domain G S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 G: dependence/craving (55) S1: first cig of the day (3) S2: automatic/mindless (4) S3: heavy (5) S4: out of control (8) S5: withdrawal (8) S6: cravings (3) S7: if I couldn t smoke (4) S8: can t quit (2) S9: temptations (5) S10: consistency (6) S11: when not allowed (3) 5

Multilevel Data Introduction At the same time, large-scale data collection efforts typically employ multi-stage sampling, resulting in natural nesting of respondents within independent sampling units. - Two-stage stratified sampling design of PISA within countries - Three-stage PPS sampling design of NAEQ (the new national assessment in China) - Of course, there is NAEP and other large-scale surveys. In addition, multilevel data structures arise when the study design involves repeated measurement or multiple raters or sources of information. 6

Current Models Calibration of item parameters typically employ the unidimensional model, assuming normality. Single-level latent variable models. Aggregated/group inference by conditioning (latent regression model) and the plausible value methodology. Model fit evaluation? New Models? As an aside (but useful for operational psychometrics): summed score based calculations (EAP to IRT scale score translation tables, item fit, linking, etc.) from the perspective of IRT without the Rasch (equal discrimination) requirement 7

New Models? From Yesterday (Yang, Monroe, and Cai) Multilevel - Clustering - Variances at L1, L2 - Intraclass Correlation Multidimensional Multiple groups Efficient computation Constraints Scores at L1 and L2 Fit tests 8

Some of the Equations Multilevel Two-Tier Model 9

Item Parameter Estimation Curse of Dimensionality High-dimensional integration problem in likelihood-based estimation and inference. Decrease the computational burden through analytical dimension reduction (Gibbons & Hedeker, 1992). Only (p+q+1)-dimensional quadrature is required. Individual response pattern scores for all level-1 latent variables can be produced as posterior expectations. Similarly dimensioned quadrature computations provide basis for score reporting systems based on posterior expectations for level-2 variables, e.g., school- or state-level achievement. 10

Non-normality in Latent Variables Empirical Histogram Item Bifactor Model Non-normal latent variables have been studied extensively in the unidimensional IRT setting (e.g., work by Carol Woods). Little discussion of non-normality in latent variables for multidimensional models. Cai & Woods proposed an item bifactor model wherein the general dimension is characterized non-parametrically. Supports multiple-group estimation. Only 2-dimensional integral is needed for MML estimation. EAP scores a bi-product. N N N N EH 11

Non-normality in Latent Variables Empirical Histogram Item Bifactor Model 12

What About Model Fit? Sparseness and Overall Model Fit Tests In IRT, full-information Pearson s X 2 or the likelihood ratio statistic G 2 may be used to examine model fit. However, if the number of response patterns is large (relative to the sample size), the underlying table can become sparse. Maydeu-Oliveres and colleagues proposed statistics based on lowerorder margins, particularly M 2, which is based on first- and secondorder marginal residuals. M 2 is successful for testing unidimensional dichotomous IRT models, but adaptation of M 2 to hierarchical and polytomous item factor models is not straightforward. 13

Challenges of Dimensionality Two challenges arise: What About Model Fit? 1. When the number of response categories increases, even the secondorder marginals become sparse. 2. The calculation of expected cell frequencies, Jacobian elements, and weight matrix elements for high-dimensional IRT models can be computationally burdensome. Cai & Hansen (in press; BJMSP) tackle (1) by a further reduction M 2 * and resolve (2) by extend Gibbons & Hedeker s (1992) strategy of dimension reduction for item parameter estimation to goodness-of-fit testing. For bifactor models, the maximum dimension of integration is 2, regardless of the number of factors. 14

Summed-Score Computations Lord-Wingersky Algorithm Version 2.0 For hierarchical multidimensional item factor models, can we obtain similar summed-score posterior based indices as in unidimensional IRT models? Lord-Wingersky Algorithm Version 2.0: Part 1: For each testlet, form likelihoods for within-testlet summed scores. This is standard Lord-Wingersky as applied to 2- dimensional quadrature grids. Part 2: For each within-testlet summed score likelihood, integrate out the specific dimension. This is the same as in bifactor/testlet item parameter estimation. Part 3: Treat testlets as items. Treat testlet scores as item scores. Apply standard Lord-Wingerksy. 15

Item Fit? Summed-Score Computations The new algorithm provides a convenient computational shortcut for obtaining the Orlando-Thissen-Bjorner S-X 2 type item fit statistics. Hierarchical (or reparameterized higherorder) models probably remain the only multidimensional IRT models where summed scores correspond nicely with latent traits. Ying Li and Andre Rupp recently examined the performance of these statistics in a paper in EPM. 16

Summed-Score Computations Score Combinations Reproducing scoring table from Thissen & Wainer (2001) Chapter 7 material on Wisconsin 3 rd grade mixed format test. MC Open-ended (Summed) Rated Score Sum 0 1 2 3 4 5 6 7 8 9 10 11 12 0-3.3-3.0-2.8-2.7-2.5-2.4-2.2-2.1-2.0-1.9-1.8-1.8-1.7 1-3.2-3.0-2.8-2.6-2.4-2.3-2.1-2.0-1.9-1.8-1.7-1.7-1.6 2-3.2-2.9-2.7-2.5-2.3-2.2-2.0-1.9-1.8-1.7-1.6-1.6-1.5 3-3.1-2.8-2.6-2.4-2.2-2.1-1.9-1.8-1.7-1.6-1.5-1.4-1.4 4-3.0-2.7-2.5-2.3-2.1-1.9-1.8-1.7-1.6-1.5-1.4-1.3-1.3 5-2.9-2.6-2.3-2.1-1.9-1.8-1.6-1.5-1.4-1.3-1.3-1.2-1.1 6-2.8-2.4-2.2-1.9-1.8-1.6-1.5-1.4-1.3-1.2-1.1-1.0-1.0 7-2.6-2.2-2.0-1.7-1.6-1.4-1.3-1.2-1.1-1.0-1.0-0.9-0.9 8-2.3-2.0-1.7-1.5-1.4-1.3-1.2-1.1-1.0-0.9-0.8-0.8-0.7 9-1.9-1.6-1.4-1.3-1.2-1.1-1.0-0.9-0.8-0.8-0.7-0.6-0.6 10-1.5-1.3-1.2-1.1-1.0-0.9-0.8-0.7-0.7-0.6-0.5-0.5-0.4 11-1.2-1.0-0.9-0.9-0.8-0.7-0.7-0.6-0.5-0.4-0.4-0.3-0.2 12-0.8-0.8-0.7-0.6-0.6-0.5-0.5-0.4-0.3-0.3-0.2-0.1 0.0 13-0.6-0.5-0.5-0.4-0.4-0.3-0.3-0.2-0.1 0.0 0.1 0.2 0.3 14-0.3-0.3-0.3-0.2-0.2-0.1-0.1 0.0 0.1 0.2 0.3 0.5 0.6 15-0.1-0.1 0.0 0.0 0.1 0.1 0.2 0.3 0.4 0.5 0.7 0.9 1.1 16 0.2 0.2 0.2 0.3 0.3 0.4 0.5 0.6 0.7 0.9 1.1 1.4 1.7 17

Summed-Score Computations Calibrated Projection Linking Thissen et al. s (2011) paper described a linking method that fuses calibration with projection and demonstrated how one may conduct summed score based projection linking. Figure on the right stolen from 2010 IMPS presentation by Thissen. 18

Software Implementations As the result of a National Cancer Institute SBIR development contract, many of the multidimensional models are implemented in IRTPRO (ssicentral.com): Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling [Computer software]. Lincolnwood, IL: Scientific Software International. The newer multilevel, non-parametric, and model fit testing procedures are implemented in flexmirt (flexmirt.com): Cai, L. (2012). Flexible multilevel item factor analysis and test scoring [Computer software]. Seattle, WA: Vector Psychometric Group, LLC. National Cancer Institute also funded: Wu, E. J. C. and Bentler, P. M. (2011). EQSIRT: A user-friendly IRT program [Computer software]. Encino, CA: Multivariate Software, Inc. 19

Acknowledgements Thank you for bearing with me. And many thanks to the program organizers and discussant! LCAI [at] UCLA [dot] EDU Part of this research is made possible by grants from the Institute of Education Sciences (R305B080016 and R305D100039) and grants from the National Institute on Drug Abuse (R01DA026943 and R01DA030466). I would like to thank the following members of my research group at UCLA: Mark Hansen, Ji Seung Yang, Scott Monroe, and the RAND/UCLA PROMIS Smoking Initiative Group. I would also like to thank Dave Thissen at UNC-Chapel Hill. 20