Second Generation, Multidimensional, and Multilevel Item Response Theory Modeling

Second Generation, Multidimensional, and Multilevel Item Response Theory Modeling Li Cai CSE/CRESST Department of Education Department of Psychology UCLA With contributions from Mark Hansen, Scott Monroe, and Ji Seung Yang NCME, 2012 April 16, 2012 1

Multidimensionality Introduction Modern educational and psychological assessments can be complex in dimensionality. - Consider the PISA assessment framework. - Consider many multi-faceted health outcomes measures. 2

Introduction Two-tier Model for PISA (Cai, 2010) 3

Introduction Two-tier Model for PISA (Cai, 2010) 4

Introduction NIH (NIDA) PROMIS Smoking Module Project goal: to develop, evaluate, and standardize item banks to assess cigarette smoking behavior and constructs associated with smoking for both daily and non-daily smokers The bifactor structure for the Dependence/Craving domain G S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 G: dependence/craving (55) S1: first cig of the day (3) S2: automatic/mindless (4) S3: heavy (5) S4: out of control (8) S5: withdrawal (8) S6: cravings (3) S7: if I couldn t smoke (4) S8: can t quit (2) S9: temptations (5) S10: consistency (6) S11: when not allowed (3) 5

Multilevel Data Introduction At the same time, large-scale data collection efforts typically employ multi-stage sampling, resulting in natural nesting of respondents within independent sampling units. - Two-stage stratified sampling design of PISA within countries - Three-stage PPS sampling design of NAEQ (the new national assessment in China) - Of course, there is NAEP and other large-scale surveys. In addition, multilevel data structures arise when the study design involves repeated measurement or multiple raters or sources of information. 6

Current Models Calibration of item parameters typically employ the unidimensional model, assuming normality. Single-level latent variable models. Aggregated/group inference by conditioning (latent regression model) and the plausible value methodology. Model fit evaluation? New Models? As an aside (but useful for operational psychometrics): summed score based calculations (EAP to IRT scale score translation tables, item fit, linking, etc.) from the perspective of IRT without the Rasch (equal discrimination) requirement 7

New Models? From Yesterday (Yang, Monroe, and Cai) Multilevel - Clustering - Variances at L1, L2 - Intraclass Correlation Multidimensional Multiple groups Efficient computation Constraints Scores at L1 and L2 Fit tests 8

Some of the Equations Multilevel Two-Tier Model 9

Item Parameter Estimation Curse of Dimensionality High-dimensional integration problem in likelihood-based estimation and inference. Decrease the computational burden through analytical dimension reduction (Gibbons & Hedeker, 1992). Only (p+q+1)-dimensional quadrature is required. Individual response pattern scores for all level-1 latent variables can be produced as posterior expectations. Similarly dimensioned quadrature computations provide basis for score reporting systems based on posterior expectations for level-2 variables, e.g., school- or state-level achievement. 10

Non-normality in Latent Variables Empirical Histogram Item Bifactor Model Non-normal latent variables have been studied extensively in the unidimensional IRT setting (e.g., work by Carol Woods). Little discussion of non-normality in latent variables for multidimensional models. Cai & Woods proposed an item bifactor model wherein the general dimension is characterized non-parametrically. Supports multiple-group estimation. Only 2-dimensional integral is needed for MML estimation. EAP scores a bi-product. N N N N EH 11

Non-normality in Latent Variables Empirical Histogram Item Bifactor Model 12

What About Model Fit? Sparseness and Overall Model Fit Tests In IRT, full-information Pearson s X 2 or the likelihood ratio statistic G 2 may be used to examine model fit. However, if the number of response patterns is large (relative to the sample size), the underlying table can become sparse. Maydeu-Oliveres and colleagues proposed statistics based on lowerorder margins, particularly M 2, which is based on first- and secondorder marginal residuals. M 2 is successful for testing unidimensional dichotomous IRT models, but adaptation of M 2 to hierarchical and polytomous item factor models is not straightforward. 13

Challenges of Dimensionality Two challenges arise: What About Model Fit? 1. When the number of response categories increases, even the secondorder marginals become sparse. 2. The calculation of expected cell frequencies, Jacobian elements, and weight matrix elements for high-dimensional IRT models can be computationally burdensome. Cai & Hansen (in press; BJMSP) tackle (1) by a further reduction M 2 * and resolve (2) by extend Gibbons & Hedeker s (1992) strategy of dimension reduction for item parameter estimation to goodness-of-fit testing. For bifactor models, the maximum dimension of integration is 2, regardless of the number of factors. 14

Summed-Score Computations Lord-Wingersky Algorithm Version 2.0 For hierarchical multidimensional item factor models, can we obtain similar summed-score posterior based indices as in unidimensional IRT models? Lord-Wingersky Algorithm Version 2.0: Part 1: For each testlet, form likelihoods for within-testlet summed scores. This is standard Lord-Wingersky as applied to 2- dimensional quadrature grids. Part 2: For each within-testlet summed score likelihood, integrate out the specific dimension. This is the same as in bifactor/testlet item parameter estimation. Part 3: Treat testlets as items. Treat testlet scores as item scores. Apply standard Lord-Wingerksy. 15

Item Fit? Summed-Score Computations The new algorithm provides a convenient computational shortcut for obtaining the Orlando-Thissen-Bjorner S-X 2 type item fit statistics. Hierarchical (or reparameterized higherorder) models probably remain the only multidimensional IRT models where summed scores correspond nicely with latent traits. Ying Li and Andre Rupp recently examined the performance of these statistics in a paper in EPM. 16

Summed-Score Computations Score Combinations Reproducing scoring table from Thissen & Wainer (2001) Chapter 7 material on Wisconsin 3 rd grade mixed format test. MC Open-ended (Summed) Rated Score Sum 0 1 2 3 4 5 6 7 8 9 10 11 12 0-3.3-3.0-2.8-2.7-2.5-2.4-2.2-2.1-2.0-1.9-1.8-1.8-1.7 1-3.2-3.0-2.8-2.6-2.4-2.3-2.1-2.0-1.9-1.8-1.7-1.7-1.6 2-3.2-2.9-2.7-2.5-2.3-2.2-2.0-1.9-1.8-1.7-1.6-1.6-1.5 3-3.1-2.8-2.6-2.4-2.2-2.1-1.9-1.8-1.7-1.6-1.5-1.4-1.4 4-3.0-2.7-2.5-2.3-2.1-1.9-1.8-1.7-1.6-1.5-1.4-1.3-1.3 5-2.9-2.6-2.3-2.1-1.9-1.8-1.6-1.5-1.4-1.3-1.3-1.2-1.1 6-2.8-2.4-2.2-1.9-1.8-1.6-1.5-1.4-1.3-1.2-1.1-1.0-1.0 7-2.6-2.2-2.0-1.7-1.6-1.4-1.3-1.2-1.1-1.0-1.0-0.9-0.9 8-2.3-2.0-1.7-1.5-1.4-1.3-1.2-1.1-1.0-0.9-0.8-0.8-0.7 9-1.9-1.6-1.4-1.3-1.2-1.1-1.0-0.9-0.8-0.8-0.7-0.6-0.6 10-1.5-1.3-1.2-1.1-1.0-0.9-0.8-0.7-0.7-0.6-0.5-0.5-0.4 11-1.2-1.0-0.9-0.9-0.8-0.7-0.7-0.6-0.5-0.4-0.4-0.3-0.2 12-0.8-0.8-0.7-0.6-0.6-0.5-0.5-0.4-0.3-0.3-0.2-0.1 0.0 13-0.6-0.5-0.5-0.4-0.4-0.3-0.3-0.2-0.1 0.0 0.1 0.2 0.3 14-0.3-0.3-0.3-0.2-0.2-0.1-0.1 0.0 0.1 0.2 0.3 0.5 0.6 15-0.1-0.1 0.0 0.0 0.1 0.1 0.2 0.3 0.4 0.5 0.7 0.9 1.1 16 0.2 0.2 0.2 0.3 0.3 0.4 0.5 0.6 0.7 0.9 1.1 1.4 1.7 17

Summed-Score Computations Calibrated Projection Linking Thissen et al. s (2011) paper described a linking method that fuses calibration with projection and demonstrated how one may conduct summed score based projection linking. Figure on the right stolen from 2010 IMPS presentation by Thissen. 18

Software Implementations As the result of a National Cancer Institute SBIR development contract, many of the multidimensional models are implemented in IRTPRO (ssicentral.com): Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling [Computer software]. Lincolnwood, IL: Scientific Software International. The newer multilevel, non-parametric, and model fit testing procedures are implemented in flexmirt (flexmirt.com): Cai, L. (2012). Flexible multilevel item factor analysis and test scoring [Computer software]. Seattle, WA: Vector Psychometric Group, LLC. National Cancer Institute also funded: Wu, E. J. C. and Bentler, P. M. (2011). EQSIRT: A user-friendly IRT program [Computer software]. Encino, CA: Multivariate Software, Inc. 19

Acknowledgements Thank you for bearing with me. And many thanks to the program organizers and discussant! LCAI [at] UCLA [dot] EDU Part of this research is made possible by grants from the Institute of Education Sciences (R305B080016 and R305D100039) and grants from the National Institute on Drug Abuse (R01DA026943 and R01DA030466). I would like to thank the following members of my research group at UCLA: Mark Hansen, Ji Seung Yang, Scott Monroe, and the RAND/UCLA PROMIS Smoking Initiative Group. I would also like to thank Dave Thissen at UNC-Chapel Hill. 20