differential item functioning Wang 2008 DIF-free-then-DIF DFTD DIF DIF-free

Size: px
Start display at page:

Download "differential item functioning Wang 2008 DIF-free-then-DIF DFTD DIF DIF-free"

Transcription

1

2 differential item functioning Type I error Wang free-then- DFTD likelihood ratio test LRT power -free -free -free -free I

3 The Effect of -free-then- Strategy on Likelihood Ratio Test in Assessing Differential Item Functioning Abstract It was found the Type I error rate of differential item functioning assessment was seriously influenced by the percentage of items in the test. To diminish this problem, the -free-then- DFTD strategy was strongly recommended to implement in assessment methods Wang, Though there are many methods, the likelihood ratio test LRT method was found to perform better in comparison with other assessment methods, therefore it was used in this study. The performance of DFTD Strategy on the LRT method was evaluated through two-stage simulation studies. The results indicated that the Type I error rates were well-controlled under LRT method with setting -free items as anchor. Furthermore, it was found the scale purification procedure can yields higher accuracy than other methods on selecting a set of -free items. Taking these items as anchor of DFTD strategy, the subsequent constant item method performed well on both Type I error control and power of assessment. Keywords: differential item functioning, likelihood ratio test, -free-then-, scale purification II

4 I Abstract II III IV V free free free free III

5 free free constant constant constant constant balanced balanced balanced balanced LRT-ST LRT-SP LRT-PA LRT-DFSP LRT-ST LRT-SP LRT-PA LRT-DFSP IV

6 ASA ASA ASA ASA V

7 item response theory, IRT differential item functioning, Trends in International 1987 Mathematics and Science Study, TIMSS American Mathematics Competitions, AMC 1

8 Cole & Zieky, 2001 Educational Testing Service, ETS power Type I error Finch, 2005; Lord, 1980; Stark, Chernyshenko, & Drasgow, % inflated scale purification Candell & Drasgow, 1988; Clauser, Mazor, & Hambleton, 1993; French & Maller, 2007; Hidalgo-Montesinos & G mez-benito, 2003; Holland & Thayer, 1988; Lord, 1980; Park & Lautenschlager, 1990; Miller & Oshima, 1992; Navas-Ara & G mez-benito, 2002; Wang & Su, % 20% 20% constant-item method, CI; Thissen, Steinberg, & Wainer, 1988; Wang & Yeh,

9 -free anchor Wang free-then-, DFTD MIMIC multiple indicators, multiple causes confirmatory factor analysis methods; Oort, 1998) Shih & Wang, 2009; Wang & Shih, 2010 DFTD MIMIC Woods, 2009 Woods Thissen, Steinberg, & Wainer, 1988 likelihood ratio test, LRT; graded response model, GRM; Samejima,1969 rank-based likelihood ratio test statistic, LR statistic 1 all-other-item method, AOI; Wang & Yeh, f LR statistic 3 LR statistic f 4 3 g 4 g g -free -free rank-based rank-based LR statistic 3

10 LR statistic f -free Stark et al., 2006; Wang, 2004; Wang & Yeh, 2003 rank-based Woods 2009 Wang, 2004; Wang & Yeh, 2003 Woods rank-based GRM DFTD -free rank-based DFTD IRT DFTD -free -free -free rank-based rank-based rank-based LR statistic iterative constant item method, ICI method; Wang, 2008 rank-based -free -free DFTD -free -free DFTD DFTD ICI 4

11 -free 2009 ICI -free 5

12 item characteristic curves, ICCs item bias Ironson & Subkoviak, 1979; Rudner, Geston, & Knight, 1980; Shepard, Camiili, & Averill, 1981 Camilli & Shepard, 1994; Drasgow & Kang, 1984; Holland & Wainer 1993; Lord, 1980 IRT IRT dichotomous 0 1 IRT 6

13 IRT three-parameter logistic model, 3PLM; Birnbaum, 1968 two-parameter logistic model, 2PLM; Birnbaum, 1968 Rasch, PLM n i 1 p ni ai exp ( n bi ) ci (1 ci ) 1 exp a ( b i n ) i 1 a i i discrimination parameter bi c i difficulty parameter i i guessing parameter 3PLM c 1 i 0 2PLM p ni ai exp ( n bi ) 1 exp a ( b i n ) i 2 2 2PLM 2 1 Rasch p ni exp( n bi ) 1 exp( b ) n i 3 Rasch Rasch

14 0 IRT sample independent impact IRTLR Thissen, reference group focal group uniform nonuniform Mellenberg, 1982 studied item 1 8

15 1 ICC 2 2 9

16 matching variable studied item IRT IRT Holland & Wainer, 1993 IRT IRT IRT IRT IRT IRT Mantel-Haenszel Holland & Thayer, 1988 Dorans & Kulick, 1986 Swaminathan & Rogers, 1990 IRT Lord s 2 Lord, 1980 Raju s Raju, 1988 Bolt, 2002; Kim & Cohen, 1995; Stark, Chernyshenko, & Drasgow, 2006; Wang, 2004; Wang & Yeh 2003 Thissen Steinberg Gerrand 1986 Thissen Steinberg Wainer 1988 likelihood ratio test LR Neyman & Pearson, 1928 IRT polytomous MULTILOG Thissen, 1991 IRTLR IRT Rasch 2PL 3PL 10

17 likelihood deviance = -2 log likelihood compact model 2 G C 4 2 GC 2 log( likelihood compact ) 4 augmented model likelihood deviance 2 GA 5 G 2 log( 2 A likelihood augmented ) 5 likelihood deviance 2 G 6 G 2 C G 2 A G G 2 G 20 2PL G C G A GC G 2 A G 1 11

18 2 ( 1) G G 1 2 G 2 all-other item method, Wang & Yeh, 2003 constant item method, Thissen et al., 1988; Wang & Yeh, 2003 Clauser, Mazor, & Hambleton, 1993; Kim & Cohen, % 12

19 scale purification IRT IRT Ackerman, 1992; Clauser et al., 1993 Rasch Miller & Oshima, 1992; Navas-Ara & Gómez-Benito, 2002 Holland Thayer 1988 Mantel-Haenszel two-step purification process Mantel-Haenszel iterative purification process Candell & Drasgow, 1988; Kok, Mellenbergh, & Van der Flier, 1985; Van der Flier, Mellenbergh, Ade`r, & Wijn, % 20% 20% Candell & Drasgow, 1988; Cheung & Rensvold, 1999; Stark et al., 2006; Thissen et al., 1988, Wang & Shih, 2009 equal-mean-difficulty method, EMD Wang, 2004; Wang & Shih, 2010; Wang & Yeh, free Wang 2004 iterative constant item method 13

20 -free -free Wang free-then- -free 4 Thissen et al., 1988; Wang & Yeh, 2003 ; Shih & Wang, 2009 DFTD MIMIC Shih & Wang, 2009 IRT DFTD -free 2009 MIMIC IRT IRT -free -free DFTD DFTD DFTD 14

21 IRT DFTD -free -free -free -free -free -free -free -free -free Shih & Wang, 2009; Wang & Shih, 2010 rank-based -free -free LRT-ST -free -free item by standard LRT method ; LRT-DFST, DFST LRT-SP -free -free item by LRT method with scale purification ; LRT-DFSP, DFSP -free -free item by LRT method with iterative constant ; LRT-DFICI, DFICI -free 15

22 -free DFST 1 LR statistic 2 1 LR statistic -free rank-based LR statistic -free DFSP 1 LR statistic 2 1 LR statistic -free DFST LR statistic -free DFICI 1 LR statistic LR statistic 2 1 -free LR statistic 1 LR statistic -free Matlab Hanson Beguin IRTLR 16

23 ability difference sample size test length pattern percentage 0% Wang Su, free reference group R focal group F IRT Tang 1994 IRT 200 R250/F250 R500/F500 R1000/F free constant 17

24 balanced constant 20 20% 4 balanced 20 20% 2 2 balanced balanced constant -free 10% 20% 30% 40% -free free -free 1 -free 18

25 0.75 -free 0.5 -free 0.25 Wang 2001 average signed area ASA ASA Raju s signed area ICC signed area SA i c b b 1 i if ir i i ASA SA I ASA I ASA SAi / I i 1 19 bif bir / I bf br ASA ASA 0 ASA 0 Wang 2001 balanced ASA 0 constant ASA 20% ASA ASA 0 20% 50% ASA

26 ASA ASA -free -free -free -free -free LRT-ST standard LRT method, LRT method with scale purification, LRT-SP DFTD pure anchor -free pure anchor LRT method with pure anchor, LRT-PA LRT-PA LRT-PA LRT-SP matching variable pure anchor LRT-PA 20

27 -free -free 1 60 Matlab IRTLR ability difference sample size test length pattern percentage amount R250/F250 R500/F500 R1000/F1000 Rogers & Swaminathan,

28 constant balanced Finch, 2005; Wang & Yeh, % 10% 20% 30% 40% uniform ASA ASA 22

29

30 EZ Waller, 1998 IRTLR IRTLF DOS University of North Carolina at Chapel Hill David Thissen IRTLR item hypothesis test deviance G 2 degrees of freedom

31 -free free constant 2 R250/F250 DFST 20% % DFSP 30% DFICI 0.9 DFST 0.9 DFSP DFICI 0.9 DFSP 30% 40% DFICI 40% DFSP 1 DFICI 40%

32 20 DFST 20% 0.9 DFSP 30% DFICI 40% 0.94 DFICI 0.92 DFST DFSP 30% 1 DFST DFSP DFST DFSP 40% DFSP 1 DFICI 40% % DFSP 0.96 DFSP constant DFICI 0.9 constant ASA ASA balanced balanced DFST DFSP

33 1 DFICI 10% % DFST DFSP DFICI 20 balanced ASA 0 ASA 0.01 ASA 0 -free ASA ASA 0 DFST DFSP 0.9 DFICI ASA 0 balanced DFSP DFST constant DFSP -free DFST DFICI DFSP constant DFICI DFSP 27

34 balanced DFICI free ASA DFSP ASA 0.24 DFICI ASA 0.24 DFSP ASA 0 DFICI DFSP DFICI ASA 0.24 ASA ICI -free ICI DFICI ASA 0 -free 28

35 2 -free 20 ability difference pattern % ASA R250/F250 R500/F500 R1000/F1000 DFST DFSP DFICI DFST DFSP DFICI DFST DFSP DFICI 0 Constant 10% % % % balanced 10% % % % Constant 10% % % % balanced 10% % % % free 40 ability difference pattern % ASA R250/F250 R500/F500 R1000/F1000 DFST DFSP DFICI DFST DFSP DFICI DFST DFSP DFICI 0 Constant 10% % % % balanced 10% % % % Constant 10% % % % balanced 10% % % %

36 3 ASA 20 4 ASA 40 5 ASA 20 6 ASA 40 30

37 -free DFSP -free LRT-DFSP LRT-DFSP LRT-ST LRT-SP LRT-PA constant LRT-ST 20% LRT-PA LRT-DFST LRT-PA LRT-SP LRT-ST 20% LRT-SP R500/F500 LRT-DFST LRT-PA 60 LRT-ST 20% LRT-SP R250/F250 LRT-DFSP LRT-PA LRT-DFSP LRT-SP LRT-SP LRT-DFSP 31

38 LRT-ST R1000/F % LRT-DFSP LRT-PA 0.4 LRT-SP LRT-ST R500/F500 R1000/F % LRT-SP R250/F250 40% LRT-PA LRT-DFSP LRT-SP LRT-ST 20% LRT-SP LRT-DFSP LRT-SP LRT-DFSP LRT-ST R500/F500 20% LRT-SP R500/F LRT-PA LRT-DFSP LRT-DFSP LRT-PA LRT-ST 30% LRT-SP R250/F250 R500/F500 40% LRT-PA LRT-DFSP 32 LRT-ST 20%

39 LRT-DFSP LRT-ST 20% LRT-SP 40% LRT-PA LRT-DFSP LRT-PA LRT-DFSP 0.83 LRT-DFSP LRT-PA 4 7 constant ASA LRT-ST ASA 0.12 balanced balanced balanced ASA 0 Wang, 2001 ASA LRT-ST LRT-SP LRT-PA LRT-DFSP LRT-SP 33

40 LRT-ST LRT-ST LRT-DFSP LRT-PA LRT-PA LRT-DFSP LRT-ST LRT-SP LRT-ST LRT-ST constant LRT-DFSP LRT-SP 34

41 4 constant 0.4 Type I error 35 Power sample size % ASA L-ST L-SP L-PA L-DFSP L-ST L-SP L-PA L-DFSP NI20 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI40 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI60 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % %

42 5 constant 0.6 Type I error 36 Power sample size % ASA L-ST L-SP L-PA L-DFSP L-ST L-SP L-PA L-DFSP NI20 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI40 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI60 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % %

43 6 constant 0.4 Type I error Power sample size % ASA L-ST L-SP L-PA L-DFSP L-ST L-SP L-PA L-DFSP NI20 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI40 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI60 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % %

44 7 constant 0.6 Type I error 38 Power sample size % ASA L-ST L-SP L-PA L-DFSP L-ST L-SP L-PA L-DFSP NI20 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI40 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI60 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % %

45 8 balanced 0.4 Type I error 39 Power sample size % ASA L-ST L-SP L-PA L-DFSP L-ST L-SP L-PA L-DFSP NI20 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI40 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI60 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % %

46 9 balanced 0.6 Type I error 40 Power sample size % ASA L-ST L-SP L-PA L-DFSP L-ST L-SP L-PA L-DFSP NI20 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI40 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI60 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % %

47 10 balanced 0.4 Type I error 41 Power sample size % ASA L-ST L-SP L-PA L-DFSP L-ST L-SP L-PA L-DFSP NI20 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI40 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI60 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % %

48 11 balanced 0.6 Type I error 42 Power sample size % ASA L-ST L-SP L-PA L-DFSP L-ST L-SP L-PA L-DFSP NI20 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI40 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % % NI60 R250_F250 0% % % % % R500_F500 0% % % % % R1000_F1000 0% % % % %

49 constant LRT-ST 20% Finch, 2005; Stark et al., 2006 LRT-SP LRT-PA LRT-PA -free pure anchor Shih Wang 2009 MIMIC LRT-DFSP LRT-PA LRT-SP 2001 LRT-PA LRT-DFSP LRT-DFSP LRT-SP LRT-DFSP balanced LRT-ST LRT-SP LRT-DFSP LRT-SP constant balanced LRT-SP 30% LRT-ST LRT-DFSP 43

50 Analysis of Variance 1. LRT-ST ANOVA F F η 2 η Cohen, LRT-ST ANOVA ASA F 7,298 = η 2 =0.969 F 2,298 = η 2 =0.865 F 2,298 = η 2 =0.338 F 1,298 = η 2 =0.280 ASA F 14,298 = η 2 =0.889 ASA F 14,298 = η 2 =0.404 ASA F 7,298 = η 2 =0.406 LRT-ST ASA ASA 0.12 constant ASA 2. LRT-SP 13 LRT-SP ANOVA ASA F 7,298 = η 2 =0.809 F 1,298 =

51 η 2 =0.187 F 2,298 = η 2 =0.162 ASA F 7,298 = η 2 =0.356 ASA F 14,298 =7.596 η 2 =0.263 ASA F 14,298 =6.708 η 2 =0.240 LRT-SP ASA ASA LRT-PA LRT-PA ANOVA 14 F 2,298 = η 2 =0.327 LRT-PA LRT-PA 4. LRT-DFSP 15 LRT-DFSP ANOVA ASA F 7,298 = η 2 =0.662 F 1,298 = η 2 =0.302 F 2,298 = η 2 =0.183 F 2,298 = η 2 =0.156 ASA F 14,298 =4.701 η 2 =0.181 LRT-DFSP ASA constant ASA 1. LRT-ST 16 LRT-ST ANOVA F 2,155 = η 2 =0.843 ASA F 7,155 = η 2 =0.598 F 1,155 = η 2 =

52 F 2,155 = η 2 =0.294 ASA F 12,155 =3.775 η 2 =0.226 LRT-ST R250/F250 R1000/F % 180% % ASA 0.12 ASA ASA 2. LRT-SP 17 LRT-SP ANOVA F 2,213 = η 2 =0.914 F 1,213 = η 2 =0.790 F 2,213 = η 2 =0.497 ASA F 7,213 = η 2 =0.480 F 1,213 = η 2 =0.330 ASA F 14,213 =5.083 η 2 =0.250 LRT-SP R250/F250 R1000/F % 200% LRT-SP ASA ASA 0.18 ASA LRT-ST 3. LRT-PA LRT-PA ANOVA 18 F 2,226 = η 2 =0.888 F 1,226 = η 2 =0.761 F 2,226 = η 2 =0.667 F 1,226 = η 2 =0.628 ASA F 7,226 = η 2 =0.306 ASA F 14, 226 =4.560 η 2 =0.220 LRT-PA LRT-SP LRT-PA 46

53 LRT-SP LRT-PA LRT-SP 14% 4. LRT-DFSP 19 LRT-DFSP ANOVA F 2,226 = η 2 =0.923 F 1,226 = η 2 =0.801 F 1,226 = η 2 =0.775 ASA F 7,226 = η 2 =0.519 F 2,226 = η 2 =0.466 ASA F 7,226 =7.487 η 2 =0.188 ASA F 14, 226 =2.970 η 2 =0.155 LRT-DFSP R250/F250 R1000/F % 300% % 2001 DFTD 47

54 12 LRT-ST testlength samplesize abilitydifference amount ASA testlength * samplesize testlength * abilitydifference testlength * amount testlength * ASA samplesize * abilitydifference samplesize * amount samplesize * ASA abilitydifference * amount abilitydifference * ASA F Eta 13 LRT-SP testlength samplesize abilitydifference amount ASA testlength * samplesize testlength * abilitydifference testlength * amount testlength * ASA samplesize * abilitydifference samplesize * amount samplesize * ASA abilitydifference * amount abilitydifference * ASA F Eta 48

55 14 LRT-PA testlength samplesize abilitydifference amount ASA testlength * samplesize testlength * abilitydifference testlength * amount testlength * ASA samplesize * abilitydifference samplesize * amount samplesize * ASA abilitydifference * amount abilitydifference * ASA F Eta 15 LRT-DFSP testlength samplesize abilitydifference amount ASA testlength * samplesize testlength * abilitydifference testlength * amount testlength * ASA samplesize * abilitydifference samplesize * amount samplesize * ASA abilitydifference * amount abilitydifference * ASA F Eta 49

56 16 LRT-ST testlength samplesize abilitydifference amount ASA testlength * samplesize testlength * abilitydifference testlength * amount testlength * ASA samplesize * abilitydifference samplesize * amount samplesize * ASA abilitydifference * amount abilitydifference * ASA F Eta 17 LRT-SP testlength samplesize abilitydifference amount ASA testlength * samplesize testlength * abilitydifference testlength * amount testlength * ASA samplesize * abilitydifference samplesize * amount samplesize * ASA abilitydifference * amount abilitydifference * ASA F Eta 50

57 18 LRT-PA testlength samplesize abilitydifference amount ASA testlength * samplesize testlength * abilitydifference testlength * amount testlength * ASA samplesize * abilitydifference samplesize * amount samplesize * ASA abilitydifference * amount abilitydifference * ASA F Eta 19 LRT-DFSP testlength samplesize abilitydifference amount ASA testlength * samplesize testlength * abilitydifference testlength * amount testlength * ASA samplesize * abilitydifference samplesize * amount samplesize * ASA abilitydifference * amount abilitydifference * ASA F Eta 51

58 rank-based rank-based -free rank-based -free -free 1 rank-based -free -free rank-based -free LRT-DFSP DFTD LRT-PA LRT-DFSP LRT-PA -free DFTD LRT-DFSP 0.9 IRT LRT-DFSP ASA 52

59 LRT-DFSP free -free -free constant -free balanced -free 20% 53 -free

60 DFTD hierarchical generalized linear modeling ; HGLM SIBTEST 54

61 (2009) -Free-then- Logistic Regression (2010) IRTLR (2001) Ackerman, TA. (1992). A didactic explanation of item bias, item impact and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp ). Reading, MA: Addison-Wesley. Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous detection methods. Applied Measurement in Education, 15, Camilli, G. & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 55

62 25,1-27. Clauser, B., Mazor, K., & Hambleton, R. K. (1993). The effects of purification of the matching criterion on the identification of using the Mantel-Haenszel procedure. Applied Measurement in Education, 6, Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Earlbaum Associates. Cole, N. S., & Zieky, M. J. (2001). The new faces of fairness. Journal of Educational Measurement, 38, Dorans, N. J., & Kulick, E. (1986), Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test, Journal of Educational Measurement, 23, Drasgow, F., & Kang, T. (1984). Statistical power of differential validity and differential prediction analyses for detecting measurement nonequivalence. Journal of Applied Psychology, 69, Finch, H. (2005). The MIMIC model as a method for detecting : Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied PsychologicalMeasurement, 29, French, B. F., & Maller, S. J. (2007). Iterative purification and effect size use with logistic regression for differential item functioning detection. Educational and psychological Measurement, 67, Hanson, B. A., & Beguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, Hidalgo-Montesinos, M. D., & Gómez-Benito, J. (2003). Test purification and the evaluation of differential item functioning with multinomial logistic regression. 56

63 European Jouranl of Psychological Assessment, 19, Holland, P. W., & Thayer, D. T. (1988). Differential item performance and Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test Validity (pp ). Hillsdale, NJ: Lawrence Erlbaum Associates. Holland, P. W., & Wainer, H. (1993). detection and description : Mantel-Haenszel and Standardization. In N. J. Dorans & P. W. Holland (Eds.), Differential item functioning (pp ). Hillsdale, NJ: Lawrence Erlbaum. lronson, G. H., & Subkoviak, M. J. (1979). A comparison of several methods of assessing bias. Journal of Educational Measurement, 16, Kim, S.-H., & Cohen, A. S. (1992). Effects of linking methods on detection of. Journal of Educational Measurement, 29(1), Kim, S.-H., & Cohen, A. S. (1995). A comparison of Lord s chi-square, Raju s area measures, and the likelihood ratio test on detection of differential item functioning. Applied Measurement in Education, 8, Kok, F. G., Mellenbergh, G. J. y Van der Flier, H.(1985). Detecting experimentally induced item bias using the iterative logit method. Journal of Educational Measurement, 22, Lord, F. M. (1980). Application of item response theory to practical testing problems, Hillsdale, NJ: Lawrence Erlbaum Associates. Mellenberg, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7, Miller, M. D. & Oshima, T. C. (1992). Effect of sample size, number of biased items, and magnitude of bias on a two-stage item bias estimation method. Applied Psychological Measurement, 16, Navas-Ara, M. J., & G mez-benito, J. (2002). Effects of ability scale purification on 57

64 identification of. European Jouranl of Psychological Assessment, 18, Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika, 20A, , Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling, 5, Park, D. G., & Lautenschlager, G. J. (1990). Improving IRT item bias detection with iterative linking and ability scale purification. Applied Psychological Measurement, 14, Rasch, G. (1980). Probability models for some intelligence and attainment tests. Chicago: The University of Chicago Press (Original edition publised in 1960). Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, Rogers, H. J.,&Swaminathan, H. (1993). Acomparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, Rudner, L. M., Geston, P. R., & Knight, D. L. (1980). Biased item detection techniques. Journal of Educational Statistics, 5, Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, No. 17. Shepard, L. A. Camilli, G., & Averill, M. (1981). Comparison of procedures for detecting test-item bias with internal and external ability criteria. Journal of Educational Statistics, 6, Shih, C.-L. & Wang W.-C. (2009). Differential Item Functioning Detection Using the Multiple Indicators, Multiple Causes Method with a Pure Short Anchor. Applied Psychological Measurement, 33,

65 Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with CFA and IRT: Toward a unified strategy. Journal of Applied Psychology, 19, Swaminathan, H., & Rogers, H. J. (1990), Detecting differential functioning using logistic regression procedures, Journal of Educational Measurement, 27, Tang, H. (1994, January). A new IRT-based small sample method. Paper presented at the annual meeting of the Southwest Educational Research Association, San Antonio, TX. Thissen, D. (1991). MULTILOG user s guide (Version 6) [Computer software]. Mooresville, IN: Scientific Software. Thissen, D. (2001). IRTLR v.2.0b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. University of North Carolina at Chapel Hill. Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group mean differences: The concept of item bias. Psychological Bulletin, 99, Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp ). Hillsdale, NJ: Erlbaum. Van der Flier, H., Mellenbergh, G. J., Adèr, H. J. y Wijn, M. (1984). An iterative item bias detection method. Journal of Educational Measurement, 21, Waller N. G. (1998). EZ: Detection of uniform and nonuniform differential item fu nctioning with the Mantel-Haenszel and Logistic regression procedures. Applied Psychological Measurement, 22: 391. Wang, W.-C. (2001, September). Effects of anchor item methods on the detection of 59

66 differential item functioning within the family of Rasch models. Paper presented at the annual meeting of the Chinese Psychological Association, Chia-Yi, Taiwan. Manuscript submitted for publication. Wang, W.-C. (2004). Effects of anchor item methods on differential item functioning detection within the family of Rasch models. Journal of Experimental Education, 72, Wang, W.-C. (2008). Assessment of differential item functioning. Journal of Applied Measurement, 9, Wang, W.-C., & Shih, C.-L. (2010). A new strategy to assess differential item functioning. Educational and Psychological Measurement. Wang, W.-C., & Su, Y.-H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the Detection via the Mantel-Haenszel method. Applied Measurement in Education, 17, Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27, Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33,