Six Major Challenges for Educational and Psychological Testing Practices Ronald K. Hambleton University of Massachusetts at Amherst

Size: px
Start display at page:

Download "Six Major Challenges for Educational and Psychological Testing Practices Ronald K. Hambleton University of Massachusetts at Amherst"

Transcription

1 Six Major Challenges for Educational and Psychological Testing Practices Ronald K. Hambleton University of Massachusetts at Amherst Annual APA Meeting, New Orleans, Aug. 11, 2006

2 In 1966 (I began my studies at the University of Toronto, in Canada.) 1. Multiple-Choice Tests 2. Relatively Simple Statistics (up to only ANOVA and linear regression) 3. Routine Psychometric Studies Could Be Published 4. Computer Cards/Tapes

3 In 2006 (40 years later) 1. Wide Array of Item Types 2. Complex Statistical Modeling of Data (IRT, GT, SEM) 3. Standard-Setting, DIF, CBT, CAT, Performance Testing, Automated Scoring and Test Development 4. Laptops, Desktops, Internet

4 Impossible to predict changes between 1966 and 2006, but a few initial predictions about the next 40 years seem possible because some trends are clear.. 1. Wider Uses of Psychological Tests in International Markets 2. Advances in Modeling of Test Data 3. New Item Types/Scoring Are Coming -High Fidelity Simulations -Item Algorithms, Item Cloning -Computer Scoring of Free Responses

5 State of Affairs Today, cont.: 4. Advances with Computer-Based Tests 5. Improvements in Score Reporting Practices (e.g., simpler, clearer, more informative displays) 6. And, Better Training in Psychometric Methods Is Needed (for Psychologists and Educational Research Specialists)

6 Two Goals of the Presentation Address these six (likely) advances and their impact on educational and psychological testing practices. Describe challenges that need to be addressed.

7 1. Use of Tests in International Markets Interest in test translations and test adaptations has increased tremendously in the past 15 years: --Several IQ and personality tests have been adapted into more than 100 languages. --Achievement tests for large scale international assessments (PISA, TIMSS) in over 30 languages.

8 1. Use of Tests in International Markets --International uses of credentialing exams is expanding (e.g., see Microsoft) --Many high school graduation/college admissions tests are in multiple languages (e.g., see Israel, South Africa, USA). --Health scientists with their Quality of Life measures are receiving wide use in many languages and cultures. --Marketing researchers are doing more.

9 1. Use of Tests in International Markets But--major misunderstandings about the difficulties of translating and adapting tests from one language and culture to another. (See Hambleton, Merenda, & Spielberger, 2006; ITC Brussels Conference, 2006)

10 Example 1 Out of sight, out of mind (Back translated from French) invisible, insane

11 Example 2 (IEA Study in Reading) Are these words similar in meaning? Pessimistic -- Sanguine

12 Pessimistic -- Sanguine Adapted to Pessimistic -- Optimistic

13 Example 3 (1995 TIMSS Pilot) Alex reads his book for 1 hour and then used a book mark to keep his place. How much longer will it take him to finish the book? A. ½ hour B. 2 hours C. 5 hours D. 10 hours

14 Common Misunderstandings: That most anyone who knows two languages can do the translation. That a backward translation design is sufficient. (Need a forward design.) That translators, if they have the correct training, can produce a valid instrument in a second language and culture. Use of bilinguals to compile empirical evidence is sufficient.

15 Challenges Ahead: Hire qualified translators (and several of them). Use forward and backward designs (and newer designs) to review test items. Compile empirical evidence to address construct, method, and item bias.

16 Challenges Ahead, cont.: Integrate best methodologies and practices to guide future test adaptation studies. Recognize the complexity of the work, so more resources, time, and expertise are available to do the job consistent with ITC and AERA/APA/NCME test standards.

17 2. Advances in Statistical Modeling of Test and Item Level Data IRT models have become popular and for several good reasons lots of positive features (e.g., model parameter invariance, item and test information). Modern measurement theory and practices are now here.

18 Item Response Functions (4 choice item): 1.0 k=0 k=3 k=2 Probability 0.5 k=1 a i = 1.00 b i1 = b i2 = b i3 = Ability

19 Graded Response Model: * P i0( θ ) = 10. P * ix Da ( θ b ) e i ix ( θ ) =, x = 01,,..., m e i( θ ix) 1 + * P imi ( + ) θ = 1 ( ) 00. Da b i * Pix ( θ) = Pix ( θ) Pi ( x + 1) ( θ) *

20 Generalized Partial Credit Model: P( x i = k θ ) = 1+ exp[ k r= 1 k s= 1 exp[ a i r s= 1 ( θ b a i si ( θ )] b si )]

21 New IRT Polytomous Response Models Partial credit model Generalized partial credit model Graded response model Logistic multidimensional model Rating scale models Hundreds more models exist!

22 Many Examples of Successful IRT Applications Automated test assembly (targeting) Computer-adaptive testing (shorten) Detection of potentially biased test items Equating (fairness and change) Test score reporting (e.g., item mapping) (IRT creates options)

23 Challenges Ahead: There are questions of model choice (fit, practicality), and calibration of items with small samples. Identifying and handling dependencies in the data (common with new item types).

24 Challenges Ahead, cont.: Establishing invariance of item parameters over subgroups of the population of interest. (e.g., Black, Hispanic, White; Male, Female; state to state, country to country) More training is needed for persons to do the IRT applications, read the test manuals, etc.

25 Ability Estimation [0-1 vs. Testlet Scoring] See paper by Zenisky, et al., JEM, Dichotomously-Scored Abillity Estimates Polytomously-Scored Ability Estimates

26

27 3. Generation of New Item Types Lots of sizzle here with simulations (e.g., virtual reality, performance tasks) and other item types. But-- --Can new skills be measured? --Can old skills be measured better? --What s the value-added versus the costs of development? Measurement/minute of testing?

28 Site Planning Vignettes (Bejar, 1991) Image from NCARB (2000)

29 Site Planning Vignettes (Bejar, 1991) Image from NCARB (2000)

30 Dynamic Problem Solving Simulation (Clauser, et al., 1997) Image from NBME (2001)

31 Examples of Advances Pioneering research of Bennett and his colleagues with the architectural exams. Work of Clauser and Nungester with sequential problem solving tests in medicine.

32 Immediate, Less Costly, and Useful New Item Formats Multiple-Correct Answers Short Answer Extended Answer (Essay) Highlighting Text Inserting Text

33 Ranking (or Ordering) Numerical Responses (Including Multiple) Drag and Drop Sequential Problems

34 More than 50 new item formats. Complex item stems, sorting tasks, interactive graphics, audio, visual, job aids, sequential problems, joy sticks, touch screens, pattern scoring, and more.

35 Challenges Ahead: An increased commitment to validation of these new item types is needed: --Face validity is important but not sufficient. Much more empirical validity evidence is needed to support the use of new item types. --Need to judge increase in test score validity against extra time and costs.

36 4. Computer-Based Testing Advantages are well-known: --Flexibility in scheduling tests --Potential for immediate score reporting --Assessment of higher level thinking with new item types (in principle) --New test designs (to reduce time) Many testing agencies on computer.

37 Computer-Based Test (CBT) Designs LINEAR MULTI STAGE CAT

38 Fixed Length Multiple Forms (Linear) A Single Form (acceptable if volume is low) Multiple Parallel Forms Linear on the Fly Tests (LOFT)

39 M E E E H H H E H E H E H H E _ + + High Low Proficiency Scale Item Bank

40 Three-Stage Test Design Stage 1 (Routing Test) Stage 2 Easy (E) Stage 2 Medium (M) Stage 2 Hard (H) Stage 3 E-E Stage 3 E-M Stage 3 M-E Stage 3 M-M Stage 3 H-M Stage 3 H-H

41 Automated Test Construction Mimicking test development committees Content and statistical considerations, exposure controls Operations research methodology, linear programming, IRT van der Linden, Luecht, Stocking, and others have advanced topic

42 One Big Challenge: Item Exposure Items exposed to candidates every day testing is done. How serious is item exposure? When present, test score validity is lowered. (e.g., GRE example)

43 Moving Averages (Ning 2006) and Hambleton, M+2*SD M M-2*SD

44 Example of an Exposed Item

45 One Big Challenge: Item Exposure How can item exposure be detected? How much more vulnerable are the performance based tasks? How can the tasks be disguised and/or cloned? Impact of even minor revisions on item statistics? Can item types be found that may be less susceptible to exposure?

46 Other Challenges, cont.: How to make CBT cost effective for schools? Researching other ways to address item exposure: Increasing the size of item banks via cloning, algorithmic item writing, rotating banks, writing items to statistical specs., etc. Matching test designs to intended uses of the scores.

47 5. Improvements in Score Reporting Least studied topic today (do you know any research?) in assessment, and one of the most important: Lots of evidence that score users are easily confused. (Concept of measurement error is not understood; error bands are confusing.)

48 Score Reporting Critically important topic, and almost no educational research studies available. Substantial empirical evidence suggesting that policy-makers, educators, and the public are confused by test score scales and reports. (What are typical IQ scores?) Thanks, April Zenisky for the next slide:

49 Put the results for both years for a single state together, then list next state Lots of questions about the axis here

50 One Promising Advance: Placing meaningful points on test score scales--e.g., performance standards, defining skills at selected scores, providing averages, market basket concept (e.g., explaining what respondents can do in relation to a collection of test items).

51 Item Characteristic Curves for an Item Bank Expected Score (on the 0-1 metric) P= W -1 N 0 P Proficiency Scale Reporting Items Points Category Topic Topic Topic Topic Topic

52 Candidate Diagnostic Score Report 1 Performance Level PASSING Score Range: 75 to 100 NEAR PASSING Score Range: 65 to 74 WEAKNESSES Score Range: 55 to 64 MAJOR WEAKNESSES Score Range: 1 to 54 Candidate Performance 60 Content / Skill Areas Candidates in this performance level can [text to be inserted here, text to be inserted here, text to be inserted here, and text to be inserted here]. Many candidates in this performance level do not [insert relevant text here]. Candidates in this performance level can [text to be inserted here, text to be inserted here, text to be inserted here, and text to be inserted here]. Many candidates in this performance level do not [insert relevant text here]. Candidates in this performance level can [text to be inserted here, text to be inserted here, text to be inserted here, and text to be inserted here]. Many candidates in this performance level do not [insert relevant text here]. Candidates in this performance level can [text to be inserted here, text to be inserted here, text to be inserted here, and text to be inserted here]. Many candidates in this performance level do not [insert relevant text here].

53 Diagnostic Score Report No. 2

54 Challenges: Can we develop empiricallybased principles to assist in the design of meaningful and useful score scales and reports? How can diagnostic reports be enhanced? (e.g., rule space methodology, MIRT, collateral and prior information)

55 Challenges, cont.: Evaluation of new methods for studying score reports: Focus groups, think aloud studies, experimental studies, field-tests. Need to commit more resources and time to this immensely important topic!

56 6. Improvement in Training for Specialists and Others Major shortage of persons with good psychometric training. We need to do a better job in training educators and psychologists to construct and to use tests incorporating recent advances. --Many Schools of Education and Psychology offer only minimal training.

57 Challenges: What knowledge and skills do modern psychometricians need? What do counselors, teachers, and others need to learn about testing and testing practices to increase the validity of test score uses?

58 Conclusions Easy to make the case that the emerging technology (IRT models, computers, item types, etc.) should be used to improve credentialing, selection, achievement, and personality tests face validity is high. At the same time, research on the various advances must be carried out, and AERA- APA-NCME Test Standards followed, to confirm the strengths and weaknesses of these advances.

59 Conclusions, cont.: Innovations and technological advances without supporting research findings and validity evidence are simply sizzle and marketing and won t lead, necessarily, to more valid assessments.

60 Conclusions, cont.: More important topics to study too: Admissions testing Cognition and testing Hierarchical modeling and analysis of test data

61 Conclusions, cont. A strong argument has been made here for full employment of psychometricians! At the same time, all six topics, and many more, are critical if tests in the 21 st century are going to meet the complex informational needs of our society.