Stefanie Moerbeek, Product Developer, EXIN Greg Pope, Questionmark, Analytics and Psychometrics Manager

Size: px

Start display at page:

Download "Stefanie Moerbeek, Product Developer, EXIN Greg Pope, Questionmark, Analytics and Psychometrics Manager"

Julianna Ellis
5 years ago
Views:

1 Stefanie Moerbeek, Product Developer, EXIN Greg Pope, Questionmark, Analytics and Psychometrics Manager

2 Stefanie Moerbeek introduction EXIN (Examination institute for Information Science), Senior Coordinator Examination Development Almost 10 years development experience of all exam products. We design, develop and deliver for/to the national and international market Business Information Management, Business IT Alignment, User Support and training, Instructional Design, learning and testing Greg Pope introduction Questionmark, Analytics and Psychometrics Manager Spent 7 years as a psychometrician with the high stakes testing program in Alberta, Canada (Alberta Education) After that spent 3 years designing psychometric software (statistical cheater analysis, computer adaptive testing) Joined Questionmark three years ago in product management Worked on the Results Management System (RMS) Working on the data warehouse/etl project and soon the new reporting system Slide 2

3 By show of hands: How many people work in corporations? How many work in academic institutions? How many work in government/military? Other? By show of hands: How many people feel very comfortable with psychometrics (item and test statistics)? How many feel somewhat comfortable? How many feel not comfortable and want to be more comfortable? Slide 3

4 This session will feature a facilitated discussion of several hot topics in the testing industry today that many learning and assessment professionals encounter Stefanie and I will present a topic and we can all have a brief discussion We will then form groups (at your tables) to come up with strategies to deal with the issues raised Notes on the strategies will be documents on the notepads provided at your table by one person in the group The notes will be collected at the end and Stefanie and I will publish them to the Questionmark Customer Wiki for all Questionmark customers to use Slide 4

5 A. Overview Stefanie and Greg present overview of area B. Full group discussion As a group we all have a brief discussion of the topic C. Scenario presentation Stefanie and Greg present a real life scenario to the group D. Small group discussion At your tables discuss the problem Document strategies about how to resolve the problem Slide 5

6 Topics for discussion: 1. Effective techniques for beta testing of questions 2. Using random administration of assessments 3. Determining and using psychometric rules for defensible questions and assessments 4. Dealing with intellectual property theft of exams Slide 6

The process of beta testing is an important step to ensuring the reliability and

statistical performance of newly created questions before they are included on a

have gone through the necessary editing and review processes, are administered

actual high stakes assessment Psychometric information regarding the new

7 The process of beta testing is an important step to ensuring the reliability and validity of the assessment Beta testing can be defined as evaluating the statistical performance of newly created questions before they are included on a large scale, actual high stakes exam Generally newly developed questions, which have gone through the necessary editing and review processes, are administered to representative samples of participants, either in advance of or during an actual high stakes assessment Psychometric information regarding the new questions can be used to build high-stakes assessments that meet certain quality benchmarks Slide 7

8 Questionmark has recently released a best practice guide that was written by the Questionmark documentation team (with input from Greg) and is at: best_practice/index.aspx Slide 8

9 Does beta testing pose fairness/ethical issues (participants answering questions that don t count)? Does your organization have participants sign a candidate agreement seeking permission for taking beta tests? How long are your beta tests (e.g., 10 questions)? What sort of feedback do you provide to item developers (e.g., beta statistics back to authors)? What sort of quality assurance review processes are the beta results used in? Other discussion issues? Slide 9

10 Scenario: Your organization wants to implement beta testing using an embedded model (un-scored questions placed on actual scored assessments) to obtain statistical information about questions to build new test forms You have been tasked to lead this effort, figure out how to do it, and prepare for the implications How can you do this in Perception? What issues could be raised by doing this (e.g., fairness, reporting, appeals)? Slide 10

13 Scenario: Your organization wants to implement beta testing using an embedded model 1. Identify what the key issues are that need to be solved 2. Discuss how you and your organization would solve the problem 3. Document strategies would you and your organization use to address the key issues Slide 13

It is possible within Perception to administer an assessment by selecting questions randomly from a topic This allows users to create on the fly assessments rather than fixed forms This approach has

14 It is possible within Perception to administer an assessment by selecting questions randomly from a topic This allows users to create on the fly assessments rather than fixed forms This approach has the benefits of using ones repository in an efficient way and reducing cheating potential among participants However, this approach does cause challenges, especially in the area of reporting and equivalence of forms Slide 14

Results obtained from assessments that randomly select questions from topics will by nature have gaps (missing data) because not all participants will answer the same questions An assessment then

15 Results obtained from assessments that randomly select questions from topics will by nature have gaps (missing data) because not all participants will answer the same questions An assessment then becomes all of the possible combinations of questions that are administered to participants (e.g., thousands of forms) This manifests itself in several ways in reports In item analysis reports there will be many questions in the analysis (however many were administered). Some of these questions will only be answered by a small number of participants so statistics will not be very robust. In test analysis reporting statistics like Cronbach s Alpha ( ) rely on the assumption that all participants took the same questions. Therefore will be very low (or even negative). Slide 15

16 What can be done about this? Really this is a missing data issue and there is not a great deal that can be done via software Reports and statistics generally rely on complete data and missing data is a problem with conventional reports and statistics One solution may be to have a special report that deals specifically with assessments administered in this way Not sure what this would look like but is on the too do list Slide 16

17 Generally this delivery approached is used in low stakes (e.g., formative tests, quizzes) contexts It is difficult using this approach to validate the equivalency of tests that each participant obtains Some participants could get lots of hard questions Some participants could get lots of easy question Many participants will be in the middle (mix of easy, average, hard) But there is not way to statistically show that equivalent forms have been delivered Slide 17

18 The best and maybe only way to address this issue completely is to conduct computerized adaptive testing (CAT) CAT requires Item Response Theory (IRT) which has advantages (e.g., comparability of questions statistics because data is fit to a model with p-values the values are dependant on the sample of participants tested) IRT however requires larger sample sizes (e.g., 900 participants for 3 parameter model) and other requirements (more technical, requires advanced psychometric knowledge) More information on CAT see: Slide 18

19 Does anyone use random administration? If so in what context (high, medium, low stakes)? Why do people use or don t use random administration? Do people find the psychometric issues problematic (e.g., result reporting)? Do people find it difficult to answer questions about the comparability of scores using this method? Do candidates find this approach fair? Other discussion issues? Slide 19

20 Scenario: Your organization wants to implement random administration of assessments for a medium stakes examination to maximize the use of the item banks You have been tasked to lead this effort, figure out how to do it, and prepare for the implications What issues could be raised by doing this (e.g., reporting, functionalities needed)? Slide 20

21 Scenario: Your organization wants to implement random administration of assessments for a medium stakes examination 1. Identify what the key issues are that need to be solved 2. Discuss how you and your organization would solve the problem 3. Document strategies would you and your organization use to address the key issues Slide 21

between assessment score and question score (participants that get higher assessment scores should get higher question scores) IRT uses a

22 Slide 22 Classical Test Theory (CTT) and Item Response Theory (IRT) look at difficulty and discrimination differently Most common is the CTT approach which looks at: Difficulty = P-values (proportion of participants selecting the correct answer) Discrimination = Correlation between assessment score and question score (participants that get higher assessment scores should get higher question scores) IRT uses a maximum of 3 parameters a-parameter = Discrimination of the question b-parameter = Difficulty of the question c-parameter = Pseudo-guessing parameter

23 Slide 23

24 We will focus on CTT as this is what most people are probably using Any IRT users here today? Difficulty criteria depends on the type/purpose of assessment Mastery tests should have most people getting high scores on questions Acceptable item difficulty: P 0.70 Criterion referenced certification exams expect a certain number of participants to pass and fail Acceptable item difficulty: P Most CTT statistics maximize discrimination when D max = 0.5 It is difficult to get high discrimination of questions with extreme difficulties Slide 24

Question discrimination (item total correlation point biserial correlation) criteria vary significantly by: Organization and legal considerations The stakes of the assessments being delivered

25 Question discrimination (item total correlation point biserial correlation) criteria vary significantly by: Organization and legal considerations The stakes of the assessments being delivered Financial and resource realities Psychometricians generally says it depends but the higher the better For high stake D 0.4+ For low stakes (e.g., quizzes) D > 0 For medium stakes 0 D 0.3 Better discrimination means better measurement efficacy which benefits everyone More items with high discrimination will increase overall test reliability (Cronbach s Alpha) Here are what some organizations use for discrimination criteria ETS: >0.300 CTB McGraw Hill: >0.300 Alberta Diploma Examination Program: >0.300 The statistical value is not the only thing to consider though, if participants are supposed to know something, the question is sound, but they are not answering the question as expected it could be an instructional issue: Content review is crucial Slide 25

Kuder-Richardson Formula 20 (KR-20) First published in 1937 Designed for dichotomous (1/0, right/wrong) items Values range from 0 to +1 (closer to +1 = higher reliability) Cronbach s Alpha ( )

26 Kuder-Richardson Formula 20 (KR-20) First published in 1937 Designed for dichotomous (1/0, right/wrong) items Values range from 0 to +1 (closer to +1 = higher reliability) Cronbach s Alpha ( ) Cronbach published in 1951 Designed for dichotomous and non-dichotomous (continuous, 1 to 5) items Generally values range from 0 to +1 (closer to +1 = higher reliability) Questionmark uses the Cronbach s Alpha on the Test Analysis Report and in the Results Management System Acceptable values for High: 0.90 (acceptable for high stakes) Moderate: (acceptable for medium stakes) Low: 0.7 (unacceptable for high/medium stakes, may be OK for low) Slide 26

What values do people here use for acceptable question and test statistics? Are there challenges in your organization on understanding and communicating these values?

27 What values do people here use for acceptable question and test statistics? Are there challenges in your organization on understanding and communicating these values? What are the relationships between statistics and question design/development (process)? How do you use statistics to improve items? What software to people use to compute statistics (other than Perception), why? Other discussion issues?

28 Scenario: Your organization has a Cronbach s Alpha test reliability criteria for high stakes exams of 0.90 or higher. A high stakes exam your organization administered comes back with a Cronbach s Alpha of You have been asked to find out why this happened and what to do about it Are questions performing properly? Are questions measuring the same dimension? Slide 28

29 Scenario: Your organization has a Cronbach s Alpha test reliability criteria for high stakes exams of 0.90 or higher. A high stakes exam your organization administered comes back with a Cronbach s Alpha of Identify what the key issues are that need to be solved 2. Discuss how you and your organization would solve the problem 3. Document strategies would you and your organization use to address the key issues Slide 29

31 Slide 31 Intellectual Property (IP) theft has to do with question content being taken from secure exams and made available to other test takers to practice off of Many times test takers will come in to an exam and memorize questions, take pictures or screenshots, which they then sell or post to web sites which make the questions available for a charge There are three areas to approach IP theft:* 1. Detection 2. Reaction 3. Prevention * Based on work done by the ATP Test Security IP Theft CWC

32 1. Detection Use a web crawl program or service Use alert mechanisms Contact Google - Ad Words and report fowl-play Create a tip line 2. Reaction Send a cease and desist Letter Engage infringement site in live chat session Make a posting to a bulletin board or forum Contact payment processing providers Notify ebay of infringement (VeRO program) Identify and notify website hosting entities Pull exams from test centers or countries 3. Prevention Review candidate agreement or NDAs Post candidate agreements or NDAs on forums Notify candidates certification revocation Publish multiple exam forms Restrict offering of beta exams Monitor new exams Incorporate stealth or Trojan items Periodically update exam forms Flag candidates with exceptional test performance Deploy internal security leak protection measures Produce your own exam prep material Slide 32

33 Resources: If you are interested in obtaining resources that go into more details on this bring me your business card or a USB drive and I will give you two very useful documents: ATP Test Security IP Theft CWC Report ATP Test Security Plan document Slide 33

34 Does anyone have problems in this area? Any examples people would like to share? How severe do you feel the problem is? Does anyone use detection methods? Does anyone use legal avenues to shut down sites that make stolen content available? Other discussion issues? Slide 34

35 Scenario: Your organization received an anonymous tip that exam questions have been stolen and are up on a web site for sale. You are tasked to: Look into whether this is true If it is true to what extent have items been exposed Do something about it (e.g., get the questions off the site) Slide 35

36 Scenario: Your organization received an anonymous tip that exam questions have been stolen and are up on a web site for sale. 1. Identify what the key issues are that need to be solved 2. Discuss how you and your organization would solve the problem 3. Document strategies would you and your organization use to address the key issues Slide 36

37 Multiple-Choice and True/False Tests: Myths and Misapprehensions Burton, Richard F. (2005) ERIC# EJ Multiple Choice and True/False Tests: Reliability Measures and Some Implications of Negative Marking Burton, Richard F. (2004) ERIC# EJ Correcting Computer-Based Assessments for Guessing Harper, R. (2003) ERIC# EJ Other useful articles: ERIC: Slide 37