The Time Crunch: How much time should candidates be given to take an exam?

Similar documents
UK Clinical Aptitude Test (UKCAT) Consortium UKCAT Examination. Executive Summary Testing Interval: 1 July October 2016

Examination Report for Testing Year Board of Certification (BOC) Certification Examination for Athletic Trainers.

What are the Steps in the Development of an Exam Program? 1

Technical Report: May 2018 CHRP ELE. Human Resources Professionals Association

Technical Report: June 2018 CKE 1. Human Resources Professionals Association

The Examination for Professional Practice in Psychology: The Enhanced EPPP Frequently Asked Questions

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Band Battery

National Council for Strength & Fitness

Customer Surveys: How to Raise Your Sales & Marketing IQ in 6 Easy Steps. By Rebecca Sprynczynatyk with Andrea Parker

Using a Performance Test Development & Validation Framework

Audit - The process of conducting an evaluation of an entity's compliance with published standards. This is also referred to as a program audit.

STUDY GUIDE. California Commission on Teacher Credentialing CX-SG

INTERNAL CONSISTENCY OF THE TEST OF WORKPLACE ESSENTIAL SKILLS (TOWES) ITEMS BASED ON THE LINKING STUDY DATA SET. Report Submitted To: SkillPlan

Chapter 8: Estimating with Confidence. Section 8.2 Estimating a Population Proportion

Field Testing and Equating Designs for State Educational Assessments. Rob Kirkpatrick. Walter D. Way. Pearson

Most measures are still uncommon: The need for comparability

Investigating Common-Item Screening Procedures in Developing a Vertical Scale

2. Is there a list of commonly used resources (textbooks, software, videos, case studies, etc.) for program directors for each program level?

Setting Standards. John Norcini, Ph.D.

Reliability & Validity

SERVSAFE EXAM DEVELOPMENT FAQs

The Assessment Center Process

Intercultural Development Inventory (IDI): Independent Review

WHITE PAPER. Implementing Project Portfolio Management in Digestible Bites. by Keith Carlson, President and CEO, INNOTAS

Test Development: Ten Steps to a Valid and Reliable Certification Exam Linda A. Althouse, Ph.D., SAS, Cary, NC

External Assessment Report 2015

Investigation : Exponential Growth & Decay

Concept Paper. An Item Bank Approach to Testing. Prepared by. Dr. Paul Squires

SEE Evaluation Report September 1, 2017-August 31, 2018

Test Development. and. Psychometric Services

proficiency that the entire response pattern provides, assuming that the model summarizes the data accurately (p. 169).

Corporate and Business Law (LW) (ENG) Syllabus and study guide

WIP Cycle Counting. It's not easy keeping track of Work In Process, which -- by its nature -- is always a moving target.

provided that the population is at least 10 times as large as the sample (10% condition).

September Promoting Regulatory Excellence. In this session we will: Getting It Right: Implementing Innovative Item Types.

Operational Check of the 2010 FCAT 3 rd Grade Reading Equating Results

Use of a delayed sampling design in business surveys coordinated by permanent random numbers

Job Analysis 101. Pamela Ing Stemmer, Ph.D. April 11, 2016

EST Accuracy of FEL 2 Estimates in Process Plants

Metrics Driven Gift Processing. Metrics Driven?

t L I-i RB Margaret K. Schultz Princeton, New Jersey May 25, 1954 William H. Angof'f THE DEVELOPMENT OF NEW SCALES FOR THE APTITUDE AND ADVANCED

R&D Connections. The Facts About Subscores. What Are Subscores and Why Is There Such an Interest in Them? William Monaghan

Six Mark Questions EXAM REPORT

Setting ACA Status in the Payroll System

WSF Assessor Manual August 2011

Equating and Scaling for Examination Programs

The PI Learning Indicator FAQ. General Questions Predictive Index, LLC The PI Learning Indicator FAQ. What is the PI Learning Indicator?

The statistics used in this report have been compiled before the completion of any Post Results Services.

AssessHub. Employment Suitability. Assured.

Linking Current and Future Score Scales for the AICPA Uniform CPA Exam i

Section 8.2 Estimating a Population Proportion. ACTIVITY The beads. Conditions for Estimating p

Comprehensive tools to prepare for the PCAT!

Overview of the 2nd Edition of ISO 26262: Functional Safety Road Vehicles

Application Form for Online Assessment

A Strategy for Optimizing Item-Pool Management

ASSOCIATION OF ACCOUNTING TECHNICIANS OF SRI LANKA. Examiner's Report AA3 EXAMINATION - JANUARY 2016 (AA34) PROCESSES, CONTROLS AND AUDIT

New Zealand Bookkeeper / Assistant Accountant Skills & Knowledge Test Report

Developing Certification Exam Questions: More Deliberate Than You May Think

Special Payroll Help Document Guidelines for Processing a Data Change

Speed to Value in Portfolio Management

Psychometrics 101: An Overview of Fundamental Psychometric Principles A S S E S S M E N T, E D U C A T I O N, A N D R E S E A R C H E X P E R T S

An Evaluation of Kernel Equating: Parallel Equating With Classical Methods in the SAT Subject Tests Program

CGEIT QAE ITEM DEVELOPMENT GUIDE

FOUNDATION DOCUMENT FOR THE QUALIFYING EXAMINATION. Introduction

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Market mechanisms and stochastic programming

Innovative Item Types Require Innovative Analysis

AN EVALUATION OF THE ARMY RADIO CODE APTITU[DE TEST

Aon Talent, Rewards & Performance. January 2018

2. How many bushels of corn should Madison produce in the short-run? a. 4 b. 10 c. 20 d. 5 e. 0 (since closed in the short-run)

Total Survey Error Model for Estimating Population Total in Two Stage Cluster Sampling

Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRT

Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRT

Introduction to FACETS: A Many-Facet Rasch Model Computer Program

APPRENTICESHIP. Apprentice Employer Training Program Sponsor Warren County Career Center Your Local Educational Agency. Page 1

Coach Participants and Develop Personal Practice: Rowing (SCQF Level 5)

Entry-Level. Analysis Project. Project Description

Running head: GROUP COMPARABILITY

+ Purpose of assessment and use in candidate monitoring or decisions are consequential

GCE EXAMINERS' REPORTS

The State of Missouri

Research & Reviews: Journal of Social Sciences

Flawless Field Service Perfecting Productivity

Item Tracking in Microsoft Dynamics

7 Statistical characteristics of the test

Fairness In Employment Testing: Validity Generalization, Minority Issues, And The General Aptitude Test Battery By John A.

PASSTCERT QUESTION & ANSWER

Testing: The critical success factor in the transition to ICD-10

future candidates and tuition providers.

Timetabling with Genetic Algorithms

National Board of Certification and Recertification for Nurse Anesthetists Summary of NCE and SEE Performance and Clinical Experience September 1,

Implementation of a Social Work Jurisprudence Exam in Texas BEYOND CE: Regulating Competency in a Dynamic Profession

Recommendations for Revising the Educator Evaluation System (Act 82 of 2012)

Validity and Reliability Issues in the Large-Scale Assessment of English Language Proficiency

Content Specification Outline

Recommendations for Revising the Educator Evaluation System

CRL1-XRL-Z7-STP-CR

NATIONAL CREDENTIALING BOARD OF HOME HEALTH AND HOSPICE TECHNICAL REPORT: JANUARY 2013 ADMINISTRATION. Leon J. Gross, Ph.D. Psychometric Consultant

Psychometric tests are a series of standardised tasks, Understanding Psychometric Tests COPYRIGHTED MATERIAL. Chapter 1.

Transcription:

The Time Crunch: How much time should candidates be given to take an exam? By Dr. Chris Beauchamp May 206 When developing an assessment, two of the major decisions a credentialing organization need to make are: How many items will be on the exam, and How much time will candidates be given to complete the exam. These decisions can have a large impact on fairness and validity. Often, once an exam has been administered, many candidates will anecdotally report that they ran out of time and the assessment was unfair. So, an important question to ask is, what can credentialing organizations do in order to investigate and address these concerns? In the credentialing field, assessments almost always have time limits. Although time limits are often set in order to provide standardized administration conditions, it is also important to allow candidates enough time to complete their assessment without undue time pressure. If the allocated time is too short, this could undermine validity (Lu & Sireci, 2007). The purpose of this backgrounder is to explore this topic and provide a framework that credentialing organizations can use to determine if the time limits and length of the assessment promote fairness and validity. Tests are classified as being either speed or power tests (Gulliksen, 950). The difference between these two types of tests has impacts on what we call test speededness, or the rate at which an exam is completed, as well as the correctness of candidate responses.

Speed Tests: A speed test is a test composed of items that are so easy that examinees would rarely give a wrong answer. However, the tests are so long that no candidate would complete the entire test in the allotted amount of time. As a result, candidates are judged by how far they get while taking the test before running out of time. This approach is commonly used in IQ Tests and other types of aptitude tests. Power Tests: A pure power test is a test in which all items should be attempted, and the candidate s performance is judged by the correctness of their responses. Although most credentialing examinations are designed to be power tests, time limits are generally used. The question then becomes, does the use of a time limit change a credentialing exam from being a power test to a speed test? Whenever tests involve a time limit, the rate at which candidates move through the items on the exam will affect their performance. A small percentage of candidates, no matter how much time is given, will not complete the exam. As a result, most examinations contain a mixture of speed and power components (Rindler, 979). There are several ways in which unintended test speededness may undermine test validity. For example, when speededness is unintended, candidate scores can be lowered due to factors such as anxiety and stress. Test speededness can also negatively impact content validity because some scored items are not attempted. This is problematic if the unattempted items fit into one or more content domains. It is therefore important to demonstrate that a test is not overly affected by speededness. There are several statistical indices that are available to assess speededness using a single administration approach. Methods of assessing speededness Educational Testing Service (ETS) provides three guidelines to assess test speededness (Schaeffer, Reese, Steffen, McKinley & Mills, 993; Swineford, 973). A test is speeded when ) fewer than 80% of candidates do not complete the exam, 2) fewer than 00% of candidates reach 75% of the test and/or 3) the ratio of not reached variance to total score variance is greater than 0.5. The not reached variance represents the variance of the number of items left unanswered following the last item for which the candidate responded. This statistic is divided by the total score variance in order to obtain the not reached to total score variance ratio. The three criteria presented above are based on the notion that a candidate will run out of time at the end of the exam and fail to respond to items that he or she has not reached. There are, however,

two other scenarios that should be considered. Near the end of the exam, a candidate may realize that there is insufficient time to finish the test. In one scenario, the candidate would accelerate their work rate and would skip items that would take too much time. This candidate would have a sporadic response pattern near the end of the exam. In another scenario, upon recognizing that time is running out, the candidate would respond to all the remaining items in a random manner (Oshima, 994). Modification #: Correcting for sporadic responding Under the original method, a candidate s stopping point is identified as when the candidate did not respond to any further items. However, it is possible that a candidate who is feeling significant time pressure will begin to answer questions sporadically. As a result, the original method should be modified. The candidate s stopping point can be identified as the point where the candidate responded to at least 3 questions in a row. The following is an illustration based on a 200-item test: Question 90 9 92 93 94 95 96 97 98 99 200 Response 4 2 2 Original method: The candidate is considered to have finished the exam. Refined method: The candidate reached question 93 and did not complete the exam. Modification #2: Correcting for random responding Most candidates see a drop in performance near the end of the exam that is most likely due to fatigue. However, a candidate who is under significant time pressure may choose to either ) respond to questions very quickly without fully reading or comprehending the material or 2) answer in a completely random manner. In some cases, a candidate may respond to every single question on the exam but may have responded randomly near the end. To account for this, candidate performance on the first 25 and the last 25 scored items of the exam can be compared. Candidates who display a statistically significant (p<0.0) drop in performance during the last 25 questions of the exam can be deemed to have run out of time. This is in contrast to the original method that would have considered this candidate to have finished the exam.

What can be done to prevent an exam from being overly speeded? Decisions related to the number of items on an exam and the time allocated to candidates to complete the exam should not be taken lightly. Significant research should be conducted first. This can include looking at historical response patterns for the exam or looking at comparable assessments. It would also be helpful, in the case of an entry-to-practice exam, to look at the policies in place at the educational/training level. Some exams contain experimental/pilot items that are presented to candidates but do not count toward their total score. These items are typically presented throughout the exam. One strategy, if speededness is a concern, is to place these experimental/pilot items at the end of the exam (without indicating to candidates that the items are experimental/pilot items). This way, if candidates run out of time, this will not impact their exam scores because the unattempted items would be experimental/pilot items. Many exam blueprints provide guidelines on the number of items from different competency categories. Items can then be presented in two ways: ) items can be presented by competency category, or 2) items from different competency categories can be interspersed. If speededness were a concern, it would be important that the latter approach be used. This way, if candidates are unable to complete an exam, content validity impacts can be mitigated because items do not systematically come from one specific area. In addition, many exam blueprints contain targets for taxonomy levels such as knowledge, application and critical thinking. In general, knowledge questions take less time to respond to compared to critical thinking questions. Therefore, taxonomy should also be considered when setting the blueprint. Finally, from an exam administration perspective, candidates should be given frequent timing updates throughout the exam (e.g., one hour left ) and a clock should be visible. This will prevent candidates from losing track of their progress and running out of time. What if speededness is detected? Evidence of speededness on an exam form may require the credentialing organization to revisit their exam blueprint to modify ) the number of questions on the exam and/or 2) the time given to candidates to take the exam. However, this does not address what to do with the candidates who took the exam that displayed evidence of excessive speededness. This is exacerbated in situations where numerous candidates are anecdotally reporting that the time limits were insufficient. What can be done to make the assessment fair for these candidates?

One solution would be to exclude the last handful of questions on the exam. For example, on a 200- item exam, the speededness analysis can be redone following the deletion of the last 0 questions. Does the revised 90-item exam still display evidence of speededness? If yes, it may be necessary to remove additional items. References Gulliksen, H. (950). Theory of mental tests. New York: John Wiley. Kolen, M.J., & Brennan, R.L. (2004). Test equating, scaling, and linking. Methods and practices (2ed.). New York: Springer-Verlag. Oshima, T.C. (994). The effect of speededness on parameter estimation in item response theory. Journal of Educational Measurement, 3 (3), 200-29. Rindler, S.E. (979). Pitfalls in assessing test speededness. Journal of Educational Measurement, 6 (4), 26-270. Schaeffer, G.A., Reese, C.M., Steffen, M. McKinley, R.L., & Mills, C.N. (993). Field test of a computerbased GRE General Test (ETS Report No. RR-93-07). Princeton, NJ: Educational Testing Service. Stafford, R.E. (97). The speed quotient: A new descriptive statistic for tests. Journal of Educational Measurement, 8, 275-278. To read more articles related to elearning, examination, and instructional design go to www.getyardstick.com and check out our blog.