Inter- rater reliability calibration program: critical components for competency- based education

Size: px

Start display at page:

Download "Inter- rater reliability calibration program: critical components for competency- based education"

Garry Jones
5 years ago
Views:

1 Received 22 January 2016 Revised 12 February 2016 Accepted 22 February 2016 DOI: /cbe EMERGING ISSUES Inter- rater reliability calibration program: critical components for competency- based education K L Gunnell 1 D. Fowler 2 K. Colaizzi 1,3 1 Institutional Research, Western Governors University, Salt Lake City, UT; 2 Evaluation, Western Governors University, Salt Lake City, UT; 3 Western Governors University, Salt Lake City, UT, USA Correspondence Kurt L Gunnell, Institutional Research, Western Governors University, 4001 South 700 East, Suite 700, Salt Lake City, UT 84107, USA; kurt.gunnell@wgu.edu Measurement accuracy and reliability are important components of competencybased education (CBE). The ability to reliably assess student performance consistently over a period of time is crucial in these programs. For CBE programs which require multiple faculty or rater evaluations due to large student enrollments, it is expected that the faculty or raters score the student responses similarly. The reliability of ratings is critical for CBE programs because the integrity of the program rests on students demonstrating the specific competencies within the curriculum. Due to Western Governors University (WGU) large student enrollments, it employs several hundred evaluators to rate students performance assessments. WGU developed a program to assist their Evaluation Department in training their evaluators to score students assessments similarly. By using this program, WGU is able to maintain high inter- rater reliability rates within their evaluators and they are able to calibrate the ratings of their evaluators to ensure equitable and impartial ratings of assessments. This calibration has been very successful for new evaluator training and the retraining of current evaluators. WGU will continue to use this program in order to review and rate their assessments in an organized and systematic manner. KEYWORDS assessment, faculty training, inter-rater reliability 1 INTRODUCTION 2 WESTERN GOVERNORS UNIVERSITY Measurement accuracy and reliability of student assessments are principal elements of competency- based education (CBE). These two principles can be inherently couched within several principles of a successful CBE program (Johnstone & Soares, 2014). The ability to accurately assess student performance consistently over a period of time is imperative when establishing and managing a CBE program. For CBE programs that require multiple faculty or raters evaluations due to large student enrollments, it is expected that the faculty or raters score the student responses similarly. In other words, the student would receive a similar score regardless of the rater. The reliability of evaluators rating is critical for CBE programs because the integrity of the program rests on students demonstrating the specific competencies within the curriculum. Conflict of interest: No conflicts declared. Measurement and evaluation calibration have always been a significant focus of the Western Governors University (WGU) Evaluation Department. WGU has an enrollment of over 65,000 students and employs hundreds of evaluators to rate students assessments. Consequently, it is critical to maintain a departmental standard to assure students receive equitable and impartial assessment scorings. However, due to the large quantity of assessments submitted regularly (i.e., hourly, daily, etc.), a process or system was developed to provide timely feedback to the students that was accurate, robust, and personalized. Currently, over 60,000 submissions for performance assessments are received and scored each month. The Evaluation Department of WGU had established internal rating processes to maintain high levels of inter- rater reliability among their raters or evaluators. These processes ensured that reliable measurements Competency-based Education 2016; 1: wileyonlinelibrary.com/cbe Western Governors University 36

2 K L Gunnell, D Fowler, and K Colaizz 37 of student performances naturally occurred. Even though the evaluator training at WGU was on a regular basis, it was tedious and timeconsuming for both the evaluators and the Department leadership. Other factors influenced the training of evaluators with specific emphasis on the content (i.e., tasks to be reviewed), pedagogy of the training, and the increased hiring of more evaluators. In an after- thefact manner, the tasks for review and calibration were often revealed and identified during students appeal processes. The training or calibration processes tended to vary by evaluation teams based on disciplines and program areas. Also, due to a continuous and increasing student enrollment, it was determined that a more scalable and consistent model was required to keep pace with the subsequent increase in student submissions. Consequently, the WGU Evaluation Department worked with the WGU Institutional Research office to develop a standardized application process to train evaluators and to ensure interrater reliability was preserved. The final product design and functional specifications had the following components: Ease of use by evaluators Mirroring of the student evaluation experience Calibration on both scoring and feedback Immediate results and coaching to the individual evaluator Reporting at the task level with drill-down to the individual evaluator By implementing these functional processes, the calibration training became more efficient and simpler for everyone involved. 3 THE SPECIFICS Using a survey management program (Qualtrics), an application was developed to be used by the Evaluation Department to standardize the inter- rater calibration process. After several iterations across 3 4 months, the application was released to the evaluators and their managers to assist in their calibration training. This training centered only on the ratings of performance assessments, one of the main assessment types at WGU. assessments are scored by assigned evaluators in the WGU Evaluation Department. Performance assessments consist of a series of performance tasks which may require students to submit artifacts (i.e., essays, research papers, case study, etc.) demonstrating their application of knowledge and skills. Performance tasks are made up of smaller content units called aspects. These aspects are the measurement components scored by WGU evaluators using the Taskstream assessment tool. The Taskstream tool allows evaluators to review the performance tasks of students, and then provide the feedback and grading results of the tasks back to the students. Students are able to review the evaluators feedback on their performance assessments and verify their passing scores in their own WGU student portal homepage. If a student is unable to pass the task, he/she works with faculty to help focus his/her areas of additional study prior to attempting to pass the assessment again. 5 PROJECT INITIATION The inter- rater reliability (IRR) project started in early 2015 with the combined efforts of the WGU Evaluation and Institutional Research Departments. Under the consultation of the evaluators, the Institutional Research Department created a program within Qualtrics which fit the testing and reporting parameters of this project. The project was to start small and allow for some reliability testing and calibrations for only a few tasks. The Evaluation Department wanted a smaller and more manageable project rollout in order to be able to evaluate its success and overall functionality before expanding it to the remaining assessments. The smaller rollout was a success and accomplished all of the functionality and training goals of the Evaluation Department. At the present time, the Evaluation Department is adding more tasks each month while more evaluators are being assigned to the calibration process. As of January 2016, the Evaluation Department had just completed the eighth round of evaluator training and ratings. Plans are now being finalized to setup a more systematized schedule of task calibrations for a wider scope of evaluators. 4 OVERVIEW OF WGU ASSESSMENTS Most assessments at WGU are delivered to students via an online and secure delivery platform. Students complete remote- proctored objective and performance- based assessments on their own computer or in a computer lab to which they have access. The assessments are delivered securely, and are only accessible on WGU- supported learning platforms and internally created course- of- study management systems. The institution created internal verification processes to identify and validate students who were taking the assessments through remote- proctored testing procedures. Objective assessments or multiple- choice tests are automatically evaluated by an internal scoring program, whereas performance 6 PREPARATION The WGU Evaluation management team selected one task within a specific assessment to use as the sample to be analyzed. The team then uploaded three basic components of the task into a preconstructed template in the Qualtrics program: (a) the scoring rubric, (b) the sample submissions, and (c) the modeled feedback. These three elements provided the structure for the rating processes performed by the evaluators. Each scoring rubric contained separate scales, either a 3- point, 4- point, or 5- point based on the task- aspect evaluation requirements. An example of a 5- point scoring rubric can be seen in Figure 1. After all required elements were loaded into the calibration program, several screens were generated to mirror the actual screens and

3 38 K L Gunnell, D Fowler, and K Colaizz Evalua on Calibra on for ABC1 Task 1: Submission 1 Open Task Submission Ar cula on of Response (clarity, organiza on, mechanics (0) (1) (2) (3) (4) Comments on this criterion: A. Development of Culture (0) (1) (2) (3) (4) Comments on this criterion: Figure 1 Calibration scoring form (5 -point) submission processes of a student taking an assessment. In addition to setting up the ratings homepages, several behind- the- program systems were established to properly coordinate and deliver the appropriate submissions to the evaluators, as well as record the responses and feedback of the evaluators. 7 RATING PROCESS OF EVALUATORS The first step was assigning a cohort of evaluators a specific task to review. Next, an introductory with an embedded link for the ratings was sent to all evaluators in the cohort so that they were able to access the calibration program through a web browser. The evaluators opened the calibration program and navigated to the introductory webpage that contained specific instructions on how to review the submissions and record their ratings. After reviewing the instructions, they selected another link at the bottom of the page which took them to the actual page where the submission link was presented and the rating tool rubrics were displayed. Next, the evaluator completed the rubric rating scale for each aspect and then submitted his/her responses. Immediately, the evaluator was provided a feedback form showing the evaluator ratings that were consistent with the correct rubric scorings and which ratings were not. Modeled Feedback or feedback which provided the rationale and basis for the correctness of the answers was also provided to the evaluator on the form. A basic assumption of the ratings tool was that every evaluator assigned to each submission was provided the same rubric and ratings scales. They also received the same predetermined feedback responses based on their own ratings whether they correctly rated the aspect or not. In this manner, it assured standardized and consistent measurement procedures across the evaluators during the training period.

4 K L Gunnell, D Fowler, and K Colaizz 39 8 REPORTING CAPABILITIES Being able to perform this calibration testing among evaluators is very useful. It allows the training and retraining of the evaluators for reliability testing. This is an obvious benefit to CBE students whose progress in their programs is a direct function of their assessment scores. Another component of the calibration program that has proven informative for the consistency of the rating process in its capacity to report results in real time. The Evaluation Department leadership team is immediately provided with reports that show the final results and rating scores by task and aspect. They are able to intervene directly and immediately to provide timely training to the evaluators instead of waiting for the issues to emerge during a student s appeal. The Evaluation leadership team can generate reports based on (1) time ranges, or (2) specific tasks and aspects to view the results of the calibrations. The reports are generated automatically with user prompts and can provide the following data points by task and aspect: 1. Number of submissions reviewed 2. Number of evaluations performed by the evaluators 3. Number of evaluations that meet the task requirements (evaluations that are correct ratings.) 4. Number of evaluations that don t meet the task requirements (evaluations that were incorrectly rated) 5. Total percentage of correct evaluations 6. Total percentage of agreement across evaluators 7. Statistical Kappa measurement by submission and ratings. The Kappa statistic was added to show the robustness of the agreement between the evaluators. This statistic shows a more stable measure than the normal percent agreement measurement, because it includes in its calculation the probability of the agreement occurring merely by chance (Cohen, 1960). The report (Figure 2) is broken out into two separate parts: (a) submission totals, and (b) submission by aspect totals. A total of eight student submissions were included in this report. The first section displays the overall totals and percentages for each submission. In the second section, the submission by aspect row totals and percentages are displayed for each respective ratings scales. The two data points percent agreement and the Kappa statistic are especially important to show because they point out where there are evaluator agreements or disagreements at this basic unit of the task. The scoring breakouts at the aspect level also show the ranges of the evaluator s response in terms of agreement and diversity. Looking closer at the top table (submission totals) in Figure 2, it can be determined that 11 evaluators reviewed the first submission with seven of the evaluators reporting that the submission did not meet the requirements to pass the task; four evaluators reported that the submission did meet the requirements to pass the task. The actual or correct rating of this submission was that it did meet the requirements. Thus, only 36% (4 divided by 11) of the evaluators reported the correct result. The percent agreement column shows a 49% agreement value. This value is a statistical percentage calculation derived from the number of agreements divided by the number of measures when determining the comparisons of the number of actual responses to the number of correct responses. The other tables of Figure 2 show by submission of the aspect level ratings by response option. As seen in Submission 01, the final score values are shown for the 11 evaluators, as well as the aspect level breakout by response option. For Aspect 1 of the Submission 01, one evaluator reported the rubric score of 2; seven reported the score of 3; and three reported the score of 4. The correct score, shown by the green- formatted cells, was four which produced a 27% correct (3 divided 11) calculation. Similar to the top table, the percent agreement and κ values are calculated and provided for each aspect. Another option built into the report was the ability to drill- down into the evaluator ratings by score. For example, a manager is able to view the eight incorrect evaluator scorings for Aspect 1 of Submission 01 by drilling into the one evaluator who reported a score of 1, and the seven evaluators who reported a score of 3. Thus, managers are able to review the ratings by specific aspect, and then if desired, drilldown into the evaluators rating to determine the scoring breakouts. The drill- down option displays the evaluators name, rating score, and actual comments by the evaluator. This option is particularly helpful for the Evaluation leadership in order to determine areas of improvements for training or retraining. 9 CURRENT IMPLICATIONS The WGU Evaluation Department has put in place several ongoing projects to use the calibration program as a training tool and a testing tool. When coaching new evaluators in their role and responsibilities of assessing student submissions, the program can be used not only to train them on how to appropriately rate student submissions, but to simulate an actual task rating session. In this manner, the new evaluator can be mentored in the process of completing the ratings in a proxy environment where proper and relevant training can occur without the problem of inadvertently modifying real student rating submissions and responses. Another function of the program is to provide training to current evaluators when issues arise around the agreement/disagreement of evaluator ratings as it relates to feedback language and clarity. For example, when there is a difference of opinion about the actual substance (i.e., wording, grammar, or structure) of a correct ratings feedback, the evaluation team can setup one or more rounds of the calibration program to finely tune and clarify any conflicting ratings of the evaluators. This process can synchronize the evaluators responses and feedback on students possible submissions to a task, thus, preserving appropriate levels of inter- rater reliability. 10 A PRINCIPLE OF COMPETENCY- BASED EDUCATION This calibration tool, albeit technologically designed and developed for a specific evaluation process within a competency- based

5 40 K L Gunnell, D Fowler, and K Colaizz assessment model, is an example of how to set up internal processes to support the structure and validity of CBE within a specific college or university. This program provides areas of improvement for evaluation training where faculty ratings could be more diverse and dissimilar than should be expected or appropriate. It not only provides both the monitoring capability to regularly test and retest the rating skills of institutional evaluators but also provides feedback to institutional academic staff members about the quality and reliability of CBE assessment evaluations. The academic leadership of an institution has a vested interest in the ability to verify that the students are receiving the appropriate scores and feedback from their faculty. They also can provide evidence that internal processes confirm and support the basic CBE principle that the faculty are being consistently trained and retrained on these particular ratings methods. Figure 2 Inter-rater agreement report

6 K L Gunnell, D Fowler, and K Colaizz 41 Figure 2 CONTINUED The curriculum for a specific program and its related courses, and assessments should be neither stagnant nor static. It should be regularly updated, revised, and enhanced to reflect the current and relevant competencies for that field of study. Thus, the grading of assessments should be updated and revised also to mirror these curriculum upgrades. The calibration tool can be implemented across an institution s curricular base within several fields of study to authenticate the proper and correct assessment rating processes. How to cite this article: K L Gunnell, D. Fowler, K. Colaizzi. Inter- rater reliability calibration program: critical components for competency- based education. Competency- based Education. 2016, 1, DOI: /cbe REFERENCES Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), Johnstone, S. M., & Soares, L. (2014, March April). Principles for developing competency- based education programs. Change, The Magazine of Higher Learning. Retrieved from Back%20Issues/2014/March-April%202014/Principles_full.html AUTHOR S BIOGRAPHY K L Gunnell is the current Director, Institutional Research at Western Governors University. Previously, he was the Associate Director, Institutional Research at the Kansas Board of Regents. He has presented at several national conferences about competencybased education, distance education, and higher education data management issues. D. Fowler serves as an Associate Provost of WGU, overseeing the evaluation, delivery, and proctoring of assessments of student competency. She holds a Juris Doctor and degrees in Mathematical Science and Economics. Her association with WGU began over a decade ago when she was invited to evaluate assessments in her spare time. Intrigued by WGU s mission and the CBE model, she changed careers and has never looked back. K. Colaizzi is the Survey Research Specialist in the Institutional Research office at Western Governors University. She manages the institutional surveys for WGU academic units and staff.