VALIDATING CRITERION-REFERENCED READING TESTS a. Daniel S. Sheehan and Mary Marcus Dallas Independent School District

Size: px

Start display at page:

Download "VALIDATING CRITERION-REFERENCED READING TESTS a. Daniel S. Sheehan and Mary Marcus Dallas Independent School District"

Merryl Wiggins
5 years ago
Views:

1 VALIDATING CRITERION-REFERENCED READING TESTS a Daniel S. Sheehan and Mary Marcus Dallas Independent School District Abstract. Validation is an important step in the developmental process of any reading test. Yet traditional approaches to reliability and validity assessment are not relevant for criterion-referenced reading tests. A more relevant approach defines reliability in terms of the consistency of decision-making across repeated test administrations and validity in terms of the accuracy of decision-making between an administration of a reading test and an administration of a criterion measure. A technique for measuring this type of reliability and validity is described in this paper along with an example of how it was used with a criterion-referenced reading battery. Varieties of new instructional models are being used in school reading programs where the usual procedures for constructing, validating, and interpreting tests are not so useful and in some cases are completely inappropriate. Examples of these instructional models include Read On (Random House School Division, 1971), the Fountain Valley Teacher Support System (Richard L. Zweig Associates, Inc., 1971), the Prescriptive Reading Inventory (CTB/McGraw-Hill, 197), A Model of School Learning (Carroll, 1963, 1970), Individualized Instruction (Glaser, 1969), and Project Plan (Flanagan, 1967, 1969). With these models reading tests are being used to establish an individual's achievement on specific reading content, i.e., instructional objectives, and to provide information for making a variety of instructional decisions (Hambleton and Gorth, 1970). Since traditional norm-referenced tests are clearly inappropriate, the reading programs that are based on these instructional models use criterion-referenced tests. Criterion-referenced tests are deliberately constructed to yield measurements that are directly interprétable in terms of specific performance standards (Glaser and Nitko, 1971). While norm-referenced tests are constructed for the purpose of making comparisons among individuals, criterion-referenced tests are specifically designed to evaluate an a Reprints may be requested from Dr. Sheehan, Dallas Independent School District, 3700 Ross Avenue, Dallas, Texas

2 130 Journal of Reading Behavior K, individual's mastery of the instructional objectives covered in the test (Hambleton, 1974). Traditional approaches to test validation include estimating reliability by internal consistency or stability coefficients and validity by predictive coefficients and construct designs. These validation procedures are all correlational in nature and depend upon test score variance. Yet because criterion-referenced tests are not designed to discriminate among individuals, in many situations their use results in little or no test score variance. In addition, if reliability and validity are considered in decision-theoretic terms, the correlational methods represent an inappropriate choice of a loss function (Hambleton and Novick, 1973). Thus these classical validation approaches are not very useful in the analysis of criterion-referenced tests. An examination of a variety of criterion-referenced reading tests revealed that most of these tests were not validated or were validated with traditional approaches that were clearly inappropriate for criterion-referenced tests (see, for example, Lichtman, 1974). Thus the primary purpose of this paper is to describe a validation approach that is appropriate for criterion-referenced reading tests. The first section of the paper will describe a criterion-referenced reading battery, the second section will present an appropriate validation procedure, while the third section will give an example of how the validation procedure was implemented with the reading battery. CRITERION-REFERENCED READING BATTERY The Survey of Reading Skills is a criterion-referenced battery based upon skills emphasized in the Dallas Independent School District's basal reading curriculum. The battery is classified into six levels of difficulty with the levels corresponding to the groupings of the Houghton Mifflin reading books. In addition, there is an individually and group administered pre-reading level and a secondary level that is divided into four major skill areas. The purpose of the battery is to provide an estimate of whether a student has attained mastery status on each of a series of reading objectives. Thus each level of the battery consists of a series of subtests which measure objectives relevant to the basal reading program at the corresponding level of difficulty. All items within a subtest are replications of the same task and multiple items are included to provide a reliable estimate of the mastery status of each student. Actual mastery status is determined by a student answering a specified number of items correctly on a particular subtest. The specified number of items is set so that a student can miss one item and still be included in the mastery group with a low probability that the classification is due to chance (Olson, 1973).

3 Sheehan, Marcus 131 A VALIDATION APPROACH Considering the purpose of the reading battery, in order to validate it, something has to be known about the consistency of decision-making across repeated administrations (the reliability of the test). Another aspect of the validation procedure that has to be considered is the accuracy of decision-making (the validity of the test) (Hambleton and Novick, 1973). Finally, because each level of the test consists of subtests on groups of items measuring individual reading objectives, it is necessary to determine the reliability and validity for each subtest. This follows since it is on the basis of subtest scores rather than total test scores that mastery decisions are made. Each level of the battery will have as many reliabilities and validities as there are objectives included in the level (Swaminathan, et al., 1974). Reliability Estimation The reliability question amounts to a check on the consistency of decisionmaking across two administrations of the reading battery. Consider the situation where one level of the reading battery is administered to a group of students on two occasions and each student is classified into one of two mutually exclusive decision states on each administration and for each objective. The two decision states are: 1. Master of the objective, and. Non-master of the objective. With each objective pj: is used to denote the proportion of students placed in the ith decision state on the first administration and jth decision state on the second administration. One estimate of the reliability of the items measuring a particular objective (subtest) in the reading battery would then be simply a proportion of agreement or i = 1 Pü, where py is the proportion of students that were classified into the same decision state on both administrations. This measure would be interpreted as the proportion of students about whom the same mastery decisions were made on the two administrations. It would, however, tend to over-estimate the "true" amount of agreement because a certain amount of agreement between decisions made on the two administrations would be expected by chance. A more appropriate measure of the reliability would be the k coefficient developed by Cohen (1960). The k coefficient can be defined as the proportion of agreement after chance agreement is removed from consideration (Swaminathan, et al., 1974). Formally,

4 13 Journal of Reading Behavior DC, where p 0, the observed proportion of students about whom the same decisions were made on the two administrations is defined by: i = l Pu and p c, the expected proportion of students about whom the same decisions were made on the two administrations is defined by: Pc= S Pi.P-i i= 1 Hie pj and p_j values represent the proportions of students classified in decision state i on the first and second administrations, respectively. Validity Estimation The validity question amounts to a check on the accuracy of decision-making between an administration of a reading battery and an administration of a criterion measure. Consider the situation where one level of a reading battery is administered to a group of students and each student is classified into either a master or non-master decision state for each objective. If criterion test results could be obtained, each student could also be classified as a master or non-master of each objective on the criterion measure. (Examples of possible criterion measures would be another test measuring the same objectives or student performance on the next unit of instruction.) Following the same form as reliability estimation, the decision-making agreement between the reading battery and the criterion measure could then be calculated. AN EXAMPLE OF THE VALIDATION PROCEDURE Validation of the Survey of Reading Skills was limited to levels, 5, and S (secondary). In addition, the validation was further limited to one randomly sampled objective (subtest) at each of the three test levels. These objectives or subtests were: Level Word Recognition (Given an orally presented word from the Basal Word List, the student will identify its printed form.), Level 5 Compound Words (Given printed words or combinations of words (i.e., a compound word), of appropriate difficulty, the student will differentiate between those which are and are not compound words.), and Level S Verbs with Inflected Endings (Given a printed sentence in

5 Sheehan, Marcus 133 which a word used as a verb has been omitted, the student will complete the sentence by selecting a verb or verb phrase with an appropriate inflected ending (i.e., d, ed, s, ing, and en). Reliability Estimation To be certain of attaining a sufficient level of precision in the reliability estimation, six classes were randomly sampled from throughout the Dallas Independent School District at each of the grade levels corresponding to the three test levels (level S was used with grade 8). Thus six classes were randomly sampled at each of the grade levels,5, and 8. Those parts of the tests at levels, 5, and S which included the randomly sampled objective or subtest for that level were initially administered to the sampled classes in the corresponding grades during the first week of December of The tests were readministered to the same classes approximately one week later. It was important that the time between test administrations was relatively short to be sure that any changes in student responses were due to the unreliability of the measures and not to changes in the students' achievement levels. While these factors will always be confounded, it was felt that if the time between administrations was short (i.e., less than two weeks) and no instruction occurred in the interim, the first factor would likely explain the bulk of, the changes in student responses. For each of the three sampled subtests, students were classified into either a master or non-master decision state on the two administrations. From these data, a joint frequency distribution was formed for each subtest. k coefficients or the above-chance proportions of students about whom the same decisions were made on the two administrations were calculated from each joint frequency distribution for each subtest. The k coefficients were.67 for the "word recognition" subtest;.45 for the "compound words" subtest; and.64 for the "verbs with inflected endings" subtest. Validity Estimation Because the methodology for validity estimation was identical to that for reliability estimation, the validity sample also consisted of six randomly selected classes at each of the grade levels, 5, and 8. This insured sufficient precision in the validity estimates. Two criterion measures were used to classify students into master or non-master decision states on the three objectives. The first of these consisted of a separate test at each of the levels,5, and S. Each of these tests was similar to the corresponding subtest in the battery. The only differences were that the

6 134 Journal of Reading Behavior IX, criterion tests were individually administered to students and they contained items that were different (but parallel) to the items in the actual subtests. The second criterion measure consisted of teacher ratings as to whether each student was a master or non-master of the objective sampled at that test level. Thus those parts of the tests at levels, 5, and S which included the randomly sampled objective or subtest were administered to the sampled classes in the corresponding grades during the first week of December of Approximately one week later the appropriate individually administered criterion measure was given to the students in these classes. At the same time, the teachers were asked to classify each student into either a master or non-master decision state on the objective sampled at that test level. For each of the three sampled subtests, students were classified into either a master or non-master decision state on the test administration and on the administration of each criterion. From these data, a joint frequency distribution was formed for each subtest and for each criterion measure, k coefficients or the above-chance proportions of students about whom the same decisions were made on the test administration and the criterion administration were calculated from each joint frequency distribution for each subtest and for each criterion measure. The above-chance proportions of agreement with the individually administered criterion measure were.71 for the "word recognition" subtest;.49 for the "compound words" subtest; and.68 for the "verbs with inflected endings" subtest. The above-chance proportions of agreement with the teacher rating criterion were.65 for the "word recognition" subtest;.51 for the "compound words" subtest; and.63 for the "verbs with inflected endings" subtest. Concluding Remarks The "word recognition" and "verbs with inflected endings" subtests gave both consistent decision-making across test administrations and accurate decision-making in terms of an individually administered criterion and a teacher rating criterion. Although high in absolute terms, the "compound words" subtest was relatively less reliable and less valid. This could have been caused by inconsistencies as to whether or not students were given the definition of a compound word before completing the test. More explicit directions could improve the decision-making consistency and the decision-making accuracy of this subtest. It should be pointed out that actual acceptable levels of reliability and validity vary with the use of the test and can be determined only by the user. One would demand higher levels of reliability for an instrument that was used to

7 Sheehan, Marcus 135 accept or reject airline pilots, for example, than for an instrument designed to assign students to mastery status on a classroom reading task. The important point, however, is that this approach to reliability and validity assessment results in the production of statistics that are directly interprétable in terms of the decision-making consistency and the decision-making accuracy of the tests. REFERENCES CARROLL, J.B. A model of school learning. Teachers College Record, 1963, 69, CARROLL, J.B. Problems of measurement related to the concept of learning for mastery. Educational Horizons, 1970, 48, COHEN, J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960, 0, FLANAGAN, J.C. Functional education for the seventies. Phi Delta Kappan, 1967, 49, 7-3. FLANAGAN, J.C. Program for learning in accordance with needs. Psychology in the Schools, 1969, 6, Fountain valley teacher support system. Huntington Beach, California: Richard L. Zweig Associates, Inc., GLASER, R. Adapting the elementary school curriculum to individual performance. In Proceedings of the 1967 Invitational Conference on Testing Problems. Princeton, N.J.: Educational Testing Service, GLASER, R., & NITKO, A.J. Measurement in learning and instruction. In R.L. Thorndike (Ed.), Educational Measurement. Washington: American Council on Education, 1971, p HAMBLETON, R.K. A review of testing and decision-making procedures for selected individualized instructional programs. Review of Educational Research, 1974, 44, HAMBLETON, R.K., & G0RTH, W.P. Criterion-referenced testing: Issues and applications. Paper presented at the annual meeting of the Northeastern Educational Research Association, Liberty, New York, HAMBLETON, R.K., & NOVICK, M.R. Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 1973, 10, LICHTMAN, M. The development and validation of R/EAL, an instrument to assess functional literacy. Journal of Reading Behavior, 1974, 6, OLSON, M.A. Evaluation of test items for the Dallas Independent School District's Survey of Reading Skills. (Research Report No ) Dallas: Dallas Independent School District Prescriptive reading inventory. Monterey, California: CTB/McGraw-Hill, 197. Read on. New York: Random House School Division, SWAMINATHAN, H., HAMBLETON, R.K., & ALGINA, J. Reliability of criterionreferences tests: A decision-theoretic formulation. Journal of Educational Measurement, 1974, 11,